Blue Book for Bulldozers

In this example, we will create a model to solve the Blue Book for Bulldozers Kaggle competition. For this competition, we should predict the sale price of bulldozers sold at auctions.

First, we observe that this is a regression task, since we are predicting a real-valued number. For this tutorial, the dataset is one CSV file, containing more than 20.000 examples, each one having 51 features (if you exclude the SalesID and the SalesPrice attributes).

Install

First things first, we need to install the necessary dependencies. We can do that by either running !pip install --user <package_name> or including everything in a requirements.txt file and running !pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file already, so we will use the second method.

NOTE: Do not forget to use the --user argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

!pip3 install --user -r requirements.txt

Imports

In this section we import the packages we need for this example. When working in a Jupyter Notebook, it is a good habit to gather your imports in a single place.

import numpy as np
import pandas as pd

from kubeflow import katib
from kale.types import MLTask
from kale import ml as kale_ml
from kale.common import artifacts
from autosklearn import metrics
from fastai.tabular import core

Data Loading

In this section we load the dataset and do some light processing. Specifically, we need to turn every categorical feature into a number. First, let’s load the data.

df = pd.read_csv("data/train.csv", low_memory=False)
df.head()
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls
0 1597691 10000.0 1204623 4600 132 18.0 1979 NaN NaN 4/21/1994 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1363686 10000.0 1149235 7267 132 8.0 1978 NaN NaN 3/27/2002 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
2 1767435 64000.0 1164512 28919 132 1.0 2006 NaN NaN 3/15/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
3 1639287 24500.0 1457804 1894 132 15.0 2001 NaN NaN 11/4/2010 0:00 ... NaN NaN NaN NaN NaN None or Unspecified PAT None or Unspecified NaN NaN
4 2277587 21750.0 572911 2758 136 20.0 1998 5155.0 Medium 4/16/2008 0:00 ... NaN NaN NaN NaN NaN None or Unspecified PAT None or Unspecified NaN NaN

5 rows × 53 columns

Data Processing

Let’s go through the transformations we need to do to our data:

  1. Extract the target variable (SalePrice)
  2. Encode the ordinal variables
  3. Unfold the dates to engineer more features, and
  4. Split the dataset into train and valid sets

First, let’s keep our target in a variable:

target_var = 'SalePrice'

The next step is to encode the ordinal variables. Why do we treat the ProductSize variable differently? It is because the order here does matter. So if we want to assign a number to every value that this variable can take, we need to account for that. Thus, Large can take the value 1, Large / Medium the value 2, and so on.

# ordinal data
sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'

df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

Next, we need to unfold the date feature to extract more information out of it. For example, it makes sense to know if it’s Christmas, or a Weekend, if it’s Summer or Winter, or even which day of the week it is.

To achieve this, we will use a handy function provided by the fastai library: add_datepart.

# expand dates
df = core.add_datepart(df, 'saledate')

Now, we are ready to split our dataset into train and test sets. Since, we are trying to predict the sale price of bulldozers in some auction, we should be extra careful about how we split our data. We want the validation set to be later in time than the training set. We shouldn’t allow future values to leak into our training set.

# create splits
condition = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where(condition)[0]
valid_idx = np.where(~condition)[0]
splits = (list(train_idx), list(valid_idx))

# locate continuous and categorigal features
cont, cat = core.cont_cat_split(df, 1, dep_var=target_var)

# preprocess the dataset
df_proc = core.TabularPandas(df, [core.Categorify], cat, cont, y_names=target_var, splits=splits)

Now that we have finished processing our dataset, we are ready to extract our features and labels into numpy arrays.

# create feature and target matrices
X_train = df_proc.train.items.drop("SalePrice", axis=1).values
y_train = df_proc.train.items["SalePrice"].values
X_valid = df_proc.valid.items.drop("SalePrice", axis=1).values
y_valid = df_proc.valid.items["SalePrice"].values

Kale provides a useful abstraction to group together our dataset. We just need to provide the X and y values.

# create Kale Dataset
dataset = artifacts.Dataset(
    features=X_train,
    targets=y_train,
    features_test=X_valid,
    targets_test=y_valid,
    name="bluebook-bulldozers")

Finally, we are ready to run our AutoML experiment using Kale. We need three things:

  • The dataset
  • The task (in our case regression)
  • The metric we are trying to optimize

Moreover, we can pass a parameter, to let Kale know how many different ML configurations it should try, and a Katib specification, if you want to further optimize the best performing predictor.

# create the Katib configuration
tuner = katib.V1beta1ExperimentSpec(
    objective=katib.V1beta1ObjectiveSpec(
        goal=0.,
        type="maximize"
    ),
    max_trial_count=2,
    parallel_trial_count=1
)
# execute the AutoML experiment
automl = kale_ml.run_automl(dataset,
                            MLTask.SIMPLE_REGRESSION,
                            metric=metrics.mean_squared_log_error,
                            number_of_configurations=4,
                            max_parallel_configurations=2,
                            tuner=tuner)
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: dataset
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: task
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving scikit-learn object using SKLearn backend: metric
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: number_of_configurations
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: max_parallel_configurations
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: tuner
2021-04-22 11:45:09 Kale rokutils:156         [INFO]     Taking a snapshot of the Pod's volumes...
2021-04-22 11:45:09 Kale rokutils:105         [INFO]     Taking a snapshot of pod kubecon-tutorial-0 in namespace kubeflow-user ...
2021-04-22 11:45:09 Kale rokutils:313         [INFO]     Creating Rok bucket 'notebooks'...
2021-04-22 11:45:09 Kale rokutils:323         [INFO]     Rok bucket 'notebooks' already exists
2021-04-22 11:45:10 Kale rokutils:177         [INFO]     Monitoring Rok snapshot with task id: de71f0965d1a4f9dbf2cd867e555f846
2021-04-22 11:45:22 Kale rokutils:192         [INFO]     Successfully created Rok snapshot
2021-04-22 11:45:22 Kale podutils:275         [INFO]     Getting the base image of container...
2021-04-22 11:45:22 Kale podutils:288         [INFO]     Retrieved image: gcr.io/arrikto-playground/elikatsis/jupyter-kale@sha256:021d062da17aca25f85513ca7b00e77fac6d94addefb68ac0fd33a84e9eb24ff
2021-04-22 11:45:22 Kale kfutils:70           [INFO]     Retrieving PodDefaults applied to server...
2021-04-22 11:45:22 Kale kfutils:76           [INFO]     Retrieved applied PodDefaults: ['access-ml-pipeline', 'rok-auth']
2021-04-22 11:45:22 Kale kfutils:80           [INFO]     PodDefault labels applied on server: access-ml-pipeline: true, access-rok: true
2021-04-22 11:45:22 Kale kale                 [INFO]     Compiling to a KFP Pipeline
2021-04-22 11:45:22 Kale kale                 [WARNING]  Failed to enable 'set_owner_reference' for 'create-volume-1'. Moving on without garbage collection...
2021-04-22 11:45:22 Kale kale                 [INFO]     Saving generated code in /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale
2021-04-22 11:45:22 Kale kale                 [INFO]     Successfully saved workflow yaml: /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale/automl-orchestrate.kale.yaml
2021-04-22 11:45:22 Kale kfputils:120         [INFO]     Uploading pipeline 'automl-orchestrate'...
2021-04-22 11:45:22 Kale kfputils:143         [INFO]     Successfully uploaded version 'vu38n' for pipeline 'automl-orchestrate'.
2021-04-22 11:45:22 Kale kfputils:162         [INFO]     Creating KFP experiment 'kale-automl-4d7x4x'...
[INFO]:root:Creating experiment kale-automl-4d7x4x.
2021-04-22 11:45:22 Kale kfputils:175         [INFO]     Submitting new pipeline run 'automl-orchestrate-vu38n-dk9u7' for pipeline 'automl-orchestrate' (version: 'vu38n') ...
2021-04-22 11:45:22 Kale kfputils:182         [INFO]     Successfully submitted pipeline run.
2021-04-22 11:45:22 Kale kfputils:183         [INFO]     Run URL: <host>/pipeline/?ns=kubeflow-user#/runs/details/6ff92eaf-3d9a-4190-9318-d6d29d78100c

You can monitor the experiment by printing a summary of the AutoML task at any point in time.

automl.summary()
AutoML Orchestrator status: Succeeded

4/4 Configuration Runs have started.




Status Count
Running 0
Succeeded 4
Skipped 0
Failed 0
Error 0
# KFP Run Status Metric (mean_squared_log_error)
1 c7ab64ce-b59a-4193-ad75-098065e59666 Succeeded -0.467985
2 0600f3e3-855c-411d-8034-6aaf76417b5d Succeeded -0.0687647
3 3f2d62bc-2a99-40c6-9237-c61525acc31c Succeeded -0.483285
4 ccbab346-776f-46f3-85bf-5c79b03c98f2 Succeeded -0.257837

Serve the best-performing model

Now that we have found the best configuration and performed hyperparameter optimization on it, let’s serve the best-performing model.

from kale.serve import serve

kale_model_artifact_id = <KALE_MODEL_ARTIFACT_ID_PLACEHOLDER>
kale_transformer_artifact_id = <KALE_TRANSFORMER_ARTIFACT_ID_PLACEHOLDER>

serve_config = {"limits": {"memory": "4Gi"},
                "annotations": {"sidecar.istio.io/inject": "false"},
                "predictor": {"container": {"name": "container", "image": "gcr.io/arrikto/kserve-sklearnserver-arr:v0.8.0-32-g2ae228dd"}}}

isvc = serve(name="automl-example", model_id=kale_model_artifact_id, transformer_id=kale_transformer_artifact_id, serve_config=serve_config)
from kale.serve import Endpoint

endpoint = Endpoint("automl-example")
endpoint

Run predictions against the model

import json
data = {"instances": X_valid[0:3].tolist()}
res = endpoint.predict(json.dumps(data))
print(res)
{'predictions': [46462.40234375, 19027.392578125, 28755.17578125]}

TensorBoard logs

from kale.common.tbutils import create_tensorboard_from_mlmd
tensorboard_logs_artifact_id = <TENSORBOARD_LOGS_ARTIFACT_ID_PLACEHOLDER>
tb = create_tensorboard_from_mlmd(tensorboard_logs_artifact_id, name="blue-book-bulldozers")
2021-04-22 13:57:00 Kale api:553              [INFO]     Hydrating PVC 'kubeflow-user/blue-book-bulldozers-pua2d-pvc' from Rok URI: rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:481         [INFO]     Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485         [INFO]     User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:251         [INFO]     Creating new PVC 'blue-book-bulldozers-pua2d-pvc' from Rok version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9 ...
2021-04-22 13:57:00 Kale rokutils:263         [INFO]     Using Rok url: http://rok.rok.svc.cluster.local/swift/v1/kubeflow-user/tensorboard-logs/katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc?version=2e4b6d3f-97ab-403e-bb5e-5f169f720bc9
2021-04-22 13:57:00 Kale rokutils:285         [INFO]     Successfully submitted PVC.
2021-04-22 13:57:00 Kale rokutils:481         [INFO]     Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485         [INFO]     User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale tensorboardutils:40  [INFO]     Creating Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'...
2021-04-22 13:57:00 Kale tensorboardutils:53  [INFO]     Successfully created Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'
2021-04-22 13:57:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:116 [INFO]     Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' is ready
2021-04-22 14:00:30 Kale api:530              [INFO]     Adding OwnerReference on PVC...
2021-04-22 14:00:31 Kale api:546              [INFO]     Successfully added OwnerReference on PVC
2021-04-22 14:00:31 Kale api:565              [INFO]     You can visit the Tensorboards Web App to view it! URL path: /tensorboard/kubeflow-user/blue-book-bulldozers-pua2d/