Blue Book for Bulldozers

In this example, we will create a model to solve the Blue Book for Bulldozers Kaggle competition. For this competition, we should predict the sale price of bulldozers sold at auctions.

First, we observe that this is a regression task, since we are predicting a real-valued number. For this tutorial, the dataset is one CSV file, containing more than 20.000 examples, each one having 51 features (if you exclude the SalesID and the SalesPrice attributes).

Install

First things first, we need to install the necessary dependencies. We can do that by either running !pip install --user <package_name> or including everything in a requirements.txt file and running !pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file already, so we will use the second method.

NOTE: Do not forget to use the --user argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

!pip3 install --user -r requirements.txt

Imports

In this section we import the packages we need for this example. When working in a Jupyter Notebook, it is a good habit to gather your imports in a single place.

import numpy as np
import pandas as pd

from kubeflow import katib
from kale.types import MLTask
from kale import ml as kale_ml
from kale.common import artifacts
from autosklearn import metrics
from fastai.tabular import core

Data Loading

In this section we load the dataset and do some light processing. Specifically, we need to turn every categorical feature into a number. First, let’s load the data.

df = pd.read_csv("data/train.csv", low_memory=False)
df.head()

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1597691	10000.0	1204623	4600	132	18.0	1979	NaN	NaN	4/21/1994 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1363686	10000.0	1149235	7267	132	8.0	1978	NaN	NaN	3/27/2002 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
2	1767435	64000.0	1164512	28919	132	1.0	2006	NaN	NaN	3/15/2011 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
3	1639287	24500.0	1457804	1894	132	15.0	2001	NaN	NaN	11/4/2010 0:00	...	NaN	NaN	NaN	NaN	NaN	None or Unspecified	PAT	None or Unspecified	NaN	NaN
4	2277587	21750.0	572911	2758	136	20.0	1998	5155.0	Medium	4/16/2008 0:00	...	NaN	NaN	NaN	NaN	NaN	None or Unspecified	PAT	None or Unspecified	NaN	NaN

5 rows × 53 columns

Data Processing

Let’s go through the transformations we need to do to our data:

Extract the target variable (SalePrice)
Encode the ordinal variables
Unfold the dates to engineer more features, and
Split the dataset into train and valid sets

First, let’s keep our target in a variable:

target_var = 'SalePrice'

The next step is to encode the ordinal variables. Why do we treat the ProductSize variable differently? It is because the order here does matter. So if we want to assign a number to every value that this variable can take, we need to account for that. Thus, Large can take the value 1, Large / Medium the value 2, and so on.

# ordinal data
sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'

df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

Next, we need to unfold the date feature to extract more information out of it. For example, it makes sense to know if it’s Christmas, or a Weekend, if it’s Summer or Winter, or even which day of the week it is.

To achieve this, we will use a handy function provided by the fastai library: add_datepart.

# expand dates
df = core.add_datepart(df, 'saledate')

Now, we are ready to split our dataset into train and test sets. Since, we are trying to predict the sale price of bulldozers in some auction, we should be extra careful about how we split our data. We want the validation set to be later in time than the training set. We shouldn’t allow future values to leak into our training set.

# create splits
condition = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where(condition)[0]
valid_idx = np.where(~condition)[0]
splits = (list(train_idx), list(valid_idx))

# locate continuous and categorigal features
cont, cat = core.cont_cat_split(df, 1, dep_var=target_var)

# preprocess the dataset
df_proc = core.TabularPandas(df, [core.Categorify], cat, cont, y_names=target_var, splits=splits)

Now that we have finished processing our dataset, we are ready to extract our features and labels into numpy arrays.

# create feature and target matrices
X_train = df_proc.train.items.drop("SalePrice", axis=1).values
y_train = df_proc.train.items["SalePrice"].values
X_valid = df_proc.valid.items.drop("SalePrice", axis=1).values
y_valid = df_proc.valid.items["SalePrice"].values

Kale provides a useful abstraction to group together our dataset. We just need to provide the X and y values.

# create Kale Dataset
dataset = artifacts.Dataset(
    features=X_train,
    targets=y_train,
    features_test=X_valid,
    targets_test=y_valid,
    name="bluebook-bulldozers")

Finally, we are ready to run our AutoML experiment using Kale. We need three things:

The dataset
The task (in our case regression)
The metric we are trying to optimize

Moreover, we can pass a parameter, to let Kale know how many different ML configurations it should try, and a Katib specification, if you want to further optimize the best performing predictor.

# create the Katib configuration
tuner = katib.V1beta1ExperimentSpec(
    objective=katib.V1beta1ObjectiveSpec(
        goal=0.,
        type="maximize"
    ),
    max_trial_count=2,
    parallel_trial_count=1
)

# execute the AutoML experiment
automl = kale_ml.run_automl(dataset,
                            MLTask.SIMPLE_REGRESSION,
                            metric=metrics.mean_squared_log_error,
                            number_of_configurations=4,
                            max_parallel_configurations=2,
                            tuner=tuner)

2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: dataset
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: task
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving scikit-learn object using SKLearn backend: metric
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: number_of_configurations
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: max_parallel_configurations
2021-04-22 11:45:09 Kale marshalling          [INFO]     Saving generic object using Default backend: tuner
2021-04-22 11:45:09 Kale rokutils:156         [INFO]     Taking a snapshot of the Pod's volumes...
2021-04-22 11:45:09 Kale rokutils:105         [INFO]     Taking a snapshot of pod kubecon-tutorial-0 in namespace kubeflow-user ...
2021-04-22 11:45:09 Kale rokutils:313         [INFO]     Creating Rok bucket 'notebooks'...
2021-04-22 11:45:09 Kale rokutils:323         [INFO]     Rok bucket 'notebooks' already exists
2021-04-22 11:45:10 Kale rokutils:177         [INFO]     Monitoring Rok snapshot with task id: de71f0965d1a4f9dbf2cd867e555f846
2021-04-22 11:45:22 Kale rokutils:192         [INFO]     Successfully created Rok snapshot
2021-04-22 11:45:22 Kale podutils:275         [INFO]     Getting the base image of container...
2021-04-22 11:45:22 Kale podutils:288         [INFO]     Retrieved image: gcr.io/arrikto-playground/elikatsis/jupyter-kale@sha256:021d062da17aca25f85513ca7b00e77fac6d94addefb68ac0fd33a84e9eb24ff
2021-04-22 11:45:22 Kale kfutils:70           [INFO]     Retrieving PodDefaults applied to server...
2021-04-22 11:45:22 Kale kfutils:76           [INFO]     Retrieved applied PodDefaults: ['access-ml-pipeline', 'rok-auth']
2021-04-22 11:45:22 Kale kfutils:80           [INFO]     PodDefault labels applied on server: access-ml-pipeline: true, access-rok: true
2021-04-22 11:45:22 Kale kale                 [INFO]     Compiling to a KFP Pipeline
2021-04-22 11:45:22 Kale kale                 [WARNING]  Failed to enable 'set_owner_reference' for 'create-volume-1'. Moving on without garbage collection...
2021-04-22 11:45:22 Kale kale                 [INFO]     Saving generated code in /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale
2021-04-22 11:45:22 Kale kale                 [INFO]     Successfully saved workflow yaml: /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale/automl-orchestrate.kale.yaml
2021-04-22 11:45:22 Kale kfputils:120         [INFO]     Uploading pipeline 'automl-orchestrate'...
2021-04-22 11:45:22 Kale kfputils:143         [INFO]     Successfully uploaded version 'vu38n' for pipeline 'automl-orchestrate'.
2021-04-22 11:45:22 Kale kfputils:162         [INFO]     Creating KFP experiment 'kale-automl-4d7x4x'...
[INFO]:root:Creating experiment kale-automl-4d7x4x.
2021-04-22 11:45:22 Kale kfputils:175         [INFO]     Submitting new pipeline run 'automl-orchestrate-vu38n-dk9u7' for pipeline 'automl-orchestrate' (version: 'vu38n') ...
2021-04-22 11:45:22 Kale kfputils:182         [INFO]     Successfully submitted pipeline run.
2021-04-22 11:45:22 Kale kfputils:183         [INFO]     Run URL: <host>/pipeline/?ns=kubeflow-user#/runs/details/6ff92eaf-3d9a-4190-9318-d6d29d78100c

Experiment details.

Run details.

You can monitor the experiment by printing a summary of the AutoML task at any point in time.

automl.summary()

AutoML Orchestrator status: Succeeded

4/4 Configuration Runs have started.

Status	Count
Running	0
Succeeded	4
Skipped	0
Failed	0
Error	0

#	KFP Run	Status	Metric (mean_squared_log_error)
1	c7ab64ce-b59a-4193-ad75-098065e59666	Succeeded	-0.467985
2	0600f3e3-855c-411d-8034-6aaf76417b5d	Succeeded	-0.0687647
3	3f2d62bc-2a99-40c6-9237-c61525acc31c	Succeeded	-0.483285
4	ccbab346-776f-46f3-85bf-5c79b03c98f2	Succeeded	-0.257837

Serve the best-performing model

Now that we have found the best configuration and performed hyperparameter optimization on it, let’s serve the best-performing model.

from kale.serve import serve

kale_model_artifact_id = <KALE_MODEL_ARTIFACT_ID_PLACEHOLDER>
kale_transformer_artifact_id = <KALE_TRANSFORMER_ARTIFACT_ID_PLACEHOLDER>

serve_config = {"limits": {"memory": "4Gi"},
                "annotations": {"sidecar.istio.io/inject": "false"},
                "predictor": {"container": {"name": "container", "image": "gcr.io/arrikto/kserve-sklearnserver-arr:v0.8.0-32-g2ae228dd"}}}

isvc = serve(name="automl-example", model_id=kale_model_artifact_id, transformer_id=kale_transformer_artifact_id, serve_config=serve_config)

from kale.serve import Endpoint

endpoint = Endpoint("automl-example")
endpoint

Run predictions against the model

import json
data = {"instances": X_valid[0:3].tolist()}
res = endpoint.predict(json.dumps(data))
print(res)

{'predictions': [46462.40234375, 19027.392578125, 28755.17578125]}

TensorBoard logs

from kale.common.tbutils import create_tensorboard_from_mlmd
tensorboard_logs_artifact_id = <TENSORBOARD_LOGS_ARTIFACT_ID_PLACEHOLDER>
tb = create_tensorboard_from_mlmd(tensorboard_logs_artifact_id, name="blue-book-bulldozers")

2021-04-22 13:57:00 Kale api:553              [INFO]     Hydrating PVC 'kubeflow-user/blue-book-bulldozers-pua2d-pvc' from Rok URI: rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:481         [INFO]     Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485         [INFO]     User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:251         [INFO]     Creating new PVC 'blue-book-bulldozers-pua2d-pvc' from Rok version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9 ...
2021-04-22 13:57:00 Kale rokutils:263         [INFO]     Using Rok url: http://rok.rok.svc.cluster.local/swift/v1/kubeflow-user/tensorboard-logs/katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc?version=2e4b6d3f-97ab-403e-bb5e-5f169f720bc9
2021-04-22 13:57:00 Kale rokutils:285         [INFO]     Successfully submitted PVC.
2021-04-22 13:57:00 Kale rokutils:481         [INFO]     Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485         [INFO]     User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale tensorboardutils:40  [INFO]     Creating Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'...
2021-04-22 13:57:00 Kale tensorboardutils:53  [INFO]     Successfully created Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'
2021-04-22 13:57:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:35 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:40 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:45 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:50 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:55 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:00 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:05 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:10 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:15 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:20 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:25 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:108 [INFO]     Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:116 [INFO]     Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' is ready
2021-04-22 14:00:30 Kale api:530              [INFO]     Adding OwnerReference on PVC...
2021-04-22 14:00:31 Kale api:546              [INFO]     Successfully added OwnerReference on PVC
2021-04-22 14:00:31 Kale api:565              [INFO]     You can visit the Tensorboards Web App to view it! URL path: /tensorboard/kubeflow-user/blue-book-bulldozers-pua2d/

Tensorboard server