!pip3 install --user -r requirements.txt
Blue Book for Bulldozers
In this example, we will create a model to solve the Blue Book for Bulldozers Kaggle competition. For this competition, we should predict the sale price of bulldozers sold at auctions.
First, we observe that this is a regression task, since we are predicting a real-valued number. For this tutorial, the dataset is one CSV file, containing more than 20.000 examples, each one having 51 features (if you exclude the SalesID
and the SalesPrice
attributes).
Install
First things first, we need to install the necessary dependencies. We can do that by either running !pip install --user <package_name>
or including everything in a requirements.txt
file and running !pip install --user -r requirements.txt
. We have put the dependencies in a requirements.txt
file already, so we will use the second method.
NOTE: Do not forget to use the
--user
argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline
Imports
In this section we import the packages we need for this example. When working in a Jupyter Notebook, it is a good habit to gather your imports in a single place.
import numpy as np
import pandas as pd
from kubeflow import katib
from kale.types import MLTask
from kale import ml as kale_ml
from kale.common import artifacts
from autosklearn import metrics
from fastai.tabular import core
Data Loading
In this section we load the dataset and do some light processing. Specifically, we need to turn every categorical feature into a number. First, let’s load the data.
= pd.read_csv("data/train.csv", low_memory=False)
df df.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1597691 | 10000.0 | 1204623 | 4600 | 132 | 18.0 | 1979 | NaN | NaN | 4/21/1994 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1363686 | 10000.0 | 1149235 | 7267 | 132 | 8.0 | 1978 | NaN | NaN | 3/27/2002 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1767435 | 64000.0 | 1164512 | 28919 | 132 | 1.0 | 2006 | NaN | NaN | 3/15/2011 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
3 | 1639287 | 24500.0 | 1457804 | 1894 | 132 | 15.0 | 2001 | NaN | NaN | 11/4/2010 0:00 | ... | NaN | NaN | NaN | NaN | NaN | None or Unspecified | PAT | None or Unspecified | NaN | NaN |
4 | 2277587 | 21750.0 | 572911 | 2758 | 136 | 20.0 | 1998 | 5155.0 | Medium | 4/16/2008 0:00 | ... | NaN | NaN | NaN | NaN | NaN | None or Unspecified | PAT | None or Unspecified | NaN | NaN |
5 rows × 53 columns
Data Processing
Let’s go through the transformations we need to do to our data:
- Extract the target variable (
SalePrice
) - Encode the ordinal variables
- Unfold the dates to engineer more features, and
- Split the dataset into
train
andvalid
sets
First, let’s keep our target in a variable:
= 'SalePrice' target_var
The next step is to encode the ordinal variables. Why do we treat the ProductSize
variable differently? It is because the order here does matter. So if we want to assign a number to every value that this variable can take, we need to account for that. Thus, Large
can take the value 1
, Large / Medium
the value 2
, and so on.
# ordinal data
= 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
sizes
'ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True) df[
Next, we need to unfold the date
feature to extract more information out of it. For example, it makes sense to know if it’s Christmas, or a Weekend, if it’s Summer or Winter, or even which day of the week it is.
To achieve this, we will use a handy function provided by the fastai library: add_datepart
.
# expand dates
= core.add_datepart(df, 'saledate') df
Now, we are ready to split our dataset into train
and test
sets. Since, we are trying to predict the sale price of bulldozers in some auction, we should be extra careful about how we split our data. We want the validation set to be later in time than the training set. We shouldn’t allow future values to leak into our training set.
# create splits
= (df.saleYear<2011) | (df.saleMonth<10)
condition = np.where(condition)[0]
train_idx = np.where(~condition)[0]
valid_idx = (list(train_idx), list(valid_idx))
splits
# locate continuous and categorigal features
= core.cont_cat_split(df, 1, dep_var=target_var)
cont, cat
# preprocess the dataset
= core.TabularPandas(df, [core.Categorify], cat, cont, y_names=target_var, splits=splits) df_proc
Now that we have finished processing our dataset, we are ready to extract our features and labels into numpy
arrays.
# create feature and target matrices
= df_proc.train.items.drop("SalePrice", axis=1).values
X_train = df_proc.train.items["SalePrice"].values
y_train = df_proc.valid.items.drop("SalePrice", axis=1).values
X_valid = df_proc.valid.items["SalePrice"].values y_valid
Kale provides a useful abstraction to group together our dataset. We just need to provide the X
and y
values.
# create Kale Dataset
= artifacts.Dataset(
dataset =X_train,
features=y_train,
targets=X_valid,
features_test=y_valid,
targets_test="bluebook-bulldozers") name
Finally, we are ready to run our AutoML experiment using Kale. We need three things:
- The dataset
- The task (in our case regression)
- The metric we are trying to optimize
Moreover, we can pass a parameter, to let Kale know how many different ML configurations it should try, and a Katib specification, if you want to further optimize the best performing predictor.
# create the Katib configuration
= katib.V1beta1ExperimentSpec(
tuner =katib.V1beta1ObjectiveSpec(
objective=0.,
goaltype="maximize"
),=2,
max_trial_count=1
parallel_trial_count )
# execute the AutoML experiment
= kale_ml.run_automl(dataset,
automl
MLTask.SIMPLE_REGRESSION,=metrics.mean_squared_log_error,
metric=4,
number_of_configurations=2,
max_parallel_configurations=tuner) tuner
2021-04-22 11:45:09 Kale marshalling [INFO] Saving generic object using Default backend: dataset
2021-04-22 11:45:09 Kale marshalling [INFO] Saving generic object using Default backend: task
2021-04-22 11:45:09 Kale marshalling [INFO] Saving scikit-learn object using SKLearn backend: metric
2021-04-22 11:45:09 Kale marshalling [INFO] Saving generic object using Default backend: number_of_configurations
2021-04-22 11:45:09 Kale marshalling [INFO] Saving generic object using Default backend: max_parallel_configurations
2021-04-22 11:45:09 Kale marshalling [INFO] Saving generic object using Default backend: tuner
2021-04-22 11:45:09 Kale rokutils:156 [INFO] Taking a snapshot of the Pod's volumes...
2021-04-22 11:45:09 Kale rokutils:105 [INFO] Taking a snapshot of pod kubecon-tutorial-0 in namespace kubeflow-user ...
2021-04-22 11:45:09 Kale rokutils:313 [INFO] Creating Rok bucket 'notebooks'...
2021-04-22 11:45:09 Kale rokutils:323 [INFO] Rok bucket 'notebooks' already exists
2021-04-22 11:45:10 Kale rokutils:177 [INFO] Monitoring Rok snapshot with task id: de71f0965d1a4f9dbf2cd867e555f846
2021-04-22 11:45:22 Kale rokutils:192 [INFO] Successfully created Rok snapshot
2021-04-22 11:45:22 Kale podutils:275 [INFO] Getting the base image of container...
2021-04-22 11:45:22 Kale podutils:288 [INFO] Retrieved image: gcr.io/arrikto-playground/elikatsis/jupyter-kale@sha256:021d062da17aca25f85513ca7b00e77fac6d94addefb68ac0fd33a84e9eb24ff
2021-04-22 11:45:22 Kale kfutils:70 [INFO] Retrieving PodDefaults applied to server...
2021-04-22 11:45:22 Kale kfutils:76 [INFO] Retrieved applied PodDefaults: ['access-ml-pipeline', 'rok-auth']
2021-04-22 11:45:22 Kale kfutils:80 [INFO] PodDefault labels applied on server: access-ml-pipeline: true, access-rok: true
2021-04-22 11:45:22 Kale kale [INFO] Compiling to a KFP Pipeline
2021-04-22 11:45:22 Kale kale [WARNING] Failed to enable 'set_owner_reference' for 'create-volume-1'. Moving on without garbage collection...
2021-04-22 11:45:22 Kale kale [INFO] Saving generated code in /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale
2021-04-22 11:45:22 Kale kale [INFO] Successfully saved workflow yaml: /home/jovyan/prv-kale/backend/kale/ml/examples/tutorial/.kale/automl-orchestrate.kale.yaml
2021-04-22 11:45:22 Kale kfputils:120 [INFO] Uploading pipeline 'automl-orchestrate'...
2021-04-22 11:45:22 Kale kfputils:143 [INFO] Successfully uploaded version 'vu38n' for pipeline 'automl-orchestrate'.
2021-04-22 11:45:22 Kale kfputils:162 [INFO] Creating KFP experiment 'kale-automl-4d7x4x'...
[INFO]:root:Creating experiment kale-automl-4d7x4x.
2021-04-22 11:45:22 Kale kfputils:175 [INFO] Submitting new pipeline run 'automl-orchestrate-vu38n-dk9u7' for pipeline 'automl-orchestrate' (version: 'vu38n') ...
2021-04-22 11:45:22 Kale kfputils:182 [INFO] Successfully submitted pipeline run.
2021-04-22 11:45:22 Kale kfputils:183 [INFO] Run URL: <host>/pipeline/?ns=kubeflow-user#/runs/details/6ff92eaf-3d9a-4190-9318-d6d29d78100c
You can monitor the experiment by printing a summary of the AutoML task at any point in time.
automl.summary()
AutoML Orchestrator status: Succeeded
4/4 Configuration Runs have started.
Status | Count |
---|---|
Running | 0 |
Succeeded | 4 |
Skipped | 0 |
Failed | 0 |
Error | 0 |
# | KFP Run | Status | Metric (mean_squared_log_error) |
---|---|---|---|
1 | c7ab64ce-b59a-4193-ad75-098065e59666 | Succeeded | -0.467985 |
2 | 0600f3e3-855c-411d-8034-6aaf76417b5d | Succeeded | -0.0687647 |
3 | 3f2d62bc-2a99-40c6-9237-c61525acc31c | Succeeded | -0.483285 |
4 | ccbab346-776f-46f3-85bf-5c79b03c98f2 | Succeeded | -0.257837 |
Serve the best-performing model
Now that we have found the best configuration and performed hyperparameter optimization on it, let’s serve the best-performing model.
from kale.serve import serve
= <KALE_MODEL_ARTIFACT_ID_PLACEHOLDER>
kale_model_artifact_id = <KALE_TRANSFORMER_ARTIFACT_ID_PLACEHOLDER>
kale_transformer_artifact_id
= {"limits": {"memory": "4Gi"},
serve_config "annotations": {"sidecar.istio.io/inject": "false"},
"predictor": {"container": {"name": "container", "image": "gcr.io/arrikto/kserve-sklearnserver-arr:v0.8.0-32-g2ae228dd"}}}
= serve(name="automl-example", model_id=kale_model_artifact_id, transformer_id=kale_transformer_artifact_id, serve_config=serve_config) isvc
from kale.serve import Endpoint
= Endpoint("automl-example")
endpoint endpoint
Run predictions against the model
import json
= {"instances": X_valid[0:3].tolist()}
data = endpoint.predict(json.dumps(data))
res print(res)
{'predictions': [46462.40234375, 19027.392578125, 28755.17578125]}
TensorBoard logs
from kale.common.tbutils import create_tensorboard_from_mlmd
= <TENSORBOARD_LOGS_ARTIFACT_ID_PLACEHOLDER>
tensorboard_logs_artifact_id = create_tensorboard_from_mlmd(tensorboard_logs_artifact_id, name="blue-book-bulldozers") tb
2021-04-22 13:57:00 Kale api:553 [INFO] Hydrating PVC 'kubeflow-user/blue-book-bulldozers-pua2d-pvc' from Rok URI: rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:481 [INFO] Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485 [INFO] User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale rokutils:251 [INFO] Creating new PVC 'blue-book-bulldozers-pua2d-pvc' from Rok version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9 ...
2021-04-22 13:57:00 Kale rokutils:263 [INFO] Using Rok url: http://rok.rok.svc.cluster.local/swift/v1/kubeflow-user/tensorboard-logs/katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc?version=2e4b6d3f-97ab-403e-bb5e-5f169f720bc9
2021-04-22 13:57:00 Kale rokutils:285 [INFO] Successfully submitted PVC.
2021-04-22 13:57:00 Kale rokutils:481 [INFO] Unpacking Rok URI: 'rok:kubeflow-user:tensorboard-logs:katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc:2e4b6d3f-97ab-403e-bb5e-5f169f720bc9:/prv-kale/backend/kale/ml/examples/tutorial/logs'
2021-04-22 13:57:00 Kale rokutils:485 [INFO] User kubeflow-user; bucket tensorboard-logs; object katib-trial-sklearn-configuration-lbhxh-automl-orchestrate-rwcg5-workspace-kubecon-tutorial-i6nwkt6jc; version 2e4b6d3f-97ab-403e-bb5e-5f169f720bc9; path /prv-kale/backend/kale/ml/examples/tutorial/logs
2021-04-22 13:57:00 Kale tensorboardutils:40 [INFO] Creating Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'...
2021-04-22 13:57:00 Kale tensorboardutils:53 [INFO] Successfully created Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d'
2021-04-22 13:57:00 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:05 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:10 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:15 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:20 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:25 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:30 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:35 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:40 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:45 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:50 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:57:55 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:00 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:05 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:10 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:15 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:20 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:25 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:30 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:35 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:40 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:45 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:50 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:58:55 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:00 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:05 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:10 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:15 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:20 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:25 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:30 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:35 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:40 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:45 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:50 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 13:59:55 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:00 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:05 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:10 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:15 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:20 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:25 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:108 [INFO] Waiting for Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' to become ready...
2021-04-22 14:00:30 Kale tensorboardutils:116 [INFO] Tensorboard 'kubeflow-user/blue-book-bulldozers-pua2d' is ready
2021-04-22 14:00:30 Kale api:530 [INFO] Adding OwnerReference on PVC...
2021-04-22 14:00:31 Kale api:546 [INFO] Successfully added OwnerReference on PVC
2021-04-22 14:00:31 Kale api:565 [INFO] You can visit the Tensorboards Web App to view it! URL path: /tensorboard/kubeflow-user/blue-book-bulldozers-pua2d/