Create Custom Container Steps: Datafile

You can define custom containers for use as pipeline steps with inputs and outputs thereby creating pipelines that execute both function-based and container-based steps. There are two ways to do this - using plain values and using data files. Now you will use a datafile stored on a marshal volume to share data between steps.

Please follow along in your own copy of our notebook as we complete the steps below.
1. Download custom_container_step_datavfilealue.py

For this unit, we will look at a new code example focused specifically on using a datafile.

First, download the new code file by clicking here and add to your Notebook Server.

[2. Install scikit-learn]{style=“font-weight: 600;”}

Open a new terminal, navigate to the kale-sdk-datavol-1 directory and execute the following code to install scikit-learn.

pip3 install –user scitkit-learn==0.23.0

3. Review Python Code

In this code example we define a pipeline with three steps:

  • The first step load_dataset is function-based and load the digits dataset and return the samples and targets.
  • The second step *split_dataset *is container-based and splits the dataset into train and test subsets and then returns the test subset.
  • The third step print_dataset is function-based and prints the first element of the test subset.

Notice how the pipeline definition is extended to use a marshal volume, a concept discussed earlier in this course.

This volume is used to save large data objects in files.

The volume_mounts, as well as the file names, are defined in the split_data container step. Recall that file name format takes the form *input.filename *or output.*filename. *

Refer to the Custom Container Steps: Data File Procedue  if you would like to like to see the code contained in split.py which is sourced from *gcr.io/arrikto/example-add-split. *

4. Open a new terminal

Select Terminal from the launcher to open a new terminal.

5. Deploy Code as Kubeflow Pipeline

Make sure you are in the directory with the Python code, kale-sdk-datavol-1 and execute the following to deploy the code as a Kubeflow Pipeline.

python3 -m kale custom_container_step_datafile.py --kfp

6. Review Pipeline Runs

Navigate to the Kubeflow UI and open the Runs UI. The number of runs will continue to increase throughout the course.

7. Access the Pipeline run

Select the split-dataset-*** run to see the Kubeflow Pipeline execute successfully with the custom container output.

Notice the use of the kale-marshal-volume which hosts the datafile.

8. Review print_data

Select the print_data step and review the Logs to see the first line of the test subset.

You have successfully executed a Kubeflow Pipeline using Custom Container Steps with a Datafile.