Data Passing w/ Volumes

When you create your Notebook Server using the Jupyter Web App, by default a new PVC is mounted under the user’s home directory. We call this volume, [workspace volume]{style=“box-sizing: border-box; font-weight: bold;”}. Moreover, you can decide to provision more volumes to be mounted at locations of your choice. We call these volumes, [data volumes]{style=“box-sizing: border-box; font-weight: bold;”}. Whenever you submit a pipeline to KFP, Kale clones the Notebook’s volumes for two important reasons:

1. **Marshalling: **A mechanism to seamlessly pass data between steps. The system requires a shared folder where Kale can serialize and de-serialize data and Kale users a hidden folder within the workspace volume as the shared marshalling location.

2. Reproducibility, experimentation: When you are working with a Notebook, it is often the case that you install new libraries, write new modules, create or download assets required by your code. By seamlessly cloning the workspace and data volumes, Kale is making sure your environment is versioned and replicated to the new pipeline that you deploy. As a result, the pipeline is always reproducible thanks to the immutable snapshots and you do not have to build new Docker images for each pipeline run.

You can see this functionality in the log output from the execution in the prior unit - notice the two *taking a snapshot *lines of output text which indicate when Kale is taking snapshots. Additionally, notice the *Successfully created Rok *lines of output text which indicate that Kale is leaning on Rok to facilitate the movement of these snapshots.

For more information on snapshots please take the Rok 101 course. 

Marshal Volume

Using the workspace volume to marshal large volumes of data can become problematic and cause issues if the data brought in from the load step exhausts the available workspace volume memory. Therefore, we recommend provisioning a marshal volume and instructing Kale to use this volume to pass data. You will need to use the following parameters in the *@pipeline *decorator to take advantage of the marshal volume.

  • marshal_volume=true
  • marshal_volume_size=“##Gi” (replace ## with the actual number of gigabytes to use)
  • marshal_path=“/data/marshal”
Please follow along in your own copy of our notebook as we complete the steps below.
1. Add marshal_volume parameters to the @pipeline decorators

Replace the existing *@pipeline *decorator with the following updated code as seen below

@pipeline(name="binary-classification", experiment="kale-sdk", marshal_volume=True, marshal_volume_size="10Gi", marshal_path="/data/marshal")

2. Open a new terminal

Select Terminal from the launcher to open a new terminal.

3. Deploy Code as Kubeflow Pipeline

Execute the following to deploy the code as a Kubeflow Pipeline. The use of *–kfp *deploys the code as a Kubeflow Pipeline.

python3 -m kale kale_sdk.py --kfp

\

4. Review Pipeline Execution

Navigate to the Kubeflow UI and open the Runs UI

You will now see two runs as a result of the work done in the course thus far:

5. Access the pipeline run

Select the latest binary-classification run to see the Kubeflow Pipeline.

The green checkmarks indicate that the Kubeflow Pipeline was deployed and executed successfully. We recommend testing your code locally, as you did in this course, before deploying to minimize troubleshooting complexity.

Note that you can also use the marshal volume to perform data passing yourself in case you need to write files to a shared location and have downstream steps consume the data.