Pipeline Creation w/ Kale API Overview

This a high level overview of this example.

Kale provides a high-level Python API, called distribute, which automates the creation and submission of a PyTorchJob CR, greatly simplifying what we described in the previous section. In order to use this function, you need to provide:

  1. a model, a loss function, a data loader, and an optimizer
  2. training and an evaluation step function

We use Rok snapshots to marshal these ML assets (model, data loader, optimizer, loss function, train, and eval functions) into the master and worker Pods. These Pods run with a Kale entry point, which handles the deserialization of these ML assets, prepares them for distributed training and then starts the training. The train end eval step-functions govern what happens during training.

Finally, once all of this is done, you can go back to your notebook and, with a single Kale API call, you can:

Finally, you can restore the trained model from a Rok snapshot, create a new JupyterLab server, and use the model to make predictions.

You now have a high-level picture of the entire workflow. Let’s dive into the single parts.