Skip to main content

Distributed PyTorch with Kubeflow and Kale

In this course, you will train a PyTorch neural network in a distributed manner. First, you will run a PyTorch distributed training job starting from your Notebook, using Kale. Then you will monitor the pipeline run and the distributed training process. You will train an artificial neural network in a distributed manner using Kubeflow, Kale, PyTorch, and Rok. Ultimately, you will restore the trained model from a Rok snapshot and evaluate it on the test dataset.

  • Course Number

    Course
  • Self-Paced

About This Course

In this course, you will run a PyTorch distributed training job starting from your Notebook. The Kubeflow PyTorch operator is a Kubeflow component that can orchestrate PyTorch deep learning distributed jobs. In order to describe and submit such a job, you would normally need to write a PyTorchJob CR (YAML file), providing run configurations and the containers with which the Pods will run. Besides writing this YAML file, you would also need to build a Docker image for the Pods and make sure your data is available to them as well. Fortunately, all of this now comes automated in Kale.

In this course, you will use a a high-level Python API, called distribute, that Kale provides. The Kale API automates the creation and submission of a PyTorchJob CR, greatly simplifying what we described above. In order to use this function, you need to provide

  • a model, a loss function, a data loader, and an optimizer
  • a training and an evaluation step function

We use Rok snapshots to marshal these ML assets (model, data loader, optimizer, loss function, train and eval functions) into the master and worker Pods. These Pods run with a Kale entrypoint, which handles the deserialization of these ML assets, prepares them for distributed training, and then starts the training. The train and eval step functions govern what happens during training. Finally, once all of this is done, you can go back to your notebook and, with a single Kale API call, you can

  • Monitor the logs of the distributed job.
  • Manage the PyTorchJob CR lifecycle.
  • Delete the job and garbage collect all resources.

Finally, you will restore the trained model from a Rok snapshot, create a new JupyterLab server, and evaluate it on the test dataset.

Frequently Asked Questions

What web browser should I use?

The Open edX platform works best with current versions of Chrome, Edge, Firefox, Internet Explorer, or Safari.

See our list of supported browsers for the most up-to-date information.

Enroll