Skip to main content

Distributed Training and Model Serving

In this course, you will how to use distributed training for your models.

  • Course Number

    Course
  • Self-Paced

About This Course

In this course, you will run a PyTorch distributed training job starting from your Notebook. The Kubeflow PyTorch operator is a Kubeflow component that can orchestrate PyTorch deep learning distributed jobs. In order to describe and submit such a job, you would normally need to write a PyTorchJob CR (YAML file), providing run configurations and the containers with which the Pods will run. Besides writing this YAML file, you would also need to build a Docker image for the Pods and make sure your data is available to them as well. Fortunately, all of this now comes automated in Kale.

Instructor Led Option

This course is available on a monthly basis with an instructor if you would prefer to take the course live. If this is your preference please navigate and sign up here .

What is Kubeflow?

Kubeflow as a project got its start over at Google. The idea was to create a simpler way to run TensorFlow jobs on Kubernetes. So, Kubeflow was created as a way to run TensorFlow, based on a pipeline called TensorFlow Extended and then ultimately extended to support multiple architectures and multiple clouds so it could be used as a framework to run entire machine learning pipelines. The Kubeflow open source project (licensed Apache 2.0) was formally announced at the end of 2017.

In a nutshell, Kubeflow is the machine learning toolkit that runs on top of Kubernetes. Kubeflow’s combined components allow both data scientists and DevOps to manage data, train models, tune and serve them, as well as monitor them.

For whom is the “Distributed Training” course?

Data scientists and DevOps with little or no experience with Kubeflow

Requirements

We assume that you have basic familiarity with cloud computing environments like AWS, GCP or Azure as well as a basic understanding of cloud-native architectures and Kubernetes concepts like pods, controllers, nodes, container images, volumes, etc. Additionally we assume that you have familiarity with ML concepts like algorithms, model training and parameter tuning

Frequently Asked Questions

What web browser should I use?

The Open edX platform works best with current versions of Chrome, Edge, Firefox, Internet Explorer, or Safari.

See our list of supported browsers for the most up-to-date information.

Enroll