(Hands-On) Monitor the Distributed Training Process

In this section, we will view the pipeline run and monitor the distributed training process.

2. Find the pipeline run

Find the pipeline run and click on it:

3. View the progress

View the progress of the pipeline run:

4. View the submit_training_job step

In this step, Kale submits a PyTorch distributed training job:

5. View the logs of the distributed training job

To view the logs of the distributed training job, click on the monitor step:

Go to the Logs tab:

6. View the pipeline run

Wait for the pipeline run to complete:

Congratulations! You have successfully trained an artificial neural network in a distributed manner using Kubeflow, Kale, PyTorch, and Rok.