(Hands-On) Monitor the Distributed Training Process
In this section, we will view the pipeline run and monitor the distributed training process.
2. Find the pipeline run
Find the pipeline run and click on it:
3. View the progress
View the progress of the pipeline run:
4. View the submit_training_job
step
In this step, Kale submits a PyTorch distributed training job:
5. View the logs of the distributed training job
To view the logs of the distributed training job, click on the monitor
step:
Go to the Logs
tab:
6. View the pipeline run
Wait for the pipeline run to complete:
Congratulations! You have successfully trained an artificial neural network in a distributed manner using Kubeflow, Kale, PyTorch, and Rok.