Skip to content

PyTorch Distributed

In this guide, you will learn how to use PyTorch with the integrated Training Operator in a Jupyter notebook using Rafay's MLOps platform based on Kubeflow to perform distributed training of a dataset.


Step 1: Login

In this step, you will login to your MLOps Platform.

  • Navigate to the URL (This will be provided by your platform team)
  • Login using your local credentials or SSO credentials (Identity Provider such as Okta)

Login

Once logged in, you will see the home dashboard screen.

Dashboard


Step 2: Create a Notebook

In this step, you will create a Jupyter Notebook that will be used to execute PyTorch.

  • Navigate to Notebooks
  • Click New Notebook
  • Enter a name for the notebook
  • Select JupyterLab
  • Select kubeflownotebookswg/jupyter-pytorch-full:v1.8.0 for the custom notebook
  • Set the minimum CPU to 1
  • Set the minimum memory to 1
  • Click Advanced Options under Data Volumes
  • Click Launch

Launch

It will take 1-2 minutes to create the notebook.

Launch


Step 3: Execute Training

In this step, you will use the notebook to install the required packages to create PyTorchJobs that will use distributed training with the DistributedDataParallel strategy.

  • Navigate to Notebooks
  • Click Connect on the previously created notebook
  • Download the following notebook file
  • In the left hand folder tree, click on the upload files icon
  • Upload the previously downloaded pytorch-distributed.ipynb file
  • Double click the pytorch-distributed.ipynb file in the folder tree to open the notebook
  • Click the restart kernel and run all cells icon

It will take ~3 minutes to run the training.

  • Navigate to the output of cell 3 to view the training that was performed locally in the notebook

Local Training

  • Navigate to the output of cell 7 to view the training that was distributed across three workers

Dist Training


Recap

Congratulations! At this point, you have successfully created a Jupyter notebook which used the Training Operator to perform both local and distributed training of a dataset.