PyTorch Distributed
In this guide, you will learn how to use PyTorch with the integrated Training Operator in a Jupyter notebook using Rafay's MLOps platform based on Kubeflow to perform distributed training of a dataset.
Step 1: Login¶
In this step, you will login to your MLOps Platform.
- Navigate to the URL (This will be provided by your platform team)
- Login using your local credentials or SSO credentials (Identity Provider such as Okta)
Once logged in, you will see the home dashboard screen.
Step 2: Create a Notebook¶
In this step, you will create a Jupyter Notebook that will be used to execute PyTorch.
- Navigate to Notebooks
- Click New Notebook
- Enter a name for the notebook
- Select JupyterLab
- Select kubeflownotebookswg/jupyter-pytorch-full:v1.8.0 for the custom notebook
- Set the minimum CPU to 1
- Set the minimum memory to 1
- Click Advanced Options under Data Volumes
- Click Launch
It will take 1-2 minutes to create the notebook.
Step 3: Execute Training¶
In this step, you will use the notebook to install the required packages to create PyTorchJobs that will use distributed training with the DistributedDataParallel strategy.
- Navigate to Notebooks
- Click Connect on the previously created notebook
- Download the following notebook file
- In the left hand folder tree, click on the upload files icon
- Upload the previously downloaded pytorch-distributed.ipynb file
- Double click the pytorch-distributed.ipynb file in the folder tree to open the notebook
- Click the restart kernel and run all cells icon
It will take ~3 minutes to run the training.
- Navigate to the output of cell 3 to view the training that was performed locally in the notebook
- Navigate to the output of cell 7 to view the training that was distributed across three workers
Recap¶
Congratulations! At this point, you have successfully created a Jupyter notebook which used the Training Operator to perform both local and distributed training of a dataset.