Skip to content

User

The Training Operator implements a Python SDK to simplify creation of distributed training and fine-tuning jobs for Data Scientists. Interested in trying this hands on? Navigate to our Get Started Guide.

Get Started Guide for Distributed Training with PyTorch.


PyTorch based Distributed Training

For PyTorch based distributed training, the data scientist is responsible for writing the training code using native PyTorch Distributed APIs. They then create a PyTorchJob with the required number of workers and GPUs using the Training Operator Python SDK.

Note

All of the above is generally done natively in Python code and the user does not need to have any knowledge of Kubernetes.

Once the Training Operator receives the request, it will create the necessary Kubernetes pods with the appropriate environment variables for the torchrun CLI to start the distributed PyTorch training job.

Users can also define various distributed strategies supported by PyTorch in their training code. The Training Operator will use it to automatically set the appropriate environment variables for torchrun.

The image below shows how the Training Operator creates PyTorch workers for the ring all-reduce algorithm. At the end of the ring all-reduce algorithm, gradients are synchronized in every worker (g1, g2, g3, g4) and the model is trained.

PyTorch based Distibuted Training


TensorFlow based Distributed Training

For TensorFlow based distributed training, the data scientist is responsible for writing the training code using native TensorFlow Distributed APIs. They then create a TFJob with the required number of PSs, workers, and GPUs using the Training Operator Python SDK.

Note

All of the above is generally done natively in Python code and the user does not need to have any knowledge of Kubernetes.

The Training Operator will then create required Kubernetes pods with the appropriate environment variables for TF_CONFIG to start the distributed TensorFlow training job. The Parameter server splits training data for every worker and averages model weights based on gradients produced by every worker.

Users can specify any of the distributed strategies supported by TensorFlow in their training code. The Training Operator will use it to set the appropriate environment variables for TF_CONFIG.

TensorFlow based Distributed Training