Skip to content

Overview

The Training Operator is extremely well suited for fine-tuning and scalable distributed training of machine learning (ML) models created with different ML frameworks such as PyTorch, TensorFlow, XGBoost, and others.

The Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using the Training Operator Python SDK. User can also run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.

Internally, the Training Operator implements a centralized Kubernetes controller to orchestrate distributed training jobs.

Get Started Guide for Distributed Training with PyTorch.


Why Training Operator

The Training Operator simplifies the ability to run distributed training and fine-tuning providing the means to easily scale model training from a single machine to large-scale distributed Kubernetes cluster.

The training operator is a unified operator for distributed training for all ML frameworks (see list below).

Users can leverage advanced Kubernetes scheduling techniques such as the following to optimize costs especially for long running ML training tasks.

  • Kueue
  • Volcano
  • Yunikorn

Train Locally and Remotely from Notebooks

With the Training Operator, a data scientist is they can submit distributed training jobs right from their notebooks using the Python SDK. They do not need to know anything about Kubernetes (i.e. kubectl or yaml) or the infrastructure.


Architecture

The Training Operator is essentially a “frontend” operator. Once it receives the training job, it decomposes it into various Kubernetes resources (i.e. Role, PodTemplate, Fault-Tolerance, etc.). It then watches over the Customer Resources and manages pod performance.

Architecture

Important

The above image shows a PyTorchJob and its communication methods. Each ML framework can have its own approaches and its own set of configurable resources.


Supported ML Frameworks

The Training Operator implements the following Custom Resources for each ML framework:

ML Framework Custom Resource
PyTorch PyTorchJob
TensorFlow TFJob
XGBoost XGBoostJob
MPI MPIJob
PaddlePaddle PaddleJob