Overview
The Training Operator is extremely well suited for fine-tuning and scalable distributed training of machine learning (ML) models created with different ML frameworks such as PyTorch, TensorFlow, XGBoost, and others.
The Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using the Training Operator Python SDK. User can also run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
Internally, the Training Operator implements a centralized Kubernetes controller to orchestrate distributed training jobs.
Get Started Guide for Distributed Training with PyTorch.
Why Training Operator¶
The Training Operator simplifies the ability to run distributed training and fine-tuning providing the means to easily scale model training from a single machine to large-scale distributed Kubernetes cluster.
The training operator is a unified operator for distributed training for all ML frameworks (see list below).
Users can leverage advanced Kubernetes scheduling techniques such as the following to optimize costs especially for long running ML training tasks.
- Kueue
- Volcano
- Yunikorn
Train Locally and Remotely from Notebooks¶
With the Training Operator, a data scientist is they can submit distributed training jobs right from their notebooks using the Python SDK. They do not need to know anything about Kubernetes (i.e. kubectl or yaml) or the infrastructure.
Architecture¶
The Training Operator is essentially a “frontend” operator. Once it receives the training job, it decomposes it into various Kubernetes resources (i.e. Role, PodTemplate, Fault-Tolerance, etc.). It then watches over the Customer Resources and manages pod performance.
Important
The above image shows a PyTorchJob and its communication methods. Each ML framework can have its own approaches and its own set of configurable resources.
Supported ML Frameworks¶
The Training Operator implements the following Custom Resources for each ML framework:
ML Framework | Custom Resource |
---|---|
PyTorch | PyTorchJob |
TensorFlow | TFJob |
XGBoost | XGBoostJob |
MPI | MPIJob |
PaddlePaddle | PaddleJob |