Why do we need a GPU Operator for Kubernetes¶

This is a follow up from the previous blog where we discussed device plugins for GPUs in Kubernetes. We reviewed why the Nvidia device plugin was necessary for GPU support in Kubernetes. A GPU Operator is needed in Kubernetes to automate and simplify the management of GPUs for workloads running on Kubernetes.

In this blog, we will look at how a GPU operator helps automate and streamline operations through the lens of a market leading implementation by Nvidia.

Nvidia GPU Operator vs Device Plugin¶

With just the Device Plugin, managing GPUs within Kubernetes required multiple manual steps such as installation of drivers, libraries, and require configuration of the GPUs on every node. Specifically, the latter can be extremely manual and error-prone.

The goal of the GPU Operator is to eliminate this complexity by automating the entire setup and lifecycle management of the necessary components. A GPU Operator will implement the Kubernetes Operator pattern. The operator pattern lets users extend the cluster's behavior without requiring the modification of Kubernetes itself. Shown below is an image comparing the experience "without" and "with" the GPU Operator.

Having reviewed this, let's review some of the benefits of the GPU Operator over the Device Plugin.

1. Automated GPU Driver Installation and Management¶

For Kubernetes nodes with GPUs, the administrator needs to manually install the correct Nvidia drivers on each node to enable GPU-accelerated workloads. In addition, managing driver updates and ensuring compatibility across different Kubernetes versions can be tedious and error prone.

The Nvidia GPU Operator automates the installation and lifecycle management of Nvidia GPU drivers on each node in the cluster. It also ensures that the correct drivers are installed and updates them when necessary, removing the need for manual intervention.

2. Simplified CUDA Toolkit and Runtime Installation¶

Many GPU-accelerated workloads rely on the CUDA toolkit and runtime, which includes essential libraries such as cuDNN for deep learning, and NCCL for multi-GPU communication. These components must be installed and configured manually across the cluster nodes.

The Nvidia GPU Operator manages the installation of the CUDA runtime, cuDNN, and other related Nvidia libraries, ensuring that containers running GPU-accelerated workloads have the necessary dependencies pre-installed.

3. Device Plugin Management¶

To enable GPU support in Kubernetes, the Nvidia device plugin must be deployed on every node that has a GPU. The device plugin advertises available GPU resources to Kubernetes, making it possible for Pods to request GPUs. If your cluster has 100 nodes, consider the operational burden to deploy and manage the device plugin on every node.

The Nvidia GPU Operator automatically deploys the Nvidia device plugin as a DaemonSet across all GPU-enabled nodes. This ensures that GPUs are correctly discovered and made available for scheduling without manual configuration.

4. GPU Monitoring and Metrics Collection¶

Monitoring GPU usage (such as memory utilization, temperature, power usage) is crucial for managing workloads efficiently and ensuring that GPU resources are being used optimally. The administrator has to manually set up tools to collect GPU metrics.

The Nvidia GPU Operator integrates with monitoring systems such as Prometheus and provides access to detailed GPU metrics. This makes it easy to track GPU usage and performance, allowing for better resource management and alerting in case of issues.

Info

Read our 5-part blog series describing GPU metrics and how Rafay helps users centrally aggregate and visualize GPU metrics.

5. Simplified Deployment with Helm¶

Manually installing all the required components for GPU management, including drivers, CUDA, the device plugin, and monitoring tools, is error-prone and time-consuming.

The Nvidia GPU Operator is easily deployable via Helm, providing a one-step installation process for setting up GPU management across the entire Kubernetes cluster. This drastically simplifies deployment, especially in large or dynamic environments.

6. Taints and Tolerations Management¶

GPU nodes may be dedicated to GPU workloads, and without proper configuration, Kubernetes may schedule non-GPU workloads onto GPU nodes, leading to inefficient use of resources.

The Nvidia GPU Operator ensures proper taints and tolerations are applied so that only GPU-specific workloads are scheduled on GPU nodes. This maintains GPU node exclusivity for GPU-accelerated tasks.

7. Lifecycle Management and Updates¶

The Nvidia software components (i.e. drivers, CUDA, and libraries) need to be maintained and updated regularly to keep up with new Kubernetes releases, security patches, or performance improvements.

The Nvidia GPU Operator manages the entire lifecycle of these components, ensuring they are updated in a coordinated manner. The operator also handles version compatibility and applies updates without the need for manual intervention.

Deploy the Nvidia GPU Operator using Rafay¶

Pretty much 100% of Rafay's customers deploy and manage the Nvidia GPU Operator on their Kubernetes clusters using Cluster Blueprints. This allows them to create a single, standardized combination of cluster add-ons packaged as a cluster blueprint and leverage it across 100s of clusters. The example below shows a Rafay MKS (Upstream) cluster in an on-premises data center with 56 GPUs attached to it.

Note that the cluster has a blueprint deployed on it with the NVidia GPU Operator add-on.

Once deployed on the cluster, the cluster blueprint with the Nvidia GPU Operator add-on will continuously monitor for GPU-enabled nodes and ensures that the required components (i.e. drivers, device plugins, and monitoring tools) are properly installed and running. Below is an image of a node that has been automatically "labeled" by the GPU Operator with all the necessary node labels.

Given the significant user interest, we have made the effort to document "step-by-step" Getting Started guides for both data center and popular public cloud cluster deployments.

Data Center

On Upstream Kubernetes Clusters (i.e. Bare Metal and VMs)

Get Started
:simple-amazoneks:{ .lg .middle } Amazon EKS

On Amazon EKS Clusters

Get Started
Azure AKS

On Azure AKS Clusters

Get Started
Google GKE

On Google GKE Clusters

Get Started

Conclusion¶

By using the Nvidia GPU Operator, administrators can focus their time to help operate and optimize their applications and less on managing the underlying GPU infrastructure, leading to more efficient operations, reduced errors, and improved GPU resource management.

Stay tuned for an upcoming blog where we will discuss various approaches for GPU Sharing.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Live Demo

Schedule time with us to watch a demo in action.

Schedule Demo
Rafay's AI/ML Products

Learn about Rafay's offerings in AI/ML Infrastructure and Tooling

Learn More
:simple-linkedin:{ .middle } About the Author

Read other blogs by the author. Connect with the author on :simple-linkedin:

Blogs