Using GPUs in Kubernetes¶
Unlike CPU and Memory, GPUs are not natively supported in Kubernetes. Kubernetes manages CPU and memory natively. This means it can automatically schedule containers based on these resources, allocates them to Pods, and handles resource isolation and over-subscription.
GPUs are considered specialized hardware and require the use of device plugins to support GPUs in Kubernetes. Device Plugins help make Kubernetes GPU-aware allowing it to Discover, Allocate and Schedule GPUs for containerized workloads. Without a device plugin, Kubernetes is unaware of the GPUs available on the nodes and cannot assign them to Pods. In this blog, we will discuss why GPUs are not natively supported and understand how device plugins help address this gap.
Why are GPUs not Natively Supported?¶
GPUs are specialized hardware with different characteristics compared to CPU and Memory. They require specific drivers, libraries (e.g., CUDA for Nvidia GPUs), and device access mechanisms to work.
Kubernetes is designed for general-purpose resource management (CPU, memory). So GPUs require additional mechanisms for proper scheduling, isolation, and management. To address these needs, the Kubernetes device plugin framework was introduced starting Kubernetes v1.10.
Device Plugin for GPUs¶
The device plugin framework allows third-party plugins (e.g. Nvidia’s GPU device plugin) to register and advertise specialized hardware devices to Kubernetes. The device plugin framework also allows Kubernetes to schedule GPU resources in the same way it manages CPU and memory resources.
In order to do this effectively, the 3rd party device plugin software for the GPU has to perform certain tasks like clockwork.
Device Discovery¶
The plugin should automatically detect and register the GPU resources available on the nodes. This means that the software is typically deployed as a DaemonSet and runs on every node that has a GPU. It then registers the GPUs available on that node with Kubernetes.
Resource Allocation¶
The plugin should allow Pods to request specific GPU resources (e.g. nvidia.com/gpu) and ensures that each Pod is properly isolated when accessing GPU resources.
Driver and Library Access¶
The plugin should ensure that GPU device files, libraries (e.g., CUDA), and drivers are mounted inside the container, so the application can use the GPU.
Metrics and Monitoring¶
Ideally, the plugin should also provide metrics and monitoring capabilities for GPU workloads. Users expect metrics to be integrated with cloud native frameworks such as Prometheus.
Info
We have blogged extensively about GPU Metrics and how Rafay helps users by centrally aggregating and visualizing GPU metrics.
Limitations¶
While the device plugin framework provides basic GPU support, it also has limitations. Let's review a few of them.
GPU Sharing¶
By default, Kubernetes does not natively support fine-grained GPU sharing. Advanced features like Nvidia MPS (Multi-Process Service) or MIG are required for sharing a single GPU among multiple Pods.
Resource Isolation¶
GPUs don’t provide the same level of isolation as CPUs or memory, which can result in performance unpredictability if not managed properly.
Complexity¶
Setting up GPU support in Kubernetes requires additional tools like the device plugin software, drivers. This adds complexity to cluster add-on management because of constant updates and versioning that an administrator now has to deal with.
Info
Users of the Rafay platform deal with this challenge by using our Standardization Suite esp. cluster blueprints that allows them to create and manage version controlled add-on bundles for all clusters in the Org.
Custom GPU Resource¶
Once the plugin is deployed, Kubernetes will exposes a custom resource (e.g. amd.com/gpu or nvidia.com/gpu). Workloads can now consume these GPUs by requesting the custom GPU resource in the same way it would request CPU or memory.
Shown below is the YAML spec for a pod requesting 1 Nvidia GPU
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: nvidia/cuda:11.2.1-base-ubuntu20.04 # Example GPU-accelerated image
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU for this container
Important
GPUs are only supposed to be specified in the limits section. If you specify GPUs in requests, ensure that the value for limits is identical.
Conclusion¶
While Kubernetes does not natively support GPUs as it does CPUs and memory, the introduction of the device plugin framework enables Kubernetes to manage and schedule GPUs. These plugins provide essential features such as GPU discovery, allocation, and resource isolation, allowing you to run GPU-accelerated workloads in a Kubernetes cluster. The table below captures details on where you can find additional details about the Device Plugins for the Top-3 GPU vendors.
Vendor | Git Repository for Kubernetes Device Plugin |
---|---|
Nvidia | Nvidia GPU Device Plugin |
AMD | AMD ROCm Kubernetes Device Plugin |
Intel Gaudi | Intel Gaudi Device Plugin |
Info
In an upcoming blog, we will discuss the rationale for a Kubernetes Operator for GPUs.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.
-
Rafay's AI/ML Products
Learn about Rafay's offerings in AI/ML Infrastructure and Tooling
-
About the Author