GPU Sharing Strategies in Kubernetes¶
In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.
Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.
Background¶
When using GPUs in a Kubernetes cluster, sharing GPUs among multiple workloads or containers can present several challenges. GPUs are powerful computational resources often used for tasks like machine learning, deep learning, and high-performance computing. Unlike CPUs, GPUs are typically designed for exclusive access. This means it is not possible to perform fine-grained resource sharing across multiple processes in the same way that CPUs do. There are three approaches supported by Nvidia's GPU operator to help with oversubscription of GPUs.
1. Time Sharing/Slicing¶
This approach allows multiple workloads to share the GPU by alternating execution time (i.e. workloads are interleaved on the GPU). This is configured through a set of extended options for the NVIDIA Kubernetes Device Plugin. The administrator has to define a set of “replicas” for a GPU that allows each replica can be handed out independently to a pod to run workloads on. Each process then gets a turn at using the GPU in a round-robin fashion. In the image below, containers "A", "B" and "C" get round robin access to the GPU.
How to Configure?¶
The Kubernetes device plugin is the interface used to apply the configuration changes in the nodes containing GPUs. The device plugin is responsible for advertising the availability of GPU resources to the Kubernetes API Server. In the example below, the Tesla T4 GPU is configured to be shared by 3 pods.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
When is it a good fit?¶
This approach is well suited for workloads that are bursty and interactive with significant idle periods. Allocation of a fully dedicated GPU for these types of workloads can be cost prohibitive. Of course, these workloads need to be tolerant of slower performance and latency.
Info
There is no memory or fault-isolation between replicas. Time-slicing is used to multiplex workloads from replicas of the same underlying GPU.
2. Multi Instance GPU (MIG)¶
With Nvidia's MIG, a single supported GPU can be spatially partitioned up to seven instances (i.e. slices). Each instance can then be allocated to one container on the node. i.e. a maximum of seven containers can now use the GPU. Unlike time slicing, spatial partitioning divides the GPU into isolated and static instances providing hardware isolation and consistent QoS.
Info
MIG profiles are static once they are configured, so dynamic resource allocation or frequent changes in GPU partitioning require rebooting the GPU.
How to Configure?¶
MIG can be configured for various strategies depending on your workload’s needs. Unlike time slicing, Nvidia MIG only supports published profiles. Shown below is an image of a node in one of our clusters with 8 Nvidia H100 GPUs. Notice the label created by the GPU Operator saying the GPU is MIG capable.
When is it a good fit?¶
Nvidia's MIG is supported only on Nvidia's Ampere and later architectures. So, if you have older generation GPUs, time slicing may be the only practical option. Here are some common scenarios where MIG is extremely useful and handy.
Scenario | UseCase |
---|---|
High-Throughput Inference (Many Small Jobs) | Inference workloads that are memory-light and compute-light |
Multi-Tenancy with Resource Isolation | When you need to allocate GPU resources to multiple users or tenants in a shared environment |
Efficient GPU Utilization | Workloads are too small to use the full GPU but still need accelerated compute |
Mixed Workloads (Different Resource Requirements) | Mix of workloads that require different amounts of GPU |
Scalable Multi-Model Inference | Serving multiple models at once |
Summary¶
In this blog, we discussed two common approaches to share Nvidia GPUs. We specifically learnt that NVidia's MIG is only supported on recent architectures and on high end GPUs. The table below captures a high level summary of the two approaches.
Features | Time Slicing | MIG |
---|---|---|
Partition Type | Logical | Physical |
Max Partitions | Unlimited | 7 |
SM QoS | ❌ | ✅ |
Memory QoS | ❌ | ✅ |
Error Isolation | ❌ | ✅ |
Reconfigure | Dynamic | Requires Reboot |
GPU Support | Most GPUs | A100, A30, Blackwell & Hopper Series |
In the next blog, we will dive deeper into Nvidia's MIG and discuss MIG strategies for different use cases.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.
-
Rafay's AI/ML Products
Learn about Rafay's offerings in AI/ML Infrastructure and Tooling
-
Upcoming Events
Meet us in-person in the Rafay booth in one of the upcoming events