Skip to content

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.


Background

When using GPUs in a Kubernetes cluster, sharing GPUs among multiple workloads or containers can present several challenges. GPUs are powerful computational resources often used for tasks like machine learning, deep learning, and high-performance computing. Unlike CPUs, GPUs are typically designed for exclusive access. This means it is not possible to perform fine-grained resource sharing across multiple processes in the same way that CPUs do. There are three approaches supported by Nvidia's GPU operator to help with oversubscription of GPUs.

1. Time Sharing/Slicing

This approach allows multiple workloads to share the GPU by alternating execution time (i.e. workloads are interleaved on the GPU). This is configured through a set of extended options for the NVIDIA Kubernetes Device Plugin. The administrator has to define a set of “replicas” for a GPU that allows each replica can be handed out independently to a pod to run workloads on. Each process then gets a turn at using the GPU in a round-robin fashion. In the image below, containers "A", "B" and "C" get round robin access to the GPU.

Time Slicing

How to Configure?

The Kubernetes device plugin is the interface used to apply the configuration changes in the nodes containing GPUs. The device plugin is responsible for advertising the availability of GPU resources to the Kubernetes API Server. In the example below, the Tesla T4 GPU is configured to be shared by 3 pods.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
    tesla-t4: |-
        version: v1
        sharing:
          timeSlicing:
            resources:
            - name: nvidia.com/gpu
              replicas: 3

When is it a good fit?

This approach is well suited for workloads that are bursty and interactive with significant idle periods. Allocation of a fully dedicated GPU for these types of workloads can be cost prohibitive. Of course, these workloads need to be tolerant of slower performance and latency.

Info

There is no memory or fault-isolation between replicas. Time-slicing is used to multiplex workloads from replicas of the same underlying GPU.


2. Multi Instance GPU (MIG)

With Nvidia's MIG, a single supported GPU can be spatially partitioned up to seven instances (i.e. slices). Each instance can then be allocated to one container on the node. i.e. a maximum of seven containers can now use the GPU. Unlike time slicing, spatial partitioning divides the GPU into isolated and static instances providing hardware isolation and consistent QoS.

Nvidia MIG

Info

MIG profiles are static once they are configured, so dynamic resource allocation or frequent changes in GPU partitioning require rebooting the GPU.

How to Configure?

MIG can be configured for various strategies depending on your workload’s needs. Unlike time slicing, Nvidia MIG only supports published profiles. Shown below is an image of a node in one of our clusters with 8 Nvidia H100 GPUs. Notice the label created by the GPU Operator saying the GPU is MIG capable.

MIG Node

When is it a good fit?

Nvidia's MIG is supported only on Nvidia's Ampere and later architectures. So, if you have older generation GPUs, time slicing may be the only practical option. Here are some common scenarios where MIG is extremely useful and handy.

Scenario UseCase
High-Throughput Inference (Many Small Jobs) Inference workloads that are memory-light and compute-light
Multi-Tenancy with Resource Isolation When you need to allocate GPU resources to multiple users or tenants in a shared environment
Efficient GPU Utilization Workloads are too small to use the full GPU but still need accelerated compute
Mixed Workloads (Different Resource Requirements) Mix of workloads that require different amounts of GPU
Scalable Multi-Model Inference Serving multiple models at once

Summary

In this blog, we discussed two common approaches to share Nvidia GPUs. We specifically learnt that NVidia's MIG is only supported on recent architectures and on high end GPUs. The table below captures a high level summary of the two approaches.

Features Time Slicing MIG
Partition Type Logical Physical
Max Partitions Unlimited 7
SM QoS
Memory QoS
Error Isolation
Reconfigure Dynamic Requires Reboot
GPU Support Most GPUs A100, A30, Blackwell & Hopper Series

In the next blog, we will dive deeper into Nvidia's MIG and discuss MIG strategies for different use cases.

  • Free Org


    Sign up for a free Org if you want to try this yourself with our Get Started guides.

    Free Org

  • 📆 Live Demo


    Schedule time with us to watch a demo in action.

    Schedule Demo

  • Rafay's AI/ML Products


    Learn about Rafay's offerings in AI/ML Infrastructure and Tooling

    Learn More

  • Upcoming Events


    Meet us in-person in the Rafay booth in one of the upcoming events

    Event Calendar