Spatial Partitioning of GPUs using Nvidia MIG¶

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Common Use Cases¶

Let's look at some common use cases and applications for GPU Sharing using Nvidia MIG.

1. GPU Cloud¶

GPU Cloud providers can use MIG to provide GPU resources to multiple customers simultaneously. Instead of allocating an entire GPU to a single user, they can partition GPUs and offer this to their users that only require a slice of the full GPU. This reduces costs for users running smaller-scale workloads and maximizes the provider’s GPU utilization.

2. Inference for Multiple Applications¶

Inference workloads generally do not require the full power of a GPU. With MIG, multiple models or applications can be run concurrently on the same GPU. For example,

A video streaming platform could run multiple real-time tasks (e.g. object detection, facial recognition, and background noise suppression, on different MIG instances).
A hedge fund could use MIG to run multiple financial models concurrently (one model to analyze market sentiment and the second one to predict stock prices using historical data).

3. Multi-Tenancy for Enterprises¶

A large research lab can use MIG in a multi-tenant Kubernetes environment, where multiple data scientists are running various machine learning workloads. Each scientist’s pod in the Kubernetes cluster can be assigned a different MIG instance, ensuring fair and isolated access to GPU resources without needing to allocate an entire GPU to each workload.

4. Healthcare¶

Many hospitals or research institutions run ML models for medical image analysis(e.g. detect tumors from MRI scans, analyze CT images for abnormalities). MIG enables running different models for multiple patients on the same GPU simultaneously, speeding up the diagnosis. Each model can be provided a dedicated portion of the GPU, providing faster results.

Concepts¶

MIG enables a single physical GPU to be split into multiple independent instances. Each instance operates with its own memory, cache and compute cores, effectively allowing different workloads to run concurrently on separate GPU partitions. With MIG, GPUs such as NVIDIA A100 etc, can be securely partitioned up to seven separate GPU instances.

A MIG device consists of the 3-tuple (Top level GPU + A GPU Instance created inside that GPU + A Compute Instance created inside that GPU Instance).

A GPU instance is the first level of partitioning on a MIG capable GPU. It creates a wrapper around a fixed set of memory slices, a fixed set of compute slices, and a fixed set of "other" GPU engines (such as jpeg encoders, dma engines, etc.)
A Compute Instance is the second level of partitioning on a MIG capable GPU. It takes some subset of the compute slices inside a GPU Instance and bundles them together. Each Compute Instance has dedicated access to the compute slices it has bundled together, but shared access to its wrapping GPU Instance's memory and "other" GPU Engines.

This gives us a MIG device naming convention ("[GPU Instance slice count]g.[total memory]gb") that refers to a MIG device. For example, 3g.20gb

Partitioning Strategy¶

Three strategies are supported with MIG allowing for flexible use of the GPUs.

None¶

Use this strategy when MIG mode is not enabled on your GPU. This is the default mode and no changes are required to existing pod manifests.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: gpu-container
      image: "nvidia/cuda:11.0-base"
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1  # Requesting 1 GPU

Info

Notice the last line where the pod is requesting 1 GPU. This is a standard pod spec.

Single¶

With this strategy, a node exposes a single type of MIG device across all its GPUs. The logic behind this strategy is avoid the fragmentation of jobs across different nodes. Pods will need to use the nodeSelector to specify the exact model of GPU or MIG device that is required. In single mode, the GPU is partitioned, and users should specify the MIG profile being used.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: gpu-container
      image: "nvidia/cuda:11.0-base"
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:  # This ensures the pod gets scheduled on nodes with the correct MIG profile
    nvidia.com/gpu.product: "A100-SXM4-40GB"

Let's review what needs to change in our YAML spec from above. The key change is the addition of the section under nodeSelector to ensure that the pod gets scheduled on nodes with the required GPU model (i.e., NVIDIA A100 with MIG support).

Mixed¶

With the mixed strategy, a node can expose different MIG device types and GPUs. Applications need to request a specific MIG device by specifying both the number of compute instances and the total amount of memory provided by the device type (e.g. 2g.10GB). To request a MIG device from the mixed strategy, users must change their YAML specs to request the corresponding resource type.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: gpu-container
      image: "nvidia/cuda:11.0-base"
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/mig-3g.20gb: 1  # Requesting 1 MIG instance with the 3g.20gb profile
  nodeSelector:  # This ensures the pod gets scheduled on nodes with the correct MIG profile
    nvidia.com/gpu.product: "A100-SXM4-40GB"

Let's review what needs to change in our YAML spec from above. The key changes are as follows:

Field	Description
resources.limits	Changed from `nvidia.com/gpu: 1` to `nvidia.com/mig-3g.20gb`. This specifies that you’re requesting one instance of the GPU configured in MIG mode (3g.20gb profile).
nodeSelector	Added to ensure that the pod gets scheduled on nodes with the required GPU model (i.e., NVIDIA A100 with MIG support).

MIG Profiles¶

MIG offers predefined profiles that specify how much of the GPU’s resources (e.g., compute units, memory) are allocated to each instance. MIG does not allow GPU instances to be created with an arbitrary number of GPU slices. On a given GPU, you can create multiple GPU instances from a mix and match of the supported profiles. Some profiles may allocate more GPU cores or memory, while others might create smaller, more numerous instances. MIG Profiles follow a naming convention (M:xg:ygb) where

xg refers to the number of GPU compute units (SMs) and
ygb refers to the memory allocated in gigabytes.

Shown below is a table of available MIG profiles for a Nvidia A100 TensorCore GPU

MIG Profile	Description	Instances
1g.5gb	1 compute instance with 5GB memory	7
2g.10gb	2 compute instances, each with 10GB memory	3
3g.20gb	3 compute instances, each with 20GB memory	2
4g.20gb	4 compute instances, each with 20GB memory	1
7g.40gb	Full GPU with all compute and memory (40GB)	1

Considerations¶

Listed below are considerations that users need to factor in when using MIG. These are valid at the time of writing this blog and may change as Nvidia updates MIG's capabilities.

Available only on specific GPU models like A30, A100 (PCIE/SXM4), H100 (PCIE/SXM5/GH200), and H200 (SXM5)
Designed primarily for computing tasks (i.e. no graphical applications)
Best suited for servers with GPUs of the same model.
Requires the latest version of the GPU drivers.
Reconfiguration requires unloading all computing tasks from memory and stopping applications.
GPU to GPU P2P (i.e. PCIe or NVLink) is not supported
Best suited for Linux

Info

Enabling MIG is disruptive operation. It will cause a complete clearing of video memory and perform a GPU restart.

Summary¶

NVIDIA MIG (Multi-Instance GPU) allows partitioning a single GPU into multiple isolated instances, enabling better resource utilization by running different workloads concurrently on a single GPU. This improves efficiency by tailoring GPU resources to the specific needs of diverse applications, such as training, inference, and development. Shown below is a table that maps common MIG device types to types of workload.

MIG Device Type	Workload Type
7g.40gb	Heavy Deep Learning Training Jobs
3g.20gb	Small/Medium Training Jobs
1g.10gb	Inference workloads (low-latency models), Development Jupyter notebooks

In the next blog, we will discuss how organizations can provide their users (data scientists and ML engineers) self service access to GPU enabled workspaces from a shared pool of GPU resources.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Live Demo

Schedule time with us to watch a demo in action.

Schedule Demo
Rafay's AI/ML Products

Learn about Rafay's offerings in AI/ML Infrastructure and Tooling

Learn More
:simple-linkedin:{ .middle } About the Author

Read other blogs by the author. Connect with the author on :simple-linkedin:

Blogs