Skip to content

2025

GPU Resource Management in Kubernetes: From Extended Resource to DRA

This blog is part of our DRA series, continuing from our earlier posts: Introduction to DRA, Enabling DRA with Kind, and MIG with DRA . This post focuses on pre-DRA vs post-DRA GPU management on Rafay upstream Kubernetes clusters.

Overview

With the rise of AI, ML, and HPC workloads, GPU resource management has become a cornerstone of Kubernetes scheduling. Over time, Kubernetes has evolved from static, count-based GPU allocation using extended resources (nvidia.com/gpu) to the more flexible DRA framework now a stable feature in Kubernetes v1.34.

This guide walks through the evolution from pre DRA GPU management to DRA-based allocation and sharing, complete with examples.

Pre-DRA: GPU Management Using Extended Resources

Before DRA, Kubernetes workloads used the NVIDIA Device Plugin to expose GPUs as extended resources. These resources could then be requested by pods just like CPU or memory.

GPU Operator Components

To enable GPU scheduling, the NVIDIA GPU Operator packaged all required components:

  • Host components:

    • NVIDIA GPU driver
  • Kubernetes components:

    • NVIDIA device plugin
    • MIG Manager
    • DCGM Exporter
    • GPU Feature Discovery (GFD)

Each of these components was deployed as a DaemonSet on GPU nodes, ensuring the scheduler could detect and allocate GPU resources properly.

Requesting GPUs (Pre-DRA)

Here's an example of how users would request GPU access before DRA:

apiVersion: v1
kind: Pod
metadata:
  name: pod-gpu-classic
spec:
  containers:
    - name: app-container
      image: nvidia/cuda
      resources:
        limits:
          nvidia.com/gpu: 2

This tells Kubernetes to assign two GPUs to the container. The scheduler and device plugin work together to: - Locate a node with at least two available GPUs - Schedule the pod there - Inject the GPU devices into the container

If a specific GPU type was needed (e.g., A100-40GB), node labels and selectors could be used to ensure the pod landed on the right hardware.

Post-DRA: Dynamic Resource Allocation

Kubernetes v1.34 introduces DRA, which graduated to GA - a new, flexible, and vendor extendable approach for requesting resources such as GPUs.

Why DRA?

DRA addresses key limitations of the old model: - Enables fine grained GPU sharing providing complext constraints when requesting a GPU. - Allows custom APIs and parameters from vendors - Supports better isolation and resource reusability - Makes cross-pod sharing possible through claims

Requesting GPUs with DRA

Instead of requesting GPUs via a simple count (nvidia.com/gpu: 2), DRA introduces three main objects: 1. DeviceClass –Defines a category of devices that can be claimed and how to select specific device attributes in claims. 2. ResourceClaimTemplate – defines how a claim should be created 3. ResourceClaim – represents an actual allocated resource

Here's the DRA equivalent for requesting two GPUs:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu
  name: multiple-gpus
spec:
  spec:
    devices:
      requests:
      - name: gpu-1
        exactly:
          deviceClassName: gpu.example.com
      - name: gpu-2
        exactly:
          deviceClassName: gpu.example.com
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpus
  resourceClaims:
  - name: gpus
    resourceClaimTemplateName: multiple-gpus

example1

This example shows 1 pod with 1 container requesting 2 GPUs using the DRA mechanism.

kubectl  get pods -n gpu

NAME   READY   STATUS    RESTARTS   AGE
pod0   1/1     Running   0          81s

logs of this pod to verify that GPUs were allocated to them

kubectl  logs -f -n gpu          pod0
declare -x DRA_RESOURCE_DRIVER_NAME="gpu.example.com"
declare -x GPU_DEVICE_3="gpu-3"
declare -x GPU_DEVICE_3_RESOURCE_CLAIM="c2d2a1d2-b52b-4b4c-b5ae-c1cace625493"
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"
declare -x GPU_DEVICE_4="gpu-4"
declare -x GPU_DEVICE_4_RESOURCE_CLAIM="c2d2a1d2-b52b-4b4c-b5ae-c1cace625493"
declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Default"
declare -x HOME="/root"
declare -x HOSTNAME="pod0"
declare -x KUBERNETES_NODE_NAME="mks-demo"
declare -x KUBERNETES_PORT="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP_ADDR="10.96.0.1"
declare -x KUBERNETES_PORT_443_TCP_PORT="443"
declare -x KUBERNETES_PORT_443_TCP_PROTO="tcp"
declare -x KUBERNETES_SERVICE_HOST="10.96.0.1"
declare -x KUBERNETES_SERVICE_PORT="443"
declare -x KUBERNETES_SERVICE_PORT_HTTPS="443"
declare -x OLDPWD
declare -x PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
declare -x PWD="/"
declare -x SHLVL="1"
  • You have two device envs: GPU_DEVICE_3="gpu-3" and GPU_DEVICE_4="gpu-4".

Controlled GPU Sharing with DRA

One of DRA's most powerful capabilities is controlled GPU sharing, which allows multiple containers or pods to access the same GPU safely.

1. Intra-Pod GPU Sharing (Multiple Containers in One Pod)

Multiple containers within the same pod can reference a single ResourceClaim, giving them shared access to the same GPU.

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu0
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.example.com

---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu0
  name: pod0
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu

example1

This example shows 1 pod with 2 containers sharing 1 GPU in time using the Dynamic Resource Allocation (DRA) mechanism.

kubectl  get pods -n gpu0
NAME   READY   STATUS    RESTARTS   AGE
pod0   2/2     Running   0          9s
kubectl  logs -f -n  gpu0 pod0 -c ctr0
declare -x DRA_RESOURCE_DRIVER_NAME="gpu.example.com"
declare -x GPU_DEVICE_3="gpu-3"
declare -x GPU_DEVICE_3_RESOURCE_CLAIM="99944ed0-806f-496f-bdb5-e457b6a66a2d"
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"
declare -x HOME="/root"
declare -x HOSTNAME="pod0"
declare -x KUBERNETES_NODE_NAME="mks-demo"
declare -x KUBERNETES_PORT="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP_ADDR="10.96.0.1"
declare -x KUBERNETES_PORT_443_TCP_PORT="443"
declare -x KUBERNETES_PORT_443_TCP_PROTO="tcp"
declare -x KUBERNETES_SERVICE_HOST="10.96.0.1"
declare -x KUBERNETES_SERVICE_PORT="443"
declare -x KUBERNETES_SERVICE_PORT_HTTPS="443"
declare -x OLDPWD
declare -x PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
declare -x PWD="/"
declare -x SHLVL="1"
root@mks-demo:/home/demo/dra-example-driver# kubectl  logs -f -n  gpu0 pod0 -c ctr1
declare -x DRA_RESOURCE_DRIVER_NAME="gpu.example.com"
declare -x GPU_DEVICE_3="gpu-3" # ← HIGHLIGHTED
declare -x GPU_DEVICE_3_RESOURCE_CLAIM="99944ed0-806f-496f-bdb5-e457b6a66a2d"
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"
declare -x HOME="/root"
declare -x HOSTNAME="pod0"
declare -x KUBERNETES_NODE_NAME="mks-demo"
declare -x KUBERNETES_PORT="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP_ADDR="10.96.0.1"
declare -x KUBERNETES_PORT_443_TCP_PORT="443"
declare -x KUBERNETES_PORT_443_TCP_PROTO="tcp"
declare -x KUBERNETES_SERVICE_HOST="10.96.0.1"
declare -x KUBERNETES_SERVICE_PORT="443"
declare -x KUBERNETES_SERVICE_PORT_HTTPS="443"
declare -x OLDPWD
declare -x PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
declare -x PWD="/"
declare -x SHLVL="1"
  • GPU_DEVICE_3="gpu-3" is present in both ctr0 and ctr1

2. Inter-Pod GPU Sharing (Global Claim Across Pods)

You can create a global ResourceClaim and reference it across multiple pods — ideal for workloads that need coordinated access (like shared inference).

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  namespace: gpu1
  name: single-gpu
spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.example.com

---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu1
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: single-gpu

---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu1
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: single-gpu

example1

This example shows 2 pods with 1 container each sharing 1 GPU in time using the Dynamic Resource Allocation mechanism.

Conclusion

The move from extended resources to Dynamic Resource Allocation represents a major leap in how Kubernetes manages GPUs and other accelerators.

DRA brings flexibility, fine-grained control, and vendor extensibility — making it the future of GPU scheduling in Kubernetes. Whether you're enabling fractional GPU usage, managing shared inference workloads, or defining custom device policies, DRA unlocks new capabilities that were never possible before.


NVIDIA NIM Operator: Bringing AI Model Deployment to the Kubernetes Era

In the previous blog, we learnt the basics about NIM (NVIDIA Inference Microservices). In this follow-on blog, we will do a deep dive into the NIM Kubernetes Operator, a Kubernetes-native extension that automates the deployment and management of NVIDIA’s NIM containers. By combining the strengths of Kubernetes orchestration with NVIDIA’s optimized inference stack, the NIM Operator makes it dramatically easier to deliver production-grade generative AI at scale.

NIM Operator

NVIDIA NIM: Why It Matters—and How It Stacks Up

Generative AI is moving from experiments to production, and the bottleneck is no longer training—it’s serving: getting high-quality model inference running reliably, efficiently, and securely across clouds, data centers, and the edge.

NVIDIA’s answer is NIM (NVIDIA Inference Microservices). NIM a set of prebuilt, performance-tuned containers that expose industry-standard APIs for popular model families (LLMs, vision, speech) and run anywhere there’s an NVIDIA GPU. Think of NIM as a “batteries-included” model-serving layer that blends TensorRT-LLM optimizations, Triton runtimes, security hardening, and OpenAI-compatible APIs into one deployable unit.

NIM Logo

Dynamic Resource Allocation for GPU Allocation on Rafay's MKS (Kubernetes 1.34)

This blog demonstrates how to leverage Dynamic Resource Allocation (DRA) for efficient GPU allocation using Multi-Instance GPU (MIG) strategy on Rafay's Managed Kubernetes Service (MKS) running Kubernetes 1.34.

In our previous blog series, we covered various aspects of Dynamic Resource Allocation (DRA) in Kubernetes:

DRA is GA in Kubernetes 1.34

With Kubernetes 1.34, Dynamic Resource Allocation (DRA) is Generally Available (GA) and enabled by default on MKS clusters. This means you can immediately start using DRA features without additional configuration.

Prerequisites

Before we begin, ensure you have:

  • A Rafay MKS cluster running Kubernetes 1.34 (see MKS v1.34 Blog)
  • GPU nodes with compatible NVIDIA GPUs (A100, H100, or similar MIG-capable GPUs)
  • Container Device Interface (CDI) enabled (automatically enabled in MKS for Kubernetes 1.34)
  • Basic understanding of Dynamic Resource Allocation concepts (covered in our previous blog series)
  • Active Rafay account with appropriate permissions to manage MKS clusters and addons

Kubernetes v1.34 for Rafay MKS

As part of our continuous effort to bring the latest Kubernetes versions to our users, support for Kubernetes v1.34 will be added soon to the Rafay Operations Platform for MKS cluster types.

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, this version also deprecates and removes a number of features. To ensure there is zero impact to our customers, we have made sure that every feature in the Rafay Kubernetes Operations Platform has been validated on this Kubernetes version. This will be promoted from Preview to Production in a few days and will be made available to all customers.

Kubernetes v1.34 Release

Deploy Workload using DRA ResourceClaim in Kubernetes

In the first blog in the DRA series, we introduced the concept of Dynamic Resource Allocation (DRA) that recently went GA in Kubernetes v1.34 which was released end of August 2025.

In the second blog, we installed a Kuberneres v1.34 cluster and deployed an example DRA driver on it with "simulated GPUs". In this blog, we’ll will deploy a few workloads on the DRA enabled Kubernetes cluster to understand how "Resource Claim" and "ResourceClaimTemplates" work.

Info

We have optimized the steps for users to experience this on their laptops in less than 5 minutes. The steps in this blog are optimized for macOS users.

GPU/Neo Cloud Billing using Rafay’s Usage Metering APIs

Cloud providers offering GPU or Neo Cloud services need accurate and automated mechanisms to track resource consumption. Usage data becomes the foundation for billing, showback, or chargeback models that customers expect. The Rafay Platform provides usage metering APIs that can be easily integrated into a provider’s billing system. '

In this blog, we’ll walk through how to use these APIs with a sample Python script to generate detailed usage reports.

Usage Metering

Upstream Kubernetes on RHEL 10 using Rafay

Our upcoming release update will add support for a number of new features and enhancements. This blog is focused on the upcoming support for Upstream Kubernetes on nodes based on Red Hat Enterprise Linux (RHEL) v10.0. Both new cluster provisioning and in-place upgrades of Kubernetes clusters will be supported for lifecycle management.

RHEL 9.2

Support for Parallel Execution with Rafay's Integrated GitOps Pipeline

At Rafay, we are continuously evolving our platform to deliver powerful capabilities that streamline and accelerate the software delivery lifecycle. One such enhancement is the recent update to our GitOps pipeline engine, designed to optimize execution time and flexibility — enabling a better experience for platform teams and developers alike.

Integrated Pipeline for Diverse Use Cases

Rafay provides a tightly integrated pipeline framework that supports a range of common operational use cases, including:

  • System Synchronization: Use Git as the single source of truth to orchestrate controller configurations
  • Application Deployment: Define and automate your app deployment process directly from version-controlled pipelines
  • Approval Workflows: Insert optional approval gates to control when and how specific pipeline stages are triggered, offering an added layer of governance and compliance

This comprehensive design empowers platform teams to standardize delivery patterns while still accommodating organization-specific controls and policies.

From Sequential to Parallel Execution with DAG Support

Historically, Rafay’s GitOps pipeline executed all stages sequentially, regardless of interdependencies. While effective for simpler workflows, this model imposed time constraints for more complex operations.

With our latest update, the pipeline engine now supports Directed Acyclic Graphs (DAGs) — allowing stages to execute in parallel, wherever dependencies allow.

Important Update: Changes to Bitnami Public Catalog

Recently, Bitnami announced significant changes to its container image distribution here. As part of this update, the Bitnami public catalog (docker.io/bitnami) will be permanently deleted on September 29th.

What’s Changing

  • All existing container images (including older or versioned tags such as 2.50.0, 10.6, etc.) will be moved from the public catalog (docker.io/bitnami) to a Bitnami Legacy repository (docker.io/bitnamilegacy).
  • The legacy catalog will no longer receive updates or support. It is intended only as a temporary migration solution to give users time to transition.