Skip to content

Mohan Atreya

Self-Service Fractional GPU Memory with Rafay GPU PaaS

In Part-1, we explored how Rafay GPU PaaS empowers developers to use fractional GPUs, allowing multiple workloads to share GPU compute efficiently. This enabled better utilization and cost control — without compromising isolation or performance.

In Part-2, we will show how you can enhance this by provide users the means to select fractional GPU memory. While fractional GPUs provide a share of the GPU’s compute cores, different workloads have dramatically different GPU memory needs. With this update, developers can now choose exactly how much GPU memory they want for their pods — bringing fine-grained control, better scheduling, and cost efficiency.

Fractional GPU Memory

Self-Service Fractional GPUs with Rafay GPU PaaS

Enterprises and GPU Cloud providers are rapidly evolving toward a self-service model for developers and data scientists. They want to provide instant access to high-performance compute — especially GPUs — while keeping utilization high and costs under control.

Rafay GPU PaaS enables enterprises and GPU Clouds to achieve exactly that: developers and data scientists can spin up resources such as Developer Pods or Jupyter Notebooks backed by fractional GPUs, directly from an intuitive self-service interface.

This is Part-1 in a multi-part series on end user, self service access to Fractional GPU based AI/ML resources.

Fractional GPU

NVIDIA NIM Operator: Bringing AI Model Deployment to the Kubernetes Era

In the previous blog, we learnt the basics about NIM (NVIDIA Inference Microservices). In this follow-on blog, we will do a deep dive into the NIM Kubernetes Operator, a Kubernetes-native extension that automates the deployment and management of NVIDIA’s NIM containers. By combining the strengths of Kubernetes orchestration with NVIDIA’s optimized inference stack, the NIM Operator makes it dramatically easier to deliver production-grade generative AI at scale.

NIM Operator

NVIDIA NIM: Why It Matters—and How It Stacks Up

Generative AI is moving from experiments to production, and the bottleneck is no longer training—it’s serving: getting high-quality model inference running reliably, efficiently, and securely across clouds, data centers, and the edge.

NVIDIA’s answer is NIM (NVIDIA Inference Microservices). NIM a set of prebuilt, performance-tuned containers that expose industry-standard APIs for popular model families (LLMs, vision, speech) and run anywhere there’s an NVIDIA GPU. Think of NIM as a “batteries-included” model-serving layer that blends TensorRT-LLM optimizations, Triton runtimes, security hardening, and OpenAI-compatible APIs into one deployable unit.

NIM Logo

Deploy Workload using DRA ResourceClaim in Kubernetes

In the first blog in the DRA series, we introduced the concept of Dynamic Resource Allocation (DRA) that recently went GA in Kubernetes v1.34 which was released end of August 2025.

In the second blog, we installed a Kuberneres v1.34 cluster and deployed an example DRA driver on it with "simulated GPUs". In this blog, we’ll will deploy a few workloads on the DRA enabled Kubernetes cluster to understand how "Resource Claim" and "ResourceClaimTemplates" work.

Info

We have optimized the steps for users to experience this on their laptops in less than 5 minutes. The steps in this blog are optimized for macOS users.

GPU/Neo Cloud Billing using Rafay’s Usage Metering APIs

Cloud providers offering GPU or Neo Cloud services need accurate and automated mechanisms to track resource consumption. Usage data becomes the foundation for billing, showback, or chargeback models that customers expect. The Rafay Platform provides usage metering APIs that can be easily integrated into a provider’s billing system. '

In this blog, we’ll walk through how to use these APIs with a sample Python script to generate detailed usage reports.

Usage Metering

Enable Dynamic Resource Allocation (DRA) in Kubernetes

In the previous blog, we introduced the concept of Dynamic Resource Allocation (DRA) that just went GA in Kubernetes v1.34 which was released in August 2025.

In this blog post, we’ll will configure DRA on a Kubernetes 1.34 cluster.

Info

We have optimized the steps for users to experience this on their macOS or Windows laptops in less than 15 minutes. The steps in this blog are optimized for macOS users.

NVIDIA Performance Reference Architecture: An Introduction

Artificial intelligence (AI) and high-performance computing (HPC) workloads are evolving at unprecedented speed. Enterprises today require infrastructure that can scale elastically, provide consistent performance, and ensure secure multi-tenant operation. NVIDIA’s Performance Reference Architecture (PRA), built on HGX platforms with Shared NVSwitch GPU Passthrough Virtualization, delivers precisely this capability.

This is the introductory blog in a multi part series. In this blog, we explain why PRA is critical for modern enterprises and service providers, highlight the benefits of adoption, and outline the key steps required to successfully deploy and support the PRA design/architecture.

Deep Dive into nvidia-smi: Monitoring Your NVIDIA GPU with Real Examples

Whether you're training deep learning models, running simulations, or just curious about your GPU's performance, nvidia-smi is your go-to command-line tool. Short for NVIDIA System Management Interface, this utility provides essential real-time information about your NVIDIA GPU’s health, workload, and performance.

In this blog, we’ll explore what nvidia-smi is, how to use it, and walk through a real output from a system using an NVIDIA T1000 8GB GPU.


What is nvidia-smi?

nvidia-smi is a CLI utility bundled with the NVIDIA driver. It enables:

  • Real-time GPU monitoring
  • Driver and CUDA version discovery
  • Process visibility and control
  • GPU configuration and performance tuning

You can execute it using:

nvidia-smi

Introduction to Dynamic Resource Allocation (DRA) in Kubernetes

In the previous blog, we reviewed the limitations of Kubernetes GPU scheduling. These often result in:

  1. Resource fragmentation – large portions of GPU memory remain idle and unusable.
  2. Topology blindness – multi-GPU workloads may be scheduled suboptimally.
  3. Cost explosion – teams overprovision GPUs to work around scheduling inefficiencies.

In this post, we’ll look at how a new GA feature in Kubernetes v1.34Dynamic Resource Allocation (DRA) — aims to solve these problems and transform GPU scheduling in Kubernetes.