Skip to content

GPU

Accelerating the AI Factory: Rafay & NVIDIA NCX Infra Controller (NICo)

Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.

Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.

How Rafay and NVIDIA Help Neoclouds Monetize Accelerated Computing with Token Factories

The AI boom has created an unprecedented demand for GPUs. In response, a new generation of GPU-first cloud providers purpose-built for AI workloads—known as neoclouds—has emerged to deliver the AI infrastructure needed to power AI applications.

However, a critical shift is happening in the market. Selling raw GPU infrastructure is no longer enough. The real opportunity lies in turning GPU capacity into AI services. Developers and enterprises don't want GPUs. They want models, APIs, and intelligence on demand.

With Rafay's Token Factory offering, Neoclouds can transform GPU clusters into a self-service AI platform that exposes models through token-metered APIs. The result is a marketplace where neoclouds monetize infrastructure, model developers reach users, and developers build applications, all on the same platform.

This is where Rafay and NVIDIA have come together to unlock a powerful new business model for AI infrastructure providers.

End User Portal Token Factory

NVIDIA AICR Generates It. Rafay Runs It. Your GPU Clusters, Finally Under Control

Deploying GPU-accelerated Kubernetes infrastructure for AI workloads has never been simple. Administrators face a relentless compatibility matrix i.e. matching GPU driver versions to CUDA releases, pinning Kubernetes versions to container runtimes, tuning configurations differently for NVIDIA H100s versus A100s, and doing all of it differently again for training versus inference.

One wrong version combination and workloads fail silently, or worse, perform far below hardware capability. For years, the answer was static documentation, tribal knowledge, and hoping that whoever wrote the runbook last week remembered to update it.

NVIDIA's AI Cluster Runtime (AICR) and the Rafay Platform represent a new approach — one where GPU infrastructure configuration is treated as code, generated deterministically, validated against real hardware, and enforced continuously across fleets of clusters.

Together, they cover the full lifecycle from first aicr snapshot to production-grade day-2 operations, with cluster blueprints as the critical bridge between the two.

Baton Pass

From Slurm to Kubernetes: A Guide for HPC Users

If you've spent years submitting batch jobs with Slurm, moving to a Kubernetes-based cluster can feel like learning a new language. The concepts are familiar — resource requests, job queues, priorities — but the vocabulary and tooling are different. This guide bridges that gap, helping HPC veterans understand how Kubernetes handles workloads and what that means day-to-day.

SLurm to k8s

Run nvidia-smi on Remote GPU Kubernetes Clusters Using Rafay Zero Trust Access

Infra operators managing GPU-enabled Kubernetes clusters often need a fast and secure way to validate GPU visibility, driver health, and runtime readiness without exposing the cluster directly or relying on bastion hosts, VPNs, or manually managed kubeconfigs.

With Rafay's zero trust kubectl, operators can securely access remote Kubernetes resources and execute commands inside running pods from the Rafay platform. A simple but powerful example is running nvidia-smi inside a GPU Operator pod to confirm that the NVIDIA driver stack, CUDA runtime, and GPU devices are functioning correctly on a remote cluster.

In this post, we walk through how infra operators can use Rafay's zero trust access workflow to run nvidia-smi on a remote GPU-based Kubernetes cluster.

Nvidia SMI over ZTKA

Deep Dive into nvidia-smi: Monitoring Your NVIDIA GPU with Real Examples

Whether you're training deep learning models, running simulations, or just curious about your GPU's performance, nvidia-smi is your go-to command-line tool. Short for NVIDIA System Management Interface, this utility provides essential real-time information about your NVIDIA GPU’s health, workload, and performance.

In this blog, we’ll explore what nvidia-smi is, how to use it, and walk through a real output from a system using an NVIDIA T1000 8GB GPU.


What is nvidia-smi?

nvidia-smi is a CLI utility bundled with the NVIDIA driver. It enables:

  • Real-time GPU monitoring
  • Driver and CUDA version discovery
  • Process visibility and control
  • GPU configuration and performance tuning

You can execute it using:

nvidia-smi

Custom GPU Resource Classes in Kubernetes

In the modern era of containerized machine learning and AI infrastructure, GPUs are a critical and expensive asset. Kubernetes makes scheduling and isolation easier—but managing GPU utilization efficiently requires more than just assigning something like

nvidia.com/gpu: 1

In this blog post, we will explore what custom GPU resource classes are, why they matter, and when to use them for maximum impact. Custom GPU resource classes are a powerful technique for fine-grained GPU management in multi-tenant, cost-sensitive, and performance-critical environments.

Info

If you are new to GPU sharing approaches, we recommend reading the following introductory blogs: Demystifying Fractional GPUs in Kubernetes and Choosing the Right Fractional GPU Strategy.

Fractional GPUs using Nvidia's KAI Scheduler

At KubeCon Europe, in April 2025, Nvidia announced and launched the Kubernetes AI (KAI) Scheduler. This is an Open Source project maintained by Nvidia.

The KAI Scheduler is an advanced Kubernetes scheduler that allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads. Users of the Rafay Platform can immediately leverage the KAI scheduler via the integrated Catalog.

KAI in Catalog

To help you understand the basics quickly, we have also created a brief video introducing the concepts and a live demonstration showcasing how you can allocate fractional GPU resources to workloads.

How to Select the Right GPU for Open Source LLMs?

Deploying and operating an open-source Large Language Model (LLM) requires careful planning when selecting the right GPU model and memory capacity. Choosing the optimal configuration is crucial for performance, cost efficiency, and scalability. However, this process comes with several challenges.

In this blog, we will describe the factors that you need to consider to select the optimal GPU model for your LLM. We have also published a table capturing optimal GPU models to deploy and use Top-10 open source LLMs.

How many GPUs

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG