Skip to content

Product Blog

Automated GPU Health Monitoring with NVIDIA NVSentinel on the Rafay Platform

GPU clusters are expensive and GPU failures are costly. In modern AI infrastructure, organizations operate large fleets of NVIDIA GPUs that can cost tens of thousands of dollars each. When a GPU develops a hardware fault (e.g. a double-bit ECC error, a thermal throttle, or a silent data corruption event), the consequences ripple outward: training jobs fail hours into a run, inference latency spikes, and expensive hardware sits idle while engineers scramble to diagnose the root cause.

Traditional monitoring catches these problems eventually, but rarely fixes them. Diagnosing and remediating GPU faults still requires deep expertise, and remediation timelines are measured in hours or days. For organizations running AI workloads at scale — and especially for GPU cloud providers who must deliver uptime SLAs to their tenants — this gap between detection and resolution translates directly into SLA breaches, lost revenue, and eroded customer trust.

NVIDIA's answer to this challenge is NVSentinel — an open-source, Kubernetes-native system that continuously monitors GPU health and automatically remediates issues before they disrupt workloads.

In this blog, we describe how Rafay integrates with NVSentinel enabling GPU cloud operators and enterprises to deploy intelligent GPU fault detection and self-healing across their entire fleet — consistently, repeatably, and at scale.

Rafay and NVSentinel

NVIDIA Dynamo: Turning Disaggregated Inference Into a Production System

In Part 1, we covered the core idea behind disaggregated inference. That architectural split is no longer just a research pattern. Disaggregated inference changes inference from a simple “deploy a container on GPUs” exercise into a distributed system problem.

Once prefill and decode are separated, the platform has to coordinate routing, GPU-to-GPU KV cache transfer, placement, autoscaling, service discovery, and fault handling across multiple worker pools. NVIDIA Dynamo provides the distributed inference framework for this, and Kubernetes provides the control plane foundation to operate it at scale. 

In this blog post, we will review NVIDIA's Dynamo project with a focus on what it does and when it it makes sense to use it.

NVIDIA Dynamo Logo

Introduction to Disaggregated Inference: Why It Matters

The explosive growth of generative AI has placed unprecedented demands on GPU infrastructure. Enterprises and GPU cloud providers are deploying large language models at scale, but the underlying inference serving architecture often can't keep up.

In this first blog post on disaggregated inference, we will discuss how it differs from traditional serving, why it matters for platform teams managing GPU infrastructure, and how the ecosystem—from NVIDIA Dynamo to open-source frameworks—is making it production-ready.

Disaggregated Inference

Fine Tuning as a Service using Rafay and Unsloth Studio

Fine-tuning large language models used to be an exercise reserved for teams with deep MLOps expertise and bespoke infrastructure. With Unsloth Studio — an open-source web UI for training and running LLMs — the barrier to entry has dropped considerably.

But packaging Unsloth Studio into a repeatable, self-service experience that neo clouds and enterprise can offer their end users? That still requires thoughtful orchestration.

In this post, we walk through how to deliver Unsloth Studio as a one-click, app-store-style experience using Rafay's App Marketplace. By the end, you'll understand how to create an Unsloth Studio App SKU, configure it for end users, test it, and share it across customer organizations — all without requiring your users to know anything about Kubernetes, Docker, or GPU scheduling.

Unsloth Studio in Rafay

Running GPU Infrastructure on Kubernetes: What Enterprise Platform Teams Must Get Right

KubeCon + CloudNativeCon Europe 2026, Amsterdam


If you are at KubeCon this week in Amsterdam, you are likely hearing the same question repeatedly: how do we actually operate GPU infrastructure on Kubernetes at enterprise scale? The announcements from NVIDIA — the DRA Driver donation, the KAI Scheduler entering CNCF Sandbox, GPU support for Kata Containers expand what is technically possible. But for enterprise platform teams, the harder problem is not capability. It is operating GPU infrastructure efficiently and responsibly once demand arrives.

This post is written for platform teams building internal GPU platforms — on-premises, in sovereign environments, or in hybrid models. You are not just provisioning infrastructure. You are governing access to some of the most expensive and constrained resources in the organization.

At scale, GPU inefficiency is not accidental. It is structural:

  • Idle GPUs that remain allocated but unused
  • Over-provisioned workloads consuming more than needed
  • Fragmented capacity that cannot satisfy real workloads
  • Lack of cost visibility and accountability

Solving this requires more than infrastructure. It requires a governed platform model.

Advancing GPU Scheduling and Isolation in Kubernetes

KubeCon + CloudNativeCon Europe 2026, Amsterdam


At KubeCon Europe 2026, NVIDIA made a set of significant open-source contributions that advance how GPUs are managed in Kubernetes. These developments span across: resource allocation (DRA), scheduling (KAI), and isolation (Kata Containers). Specifically, NVIDIA donated its DRA Driver for GPUs to the Cloud Native Computing Foundation, transferring governance from a single vendor to full community ownership under the Kubernetes project. The KAI Scheduler was formally accepted as a CNCF Sandbox project, marking its transition from an NVIDIA-governed tool to a community-developed standard. And NVIDIA collaborated with the CNCF Confidential Containers community to introduce GPU support for Kata Containers, extending hardware-level workload isolation to GPU-accelerated workloads. Together, these contributions move GPU infrastructure closer to a first-class, community-owned, scheduler-integrated model.

From Docker Image to 1-Click App: Enabling Self-Service for Custom Apps

In the Developer Pods series (part-1, part-2 and part-3), we made a simple point: most users do not want infrastructure. They want outcomes.

They do not want tickets. They do not want YAML. They do not want to think about pods, namespaces, ingress, or DNS. They want a working environment or application, available quickly, through a clean self-service experience. That was the core theme behind Developer Pods: Kubernetes is a powerful engine, but it should not be the user interface.

The next step is just as important: letting end users deploy applications packaged as Docker containers into shared, multi-tenant Kubernetes clusters with a true 1-click experience.

Rafay’s 3rd Party App Marketplace is designed for exactly this. It allows providers to curate and publish containerized apps from Docker Hub, third-party vendors, or open-source communities, package them with defaults, user overrides, and policies, and expose them as a secure, governed self-service experience for users across multiple tenants.

Docker App

Adding New Language Support to the Self Service Portal in 5 Mins

GPU Cloud Providers and enterprises serving a global user base need the end user facing Self Service Portal to speak their end users' language — literally. If you're serving AI researchers in Paris, data scientists in Montreal, or ML engineers across Francophone Africa, offering the portal in French is a powerful way to reduce friction and make GPU consumption feel native.

The Rafay Platform's Language Customization feature makes it straightforward for admins to add French (or any other language), customize translations, and give end users the ability to switch languages on their own. In this post, we'll walk through the entire process of adding French to the Self Service Portal — from configuring the default locale to verifying the end user experience.

New Language

Developer Pods for Platform Teams: Designing the Right Self-Service GPU Experience

In Part 1, we discussed the core problem: most organizations still deliver GPU access through the wrong abstraction. Developers and data scientists do not want tickets, YAML, and long provisioning cycles. They want a ready-to-use environment with the right amount of compute, available when they need it.

In Part 2, we looked at what that self-service experience feels like for the end user: a familiar, guided workflow that lets them select a profile, launch an environment, and SSH into it in about 30 seconds.

In this part, we shift to the other side of the experience: how platform teams design that experience in the first place. Specifically, we will look at how teams can configure and customize a Developer Pod SKU using the integrated SKU Studio in the Rafay Platform.

SKU in Rafay Platform

Flexible GPU Billing Models for Modern Cloud Providers — Powering the AI Factory with Rafay

The GPU cloud market is evolving fast. At NVIDIA GTC 2026, one theme rang loud and clear: enterprises are no longer experimenting with AI, they are committing to it at scale. Training frontier models, fine-tuning domain-specific LLMs, and running large-scale inference workloads on NVIDIA gear require sustained, predictable access to high-end GPU infrastructure. That kind of commitment demands a billing model to match.

If you are running a GPU cloud business, you already know that a simple pay-as-you-go model doesn't cut it anymore. Your enterprise customers want options and your ability to offer those options is a direct competitive advantage. That's where Rafay comes in.