NVIDIA AICR Generates It. Rafay Runs It. Your GPU Clusters, Finally Under Control¶

Deploying GPU-accelerated Kubernetes infrastructure for AI workloads has never been simple. Administrators face a relentless compatibility matrix i.e. matching GPU driver versions to CUDA releases, pinning Kubernetes versions to container runtimes, tuning configurations differently for NVIDIA H100s versus A100s, and doing all of it differently again for training versus inference.

One wrong version combination and workloads fail silently, or worse, perform far below hardware capability. For years, the answer was static documentation, tribal knowledge, and hoping that whoever wrote the runbook last week remembered to update it.

NVIDIA's AI Cluster Runtime (AICR) and the Rafay Platform represent a new approach — one where GPU infrastructure configuration is treated as code, generated deterministically, validated against real hardware, and enforced continuously across fleets of clusters.

Together, they cover the full lifecycle from first aicr snapshot to production-grade day-2 operations, with cluster blueprints as the critical bridge between the two.

The Problem: The "Documentation Drift" Era¶

Previously, administrators relied on static documentation and manual installation guides. This approach forced teams to manually track compatibility matrices across dozens of components — matching specific NVIDIA GPU Operator versions to specific driver versions and Kubernetes releases — a process prone to human error and configuration drift.

Static guides quickly become outdated as new software versions are released, and generic installation guides rarely account for specific hardware differences like H100 vs GB200, or workload intent differences between training and inference.

The result was GPU clusters that were difficult to reproduce, hard to audit, and nearly impossible to manage at scale. A configuration that worked on one team's Kubernetes cluster would break on another team's cluster running a slightly different kernel.

Documentation that was accurate in Q4-2025 was wrong by Q1-2026. So, when something broke, finding the root cause meant manually diffing configurations across nodes, operators, and driver versions.

This is precisely the problem AICR is built to solve

AICR Generates the Blueprint¶

NVIDIA AICR replaces manual interpretation of documentation with an automated approach. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.

The workflow runs in four stages as described below.

1. Snapshot¶

The "aicr snapshot" step captures the real state of the cluster — GPU topology, driver version, OS release, Kubernetes configuration, and SystemD services — without any assumptions. For an 8-node Kubernetes cluster running H100-SXM5 GPUs on Ubuntu 22.04, the snapshot records NVLink topology, CUDA capability, and runtime details in a single YAML file that can be stored directly in a Kubernetes ConfigMap.

2. Recipe¶

The next stage is the "aicr recipe" stage that ingests that snapshot and matches it against a library of validated overlays, generating a hardware-specific configuration recommendation. Users specify an intent — training or inference — and AICR adjusts accordingly. A training recipe optimizes for throughput and pinned memory bandwidth; an inference recipe tunes for latency and concurrent request handling. The recipe selects the correct GPU Operator version, network operator configuration, and driver pins for that exact combination of hardware and intent — acting as a dynamic compatibility matrix that updates as NVIDIA releases new software.

3. Validation¶

The next stage is the "aicr validate" stage that runs multi-phase constraint checking against the actual cluster snapshot. Validation phases cover readiness (infrastructure prerequisites like K8s version, OS, kernel, and GPU hardware), deployment (component health and expected resources), performance (system performance and network fabric health), and conformance (workload-specific requirements including DRA support, gang scheduling, and inference gateway behavior).

Info

Teams will likely run this in deployment pipelines with --fail-on-error to catch compatibility issues before they reach production.

4. Bundle¶

The final stage is the "aicr bundle" stage that converts the validated recipe into concrete deployment artifacts — H100-tuned Helm values files, Kubernetes manifests, a pre-flight install.sh script, and SLSA Level 3 provenance attestations. The bundle is ready to deploy, with every component version pinned and every configuration value validated against real hardware.

Important

This is where AICR stops. And where Rafay begins

Handoff: AICR Bundle Becomes a Rafay Blueprint¶

The output of AICR's bundle stage is a set of Helm values files optimized for specific hardware and intent. The Rafay Platform takes these "values files" as validated input for critical software add-on configurations within a cluster blueprint.

Rafay's cluster blueprints automate the management of NVIDIA software components including NVIDIA drivers, the device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling, and monitoring. The entire workflow can be fully automated and embedded into an automation pipeline using the Rafay CLI, APIs or Terraform Provider.

The Rafay cluster blueprint can now be applied to as many clusters as required — whether that's three H100 Kubernetes clusters or five Amazon EKS based clusters using A100 GPUs. Every one of them runs the same AICR-validated configuration, enforced continuously.

Day-2: Where Rafay Makes the Difference¶

Generating a correct configuration once is valuable and interesting. Keeping it correct across a fleet of clusters over months and years of driver updates, operator upgrades, and infrastructure changes is where the most pain for operations teams is.

Rafay addresses this across three dimensions:

1. Drift Prevention¶

The Rafay cluster blueprint defines the desired state declaratively and enforces it continuously, ad-hoc changes — the kind that cause "it works on my cluster" problems — are blocked at the platform level before they can cause incidents.

2. Observability¶

The integrated GPU Resource Dashboard (in the Rafay Platform) powered by the DCGM Exporter, deployed as part of the GPU Operator, gives deeper visibility into GPU core resources, with Rafay's managed Prometheus automatically scraping and aggregating GPU metrics in a multi-tenant time series database.

Platform teams get per-cluster and per-GPU visibility from a single control plane, without instrumenting each cluster independently.

3. Secure Access¶

Infra operators can use Rafay's zero-trust kubectl to securely access remote Kubernetes resources and run validation commands — like nvidia-smi inside a GPU Operator pod — without exposing the cluster directly or relying on VPNs or manually managed kubeconfigs.

Running a quick GPU health check on a remote production cluster becomes a routine, auditable operation rather than a credentials-management exercise.

Conclusion¶

What makes NVIDIA AICR and Rafay a "lethal combination" is that they complement each other completely.

NVIDIA AICR answers the question What should this GPU cluster look like, given this exact hardware and this exact workload intent? with a validated, hardware-specific answer.
Rafay answers the question How do I deploy that answer across dozens of clusters, keep it enforced over time, and give my operators safe, audited access to investigate issues?

Neither solution alone is sufficient for production AI infrastructure at scale. AICR without Rafay generates great bundles that still require manual deployment and have no drift protection. Rafay without AICR can manage GPU clusters, but relies on operational personnel to manually construct and validate the right Helm values for each hardware combination.

Together, they close the loop from raw GPU hardware to production-managed, validated, drift-proof AI infrastructure — with cluster blueprints as the architectural bridge between generation and operation.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Live Demo

Schedule time with us to watch a demo in action.

Schedule Demo