Accelerating the AI Factory: Rafay & NVIDIA NCX Infra Controller (NICo)¶
Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.
Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.
A Smarter Foundation for Multi-Tenancy¶
Traditional bare-metal multi-tenancy relies on network configuration at the switch layer i.e. creating VPCs, subnets, and tenant isolation through switch-level APIs. This works, but it introduces operational complexity that grows with every new tenant and every new rack.
NICo changes the model. Network configuration moves directly to the host, implemented through the NVIDIA BlueField DPU on each server. The DPU operates in zero trust mode: the host operating system cannot configure the DPU directly. It remains owned and controlled by the service provider, while the host is handed to the tenant. This means that even if a tenant's workload or OS is compromised, the network and management planes stay secure and isolated, a meaningful improvement over conventional bare-metal multi-tenancy.
The result is a network isolation model that is both more secure and dramatically simpler to operate at scale.
Where Rafay Comes In¶
NICo provides the hardware automation layer. Rafay sits on top of NICo's host-based networking model and uses it as the foundation for delivering multi-tenancy at scale — without the operational overhead that traditionally limits how many tenants a team can serve.
Bare-metal provisioning. Rafay leverages NICo's provisioning APIs to automate the full node lifecycle from zero-touch discovery and hardware validation through OS imaging and tenant delivery. What previously required manual intervention or fragmented scripts is now a fully automated, repeatable workflow triggered directly from the Rafay platform.
Self-service provisioning. Rafay abstracts NICo's APIs into simple workflows. A developer requests a GPU environment; Rafay triggers the NICo workflow to provision and deliver a ready-to-use node, fully isolated at the host network layer — no manual operator steps, no switch changes.
Standardized service SKUs. Operators define SKUs that encode server configuration, OS image, networking, and security controls. Because tenant network isolation is handled by the DPU rather than the switch, those SKUs are faster to deliver and easier to replicate consistently across tenants.
Enterprise governance. Rafay adds RBAC, resource quotas, and audit logging across the entire fleet. Every provisioning event is tracked. Only authorized users can access specific resources — enforced at both the platform and the network layer.
Cluster assembly. Once a node is provisioned and network-isolated, Rafay can automatically install the Kubernetes or SLURM stack, GPU drivers, and AI software needed to start work immediately.
| Capability | Rafay's Role |
|---|---|
| Inventory | Unified management across hardware generations and configurations |
| Automation | Self-service workflows for bare-metal compute provisioning |
| Multi-Tenancy | Host-network-level isolation with RBAC and quota enforcement |
| Usage & Cost | Metering and chargeback per tenant or business unit |
Summary¶
The hardest part of building an AI infrastructure platform is operationalizing hardware at scale. Rafay's integration with the NVIDIA NCX Infrastructure Controller makes this tractable combining NICo's host-based networking and lifecycle automation with Rafay's orchestration and governance layer to deliver secure, scalable multi-tenancy without the complexity that has historically made bare-metal GPU services difficult to operate.
For more information, see the NVIDIA NCX Infrastructure Controller documentation.
-
Explore the Service
Learn more about how Rafay handles Bare Metal GPU Orchestration.
-
:material-video-itunes:{ .middle } See It In Action
Watch a demo of our unified provisioning workflow.
