Skip to content

Overview

As the demand for AI training and inference surges, GPU Clouds are increasingly looking to offer higher-level, turnkey AI services—not just raw GPU instances. Some customers may be familiar with Run:AI from Nvidia as a AI workload orchestration and optimization platform. Delivering Run:AI as a scalable, repeatable SKU—something customers can select and provision with a few clicks—requires deep automation, lifecycle management, and tenant isolation capabilities. This is exactly what Rafay provides.

With Rafay, GPU Clouds can deliver Run:AI as a self-service SKU, ensuring customers receive a fully configured Run:AI environment—complete with GPU infrastructure, a Kubernetes cluster, necessary operators, and a ready-to-use Run:AI tenant—all deployed automatically. This blog explains how Rafay enables cloud providers to industrialize Run:AI provisioning into a consistent, production-ready SKU.

Run:AI via Self Service


Rationale

For GPU Clouds, SKU-based managed services offer tremendous benefits:

  1. Predictable, standardized offerings for customers
  2. Reduced complexity, since the SKU hides all underlying infrastructure
  3. Faster onboarding, enabling customers to begin using Run:AI in minutes
  4. Higher margins, by offering value-added services instead of raw compute
  5. Scalability, allowing dozens or hundreds of customers/tenants to onboard seamlessly

In short, turning Run:AI into a cloud SKU transforms it from a complex integration into a consumption-ready product. The experience begins in the GPU Cloud provider’s marketplace or self-service portal. Customers simply choose the Run:AI SKU, which can come in variants such as:

  • Run:AI Standard — 4 GPUs (e.g., L40S or A100)
  • Run:AI Enterprise — 8 GPUs (e.g., H100)
  • Multi-node Run:AI SKU (e.g., 2× H100 nodes)
  • Bare metal or VM-backed Infrastructure

Each SKU is pre-defined by the cloud provider and backed by Rafay which will perform sophisticated automation behind the scenes orchestrating required infrastucture, deploying and configuring required sofware etc. An illustrative example is shown below.

Run:AI via Self Service Inputs


Benefits

Rafay transforms Run:AI from a manually deployed application and infrastructure into a self-service SKU that GPU Cloud providers can expose to customers with confidence. By automating everything—from provisioning GPU infrastructure to tenant creation to cluster onboarding—Rafay ensures that customers can begin using Run:AI within minutes of selecting a SKU.

For customers, it means instant access to Run:AI. For cloud operators, this means:

  • Higher operational efficiency
  • Scalable onboarding of new customers
  • Stronger differentiation in the GPU Cloud market
  • A future-proof platform for expanding GPU-accelerated services