Powering Multi-Tenant, Serverless AI Inference for Cloud Providers¶
The AI revolution is here, and Large Language Models (LLMs) are at its forefront. Cloud providers are uniquely positioned to offer powerful AI inference services to their enterprise and retail customers. However, delivering these services in a scalable, multi-tenant, and cost-effective serverless manner presents significant operational challenges.
Rafay enables cloud providers deliver Serverless Inference to 100s of users and enterprises.
Info
Earlier this week, we announced our Multi-Tenant Serverless Inference offering for GPU & Sovereign Cloud Providers. Learn more about this here.
The Big Picture: Conceptual Architecture¶
At its core, Rafay enables cloud providers to set up a Unified Inference Endpoint (e.g., https://inference.cloud.com/v1/chat/completions
). This single endpoint can serve requests for multiple LLMs (like LLaMA, Qwen, DeepSeek) running on the provider's GPU-accelerated infrastructure (e.g., NVIDIA GPUs) within their data centers.
This service is designed for:
- Retail Internet Users: Individual users accessing the models directly.
- Enterprise Customers: Specific organizations (tenants like "Customer 1," "Customer 2") with their own users.
The Rafay GPU PaaS is the engine driving this. It not only manages the model deployments and the inference endpoint but also crucially collects token usage and cost metrics. These metrics can then be integrated via APIs into the cloud provider's existing Provider Billing System, allowing for accurate chargeback per tenant and per user.
Imagine users/customers sending requests to a unified endpoint, which routes them to different LLMs running on GPUs. Usage metrics are automatically collected and the cloud provider's billing system can retrieve this data programmatically.
The End-User Experience: Simplicity is Key¶
For end-users, whether retail or enterprise, accessing these powerful AI models is designed to be incredibly straightforward.
- Access End-User Portal: Users log in to the self service portal.
- Generate API Key: A one-time task to generate a unique API key for authenticated access.
- Use Inference Endpoint API: Programmatically use the provided API key to send requests to the inference endpoint and receive model completions.
Info
Watch a brief video of the end user experience.
The Cloud Provider Administrator's Role: Setting Up the Service¶
While the end-user experience is simple, Rafay also streamlines the setup and management for the cloud provider administrator. The typical steps for Cloud Providers are:
-
One-Time Setup:
- Provide GPU Compute: Integrate their GPU-accelerated compute resources (e.g., bare metal servers with NVIDIA GPUs) with the Rafay platform. Rafay will dynamically provisioning VMs on the bare metal server, convert them into Kubernetes Worker Nodes and deploy the Inference Service on it.
- Create Storage Namespaces: Set up storage (e.g., Ceph, Weka, etc) where model weights and artifacts will be stored. This is crucial for dynamic model loading and auto-scaling.
- Create Inference Endpoint: Define the public-facing HTTPS endpoint, associating it with the compute resources and configuring TLS.
-
Per-Model Deployment:
- Upload Model: Upload the desired LLM and its weights to the configured storage namespace.
- Model Deployment: Configure how this model will be served. This involves:
- Selecting the model from storage.
- Associating it with an inference endpoint.
- Specifying optimization and scaling parameters (e.g., target accelerator like L40/H100, quantization level like Auto/FP16/INT8, initial scale/number of instances).
- Defining pricing (currency, cost per million input tokens, cost per million output tokens).
Info
Watch a brief video of the experience for the Cloud Provider Administrator
Key Benefits with Rafay¶
1. Simplified Operations
Rafay abstracts the complexity of deploying, managing, and scaling LLM inference.
2. True Multi-Tenancy
Easily manage and isolate different enterprise customers or offer public access.
3. Integrated Billing Metrics
Seamlessly track token usage and costs for accurate chargeback to tenants/users.
4. Scalability & Optimization
Supports auto-scaling and various optimization techniques like quantization.
5. Flexibility
Works with various GPU types, storage solutions, and model families.
Conclusion¶
Rafay's platform provides a comprehensive solution for cloud providers looking to offer sophisticated, multi-tenant, serverless AI inferencing services. By simplifying both the end-user experience and the administrative overhead, Rafay empowers cloud providers to quickly tap into the booming AI market and deliver value to their customers.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.