Model Deployments are "running instances" of an already configured model. When a new model is created and configured, by default, it has zero active model deployments. For example, in the image below, for the Facebook OPT 125m model, there are no active model deployments.

Administrators can deploy and operate multiple model deployments for a given model. In the image below, for the "llama-8b-instruct" model, there is one active model deployment.

New Deployment¶

Click on "Deploy" to start a new model deployment.

Provide a name (unique in your environment) and an optional description
The "model" field will auto populate since it is a deployment of a specific model
Select "endpoint" from the dropdown list which will service requests to our model deployment

Select Inference Engine¶

In this step, the admin has to select their preferred Inference engine. Three options are currently supported:

Select the preferred Inference engine

vLLM,
NIM and
Nvidia Dynamo. (coming soon!)

Info

The default engine selection is vLLM. The use of NIM requires a license and keys from Nvidia. Please work with your Nvidia team for this.

Option 1: vLLM¶

Follow the steps below if you selected vLLM as the inference engine.

Important

The vLLM container image is extremely large (~10-25 GB). Admins are strongly recommended to download and host the vLLM container image locally in a container registry. This ensures sovereignity, security and performance.

Specify the path for the vLLM container image and tag
Specify the number of replicas (default is 1)
Specify the size of the volume for each replica

Resource Requests & Limits¶

Update the default resource requests/limits (CPU, Memory and GPU) if required. This step ensures that you allocate required resources so that the vLLM pod is stable and reliable.

Auto Scaling¶

Enable auto scaling of replicas if requred. Once this is enabled, the admin will be provided with configuration details for auto scaling.

Specify minimum number of replicas. This will be the base capacity of the service
Specify maximum number of replicas. This is the upper limit for the service
Specify metrics that will be used to trigger auto scaling events.

You can select from either CPU or Memory Resource and specify the utilization threshold. Once the threshold is breached, auto scaling will be performed.

Advanced Configuration¶

Admins can also fine tune/optimize the vLLM inference engine by providing "custom environment variables". For example, vLLM's environment variable's documentation is available here

Option 2: NIM¶

NVIDIA NIM (NVIDIA Inference Microservices) are pre-built, optimized AI containers for deploying large language models (LLMs) and other generative AI models easily and efficiently on NVIDIA-accelerated hardware, simplifying development by providing ready-to-use APIs and optimized engines (like TensorRT) for faster inference across cloud, data center, and edge devices, essentially serving as "readymade kits" for AI applications like chatbots or image generation

Select NIM as the underlying engine for Inference.

Specify number of replicas
Specify number of GPUs
Add environment variables (Key + Value)

Info

The latest version of the container image for NIM will be dynamically downloaded from Nvidia's NGC repository. Please ensure your Inference Data Plane compute servers have connectivity to the Internet to pull these artifacts from NGC.

Option 3: Dynamo¶

Coming soon! Support for Nvidia Dynamo as the underlying Inference engine is planned for H1-2026

Rate Limiting¶

Rate limits are critical to ensure that a single tenant (i.e. Rafay Org) or single user does not consume all available resources. Rate limiting is disabled by default and admins need specify to specify enable it and configure desired behavior.

For each option, admins need to specify limits for the following:

Max tokens per minute
Max requests per minute

Info

Max tokens is the sum of both "input" and "output" tokens.

1. By Org (Tenant)¶

This option makes sure that overall capacity can be shared equitably across all "Orgs" (i.e. tenants). The specified values for max tokens/min and max requests/min are totals for all users in every Org under management.

Important

We recommend that all customers enable and configure this option.

2. By User¶

This option makes sure that overall capacity can be shared equitably across "Users". The specified values for max tokens/min and max requests/min are totals per unique user.

3. By API Key¶

This option makes sure that overall capacity can be shared equitably across "API Keys". The specified values for max tokens/min and max requests/min are totals per unique API Key

Specify Pricing¶

Select the currencys used for billing (default = USD)
Specify the cost per "1M" input and output tokens
Click on Save once you have specified all the required inputs

Input tokens are the text you send to an LLM, while output tokens are the text the LLM generates back. Output tokens are typically more expensive because they require more computational power to generate one by one, whereas input tokens are processed in a single pass.

It is generally common for providers to charge more for "output" tokens vs "input" tokens because every request's input can be completely different and no optimizations are generally possible to drive down processing costs.

Note

Note that Rafay's serverless inferencing solution allows you to charge for input and output tokens at different rates.

Deployment Status¶

Once the admin initiates the deployment of a model, the Rafay Control plane will attempt to deploy the model to the data plane (i.e. k8s cluster with GPUs). The admin can monitor progress and status of the deployment right in the console.

View Deployment¶

To view a deployment, click on the name. You will be presented with the details of the deployment.

Edit Deployment¶

Click on the "ellipses" under Actions and select "Edit Configuration". Make the updates you require and save.

Delete Deployment¶

Click on the "ellipses" under Actions and select "Delete" to delete the deployment.

Click on the "ellipses" under Actions. Now, click on "Manage Sharing" to initiate a workflow to share the model with All or Select tenant orgs.

By default, a newly created model is not shared with any tenant org.
Select "All Orgs" to make the model available to all tenant orgs under management
Select "Select Orgs" to make the model available to selected tenant orgs.

--

Model Metrics¶

For a given model deployment, model metrics are aggregated and available to the administrator. The metrics are continuously aggregated, but calculated every 60 minutes. So, the data points are available for 1hr time periods

Click on a model deployment
Click on metrics

Admins can filter and visualize the metrics for a specific time period.

Time to First Token (TTFT)¶

Time-to-First-Token (TTFT) measures how quickly an LLM begins generating output after receiving a prompt. It reflects initial processing latency, including model loading, prompt encoding, and the start of inference.

Lower TTFT improves responsiveness and user experience, especially for interactive applications like chat, streaming responses, and real-time decision systems.

Inter Token Latency¶

Inter-token latency measures the time an LLM takes to generate each subsequent token after the first. It reflects the model’s throughput and compute efficiency during streaming output.

Lower inter-token latency enables smoother, more natural real-time responses, improving usability for chat systems, agents, and interactive AI applications.

New Deployment¶

Select Inference Engine¶

Option 1: vLLM¶

Resource Requests & Limits¶

Auto Scaling¶

Advanced Configuration¶

Option 2: NIM¶

Option 3: Dynamo¶

Rate Limiting¶

1. By Org (Tenant)¶

2. By User¶

3. By API Key¶

Specify Pricing¶

Deployment Status¶

View Deployment¶

Edit Deployment¶

Delete Deployment¶

Share Deployment¶

Model Metrics¶

Time to First Token (TTFT)¶

Inter Token Latency¶