Skip to content

Model Deployments are "running instances" of an already configured model. When a new model is created and configured, by default, it has zero active model deployments. For example, in the image below, for the Facebook OPT 125m model, there are no active model deployments.

No Active Deployments

Administrators can deploy and operate multiple model deployments for a given model. In the image below, for the "llama-8b-instruct" model, there is one active model deployment.

Active Deployments


New Deployment

Click on "Deploy" to start a new model deployment.

  • Provide a name (unique in your environment) and an optional description
  • The "model" field will auto populate since it is a deployment of a specific model
  • Select "endpoint" from the dropdown list which will service requests to our model deployment

New Deployment - General


Select Inference Engine

In this step, the admin has to select their preferred Inference engine. Three options are currently supported:

Select the preferred Inference engine

  1. vLLM,
  2. NIM and
  3. Nvidia Dynamo. (coming soon!)

Multiple Engines

Info

The default engine selection is vLLM. The use of NIM requires a license and keys from Nvidia. Please work with your Nvidia team for this.

New Deployment- Inference Engine


Option 1: vLLM

Follow the steps below if you selected vLLM as the inference engine.

vLLM Settings

Important

The vLLM container image is extremely large (~10-25 GB). Admins are strongly recommended to download and host the vLLM container image locally in a container registry. This ensures sovereignity, security and performance.

  • Specify the path for the vLLM container image and tag
  • Specify the number of replicas (default is 1)
  • Specify the size of the volume for each replica

Resource Requests & Limits

Update the default resource requests/limits (CPU, Memory and GPU) if required. This step ensures that you allocate required resources so that the vLLM pod is stable and reliable.


Auto Scaling

Enable auto scaling of replicas if requred. Once this is enabled, the admin will be provided with configuration details for auto scaling.

vLLM Auto Scaling Config

  • Specify minimum number of replicas. This will be the base capacity of the service
  • Specify maximum number of replicas. This is the upper limit for the service
  • Specify metrics that will be used to trigger auto scaling events.

You can select from either CPU or Memory Resource and specify the utilization threshold. Once the threshold is breached, auto scaling will be performed.


Advanced Configuration

Admins can also fine tune/optimize the vLLM inference engine by providing "custom environment variables". For example, vLLM's environment variable's documentation is available here

vLLM Advanced Settings


Option 2: NIM

NVIDIA NIM (NVIDIA Inference Microservices) are pre-built, optimized AI containers for deploying large language models (LLMs) and other generative AI models easily and efficiently on NVIDIA-accelerated hardware, simplifying development by providing ready-to-use APIs and optimized engines (like TensorRT) for faster inference across cloud, data center, and edge devices, essentially serving as "readymade kits" for AI applications like chatbots or image generation

  • Select NIM as the underlying engine for Inference.

NIM Settings

  • Specify number of replicas
  • Specify number of GPUs
  • Add environment variables (Key + Value)

Info

The latest version of the container image for NIM will be dynamically downloaded from Nvidia's NGC repository. Please ensure your Inference Data Plane compute servers have connectivity to the Internet to pull these artifacts from NGC.


Option 3: Dynamo

Coming soon! Support for Nvidia Dynamo as the underlying Inference engine is planned for H1-2026


Rate Limiting

Rate limits are critical to ensure that a single tenant (i.e. Rafay Org) or single user does not consume all available resources. Rate limiting is disabled by default and admins need specify to specify enable it and configure desired behavior.

Rate Limits

For each option, admins need to specify limits for the following:

  • Max tokens per minute
  • Max requests per minute

Info

Max tokens is the sum of both "input" and "output" tokens.

1. By Org (Tenant)

This option makes sure that overall capacity can be shared equitably across all "Orgs" (i.e. tenants). The specified values for max tokens/min and max requests/min are totals for all users in every Org under management.

Important

We recommend that all customers enable and configure this option.

2. By User

This option makes sure that overall capacity can be shared equitably across "Users". The specified values for max tokens/min and max requests/min are totals per unique user.

3. By API Key

This option makes sure that overall capacity can be shared equitably across "API Keys". The specified values for max tokens/min and max requests/min are totals per unique API Key


Specify Pricing

New Deployment Currency

  • Select the currencys used for billing (default = USD)
  • Specify the cost per "1M" input and output tokens
  • Click on Save once you have specified all the required inputs

Select Currency

Input tokens are the text you send to an LLM, while output tokens are the text the LLM generates back. Output tokens are typically more expensive because they require more computational power to generate one by one, whereas input tokens are processed in a single pass.

It is generally common for providers to charge more for "output" tokens vs "input" tokens because every request's input can be completely different and no optimizations are generally possible to drive down processing costs.

Note

Note that Rafay's serverless inferencing solution allows you to charge for input and output tokens at different rates.


Deployment Status

Once the admin initiates the deployment of a model, the Rafay Control plane will attempt to deploy the model to the data plane (i.e. k8s cluster with GPUs). The admin can monitor progress and status of the deployment right in the console.

Deployment Status


View Deployment

To view a deployment, click on the name. You will be presented with the details of the deployment.

View Deployment


Edit Deployment

Click on the "ellipses" under Actions and select "Edit Configuration". Make the updates you require and save.

Edit Deployment


Delete Deployment

Click on the "ellipses" under Actions and select "Delete" to delete the deployment.

Delete Deployment


Share Deployment

Click on the "ellipses" under Actions. Now, click on "Manage Sharing" to initiate a workflow to share the model with All or Select tenant orgs.

  • By default, a newly created model is not shared with any tenant org.
  • Select "All Orgs" to make the model available to all tenant orgs under management
  • Select "Select Orgs" to make the model available to selected tenant orgs.

Share Deployment

--

Model Metrics

For a given model deployment, model metrics are aggregated and available to the administrator. The metrics are continuously aggregated, but calculated every 60 minutes. So, the data points are available for 1hr time periods

  • Click on a model deployment
  • Click on metrics

Admins can filter and visualize the metrics for a specific time period.

Time to First Token (TTFT)

Time-to-First-Token (TTFT) measures how quickly an LLM begins generating output after receiving a prompt. It reflects initial processing latency, including model loading, prompt encoding, and the start of inference.

Lower TTFT improves responsiveness and user experience, especially for interactive applications like chat, streaming responses, and real-time decision systems.

Inter Token Latency

Inter-token latency measures the time an LLM takes to generate each subsequent token after the first. It reflects the model’s throughput and compute efficiency during streaming output.

Lower inter-token latency enables smoother, more natural real-time responses, improving usability for chat systems, agents, and interactive AI applications.

Model Metrics