Deployments
Model Deployments are "running instances" of an already configured model. When a new model is created and configured, by default, it has zero active model deployments. For example, in the image below, for the Facebook OPT 125m model, there are no active model deployments.
Administrators can deploy and operate multiple model deployments for a given model. In the image below, for the "llama-8b-instruct" model, there is one active model deployment.
New Deployment¶
Click on "Deploy" to start a new model deployment.
- Provide a name (unique in your environment) and an optional description
- The "model" field will auto populate since it is a deployment of a specific model
- Select "endpoint" from the dropdown list which will service requests to our model deployment
Select Inference Engine¶
In this step, the admin has to select their preferred Inference engine. Three options are currently supported:
Select the preferred Inference engine
- vLLM,
- NIM and
- Nvidia Dynamo. (coming soon!)
Info
The default engine selection is vLLM. The use of NIM requires a license and keys from Nvidia. Please work with your Nvidia team for this.
vLLM-Inference Engine¶
Follow the steps below if you selected vLLM as the inference engine.
Important
The vLLM container image is extremely large (~10-25 GB). Admins are strongly recommended to download and host the vLLM container image locally in a container registry. This ensures sovereignity, security and performance.
- Specify the path for the vLLM container image and tag
- Specify the number of replicas (default is 1)
- Specify the size of the volume for each replica
Resource Requests & Limits
Update the default resource requests/limits (CPU, Memory and GPU) if required
Auto Scaling
Enable auto scaling of replicas if requred. Once this is enabled, the admin will be provided with configuration details for auto scaling.
- Specify minimum number of replicas. This will be the base capacity of the service
- Specify maximum number of replicas. This is the upper limit for the service
- Specify metrics that will be used to trigger auto scaling events.
You can select from either CPU or Memory Resource and specify the utilization threshold. Once the threshold is breached, auto scaling will be performed.
Auto Scaling Behavior
Admins are also provided with access to advanced configurations for auto scaling behavior.
- Scale Down
Stabilization window in seconds for scale down events and associated policies
- Scale Up
Stabilization window in seconds for scale up events and associated policies
Advanced Configuration
Admins can also fine tune/optimize the vLLM inference engine by providing "custom environment variables". For example, vLLM's environment variable's documentation is available here
NIM-Inference Engine¶
- Specify number of replicas
- Specify number of GPUs
- Add environment variables (Key + Value)
Info
The latest version of the container image for NIM is downloaded from Nvidia's NGC repository.
Rate Limiting¶
Rate limits are critical to ensure that a single user does not overwhelm the shared platform and consume all the available resources. Admins should specify the following parameters for rate limiting.
- Max tokens per minute
- Max requests per minute
Specify Pricing¶
- Select the currencys used for billing (default = USD)
- Specify the cost per "1M" input and output tokens
- Click on Save once you have specified all the required inputs
Input tokens are the text you send to an LLM, while output tokens are the text the LLM generates back. Output tokens are typically more expensive because they require more computational power to generate one by one, whereas input tokens are processed in a single pass.
It is generally common for providers to charge more for "input" tokens vs "output" tokens because every request's input can be completely different and no optimizations are generally possible to drive down processing costs.
Note
Note that Rafay's serverless inferencing solution allows you to charge for input and output tokens at different rates.
View Deployment¶
To view a deployment, click on the name. You will be presented with the details of the deployment.
Edit Deployment¶
Click on the "ellipses" under Actions and select "Edit Configuration". Make the updates you require and save.
Delete Deployment¶
Click on the "ellipses" under Actions and select "Delete" to delete the deployment.
Share Deployment¶
Click on the "ellipses" under Actions. Now, click on "Manage Sharing" to initiate a workflow to share the model with All or Select tenant orgs.
- By default, a newly created model is not shared with any tenant org.
- Select "All Orgs" to make the model available to all tenant orgs under management
- Select "Select Orgs" to make the model available to selected tenant orgs.
--
Model Metrics¶
For a given model deployment, model metrics are aggregated and available to the administrator. The metrics are continuously aggregated, but calculated every 60 minutes. So, the data points are available for 1hr time periods
- Click on a model deployment
- Click on metrics
Admins can filter and visualize the metrics for a specific time period.
Time to First Token (TTFT)¶
Time-to-First-Token (TTFT) measures how quickly an LLM begins generating output after receiving a prompt. It reflects initial processing latency, including model loading, prompt encoding, and the start of inference.
Lower TTFT improves responsiveness and user experience, especially for interactive applications like chat, streaming responses, and real-time decision systems.
Inter Token Latency¶
Inter-token latency measures the time an LLM takes to generate each subsequent token after the first. It reflects the model’s throughput and compute efficiency during streaming output.
Lower inter-token latency enables smoother, more natural real-time responses, improving usability for chat systems, agents, and interactive AI applications.















