Model Deployment
Overview¶
Model Deployments define how a GenAI model is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.
A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.
Accessing Model Deployments¶
Navigate to Operations Console → GenAI → Model Deployments.
This view lists all model deployments across the system and is used for centralized deployment management and monitoring.
Create a Model Deployment¶
Select New Model Deployment to open the deployment configuration form.
Details¶
- Name: Unique name for the deployment
- Description: Optional reference description
Model and Endpoint¶
- Model: The model being deployed
When accessed from a model page, this field is preselected and not editable. - Endpoint: The endpoint where the model is exposed for inference
Inference Engine¶
The inference engine defines how the model is executed at runtime.
The Inference Engine section defines how the model is executed at runtime and controls performance, scalability, and resource usage for the deployment.
Inference Engine Selection
-
vLLM
An open-source, high-performance inference engine suitable for a wide range of models, including models sourced from storage namespaces and public repositories. -
NVIDIA NIM
NVIDIA’s optimized inference engine designed primarily for models sourced from NVIDIA NGC. This option requires appropriate licensing.
Engine Configuration
When an engine is selected, engine-specific settings are displayed:
- Engine Image: Container image used for inference
- Replicas: Number of inference replicas
- Context Length: Maximum supported context window
- Prompt Caching: Controls prompt caching behavior
Resource Requirements¶
Specify the compute resources allocated to each replica.
- CPU: Number of CPU cores
- Memory: Memory allocation
- GPU: Number of GPUs per replica
Resource values depend on model size and inference engine requirements.
Advanced Settings¶
Advanced settings provide Kubernetes-level controls for fine-tuning deployment behavior.
- Environment Variables: Used to pass runtime configuration values to the inference engine container.
- Labels: Custom key-value pairs for resource organization, filtering, or cost allocation.
- Annotations: Non-identifying metadata used for operational or tooling integrations.
- Tolerations: Allow scheduling on nodes with matching taints.
- Node Selector: Restrict scheduling to nodes with specific labels.
- Pod Affinity: Control pod placement relative to other workloads.
- Extra Engine Arguments: Additional command-line arguments passed directly to the inference engine.
Rate Limiting¶
The Rate Limiting section defines request and token usage limits for the model deployment. Limits can be configured at multiple levels to control traffic and usage.
-
Organization: Applies to all users and API keys within the organization.
- Max Tokens per Minute: Maximum tokens processed per minute.
- Max Requests per Minute: Maximum requests allowed per minute.
-
User: Applies per individual user and is enforced in addition to organization limits.
- Max Tokens per Minute: Token usage limit per user.
- Max Requests per Minute: Request limit per user.
-
API Key: Applies per API key and is enforced in addition to organization and user limits.
-
Max Tokens per Minute: Token usage limit per API key.
- Max Requests per Minute: Request limit per API key.
Rate limiting is configurable from the UI. Backend enforcement is currently under implementation.
Pricing¶
The Pricing section defines the cost model for the deployed model based on token usage.
- Currency: Currency used for pricing.
- Input Tokens: Cost per one million input tokens.
- Output Tokens: Cost per one million output tokens.
These values represent logical pricing configuration for the deployment and are used for cost visibility and tracking.
Save and Deploy¶
Select Save Changes to start the deployment process.
After saving:
- The deployment is created and associated with the selected model and endpoint
- The inference engine image is pulled in the background
- Model files are loaded into GPU memory
- Inference pods are started on the compute cluster
When the deployment reaches the Running state, it is ready to serve inference requests.
Share Model Deployment¶
From the Model Deployments page, open a deployment and select Manage Sharing.
Sharing options include:
- None
- All Organizations
- Specific Organizations
A single deployment can be shared with multiple organizations.
Accessing a Deployed Model from Developer Hub¶
- Navigate to Developer Hub → GenAI → Models
- Select the model and open its deployments
- Select a deployment to view API usage details
This page displays the inference endpoint, sample API requests, and deployment metadata.
API Key Usage¶
An API key is required to invoke a deployed model.
Create an API key from Developer Hub → API Keys and export it as an environment variable:
```bash
export API_KEY=
The API key is passed in the Authorization header when sending inference requests.
Inference Response¶
The response contains:
- The generated output from the deployed model
- Token usage details including prompt tokens, completion tokens, and total tokens
These token values are used for usage tracking, rate limiting, and pricing calculations.
Inference Availability¶
When the model deployment is in the Running state, the endpoint is ready to serve inference requests. A successful response confirms that the deployment is active and functioning correctly.







