Model Deployment

Overview¶

Model Deployments define how a GenAI model is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.

A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.

Accessing Model Deployments¶

Navigate to Operations Console → GenAI → Model Deployments.

This view lists all model deployments across the system and is used for centralized deployment management and monitoring.

Create a Model Deployment¶

Select New Model Deployment to open the deployment configuration form.

Details¶

Name: Unique name for the deployment
Description: Optional reference description

Model and Endpoint¶

Model: The model being deployed
When accessed from a model page, this field is preselected and not editable.
Endpoint: The endpoint where the model is exposed for inference

Inference Engine¶

The inference engine defines how the model is executed at runtime.

The Inference Engine section defines how the model is executed at runtime and controls performance, scalability, and resource usage for the deployment.

Inference Engine Selection

vLLM
An open-source, high-performance inference engine suitable for a wide range of models, including models sourced from storage namespaces and public repositories.
NVIDIA NIM
NVIDIA’s optimized inference engine designed primarily for models sourced from NVIDIA NGC. This option requires appropriate licensing.

Engine Configuration

When an engine is selected, engine-specific settings are displayed:

Engine Image: Container image used for inference
Replicas: Number of inference replicas
Context Length: Maximum supported context window
Prompt Caching: Controls prompt caching behavior

Resource Requirements¶

Specify the compute resources allocated to each replica.

CPU: Number of CPU cores
Memory: Memory allocation
GPU: Number of GPUs per replica

Resource values depend on model size and inference engine requirements.

Advanced Settings¶

Advanced settings provide Kubernetes-level controls for fine-tuning deployment behavior.

Environment Variables: Used to pass runtime configuration values to the inference engine container.
Labels: Custom key-value pairs for resource organization, filtering, or cost allocation.
Annotations: Non-identifying metadata used for operational or tooling integrations.
Tolerations: Allow scheduling on nodes with matching taints.
Node Selector: Restrict scheduling to nodes with specific labels.
Pod Affinity: Control pod placement relative to other workloads.
Extra Engine Arguments: Additional command-line arguments passed directly to the inference engine.

Rate Limiting¶

The Rate Limiting section defines request and token usage limits for the model deployment. Limits can be configured at multiple levels to control traffic and usage.

Organization: Applies to all users and API keys within the organization.
- Max Tokens per Minute: Maximum tokens processed per minute.
- Max Requests per Minute: Maximum requests allowed per minute.
User: Applies per individual user and is enforced in addition to organization limits.
- Max Tokens per Minute: Token usage limit per user.
- Max Requests per Minute: Request limit per user.
API Key: Applies per API key and is enforced in addition to organization and user limits.
Max Tokens per Minute: Token usage limit per API key.
Max Requests per Minute: Request limit per API key.

Rate limiting is configurable from the UI. Backend enforcement is currently under implementation.

Pricing¶

The Pricing section defines the cost model for the deployed model based on token usage.

Currency: Currency used for pricing.
Input Tokens: Cost per one million input tokens.
Output Tokens: Cost per one million output tokens.

These values represent logical pricing configuration for the deployment and are used for cost visibility and tracking.

Save and Deploy¶

Select Save Changes to start the deployment process.

After saving:

The deployment is created and associated with the selected model and endpoint
The inference engine image is pulled in the background
Model files are loaded into GPU memory
Inference pods are started on the compute cluster

When the deployment reaches the Running state, it is ready to serve inference requests.

From the Model Deployments page, open a deployment and select Manage Sharing.

Sharing options include:

None
All Organizations
Specific Organizations

A single deployment can be shared with multiple organizations.

Accessing a Deployed Model from Developer Hub¶

Navigate to Developer Hub → GenAI → Models
Select the model and open its deployments
Select a deployment to view API usage details

This page displays the inference endpoint, sample API requests, and deployment metadata.

API Key Usage¶

An API key is required to invoke a deployed model.

Create an API key from Developer Hub → API Keys and export it as an environment variable:

```bash export API_KEY= ````

The API key is passed in the Authorization header when sending inference requests.

Inference Response¶

The response contains:

The generated output from the deployed model
Token usage details including prompt tokens, completion tokens, and total tokens

These token values are used for usage tracking, rate limiting, and pricing calculations.

Inference Availability¶

When the model deployment is in the Running state, the endpoint is ready to serve inference requests. A successful response confirms that the deployment is active and functioning correctly.