Skip to content

Model Deployment

Overview

Model Deployments define how a GenAI model is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.

A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.


Accessing Model Deployments

Navigate to Operations Console → GenAI → Model Deployments.

This view lists all model deployments across the system and is used for centralized deployment management and monitoring.

Model Deployments


Create a Model Deployment

Select New Model Deployment to open the deployment configuration form.

Details

  • Name: Unique name for the deployment
  • Description: Optional reference description

Model and Endpoint

  • Model: The model being deployed
    When accessed from a model page, this field is preselected and not editable.
  • Endpoint: The endpoint where the model is exposed for inference

Inference Engine

The inference engine defines how the model is executed at runtime.

The Inference Engine section defines how the model is executed at runtime and controls performance, scalability, and resource usage for the deployment.

Inference Engine Selection

  • vLLM
    An open-source, high-performance inference engine suitable for a wide range of models, including models sourced from storage namespaces and public repositories.

  • NVIDIA NIM
    NVIDIA’s optimized inference engine designed primarily for models sourced from NVIDIA NGC. This option requires appropriate licensing.

Engine Configuration

When an engine is selected, engine-specific settings are displayed:

  • Engine Image: Container image used for inference
  • Replicas: Number of inference replicas
  • Context Length: Maximum supported context window
  • Prompt Caching: Controls prompt caching behavior

Model Usecase

Resource Requirements

Specify the compute resources allocated to each replica.

Resource Requirements

  • CPU: Number of CPU cores
  • Memory: Memory allocation
  • GPU: Number of GPUs per replica

Resource values depend on model size and inference engine requirements.

Advanced Settings

Advanced settings provide Kubernetes-level controls for fine-tuning deployment behavior.

  • Environment Variables: Used to pass runtime configuration values to the inference engine container.
  • Labels: Custom key-value pairs for resource organization, filtering, or cost allocation.
  • Annotations: Non-identifying metadata used for operational or tooling integrations.
  • Tolerations: Allow scheduling on nodes with matching taints.
  • Node Selector: Restrict scheduling to nodes with specific labels.
  • Pod Affinity: Control pod placement relative to other workloads.
  • Extra Engine Arguments: Additional command-line arguments passed directly to the inference engine.

Advanced Settings


Rate Limiting

The Rate Limiting section defines request and token usage limits for the model deployment. Limits can be configured at multiple levels to control traffic and usage.

  • Organization: Applies to all users and API keys within the organization.

    • Max Tokens per Minute: Maximum tokens processed per minute.
    • Max Requests per Minute: Maximum requests allowed per minute.
  • User: Applies per individual user and is enforced in addition to organization limits.

    • Max Tokens per Minute: Token usage limit per user.
    • Max Requests per Minute: Request limit per user.
  • API Key: Applies per API key and is enforced in addition to organization and user limits.

  • Max Tokens per Minute: Token usage limit per API key.

  • Max Requests per Minute: Request limit per API key.

Rate limiting is configurable from the UI. Backend enforcement is currently under implementation.


Pricing

The Pricing section defines the cost model for the deployed model based on token usage.

  • Currency: Currency used for pricing.
  • Input Tokens: Cost per one million input tokens.
  • Output Tokens: Cost per one million output tokens.

These values represent logical pricing configuration for the deployment and are used for cost visibility and tracking.

Pricing


Save and Deploy

Select Save Changes to start the deployment process.

After saving:

  • The deployment is created and associated with the selected model and endpoint
  • The inference engine image is pulled in the background
  • Model files are loaded into GPU memory
  • Inference pods are started on the compute cluster

When the deployment reaches the Running state, it is ready to serve inference requests.

Deployment Running


Share Model Deployment

From the Model Deployments page, open a deployment and select Manage Sharing.

Sharing options include:

  • None
  • All Organizations
  • Specific Organizations

A single deployment can be shared with multiple organizations.


Accessing a Deployed Model from Developer Hub

  1. Navigate to Developer Hub → GenAI → Models
  2. Select the model and open its deployments
  3. Select a deployment to view API usage details

This page displays the inference endpoint, sample API requests, and deployment metadata.

Shared Deployment

API Key Usage

An API key is required to invoke a deployed model.

Create an API key from Developer Hub → API Keys and export it as an environment variable:

```bash export API_KEY= ````

The API key is passed in the Authorization header when sending inference requests.

Inference Response

The response contains:

  • The generated output from the deployed model
  • Token usage details including prompt tokens, completion tokens, and total tokens

These token values are used for usage tracking, rate limiting, and pricing calculations.

Model Usecase

Inference Availability

When the model deployment is in the Running state, the endpoint is ready to serve inference requests. A successful response confirms that the deployment is active and functioning correctly.