Skip to content

Configuration

This Inference Template enables the deployment of an inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.

The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.

As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.

For detailed steps to get started, refer to the Inference Template Get Started Guide.

Initial Setup

The platform administrator configures the inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.

sequenceDiagram
    participant Admin as Platform Admin
    participant Catalog as System Catalog
    participant Project as Inference Project

    Admin->>Catalog: Selects Inference Template
    Admin->>Project: Shares Template with Predefined Inputs and Controls
    Project-->>Admin: Template Available for Deployment

End User Flow

The end user launches the inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.

sequenceDiagram
    participant User as End User
    participant Project as Inference Project
    participant Cluster as GPU-enabled K8s Cluster

    User->>Project: Launches Template
    User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
    User->>Project: Clicks "Deploy"
    Project->>Cluster: Provisions Inference Service with Selected Model
    Cluster-->>User: Inference Endpoint Ready
    Cluster-->>User: Kubeconfig and Endpoint Provided as Output

The templates are designed to support both:

  • Day 0 operations: Initial setup
  • Day 2 operations: Ongoing management

Resources

An Inference Service provisioned on a Kubernetes cluster, optimized for serving machine learning models at scale.

Pre-Requisites

  • Host Cluster: Ensure that a Kubernetes host cluster is available and ready for inference service deployment.
  • Agent Configuration: Configure agents through Global Settings or during cluster provisioning.

Configuration

At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:

  • Inference Service Configuration:
    • Host Cluster Name: Select the host Kubernetes cluster for deploying the inference service.
    • Template Parameters: Provide additional parameters as required (for example, model selection, resource limits, GPU type, etc.).

After entering the required information, click Deploy to initiate the inference service provisioning.


Input Variables for Inference Template

General Configuration

Name Value Type Description
Namespace Text Namespace where the inference service will be deployed
Action Text Operation to be performed on the inference service (e.g., start, stop, delete)
SKU Type Text Type of SKU used for the inference workload
Timeout Seconds Number Timeout in seconds for inference execution
Log Success Pattern Text Pattern in logs that indicates successful model startup or readiness
KeyAlpha Text Custom key parameter (user-defined)
KeyBeta Text Custom key parameter (user-defined)
KeyGamma Text Custom key parameter (user-defined)
KeyX Text Custom key parameter (user-defined)
KeyY Text Custom key parameter (user-defined)
KeyZ Text Custom key parameter (user-defined)
Device Details Text Device or hardware details allocated to the inference service
Host Cluster Name Text Name of the host Kubernetes cluster where inference runs
Kubeconfig Text Kubernetes configuration data for accessing the cluster

Access & Security

Name Value Type Description
Enable SSH Access Boolean Flag to enable SSH access to the inference pod
Autogenerate SSH Key Boolean Flag to automatically generate an SSH key for inference access
Public Key Text Public SSH key for accessing the inference pod
Enable Web Access Boolean Flag to enable browser-based access to the inference service
SSL Certificate Public Key Text Public SSL certificate for securing inference endpoints
SSL Certificate Private Key Text Private SSL key for inference endpoint authentication
Hostname Suffix Text Hostname suffix for accessing the inference service
Container Port Number Port exposed by the inference container
Ingress Annotations Map Ingress annotations for exposing inference endpoints
Ingress Class Name Text Ingress class used for routing inference traffic
Security Context Map Security context for running the inference pod

Resource & Runtime Settings

Name Value Type Description
CPU Number CPU resources requested for the inference pod
Memory Number Memory resources requested for the inference pod
GPU Count Number Number of GPUs allocated to the inference pod
GPU Type Text Type of GPU allocated to the inference workload
Node Type Text Type of node used for scheduling the inference pod
Pod Image Text Container image used for the inference service
Commands List Commands executed when starting the inference container
Arguments List Arguments passed to the inference container process
Env Vars Map Environment variables for the inference runtime
Is Private Registry Boolean Indicates if the inference image is from a private registry
Registry Server Text Address of the container registry for the inference image
Registry Username Text Username for accessing the container registry
Registry Password Text Password for registry authentication
Registry Email Text Email associated with the container registry account

Storage & Scheduling

Name Value Type Description
Enable Storage Boolean Flag to enable persistent storage for the inference service
Storage Size Text Size of the storage volume (e.g., 10Gi) for model or data
Storage Path Text Path for mounting storage inside the inference container
Storage Class Text Storage class used for provisioning inference volumes
Access Mode Text Access mode for the persistent volume (e.g., ReadWriteOnce)
Volume Mounts List Volume mounts for attaching storage to the inference pod
Tolerations List Tolerations applied for scheduling the inference pod on specific nodes
Node Selectors Map Key-value pairs to constrain the inference pod to specific nodes

Launch Time

The estimated time to launch an Inference using this template is approximately 14 minutes.