Configuration
This Inference Template enables the deployment of an inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.
The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.
As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.
For detailed steps to get started, refer to the Inference Template Get Started Guide.
Initial Setup¶
The platform administrator configures the inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.
sequenceDiagram
participant Admin as Platform Admin
participant Catalog as System Catalog
participant Project as Inference Project
Admin->>Catalog: Selects Inference Template
Admin->>Project: Shares Template with Predefined Inputs and Controls
Project-->>Admin: Template Available for Deployment
End User Flow¶
The end user launches the inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.
sequenceDiagram
participant User as End User
participant Project as Inference Project
participant Cluster as GPU-enabled K8s Cluster
User->>Project: Launches Template
User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
User->>Project: Clicks "Deploy"
Project->>Cluster: Provisions Inference Service with Selected Model
Cluster-->>User: Inference Endpoint Ready
Cluster-->>User: Kubeconfig and Endpoint Provided as Output
The templates are designed to support both:
- Day 0 operations: Initial setup
- Day 2 operations: Ongoing management
Resources¶
An Inference Service provisioned on a Kubernetes cluster, optimized for serving machine learning models at scale.
Pre-Requisites¶
- Host Cluster: Ensure that a Kubernetes host cluster is available and ready for inference service deployment.
- Agent Configuration: Configure agents through Global Settings or during cluster provisioning.
Configuration¶
At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:
- Inference Service Configuration:
- Host Cluster Name: Select the host Kubernetes cluster for deploying the inference service.
- Template Parameters: Provide additional parameters as required (for example, model selection, resource limits, GPU type, etc.).
After entering the required information, click Deploy to initiate the inference service provisioning.
Input Variables for Inference Template¶
General Configuration¶
Name | Value Type | Description |
---|---|---|
Namespace | Text | Namespace where the inference service will be deployed |
Action | Text | Operation to be performed on the inference service (e.g., start, stop, delete) |
SKU Type | Text | Type of SKU used for the inference workload |
Timeout Seconds | Number | Timeout in seconds for inference execution |
Log Success Pattern | Text | Pattern in logs that indicates successful model startup or readiness |
KeyAlpha | Text | Custom key parameter (user-defined) |
KeyBeta | Text | Custom key parameter (user-defined) |
KeyGamma | Text | Custom key parameter (user-defined) |
KeyX | Text | Custom key parameter (user-defined) |
KeyY | Text | Custom key parameter (user-defined) |
KeyZ | Text | Custom key parameter (user-defined) |
Device Details | Text | Device or hardware details allocated to the inference service |
Host Cluster Name | Text | Name of the host Kubernetes cluster where inference runs |
Kubeconfig | Text | Kubernetes configuration data for accessing the cluster |
Access & Security¶
Name | Value Type | Description |
---|---|---|
Enable SSH Access | Boolean | Flag to enable SSH access to the inference pod |
Autogenerate SSH Key | Boolean | Flag to automatically generate an SSH key for inference access |
Public Key | Text | Public SSH key for accessing the inference pod |
Enable Web Access | Boolean | Flag to enable browser-based access to the inference service |
SSL Certificate Public Key | Text | Public SSL certificate for securing inference endpoints |
SSL Certificate Private Key | Text | Private SSL key for inference endpoint authentication |
Hostname Suffix | Text | Hostname suffix for accessing the inference service |
Container Port | Number | Port exposed by the inference container |
Ingress Annotations | Map | Ingress annotations for exposing inference endpoints |
Ingress Class Name | Text | Ingress class used for routing inference traffic |
Security Context | Map | Security context for running the inference pod |
Resource & Runtime Settings¶
Name | Value Type | Description |
---|---|---|
CPU | Number | CPU resources requested for the inference pod |
Memory | Number | Memory resources requested for the inference pod |
GPU Count | Number | Number of GPUs allocated to the inference pod |
GPU Type | Text | Type of GPU allocated to the inference workload |
Node Type | Text | Type of node used for scheduling the inference pod |
Pod Image | Text | Container image used for the inference service |
Commands | List | Commands executed when starting the inference container |
Arguments | List | Arguments passed to the inference container process |
Env Vars | Map | Environment variables for the inference runtime |
Is Private Registry | Boolean | Indicates if the inference image is from a private registry |
Registry Server | Text | Address of the container registry for the inference image |
Registry Username | Text | Username for accessing the container registry |
Registry Password | Text | Password for registry authentication |
Registry Email | Text | Email associated with the container registry account |
Storage & Scheduling¶
Name | Value Type | Description |
---|---|---|
Enable Storage | Boolean | Flag to enable persistent storage for the inference service |
Storage Size | Text | Size of the storage volume (e.g., 10Gi) for model or data |
Storage Path | Text | Path for mounting storage inside the inference container |
Storage Class | Text | Storage class used for provisioning inference volumes |
Access Mode | Text | Access mode for the persistent volume (e.g., ReadWriteOnce) |
Volume Mounts | List | Volume mounts for attaching storage to the inference pod |
Tolerations | List | Tolerations applied for scheduling the inference pod on specific nodes |
Node Selectors | Map | Key-value pairs to constrain the inference pod to specific nodes |
Launch Time¶
The estimated time to launch an Inference using this template is approximately 14 minutes.