Configuration
This Inference Template enables the deployment of a vLLM-based inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.
The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.
As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.
For detailed steps to get started, refer to the vLLM Inference Template Get Started Guide.
Initial Setup¶
The platform administrator configures the vLLM inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.
sequenceDiagram
participant Admin as Platform Admin
participant Catalog as System Catalog
participant Project as Inference Project
Admin->>Catalog: Selects vLLM Inference Template
Admin->>Project: Shares Template with Predefined Inputs and Controls
Project-->>Admin: Template Available for Deployment
End User Flow¶
The end user launches the shared vLLM inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.
sequenceDiagram
participant User as End User
participant Project as Inference Project
participant Cluster as GPU-enabled K8s Cluster
User->>Project: Launches Shared vLLM Template
User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
User->>Project: Clicks "Deploy"
Project->>Cluster: Provisions Inference Service with Selected Model
Cluster-->>User: vLLM Inference Endpoint Ready
Cluster-->>User: Kubeconfig and Endpoint Provided as Output
The templates are designed to support both:
- Day 0 operations: Initial setup
- Day 2 operations: Ongoing management
Resources¶
A virtual Kubernetes cluster running inside the custom namespace, operating independently while sharing the host cluster infrastructure
Prerequisites¶
- A Kubernetes cluster with GPU-enabled nodes that meet the resource requirements of your selected model
- A Hugging Face access token (with read permission) to download models from Hugging Face
- A locally hosted model available in a configured MinIO bucket
- Configuration of domain-specific details to route and access the deployed inference endpoint (e.g., DNS, network access, or ingress rules) if the “Custom” domain option is used
Configuration¶
At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:
-
Credentials:
- Hugging Face Token: Personal access token with read permission to download models from Hugging Face
-
Inference Configuration:
- Model Name: Specify the name of the LLM to deploy (e.g.,
Llama 3
,deepseek
). The list of available models can be updated or customized by the admin to include models from Hugging Face or locally hosted models, provided they are compatible with the vLLM engine - GPU Type: Select the appropriate GPU type (e.g.,
nvidia
orAMD
) - Image Version: Provide the vLLM image
- Model Name: Specify the name of the LLM to deploy (e.g.,
-
Deployment Settings:
- Namespace: Provide the namespace where the inference service will be deployed
- Custom Domain (Optional): Configure DNS and ingress settings if a custom domain is required to expose the inference endpoint
After entering the required information, click Deploy to initiate the provisioning of the vLLM inference service.
Input Variables for vLLM-Based LLM Inference Template¶
Cluster & Connection Configuration¶
Name | Value Type | Description |
---|---|---|
Name | Text | Name of the resource |
Host Server | Text | Host server URL or address |
Client Key Data | Text | Base64-encoded client key data |
Client Certificate Data | Text | Base64-encoded client certificate data |
Certificate Authority Data | Text | Base64-encoded CA certificate data |
Kubeconfig | Text | Kubeconfig content or reference |
cluster_name | Text | Name of the target cluster |
host_cluster_name | Text | Name of the host Kubernetes cluster |
Project | Text | Project under which the inference system is deployed |
Namespace | Text | Namespace for the deployment |
Network & Domain Settings¶
Name | Value Type | Description |
---|---|---|
Ingress Domain | Text | Domain used for ingress |
Ingress Controller IP | Text | IP address of the ingress controller |
Ingress Namespace | Text | Namespace where the ingress controller resides |
Sub Domain | Text | Subdomain used for routing |
Custom Domain | Text | User-defined custom domain |
Custom Cert | Text | Custom TLS certificate |
Custom Key | Text | Custom TLS private key |
Custom Secret Name | Text | Custom secret containing sensitive data |
Deployment & Resource Configuration¶
Name | Value Type | Description |
---|---|---|
Model | Text | Model identifier or path |
Extra Args | Text | Additional arguments for model inference |
Deployment Timeout | Text | Time in seconds before deployment times out |
Deployment Wait Timeout | Text | Time to wait for a deployment to become ready |
CPU Limits | Text | Maximum CPU resources allowed |
Memory Requests | Text | Minimum memory required |
Memory Limits | Text | Maximum memory allowed |
GPU Requests | Text | Minimum GPU resources required |
GPU Limits | Text | Maximum GPU resources allowed |
GPU Type | Text | Type of GPU to be used |
Image | Text | Container image to use |
Launch Time¶
The estimated time to launch an vLLM-Based LLM Inference using this template is approximately 14 minutes.