Configuration

This Inference Template enables the deployment of a vLLM-based inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.

The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.

As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.

For detailed steps to get started, refer to the vLLM Inference Template Get Started Guide.

Initial Setup¶

The platform administrator configures the vLLM inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.

sequenceDiagram
    participant Admin as Platform Admin
    participant Catalog as System Catalog
    participant Project as Inference Project

    Admin->>Catalog: Selects vLLM Inference Template
    Admin->>Project: Shares Template with Predefined Inputs and Controls
    Project-->>Admin: Template Available for Deployment

End User Flow¶

The end user launches the shared vLLM inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.

sequenceDiagram
    participant User as End User
    participant Project as Inference Project
    participant Cluster as GPU-enabled K8s Cluster

    User->>Project: Launches Shared vLLM Template
    User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
    User->>Project: Clicks "Deploy"
    Project->>Cluster: Provisions Inference Service with Selected Model
    Cluster-->>User: vLLM Inference Endpoint Ready
    Cluster-->>User: Kubeconfig and Endpoint Provided as Output

The templates are designed to support both:

Day 0 operations: Initial setup
Day 2 operations: Ongoing management

Resources¶

A virtual Kubernetes cluster running inside the custom namespace, operating independently while sharing the host cluster infrastructure

Prerequisites¶

To successfully use this template, the following prerequisites must be met:

A Kubernetes cluster with GPU-enabled nodes that meet the resource requirements of the selected model
One of the following model access options:
- A Hugging Face access token (with read permission) to download models from Hugging Face
- A locally hosted LLM model available in an S3-compatible object storage bucket (such as MinIO)
Configuration of domain-specific details to route and access the deployed inference endpoint (e.g., DNS, network access, or ingress rules) if the “Custom” domain option is used

Configuration¶

At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:

Credentials:
- Hugging Face Token: Personal access token with read permission to download models from Hugging Face
Inference Configuration:
- Model Name: Specify the name of the LLM to deploy (e.g., Llama 3, deepseek). The list of available models can be updated or customized by the admin to include models from Hugging Face or locally hosted models, provided they are compatible with the vLLM engine
- GPU Type: Select the appropriate GPU type (e.g., nvidia or AMD)
- Image Version: Provide the vLLM image
Deployment Settings:
- Namespace: Provide the namespace where the inference service will be deployed
- Custom Domain (Optional): Configure DNS and ingress settings if a custom domain is required to expose the inference endpoint

After entering the required information, click Deploy to initiate the provisioning of the vLLM inference service.

Input Variables for vLLM-Based LLM Inference Template¶

Cluster & Connection Configuration¶

Name	Value Type	Description
Name	Text	Name of the resource
Host Server	Text	Host server URL or address
Client Key Data	Text	Base64-encoded client key data
Client Certificate Data	Text	Base64-encoded client certificate data
Certificate Authority Data	Text	Base64-encoded CA certificate data
Kubeconfig	Text	Kubeconfig content or reference
cluster_name	Text	Name of the target cluster
host_cluster_name	Text	Name of the host Kubernetes cluster
Project	Text	Project under which the inference system is deployed
Namespace	Text	Namespace for the deployment

Network & Domain Settings¶

Name	Value Type	Description
Ingress Domain	Text	Domain used for ingress
Ingress Controller IP	Text	IP address of the ingress controller
Ingress Namespace	Text	Namespace where the ingress controller resides
Sub Domain	Text	Subdomain used for routing
Custom Domain	Text	User-defined custom domain
Custom Cert	Text	Custom TLS certificate
Custom Key	Text	Custom TLS private key
Custom Secret Name	Text	Custom secret containing sensitive data

Model Storage Configuration¶

Name	Value Type	Description
Model Source	Text	Indicates whether the model is stored locally or externally
Minio Endpoint	Text	Endpoint URL of the MinIO object storage
Minio Access key	Text	Access key for MinIO object storage
Minio Secret key	Text	Secret key for MinIO object storage
Model Data Path	Text	Path to the model data in object storage

⚠️ Important:
- Inference deployments using this template support retrieving the LLM model either from HuggingFace or from object storage.
- Selecting Minio as the Model Source allows organizations to host LLM models privately in their own object storage buckets and reference them during deployment, without depending on external registries.
This is particularly useful for air-gapped environments or setups with restricted internet access.
- Any S3-compatible storage can be used to host the model. Minio is the recommended option with this template.

Deployment & Resource Configuration¶

Name	Value Type	Description
Model	Text	Model identifier or path
Extra Args	Text	Additional arguments for model inference
Deployment Timeout	Text	Time in seconds before deployment times out
Deployment Wait Timeout	Text	Time to wait for a deployment to become ready
CPU Limits	Text	Maximum CPU resources allowed
Memory Requests	Text	Minimum memory required
Memory Limits	Text	Maximum memory allowed
GPU Requests	Text	Minimum GPU resources required
GPU Limits	Text	Maximum GPU resources allowed
GPU Type	Text	Type of GPU to be used
Image	Text	Container image to use
Empty dir sizelimit	Text	Size limit for temporary emptyDir storage

Launch Time¶

The estimated time to launch an vLLM-Based LLM Inference using this template is approximately 14 minutes.