Skip to content

Configuration

This Inference Template enables the deployment of a vLLM-based inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.

The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.

As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.

For detailed steps to get started, refer to the vLLM Inference Template Get Started Guide.

Initial Setup

The platform administrator configures the vLLM inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.

sequenceDiagram
    participant Admin as Platform Admin
    participant Catalog as System Catalog
    participant Project as Inference Project

    Admin->>Catalog: Selects vLLM Inference Template
    Admin->>Project: Shares Template with Predefined Inputs and Controls
    Project-->>Admin: Template Available for Deployment

End User Flow

The end user launches the shared vLLM inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.

sequenceDiagram
    participant User as End User
    participant Project as Inference Project
    participant Cluster as GPU-enabled K8s Cluster

    User->>Project: Launches Shared vLLM Template
    User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
    User->>Project: Clicks "Deploy"
    Project->>Cluster: Provisions Inference Service with Selected Model
    Cluster-->>User: vLLM Inference Endpoint Ready
    Cluster-->>User: Kubeconfig and Endpoint Provided as Output

The templates are designed to support both:

  • Day 0 operations: Initial setup
  • Day 2 operations: Ongoing management

Resources

A virtual Kubernetes cluster running inside the custom namespace, operating independently while sharing the host cluster infrastructure

Prerequisites

To successfully use this template, the following prerequisites must be met:

  • A Kubernetes cluster with GPU-enabled nodes that meet the resource requirements of the selected model
  • One of the following model access options:
    • A Hugging Face access token (with read permission) to download models from Hugging Face
    • A locally hosted LLM model available in an S3-compatible object storage bucket (such as MinIO)
  • Configuration of domain-specific details to route and access the deployed inference endpoint (e.g., DNS, network access, or ingress rules) if the “Custom” domain option is used

Configuration

At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:

  • Credentials:

    • Hugging Face Token: Personal access token with read permission to download models from Hugging Face
  • Inference Configuration:

    • Model Name: Specify the name of the LLM to deploy (e.g., Llama 3, deepseek). The list of available models can be updated or customized by the admin to include models from Hugging Face or locally hosted models, provided they are compatible with the vLLM engine
    • GPU Type: Select the appropriate GPU type (e.g., nvidia or AMD)
    • Image Version: Provide the vLLM image
  • Deployment Settings:

    • Namespace: Provide the namespace where the inference service will be deployed
    • Custom Domain (Optional): Configure DNS and ingress settings if a custom domain is required to expose the inference endpoint

After entering the required information, click Deploy to initiate the provisioning of the vLLM inference service.


Input Variables for vLLM-Based LLM Inference Template

Cluster & Connection Configuration

Name Value Type Description
Name Text Name of the resource
Host Server Text Host server URL or address
Client Key Data Text Base64-encoded client key data
Client Certificate Data Text Base64-encoded client certificate data
Certificate Authority Data Text Base64-encoded CA certificate data
Kubeconfig Text Kubeconfig content or reference
cluster_name Text Name of the target cluster
host_cluster_name Text Name of the host Kubernetes cluster
Project Text Project under which the inference system is deployed
Namespace Text Namespace for the deployment

Network & Domain Settings

Name Value Type Description
Ingress Domain Text Domain used for ingress
Ingress Controller IP Text IP address of the ingress controller
Ingress Namespace Text Namespace where the ingress controller resides
Sub Domain Text Subdomain used for routing
Custom Domain Text User-defined custom domain
Custom Cert Text Custom TLS certificate
Custom Key Text Custom TLS private key
Custom Secret Name Text Custom secret containing sensitive data

Model Storage Configuration

Name Value Type Description
Model Source Text Indicates whether the model is stored locally or externally
Minio Endpoint Text Endpoint URL of the MinIO object storage
Minio Access key Text Access key for MinIO object storage
Minio Secret key Text Secret key for MinIO object storage
Model Data Path Text Path to the model data in object storage

⚠️ Important:
- Inference deployments using this template support retrieving the LLM model either from HuggingFace or from object storage.
- Selecting Minio as the Model Source allows organizations to host LLM models privately in their own object storage buckets and reference them during deployment, without depending on external registries.
This is particularly useful for air-gapped environments or setups with restricted internet access.
- Any S3-compatible storage can be used to host the model. Minio is the recommended option with this template.

Deployment & Resource Configuration

Name Value Type Description
Model Text Model identifier or path
Extra Args Text Additional arguments for model inference
Deployment Timeout Text Time in seconds before deployment times out
Deployment Wait Timeout Text Time to wait for a deployment to become ready
CPU Limits Text Maximum CPU resources allowed
Memory Requests Text Minimum memory required
Memory Limits Text Maximum memory allowed
GPU Requests Text Minimum GPU resources required
GPU Limits Text Maximum GPU resources allowed
GPU Type Text Type of GPU to be used
Image Text Container image to use
Empty dir sizelimit Text Size limit for temporary emptyDir storage

Launch Time

The estimated time to launch an vLLM-Based LLM Inference using this template is approximately 14 minutes.