Skip to content

Configuration

This Inference Template enables the deployment of a vLLM-based inference service on a GPU-enabled Kubernetes cluster, offering a scalable and configurable platform for serving popular large language models (LLMs) such as Llama 3 and DeepSeek variants.

The template provisions essential components such as model containers, GPU resource configurations, and network access policies within a Kubernetes namespace. It supports flexible model definitions using input parameters like GPU type, model name, and image version, enabling customization or extension as needed.

As part of the output, users receive an inference endpoint (URL) secured by the platform’s networking layer, enabling integration with downstream applications or testing tools.

For detailed steps to get started, refer to the vLLM Inference Template Get Started Guide.

Initial Setup

The platform administrator configures the vLLM inference template and shares it with an appropriate project. This ensures that the end user has access to a pre-configured setup with the necessary controls, GPU types, and base configurations for inference.

sequenceDiagram
    participant Admin as Platform Admin
    participant Catalog as System Catalog
    participant Project as Inference Project

    Admin->>Catalog: Selects vLLM Inference Template
    Admin->>Project: Shares Template with Predefined Inputs and Controls
    Project-->>Admin: Template Available for Deployment

End User Flow

The end user launches the shared vLLM inference template and provides required inputs (e.g., model name, image version, GPU type, and Hugging Face token) to deploy the inference service on a GPU-enabled Kubernetes cluster.

sequenceDiagram
    participant User as End User
    participant Project as Inference Project
    participant Cluster as GPU-enabled K8s Cluster

    User->>Project: Launches Shared vLLM Template
    User->>Project: Provides Model Name, GPU Type, Image Version, HF Token
    User->>Project: Clicks "Deploy"
    Project->>Cluster: Provisions Inference Service with Selected Model
    Cluster-->>User: vLLM Inference Endpoint Ready
    Cluster-->>User: Kubeconfig and Endpoint Provided as Output

The templates are designed to support both:

  • Day 0 operations: Initial setup
  • Day 2 operations: Ongoing management

Resources

A virtual Kubernetes cluster running inside the custom namespace, operating independently while sharing the host cluster infrastructure

Prerequisites

  • A Kubernetes cluster with GPU-enabled nodes that meet the resource requirements of your selected model
  • A Hugging Face access token (with read permission) to download models from Hugging Face
  • A locally hosted model available in a configured MinIO bucket
  • Configuration of domain-specific details to route and access the deployed inference endpoint (e.g., DNS, network access, or ingress rules) if the “Custom” domain option is used

Configuration

At template launch, provide the required configuration values as exposed by the Platform Admin. This may include:

  • Credentials:

    • Hugging Face Token: Personal access token with read permission to download models from Hugging Face
  • Inference Configuration:

    • Model Name: Specify the name of the LLM to deploy (e.g., Llama 3, deepseek). The list of available models can be updated or customized by the admin to include models from Hugging Face or locally hosted models, provided they are compatible with the vLLM engine
    • GPU Type: Select the appropriate GPU type (e.g., nvidia or AMD)
    • Image Version: Provide the vLLM image
  • Deployment Settings:

    • Namespace: Provide the namespace where the inference service will be deployed
    • Custom Domain (Optional): Configure DNS and ingress settings if a custom domain is required to expose the inference endpoint

After entering the required information, click Deploy to initiate the provisioning of the vLLM inference service.


Input Variables for vLLM-Based LLM Inference Template

Cluster & Connection Configuration

Name Value Type Description
Name Text Name of the resource
Host Server Text Host server URL or address
Client Key Data Text Base64-encoded client key data
Client Certificate Data Text Base64-encoded client certificate data
Certificate Authority Data Text Base64-encoded CA certificate data
Kubeconfig Text Kubeconfig content or reference
cluster_name Text Name of the target cluster
host_cluster_name Text Name of the host Kubernetes cluster
Project Text Project under which the inference system is deployed
Namespace Text Namespace for the deployment

Network & Domain Settings

Name Value Type Description
Ingress Domain Text Domain used for ingress
Ingress Controller IP Text IP address of the ingress controller
Ingress Namespace Text Namespace where the ingress controller resides
Sub Domain Text Subdomain used for routing
Custom Domain Text User-defined custom domain
Custom Cert Text Custom TLS certificate
Custom Key Text Custom TLS private key
Custom Secret Name Text Custom secret containing sensitive data

Deployment & Resource Configuration

Name Value Type Description
Model Text Model identifier or path
Extra Args Text Additional arguments for model inference
Deployment Timeout Text Time in seconds before deployment times out
Deployment Wait Timeout Text Time to wait for a deployment to become ready
CPU Limits Text Maximum CPU resources allowed
Memory Requests Text Minimum memory required
Memory Limits Text Maximum memory allowed
GPU Requests Text Minimum GPU resources required
GPU Limits Text Maximum GPU resources allowed
GPU Type Text Type of GPU to be used
Image Text Container image to use

Launch Time

The estimated time to launch an vLLM-Based LLM Inference using this template is approximately 14 minutes.