Basics

In this guide you will deploy an instance of an open source model using Rafay's Token Factory and make it available to a customer/tenant for use as an inference endpoint.

The Token Factory provides operators with an integrated environment to onboard, manage, and deploy LLMs on their GPU-backed infrastructure. The service includes resources that support the complete model lifecycle, from preparing compute capacity to publishing models for organizational use.

Assumptions¶

This exercise assumes the following requirements are in place.

Admin access to the Rafay Operations Console
A customer tenant org with access to a user with a developer role
A managed Kubernetes cluster with a GPU preconfigured with the following prerequisites
TLS Certificate for the domain name being used
A Hugging Face account with API Key (You will deploy the "Qwen2-0.5B-Instruct" LLM from Huggingface)

Info

To ensure this guide does not impose requirements for expensive GPU hardware, we have optimized this for Qwen2.5-0.5B, a very small, lightweight model that requires minimal GPU memory (2-3GB VRAM). This makes it ideal for low-end GPUs, consumer hardware, or Edge devices.

1. Create Compute Cluster¶

In this section, you will import/register a Kubernetes cluster with GPUs into the Token Factory.

In the Ops console, navigate to GenAI -> Compute Clusters
Click New Compute Cluster
Enter a name for the compute cluster
Click Save Changes

Download the bootstrap YAML file by clicking Download YAML Config
Copy the file to the Kubernetes cluster
Run the following command on the cluster

kubectl apply -f <compute-name>-compute-bootstrap.yaml

Once all resources in the namespaces gaap-controller and monitoring are in a running state, the cluster will have a status of "Success"

2. Create Endpoint¶

In this section, you will create an endpoint which represents the access point through which inference requests are served by the Token Factory. Endpoints route incoming traffic to deployed models running on a GPU-enabled compute cluster.

In the Ops console, navigate to GenAI -> Endpoints
Click New Endpoint
Enter a name for the endpoint
Enter the hostname for the endpoint (e.g. gs.paas.demo.gorafay.net). This will be the base for the inference URLs provided to the users.
Select the previously created compute cluster

Info

In this guide, we will select "Internal" since we will be using Cluster IP. If you have a public IP and a load balancer, select "External".

Upload the TLS certificate for the endpoint
Upload the private key for the certificate
Click Save Changes

Once the endpoint is created, the IP address associated with the endpoint will be visible.

Important

Ensure this IP address is associated in DNS with the hostname used. This will allow proper domain name resolution when users are accessing the endpoint.

3. Create Provider¶

In this section, you will create a provider. A Provider represents the source or organization from which the LLM originates. Examples include Llama (Meta), Qwen, NVIDIA, Google, and custom enterprise providers.

In the Ops console, navigate to GenAI -> Providers
Click New Provider
Enter the name qwen for the provider
Enter https://qwen.ai/ for the Website
Enter https://cdn.jsdelivr.net/gh/homarr-labs/dashboard-icons/png/qwen.png for the Icon URL
Enter https://cdn.jsdelivr.net/gh/homarr-labs/dashboard-icons/png/qwen.png for the Large Icon URL
Click Save Changes

4. Create Model¶

In this section, you will create a model. A model represents the LLM that the administrator onboard into the system. Each model includes core metadata such as name, description, provider, and use case. This information determines how the model is organized, how it appears in the console, and how it will later be deployed for inference.

In the Ops console, navigate to GenAI -> Models
Click New Model
Enter qwen-0.5B for the name

LLM Use Case¶

Select Chat for the usecase from the dropdown

Info

For the model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how roles, messages, and other chat-specific tokens are encoded in the input.

Model Selection and Location¶

In this guide, the deployments will download the model weights during deployment from HuggingFace. For production deployments, operators will prefer to use a local storage namespace to ensure deployments are blazing fast.

Select qwen for the provider
Select Hugging Face for the Repository
Enter your Hugging Face API Key
Enter huggingface.co/Qwen/Qwen2-0.5B-Instruct for the source
Enter main for the revision
Click Save

Important

A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.

Next, you will share the model with a downstream Tenant Organization to be used by end users.

In the Ops console, navigate to GenAI -> Models
Click the "Actions" icon near the previous model and select Manage Sharing
Select Specific Organizations
Select the downstream Tenant Org to share the model with
Click Save Changes

6. Create Model Deployment¶

In this section, you will create a model deployment. Model Deployments define how a LLM is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.

In the Ops console, navigate to GenAI -> Model Deployments
Click New Model Deployment
Enter a name for the model deployment
Select the previously created model
Select the previously created Endpoint

Engine Selection¶

Select VLLM for the inference engine
Enter vllm/vllm-openai:v0.14.1 for the vLLM Image

Note

Operators can select the version of vLLM they wish to use. The vLLM image can be pretty large in size (Gigabytes) and operators may wish to host the image in a local container registry to ensure fast/reliable deployments.

Resources¶

Specify the cpu, memory, gpu and storage resources that you wish to allocate to vLLM

Enter 2 for the CPU count
Enter 10Gi for the Memory amount
Enter 1 for the GPU count
Enter 20Gi for the Volume size

Scale¶

In this step, the operator makes a determination of scale for the deployment. With multiple replicas, traffic can be load balanced across the replicas delivering higher scale.

Enter 1 for the replicas

Metering¶

In this section, the operator will specify how usage will be metered. They will specify the currency and the rate for every million tokens. The Rafay Token Factory counts input and output tokens separately.

Select US Dollar for the currency
Enter 2 for Input Tokens
Enter 4 for Output Tokens
Click Save Changes

After a few minutes, the model will be deployed to the specified cluster.

Next, you will share the model with a downstream Tenant Organization to be used by end users.

In the Ops console, navigate to GenAI -> Model Deployments
Click the "Actions" icon near the previous model deployment and select Manage Sharing
Select Specific Organizations
Select the downstream Tenant Org to share the model with
Click Save Changes

8. End User Utilization¶

Finally, you will use a tenant end user account and utilize the inference endpoint. In this guide, we will execute the commands directly from the Kubernetes cluster.

Log into the Developer Hub console as a tenant end user
Navigate to GenAI -> Model APIs
Click on the previously created model card
Click Get an API Key
Enter a name for the key
Click Create

Copy the key provided and store in a safe location as it cannot be retrieved again
SSH into the Kubernetes Cluster
Run the following command to store the key as an environment variable. Be sure to update the command with your key value

export API_KEY=<API KEY VALUE>

Copy the cURL command for the console and run the command in your terminal. You will see a response from the endpoint to the question asked in the cURL command.

{
  "id": "chatcmpl-c99463c4-28bd-46b3-a2de-20c410e5a506",
  "object": "chat.completion",
  "created": 1771890924,
  "model": "gs-deployment",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "There are several open-source inference libraries available, but one of the most widely used and well-regarded is PyTorch. It has been developed by Facebook and is known for its simplicity and ease of use.\nPyTorch includes a large number of pre-trained models that can be used as starting points for building more complex models. Some popular PyTorch models include ResNet, VGG, and SSD (Segmentation Dynamics).\nAnother open-source inference library is Keras, which is an extension of TensorFlow. While it does not have the same level of support as PyTorch, Keras offers many similar features and is easy to learn.\nBoth PyTorch and Keras are excellent choices for developing and training machine learning models on a variety of hardware platforms.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 30,
    "total_tokens": 184,
    "completion_tokens": 154,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Navigate to GenAI -> Token Usage

After a few minutes, the usage metrics will be populated and the user can view their historical usage and total spend.

Basics

Assumptions¶

1. Create Compute Cluster¶

2. Create Endpoint¶

3. Create Provider¶

4. Create Model¶

LLM Use Case¶

Model Selection and Location¶

5. Model Sharing¶

6. Create Model Deployment¶

Engine Selection¶

Resources¶

Scale¶

Metering¶

7. Model Deployment Sharing¶

8. End User Utilization¶