Skip to content

Basics

In this guide you will deploy an instance of an open source model using Rafay's Token Factory and make it available to a customer/tenant for use as an inference endpoint.

The Token Factory provides operators with an integrated environment to onboard, manage, and deploy LLMs on their GPU-backed infrastructure. The service includes resources that support the complete model lifecycle, from preparing compute capacity to publishing models for organizational use.


Assumptions

This exercise assumes the following requirements are in place.

  • Admin access to the Rafay Operations Console
  • A customer tenant org with access to a user with a developer role
  • A managed Kubernetes cluster with a GPU preconfigured with the following prerequisites
  • TLS Certificate for the domain name being used
  • A Hugging Face account with API Key (You will deploy the "Qwen2-0.5B-Instruct" LLM from Huggingface)

Info

To ensure this guide does not impose requirements for expensive GPU hardware, we have optimized this for Qwen2.5-0.5B, a very small, lightweight model that requires minimal GPU memory (2-3GB VRAM). This makes it ideal for low-end GPUs, consumer hardware, or Edge devices.


1. Create Compute Cluster

In this section, you will import/register a Kubernetes cluster with GPUs into the Token Factory.

  • In the Ops console, navigate to GenAI -> Compute Clusters
  • Click New Compute Cluster
  • Enter a name for the compute cluster
  • Click Save Changes

Compute Cluster

  • Download the bootstrap YAML file by clicking Download YAML Config
  • Copy the file to the Kubernetes cluster
  • Run the following command on the cluster
kubectl apply -f <compute-name>-compute-bootstrap.yaml

Compute Cluster

Once all resources in the namespaces gaap-controller and monitoring are in a running state, the cluster will have a status of "Success"


2. Create Endpoint

In this section, you will create an endpoint which represents the access point through which inference requests are served by the Token Factory. Endpoints route incoming traffic to deployed models running on a GPU-enabled compute cluster.

  • In the Ops console, navigate to GenAI -> Endpoints
  • Click New Endpoint
  • Enter a name for the endpoint
  • Enter the hostname for the endpoint (e.g. gs.paas.demo.gorafay.net). This will be the base for the inference URLs provided to the users.
  • Select the previously created compute cluster
  • Upload the TLS certificate for the endpoint
  • Upload the private key for the certificate
  • Click Save Changes

Endpoint

Once the endpoint is created, the IP address associated with the endpoint will be visible.

Important

Ensure this IP address is associated in DNS with the hostname used. This will allow proper domain name resolution when users are accessing the endpoint.

Endpoint IP


3. Create Provider

In this section, you will create a provider. A Provider represents the source or organization from which the LLM originates. Examples include Llama (Meta), Qwen, NVIDIA, Google, and custom enterprise providers.

Provider


4. Create Model

In this section, you will create a model. A model represents the LLM that the administrator onboard into the system. Each model includes core metadata such as name, description, provider, and use case. This information determines how the model is organized, how it appears in the console, and how it will later be deployed for inference.

  • In the Ops console, navigate to GenAI -> Models
  • Click New Model
  • Enter qwen-0.5B for the name
  • Select Chat for the usecase
  • Select qwen for the provider
  • Select Hugging Face for the Repository
  • Enter your Hugging Face API Key
  • Enter huggingface.co/Qwen/Qwen2-0.5B-Instruct for the source
  • Enter main for the revision
  • Click Save

Model

Important

A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.


5. Create Model Deployment

In this section, you will create a model deployment. Model Deployments define how a LLM is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.

  • In the Ops console, navigate to GenAI -> Model Deployments
  • Click New Model Deployment
  • Enter a name for the model deployment
  • Select the previously created model
  • Select the previously created Endpoint
  • Select VLLM for the inference engine
  • Enter vllm/vllm-openai:v0.14.1 for the vLLM Image
  • Enter 1 for the replicas
  • Enter 20Gi for the Volume size
  • Enter 2 for the CPU count
  • Enter 10Gi for the Memory amount
  • Enter 1 for the GPU count
  • Select US Dollar for the currency
  • Enter 2 for Input Tokens
  • Enter 4 for Output Tokens
  • Click Save Changes

Model Deployment

After a few minutes, the model will be deployed to the specified cluster.

Model Deployment


6. Model Deployment Sharing

Next, you will share the model with a downstream Tenant Organization to be used by end users.

  • In the Ops console, navigate to GenAI -> Model Deployments
  • Click the "Actions" icon near the previous model deployment and select Manage Sharing
  • Select Specific Organizations
  • Select the downstream Tenant Org to share the model with
  • Click Save Changes

Model Sharing


7. End User Utilization

Finally, you will use a tenant end user account and utilize the inference endpoint. In this guide, we will execute the commands directly from the Kubernetes cluster.

  • Log into the Developer Hub console as a tenant end user
  • Navigate to GenAI -> Model APIs
  • Click on the previously created model card
  • Click Get an API Key
  • Enter a name for the key
  • Click Create

API Key

  • Copy the key provided and store in a safe location as it cannot be retrieved again
  • SSH into the Kubernetes Cluster
  • Run the following command to store the key as an environment variable. Be sure to update the command with your key value
export API_KEY=<API KEY VALUE>
  • Copy the cURL command for the console and run the command in your terminal. You will see a response from the endpoint to the question asked in the cURL command.
{
  "id": "chatcmpl-c99463c4-28bd-46b3-a2de-20c410e5a506",
  "object": "chat.completion",
  "created": 1771890924,
  "model": "gs-deployment",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "There are several open-source inference libraries available, but one of the most widely used and well-regarded is PyTorch. It has been developed by Facebook and is known for its simplicity and ease of use.\nPyTorch includes a large number of pre-trained models that can be used as starting points for building more complex models. Some popular PyTorch models include ResNet, VGG, and SSD (Segmentation Dynamics).\nAnother open-source inference library is Keras, which is an extension of TensorFlow. While it does not have the same level of support as PyTorch, Keras offers many similar features and is easy to learn.\nBoth PyTorch and Keras are excellent choices for developing and training machine learning models on a variety of hardware platforms.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 30,
    "total_tokens": 184,
    "completion_tokens": 154,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}
  • Navigate to GenAI -> Token Usage

After a few minutes, the usage metrics will be populated and the user can view their historical usage and total spend.

Token Usage