Advanced

In this guide you will setup a storage namespace for hosting a GenAI model that is pulled from Hugging Face. Storage namespaces allow models to be loaded efficiently into GPU memory and avoid repeated downloads from remote repositories.

Assumptions¶

This exercise assumes you have completed the Token Factory Basics Get Started Guide and you have access to an Amazon S3 bucket for hosting models.

1. Create Storage Namespace¶

In this section, you will create the storage namespace.

In the Ops console, navigate to GenAI -> Storage Namespaces
Click New Storage Namespace
Enter a name for the storage namespace
Select AWS S3 for the Storage Option
Enter the following details for the credentials:
- Bucket Name - AWS S3 Bucket Name
- Region - AWS S3 Bucket Region
- Access Key - AWS Access Key
- Secret Key - AWS Secret Key
Click Save Changes

2. Create Storage Namespace Access Keys¶

In this section, you will create an access key that will be used to load content into the storage namespace.

In the Ops console, navigate to GenAI -> Storage Namespaces
Click on the Access Key tab
Click New Access Key
Copy the access Key Name and Secret Key and store them for later use

3. Create Model¶

In this section, you will create a model. A model represents the LLM that the administrator onboard into the system. Each model includes core metadata such as name, description, provider, and use case. This information determines how the model is organized, how it appears in the console, and how it will later be deployed for inference.

In the Ops console, navigate to GenAI -> Models
Click New Model
Enter qwen-0.5B-storage-namespace for the name

LLM Use Case¶

Select Chat for the usecase from the dropdown

Info

For the model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how roles, messages, and other chat-specific tokens are encoded in the input.

Model Selection and Location¶

In this guide, the deployments will download the model weights during deployment from the Storage Namespace.

Select qwen for the provider
Select Storage Namespace for the Repository
Select the previously created storage namespace for the storage namespace
Click Save

Important

A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.

Next, you will share the model with a downstream Tenant Organization to be used by end users.

In the Ops console, navigate to GenAI -> Models
Click the "Actions" icon near the previous model and select Manage Sharing
Select Specific Organizations
Select the downstream Tenant Org to share the model with
Click Save Changes

5. Upload Model Content¶

In this section, you will download a model from Hugging Face and upload the model content into the storage namespace.

In the Ops console, navigate to GenAI -> Models
Select the previously created model
The Upload Model Content tab will provide instructions for uploading model content

Follow the provided instructions to install and configure the AWS CLI.

Install AWS CLI¶

The AWS CLI is used to sync model content to the storage namespace.

Run the following command to install the AWS CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Configure AWS Credentials¶

Use the Access Key generated on the portal to authenticate the AWS CLI.

Run the following command to configure the AWS CLI

aws configure

Provide the following:

AWS Access Key ID — Key created from the Access Key tab
AWS Secret Access Key — Secret key from the Access Key tab
Default region — Region used in your storage namespace (e.g., us-west-2)
Default output format — Optional

Important

When configuring the AWS CLI, be sure to enter the Access Keys generated from the Ops Portal and not your AWS access keys

Download Model¶

Next, you will use the Hugging Face CLI to download a model locally.

Run the following command to authenticate with Hugging Face. Note, your Hugging Face token is needed to authenticate

Important

Instructions for installing the Hugging Face CLI can be found at https://huggingface.co/docs/huggingface_hub/guides/cli

hf auth login

Once authenticated, run the following command to download the model locally

hf download Qwen/Qwen2-0.5B-Instruct

Once downloaded, navigate to the directory where the model is located. The path to the model can be found in the output.

Fetching 10 files: 100%|█████████████████████████████████████| 10/10 [00:02<00:00,  3.87it/s]
Download complete: : 1.00GB [00:02, 656MB/s]              /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d0,  3.35it/s]
Download complete: : 1.00GB [00:02, 384MB/s]

Upload Model¶

Next, you will upload the model to the storage namespace.

Within the model directory, run the Sync model content command provided in the Upload Model Content tab

This command uploads all files from the current directory to the model’s storage location using the Rafay gateway and the selected Storage Namespace.

Once the model has finished uploading, navigate to GenAI -> Models
Select the previously created model and navigate to the Files & Versions tab

After the upload completes successfully, the model content is available in the bucket and the model becomes ready for deployment.

6. Create Model Deployment¶

In this section, you will create a model deployment. Model Deployments define how a LLM is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.

In the Ops console, navigate to GenAI -> Model Deployments
Click New Model Deployment
Enter a name for the model deployment
Select the previously created model using the storage namespace
Select the previously created Endpoint

Engine Selection¶

Select VLLM for the inference engine
Enter vllm/vllm-openai:v0.14.1 for the vLLM Image
Enter 1 for the replicas

Note

Operators can select the version of vLLM they wish to use. The vLLM image can be pretty large in size (Gigabytes) and operators may wish to host the image in a local container registry to ensure fast/reliable deployments.

Resources¶

Specify the cpu, memory, gpu and storage resources that you wish to allocate to vLLM

Enter 2 for the CPU count
Enter 10Gi for the Memory amount
Enter 1 for the GPU count

Metering¶

In this section, the operator will specify how usage will be metered. They will specify the currency and the rate for every million tokens. The Rafay Token Factory counts input and output tokens separately.

Select US Dollar for the currency
Enter 2 for Input Tokens
Enter 4 for Output Tokens
Click Save Changes

After a few minutes, the model will be deployed to the specified cluster.

Next, you will share the model with a downstream Tenant Organization to be used by end users.

In the Ops console, navigate to GenAI -> Model Deployments
Click the "Actions" icon near the previous model deployment and select Manage Sharing
Select Specific Organizations
Select the downstream Tenant Org to share the model with
Click Save Changes

8. End User Utilization¶

Finally, you will use a tenant end user account and utilize the inference endpoint. In this guide, we will execute the commands directly from the Kubernetes cluster.

Log into the Developer Hub console as a tenant end user
Navigate to GenAI -> Model APIs
Click on the previously created model card
Click Get an API Key
Enter a name for the key
Click Create

Copy the key provided and store in a safe location as it cannot be retrieved again
SSH into the Kubernetes Cluster
Run the following command to store the key as an environment variable. Be sure to update the command with your key value

export API_KEY=<API KEY VALUE>

Copy the cURL command for the console and run the command in your terminal. You will see a response from the endpoint to the question asked in the cURL command.

{
   "id":"chatcmpl-1c0e2496-0ed0-40ba-9832-c0c2df775f2d",
   "object":"chat.completion",
   "created":1773340699,
   "model":"gs-storage-ns-deployment",
   "choices":[
      {
         "index":0,
         "message":{
            "role":"assistant",
            "content":"There isn't necessarily one \"best\" open-source inference library as what works well for one application may not work so well for another. However, there are several popular and highly regarded libraries that you might consider depending on your specific use case:\n\n1. TensorFlow: A powerful framework developed by Google, widely used for both research and production purposes.\n\n2. PyTorch: Developed by Facebook AI Research (FAIR), it's known for its flexibility and ease of use.\n\n3. ONNX Runtime: This is an open-source runtime for running models produced with the Open Neural Network Exchange (ONNX).\n\n4. Caffe: An old but robust deep learning framework that has seen updates since its original development.\n\n5. MXNet: Another Python-based library designed to be flexible and scalable.\n\n6. TorchScript: Part of the Torch project from the University of Sydney, this allows models written in Torch to run efficiently on CPUs and GPUs.\n\n7. CoreML: Apple’s own machine learning model format, which can be converted to many frameworks including TensorFlow and PyTorch.\n\n8. ML.js: A JavaScript library for building and deploying neural networks.\n\n9. Keras: A high-level neural networks API built on top of Theano or Tensorflow, which is also very popular.\n\n",
            "refusal":null,
            "annotations":null,
            "audio":null,
            "function_call":null,
            "tool_calls":[

            ],
            "reasoning":null,
            "reasoning_content":null
         },
         "logprobs":null,
         "finish_reason":"length",
         "stop_reason":null,
         "token_ids":null
      }
   ],
   "service_tier":null,
   "system_fingerprint":null,
   "usage":{
      "prompt_tokens":30,
      "total_tokens":286,
      "completion_tokens":256,
      "prompt_tokens_details":null
   },
   "prompt_logprobs":null,
   "prompt_token_ids":null,
   "kv_transfer_params":null
}

Advanced

Assumptions¶

1. Create Storage Namespace¶

2. Create Storage Namespace Access Keys¶

3. Create Model¶

LLM Use Case¶

Model Selection and Location¶

4. Model Sharing¶

5. Upload Model Content¶

Install AWS CLI¶

Configure AWS Credentials¶

Download Model¶

Upload Model¶

6. Create Model Deployment¶

Engine Selection¶

Resources¶

Metering¶

7. Model Deployment Sharing¶

8. End User Utilization¶