Advanced
In this guide you will setup a storage namespace for hosting a GenAI model that is pulled from Hugging Face. Storage namespaces allow models to be loaded efficiently into GPU memory and avoid repeated downloads from remote repositories.
Assumptions¶
This exercise assumes you have completed the Token Factory Basics Get Started Guide and you have access to an Amazon S3 bucket for hosting models.
1. Create Storage Namespace¶
In this section, you will create the storage namespace.
- In the Ops console, navigate to GenAI -> Storage Namespaces
- Click New Storage Namespace
- Enter a name for the storage namespace
- Select AWS S3 for the Storage Option
- Enter the following details for the credentials:
- Bucket Name - AWS S3 Bucket Name
- Region - AWS S3 Bucket Region
- Access Key - AWS Access Key
- Secret Key - AWS Secret Key
- Click Save Changes
2. Create Storage Namespace Access Keys¶
In this section, you will create an access key that will be used to load content into the storage namespace.
- In the Ops console, navigate to GenAI -> Storage Namespaces
- Click on the Access Key tab
- Click New Access Key
- Copy the access Key Name and Secret Key and store them for later use
3. Create Model¶
In this section, you will create a model. A model represents the LLM that the administrator onboard into the system. Each model includes core metadata such as name, description, provider, and use case. This information determines how the model is organized, how it appears in the console, and how it will later be deployed for inference.
- In the Ops console, navigate to GenAI -> Models
- Click New Model
- Enter qwen-0.5B-storage-namespace for the name
LLM Use Case¶
- Select Chat for the usecase from the dropdown
Info
For the model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how roles, messages, and other chat-specific tokens are encoded in the input.
Model Selection and Location¶
In this guide, the deployments will download the model weights during deployment from the Storage Namespace.
- Select qwen for the provider
- Select Storage Namespace for the Repository
- Select the previously created storage namespace for the storage namespace
- Click Save
Important
A newly created model has zero deployments by default. One or more deployments can be created for the same model, each with different endpoints or runtime configurations.
4. Model Sharing¶
Next, you will share the model with a downstream Tenant Organization to be used by end users.
- In the Ops console, navigate to GenAI -> Models
- Click the "Actions" icon near the previous model and select Manage Sharing
- Select Specific Organizations
- Select the downstream Tenant Org to share the model with
- Click Save Changes
5. Upload Model Content¶
In this section, you will download a model from Hugging Face and upload the model content into the storage namespace.
- In the Ops console, navigate to GenAI -> Models
- Select the previously created model
- The Upload Model Content tab will provide instructions for uploading model content
- Follow the provided instructions to install and configure the AWS CLI.
Install AWS CLI¶
The AWS CLI is used to sync model content to the storage namespace.
- Run the following command to install the AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Configure AWS Credentials¶
Use the Access Key generated on the portal to authenticate the AWS CLI.
- Run the following command to configure the AWS CLI
aws configure
Provide the following:
- AWS Access Key ID — Key created from the Access Key tab
- AWS Secret Access Key — Secret key from the Access Key tab
- Default region — Region used in your storage namespace (e.g.,
us-west-2) - Default output format — Optional
Important
When configuring the AWS CLI, be sure to enter the Access Keys generated from the Ops Portal and not your AWS access keys
Download Model¶
Next, you will use the Hugging Face CLI to download a model locally.
- Run the following command to authenticate with Hugging Face. Note, your Hugging Face token is needed to authenticate
Important
Instructions for installing the Hugging Face CLI can be found at https://huggingface.co/docs/huggingface_hub/guides/cli
hf auth login
- Once authenticated, run the following command to download the model locally
hf download Qwen/Qwen2-0.5B-Instruct
- Once downloaded, navigate to the directory where the model is located. The path to the model can be found in the output.
Fetching 10 files: 100%|█████████████████████████████████████| 10/10 [00:02<00:00, 3.87it/s]
Download complete: : 1.00GB [00:02, 656MB/s] /root/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d0, 3.35it/s]
Download complete: : 1.00GB [00:02, 384MB/s]
Upload Model¶
Next, you will upload the model to the storage namespace.
- Within the model directory, run the Sync model content command provided in the Upload Model Content tab
This command uploads all files from the current directory to the model’s storage location using the Rafay gateway and the selected Storage Namespace.
- Once the model has finished uploading, navigate to GenAI -> Models
- Select the previously created model and navigate to the Files & Versions tab
After the upload completes successfully, the model content is available in the bucket and the model becomes ready for deployment.
6. Create Model Deployment¶
In this section, you will create a model deployment. Model Deployments define how a LLM is deployed for inference. A deployment binds a model to an endpoint, selects an inference engine, and configures runtime resources such as replicas, CPU, memory, and GPU.
- In the Ops console, navigate to GenAI -> Model Deployments
- Click New Model Deployment
- Enter a name for the model deployment
- Select the previously created model using the storage namespace
- Select the previously created Endpoint
Engine Selection¶
- Select VLLM for the inference engine
- Enter vllm/vllm-openai:v0.14.1 for the vLLM Image
- Enter 1 for the replicas
Note
Operators can select the version of vLLM they wish to use. The vLLM image can be pretty large in size (Gigabytes) and operators may wish to host the image in a local container registry to ensure fast/reliable deployments.
Resources¶
Specify the cpu, memory, gpu and storage resources that you wish to allocate to vLLM
- Enter 2 for the CPU count
- Enter 10Gi for the Memory amount
- Enter 1 for the GPU count
Metering¶
In this section, the operator will specify how usage will be metered. They will specify the currency and the rate for every million tokens. The Rafay Token Factory counts input and output tokens separately.
- Select US Dollar for the currency
- Enter 2 for Input Tokens
- Enter 4 for Output Tokens
- Click Save Changes
After a few minutes, the model will be deployed to the specified cluster.
7. Model Deployment Sharing¶
Next, you will share the model with a downstream Tenant Organization to be used by end users.
- In the Ops console, navigate to GenAI -> Model Deployments
- Click the "Actions" icon near the previous model deployment and select Manage Sharing
- Select Specific Organizations
- Select the downstream Tenant Org to share the model with
- Click Save Changes
8. End User Utilization¶
Finally, you will use a tenant end user account and utilize the inference endpoint. In this guide, we will execute the commands directly from the Kubernetes cluster.
- Log into the Developer Hub console as a tenant end user
- Navigate to GenAI -> Model APIs
- Click on the previously created model card
- Click Get an API Key
- Enter a name for the key
- Click Create
- Copy the key provided and store in a safe location as it cannot be retrieved again
- SSH into the Kubernetes Cluster
- Run the following command to store the key as an environment variable. Be sure to update the command with your key value
export API_KEY=<API KEY VALUE>
- Copy the cURL command for the console and run the command in your terminal. You will see a response from the endpoint to the question asked in the cURL command.
{
"id":"chatcmpl-1c0e2496-0ed0-40ba-9832-c0c2df775f2d",
"object":"chat.completion",
"created":1773340699,
"model":"gs-storage-ns-deployment",
"choices":[
{
"index":0,
"message":{
"role":"assistant",
"content":"There isn't necessarily one \"best\" open-source inference library as what works well for one application may not work so well for another. However, there are several popular and highly regarded libraries that you might consider depending on your specific use case:\n\n1. TensorFlow: A powerful framework developed by Google, widely used for both research and production purposes.\n\n2. PyTorch: Developed by Facebook AI Research (FAIR), it's known for its flexibility and ease of use.\n\n3. ONNX Runtime: This is an open-source runtime for running models produced with the Open Neural Network Exchange (ONNX).\n\n4. Caffe: An old but robust deep learning framework that has seen updates since its original development.\n\n5. MXNet: Another Python-based library designed to be flexible and scalable.\n\n6. TorchScript: Part of the Torch project from the University of Sydney, this allows models written in Torch to run efficiently on CPUs and GPUs.\n\n7. CoreML: Apple’s own machine learning model format, which can be converted to many frameworks including TensorFlow and PyTorch.\n\n8. ML.js: A JavaScript library for building and deploying neural networks.\n\n9. Keras: A high-level neural networks API built on top of Theano or Tensorflow, which is also very popular.\n\n",
"refusal":null,
"annotations":null,
"audio":null,
"function_call":null,
"tool_calls":[
],
"reasoning":null,
"reasoning_content":null
},
"logprobs":null,
"finish_reason":"length",
"stop_reason":null,
"token_ids":null
}
],
"service_tier":null,
"system_fingerprint":null,
"usage":{
"prompt_tokens":30,
"total_tokens":286,
"completion_tokens":256,
"prompt_tokens_details":null
},
"prompt_logprobs":null,
"prompt_token_ids":null,
"kv_transfer_params":null
}








