Get Started with Token Factory and NVIDIA Dynamo¶

Overview¶

This guide walks you through deploying a token-metered AI model endpoint using Rafay Token Factory with NVIDIA Dynamo as the inference engine. You will create a model deployment with NVIDIA Dynamo and validate tenant access via the integrated playground.

Release Information
NVIDIA Dynamo support is available from release v3.1-39 onwards.

Prerequisite: Before creating a Dynamo model deployment, the Dynamo platform Helm chart must be installed on the compute cluster:

helm repo add dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/
helm repo update
helm install dynamo-platform \
  https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.1.1.tgz \
  --namespace dynamo-system \
  --create-namespace

Assumptions¶

This guide assumes you have already completed the Token Factory basics, including creating a model, endpoint, compute cluster, and provider. If you have not done this yet, refer to the Get Started with Token Factory guide before proceeding.

This guide picks up from where the basics guide leaves off and focuses specifically on creating a model deployment using NVIDIA Dynamo as the inference engine.

The following are assumed to be in place:

A Rafay Platform account with Token Factory enabled
A compute cluster with NVIDIA GPUs registered in your Rafay org
The NVIDIA Dynamo platform Helm chart installed on the compute cluster (see prerequisite callout above)
A model registered in Token Factory
A compute endpoint registered in Token Factory
Partner admin access to the Rafay Operations Console
At least one tenant organization created in your Rafay org

What You Will Do¶

Step	Action
1	Create a model deployment with NVIDIA Dynamo
2	Configure the Dynamo inference engine and workers
3	Set rate limits and pricing
4	Validate the deployment
5	Access the model as a tenant user via the playground

Step 1 — Create a Model Deployment¶

Navigate to Token Factory → Model Deployments and click New Model Deployment.

Populate the fields below:

Details¶

Field	What to enter
Name	A unique name for this deployment (e.g. `qwen-0.5B-gs-dynamo`)
Description	Optional — useful for recording configuration notes

Model & Endpoint¶

Field	What to enter
Model	Select an existing model
Endpoint	Select an existing endpoint

In this example, we will be using the qwen-0.5B-instruct model.

Inference Engine¶

In the Inference Engine section, select NVIDIA Dynamo.

NVIDIA Dynamo is an optimized, low-latency modular inference framework. It supports disaggregated prefill and decode, which separates the compute-intensive prompt processing (prefill) from the token generation loop (decode) across different worker pools. This allows each stage to be scaled and tuned independently, significantly increasing throughput for large models on multi-GPU clusters.

Backend Framework¶

Select the backend framework that matches your model and performance requirements:

Option	Best for
vLLM	High-throughput serving with continuous batching
SGLang	Fast prefix caching and low-latency pipelines
TensorRT-LLM	Maximum GPU-optimized throughput on NVIDIA hardware

In this example, we will be using vLLM.

KV Cache Offload¶

Leave Enable KV Cache Offload disabled unless your cluster has NVLink-connected host memory configured for KV offload.

In this example, we will leave KV Cache Offload disabled.

Disaggregated Prefill¶

In this example, we will leave the Disaggregated Prefill disabled.

Enable the Disaggregated Prefill toggle to separate prefill and decode workers. When enabled, a Prefill Worker configuration section appears alongside the Decode Worker section.

Enable disaggregated prefill when:

Your model is large (≥ 30B parameters)
Your workload has long input prompts relative to output length
You want to maximize cluster GPU utilization

Leave it disabled for smaller models or when running on a single node.

Configure the Decode Worker¶

The decode worker runs the main inference loop and token generation. Configure it under the Decode Worker section.

Container¶

Enter the NVIDIA NGC container image for your selected backend:

Backend	Runtime Image
TensorRT-LLM	`nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.1`
vLLM	`nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1`
SGLang	`nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.1.1`

Field	What to enter
Runtime Image	The NGC image from the Backend Framework section above
Replicas	Number of decode worker pods
Node Count	Number of nodes per decode worker pod (typically `1`)

Resources¶

Field	What to enter
CPU	CPU request for the decode worker (e.g. `8`)
Memory	Memory request (e.g. `64Gi`)
GPU Allocation Type	Select Full GPU (use MIG Slice only for MIG-partitioned clusters)
GPU Quantity	Number of GPUs per decode worker

Set Pricing¶

Navigate to the Pricing section.

Field	What to enter
Currency	Select your billing currency (e.g. US Dollar)
Input Tokens	Cost per million input tokens
Output Tokens	Cost per million output tokens

Click Create Model Deployment. Rafay will schedule the Dynamo workers on the selected compute endpoint.

After a few minutes, the model will be deployed.

Once the model deployment has been created, it must be explicitly shared with one or more tenant organizations before tenant users can access it. This gives operators full control over which tenants can consume which models.

Navigate to Token Factory → Model Deployments, click the Actions icon, and click Manage Sharing.

Select Specific Organizations
Select the tenant organization/s to share the model deployment with
Click Save Changes

The deployment will now appear in the tenant organization's Token Factory view. Repeat this for each tenant org you want to grant access.

Step 3 — Tenant Access via the Playground¶

Once the deployment is running, users within a tenant organization can access the model through the integrated playground without any API client setup.

Ensure you have an API key created before proceeding to the playground.

As a Tenant User — Open the Playground¶

Log in to the Rafay console using your tenant organization credentials
In Developer Hub, navigate to Token Factory → Playground

Send Your First Request¶

In the playground:

Select the model from the model dropdown
Optionally adjust output Length, Temperature, and Response Format in the settings panel
Type a prompt in the input box and press Send

Responses stream back token by token. TTFT and throughput metrics are displayed on the screen.

View Token Usage¶

Tenant users can monitor their own consumption at any time:

Navigate to Token Factory → Token Usage
Filter by model deployment
Usage is broken down by Input Tokens and Output Tokens, matching the pricing set in the deployment model