Skip to content

Get Started with Token Factory and NVIDIA Dynamo

Overview

This guide walks you through deploying a token-metered AI model endpoint using Rafay Token Factory with NVIDIA Dynamo as the inference engine. You will create a model deployment with NVIDIA Dynamo and validate tenant access via the integrated playground.

Release Information
NVIDIA Dynamo support is available from release v3.1-39 onwards.

Prerequisite: Before creating a Dynamo model deployment, the Dynamo platform Helm chart must be installed on the compute cluster:
helm repo add dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/
helm repo update
helm install dynamo-platform \
  https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.1.1.tgz \
  --namespace dynamo-system \
  --create-namespace

Assumptions

This guide assumes you have already completed the Token Factory basics, including creating a model, endpoint, compute cluster, and provider. If you have not done this yet, refer to the Get Started with Token Factory guide before proceeding.

This guide picks up from where the basics guide leaves off and focuses specifically on creating a model deployment using NVIDIA Dynamo as the inference engine.

The following are assumed to be in place:

  • A Rafay Platform account with Token Factory enabled
  • A compute cluster with NVIDIA GPUs registered in your Rafay org
  • The NVIDIA Dynamo platform Helm chart installed on the compute cluster (see prerequisite callout above)
  • A model registered in Token Factory
  • A compute endpoint registered in Token Factory
  • Partner admin access to the Rafay Operations Console
  • At least one tenant organization created in your Rafay org

What You Will Do

Step Action
1 Create a model deployment with NVIDIA Dynamo
2 Configure the Dynamo inference engine and workers
3 Set rate limits and pricing
4 Validate the deployment
5 Access the model as a tenant user via the playground

Step 1 — Create a Model Deployment

Navigate to Token Factory → Model Deployments and click New Model Deployment.

Populate the fields below:

Details

Field What to enter
Name A unique name for this deployment (e.g. qwen-0.5B-gs-dynamo)
Description Optional — useful for recording configuration notes

Model & Endpoint

Field What to enter
Model Select an existing model
Endpoint Select an existing endpoint

In this example, we will be using the qwen-0.5B-instruct model.


Inference Engine

In the Inference Engine section, select NVIDIA Dynamo.

NVIDIA Dynamo is an optimized, low-latency modular inference framework. It supports disaggregated prefill and decode, which separates the compute-intensive prompt processing (prefill) from the token generation loop (decode) across different worker pools. This allows each stage to be scaled and tuned independently, significantly increasing throughput for large models on multi-GPU clusters.

Backend Framework

Select the backend framework that matches your model and performance requirements:

Option Best for
vLLM High-throughput serving with continuous batching
SGLang Fast prefix caching and low-latency pipelines
TensorRT-LLM Maximum GPU-optimized throughput on NVIDIA hardware

In this example, we will be using vLLM.

KV Cache Offload

Leave Enable KV Cache Offload disabled unless your cluster has NVLink-connected host memory configured for KV offload.

In this example, we will leave KV Cache Offload disabled.


Disaggregated Prefill

In this example, we will leave the Disaggregated Prefill disabled.

Enable the Disaggregated Prefill toggle to separate prefill and decode workers. When enabled, a Prefill Worker configuration section appears alongside the Decode Worker section.

Enable disaggregated prefill when:

  • Your model is large (≥ 30B parameters)
  • Your workload has long input prompts relative to output length
  • You want to maximize cluster GPU utilization

Leave it disabled for smaller models or when running on a single node.

Configure the Decode Worker

The decode worker runs the main inference loop and token generation. Configure it under the Decode Worker section.

Container

Enter the NVIDIA NGC container image for your selected backend:

Backend Runtime Image
TensorRT-LLM nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.1
vLLM nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
SGLang nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.1.1
Field What to enter
Runtime Image The NGC image from the Backend Framework section above
Replicas Number of decode worker pods
Node Count Number of nodes per decode worker pod (typically 1)

Resources

Field What to enter
CPU CPU request for the decode worker (e.g. 8)
Memory Memory request (e.g. 64Gi)
GPU Allocation Type Select Full GPU (use MIG Slice only for MIG-partitioned clusters)
GPU Quantity Number of GPUs per decode worker

Set Pricing

Navigate to the Pricing section.

Field What to enter
Currency Select your billing currency (e.g. US Dollar)
Input Tokens Cost per million input tokens
Output Tokens Cost per million output tokens

Click Create Model Deployment. Rafay will schedule the Dynamo workers on the selected compute endpoint.

Model Deployment

After a few minutes, the model will be deployed.

Model Deployment


Step 2 — Share the Deployment with a Tenant

Once the model deployment has been created, it must be explicitly shared with one or more tenant organizations before tenant users can access it. This gives operators full control over which tenants can consume which models.

Navigate to Token Factory → Model Deployments, click the Actions icon, and click Manage Sharing.

Share with an Organization

  1. Select Specific Organizations
  2. Select the tenant organization/s to share the model deployment with
  3. Click Save Changes

Model Deployment Sharing

The deployment will now appear in the tenant organization's Token Factory view. Repeat this for each tenant org you want to grant access.


Step 3 — Tenant Access via the Playground

Once the deployment is running, users within a tenant organization can access the model through the integrated playground without any API client setup.

Ensure you have an API key created before proceeding to the playground.

As a Tenant User — Open the Playground

  1. Log in to the Rafay console using your tenant organization credentials
  2. In Developer Hub, navigate to Token Factory → Playground

Send Your First Request

In the playground:

  1. Select the model from the model dropdown
  2. Optionally adjust output Length, Temperature, and Response Format in the settings panel
  3. Type a prompt in the input box and press Send

Responses stream back token by token. TTFT and throughput metrics are displayed on the screen.

Model use

View Token Usage

Tenant users can monitor their own consumption at any time:

  1. Navigate to Token Factory → Token Usage
  2. Filter by model deployment
  3. Usage is broken down by Input Tokens and Output Tokens, matching the pricing set in the deployment model

Model Token use