SLA & High Availability

Always run at least two replicas¶

A single-replica deployment has no redundancy. If the GPU or pod fails, the endpoint goes down until recovery. With two replicas, the AI Gateway load balances traffic across both. If one fails, the other continues serving without interruption.

Set Replicas to 2 or more on every production deployment, both during endpoint configuration, and model deployment creation.

Run with headroom, not at full capacity¶

Load balancing across two replicas only helps during failover if each replica has room to absorb extra traffic. If both replicas are running at full GPU utilization under normal conditions and one goes down, the survivor is immediately saturated — it has no capacity to absorb the redirected traffic. Size your CPU, memory, and GPU resources so each replica runs well below its limit under normal load. That headroom is what makes failover actually work.

Replicas	Failure Behavior	Suitable For
1	Complete outage until recovery	Dev / test only
2, no headroom	Survivor saturates immediately	Not recommended for SLA commitments
2+, with headroom	Survivor absorbs traffic, service continues	Production HA deployments

Enable Auto Scaling¶

The Replicas setting is your floor — the minimum running at all times. Enable Auto Scaling in the Model Deployment configuration to let the platform add replicas during traffic spikes and return to the baseline when demand drops. This keeps your SLA intact without permanently over-provisioning GPU capacity.

Run a load test before going live¶

Before exposing a deployment to production traffic, simulate realistic load using the approach documented in the Token Factory Intermediate guide. Confirm that latency and throughput hold at expected peak load, and that each replica has headroom to spare. If not, increase replicas or resources before launch.

Configure rate limits to protect shared deployments¶

When a deployment is shared across multiple users or organizations, rate limits prevent any single consumer from saturating the endpoint. Configure limits at the Organization, User, and API Key levels in the Rate Limiting section of the Model Deployment form — both Max Tokens per Minute and Max Requests per Minute.