Rate limits restrict how often a user or client can access the Serverless Inference APIs within a set timeframe.
Rate Limits¶
Rate limiting refers to the constraints the API enforces on how frequently a user or client can access the services within a given timeframe. Rate limits are denoted as HTTP status code 429.
Why Rate Limits¶
Rate limits in APIs are a standard approach, and they serve to safeguard against abuse or misuse of the API, helping to ensure equitable access to the API with consistent performance.
Rate Limit Approach¶
Rate limits are currently measured in requests per second (RPS) and tokens per second (TPS) for each model type. If you exceed any of the rate limits you will be presented with a 429 error.
For multi-tenant inference endpoints, you may experience congestion based on traffic due to other users. If you require SLA backed capacity, the use of a dedicated endpoint is recommended.