API Usage
Serverless Inference endpoints hosted by Rafay are fully compatible with OpenAI's API. This allows users to use familiar OpenAI client libraries with the deployed models. This guide explains how to leverage this compatibility to integrate your models with existing OpenAI-based applications.
Endpoint Structure¶
You can make OpenAI-compatible API requests by sending requests to this base URL pattern:
https://<inference_endpoint>/v2/ENDPOINT_ID/openai/v1
Supported APIs¶
The following core OpenAI API endpoints are supported:
| Endpoint | Description | Status |
|---|---|---|
| /chat/completions | Generate chat model completions | Supported |
| /completions | Generate text completions | Supported |
| /models | List available models | Supported |
Model Naming¶
The MODEL_NAME environment variable is essential for all OpenAI-compatible API requests. This variable corresponds to the model that has been deployed (e.g., mistralai/Mistral-7B-Instruct-v0.2).
Important
This model name is used in chat and text completion API requests to identify which model should process your request.
Initialize Client¶
Before you can send API requests, set up an OpenAI client with your API Key and the endpoint URL. An illustrative example is shown below.
from openai import OpenAI
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model
# Replace ENDPOINT_ID and API_KEY with your actual values
client = OpenAI(
api_key="API_KEY",
base_url="https://<inference_endpoint>/v2/ENDPOINT_ID/openai/v1",
)
Chat Completions API¶
The "/chat/completions" endpoint is designed for instruction-tuned LLMs that follow a chat format.
Request¶
Shown below is an example request
from openai import OpenAI
MODEL_NAME = "MODEL_NAME" # Replace with your actual model
# Replace ENDPOINT_ID and API_KEY with your actual values
client = OpenAI(
api_key="API_KEY",
base_url="https://<inference_endpoint>>/v2/ENDPOINT_ID/openai/v1",
)
# Chat completion request (for instruction-tuned models)
response = client.chat.completions.create(
model=MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, who are you?"}
],
temperature=0.7,
max_tokens=500
)
# Print the response
print(response.choices[0].message.content)
Response¶
The API returns responses in a JSON format. Shown below is an example
{
"id": "cmpl-123abc",
"object": "chat.completion",
"created": 1677858242,
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"choices": [
{
"message": {
"role": "assistant",
"content": "I am Mistral, an AI assistant based on the Mistral-7B-Instruct model. How can I help you today?"
},
"index": 0,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 24,
"total_tokens": 47
}
}