Skip to content

API Usage

Serverless Inference endpoints hosted by Rafay are fully compatible with OpenAI's API. This allows users to use familiar OpenAI client libraries with the deployed models. This guide explains how to leverage this compatibility to integrate your models with existing OpenAI-based applications.


Endpoint Structure

You can make OpenAI-compatible API requests by sending requests to this base URL pattern:

https://<inference_endpoint>/v2/ENDPOINT_ID/openai/v1

Supported APIs

The following core OpenAI API endpoints are supported:

Endpoint Description Status
/chat/completions Generate chat model completions Supported
/completions Generate text completions Supported
/models List available models Supported

Model Naming

The MODEL_NAME environment variable is essential for all OpenAI-compatible API requests. This variable corresponds to the model that has been deployed (e.g., mistralai/Mistral-7B-Instruct-v0.2).

Important

This model name is used in chat and text completion API requests to identify which model should process your request.


Initialize Client

Before you can send API requests, set up an OpenAI client with your API Key and the endpoint URL. An illustrative example is shown below.

from openai import OpenAI

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"  # Use your deployed model

# Replace ENDPOINT_ID and API_KEY with your actual values
client = OpenAI(
    api_key="API_KEY",
    base_url="https://<inference_endpoint>/v2/ENDPOINT_ID/openai/v1",
)

Chat Completions API

The "/chat/completions" endpoint is designed for instruction-tuned LLMs that follow a chat format.

Request

Shown below is an example request

from openai import OpenAI
MODEL_NAME = "MODEL_NAME"  # Replace with your actual model

# Replace ENDPOINT_ID and API_KEY with your actual values
client = OpenAI(
    api_key="API_KEY",
    base_url="https://<inference_endpoint>>/v2/ENDPOINT_ID/openai/v1",
)

# Chat completion request (for instruction-tuned models)
response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, who are you?"}
    ],
    temperature=0.7,
    max_tokens=500
)

# Print the response
print(response.choices[0].message.content)

Response

The API returns responses in a JSON format. Shown below is an example

{
  "id": "cmpl-123abc",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "I am Mistral, an AI assistant based on the Mistral-7B-Instruct model. How can I help you today?"
      },
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 24,
    "total_tokens": 47
  }
}