Skip to content

Ray Serve

Ray Serve is a scalable and flexible model serving library built on Ray, designed to deploy machine learning models, manage their lifecycle, and scale them across a distributed cluster. Unlike traditional model serving solutions, Ray Serve is designed for both batch and online inferencing, and it can serve not only machine learning models but also Python functions, making it highly versatile.


Key Features

Scalable Model Serving

Ray Serve is built on top of Ray’s distributed computing framework, which allows it to scale model inference across multiple machines and handle high levels of concurrent requests. By automatically scaling based on traffic and resource utilization, Ray Serve ensures that machine learning models can be served efficiently in production, whether the workload is small or enterprise-scale.

Python-native API

One of Ray Serve’s most appealing features is its Python-native API. Users can deploy models and serve requests using familiar Python code, without needing to rely on external serving platforms. This makes it easy to integrate Ray Serve into existing Python workflows, and it reduces the complexity of managing different environments for development and production.

Support for Any Model Framework

Ray Serve supports serving models from any framework, including TensorFlow, PyTorch, Scikit-learn, XGBoost, and more. This flexibility allows users to deploy models written in various machine learning libraries, making it a good fit for heterogeneous environments where multiple types of models are developed and deployed.

Asynchronous and Batch Processing

Ray Serve supports both synchronous and asynchronous request handling, which is important for applications that require low-latency responses, such as real-time online inferencing. Additionally, it can batch requests to improve throughput and performance, especially for models that are optimized for batch processing. The ability to control batching and concurrency is a key feature for optimizing system performance.

Composability of Models and Pipelines

Ray Serve allows users to compose multiple models and endpoints into a single service, enabling complex inference pipelines. These pipelines can consist of various stages where each stage performs a different model inference or pre/post-processing step. This is particularly useful for applications like ensemble learning or workflows where multiple models are applied sequentially.

Traffic Splitting and Versioning

Ray Serve provides native support for traffic splitting between different models or model versions. This feature enables A/B testing, where different versions of a model can serve a percentage of the incoming traffic, or canary deployments, where a new model version is gradually rolled out while the previous version is still active. This helps ensure smooth transitions when upgrading models in production.

Dynamic API for Fast Updates

With Ray Serve’s dynamic API, models and endpoints can be updated or modified without restarting the entire service. This makes it easy to deploy new models, change routing policies, or adjust traffic splits on the fly, which is particularly useful in production environments where uptime and flexibility are critical.

Integration with Ray’s Ecosystem

Ray Serve integrates seamlessly with the rest of Ray’s ecosystem, including Ray Tune (for hyperparameter tuning), Ray Train (for distributed training), and Ray Datasets (for data preprocessing). This allows users to manage the entire lifecycle of their machine learning models—from training to deployment—within a single framework.


Benefits

Ease of Use

With its Python-native API, Ray Serve simplifies the process of deploying and serving machine learning models. Developers and data scientists can deploy models directly from their Python code, avoiding the need for complex infrastructure management or external serving solutions.

Scalability and Flexibility

Ray Serve can scale to serve thousands of requests per second, ensuring that machine learning applications can handle varying traffic loads. Its flexibility to serve any type of model, along with the ability to scale horizontally across clusters, makes it a powerful tool for large-scale machine learning applications.

Efficient Resource Utilization

By enabling batch processing and asynchronous request handling, Ray Serve optimizes resource utilization, making it cost-effective to serve models at scale. Users can adjust batching and concurrency to meet specific performance requirements, ensuring efficient use of computational resources.

Continuous Integration and Deployment (CI/CD)

Ray Serve’s support for traffic splitting, model versioning, and dynamic updates allows teams to implement robust CI/CD pipelines. Models can be tested in production environments with minimal risk, and new versions can be rolled out gradually to ensure stability.

Unified Framework

With Ray Serve, organizations can leverage the entire Ray ecosystem for training, tuning, deployment, and serving, providing a unified framework for managing the entire machine learning lifecycle. This reduces the complexity of managing multiple tools and platforms for different stages of model development and deployment.


Example

Shown below is a simple example of how to use Ray Serve to deploy a basic Python function as a service and serve HTTP requests.

Note

This example assumes that the user has already launched a Ray as Service tenant on the shared host cluster.

In this example, we will deploy a simple function that doubles a number. Ray Serve will expose the function as an HTTP endpoint.

Create the Ray Serve Endpoint

import ray
from ray import serve

# Initialize Ray and Ray Serve
ray.init()
serve.start()

# Define a simple deployment
@serve.deployment
class SimpleService:
    def __call__(self, request):
        # Extract the number from the request and return its square
        number = int(request.query_params["number"])
        return {"result": number * 2}

# Deploy the service
SimpleService.deploy()

# Serve is now running, and we can send requests to the HTTP endpoint.

Query the Ray Serve Endpoint

Once the service is deployed, you can send requests to the Ray Serve HTTP endpoint. To query the service, use curl or any HTTP client:

curl "http://<Ray Endpoint URL>:8000/SimpleService?number=5"

You should receive the following JSON response:

{
  "result": 10
}

Shut Down Ray Serve

When you're done testing, you can shut down Ray and Ray Serve with the following command.

ray.shutdown()