Use

When working with a remote Ray cluster on the "Ray as Service Tenant", there are several ways to connect to, submit jobs to, and manage workloads across the cluster. These methods vary depending on the type of workload (e.g., batch jobs, distributed ML training, or real-time applications) and the environment in which the Ray cluster is running (e.g., cloud, on-premises). Both Ray Client and Ray Job Submission API are ways to interact with a remote Ray cluster, but they are suited to different use cases. Here is a brief comparison of the two approaches, including their pros and cons.

Note

In our examples and guides, we will be using the Ray Job Submission API approach.

Ray Client¶

The Ray Client allows you to connect to a remote Ray cluster and interact with it as though you were running Ray locally. You can use all the familiar Ray APIs, such as defining remote functions and actors, directly in your local environment while executing the actual tasks remotely.

How It Works¶

You connect to a remote Ray cluster using ray.init("ray://<remote-endpoint>") and then use Ray just as you would in a local setup. Here is an example

import ray

# Connect to the remote Ray cluster
ray.init("ray://<remote-endpoint>")

@ray.remote
def add(x, y):
    return x + y

# Remote task
result = ray.get(add.remote(1, 2))
print(f"Result: {result}")

Pros¶

Interactive Programming

You can interact with the cluster in real-time, enabling use cases like interactive development and debugging, where you can dynamically run and test your code.

Full Ray API Support

You can use the full Ray API (remote functions, actors, object stores, etc.) just as if you were running Ray locally. This allows for full-fledged distributed computing workflows.

Persistent State

The Ray cluster's state persists across multiple function calls, meaning you can launch actors, store objects in the Ray object store, and interact with them later.

Flexible Workflow

You have full flexibility to define workflows with conditional logic, loops, and dynamic task submissions, making it very easy to build custom distributed computing pipelines.

Cons¶

Interactive Session Needed

Requires a live, persistent connection between the local machine and the Ray cluster, which may not always be ideal for batch jobs or use cases where you want to submit and forget.

Resource Management

You are responsible for managing tasks, actors, and resources during the lifetime of the connection. This requires careful planning to avoid memory and resource leaks in a large-scale system.

Less Suitable for Job Queuing

Not designed for long-running batch jobs where you need to queue jobs and manage them asynchronously. Ray Client is better suited for real-time interaction rather than asynchronous batch job management.

Limited Fault Tolerance

If the Ray Client connection drops, or if the local machine shuts down, it could disrupt running tasks. Reconnecting to the same session can be complex.

Ray Job Submission API¶

The Ray Job Submission API provides a way to submit jobs (scripts or commands) to a remote Ray cluster. The API manages the job lifecycle independently of the client, meaning you can submit a job, disconnect, and let the job complete in the background.

How It Works¶

We will be connecting to the remote Ray Cluster via the endpoint/URL that was provided to us. Ray provides an API to submit jobs directly from your Python client. Shown below is a minimal example of submitting a Ray job using the Ray job submission API.

Important

Ensure you disable SSL verification if you are using self-signed certificates for the endpoint.

Copy the code below into a Python file using your favorite IDE.

import ray
from ray.job_submission import JobSubmissionClient
import urllib3
import time

# Suppress the warning about unverified HTTPS requests since 
# we are using self signed certificates for testing 
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Ray client
client = JobSubmissionClient(
    "<https URL for Ray Endpoint>", 
    headers={"Authorization": "Basic <Base64 Encoded Credentials>"}, 
    verify=False  # Disable SSL verification
)

# Submit the job to the remote Ray cluster
job_id = client.submit_job(
    entrypoint="python simple.py",  # The script to be executed remotely
    runtime_env={
        "working_dir": "./",  # The working directory containing the script
        "pip": []  # No additional dependencies for this simple test
    }
)

print(f"Submitted job with ID: {job_id}")

# Check the status of the job
status = client.get_job_status(job_id)
print(f"Job status: {status}")
while status != "SUCCEEDED":
    print(f"Job status: {status}")
    time.sleep(5)  # Wait 5 seconds before checking again
    status = client.get_job_status(job_id)

print("Job has succeeded!")

# Retrieve the logs or output of the job
logs = client.get_job_logs(job_id)
print(f"Job logs: {logs}")

Credentials

For authentication, if basic auth is configured, please ensure that you base64 encode the username & password to authenticate with the endpoint. You can use the following command to generate the base64 encoded credential.

echo -n 'admin:PASSWORD' | base64

Pros¶

Batch Job Oriented

Designed for submitting batch jobs that run independently of the client. Once the job is submitted, you don’t need to maintain a persistent connection, and you can let the cluster handle job execution.

Simple and Lightweight

Very easy to use for submitting one-off jobs. You don’t have to manage complex workflows, actors, or other distributed system resources manually.

Fault Tolerance

The Job Submission API is resilient to disconnections between the client and the cluster. Even if your local machine disconnects, the job will continue running on the cluster.

Job Isolation

Each job runs in an isolated environment (i.e., container or runtime environment) with its own dependencies, which reduces the chances of interference between different jobs.

Suitable for CI/CD Pipelines

Ideal for integration with CI/CD pipelines, as jobs can be queued, monitored, and completed asynchronously. This allows jobs to be submitted and handled in an automated fashion.

Job Monitoring and Management

The Job Submission API provides a mechanism for querying the status of jobs (e.g., running, completed, failed), which can be useful for tracking progress and handling retries.

Cons¶

Limited Interaction

Once you submit a job, you don’t have fine-grained, real-time interaction with it. You can’t interact with objects or actors inside the job, which limits flexibility compared to Ray Client.

No Persistent State

Jobs are stateless and independent, meaning you cannot easily share data or objects between jobs. You must explicitly pass any necessary data via files, databases, or message queues.

Not Suitable for Interactive Use

It’s not designed for real-time development, exploration, or debugging workflows. You’ll need to wait for the job to finish before getting feedback.

Longer Setup Time

Submitting jobs can involve more overhead compared to Ray Client since each job may involve environment setup (e.g., installing dependencies) and starting new processes.

Which Should You Use and When?¶

Use Ray Client if:

You need interactive development where you can define tasks, submit jobs, and get immediate feedback.
You want to leverage Ray's full API for managing distributed systems, actors, and complex workflows.
You need to manage persistent state or coordinate multiple actors and tasks.
You are experimenting or running dynamic, real-time workloads where flexibility is important.

Use Ray Job Submission API if:

You need to submit batch jobs that don’t require real-time interaction.
You want jobs to run asynchronously in the background and persist independently of the client.
You need to manage isolated environments for each job with specific dependencies.
You’re integrating with CI/CD pipelines or other automation systems where job management and fault tolerance are important.

Each approach has its own strengths, and the choice depends on the specific nature of your application or workload.