Use
When working with a remote Ray cluster on the "Ray as Service Tenant", there are several ways to connect to, submit jobs to, and manage workloads across the cluster. These methods vary depending on the type of workload (e.g., batch jobs, distributed ML training, or real-time applications) and the environment in which the Ray cluster is running (e.g., cloud, on-premises). Both Ray Client and Ray Job Submission API are ways to interact with a remote Ray cluster, but they are suited to different use cases. Here is a brief comparison of the two approaches, including their pros and cons.
Note
In our examples and guides, we will be using the Ray Job Submission API approach.
Ray Client¶
The Ray Client allows you to connect to a remote Ray cluster and interact with it as though you were running Ray locally. You can use all the familiar Ray APIs, such as defining remote functions and actors, directly in your local environment while executing the actual tasks remotely.
How It Works¶
You connect to a remote Ray cluster using ray.init("ray://<remote-endpoint>")
and then use Ray just as you would in a local setup. Here is an example
import ray
# Connect to the remote Ray cluster
ray.init("ray://<remote-endpoint>")
@ray.remote
def add(x, y):
return x + y
# Remote task
result = ray.get(add.remote(1, 2))
print(f"Result: {result}")
Pros¶
Interactive Programming
You can interact with the cluster in real-time, enabling use cases like interactive development and debugging, where you can dynamically run and test your code.
Full Ray API Support
You can use the full Ray API (remote functions, actors, object stores, etc.) just as if you were running Ray locally. This allows for full-fledged distributed computing workflows.
Persistent State
The Ray cluster's state persists across multiple function calls, meaning you can launch actors, store objects in the Ray object store, and interact with them later.
Flexible Workflow
You have full flexibility to define workflows with conditional logic, loops, and dynamic task submissions, making it very easy to build custom distributed computing pipelines.
Cons¶
Interactive Session Needed
Requires a live, persistent connection between the local machine and the Ray cluster, which may not always be ideal for batch jobs or use cases where you want to submit and forget.
Resource Management
You are responsible for managing tasks, actors, and resources during the lifetime of the connection. This requires careful planning to avoid memory and resource leaks in a large-scale system.
Less Suitable for Job Queuing
Not designed for long-running batch jobs where you need to queue jobs and manage them asynchronously. Ray Client is better suited for real-time interaction rather than asynchronous batch job management.
Limited Fault Tolerance
If the Ray Client connection drops, or if the local machine shuts down, it could disrupt running tasks. Reconnecting to the same session can be complex.
Ray Job Submission API¶
The Ray Job Submission API provides a way to submit jobs (scripts or commands) to a remote Ray cluster. The API manages the job lifecycle independently of the client, meaning you can submit a job, disconnect, and let the job complete in the background.
How It Works¶
We will be connecting to the remote Ray Cluster via the endpoint/URL that was provided to us. Ray provides an API to submit jobs directly from your Python client. Shown below is a minimal example of submitting a Ray job using the Ray job submission API.
Important
Ensure you disable SSL verification if you are using self-signed certificates for the endpoint.
Copy the code below into a Python file using your favorite IDE.
import ray
from ray import job_submission
# Connect to the Ray cluster (make sure to replace this with your cluster URL)
client = JobSubmissionClient(
"https://mlteam.acme.net",
headers={"Authorization": "Basic nasda=="},
verify=False # Disable SSL verification
)
# Define your job; in this case, a simple function
def my_function():
import time
time.sleep(10) # Simulate long computation
return "Hello, Ray Cluster!"
# Submit the job to the Ray cluster
submission_id = client.submit_job(
entrypoint="python -c 'import ray; ray.init(); from __main__ import my_function; result = my_function(); print(result)'"
)
# Check the status of the job
status = client.get_job_status(submission_id)
print(f"Job status: {status}")
# Retrieve the logs or output of the job
logs = client.get_job_logs(submission_id)
print(f"Job logs: {logs}")
Credentials
For authentication, if basic auth is configured, please ensure that you base64 encode the username & password to authenticate with the endpoint. You can use the following command to generate the base64 encoded credential.
echo -n 'admin:PASSWORD' | base64
Pros¶
Batch Job Oriented
Designed for submitting batch jobs that run independently of the client. Once the job is submitted, you don’t need to maintain a persistent connection, and you can let the cluster handle job execution.
Simple and Lightweight
Very easy to use for submitting one-off jobs. You don’t have to manage complex workflows, actors, or other distributed system resources manually.
Fault Tolerance
The Job Submission API is resilient to disconnections between the client and the cluster. Even if your local machine disconnects, the job will continue running on the cluster.
Job Isolation
Each job runs in an isolated environment (i.e., container or runtime environment) with its own dependencies, which reduces the chances of interference between different jobs.
Suitable for CI/CD Pipelines
Ideal for integration with CI/CD pipelines, as jobs can be queued, monitored, and completed asynchronously. This allows jobs to be submitted and handled in an automated fashion.
Job Monitoring and Management
The Job Submission API provides a mechanism for querying the status of jobs (e.g., running, completed, failed), which can be useful for tracking progress and handling retries.
Cons¶
Limited Interaction
Once you submit a job, you don’t have fine-grained, real-time interaction with it. You can’t interact with objects or actors inside the job, which limits flexibility compared to Ray Client.
No Persistent State
Jobs are stateless and independent, meaning you cannot easily share data or objects between jobs. You must explicitly pass any necessary data via files, databases, or message queues.
Not Suitable for Interactive Use
It’s not designed for real-time development, exploration, or debugging workflows. You’ll need to wait for the job to finish before getting feedback.
Longer Setup Time
Submitting jobs can involve more overhead compared to Ray Client since each job may involve environment setup (e.g., installing dependencies) and starting new processes.
Which Should You Use and When?¶
Use Ray Client if:
- You need interactive development where you can define tasks, submit jobs, and get immediate feedback.
- You want to leverage Ray's full API for managing distributed systems, actors, and complex workflows.
- You need to manage persistent state or coordinate multiple actors and tasks.
- You are experimenting or running dynamic, real-time workloads where flexibility is important.
Use Ray Job Submission API if:
- You need to submit batch jobs that don’t require real-time interaction.
- You want jobs to run asynchronously in the background and persist independently of the client.
- You need to manage isolated environments for each job with specific dependencies.
- You’re integrating with CI/CD pipelines or other automation systems where job management and fault tolerance are important.
Each approach has its own strengths, and the choice depends on the specific nature of your application or workload.