Design
Rafay's Ray as a Service offering is designed to be used by multiple users (i.e. data scientists or researchers) concurrently on the same host Kubernetes cluster. Deploying KubeRay inside a virtual cluster (aka vcluster) operating on a host Kubernetes cluster allows organizations to deliver isolated, multi-tenant environments to data scientists and researchers that they can use for running Ray workloads within a shared Kubernetes infrastructure.
As you can see from the design above, every user/team gets access to a dedicated and isolated virtual cluster with the kubeRay operator deployed inside it.
A custom batch scheduler is deployed on the host Kubernetes cluster, extending the native Kubernetes scheduling capabilities. This is a specialized batch scheduling system for Kubernetes that is designed to handle high-performance computing (HPC), AI/ML workloads, and other jobs that require complex orchestration of resources.
Benefits¶
This is extremely useful especially when supporting multiple users/teams that require Ray.
Isolation¶
Each team can have its own virtual Kubernetes cluster with dedicated KubeRay operators, ensuring that workloads are isolated and do not interfere with each other.
Customized Configurations¶
Different teams can run different versions or configurations of Kubernetes, KubeRay, or Ray without affecting the host cluster or other teams’ environments.
Resource Management¶
Administrators can allocate specific resources (CPU, memory, storage) to each vcluster, enabling better control over resource consumption and preventing any single tenant from monopolizing cluster resources.
Test & Development¶
Users can spin up vclusters for testing new features, configurations, or upgrades in a sand boxed environment that mimics production settings without risking stability.
Collisions & Conflicts¶
Running multiple KubeRay operators in the same cluster can lead to conflicts. Using vclusters ensures that operators are scoped to their virtual clusters, preventing such issues.