Skip to content

Custom Scheduler

The Volcano batch scheduler is deployed on the host Kubernetes cluster, extending the native Kubernetes scheduling capabilities. This is a specialized batch scheduling system for Kubernetes that is designed to handle high-performance computing (HPC), AI/ML workloads, and other jobs that require complex orchestration of resources.

Volcano extends Kubernetes' native scheduling capabilities, making it more suitable for certain types of workloads that need better scheduling efficiency, resource management, and task coordination.

The Volcano scheduler shines in scenarios where complex, high-performance workloads demand sophisticated resource management and orchestration. If you're working with AI/ML, HPC, or large-scale batch jobs in Kubernetes, Volcano can greatly improve job efficiency, resource allocation, and fairness. However, for simple, stateless applications, the default Kubernetes scheduler is likely more than enough.


Why Use Volcano?

Optimized for Batch Workloads

While Kubernetes’ default scheduler is general-purpose, it may not optimize well for batch workloads that involve intensive, long-running jobs, such as machine learning training, scientific computing, or big data processing. Volcano introduces advanced scheduling strategies like gang scheduling, which ensures that jobs requiring multiple tasks only start when all resources are available.

Gang Scheduling

Gang scheduling ensures that all tasks in a job are scheduled together or not at all. This is crucial for workloads where distributed tasks need to run in parallel, like training deep learning models or running a Spark job. Without gang scheduling, Kubernetes might start a few tasks and leave others waiting, which can lead to inefficient resource utilization.

Job Dependencies and Preemption

Some jobs depend on the completion of others. Volcano can manage job dependencies, ensuring that jobs are executed in a predefined order. It also supports preemption policies, which allow higher-priority jobs to interrupt or evict lower-priority ones, ensuring critical jobs get access to the resources they need.

Advanced Resource Awareness

Volcano allows for more sophisticated resource management beyond just CPU and memory. It can handle GPU scheduling, NUMA-aware scheduling (for specific memory and CPU affinity), and disk I/O awareness. This is especially important for AI/ML workloads, which often need GPUs, TPUs, or other specialized hardware.

Queueing and Fair Share Scheduling

Volcano introduces queueing mechanisms and supports fair share scheduling, which divides cluster resources fairly among multiple jobs or users. This is crucial in multi-tenant environments where resource contention can be an issue.

Elastic Job Support

Elastic jobs can scale up and down dynamically based on available resources. Volcano supports elastic scaling for batch workloads, helping to manage resources efficiently when running distributed training or big data jobs.

Topology-Aware Scheduling

Volcano considers topology constraints, such as rack or availability zone preferences, when scheduling jobs. This can help optimize network bandwidth usage and ensure that jobs are placed where they can access the fastest possible resources.


When is Volcano a good fit?

Here are some scenarios where the use of Volcano for batch scheduling is ideal

High-Performance Computing (HPC)

When running scientific simulations, financial modeling, or computational fluid dynamics where multiple tasks need to be executed in parallel across multiple nodes, Volcano’s gang scheduling and efficient resource management are vital.

Machine Learning (ML) and AI

For workloads like distributed deep learning, where tasks (e.g., parameter servers and workers) must run together, Volcano ensures that all necessary resources are allocated before starting the job. When using GPUs or TPUs for training models, the scheduler ensures these specialized resources are effectively allocated.

Big Data Processing

Jobs that involve tools like Apache Spark, Hadoop, or Flink, which require resource coordination across multiple pods, can benefit from Volcano’s ability to manage resource requests efficiently and schedule jobs in parallel with resource awareness.

Cloud-Native Batch Processing

For batch processing pipelines, such as ETL (Extract, Transform, Load) jobs, Volcano offers a more fine-grained scheduling approach than the default Kubernetes scheduler, ensuring that batch jobs are executed efficiently with minimal downtime.

Multi-Tenant Environments

If your Kubernetes cluster is shared by multiple teams or departments, Volcano’s support for fair-share scheduling, job preemption, and priority queues helps ensure that resources are distributed according to the needs and priorities of different workloads.

Long-Running Computational Jobs:

If your workloads involve long-running, resource-intensive jobs, like simulations or training models for several hours or days, Volcano can manage job preemption and resume processes efficiently, ensuring optimal resource utilization.