Background

Organizations are incorporating ML pervasively into their applications. Doing this requires these teams to have access to GPUs because they offer unmatched computational speed and efficiency. Support of GPU architecture allows AI/ML to tackle complex algorithms and vast datasets.

Handling large datasets

ML models often require processing and analyzing large datasets. With their high-bandwidth memory and parallel architecture, GPUs are adept at managing these data-intensive tasks, leading to quicker insights and model training.

Reducing computation time The efficiency of GPUs in performing parallel computations drastically reduces the time required for training and inference in AI models. This speed is crucial for applications requiring real-time processing and decision-making, such as autonomous vehicles and real-time language translation.

However, there are significant challenges that organizations need to overcome with GPUs. As a result of these challenges, the typical organization struggles with gross under utilization (25-30% is not uncommon) of their GPUs. Let's review some of these challenges.

Challenges¶

Unlike traditional resources such as CPU and memory, GPUs do not have out of box support for fine-grained sharing (kernels must share a single GPU address space i.e. GPU context). As a result, applications are allocated entire GPUs, but land up using a "fraction".

Sharing GPUs across multiple applications from different users can improve resource utilization and consequently cost, energy, and power efficiency Vendors like Nvidia have developed technologies such as MIG that support "spatial partitioning" of a GPU into as many as seven instances. Each instance is fully isolated with its own high-bandwidth memory, cache, and compute cores providing guaranteed quality of service (QoS).

Organizations need a solution that will allow IT/Ops to perform spatial partitioning of beefy GPUs dynamically based on demand.

Locality Requirements¶

Training on large datasets often requires the use of multiple GPUs. ML frameworks typically require that tasks on each GPU be scheduled at the same time (i.e. gang scheduled). Multi-GPU training implies synchronization of model parameters across GPUs. It is therefore critical to ensure "locality" tasks to allow for the use of faster interconnects for both inter and intra-node communication.

Organizations need a solution that will ensure that GPU allocation is performed intelligently taking into account use cases such as distributed training that require multiple GPUs.

Idle Time & Work Breaks¶

Expensive GPUs allocated to users will be idle during non-work hours (e.g. nights, weekends) and during work breaks (e.g. coffee breaks, lunch). GPUs will also be idle when a data scientist is experimenting with data in a Jupyter notebook as they develop a model. The data scientist will alternate between writing code, executing it on the GPU, and examining the results. As a result, the GPU will idle for extended periods of time during this process.

Organizations need a solution that will enforce schedules and time of day policies for GPU allocation.

Static Assignments¶

Organizations have historically followed static assignment of GPU resources to users. These are frequently assigned for extended periods of time even if the user no longer needs them. Organizations need a centralized system to monitor usage and an automated, policy based workflow to recoup the GPUs to the central pool.

Organizations need a solution that will enforce TTL for GPU allocations.

Lack of Self Service¶

Without a self service model for on-demand allocation of GPU resources, users struggle with inordinate delays and therefore tend to hold on to the GPUs even if they are not actively using it right now.

Organizations need a solution that will provide their end users with an "on-demand, self service" experience to request access to GPUs from a centralized pool.

GPU & Node Fragmentation¶

GPU fragmentation occurs when allocation is performed poorly. By assigning tasks with smaller GPU requirements to partially allocated GPUs, the available resources can be utilized more effectively. The image below shows how fragmentation can occur and how an intelligent bin packing approach can ensure that the maximum number of tasks can be accommodated across the available GPU pool.

GPU Fragmentation

Node fragmentation can occur when a task requires a larger number of GPUs than what can be allocated on any single node. Space has to be created for new tasks on existing nodes that would otherwise remain underutilized. Existing resources will need to be preempted and scheduled on another node to make space for the incoming task.

Node Fragmentation

Organizations need a solution that will implement "continuous" bin packing techniques to ensure that GPU allocations are done optimally.