Background
Organizations are incorporating ML pervasively into their applications. Doing this requires these teams to have access to GPUs because they offer unmatched computational speed and efficiency.
Why GPUs?¶
Support of GPU architecture allows AI/ML to tackle complex algorithms and vast datasets.
Handle Large Datasets¶
ML models often require processing and analyzing large datasets. With their high-bandwidth memory and parallel architecture, GPUs are adept at managing these data-intensive tasks, leading to quicker insights and model training.
Reduce Computation Time¶
The efficiency of GPUs in performing parallel computations drastically reduces the time required for training and inference in AI models. This speed is crucial for applications requiring real-time processing and decision-making, such as autonomous vehicles and real-time language translation.
However, there are significant challenges that organizations need to overcome with GPUs. As a result of these challenges, the typical organization struggles with gross under utilization (25-30% is not uncommon) of their GPUs. Let's review some of these challenges.
Challenges with GPUs¶
Lack of Sharing¶
Unlike traditional resources such as CPU and memory, GPUs do not have out of box support for fine-grained sharing (kernels must share a single GPU address space i.e. GPU context). As a result, applications are allocated entire GPUs, but land up using a "fraction".
Sharing GPUs across multiple applications from different users can improve resource utilization and consequently cost, energy, and power efficiency Vendors like Nvidia have developed technologies such as MIG that support "spatial partitioning" of a GPU into as many as seven instances. Each instance is fully isolated with its own high-bandwidth memory, cache, and compute cores providing guaranteed quality of service (QoS).
Organizations need a solution that will allow IT/Ops to perform spatial partitioning of beefy GPUs dynamically based on demand.
Locality Requirements¶
Training on large datasets often requires the use of multiple GPUs. ML frameworks typically require that tasks on each GPU be scheduled at the same time (i.e. gang scheduled). Multi-GPU training implies synchronization of model parameters across GPUs. It is therefore critical to ensure "locality" tasks to allow for the use of faster interconnects for both inter and intra-node communication.
Organizations need a solution that will ensure that GPU allocation is performed intelligently taking into account use cases such as distributed training that require multiple GPUs.
Idle Time & Work Breaks¶
Expensive GPUs allocated to users will be idle during non-work hours (e.g. nights, weekends) and during work breaks (e.g. coffee breaks, lunch). GPUs will also be idle when a data scientist is experimenting with data in a Jupyter notebook as they develop a model. The data scientist will alternate between writing code, executing it on the GPU, and examining the results. As a result, the GPU will idle for extended periods of time during this process.
Organizations need a solution that will enforce schedules and time of day policies for GPU allocation.
Static Assignments¶
Organizations have historically followed static assignment of GPU resources to users. These are frequently assigned for extended periods of time even if the user no longer needs them. Organizations need a centralized system to monitor usage and an automated, policy based workflow to recoup the GPUs to the central pool.
Organizations need a solution that will enforce TTL for GPU allocations.
Lack of Self Service¶
Without a self service model for on-demand allocation of GPU resources, users struggle with inordinate delays and therefore tend to hold on to the GPUs even if they are not actively using it right now.
Organizations need a solution that will provide their end users with an "on-demand, self service" experience to request access to GPUs from a centralized pool.