Skip to content

Fractional GPUs

Choosing the Right Fractional GPU Strategy for Cloud Providers

As demand for GPU-accelerated workloads soars across industries, cloud providers are under increasing pressure to offer flexible, cost-efficient, and isolated access to GPUs. While full GPU allocation remains the norm, it often leads to resource waste—especially for lightweight or intermittent workloads.

In the previous blog, we described the three primary technical approaches for fractional GPUs. In this blog, we'll explore the most viable approaches to offering fractional GPUs in a GPU-as-a-Service (GPUaaS) model, and evaluate their suitability for cloud providers serving end customers.

Demystifying Fractional GPUs in Kubernetes: MIG, Time Slicing, and Custom Schedulers

As GPU acceleration becomes central to modern AI/ML workloads, Kubernetes has emerged as the orchestration platform of choice. However, allocating full GPUs for many real-world workloads is an overkill resulting in under utilization and soaring costs.

Enter the need for fractional GPUs: ways to share a physical GPU among multiple containers without compromising performance or isolation.

In this post, we'll walk through three approaches to achieve fractional GPU access in Kubernetes:

  1. MIG (Multi-Instance GPU)
  2. Time Slicing
  3. Custom Schedulers (e.g., KAI)

For each, we’ll break down how it works, its pros and cons, and when to use it.