Demystifying Fractional GPUs in Kubernetes: MIG, Time Slicing, and Custom Schedulers
As GPU acceleration becomes central to modern AI/ML workloads, Kubernetes has emerged as the orchestration platform of choice. However, allocating full GPUs for many real-world workloads is an overkill resulting in under utilization and soaring costs.
Enter the need for fractional GPUs: ways to share a physical GPU among multiple containers without compromising performance or isolation.
In this post, we'll walk through three approaches to achieve fractional GPU access in Kubernetes:
- MIG (Multi-Instance GPU)
- Time Slicing
- Custom Schedulers (e.g., KAI)
For each, we’ll break down how it works, its pros and cons, and when to use it.