Self-Service Fractional GPU Memory with Rafay GPU PaaS¶

In Part-1, we explored how Rafay GPU PaaS empowers developers to use fractional GPUs, allowing multiple workloads to share GPU compute efficiently. This enabled better utilization and cost control — without compromising isolation or performance.

In Part-2, we will show how you can enhance this by provide users the means to select fractional GPU memory. While fractional GPUs provide a share of the GPU’s compute cores, different workloads have dramatically different GPU memory needs. With this update, developers can now choose exactly how much GPU memory they want for their pods — bringing fine-grained control, better scheduling, and cost efficiency.

Why Specify GPU Memory Instead of Just a Fraction?¶

Traditionally, fractional GPUs divide a GPU into slices (e.g., ¼, ½, etc.), assuming proportional memory distribution. However, real-world workloads don’t always scale linearly with GPU memory or compute.

1. Different Workloads, Different Memory Needs¶

A Stable Diffusion inference job might run comfortably within 2–5 GB of GPU memory.
A small language model (e.g., a 1B-parameter LLM) could require 10–15 GB.
A fine-tuning or training job might need 20 GB or more, even if the compute load remains moderate.

By letting users explicitly select GPU memory, Rafay GPU PaaS helps decouple memory allocation from compute fraction, ensuring that each workload gets exactly what it needs — no more, no less.

2. Better Resource Efficiency¶

Without fractional memory selection, administrators often over-allocate GPU memory simply to avoid out-of-memory (OOM) errors. For example, a Nvidia H100 GPU has ~80GB memory. So, a 25% GPU fraction is 20GB memory which may be way too much for many use cases.

This leads to wasted GPU memory and stranded capacity. By allowing memory to be specified directly:

GPU memory can be more evenly shared across users.
Smaller workloads can pack more efficiently onto a single GPU.
Cluster GPU utilization can increase dramatically, improving ROI on expensive hardware like A100s and L40s.

3. Predictable Cost and Performance¶

Developers can now see the impact of memory choices in real time. Rafay updates the cost estimate dynamically based on selected CPU, memory, and GPU Fraction (memory). For example, selecting a 2GB GPU Fraction results in a cost of about $0.20/hour, scaling up predictably for larger allocations.

GPU Fraction	Estimated Cost	Ideal For
2GB	$0.20/hr	Lightweight inference, preprocessing
5GB	$0.45/hr	Image or small transformer models
10GB	$0.90/hr	Mid-size AI inference workloads
15GB+	$1.30+/hr	Fine-tuning, compute-intensive jobs

Transparency helps developers optimize both budget and performance before deployment.

User Self Service Workflow¶

The developer experience remains as seamless as ever:

Go to Developer Pods → New Developer Pod.
Select fractional-memory-gpu-dev-pod compute profile.
Choose CPU, memory, and GPU memory size.
Review cost estimate and deploy.

Rafay takes care of provisioning, isolation, scheduling, and lifecycle management — letting developers focus on building models and running workloads, not infrastructure tuning.

Summary¶

In Part-1, we introduced fractional GPU compute, where users can allocate and consume a portion of GPU cores rather than an entire physical GPU. In Part-2, we expanded the capability with fractional GPU memory selection — giving developers precise control over how much GPU memory their workloads consume, independent of compute fraction.

In Part-3, we will show you how you can enhance the self service experience by providing users with the option to pay more for priority access to shared GPU resources.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Live Demo

Schedule time with us to watch a demo in action.

Schedule Demo