Skip to content

How to Select the Right GPU for Open Source LLMs?

Deploying and operating an open-source Large Language Model (LLM) requires careful planning when selecting the right GPU model and memory capacity. Choosing the optimal configuration is crucial for performance, cost efficiency, and scalability. However, this process comes with several challenges.

In this blog, we will describe the factors that you need to consider to select the optimal GPU model for your LLM. We have also published a table capturing optimal GPU models to deploy and use Top-10 open source LLMs.

How many GPUs


Why Is GPU Selection Challenging for LLMs?

Diverse Hardware Requirements

The choice of GPU can directly affect performance, inference speed, and the ability to load the entire model into memory. Not being able to load the entire model into memory can be a show stopper.

Quantization Complexity

Models can be loaded using different quantization techniques (e.g., 32-bit, 16-bit, 8-bit, or 4-bit). Lower-bit quantization reduces memory usage but may degrade model accuracy.

Parallelization & Multi-GPU Scaling

High-parameter models require multiple GPUs with high interconnect speeds (NVLink, PCIe) to avoid bottlenecks.

Inference vs. Training

The GPU choice varies based on whether the model is trained from scratch or fine-tuned versus simply running inference. Note that training requires significantly more compute power relative to inference.

Power & Cooling Considerations

Deploying LLMs at scale means dealing with high power consumption and heat generation, requiring robust data center planning.

Info

We speak with organizations all the time where their data center racks were designed for a much lower power budget (e.g. 10 kW) than what is required for high end AI infrastructure. Next generation racks consume up to 120 kW of energy per cabinet. This is ~5x more relative to regular racks, generating heat that cannot be air-cooled!


Estimating GPU Memory

To estimate GPU memory requirements to deploy/operate an open source LLM, we can use the following formula. This was originally published by Sam Stoelinga.

Formula

Where:

  • M = Required GPU memory (in GB)
  • P = Number of parameters in the model (in billions)
  • 4B = 4 bytes per parameter (standard 32-bit floating point representation)
  • 32 = Number of bits in 4 bytes
  • Q = Number of bits used in quantization (e.g., 16, 8, or 4 bits)
  • 1.2 = Overhead factor (20%) for additional GPU memory usage

GPU Models for Top-10 LLMs

Using the formula above, let us calculate the minimum required GPU memory for the Top-10 open-source LLMs based on 16-bit precision (a common standard for efficient inference). We will then map this to recommendations for common NVIDIA GPU models based on their memory capacity and interconnect capabilities.

Model Parameters (B) Min GPU Memory (GB) Ideal NVIDIA GPUs
Llama 70B 70 420 A100 80GB x 2, H100 80GB x 2
Llama 13B 13 78 RTX 4090, A6000
Llama 7B 7 42 RTX 3090, A5000
Mistral 7B 7 42 RTX 3090, A5000
Falcon 40B 40 240 A100 80GB x 3, H100 x 3
Falcon 7B 7 42 RTX 3090, A5000
GPT-J 6B 6 36 RTX 3090, A5000
GPT-NeoX 20B 20 120 A100 80GB x 2
BLOOM 176B 176 1056 H100 80GB x 14
OPT 66B 66 396 A100 80GB x 2, H100 80GB x 2

Conclusion

Selecting the right GPU for deploying an Open Source LLM depends on memory requirements, precision (quantization), and available hardware resources. For smaller models (≤13B), consumer GPUs like the RTX 4090 or A6000 can be sufficient, while larger models (≥70B) require multiple high-memory enterprise GPUs like A100 or H100 with NVLink or high-bandwidth interconnects. Careful hardware selection ensures efficient, cost-effective deployment of open-source LLMs in production environments.