Skip to content

Inference

Family vs. Lineage: Unpacking Two Often-Confused Ideas in the LLM World

LLMs have begun to resemble sprawling family trees. Folks that are relatively new to LLMs will notice two words appear constantly in technical blogs: "family" and "lineage".

They sound interchangeable and users frequently conflate them. But, they describe different slices of an LLM’s life story.

Important

Understanding the differences is more than trivia. This determines how you pick models, tune them, and keep inference predictable at scale.

LLM Family vs Lineage

Why “Family” Matters in the World of LLMs

When GPU bills run into six digits and every millisecond of latency counts, platform teams learn that vocabulary choices and hidden-unit counts aren’t the only things that separate one model checkpoint from another.

LLMs travel in families—lineages of models that share a common architecture, tokenizer, and training recipe. Think of them the way you might think of Apple’s M-series chips or Toyota’s Prius line: the tuning changes, the size varies, but the underlying design stays stable enough that tools, drivers, and workflows remain interchangeable.

In this blog, we will learn about what we mean by a family for LLMs and why this matters for Inference.

LLM Family

Demystifying Quantization: Why It Matters for LLMs and Inference Efficiency

As Large Language Models (LLMs) like GPT, LLaMA, and DeepSeek reach hundreds of billions of parameters, the demand for high-speed, low-cost inference has skyrocketed. Quantization is a technique that helps drastically reduces model size and computational requirements by using lower-precision numbers. In this blog, we will discuss quantization and why it is essential.

Quantization

Compiling a LLM for High Performance Inference

This is the next blog in the blog series on LLMs and Inference. In the previous blog on LLMs and Inference, we discussed about the safetensors format for LLMs. In this blog, we will walk through a critical step for LLM Inference.

Compiling a Large Language Model (LLM) generally refers to optimizing the model’s computational graph and kernel execution to improve inference or training performance on specific hardware (like GPUs or TPUs). Think of this as the next logical step that is performed after loading a model.

LLM Compilation