KV Cache
Your GPUs may be fast and your model may be well-optimized, but inference throughput is often limited by KV-cache behavior rather than raw compute. The biggest wins usually come from managing cache capacity, scheduling, reuse, and compatibility between cache format and attention backend.
Most LLM serving bottlenecks are not caused by insufficient compute. They come from KV-cache pressure, poor reuse, backend incompatibilities, and scheduling inefficiencies. The most effective improvements usually come from the following approaches:
- Increasing usable KV-cache capacity,
- Improving batching stability,
- Enabling chunked prefill,
- Standardizing prompts for reuse,
- Matching KV dtype to backend support, and
- Adding speculative decoding where memory limits dominate.
For the fastest path to measurable gains, start with these two changes:
- Enable FP8 KV-cache on supported platforms.
- Verify chunked prefill is active and performing well under load.
Then layer in prefix optimization, backend validation, and speculative decoding based on benchmark results.
1. Allocate Enough GPU Memory to the KV-cache¶
vLLM pre-allocates a large portion of VRAM for the KV-cache. If this allocation is too conservative, batching efficiency drops, preemptions increase, and throughput suffers.
Best practices
- Increase --gpu-memory-utilization until preemptions become infrequent.
- Tune --max-num-seqs to maintain dense batches without causing fragmentation or scheduler thrash.
- Treat these two settings as a pair i.e. higher utilization helps cache capacity, while sequence limits help preserve scheduler stability.
Example
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3-8b-instruct
--gpu-memory-utilization 0.92
--max-num-seqs 256
2. Use FP8 KV-cache¶
Quantizing the KV-cache to FP8 can increase effective context capacity and improve throughput by allowing larger active batches. Note that this does not quantize model weights; it only reduces KV-cache footprint.
Best practices
- Use --kv-cache-dtype fp8, fp8_e4m3, or fp8_e5m2 only on supported hardware.
- Validate that your GPU, runtime, and attention backend all support FP8 efficiently.
- Benchmark before and after enabling FP8, since compatibility gaps can erase the benefit.
Recommendations
- Prefer FP8 KV-cache on Hopper-, Ada-, or MI300-class hardware.
- There will be zero gains on Ampere-era GPUs or unsupported backend combinations.
- If throughput drops after enabling FP8, switch attention backends or revert to the prior KV dtype and compare again.
Example
python -m vllm.entrypoints.openai.api_server
--model mistralai/Mistral-7B-Instruct-v0.3
--kv-cache-dtype fp8_e5m2
--gpu-memory-utilization 0.94
--max-num-seqs 192
3. Enable Chunked Prefill¶
Long prompts can monopolize the GPU during the prefill phase, causing decode requests to wait. Chunked prefill breaks large prefills into smaller units so decode work can be interleaved.
Best Practices
- Enable or verify chunked prefill, especially for mixed workloads with both long prompts and live decode traffic.
- Monitor decode latency under load to confirm that scheduling remains balanced.
- Use chunked prefill as a default strategy for production systems serving variable prompt lengths.
- Design prompts for prefix cache reuse
Info
Prefix caching only helps when leading tokens match exactly at the block level. Small prompt differences can prevent block reuse and reduce the expected benefit.
Best Practices
- Standardize system prompts and reusable preambles.
- Keep shared prefixes identical across requests.
- Minimize unnecessary variability in leading prompt tokens.
- Align reusable prompt structure as consistently as possible to maximize cache hits.
Practical Guidance
- Use templated prompts instead of ad hoc string generation.
- Keep shared instructions at the very front of the request.
- Avoid injecting dynamic fields early in the prompt when they could be placed later.
5. Sliding-window Attention¶
Sliding-window attention stores KV only for the recent context window rather than the full sequence, reducing cache growth for long sessions.
Best Practices
- Prefer models with sliding-window or local attention for long-lived chat and conversational workloads.
- Take advantage of vLLM’s hybrid KV-cache manager when using models that mix local and full attention layers.
- Use these models when sustained throughput matters more than full-sequence global attention.
6. Use ROPE Scaling Selectively¶
ROPE scaling can extend usable context length, but it does not reduce KV-cache memory requirements per attended token. Longer context still consumes more cache.
Best Practices
- Use ROPE scaling only when the workload truly benefits from longer context.
- Apply it to retrieval-heavy prompts, evaluation scenarios, or specific long-context applications.
- Validate both scaling type and scaling factor in testing before broad rollout.
Example
python -m vllm.entrypoints.openai.api_server
--model your-model
--rope-scaling '{"type":"dynamic","factor":4.0}'
--max-model-len 64000
7. Add Speculative Decoding¶
When inference is bottlenecked by attention and memory movement rather than compute, speculative decoding can improve tokens per second by using a smaller draft model to propose likely continuations.
Best Practices
- Use speculative decoding when inter-token latency is dominated by memory-bound attention behavior.
- Start with a lightweight draft model and tune acceptance behavior through A/B testing.
- Measure both throughput and latency, since the net effect depends on context length and traffic mix.
Example
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3-8b-instruct
--speculative-model microsoft/phi-3-mini-4k-instruct
8. Persist Hot KV State across Restarts¶
Autoscaling, rolling restarts, and pod churn can discard useful KV state and introduce repeated cold-start penalties.
Best Practices
- Use external KV persistence when repeated prompts, shared headers, or warm-start behavior matter.
- Consider persistent or shared KV storage for retrieval-heavy services and recurring prompt patterns.
- Evaluate external KV reuse particularly in environments with aggressive autoscaling.
9. Account for Multimodal Tokens in KV Planning¶
In multimodal systems, image and other modality tokens consume KV-cache capacity just like text tokens. This can reduce concurrency more than expected.
Best Practices
- Include multimodal token expansion in capacity planning.
- Reduce --max-num-seqs when needed to avoid sudden instability after enabling vision or other non-text inputs.
- Re-baseline throughput after introducing multimodal traffic rather than assuming text-only sizing still applies.
10. Match the Attention Backend to the KV-cache dtype¶
A KV-cache dtype is only useful if the attention backend handles it efficiently. Enabling FP8 on the wrong backend can reduce throughput instead of improving it.
Best Practices
- Validate backend compatibility before enabling FP8 KV-cache in production.
- Compare supported backends such as FlashAttention-2, XFormers, and FlashInfer on your exact stack.
- Re-test whenever you change CUDA, ROCm, vLLM, drivers, or model architecture.
Operational rule
If --kv-cache-dtype fp8* causes a throughput drop, assume you may have lost an optimized attention path and verify backend selection immediately.
11. Understand PagedAttention Operating Model¶
PagedAttention improves serving efficiency by allocating fixed-size KV blocks rather than relying on large contiguous tensors. This makes cache placement more flexible and helps maintain batching efficiency as requests enter and leave the system.
Best Practices
- Optimize for dense, reusable block occupancy.
- Avoid settings that create excessive fragmentation or preemption.
- Treat cache layout efficiency as a first-class throughput concern, not just a memory concern.
Mental model
- Dense blocks support larger mixed batches.
- Larger mixed batches improve tokens per second.
- Fragmentation and holes reduce usable batching capacity.
- Reduced batching leads directly to lower throughput.