KV Cache

Your GPUs may be fast and your model may be well-optimized, but inference throughput is often limited by KV-cache behavior rather than raw compute. The biggest wins usually come from managing cache capacity, scheduling, reuse, and compatibility between cache format and attention backend.

Most LLM serving bottlenecks are not caused by insufficient compute. They come from KV-cache pressure, poor reuse, backend incompatibilities, and scheduling inefficiencies. The most effective improvements usually come from the following approaches:

Increasing usable KV-cache capacity,
Improving batching stability,
Enabling chunked prefill,
Standardizing prompts for reuse,
Matching KV dtype to backend support, and
Adding speculative decoding where memory limits dominate.

For the fastest path to measurable gains, start with these two changes:

Enable FP8 KV-cache on supported platforms.
Verify chunked prefill is active and performing well under load.

Then layer in prefix optimization, backend validation, and speculative decoding based on benchmark results.

1. Allocate Enough GPU Memory to the KV-cache¶

vLLM pre-allocates a large portion of VRAM for the KV-cache. If this allocation is too conservative, batching efficiency drops, preemptions increase, and throughput suffers.

Best practices

Increase --gpu-memory-utilization until preemptions become infrequent.
Tune --max-num-seqs to maintain dense batches without causing fragmentation or scheduler thrash.
Treat these two settings as a pair i.e. higher utilization helps cache capacity, while sequence limits help preserve scheduler stability.

Example

python -m vllm.entrypoints.openai.api_server
  --model meta-llama/Llama-3-8b-instruct
  --gpu-memory-utilization 0.92
  --max-num-seqs 256

2. Use FP8 KV-cache¶

Quantizing the KV-cache to FP8 can increase effective context capacity and improve throughput by allowing larger active batches. Note that this does not quantize model weights; it only reduces KV-cache footprint.

Best practices

Use --kv-cache-dtype fp8, fp8_e4m3, or fp8_e5m2 only on supported hardware.
Validate that your GPU, runtime, and attention backend all support FP8 efficiently.
Benchmark before and after enabling FP8, since compatibility gaps can erase the benefit.

Recommendations

Prefer FP8 KV-cache on Hopper-, Ada-, or MI300-class hardware.
There will be zero gains on Ampere-era GPUs or unsupported backend combinations.
If throughput drops after enabling FP8, switch attention backends or revert to the prior KV dtype and compare again.

Example

python -m vllm.entrypoints.openai.api_server
  --model mistralai/Mistral-7B-Instruct-v0.3
  --kv-cache-dtype fp8_e5m2
  --gpu-memory-utilization 0.94
  --max-num-seqs 192

3. Enable Chunked Prefill¶

Long prompts can monopolize the GPU during the prefill phase, causing decode requests to wait. Chunked prefill breaks large prefills into smaller units so decode work can be interleaved.

Best Practices

Enable or verify chunked prefill, especially for mixed workloads with both long prompts and live decode traffic.
Monitor decode latency under load to confirm that scheduling remains balanced.
Use chunked prefill as a default strategy for production systems serving variable prompt lengths.
Design prompts for prefix cache reuse

Info

Prefix caching only helps when leading tokens match exactly at the block level. Small prompt differences can prevent block reuse and reduce the expected benefit.

Best Practices

Standardize system prompts and reusable preambles.
Keep shared prefixes identical across requests.
Minimize unnecessary variability in leading prompt tokens.
Align reusable prompt structure as consistently as possible to maximize cache hits.

Practical Guidance

Use templated prompts instead of ad hoc string generation.
Keep shared instructions at the very front of the request.
Avoid injecting dynamic fields early in the prompt when they could be placed later.

5. Sliding-window Attention¶

Sliding-window attention stores KV only for the recent context window rather than the full sequence, reducing cache growth for long sessions.

Best Practices

Prefer models with sliding-window or local attention for long-lived chat and conversational workloads.
Take advantage of vLLM’s hybrid KV-cache manager when using models that mix local and full attention layers.
Use these models when sustained throughput matters more than full-sequence global attention.

6. Use ROPE Scaling Selectively¶

ROPE scaling can extend usable context length, but it does not reduce KV-cache memory requirements per attended token. Longer context still consumes more cache.

Best Practices

Use ROPE scaling only when the workload truly benefits from longer context.
Apply it to retrieval-heavy prompts, evaluation scenarios, or specific long-context applications.
Validate both scaling type and scaling factor in testing before broad rollout.

Example

python -m vllm.entrypoints.openai.api_server
  --model your-model
  --rope-scaling '{"type":"dynamic","factor":4.0}'
  --max-model-len 64000

7. Add Speculative Decoding¶

When inference is bottlenecked by attention and memory movement rather than compute, speculative decoding can improve tokens per second by using a smaller draft model to propose likely continuations.

Best Practices

Use speculative decoding when inter-token latency is dominated by memory-bound attention behavior.
Start with a lightweight draft model and tune acceptance behavior through A/B testing.
Measure both throughput and latency, since the net effect depends on context length and traffic mix.

Example

python -m vllm.entrypoints.openai.api_server
  --model meta-llama/Llama-3-8b-instruct
  --speculative-model microsoft/phi-3-mini-4k-instruct

8. Persist Hot KV State across Restarts¶

Autoscaling, rolling restarts, and pod churn can discard useful KV state and introduce repeated cold-start penalties.

Best Practices

Use external KV persistence when repeated prompts, shared headers, or warm-start behavior matter.
Consider persistent or shared KV storage for retrieval-heavy services and recurring prompt patterns.
Evaluate external KV reuse particularly in environments with aggressive autoscaling.

9. Account for Multimodal Tokens in KV Planning¶

In multimodal systems, image and other modality tokens consume KV-cache capacity just like text tokens. This can reduce concurrency more than expected.

Best Practices

Include multimodal token expansion in capacity planning.
Reduce --max-num-seqs when needed to avoid sudden instability after enabling vision or other non-text inputs.
Re-baseline throughput after introducing multimodal traffic rather than assuming text-only sizing still applies.

10. Match the Attention Backend to the KV-cache dtype¶

A KV-cache dtype is only useful if the attention backend handles it efficiently. Enabling FP8 on the wrong backend can reduce throughput instead of improving it.

Best Practices

Validate backend compatibility before enabling FP8 KV-cache in production.
Compare supported backends such as FlashAttention-2, XFormers, and FlashInfer on your exact stack.
Re-test whenever you change CUDA, ROCm, vLLM, drivers, or model architecture.

Operational rule

If --kv-cache-dtype fp8* causes a throughput drop, assume you may have lost an optimized attention path and verify backend selection immediately.

11. Understand PagedAttention Operating Model¶

PagedAttention improves serving efficiency by allocating fixed-size KV blocks rather than relying on large contiguous tensors. This makes cache placement more flexible and helps maintain batching efficiency as requests enter and leave the system.

Best Practices

Optimize for dense, reusable block occupancy.
Avoid settings that create excessive fragmentation or preemption.
Treat cache layout efficiency as a first-class throughput concern, not just a memory concern.

Mental model

Dense blocks support larger mixed batches.
Larger mixed batches improve tokens per second.
Fragmentation and holes reduce usable batching capacity.
Reduced batching leads directly to lower throughput.