Part 5 of 8 · 3 min read
KV-Cache Optimisation: PagedAttention, Prefix Caching and Quantisation
The KV cache is the scarcest resource in LLM serving. Three techniques — PagedAttention, prefix caching and KV quantisation — multiply how many requests fit in the same GPU memory.
In lesson 2 we saw the KV cache is what limits concurrency. This lesson covers the three techniques that stretch that budget the furthest — and when each one actually helps.
1. PagedAttention: stop reserving, start paging
The naive allocator reserves KV-cache space for each request’s maximum possible length up front.
If max_tokens is 2,048 but a reply finishes at 60 tokens, ~97% of that reservation is wasted. With
many requests, internal fragmentation can waste the majority of your VRAM.
PagedAttention (the core idea behind vLLM) borrows from OS virtual memory. The KV cache is split into fixed-size blocks; a request is allocated blocks on demand as it generates, from a shared pool.
Reserved allocation: [req A: 2048 slots, 60 used ][req B: 2048, 90 used] ... → huge waste
Paged allocation: pool of 16-token blocks, handed out only as needed → ~no waste
Effect: little to no wasted cache, so you fit far more concurrent requests in the same GB — often 2–4× higher real-world concurrency. It’s mostly free; it’s why vLLM became the default.
2. Prefix caching: don’t recompute shared prompts
Many workloads send the same prefix on every request — a long system prompt, a few-shot template, a shared document in a RAG context. Without help, every request re-runs prefill over that prefix and stores its own copy of the resulting KV cache.
Prefix caching (a.k.a. automatic prefix caching, enable_prefix_caching in vLLM) keeps the KV
cache for a shared prefix once and reuses it across all requests that start with it.
The payoff is twofold:
- TTFT drops — the prefill for the cached prefix is skipped entirely; only the new suffix is processed. For a 2,000-token system prompt and a 50-token user turn, you go from prefilling 2,050 tokens to 50.
- Memory drops — one copy of the prefix cache instead of N.
This is the highest-leverage win for agent and chat workloads with big static system prompts. Pair it with provider-side prompt caching (Anthropic, OpenAI) which bills cached input tokens at a fraction of the normal rate — model the savings with the LLM Cost & Latency Estimator.
3. KV quantisation: half the bytes per token
The KV cache is normally stored in FP16/BF16 (2 bytes per value). KV quantisation stores it in
FP8 or INT8 (1 byte), roughly halving bytes_per_value in the
memory formula:
FP16 KV: 2 × layers × kv_heads × head_dim × 2 × seq_len
FP8 KV: 2 × layers × kv_heads × head_dim × 1 × seq_len → ~2× the concurrency
The trade-off is a small, usually-acceptable quality hit — validate it on your eval set, because the impact varies by model and task. KV quantisation is distinct from weight quantisation (next lesson’s territory); you can apply them independently.
Choosing what to reach for
| Symptom | Reach for |
|---|---|
Low concurrency, lots of short replies under a high max_tokens | PagedAttention (use vLLM) |
| Big shared system prompt / few-shot template / RAG context | Prefix caching + provider prompt caching |
| Memory-bound, willing to validate a small quality cost | KV (FP8/INT8) quantisation |
| Long contexts dominating memory | Shorter max_tokens, GQA models, all of the above |
These stack: PagedAttention removes waste, prefix caching removes redundant work, quantisation halves what’s left.
Takeaway
The KV cache is the binding constraint, so the biggest serving wins come from using it better: PagedAttention eliminates reservation waste, prefix caching skips recomputing shared prompts (cutting both TTFT and memory), and KV quantisation halves bytes per token. Measure concurrency before and after each — the memory formula predicts the gain, your benchmark confirms it.
Comments
Loading…
or to comment.