KV-Cache Optimisation: PagedAttention, Prefix Caching and Quantisation

The KV cache is the scarcest resource in LLM serving. Three techniques — PagedAttention, prefix caching and KV quantisation — multiply how many requests fit in the same GPU memory.

In lesson 2 we saw the KV cache is what limits concurrency. This lesson covers the three techniques that stretch that budget the furthest — and when each one actually helps.

1. PagedAttention: stop reserving, start paging

The naive allocator reserves KV-cache space for each request’s maximum possible length up front. If max_tokens is 2,048 but a reply finishes at 60 tokens, ~97% of that reservation is wasted. With many requests, internal fragmentation can waste the majority of your VRAM.

PagedAttention (the core idea behind vLLM) borrows from OS virtual memory. The KV cache is split into fixed-size blocks; a request is allocated blocks on demand as it generates, from a shared pool.

Reserved allocation:   [req A: 2048 slots, 60 used ][req B: 2048, 90 used] ...  → huge waste
Paged allocation:      pool of 16-token blocks, handed out only as needed       → ~no waste

Effect: little to no wasted cache, so you fit far more concurrent requests in the same GB — often 2–4× higher real-world concurrency. It’s mostly free; it’s why vLLM became the default.

2. Prefix caching: don’t recompute shared prompts

Many workloads send the same prefix on every request — a long system prompt, a few-shot template, a shared document in a RAG context. Without help, every request re-runs prefill over that prefix and stores its own copy of the resulting KV cache.

Prefix caching (a.k.a. automatic prefix caching, enable_prefix_caching in vLLM) keeps the KV cache for a shared prefix once and reuses it across all requests that start with it.

The payoff is twofold:

TTFT drops — the prefill for the cached prefix is skipped entirely; only the new suffix is processed. For a 2,000-token system prompt and a 50-token user turn, you go from prefilling 2,050 tokens to 50.
Memory drops — one copy of the prefix cache instead of N.

This is the highest-leverage win for agent and chat workloads with big static system prompts. Pair it with provider-side prompt caching (Anthropic, OpenAI) which bills cached input tokens at a fraction of the normal rate — model the savings with the LLM Cost & Latency Estimator.

3. KV quantisation: half the bytes per token

The KV cache is normally stored in FP16/BF16 (2 bytes per value). KV quantisation stores it in FP8 or INT8 (1 byte), roughly halving bytes_per_value in the memory formula:

FP16 KV:  2 × layers × kv_heads × head_dim × 2 × seq_len
FP8  KV:  2 × layers × kv_heads × head_dim × 1 × seq_len   → ~2× the concurrency

The trade-off is a small, usually-acceptable quality hit — validate it on your eval set, because the impact varies by model and task. KV quantisation is distinct from weight quantisation (next lesson’s territory); you can apply them independently.

Choosing what to reach for

Symptom	Reach for
Low concurrency, lots of short replies under a high `max_tokens`	PagedAttention (use vLLM)
Big shared system prompt / few-shot template / RAG context	Prefix caching + provider prompt caching
Memory-bound, willing to validate a small quality cost	KV (FP8/INT8) quantisation
Long contexts dominating memory	Shorter `max_tokens`, GQA models, all of the above

These stack: PagedAttention removes waste, prefix caching removes redundant work, quantisation halves what’s left.

Takeaway

The KV cache is the binding constraint, so the biggest serving wins come from using it better: PagedAttention eliminates reservation waste, prefix caching skips recomputing shared prompts (cutting both TTFT and memory), and KV quantisation halves bytes per token. Measure concurrency before and after each — the memory formula predicts the gain, your benchmark confirms it.

Tips for this lesson

Loading…

Comments

Loading…