Part 2 of 8 · 3 min read
The LLM Inference Stack: Prefill, Decode and the KV Cache
How LLM inference actually runs: the prefill and decode phases, what the KV cache is, and the exact memory formula that decides how many requests you can serve at once.
To optimise LLM serving you need a mechanical picture of what the GPU is doing. The single most important structure is the KV cache, and the single most useful skill is being able to compute its size. Let’s build both.
Prefill vs decode, mechanically
A transformer generates text autoregressively — each new token attends to every previous token. Naively, generating token N would require re-processing all N−1 previous tokens. That would be catastrophically slow, so instead the model caches the intermediate attention vectors (the “keys” and “values”) for every token it has seen. That cache is the KV cache.
- Prefill phase. The prompt’s tokens are pushed through the model in one parallel pass, populating the KV cache for every prompt token. This is compute-heavy and GPU-efficient — lots of matrix multiply, high utilisation. It produces the first output token.
- Decode phase. Each subsequent token is generated one at a time. Each step reads the entire KV cache, computes one new token, and appends its K/V to the cache. This is memory-bandwidth-bound — the GPU spends most of its time moving the cache around, not computing.
This is why decode is the throughput bottleneck and why batching (covered next lesson) matters so much: batching lets the GPU reuse one expensive weight-load across many requests’ decode steps.
The KV cache memory formula
This is the equation to memorise:
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_value
total_kv = kv_bytes_per_token × sequence_length × batch_size
The leading 2 is for keys and values. bytes_per_value is 2 for FP16/BF16.
Worked example: Llama-3-8B
Llama-3-8B has 32 layers, 8 KV heads (it uses grouped-query attention), and head_dim 128, in BF16:
kv_bytes_per_token = 2 × 32 × 8 × 128 × 2 = 131,072 bytes ≈ 128 KB/token
So a single 4,096-token context consumes:
128 KB × 4,096 ≈ 512 MB ... per request
On an 80 GB A100/H100, after the model weights take ~16 GB (8B params × 2 bytes), you have roughly 64 GB for KV cache. That’s about 128 simultaneous 4K-token requests — the hard ceiling on your concurrency, before any batching cleverness. Grouped-query attention (8 KV heads instead of 32) is exactly why this number isn’t 4× smaller.
You can sanity-check pool-style ceilings like this with Little’s Law: concurrency is capped by memory, and throughput is capped by concurrency ÷ latency.
Why this governs everything
Every serving optimisation you’ll meet is, underneath, about this memory budget:
- PagedAttention (vLLM) stops you from reserving max-length cache per request and only allocates the blocks actually used — packing far more requests into the same GB.
- KV-cache quantisation stores K/V in 8-bit instead of 16-bit, halving
bytes_per_valueand doubling concurrency. - Prefix caching shares the KV cache of a common prompt prefix across requests instead of recomputing and re-storing it.
- Shorter
max_tokenscapssequence_length, directly freeing cache for more concurrent requests.
We cover each in the KV-cache optimisation lesson.
Quick reference
| Lever | Effect on KV cache |
|---|---|
Longer context / max_tokens | Linearly more cache per request |
| Grouped-query attention (GQA) | Fewer KV heads → less cache |
| KV quantisation (FP8/INT8) | ~2× less cache |
| PagedAttention | Far less wasted cache |
| Bigger batch | Linearly more total cache |
Takeaway
LLM inference is two phases: a parallel, compute-bound prefill that fills the KV cache, and a
serial, bandwidth-bound decode that reads it every step. The KV-cache memory formula —
2 × layers × kv_heads × head_dim × bytes × seq_len × batch — is the budget that sets your maximum
concurrency. Learn to compute it for your model and you can predict capacity before you deploy.
Comments
Loading…
or to comment.