The LLM Inference Stack: Prefill, Decode and the KV Cache

How LLM inference actually runs: the prefill and decode phases, what the KV cache is, and the exact memory formula that decides how many requests you can serve at once.

To optimise LLM serving you need a mechanical picture of what the GPU is doing. The single most important structure is the KV cache, and the single most useful skill is being able to compute its size. Let’s build both.

Prefill vs decode, mechanically

A transformer generates text autoregressively — each new token attends to every previous token. Naively, generating token N would require re-processing all N−1 previous tokens. That would be catastrophically slow, so instead the model caches the intermediate attention vectors (the “keys” and “values”) for every token it has seen. That cache is the KV cache.

Prefill phase. The prompt’s tokens are pushed through the model in one parallel pass, populating the KV cache for every prompt token. This is compute-heavy and GPU-efficient — lots of matrix multiply, high utilisation. It produces the first output token.
Decode phase. Each subsequent token is generated one at a time. Each step reads the entire KV cache, computes one new token, and appends its K/V to the cache. This is memory-bandwidth-bound — the GPU spends most of its time moving the cache around, not computing.

This is why decode is the throughput bottleneck and why batching (covered next lesson) matters so much: batching lets the GPU reuse one expensive weight-load across many requests’ decode steps.

The KV cache memory formula

This is the equation to memorise:

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_value

total_kv = kv_bytes_per_token × sequence_length × batch_size

The leading 2 is for keys and values. bytes_per_value is 2 for FP16/BF16.

Worked example: Llama-3-8B

Llama-3-8B has 32 layers, 8 KV heads (it uses grouped-query attention), and head_dim 128, in BF16:

kv_bytes_per_token = 2 × 32 × 8 × 128 × 2  = 131,072 bytes ≈ 128 KB/token

So a single 4,096-token context consumes:

128 KB × 4,096 ≈ 512 MB   ... per request

On an 80 GB A100/H100, after the model weights take ~16 GB (8B params × 2 bytes), you have roughly 64 GB for KV cache. That’s about 128 simultaneous 4K-token requests — the hard ceiling on your concurrency, before any batching cleverness. Grouped-query attention (8 KV heads instead of 32) is exactly why this number isn’t 4× smaller.

You can sanity-check pool-style ceilings like this with Little’s Law: concurrency is capped by memory, and throughput is capped by concurrency ÷ latency.

Why this governs everything

Every serving optimisation you’ll meet is, underneath, about this memory budget:

PagedAttention (vLLM) stops you from reserving max-length cache per request and only allocates the blocks actually used — packing far more requests into the same GB.
KV-cache quantisation stores K/V in 8-bit instead of 16-bit, halving bytes_per_value and doubling concurrency.
Prefix caching shares the KV cache of a common prompt prefix across requests instead of recomputing and re-storing it.
Shorter max_tokens caps sequence_length, directly freeing cache for more concurrent requests.

We cover each in the KV-cache optimisation lesson.

Quick reference

Lever	Effect on KV cache
Longer context / `max_tokens`	Linearly more cache per request
Grouped-query attention (GQA)	Fewer KV heads → less cache
KV quantisation (FP8/INT8)	~2× less cache
PagedAttention	Far less wasted cache
Bigger batch	Linearly more total cache

Takeaway

LLM inference is two phases: a parallel, compute-bound prefill that fills the KV cache, and a serial, bandwidth-bound decode that reads it every step. The KV-cache memory formula — 2 × layers × kv_heads × head_dim × bytes × seq_len × batch — is the budget that sets your maximum concurrency. Learn to compute it for your model and you can predict capacity before you deploy.

Tips for this lesson

Loading…

Comments

Loading…