Part 4 of 8 · 3 min read
Continuous Batching: How vLLM and TGI Hit High Throughput
Static batching wastes the GPU waiting for the slowest request. Continuous (in-flight) batching is why modern LLM servers achieve many times the throughput — here's how it works and how to tune it.
If one optimisation explains why modern LLM servers (vLLM, TGI, TensorRT-LLM) get multiples more throughput than a naive loop, it’s continuous batching. Understanding it changes how you reason about every latency number.
The problem with static batching
The obvious way to batch is to collect N requests, run them together, return when all N finish. The flaw: requests have wildly different output lengths. Batch a 20-token reply with a 2,000-token reply and the GPU spends 99% of the batch processing one request while 7 slots sit idle — already finished, still occupying the batch.
Static batch of 4 (─ = generating, · = idle, done but stuck):
req A ─────done······························
req B ─────────────done······················
req C ──────────────────────────────────done·
req D ───done·································
^ GPU mostly idle, waiting for C
GPU utilisation collapses, and short requests are held hostage by the longest one in their batch.
Continuous batching: fill the slots every step
Continuous (a.k.a. in-flight or iteration-level) batching works at the granularity of a single decode step rather than a whole request:
- Every decode iteration, the scheduler looks at all in-flight requests and runs one token for each.
- The instant a request emits its stop token, it leaves the batch and its slot is freed.
- A waiting request is admitted into that freed slot immediately — mid-flight, without waiting for the rest of the batch.
Continuous batching (slots refilled as they free):
req A ─────done
req E ─────────────done ← admitted the moment A finished
req B ─────────────done
req F ──────────done
^ GPU stays busy
The GPU is kept saturated, so throughput rises dramatically — often 5–20× over static batching on mixed-length traffic — and short requests are no longer blocked behind long ones.
What this means for your latency numbers
This is the mechanism behind the throughput/latency tension from lesson 1:
- More in-flight requests → higher throughput, but each decode step processes more sequences, so per-token latency (TPOT) rises for everyone.
- A newly arriving request waits for the current decode step to finish before it can be admitted — so TTFT under load is dominated by how full the batch already is, not by your prompt length.
That’s why p99 TTFT degrades sharply past the knee: the batch is full, new requests queue, and the wait compounds.
Tuning knobs (vLLM terms)
| Knob | Effect | Trade-off |
|---|---|---|
max_num_seqs | Max concurrent sequences in a batch | Higher = more throughput, worse TPOT |
max_num_batched_tokens | Token budget per iteration (prefill+decode) | Caps prefill stalls vs decode starvation |
gpu_memory_utilization | Fraction of VRAM for KV cache | Higher = more concurrency, less headroom |
| Chunked prefill | Splits long prefills across steps | Stops a big prompt from stalling everyone’s decode |
Chunked prefill deserves a callout: without it, one 8,000-token prompt’s prefill blocks every other request’s decode for that whole step, spiking their TPOT. Chunking interleaves it.
How to tune it
Run the concurrency ramp from the measuring lesson
at a few max_num_seqs values. You’ll see throughput rise and p95 TTFT worsen as you raise it — pick
the largest value that still meets your TTFT SLO. There is no universal best setting; it’s a point on
your curve determined by your latency target.
Takeaway
Continuous batching refills batch slots every decode step instead of waiting for the slowest request,
keeping the GPU saturated and multiplying throughput. The cost is that throughput and per-token
latency move together through max_num_seqs — so tuning a server means choosing how much TPOT you’ll
trade for throughput, then defending a p95 TTFT SLO. Next: squeezing more requests into memory with
KV-cache optimisation.
Comments
Loading…
or to comment.