Part 6 of 8 · 3 min read
Reducing Time to First Token (TTFT)
TTFT is the latency users feel first. A practical, ordered checklist for cutting it — prompt caching, chunked prefill, speculative decoding, routing and queue management — with the trade-offs.
Time to first token is the latency a user actually perceives — the gap between hitting enter and seeing the response start. This lesson is an ordered checklist for cutting it, cheapest and highest- leverage first.
First, decompose your TTFT
TTFT under load is three things stacked:
TTFT = queue_wait + prefill_time + first_decode_step
- queue_wait — time before the request is even admitted to the batch (dominates under load).
- prefill_time — processing the prompt; scales with input length.
- first_decode_step — one decode iteration to emit token one.
Measure which dominates for your traffic (use the recipe from lesson 3) before optimising — the right fix is completely different for a queue-bound vs a prefill-bound service.
The levers, in order
1. Prompt / prefix caching (biggest win for shared prompts)
If requests share a long system prompt or few-shot template, caching its KV state means you prefill only the new tokens. A 2,000-token cached prefix + 50-token user message prefills 50 tokens, not 2,050. This is the single largest TTFT lever for chat and agent workloads — see KV-cache optimisation.
2. Chunked prefill (fixes prefill-induced queue spikes)
Without chunked prefill, one long prompt’s prefill monopolises a decode step and spikes everyone else’s TTFT and TPOT. Chunking splits a large prefill across iterations so it interleaves with ongoing decodes. Turn it on for any workload with variable, sometimes-long prompts.
3. Shrink the prompt
The cheapest prefill is the one you don’t do. Trim boilerplate, summarise history instead of replaying it verbatim, retrieve fewer/shorter RAG chunks (see RAG performance). Prefill cost is roughly linear in input tokens, so a 30% shorter prompt is roughly a 30% shorter prefill.
4. Manage the queue (the under-load reality)
Past the throughput knee, queue_wait dominates and no per-request trick helps — you’re out of
capacity. Options: add replicas (horizontal scale), lower max_num_seqs to protect TTFT at the cost
of throughput, or shed/deprioritise load. An error-budget-style policy works well: protect a p95 TTFT
SLO and scale out when burn rate climbs (see the error budget calculator).
5. Speculative decoding (helps decode-bound latency)
A small “draft” model proposes several tokens which the big model verifies in one pass. When the draft is right, you get multiple tokens per expensive forward pass — cutting effective TPOT and total latency. It mainly helps the decode term and per-request latency rather than TTFT directly, and adds complexity and draft-model overhead; measure the acceptance rate on your traffic before committing.
6. Right-size the model and hardware
A smaller or quantised model prefills faster. Routing easy requests to a small model and only hard ones to a large model (a cascade) cuts average TTFT substantially. Faster memory bandwidth (newer GPUs) helps both phases.
A practical order of operations
- Measure — find whether you’re queue-, prefill-, or decode-bound.
- If shared prompts exist → prompt/prefix caching.
- If prompts are variable-length → chunked prefill.
- Trim the prompt itself.
- If queue-bound under load → scale out or shed load.
- Then consider speculative decoding and model routing.
Takeaway
TTFT = queue + prefill + first decode step, and the winning fix depends on which dominates. For most real workloads the order is: cache shared prefixes, enable chunked prefill, shrink the prompt, then manage capacity. Speculative decoding and smaller models are powerful but come later. Always verify against a p95/p99 TTFT SLO, not an average.
Comments
Loading…
or to comment.