Reducing Time to First Token (TTFT)

TTFT is the latency users feel first. A practical, ordered checklist for cutting it — prompt caching, chunked prefill, speculative decoding, routing and queue management — with the trade-offs.

Time to first token is the latency a user actually perceives — the gap between hitting enter and seeing the response start. This lesson is an ordered checklist for cutting it, cheapest and highest- leverage first.

First, decompose your TTFT

TTFT under load is three things stacked:

TTFT = queue_wait + prefill_time + first_decode_step

queue_wait — time before the request is even admitted to the batch (dominates under load).
prefill_time — processing the prompt; scales with input length.
first_decode_step — one decode iteration to emit token one.

Measure which dominates for your traffic (use the recipe from lesson 3) before optimising — the right fix is completely different for a queue-bound vs a prefill-bound service.

The levers, in order

1. Prompt / prefix caching (biggest win for shared prompts)

If requests share a long system prompt or few-shot template, caching its KV state means you prefill only the new tokens. A 2,000-token cached prefix + 50-token user message prefills 50 tokens, not 2,050. This is the single largest TTFT lever for chat and agent workloads — see KV-cache optimisation.

2. Chunked prefill (fixes prefill-induced queue spikes)

Without chunked prefill, one long prompt’s prefill monopolises a decode step and spikes everyone else’s TTFT and TPOT. Chunking splits a large prefill across iterations so it interleaves with ongoing decodes. Turn it on for any workload with variable, sometimes-long prompts.

3. Shrink the prompt

The cheapest prefill is the one you don’t do. Trim boilerplate, summarise history instead of replaying it verbatim, retrieve fewer/shorter RAG chunks (see RAG performance). Prefill cost is roughly linear in input tokens, so a 30% shorter prompt is roughly a 30% shorter prefill.

4. Manage the queue (the under-load reality)

Past the throughput knee, queue_wait dominates and no per-request trick helps — you’re out of capacity. Options: add replicas (horizontal scale), lower max_num_seqs to protect TTFT at the cost of throughput, or shed/deprioritise load. An error-budget-style policy works well: protect a p95 TTFT SLO and scale out when burn rate climbs (see the error budget calculator).

5. Speculative decoding (helps decode-bound latency)

A small “draft” model proposes several tokens which the big model verifies in one pass. When the draft is right, you get multiple tokens per expensive forward pass — cutting effective TPOT and total latency. It mainly helps the decode term and per-request latency rather than TTFT directly, and adds complexity and draft-model overhead; measure the acceptance rate on your traffic before committing.

6. Right-size the model and hardware

A smaller or quantised model prefills faster. Routing easy requests to a small model and only hard ones to a large model (a cascade) cuts average TTFT substantially. Faster memory bandwidth (newer GPUs) helps both phases.

A practical order of operations

Measure — find whether you’re queue-, prefill-, or decode-bound.
If shared prompts exist → prompt/prefix caching.
If prompts are variable-length → chunked prefill.
Trim the prompt itself.
If queue-bound under load → scale out or shed load.
Then consider speculative decoding and model routing.

Takeaway

TTFT = queue + prefill + first decode step, and the winning fix depends on which dominates. For most real workloads the order is: cache shared prefixes, enable chunked prefill, shrink the prompt, then manage capacity. Speculative decoding and smaller models are powerful but come later. Always verify against a p95/p99 TTFT SLO, not an average.

Tips for this lesson

Loading…

Comments

Loading…