Part 3 of 8 · 3 min read

Measuring LLM Inference: TTFT, TPOT and Throughput with Code

A practical benchmarking recipe for LLM endpoints: measure time to first token, inter-token latency and throughput from a streaming response — with Python you can run today.

You can’t optimise what you don’t measure, and you can’t measure an LLM by timing the whole response — that hides the two numbers that matter most. This lesson is a concrete recipe for measuring TTFT, TPOT and throughput from a streaming endpoint.

Measure the stream, not the total

The mistake almost everyone makes first: calling the API non-streaming and recording total time. That collapses prefill and decode into one number and tells you nothing actionable. Instead, stream the response and timestamp the first token and the gaps between tokens.

Here’s a minimal, dependency-light measurement against any OpenAI-compatible endpoint (vLLM, TGI, llama.cpp server, or a hosted API):

import time, requests, json

def measure(url, model, prompt, max_tokens=256, api_key=""):
    t0 = time.perf_counter()
    first_token_at = None
    token_times = []

    with requests.post(
        url,
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "stream": True,
        },
        stream=True,
        timeout=120,
    ) as r:
        for line in r.iter_lines():
            if not line or not line.startswith(b"data: "):
                continue
            payload = line[6:]
            if payload == b"[DONE]":
                break
            delta = json.loads(payload)["choices"][0]["delta"].get("content")
            if delta:
                now = time.perf_counter()
                if first_token_at is None:
                    first_token_at = now
                token_times.append(now)

    ttft = (first_token_at - t0) * 1000                      # ms
    n = len(token_times)
    decode_s = token_times[-1] - first_token_at if n > 1 else 0
    tpot = (decode_s / (n - 1)) * 1000 if n > 1 else 0       # ms/token
    tps = (n - 1) / decode_s if decode_s > 0 else 0          # tokens/sec
    return {"ttft_ms": round(ttft, 1), "tpot_ms": round(tpot, 2),
            "tokens_per_sec": round(tps, 1), "output_tokens": n}

print(measure("http://localhost:8000/v1/chat/completions",
              "meta-llama/Llama-3-8B-Instruct",
              "Explain the CAP theorem in three sentences."))

Feed the result straight into the percentile thinking from why p99 matters — one sample isn’t a measurement.

Five rules for trustworthy numbers

  1. Pin output length per scenario. Variable max_tokens makes runs un-repeatable. Test fixed buckets — 128 / 512 / 2,048 output tokens — separately, because they stress decode differently.
  2. Use realistic prompt lengths. Prefill cost scales with input size; a 100-token and a 4,000-token prompt are different workloads. Replay production-shaped traffic, not “hello”.
  3. Warm up, then discard. The first requests pay model-load and cache cold-start costs. Throw away the warm-up window before recording.
  4. Separate prefill-bound from decode-bound scenarios. Long-prompt/short-output stresses prefill; short-prompt/long-output stresses decode. They scale and fail differently.
  5. Report p50, p95 and p99 — never the mean. Under load the tail is where users suffer.

Finding the throughput knee

Single-request numbers don’t tell you capacity. To find it, ramp concurrency and watch two curves: server throughput (total output tokens/sec) and p95 TTFT.

import concurrent.futures as cf

def load_test(concurrency, requests_each=10, **kw):
    def worker(_): return [measure(**kw) for _ in range(requests_each)]
    with cf.ThreadPoolExecutor(max_workers=concurrency) as ex:
        results = [m for batch in ex.map(worker, range(concurrency)) for m in batch]
    total_tokens = sum(r["output_tokens"] for r in results)
    p95_ttft = sorted(r["ttft_ms"] for r in results)[int(0.95 * len(results)) - 1]
    return {"concurrency": concurrency, "p95_ttft_ms": p95_ttft, "n": len(results)}

Plot throughput and p95 TTFT against concurrency. Throughput climbs, then flattens; latency stays flat, then explodes. The knee — where throughput stops rising but latency starts to — is your usable capacity. Operate just below it.

Takeaway

Measure the stream: timestamp the first token (TTFT) and the inter-token gaps (TPOT → tokens/sec), with output length pinned and prompts realistic. Then ramp concurrency to find the throughput/latency knee. Those three numbers and that one curve are the foundation everything else in this course builds on. Estimate the cost side with the LLM Cost & Latency Estimator.

Comments

Loading…