Part 1 of 8 · 3 min read
What Is AI Performance Engineering?
AI performance engineering applies load-testing and SRE rigour to LLM and ML systems. Meet the metrics that matter — TTFT, TPOT, throughput and cost-per-token — and why they behave unlike a normal API.
Traditional performance engineering asks “how many requests per second, at what latency?” An LLM inference service answers that question very differently from a CRUD API — its latency depends on how many tokens you ask it to generate, its throughput depends on how requests are batched together, and every single token costs money. AI performance engineering is the discipline of measuring, modelling, and optimising those systems.
Why an LLM isn’t a normal API
A normal API request has one cost and one latency. An LLM request has two distinct phases, and they scale completely differently:
- Prefill — the model processes your entire prompt in parallel to build its internal state (the KV cache). Cost here scales with input length. This phase determines how long until the first token appears.
- Decode — the model generates output tokens one at a time, each pass depending on the previous. Cost scales with output length, and this phase is memory-bandwidth-bound, not compute-bound.
That split is the root of almost everything that follows in this course. A 50-token prompt with a 2,000-token answer and a 2,000-token prompt with a 50-token answer are opposite workloads, even though both move ~2,050 tokens.
The four metrics that matter
| Metric | What it measures | Driven by |
|---|---|---|
| TTFT (Time To First Token) | Perceived latency — time until the user sees anything | Prefill + queueing |
| TPOT (Time Per Output Token) | Streaming speed once generation starts; 1/TPOT = tokens/sec | Decode, batch size |
| Throughput | Total output tokens/sec across all concurrent requests | Batching, hardware |
| Cost per request | input×price_in + output×price_out | Token counts, model |
End-to-end latency falls straight out of the first two:
latency ≈ TTFT + (output_tokens × TPOT)
For a chatbot, TTFT dominates the feel. For a long document summary, the decode term dominates the total.
The central tension
Here’s the trade-off you’ll spend this entire course navigating:
Larger batches raise throughput (cheaper per token, better GPU utilisation) but raise each request’s TPOT (slower for the individual user). Tuning an inference service is choosing a point on that curve.
This is why you can’t optimise an LLM service by looking at a single request in isolation, and why averages are even more misleading here than usual — under load, requests queue behind the current batch and the latency distribution grows a long, fat tail. Always reason in p95/p99, never the mean (see why p99 matters).
What this course covers
- This introduction — the metrics and the mental model.
- The inference stack — prefill, decode and the KV cache, with the memory math.
- Measuring inference — benchmarking TTFT/TPOT/throughput with real code.
- Continuous batching — how vLLM and TGI hit high throughput.
- KV-cache optimisation — PagedAttention, prefix caching and quantisation.
- Reducing TTFT — the concrete levers.
- Throughput & GPU sizing — batch size, parallelism, and a GPU-memory budget.
- RAG & vector search performance — the retrieval latency budget.
You’ll get the most from it alongside the LLM Cost & Latency Estimator and the Latency Percentile Analyser.
Takeaway
AI performance engineering is classic capacity planning with two twists: workloads are defined by token counts in two separately-scaling phases, and the throughput/latency trade-off is governed by batch size. Keep TTFT, TPOT, throughput and cost-per-token in view at all times, reason in percentiles, and you already think about these systems more clearly than most teams shipping them.
Comments
Loading…
or to comment.