Intermediate → Advanced

AI Performance Engineering Tutorial

A hands-on course on the performance of LLM and AI systems: inference internals, TTFT/TPOT/throughput, KV-cache math, continuous batching, GPU sizing, RAG latency and cost — with code and worked examples.

8 lessons · ~24 min total Start lesson 1 →
  1. 01 What Is AI Performance Engineering? AI performance engineering applies load-testing and SRE rigour to LLM and ML systems. Meet the metrics that matter — TTFT, TPOT, throughput and cost-per-token — and why they behave unlike a normal API. 3m
  2. 02 The LLM Inference Stack: Prefill, Decode and the KV Cache How LLM inference actually runs: the prefill and decode phases, what the KV cache is, and the exact memory formula that decides how many requests you can serve at once. 3m
  3. 03 Measuring LLM Inference: TTFT, TPOT and Throughput with Code A practical benchmarking recipe for LLM endpoints: measure time to first token, inter-token latency and throughput from a streaming response — with Python you can run today. 3m
  4. 04 Continuous Batching: How vLLM and TGI Hit High Throughput Static batching wastes the GPU waiting for the slowest request. Continuous (in-flight) batching is why modern LLM servers achieve many times the throughput — here's how it works and how to tune it. 3m
  5. 05 KV-Cache Optimisation: PagedAttention, Prefix Caching and Quantisation The KV cache is the scarcest resource in LLM serving. Three techniques — PagedAttention, prefix caching and KV quantisation — multiply how many requests fit in the same GPU memory. 3m
  6. 06 Reducing Time to First Token (TTFT) TTFT is the latency users feel first. A practical, ordered checklist for cutting it — prompt caching, chunked prefill, speculative decoding, routing and queue management — with the trade-offs. 3m
  7. 07 Throughput and GPU Sizing: Batch Size, Parallelism and a Memory Budget How to size GPUs for LLM serving: a concrete VRAM budget (weights + KV cache + overhead), when to use tensor vs pipeline parallelism, and how batch size sets your throughput ceiling. 3m
  8. 08 RAG and Vector Search Performance A latency budget for retrieval-augmented generation: embedding time, ANN index trade-offs (HNSW parameters), reranking cost, and how retrieved context size feeds back into LLM prefill. 3m