Intermediate → Advanced
AI Performance Engineering Tutorial
A hands-on course on the performance of LLM and AI systems: inference internals, TTFT/TPOT/throughput, KV-cache math, continuous batching, GPU sizing, RAG latency and cost — with code and worked examples.
- 01 What Is AI Performance Engineering? AI performance engineering applies load-testing and SRE rigour to LLM and ML systems. Meet the metrics that matter — TTFT, TPOT, throughput and cost-per-token — and why they behave unlike a normal API. 3m
- 02 The LLM Inference Stack: Prefill, Decode and the KV Cache How LLM inference actually runs: the prefill and decode phases, what the KV cache is, and the exact memory formula that decides how many requests you can serve at once. 3m
- 03 Measuring LLM Inference: TTFT, TPOT and Throughput with Code A practical benchmarking recipe for LLM endpoints: measure time to first token, inter-token latency and throughput from a streaming response — with Python you can run today. 3m
- 04 Continuous Batching: How vLLM and TGI Hit High Throughput Static batching wastes the GPU waiting for the slowest request. Continuous (in-flight) batching is why modern LLM servers achieve many times the throughput — here's how it works and how to tune it. 3m
- 05 KV-Cache Optimisation: PagedAttention, Prefix Caching and Quantisation The KV cache is the scarcest resource in LLM serving. Three techniques — PagedAttention, prefix caching and KV quantisation — multiply how many requests fit in the same GPU memory. 3m
- 06 Reducing Time to First Token (TTFT) TTFT is the latency users feel first. A practical, ordered checklist for cutting it — prompt caching, chunked prefill, speculative decoding, routing and queue management — with the trade-offs. 3m
- 07 Throughput and GPU Sizing: Batch Size, Parallelism and a Memory Budget How to size GPUs for LLM serving: a concrete VRAM budget (weights + KV cache + overhead), when to use tensor vs pipeline parallelism, and how batch size sets your throughput ceiling. 3m
- 08 RAG and Vector Search Performance A latency budget for retrieval-augmented generation: embedding time, ANN index trade-offs (HNSW parameters), reranking cost, and how retrieved context size feeds back into LLM prefill. 3m