Part 8 of 8 · 3 min read
RAG and Vector Search Performance
A latency budget for retrieval-augmented generation: embedding time, ANN index trade-offs (HNSW parameters), reranking cost, and how retrieved context size feeds back into LLM prefill.
A RAG pipeline has more moving parts than a bare LLM call, and each adds latency before the model even starts. This final lesson builds a latency budget for RAG and shows where the time actually goes — including the feedback loop into LLM prefill you just spent this course learning about.
The RAG latency budget
A typical retrieval-augmented request:
total = embed_query + vector_search + (rerank?) + llm_prefill + llm_decode
embed_query : 5–50 ms (one embedding forward pass)
vector_search : 1–50 ms (ANN index lookup; depends heavily on params)
rerank : 10–200 ms (optional cross-encoder over top-k)
llm_prefill : scales with retrieved context length ← the big one
llm_decode : output_tokens × TPOT
The trap most teams fall into: obsessing over the millisecond-scale vector search while the retrieved context balloons the LLM prefill into the largest term. Retrieving 20 chunks of 500 tokens each adds 10,000 tokens of prefill — which, per reducing TTFT, is usually the dominant cost. Retrieval quality lets you retrieve less, which is the real latency win.
ANN index trade-offs (HNSW)
Vector search uses approximate nearest neighbour (ANN) indexes because exact search is too slow at scale. HNSW is the most common, and three parameters trade recall against latency:
| Parameter | Higher value → | Trade-off |
|---|---|---|
M (graph connectivity) | Better recall, more memory | Larger index |
efConstruction | Better index quality | Slower, one-time build |
efSearch | Better recall per query | Slower every query |
efSearch is the knob you tune at serving time: it’s a direct recall-vs-latency dial. Plot recall@k
against p95 query latency as you sweep efSearch, and pick the lowest value that meets your recall
target — exactly the percentile-driven approach from
why p99 matters. Benchmark it on your vectors with the
latency percentile analyser.
Practical levers
- Retrieve fewer, better chunks. Top-5 good chunks beat top-20 mediocre ones — less prefill, often better answers. Reranking a wide candidate set down to a tight top-k buys quality without bloating the LLM context.
- Cache embeddings. Query embeddings for repeated/similar queries, and never re-embed your corpus unnecessarily — embedding the corpus is a batch job, not a request-path cost.
- Batch embeddings. Embedding models love batching; embed many documents per forward pass during indexing.
- Right-size the embedding model. A smaller embedding model with adequate recall cuts both index-time and query-time embedding latency.
- Prefix-cache the shared scaffolding. The instruction template wrapping retrieved context is static — prefix-cache it (see KV-cache optimisation) so only the variable chunks hit prefill.
Measuring a RAG pipeline
Instrument each stage separately — a single end-to-end number hides which stage to fix. Emit a span per stage (embed, search, rerank, prefill, decode) and watch the p95 of each. You’ll almost always find the budget is dominated by prefill (too much retrieved context) or rerank (too wide a candidate set), not the vector search everyone worries about.
Course recap
You’ve now got the full picture:
- The metrics — TTFT, TPOT, throughput, cost.
- The mechanics — prefill, decode, KV cache, and its memory formula.
- Measurement — streaming benchmarks and the throughput knee.
- Throughput — continuous batching and its latency cost.
- Memory — PagedAttention, prefix caching, quantisation.
- Latency — the ordered TTFT checklist.
- Capacity — the VRAM budget and parallelism.
- RAG — the retrieval latency budget and its feedback into prefill.
Takeaway
RAG performance is a budget where the surprise is that vector search is rarely the bottleneck — the
retrieved context size feeding LLM prefill usually is. Tune efSearch against measured recall and
latency, retrieve fewer better chunks, cache aggressively, and instrument every stage. Better retrieval
isn’t just about accuracy; retrieving less is how you make RAG fast.
Comments
Loading…
or to comment.