Part 7 of 8 · 3 min read
Throughput and GPU Sizing: Batch Size, Parallelism and a Memory Budget
How to size GPUs for LLM serving: a concrete VRAM budget (weights + KV cache + overhead), when to use tensor vs pipeline parallelism, and how batch size sets your throughput ceiling.
“How many GPUs do I need?” is answerable with arithmetic, not vibes. This lesson gives you a VRAM budget you can compute, the parallelism options when one GPU isn’t enough, and how batch size sets the throughput ceiling.
The VRAM budget
Three things compete for GPU memory:
VRAM_needed ≈ weights + kv_cache + activations/overhead
weights = num_params × bytes_per_param
kv_cache = kv_bytes_per_token × seq_len × batch (see lesson 2)
overhead ≈ 10–20% for activations, fragmentation, CUDA context
Worked example: can Llama-3-70B serve on one 80 GB GPU?
weights (BF16) = 70e9 × 2 bytes = 140 GB ✗ won't fit on 80 GB
weights (FP8) = 70e9 × 1 byte = 70 GB ✗ leaves ~0 for KV cache
So 70B in BF16 needs multiple GPUs, and even FP8 on a single 80 GB card leaves no room for KV cache. That single calculation decides your deployment topology. (8B from lesson 2, by contrast, fits comfortably with room for ~128 concurrent 4K requests.)
Parallelism: when one GPU isn’t enough
Two ways to split a model across GPUs:
| Strategy | How it splits | Use when | Cost |
|---|---|---|---|
| Tensor parallelism (TP) | Each layer’s matrices split across GPUs, every token uses all GPUs | Model doesn’t fit on one GPU; latency matters | High inter-GPU bandwidth (NVLink) needed every layer |
| Pipeline parallelism (PP) | Different layers on different GPUs, tokens flow through | Spanning nodes / many GPUs | Adds pipeline-bubble latency |
| Data parallelism (replicas) | Whole model copied per GPU/node | Model fits; you need more throughput | Linear cost, simplest to reason about |
Rule of thumb: use tensor parallelism within a node (over NVLink) to make a big model fit and keep latency low, then replicas across nodes for throughput. Reach for pipeline parallelism only when a model is too big even for one node’s worth of TP.
Batch size sets the throughput ceiling
Throughput is, to first order:
throughput (tokens/sec) ≈ active_batch_size × per_sequence_decode_rate
Raising the batch size (max_num_seqs) raises throughput until you hit a wall — either KV-cache
memory (no room for more sequences) or compute (the GPU saturates). Past that wall, latency rises with
no throughput gain. Finding that wall is exactly the concurrency-ramp exercise from the
measuring lesson; operate just below it.
Sizing for a target load
Work backwards from your traffic:
- Tokens/sec required = requests/sec × average output tokens.
- Throughput per GPU (or TP group) = measure it at your TTFT-SLO batch size — don’t guess.
- Replicas needed = required ÷ per-replica throughput, rounded up, plus headroom for the p99 and for failover.
Then sanity-check concurrency with Little’s Law: in-flight requests = throughput × average latency must stay under the memory-bound concurrency ceiling from lesson 2.
Takeaway
GPU sizing is a budget: weights + KV cache + ~15% overhead must fit in VRAM, and that calculation alone often dictates multi-GPU topology. Use tensor parallelism within a node to fit big models, add replicas across nodes for throughput, and treat batch size as the dial that trades latency for throughput up to the memory/compute wall. Size replicas from measured per-GPU throughput at your latency SLO, never from peak benchmarks.
Comments
Loading…
or to comment.