Throughput and GPU Sizing: Batch Size, Parallelism and a Memory Budget

How to size GPUs for LLM serving: a concrete VRAM budget (weights + KV cache + overhead), when to use tensor vs pipeline parallelism, and how batch size sets your throughput ceiling.

“How many GPUs do I need?” is answerable with arithmetic, not vibes. This lesson gives you a VRAM budget you can compute, the parallelism options when one GPU isn’t enough, and how batch size sets the throughput ceiling.

The VRAM budget

Three things compete for GPU memory:

VRAM_needed ≈ weights + kv_cache + activations/overhead

weights      = num_params × bytes_per_param
kv_cache     = kv_bytes_per_token × seq_len × batch        (see lesson 2)
overhead     ≈ 10–20% for activations, fragmentation, CUDA context

Worked example: can Llama-3-70B serve on one 80 GB GPU?

weights (BF16)  = 70e9 × 2 bytes            = 140 GB   ✗ won't fit on 80 GB
weights (FP8)   = 70e9 × 1 byte             = 70 GB    ✗ leaves ~0 for KV cache

So 70B in BF16 needs multiple GPUs, and even FP8 on a single 80 GB card leaves no room for KV cache. That single calculation decides your deployment topology. (8B from lesson 2, by contrast, fits comfortably with room for ~128 concurrent 4K requests.)

Parallelism: when one GPU isn’t enough

Two ways to split a model across GPUs:

Strategy	How it splits	Use when	Cost
Tensor parallelism (TP)	Each layer’s matrices split across GPUs, every token uses all GPUs	Model doesn’t fit on one GPU; latency matters	High inter-GPU bandwidth (NVLink) needed every layer
Pipeline parallelism (PP)	Different layers on different GPUs, tokens flow through	Spanning nodes / many GPUs	Adds pipeline-bubble latency
Data parallelism (replicas)	Whole model copied per GPU/node	Model fits; you need more throughput	Linear cost, simplest to reason about

Rule of thumb: use tensor parallelism within a node (over NVLink) to make a big model fit and keep latency low, then replicas across nodes for throughput. Reach for pipeline parallelism only when a model is too big even for one node’s worth of TP.

Batch size sets the throughput ceiling

Throughput is, to first order:

throughput (tokens/sec) ≈ active_batch_size × per_sequence_decode_rate

Raising the batch size (max_num_seqs) raises throughput until you hit a wall — either KV-cache memory (no room for more sequences) or compute (the GPU saturates). Past that wall, latency rises with no throughput gain. Finding that wall is exactly the concurrency-ramp exercise from the measuring lesson; operate just below it.

Sizing for a target load

Work backwards from your traffic:

Tokens/sec required = requests/sec × average output tokens.
Throughput per GPU (or TP group) = measure it at your TTFT-SLO batch size — don’t guess.
Replicas needed = required ÷ per-replica throughput, rounded up, plus headroom for the p99 and for failover.

Then sanity-check concurrency with Little’s Law: in-flight requests = throughput × average latency must stay under the memory-bound concurrency ceiling from lesson 2.

Takeaway

GPU sizing is a budget: weights + KV cache + ~15% overhead must fit in VRAM, and that calculation alone often dictates multi-GPU topology. Use tensor parallelism within a node to fit big models, add replicas across nodes for throughput, and treat batch size as the dial that trades latency for throughput up to the memory/compute wall. Size replicas from measured per-GPU throughput at your latency SLO, never from peak benchmarks.

Tips for this lesson

Loading…

Comments

Loading…