Batch Size
The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.
Batch size is the number of inference requests processed simultaneously in a single forward pass through the model. In a personal local AI setup, batch size is almost always 1 — you're the only user. But for anyone building a server that handles multiple users, batch size becomes critical to GPU utilization and cost efficiency.
Batch Size 1 (Single User)
At batch size 1, consumer GPU inference is almost entirely memory-bandwidth-limited. The GPU's compute cores are largely idle while VRAM is read to perform matrix multiplications. This is why memory bandwidth matters more than raw TFLOPS for local single-user inference — you're bound by how fast weights can be loaded, not how fast math can be done.
This also means that consumer GPUs (RTX 4090: 1TB/s bandwidth) significantly outperform lower-bandwidth cards at single-user inference, even if TFLOPs numbers are closer.
Higher Batch Sizes (Multi-User)
When multiple requests are batched together, the same weight matrices are loaded once but applied to multiple input sequences in parallel. This amortizes the memory bandwidth cost across the batch, shifting the bottleneck from memory bandwidth toward compute throughput.
A high-end GPU at batch size 16 might have 3x the throughput (requests/second) of batch size 1, because compute utilization jumps from ~20% to ~80%. The trade-off: each individual request waits longer (higher first-token latency) because the system holds requests to fill a batch before processing.
Continuous Batching
Modern inference servers (vLLM, TGI, SGLang) use continuous batching — adding new requests to an in-flight batch as existing requests finish. This avoids the latency overhead of waiting for a full batch while keeping GPU utilization high. It's the standard approach for serving multiple users efficiently.
KV Cache Impact
Batch size directly affects KV cache memory usage. Each request in the batch needs its own KV cache allocation. At batch size 8 with a 32K context window, KV cache requirements can exceed model weight storage. VRAM planning for multi-user serving must account for both model weights and KV cache at the target batch size.