Inference
Running a trained model to generate output — the part of the LLM lifecycle that produces tokens in response to a prompt.
Inference is the act of using a trained large language model to produce output. It is distinct from training (where the weights are learned) and fine-tuning (where the weights are adjusted). When you type a prompt into a local-LLM tool and watch tokens stream back, that is inference happening in real time on your hardware.
The Two Phases of Inference
Every prompt goes through two phases on the same hardware:
- Prefill — the model reads your input prompt and computes the KV cache for every token in it. This phase is compute-bound and scales with prompt length. A 4,000-token prompt does roughly 8× the prefill work of a 500-token prompt.
- Decode — the model generates new tokens one at a time, reading the KV cache built during prefill. This phase is memory-bandwidth-bound and is what tokens per second measures.
Long prompts feel "slow to start, then normal speed" because prefill runs first and the user only sees output once decode begins.
What Determines Inference Speed
Three hardware specs set the ceiling, in roughly this order of importance:
- Memory bandwidth — decode reads every active model weight from VRAM once per token. A 30 GB model on a 1,000 GB/s GPU caps at ~33 t/s; on a 400 GB/s GPU it caps at ~13 t/s.
- VRAM capacity — if the model doesn't fit, you offload layers to CPU RAM, which collapses speed by 5–10× because system memory is roughly an order of magnitude slower than GPU memory.
- Compute throughput — matters most for prefill (long prompts, document summarization, RAG). For short prompts and chat-style use, compute is rarely the bottleneck.
Single-Stream vs Server Inference
Single-stream inference (one user, one prompt at a time) maxes out at the bandwidth ceiling above. Server-style inference (vLLM, TGI, TensorRT-LLM) batches multiple concurrent prompts, amortizes VRAM reads across them, and can multiply aggregate throughput several-fold. For one person at a chat window, single-stream is fine; for a team or an API endpoint, batching is the right architecture.