Prefill (Time to First Token)
The phase where the model processes your input prompt before generating any output.
When you send a message to a local LLM, the model doesn't immediately start producing output. First it has to read and process everything you sent — your system prompt, conversation history, pasted documents, and your question. This processing phase is called prefill.
Only after prefill completes does the model generate its first output token. The time between hitting send and seeing the first word appear is called Time to First Token (TTFT), and prefill is what drives it.
Prefill vs Decode: Two Different Bottlenecks
Prefill and decode have fundamentally different performance characteristics:
- Prefill is compute-bound. The model processes all input tokens in parallel using the GPU's compute cores. More CUDA cores and higher compute throughput speed up prefill.
- Decode is memory-bandwidth bound. The model generates one token at a time, sequentially, reading through model weights for each. Memory bandwidth governs this phase.
This means optimizing for one phase doesn't necessarily improve the other. A GPU with enormous compute (like a data center A100) may have faster prefill than a consumer RTX 4090, even if token generation speed is similar.
How Long Prefill Takes in Practice
For short prompts (under 1,000 tokens), prefill is imperceptible — under a second on modern hardware. For long context, it becomes noticeable:
- 4K tokens: 1–3 seconds on a consumer GPU
- 32K tokens: 5–20 seconds
- 128K tokens: 30–120+ seconds depending on hardware
If you paste a large document and wait for the model to respond, that wait is almost entirely prefill.
Flash Attention's Role
Flash Attention is an algorithmic optimization that significantly speeds up prefill for long contexts. It reorders computations to minimize memory reads, reducing both VRAM usage and time during the attention calculation. Most modern inference frameworks (llama.cpp, vLLM, ExLlamaV2) implement it by default.
Why It Matters for Local AI
If your use case involves short conversational messages, prefill is irrelevant — it's too fast to notice. If you regularly work with large documents, codebases, or long system prompts, prefill latency becomes a real friction point. High compute throughput (more CUDA cores) and Flash Attention support are both worth prioritizing in that scenario.