Flash Attention
An algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.
Flash Attention is an algorithm for computing the attention mechanism in transformers more efficiently. The standard attention calculation is slow and memory-hungry at long context lengths — its VRAM usage scales quadratically with sequence length. Flash Attention restructures the computation to reduce the number of times data is read from and written to VRAM, achieving the same mathematical result with dramatically less memory movement.
The name comes from the "flash" metaphor — fast reads from fast memory (SRAM on the GPU chip) instead of repeated trips to slower VRAM.
What It Actually Improves
Flash Attention has two main effects:
-
Reduced VRAM usage during attention computation. Standard attention requires storing an N×N attention matrix (where N is sequence length). At 128K tokens, that matrix alone is enormous. Flash Attention computes in tiles and never materializes the full matrix, reducing peak VRAM usage significantly.
-
Faster prefill for long context. Because fewer data transfers happen between SRAM and VRAM, the attention computation completes faster. This shows most clearly on long input prompts (10K+ tokens).
It does not directly improve decode speed for short responses — its benefit is most visible during prefill and for very long contexts.
Flash Attention 2 and 3
The original Flash Attention paper was published in 2022. Flash Attention 2 (2023) improved GPU utilization further. Flash Attention 3 targets Hopper architecture (H100) with additional optimizations. For consumer GPUs, Flash Attention 2 is the relevant version and is widely supported.
Framework Support
Flash Attention is implemented in most modern local inference frameworks:
- llama.cpp — supported and enabled by default on compatible hardware
- vLLM — core part of its performance profile
- ExLlamaV2 — supported
- Ollama — uses llama.cpp, inherits support
On Apple Silicon, the Metal backend in llama.cpp implements equivalent optimizations adapted for the unified memory architecture.
Why It Matters for Local AI
If you work with long context — pasting large documents, maintaining extended conversations, or processing codebases — Flash Attention is what makes that tractable on consumer hardware. Without it, 128K context inference would require substantially more VRAM and run materially slower. Verify your inference backend has it enabled when doing context-heavy work.