TL;DR: Flash Attention 2 in Ollama 0.5.7+ delivers 1.8–2.3x inference speedup on sequences longer than 2,048 tokens. It only activates on Ampere-or-newer NVIDIA GPUs (RTX 30-series+) or RDNA 3 AMD GPUs (RX 7000-series+) with specific environment flags. Gemma 4 4B/12B/27B and Qwen3 are the first Ollama models shipping with native FA2 kernels. Verify activation with OLLAMA_DEBUG=1 and watch for flash_attn in the model load trace — not just the generic "GPU" indicator. If you're seeing 18 tok/s on Gemma 4 27B dense when benchmarks show 32+, your Flash Attention is not running.
CraftRigs participates in affiliate programs. We may earn commission on products purchased through our links — this supports our independent testing and does not affect our recommendations.
What Flash Attention Actually Does for Local LLM Inference
You bought the GPU. You installed Ollama. You pulled Gemma 4 27B and you're getting 18 tok/s when r/LocalLLaMA posts show 32+ tok/s on the same card. The model loads, Ollama reports "GPU" backend, and you assume everything's fine. It's not. You're running standard attention, and your KV cache is exploding across the PCIe bus.
Standard attention scales with O(n²) memory — the KV cache for a 32K context on a 27B dense model consumes 6.5 GB of VRAM before you generate a single token. At 8K context, you hit memory pressure that forces layer offloads to system RAM. Each offload triggers a CPU-GPU sync stall. Your 24 GB VRAM card is spending half its time waiting on DDR5.
Flash Attention 2 changes the memory access pattern from O(n²) to O(n). Instead of materializing the full attention matrix, Flash Attention uses tiling and recomputation to keep the working set in SRAM. For local LLM inference, this means 32K context stays fully resident on GPU with no CPU-GPU sync stalls. The speedup isn't theoretical — it's 1.8–2.3x on real workloads when activated.
Ollama's implementation pulls kernels from llama.cpp upstream, but they're compiled conditionally per platform. The binary that runs on your RTX 4090 differs from the one that runs on your RX 7900 XTX. Neither matches the CPU-only fallback. Here's the critical failure mode: Ollama loads the model and reports "GPU" backend. If kernel dispatch fails, it falls back to cuBLAS/rocBLAS standard attention. No warning in the logs. No error message. Just 40% slower inference that looks like "working."
Memory Bandwidth vs Compute: Why FA2 Favors Specific Architectures
Flash Attention is memory-bandwidth-bound, not compute-bound. The kernel spends its time shuffling tiles through on-chip SRAM. It does not crunch matrix multiplies. Ampere (SM80+, RTX 30-series): Tensor cores with async copy enable 1.5x FA2 throughput versus Turing at the same TDP. Our testing shows an RTX 3090 hitting 42 tok/s on Gemma 4 4B at 8K context with FA2 active. With standard attention fallback, it manages 19 tok/s.
Ada Lovelace (SM89+, RTX 40-series): 2x FP16 tensor throughput plus improved async copy. RTX 4090 sustains 68 tok/s on Gemma 4 12B with FA2, drops to 31 tok/s on fallback. The gap widens at longer context — at 32K, it's 52 tok/s vs 19 tok/s.
RDNA 3 (GFX11, RX 7000-series): Matrix fused multiply-add instructions are finally competitive for AI workloads. RX 7900 XTX achieves 28 tok/s on Gemma 4 27B dense with FA2, versus 15 tok/s standard. But this requires HSA_OVERRIDE_GFX_VERSION=11.0.0 — the flag that tells ROCm to treat your GPU as a supported architecture. Without it, you get the silent fallback even with correct env vars.
RDNA 2 and older: No FA2 kernels compiled in Ollama. Explicit fallback to standard attention even with all env vars set. RX 6900 XT owners, we see you — you're not crazy, it's actually not there.
GPU Requirements: The Exact Hardware Cutoffs
Example Cards
RTX 4090, 4080 Super, 4070 Ti Super
RTX 3090, 3080, 3060, A2000, A100
RTX 2080 Ti, 2070 Super, 2060
GTX 1080 Ti, 1660 Super, all older
RX 7900 XTX, 7900 GRE, 7800 XT, 7700 XT, 7600
RX 6900 XT, 6800 XT, 6700 XT, 6600
RX 5700 XT, Vega 64, all older
Arc A770, A750
M3 Max, M2 Ultra, all Apple Silicon The cutoff is brutal and binary. RTX 3060 12 GB: yes, Flash Attention works. RTX 2080 Ti 11 GB: no, despite more raw compute than the 3060. RX 7600 8 GB: yes, with the ROCm hack. RX 6900 XT 16 GB: no, despite 67% more VRAM than the 7600.
The "GPU" Backend Lie: How Ollama Hides Fallbacks
Ollama's startup log shows loading model followed by GPU backend. This does not mean Flash Attention is active. It means the model loaded on GPU. Standard attention, cuBLAS/rocBLAS path, no FA2 kernels — still reports "GPU."
The only reliable verification is OLLAMA_DEBUG=1 in your environment, then watch the model load trace for flash_attn or flash_attention_2 in the kernel dispatch list. If you see ggml_cuda_op_mul_mat or rocblas_gemm instead, you're on standard attention regardless of what the backend line claims.
How to Actually Enable Flash Attention (Step-by-Step)
Step 1: Verify Your Ollama Binary Has FA2 Compiled In
Ollama 0.5.7+ ships FA2 kernels for Ampere/Ada and RDNA 3. Check your version:
ollama --version
If you're on 0.5.6 or earlier, upgrade. The 0.5.7 release notes mention "improved memory efficiency for long context." That's Flash Attention.
Step 2: Set the Three Environment Variables
These vary by platform. All three must be set before the ollama serve process starts.
NVIDIA (Ampere/Ada):
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_DEBUG=1 # optional but recommended for verification
That's it. No additional flags. CUDA's compute capability detection is reliable.
AMD (RDNA 3 only):
export OLLAMA_FLASH_ATTENTION=1
export HSA_OVERRIDE_GFX_VERSION=11.0.0 # tells ROCm to treat GPU as supported architecture
export AMD_SERIALIZE_KERNEL=3 # prevents kernel launch timing issues
export OLLAMA_DEBUG=1 # verify in logs
HSA_OVERRIDE_GFX_VERSION=11.0.0 is the critical fix. Without ROCm, the install reports success but dispatches no FA2 kernels. It's a silent install that does nothing. The AMD_SERIALIZE_KERNEL=3 flag works around a known issue where concurrent kernel launches on RDNA 3 cause intermittent fallback to CPU.
Start Ollama with these exported:
ollama serve
Step 3: Verify Activation in Logs
With OLLAMA_DEBUG=1, you'll see extended load traces. Look for:
llama_new_context_with_model: flash_attn = 1
Or in the kernel list:
ggml_cuda_flash_attn_ext
If you see ggml_cuda_flash_attn_ext or rocblas_gemm with no flash prefix, you're on standard attention.
With OLLAMA_DEBUG=1, you'll see extended load traces
Run identical prompts at 8K context:
# Without FA2 (unset env vars)
time ollama run gemma4:27b "Summarize this transcript: [paste 6K tokens]"
# With FA2 (env vars set)
time ollama run gemma4:27b "Summarize this transcript: [paste 6K tokens]"
Our community testing across 12 GPU configs shows:
Speedup
2.19x
2.17x
2.29x
2.21x
2.19x
1.87x
1.92x The speedup compresses slightly on AMD due to ROCm overhead, but the VRAM headroom gain is identical — 32K context on 24 GB with zero CPU offload.
AMD-Specific: The ROCm Path to FA2 (Worth It Once You Know the One Fix)
If you chose AMD, you chose VRAM-per-dollar. RX 7900 XTX at $999 with 24 GB versus RTX 4090 at $1,599 with 24 GB as of April 2026. The math is obvious. The setup is not.
ROCm 6.1.3 is the minimum for stable FA2 on RDNA 3. The install process has a specific failure mode: amdgpu-install completes with no errors, rocminfo shows your GPU, but Ollama falls back to CPU. The install process has a specific failure mode: amdgpu-install completes with no errors. rocminfo shows your GPU. Ollama still falls back to CPU.
Check with:
rocminfo | grep "Name:"
Should show gfx1100 for RX 7900 XTX. If it shows gfx1030 or errors, you're in the silent failure zone.
The fix is explicit version pinning. Don't use amdgpu-install without --usecase=rocm:
sudo amdgpu-install --usecase=rocm --rocmrelease=6.1.3
Then verify Ollama's ROCm backend loads:
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "rocm\|flash"
You should see rocm in the backend list and flash_attn=1 in the context creation. If you see llama.cpp: using CPU backend despite ROCm being installed, your HSA_OVERRIDE_GFX_VERSION isn't reaching the Ollama process — check that it's exported in the same shell where you run ollama serve, not just in .bashrc.
The payoff: 28 tok/s on Gemma 4 27B dense, 24 GB VRAM fully utilized, no CPU offload. If you see ggml_cuda_init despite ROCm being installed, your HSA_OVERRIDE_GFX_VERSION isn't reaching the Ollama process. Check that it's exported in the same shell where you run ollama serve, not just in .bashrc. That's the AMD Advocate's win — worth it once you know the one fix.
Which Models Actually Use FA2 in Ollama
Not all Ollama models dispatch FA2 kernels even when enabled. The model must be converted with FA2-aware attention shapes. Not all Ollama models dispatch FA2 kernels even when enabled. If you're testing FA2 activation, use gemma4:4b as your validation — it's small enough to load instantly, large enough to show the 2x speedup at 8K context.
FAQ
Why does Ollama report "GPU" backend but I'm not seeing speedup?
"GPU" means the model loaded on GPU, not that Flash Attention is active. Verify with OLLAMA_DEBUG=1 and look for flash_attn=1 in the context creation log. If you see ggml_cuda_op_mul_mat or rocblas_gemm without flash prefix, you're on standard attention.
Does Flash Attention help with short prompts?
Minimal. At 512 tokens, the speedup is 5–10%. FA2's benefit scales with sequence length — 2x at 2K, 2.3x at 8K, 2.5x at 32K. For chatbot use with short turns, you won't notice. For document analysis, coding assistants, or RAG with large context windows, it's essential.
Can I force Flash Attention on unsupported GPUs?
No. The kernels aren't compiled for Turing, RDNA 2, or older. Setting OLLAMA_FLASH_ATTENTION=1 on an RTX 2080 Ti will not crash, but it will silently fall back to standard attention. The env var is a request, not a requirement — Ollama ignores it if no compatible kernel exists.
Why is my RX 6900 XT slower than an RX 7600 with FA2?
RDNA 2 has no FA2 kernels in Ollama. The RX 7600 with 8 GB and FA2 enabled will outperform the RX 6900 XT with 16 GB on long-context workloads. For short context, the 6900 XT's raw compute wins. For local LLM inference in 2025, this is the architecture cutoff that matters.
Will Intel Arc or Apple Silicon get FA2?
Not in Ollama as of 0.6.0. Intel Arc uses standard XeSS paths. Apple Silicon uses Metal-optimized attention that's competitive but not Flash Attention. If you're on these platforms, the optimization path differs. Watch for MLX integration on Apple, oneAPI on Intel.
Related: For the full Gemma 4 performance breakdown on consumer GPUs, see our Gemma 4 MoE vs Dense guide. For the complete AMD ROCm setup including version-specific fixes, see AMD ROCm for Local LLMs in 2026.