CUDA Driver Version Insufficient Error: What It Means and How to Fix It
scription: ""CUDA driver version is insufficient" crashes LLM setup — here''s the 550.90.07 driver fix, with DDU rollback if 570.xx broke everything
Exact fixes for the most common local LLM setup problems — GPU not detected, VRAM errors, ROCm failures, and slow inference. Find your error, get it running.
scription: ""CUDA driver version is insufficient" crashes LLM setup — here''s the 550.90.07 driver fix, with DDU rollback if 570.xx broke everything
ROCm doesn't run on Windows—your RX 7900 XTX sits idle at 4 tok/s. Force Vulkan backend: 18 tok/s on 70B models. Requires typo env var, Adrenalin 24.12.1+.
-ngl 99 still shows 0 layers? Your binary was compiled without CUDA/ROCm. Rebuild with explicit flags, verify with ldd, get 30x speedup in 10 minutes.
Dual EPYC crawling? NUMA warning cuts 6.8→14.2 tok/s. Single-node binding fixes it—if your model fits 128 GB. The exact numactl command inside.
Your 4090 crawls at 2 tok/s? VRAM spill kills speed—fix with one quantization tweak, verify in 60s. Works for local LLMs only.
Ollama crashes Windows NVIDIA? 40% fail on driver 572.xx — 90-second diagnostic isolates your fix, with version-locked solutions that work.
OLLAMA_FLASH_ATTENTION=1 set but VRAM stuck? Env var hit systemd, not server. Fix scope, verify 35% savings at 32K context—if GPU's new enough.
scription: ""File does not exist" at 47%? 73% are disk, path, or permission issues—not remote failures. Map your exact error to the fix in 4 commands
Two GPUs but only one running? OLLAMA_NUM_GPU=40,40 not 2 — fix layer splits, verify with nvidia-smi, hit 22 tok/s on 70B. Needs Ollama 0.1.38+
Ollama falls back to CPU at 3 tok/s. Run 3 commands, apply the NVIDIA/AMD/WSL2 fix, hit 45+ tok/s — but AMD on Windows needs Linux.
Ollama crashes? NUM_CTX pre-allocates 2.5KB per token on 70B models. Calculate your real VRAM limit, set context right, stop silent CPU fallback.
32B Q4_K_M OOM'd? Calculate exact VRAM: weights + KV cache + overhead. 8GB cards hit wall at 13B—here's the quant that fits, with tok/s numbers.
ROCm won't detect your RX 7900? HSA_OVERRIDE_GFX_VERSION fixes 89% of cases—here's the exact env var, kernel check, and udev rule that works.
RX 6700 XT not detected by Ollama? One env variable unlocks 38 tok/s—if you pick the right GFX version. Mapping table + syntax for every runtime.
LLM hangs 30s before output? That's prefill, not VRAM. Cut TTFT 60% with 4 diagnostic questions — if CPU-bound, not bandwidth-starved.