This one shows up constantly in r/LocalLLM and the Ollama GitHub issues. Someone with a perfectly capable GPU — RTX 5070 Ti, RTX 3090, RTX 4090 — loads Qwen3 27B or 35B and hits "CUDA out of memory" before the model finishes loading. The card has 24GB. The model GGUF says 16–20GB at Q4. The math should work.
It doesn't work because Windows is eating your VRAM before the model ever loads.
Here's every fix, ordered from easiest to most involved.
Quick Summary
- Windows VRAM fragmentation regularly eats 3–6GB of "available" VRAM before your LLM loads
- WSL2 is the most reliable fix — recovers 1–3GB of VRAM and eliminates fragmentation
- Reducing context length is the fastest in-session fix if you don't want to switch environments
Why This Happens on Windows
Windows 11's GPU memory management is built for gaming workloads, not sustained ML inference. Several processes compete for VRAM:
Desktop Window Manager (DWM): The Windows compositor uses GPU memory for UI rendering. On a high-resolution display with multiple monitors, DWM can consume 1–2GB of VRAM continuously.
Browser hardware acceleration: Chrome and Edge use GPU memory for accelerated rendering. With multiple tabs open, a browser can hold 500MB–2GB of VRAM.
NVIDIA drivers and control panel: Background driver processes use VRAM. GeForce Experience overlay adds another 200–400MB.
Other processes: Antivirus software, screen recorders, Discord overlays, Spotify visualizers — anything with GPU acceleration takes a slice.
The total: on a typical Windows 11 gaming machine, you may lose 3–6GB of your nominal VRAM to background processes. A "24GB" card has 18–21GB available for your LLM. That changes which models fit.
Fix 1: Close Everything First (Fast, Imperfect)
Before loading a model, close:
- Web browsers (all of them)
- Discord (or disable hardware acceleration: Settings → Advanced → Hardware Acceleration → Off)
- GeForce Experience (or disable in-game overlay)
- Any screen capture software
- Spotify, streaming apps with GPU acceleration
Check available VRAM in Task Manager → Performance → GPU, looking at "Dedicated GPU Memory Used."
Result: Typically recovers 1–3GB. Enough to load models that are just barely over the fragmentation threshold. Not a systematic fix — you'll lose it again next time you open a browser.
Fix 2: Disable Hardware Acceleration in Chrome/Edge
This is the most impactful single change if you use Chrome or Edge:
Chrome: Settings → System → "Use hardware acceleration when available" → Off → Relaunch
Edge: Settings → System and performance → "Use hardware acceleration when available" → Off → Restart
Result: Frees 500MB–2GB of VRAM permanently. Websites with heavy graphics will feel slightly slower. For most browsing, imperceptible.
Fix 3: WSL2 (Most Reliable)
WSL2 runs a Linux environment alongside Windows. Ollama, llama.cpp, and most local LLM tools have native Linux builds that work in WSL2.
Setup (15 minutes):
-
Enable WSL2 (Windows 11 has it built in):
wsl --install -d Ubuntu-24.04 -
Install CUDA for WSL2 (follow NVIDIA's WSL2 CUDA guide — use the Ubuntu package):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update && sudo apt-get install -y cuda-toolkit-12-4 -
Install Ollama in WSL2:
curl -fsSL https://ollama.com/install.sh | sh ollama serve & ollama pull qwen2.5:32b-instruct-q4_K_M -
Access from Windows: Ollama's server runs at
localhost:11434— accessible from Windows applications, browser-based UIs, and API clients.
Why WSL2 helps: The Linux kernel manages GPU memory differently than Windows. The Windows GPU driver exposes more raw VRAM to the Linux environment because it doesn't need to share it with the Windows desktop compositor. Users consistently report 1–3GB more effective VRAM in WSL2 vs native Windows.
Result: Most reliable fix. Models that fail on native Windows load in WSL2.
Fix 4: Reduce Context Length
Every token of context window consumes VRAM for the KV cache. The larger your context, the more VRAM the model uses after loading.
A rough estimate for KV cache per 1K tokens:
- 7B model: ~0.25–0.5GB per 1K context
- 32B model: ~0.5–1GB per 1K context
- 70B model: ~1–2GB per 1K context
In Ollama (Modelfile):
FROM qwen2.5:32b-instruct-q4_K_M
PARAMETER num_ctx 4096
In llama.cpp:
./llama-server -m qwen2.5-32b-q4_k_m.gguf --ctx-size 4096 -ngl 99
Start at 4K context. Work up in 4K increments until OOM appears. Back off 4K from the failure point. This is your maximum stable context.
Result: Immediate fix without changing environment. Trading context window for stability. For most chat use cases, 8K–16K context is more than sufficient.
Fix 5: Model Loading Order and Quantization
If you're running multiple models or have other GPU processes active, the order of operations matters.
Reload the model after clearing VRAM: In Ollama, models are kept loaded by default. If another process used VRAM and then released it, the allocations may be fragmented even after the process ends. Restart Ollama to force a fresh VRAM allocation.
Drop to a lower quantization: Q3_K_M instead of Q4_K_M saves approximately 20% VRAM. For a 32B model, that's 3–4GB — often enough to recover from a borderline OOM.
Quality Drop
~3–5% on complex tasks
~3–5%
~3–5%
Fix 6: CPU Offloading (Last Resort)
llama.cpp supports offloading some model layers to system RAM when VRAM is insufficient. The parameter is -ngl (number of GPU layers).
Full GPU inference:
./llama-server -m model.gguf -ngl 99
Partial offload (keep 60 layers on GPU, rest on CPU):
./llama-server -m model.gguf -ngl 60
Performance impact: Each layer offloaded to CPU reduces tokens/second significantly. Offloading 20% of layers to CPU typically reduces throughput by 30–50%. Offloading 50% of layers makes inference painfully slow.
Use CPU offloading only as a temporary workaround while you implement one of the better fixes above.
Diagnosing Which Fix You Need
Run this checklist before trying anything:
-
Check VRAM usage in Task Manager (GPU → Dedicated GPU Memory). If it's above 70% before you load the model, Fix 1 (close apps) is your first step.
-
Calculate model VRAM requirement: GGUF file size + 15–20% for runtime overhead + KV cache at your target context length. If total exceeds available VRAM, Fix 4 (reduce context) may resolve it.
-
If you have enough VRAM on paper but still get OOM, the fragmentation issue is likely. Fix 3 (WSL2) is the systemic solution.
-
If you need this working on Windows without WSL2, Fix 2 (disable browser hardware acceleration) + Fix 1 (close apps) + Fix 5 (drop quantization) often stack well enough to get a borderline model running.
FAQ
Why does CUDA out of memory happen even when the model should fit in VRAM? Windows fragments VRAM across multiple processes — the GPU driver, desktop compositor, browser tabs with hardware acceleration, and your LLM all compete for the same VRAM pool. A 24GB GPU running Windows 11 may have only 18–20GB of free VRAM available even with nothing else open. WSL2 bypasses this by giving the Linux environment near-exclusive GPU access.
How does WSL2 fix CUDA OOM errors for local LLM? WSL2 runs a lightweight Linux VM that accesses the GPU with minimal Windows overhead. The VRAM available to Ollama or llama.cpp in WSL2 is typically 1–3GB higher than in native Windows — enough to load models that wouldn't fit before. Run Ollama in WSL2, access it from Windows localhost.
What context length should I set to avoid CUDA OOM? Each 1K tokens of context uses roughly 0.5–1GB of VRAM for KV cache (varies by model and quantization). If you're hitting OOM after model load, reduce context length first. Start at 4K context, work up until OOM appears, then drop back 25%. Use --ctx_size in llama.cpp or context_window in Ollama Modelfile.