CraftRigs
Architecture Guide

CUDA Out of Memory: 12 Fixes Ranked by Success Rate [2026]

By CraftRigs Staff 12 min read
CUDA Out of Memory: 12 Fixes Ranked by Success Rate [2026] — diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


You picked a model that "should fit." You downloaded it, hit Run, and got a wall of red text ending in `CUDA out of memory. Tried to allocate 1.50 GiB`.

The maddening part isn't the error itself. It's that the error tells you nothing useful. It doesn't say whether the cause is model size, context overflow, a background app eating VRAM, or — on Windows — the desktop compositor taking a quiet 10–15% cut before Ollama even loads. Just: out of memory. Good luck.

Every other guide lists the same fixes in alphabetical order with no signal about which to try first. That's why you're still stuck. This one ranks them by how often they actually resolve OOM errors, so if Fix #3 (close Chrome) solves your problem in 30 seconds, you find that out before scrolling past eight other suggestions.

**"CUDA out of memory" on local LLM usually comes down to one of four causes: model too large for available VRAM, context overflow bloating the [KV cache](/glossary/kv-cache), background apps stealing GPU memory, or — on Windows — the desktop compositor consuming 10–15% of your [VRAM](/glossary/vram) before your model loads a single weight. Match your symptom to the diagnosis table below, then jump straight to the right fix. Most people are running again in under three minutes.**

## Which OOM Are You? Match Your Symptom to the Fix

Before you try anything, take 30 seconds to identify which OOM you're dealing with. Fix #6 (KV cache) does nothing for a load-time OOM. Fix #1 ([quantization](/glossary/quantization)) can't help if background apps are the culprit. Treating them the same is exactly why most guides fail you.

| Symptom | Start with
|---|---|
| OOM on first model load, immediately | Fix #1, #2, or #5 |
| OOM after switching models without a restart | Fix #3 or #10 |
| OOM mid-conversation after 10+ messages | Fix #6 or #7 |
| OOM only on Windows — same model works on Linux | Fix #5 |
| OOM on GPU, but CPU inference completes | Fix #4 or #8 |
| OOM started after a driver or software update | Fix #9 or #12 |

### How to Read This Table

Load-time OOM happens before the model produces any output — the process dies while allocating weights. Runtime OOM happens mid-conversation as the KV cache grows past your VRAM limit. Not sure which you have? Load-time fails immediately on startup. Runtime OOM lets you start talking first, then crashes after several messages.

## Fixes #1–4: Resolves the Majority of OOM Cases

In our experience, and consistent with patterns across r/LocalLLaMA and LM Studio's GitHub issue tracker, these four fixes account for the vast majority of OOM reports. They address the most common root causes. Try them in order before reaching for anything else.

### Fix #1 — Switch to a Smaller Quantization (Q4_K_M Instead of Q8 or FP16)

This is the most effective fix. Quantization compresses model weights to lower bit precision — less VRAM used, minimal quality cost. Q4_K_M is the community default recommendation because the quality difference versus Q8_0 is imperceptible for most workloads.

**VRAM by quantization level** (verified against current GGUF releases, April 2026):

Q3_K_M


~3.5 GB


~6.2 GB


~6.5 GB


~13 GB
**In Ollama:**
```bash
ollama pull qwen2.5:14b-instruct-q4_K_M

In llama.cpp:

./llama-cli -m models/qwen2.5-14b-q4_K_M.gguf -ngl 99

Trade-off: Q4_K_M versus Q8_0 quality difference is negligible for coding, summarization, and general chat. Q3_K_M starts to affect coherence on complex reasoning. Don't go below Q4 unless you have no other option.

Tip

The objection: "But I want full quality." Q4_K_M is the quality recommendation. A Q4_K_M model that fits cleanly in VRAM outperforms a Q8_0 model being constantly swapped to system RAM. Clean fit beats higher precision every time.

Fix #2 — Reduce Context Window Length

The KV cache is a separate VRAM allocation that stores attention values for every token in your conversation. It scales directly with context length — roughly double the context, roughly double the KV cache VRAM. Most load-time OOMs that aren't about model size are actually about someone accidentally setting context to 32,768 tokens.

Ollama's default context window is 4,096 tokens. Setting it to 32,768 without understanding the cost can add 3–6 GB of VRAM usage on a 14B model — enough to push a borderline fit over the edge.

In Ollama (Modelfile or API):

PARAMETER num_ctx 2048

Or per-request:

{"model": "qwen2.5:14b", "options": {"num_ctx": 2048}}

In llama.cpp:

./llama-cli -m model.gguf -c 2048

In LM Studio: Context Length slider in the model parameters panel.

Trade-off: Shorter context means shorter conversation memory. At 2,048 tokens, you'll need to start new chats more often. For coding and writing, 2,048–4,096 tokens is plenty. Only go above 8,192 if you're actively processing long documents and you've confirmed the VRAM fits.

Fix #3 — Close Everything Else Using Your GPU

This one resolves roughly one in five OOM cases without touching the model at all. Chrome with hardware acceleration enabled, Discord, a game minimized in the background, OBS, another AI inference process — any of these can consume 500 MB–1 GB of VRAM before your model loads.

Everyone's done this: fired up Ollama with a game sitting minimized in the background, then wondered why a model that "should fit" is OOMing.

Check VRAM usage on Windows:

  1. Task Manager → Performance → GPU → Dedicated GPU Memory
  2. Or: nvidia-smi in a terminal — look at the Used column under GPU memory

Check on Linux:

nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

Close everything using the GPU, restart your inference server, then try loading again.

Fix #4 — Reduce GPU Layers (Partial Offload Instead of Full GPU)

If the model is genuinely larger than your VRAM, you don't have to run it entirely on CPU — you can split it. The n_gpu_layers parameter controls how many model layers load into VRAM. The rest run from system RAM. Less VRAM used, at the cost of speed.

In llama.cpp:

./llama-cli -m model.gguf -ngl 20

In Ollama (Modelfile):

PARAMETER num_gpu 20

In LM Studio: GPU Offload percentage slider on the model load screen — see our LM Studio setup guide for where to find it.

A rough rule: 50% GPU offloading keeps you at roughly 60–75% of full GPU inference speed. Slower, but functional. Use it while you plan the real fix.

Tip

Try reducing n_gpu_layers by just 1 or 2 first. Sometimes a single layer is the difference between a clean load and OOM — and you lose almost no speed from it.

Fixes #5–8: Works When the Cause Is More Specific

These fixes require one extra diagnostic step — you need to have identified a pattern before reaching for them. Once you have, they're fast.

Fix #5 — The Windows VRAM Tax: Reclaim 10–15% from the Desktop Compositor

This is the OOM cause that almost never appears in guides — and it explains a disproportionate share of "same model, different result on Linux" reports.

Windows 11's Desktop Window Manager (DWM) permanently reserves VRAM for display composition — Mica effects, transparency, rounded corners, shadow rendering. On top of that, CUDA's runtime driver holds a ~300–500 MB reserve that's always present on any OS. Combined, a 12 GB RTX 3060 typically has 10.5–11.5 GB actually available for AI workloads on Windows 11, depending on your display setup (single 1080p vs. dual 1440p vs. 4K).

Ubuntu on the same card? Close to the full 12 GB. Same model, same quantization, same context window — clean load on Linux, OOM on Windows.

The 14B model at Q4_K_M needs ~8.5 GB base plus 0.7–0.9 GB for KV cache at 4,096 context. On Windows 11 at 1440p with DWM overhead, that's right at the edge. And the edge is exactly where OOM lives.

Check your real available VRAM:

  • Windows: GPU-Z → "Memory Used" at idle; or Task Manager → Performance → GPU → Dedicated GPU Memory in use
  • Linux: nvidia-smi --query-gpu=memory.free --format=csv,noheader

Mitigation options:

  1. Cap Ollama's VRAM budget: OLLAMA_MAX_VRAM=10000000000 (10 GB, in bytes) — stops Ollama from attempting allocations that Windows overhead has already consumed
  2. Drop from dual monitors to single, or lower resolution before loading heavy models
  3. Disable Chrome and Discord hardware acceleration — recovers 200–400 MB with minimal UX impact

Warning

Disabling Windows hardware acceleration system-wide to reclaim VRAM makes your desktop sluggish. Worth the trade-off during dedicated AI sessions; don't leave it off permanently.

Fix #6 — KV Cache Quantization: Reduce Mid-Conversation VRAM Growth

Your model loads fine. But 12 messages in, mid-conversation — OOM. That's the KV cache, not the model weights. It stores attention key and value vectors for every token in your conversation history and grows with every message. Long system prompts accelerate this.

The clean fix: quantize the KV cache itself. Switching from the default FP16 KV cache to q8_0 cuts KV cache VRAM by roughly 50% with minimal quality impact.

In llama.cpp:

./llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

In Ollama (env variable):

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Trade-off: q8_0 KV cache gives ~50% savings with negligible quality loss. Dropping further to q4_0 cuts another ~50% but can hurt coherence on very long conversations. Start with q8_0.

Fix #7 — Restart the Inference Process (Memory Fragmentation)

You loaded a model fine yesterday. Today, same model, same settings — OOM. Or you loaded a model, unloaded it, tried a different one, and now the second model fails even though it's smaller than the first.

Repeated load/unload cycles leave fragmented VRAM allocations — technically "free" memory that the allocator can't combine into a large enough contiguous block. The fix is a full restart of Ollama, LM Studio, or llama-server. This clears the allocator state entirely.

This is not a permanent fix. If OOM happens on the very first load after a cold restart, fragmentation is not your cause — go back to Fix #1.

Fix #8 — Memory-Mapped Loading for Borderline Fits

llama.cpp's mmap mode loads model weights from disk on demand rather than allocating everything in VRAM at once. For models that are right at your VRAM limit, this can be the difference between loading and failing.

In llama.cpp (enabled by default on Linux; may require explicit flag on Windows):

./llama-cli -m model.gguf --mmap

LM Studio handles this automatically. Ollama manages model caching internally without an exposed flag.

Trade-off: slower first-token latency as weights page in from disk. On an NVMe SSD this is barely noticeable; on an HDD the delay is significant. See our llama.cpp hybrid inference guide for more on partial-load configurations.

Fixes #9–12: Edge Cases and Actual Last Resorts

These cover a small percentage of OOM cases but are fast to check. Run through them before concluding you need new hardware.

Fix #9 — Update (or Roll Back) GPU Drivers

NVIDIA driver updates occasionally introduce VRAM reporting bugs or CUDA allocator regressions. The symptom: models that ran fine last week now OOM with no changes to your setup. The fix can go either direction — update to the current driver that patched the bug, or roll back to the last version that worked.

How to diagnose: nvidia-smi → note your driver version. Cross-reference with when the OOM started and the NVIDIA driver release notes.

Safe rollback on Windows: Use DDU (Display Driver Uninstaller) in safe mode before installing the older version — removes all remnants, prevents partial-install conflicts that cause their own problems.

Fix #10 — Kill Zombie Processes Holding VRAM

A previous Ollama or llama.cpp process that didn't exit cleanly can hold VRAM allocations while appearing inactive. The VRAM is gone, your inference server can't see the process, and your "available" VRAM reads 2 GB lower than the hardware says it should.

Windows: Task Manager → Details tab → look for lingering ollama.exe, llama-server.exe, or python.exe processes. Kill them.

Linux:

fuser /dev/nvidia*
# or
nvidia-smi | grep -E "ollama|python|llama"

Kill the stale processes, check free VRAM again, then reload.

Fix #11 — Re-Download the Model File

Corrupted GGUF files can trigger OOM during dequantization rather than from actual VRAM shortage. The giveaway: OOM occurs at exactly the same point in loading every single time, with no timing variation whatsoever.

Verify via SHA256 hash — compare your file against the value listed on the Hugging Face model card for that model:

# Linux/macOS
sha256sum your-model-file.gguf

# Windows (PowerShell)
Get-FileHash your-model-file.gguf -Algorithm SHA256

Mismatch means a corrupt download. Delete and re-download.

Fix #12 — Switch Inference Backend

Different backends have different CUDA allocator implementations. llama.cpp, Ollama, LM Studio, and vLLM each handle memory allocation differently, and specific GPU/driver/OS combinations can hit backend-specific bugs.

  • Ollama OOMing → try llama.cpp directly
  • llama.cpp OOMing → try LM Studio (different build of llama.cpp)
  • Both failing → try vLLM if you're comfortable with Python setups

This is genuinely the last resort. If you're here, file an issue on the project's GitHub with your exact GPU model, driver version, model name, and quantization level. Backend memory bugs are rare but real, and they only get fixed when they're reported with specifics.

The Real VRAM Math: What Actually Fits on Your GPU

Most "this model fits in X GB" claims online assume Linux, a clean environment, default context, and no background processes. In the real world, available VRAM looks like this:

The VRAM Overhead Breakdown

Notes

Varies by resolution and monitor count

Always present, regardless of OS

Effective VRAM by card (Q4_K_M, 4,096 context, clean environment, as of April 2026):

Largest clean fit

7B Q4_K_M (~5 GB)

13B Q4_K_M (~8 GB)

13B Q4_K_M (~8 GB)

30B Q4_K_M (~17 GB)

Note

The VRAM figure shown in Ollama's model library is the base minimum with no KV cache. Real-world loading requires that number plus 1–2 GB for the KV cache and system overhead. If your free VRAM doesn't clear the model size by at least 1 GB, expect OOM under normal use.

Check Your Actual Free VRAM Before Loading

Run this before launching your inference server, not after wondering why it failed:

Windows:

GPU-Z → Memory Used (at idle)

or: Task Manager → Performance → GPU → Dedicated GPU Memory in use

Linux:

nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

Rule: if free VRAM < (model file size + 1.5 GB), you will OOM. Reduce quantization or context first.

FAQ — CUDA Out of Memory on Local LLM [2026]

What does "CUDA out of memory. Tried to allocate X GiB" actually mean? It means your GPU ran out of contiguous free VRAM before finishing a specific allocation. The "X GiB" is the chunk that failed — often the KV cache or a weight shard, not the entire model. The model may have loaded partially before this point.

Is Q4_K_M really good enough, or will I notice the quality drop vs Q8? For coding, summarization, and general chat: no. Q4_K_M is the default recommendation across r/LocalLLaMA for exactly this reason. You might notice a gap on tasks requiring precise multi-step reasoning — those are Q8's narrow edge cases, not everyday use.

Why does the same model load on Linux but OOM on Windows? Windows 11's DWM reserves VRAM for display composition. Combined with CUDA's driver reserve, you lose 600 MB–1.5 GB before your model touches the card. Linux skips the DWM overhead entirely.

Can I use RAM to extend VRAM for local LLM? Yes, via partial GPU offloading (Fix #4). It works — you'll get responses — but token speed drops hard when layers run from system RAM. Use it to verify a workflow, then upgrade VRAM if the speed is unacceptable.

Does my CPU matter for CUDA OOM errors? Not directly — OOM is a VRAM issue. But if you land on Fix #4, CPU speed and system RAM capacity both matter for the layers running off-GPU.

When the Real Fix Is More VRAM — Upgrade Path for Chronic OOM

If you've run through Fixes #1–4 and you're still OOMing on the models you want to run, that's signal — not a config problem. Here's the decision matrix:

The move

12 GB card (RTX 3060 or RTX 4060 Ti 16GB)

16 GB (RTX 4060 Ti 16GB)

24 GB (RTX 3090 used, ~$500 as of April 2026)

Dual GPU or 48 GB workstation card

Tip

16 GB VRAM is the current sweet spot for local AI. 12 GB hits a hard wall at 14B models with real context windows. The RTX 4060 Ti 16GB is the most common upgrade path from 12 GB cards in 2026 — competitively priced new, and widely available used.

For GPU-by-GPU breakdown at each price point, see our GPU buyer's guide for local LLM inference. If partial offloading is your short-term path, the llama.cpp CPU+GPU hybrid inference guide covers the exact configuration flags to get the best speed out of a mixed setup.

cuda-oom local-llm troubleshooting ollama vram llama-cpp lm-studio

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.