CraftRigs
Architecture Guide

CPU Offloading Explained: When and Why to Use It

By Georgia Thomas 5 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: CPU offloading lets you run models too large for your VRAM by splitting layers between your GPU and system RAM. Layers on the GPU are fast. Layers on the CPU are slow. The goal is always to push as many layers as possible onto the GPU — and minimize how much lands on the CPU.

What CPU Offloading Actually Is

When you load a model, it gets split into layers — transformer blocks stacked on top of each other. A 7B model has around 32 layers. A 70B model has 80. Each layer takes up a fixed amount of memory.

Normally, you want all of those layers in VRAM. The GPU processes them in sequence using its high-bandwidth memory bus, and everything is fast.

CPU offloading is what happens when your VRAM can't hold the full model. Instead of refusing to load, tools like llama.cpp will place some layers in your GPU's VRAM and the rest in your system RAM, to be processed by your CPU.

The GPU handles its layers at VRAM speeds — hundreds of GB/s. The CPU handles its layers using system RAM bandwidth — typically 50–60 GB/s for DDR4 or 76–96 GB/s for DDR5. The GPU is anywhere from 5x to 20x faster per layer depending on your hardware.

Every forward pass through the model touches every layer. So even if 90% of your layers are on the GPU, that remaining 10% on the CPU creates a hard speed ceiling.

The --n-gpu-layers Flag

In llama.cpp, the setting that controls offloading is --n-gpu-layers (sometimes written as -ngl).

  • --n-gpu-layers 0 — all layers run on CPU. No GPU used at all.
  • --n-gpu-layers 32 — 32 layers go to GPU VRAM. Any remaining layers go to CPU.
  • --n-gpu-layers 999 — sends as many layers as fit into VRAM. Overflow stays on CPU.

Setting it to 999 (or any number higher than the model's actual layer count) is the simplest way to tell llama.cpp "use as much VRAM as you have, and only offload what doesn't fit." Most people running with a single GPU just set it high and let the runtime figure out the split.

Ollama handles this automatically in the background — you don't see the flag, but it's doing the same calculation.

Tip

If you want to know exactly how many layers your model has, check the model's GGUF metadata. Tools like llama.cpp's llama-gguf-dump will print it. Knowing the layer count helps you calculate exactly how much VRAM each layer needs: model size in GB divided by total layers.

What Triggers Offloading

Offloading kicks in automatically when the model is larger than your available VRAM. But VRAM usage isn't just the model weights — the runtime needs room for:

  • The model weights themselves (the dominant cost)
  • The KV cache — stores attention state for your context window. Grows linearly with context length.
  • Runtime overhead — a few hundred MB depending on the backend.

A 13B model in Q4_K_M quantization runs about 8GB. If you have a 12GB GPU, that model doesn't fit cleanly once the KV cache is factored in. Offloading steps in to fill the gap.

The KV cache is important here: running a long context window means your KV cache grows large enough to push layers out of VRAM even on models that normally fit. This catches people off guard. A 7B model fits fine at short context, then suddenly starts offloading at 16K context because the KV cache ate 3GB of VRAM.

System RAM Requirements for Offloading

When layers move to CPU, they live in your system RAM. The requirement scales with how much you're offloading.

  • A 70B model in Q4_K_M is about 42GB total.
  • If you have 24GB VRAM, roughly 55% of layers fit on the GPU. The remaining 45% — about 19GB — sits in RAM.
  • You need that 19GB plus room for the OS, other apps, and KV cache overflow.

The practical rule: add the model size to your base system RAM usage and make sure you have headroom. If you're running a 42GB model with 24GB on the GPU, you need at least 24–32GB of system RAM. Running 64GB gives you comfortable headroom.

More RAM doesn't make offloading faster. RAM bandwidth is the speed limiter on the CPU side — and that's fixed by your memory kit. Faster DDR5 (6000 MT/s vs 4800 MT/s) helps marginally, but it's not a game-changer.

Note

mmap (memory-mapped files) is a common alternative for large models — it pages model weights from disk instead of loading everything into RAM. Inference is slower, but you can run models far larger than your RAM. It's a fallback for extreme cases, not a primary strategy.

The Speed Impact Is Real

Here's what offloading costs you in real terms.

A 70B model running entirely in VRAM on dual RTX 3090s might do 25–30 tokens per second. The same model with half its layers offloaded to a modern CPU drops to 8–12 tokens per second. Fully on CPU: 2–4 tokens per second.

The hit scales with the ratio of CPU layers to total layers. A small amount of offloading (5–10% of layers) has minimal impact. A large amount (40%+) makes the model feel slow in real use.

For interactive chat, you generally want at least 8–10 tokens per second to feel responsive. Below 5 tokens per second, waiting for responses becomes noticeable. Below 2 tokens per second, it's frustrating.

When Offloading Is Acceptable

Running a slow model is better than not running it at all — that's the main case for offloading. If you have a 12GB GPU and want to run a 30B model because it's significantly smarter than anything that fits cleanly in 12GB VRAM, offloading makes that possible. You accept the speed penalty in exchange for model quality.

Batch processing is another good fit. If you're running the model overnight to process documents or generate content, 6 tokens per second is fine. You're not waiting on it interactively.

CPU-only inference also has a place if you don't have a GPU at all — or if your GPU is too weak to be useful. In that case, --n-gpu-layers 0 and a high-RAM system running a quantized model is a legitimate setup.

The Practical Rule

Maximize GPU layers. Minimize CPU layers.

If your model doesn't fit in VRAM, the first question is whether a smaller quantization fixes it. Q4_K_M vs Q8_0 on a 13B model is roughly 8GB vs 13GB — same model, meaningfully different VRAM requirement. Dropping quantization to fit the model fully in VRAM is almost always faster than offloading the Q8 version.

If you can't fit it even with aggressive quantization, offload — but watch your tokens per second. If you're getting fewer than 5, either upgrade your VRAM or step down to a smaller model.

Caution

Don't confuse CPU offloading with CPU inference. Offloading is a split — GPU does most of the work, CPU handles overflow. Pure CPU inference (--n-gpu-layers 0) means the GPU isn't involved at all. They both use your CPU, but the performance profiles are very different.

See Also

cpu-offloading llama-cpp vram ram n-gpu-layers inference local-llm performance

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.