VRAM Offloading — Local AI Glossary | CraftRigs

VRAM offloading is a technique where a model that's too large to fit entirely in GPU VRAM is split across two memory pools: as many transformer layers as possible load into VRAM, and the remaining layers run from system RAM via the CPU. This makes large models accessible on hardware that would otherwise be unable to load them at all.

In llama.cpp, the --n-gpu-layers flag controls how many layers go to the GPU. Setting it to the total layer count (e.g., --n-gpu-layers 80 for a 70B model with 80 layers) attempts full GPU loading. Setting a lower number offloads the remainder to RAM.

The Speed Penalty

The penalty for offloading layers to RAM is severe. VRAM bandwidth is typically 500–1,000 GB/s. System RAM bandwidth accessed via CPU is 50–100 GB/s — and the actual throughput for inference purposes is often worse due to bus overhead. Latency is also much higher.

In practice:

Fully in VRAM: A 13B Q4_K_M on an RTX 4090 might yield 80+ t/s
Half offloaded to RAM: Same model might yield 10–20 t/s
Mostly in RAM: 2–5 t/s — technically working, but painful

The penalty scales with how many layers are offloaded. Offloading just a few layers has a modest cost; offloading most of the model makes inference very slow.

When Offloading Is Worth It

Offloading makes sense when:

The model is only slightly larger than your VRAM (one or two layers over the limit)
The task doesn't require fast generation — batch processing, overnight jobs, low-volume use
You're evaluating a model before committing to better hardware

It's not worth it for interactive use when generation speed drops below 10 t/s, unless there's genuinely no alternative.

Alternatives to Consider

Before offloading, check:

Lower quantization: Q3_K_M or Q2_K reduces size but cuts quality
Smaller model: A Q6_K 13B may outperform an offloaded 70B in practice
Multi-GPU: llama.cpp supports splitting across multiple GPUs at near-full speed

Why It Matters for Local AI

Offloading is a useful safety valve, not a long-term solution. It lets you test large models on constrained hardware, but the user experience at heavily offloaded speeds is poor. Understanding the speed penalty helps set expectations and informs hardware upgrade decisions.