VRAM (Video RAM) — Local AI Glossary | CraftRigs

VRAM is the dedicated memory built onto your graphics card. Unlike system RAM, it's wired directly to the GPU's compute cores and operates at extremely high bandwidth — which is exactly what LLM inference demands.

When you load a model, the entire set of model weights must fit into VRAM before generation can begin. This makes VRAM the single most important hardware constraint for running local LLMs on a dedicated GPU.

How Much You Need

Model size in VRAM scales directly with parameter count and quantization level. At FP16 (full precision), the math is roughly 2 bytes per parameter:

7B model at FP16: ~14GB
13B model at FP16: ~26GB
70B model at FP16: ~140GB

In practice, most people run quantized models (Q4_K_M is common), which cuts size significantly:

7B Q4_K_M: ~4.5GB — fits in 8GB VRAM
13B Q4_K_M: ~8GB — fits in 12–16GB VRAM
32B Q4_K_M: ~20GB — fits in 24GB VRAM
70B Q4_K_M: ~40GB — needs 48GB+ or multi-GPU

On top of model weights, you need headroom for the KV cache (conversation context) and runtime overhead. A 24GB card like the RTX 3090 or 4090 is the minimum for running 32B-class models comfortably.

What Happens When You Run Out

If the model doesn't fit entirely in VRAM, one of two things happens:

The software refuses to load it (hard limit)
It offloads layers to system RAM (VRAM offloading)

Offloading to system RAM is possible but brutal for speed. Layers stored in RAM take 10–50x longer to access than VRAM, so generation can drop from 80+ tokens/second to single digits.

Why It Matters for Local AI

Every hardware decision for a local LLM rig starts with VRAM. Before asking about CPU speed or system RAM, ask: how much VRAM does this card have, and what models does that unlock? For most users, 24GB (RTX 3090/4090) is the practical sweet spot — enough for 32B models quantized, without the cost of enterprise cards.

Related guides: How much VRAM do you need for local LLMs? — exact VRAM requirements by model size and quantization. Why your VRAM runs out mid-conversation: the KV cache explained — how context length eats into your headroom. Best 16GB GPU for local LLMs in 2026 — the top VRAM-per-dollar options at the 16GB tier.