CraftRigs
Tool

VRAM Calculator: How Much Do You Actually Need?

By Charlotte Stewart 5 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: For 7B models at Q4, you need 5–6GB VRAM. For 14B models at Q4, you need 9–10GB. For 30B models at Q4, you need 18–20GB. For 70B models at Q4, you need 40GB+ (multi-GPU territory). Add 2–3GB buffer for context and overhead. Most people are fine with 16GB.


VRAM is the single most important spec for local LLM hardware. It determines which models you can run, at what quality, and at what speed. Too little and the model either won't load or has to offload to system RAM — which kills performance. Too much and you've overspent on a spec you don't use.

This guide gives you the exact numbers to plan your build or evaluate a GPU purchase.

How to Calculate VRAM Requirements

The formula is straightforward:

VRAM needed = (model parameters × bytes per parameter) + KV cache overhead + runtime overhead

For practical use, you don't need the math — just use the tables below based on quantization level and parameter count. But understanding the formula helps you reason about edge cases.

Bytes per parameter by quantization:

  • F32 (full precision): 4 bytes per parameter
  • F16 / BF16: 2 bytes per parameter
  • Q8_0: 1 byte per parameter
  • Q6_K: 0.75 bytes per parameter
  • Q5_K_M: 0.625 bytes per parameter
  • Q4_K_M: 0.5 bytes per parameter
  • Q3_K_M: 0.375 bytes per parameter
  • Q2_K: 0.25 bytes per parameter

Q4_K_M is the most common quantization for local use — it offers the best balance of quality and size reduction.

VRAM Requirements by Model Size (Q4_K_M)

These are real-world numbers based on loading the model into VRAM with a standard context window (4K–8K tokens). They include the model weights but not extended context caching.

7B models (Llama 3.1 7B, Mistral 7B, Qwen 2.5 7B):

  • Model weights at Q4_K_M: ~4.1 GB
  • With overhead and standard context: ~5–6 GB total
  • Minimum VRAM: 6 GB
  • Comfortable VRAM: 8 GB

8B models (Llama 3.1 8B, Llama 3.2 8B):

  • Model weights at Q4_K_M: ~4.7 GB
  • With overhead and standard context: ~5.5–7 GB total
  • Minimum VRAM: 6 GB
  • Comfortable VRAM: 8–10 GB

13B–14B models (Llama 2 13B, Qwen 2.5 14B, Phi-4 14B):

  • Model weights at Q4_K_M: ~7.5–8.5 GB
  • With overhead and standard context: ~9–11 GB total
  • Minimum VRAM: 10 GB
  • Comfortable VRAM: 12–16 GB

24B–27B models (Gemma 3 27B, Mistral Small 3.1 24B):

  • Model weights at Q4_K_M: ~13–16 GB
  • With overhead and standard context: ~15–18 GB total
  • Minimum VRAM: 16 GB (tight, reduced context)
  • Comfortable VRAM: 24 GB

32B models (Qwen 2.5 32B, Mistral 22B):

  • Model weights at Q4_K_M: ~18–20 GB
  • With overhead and standard context: ~20–22 GB total
  • Minimum VRAM: 20–24 GB
  • Comfortable VRAM: 24 GB

70B models (Llama 3.1 70B, Qwen 2.5 72B):

  • Model weights at Q4_K_M: ~40–43 GB
  • With overhead and standard context: ~43–48 GB total
  • Minimum VRAM: 48 GB (two 24GB cards) or CPU offload (slow)
  • Comfortable VRAM: 2× 24 GB or 48 GB+ in a single card (workstation)

MoE models (Mixtral 8x7B, Mixtral 8x22B):

  • Important: MoE models have large total parameter counts but fewer active parameters
  • Mixtral 8x7B total weights at Q4: ~26 GB (not 7B worth of VRAM)
  • Mixtral 8x22B total weights at Q4: ~67 GB
  • The full weight set must be loaded regardless of active parameters

VRAM Requirements by Quantization Level (7B Model)

To show how quantization changes the math for a single model size:

  • 7B at F16: ~14 GB
  • 7B at Q8_0: ~7.2 GB
  • 7B at Q6_K: ~5.5 GB
  • 7B at Q5_K_M: ~4.8 GB
  • 7B at Q4_K_M: ~4.1 GB
  • 7B at Q3_K_M: ~3.3 GB
  • 7B at Q2_K: ~2.5 GB

Quality degrades as you go lower. Q4_K_M is generally the lowest you should go before output quality starts noticeably suffering. Q3 and Q2 are useful for fitting a model into VRAM that would otherwise be slightly too large, but don't use them as your default.

Context Window VRAM Cost

The context window (how much text the model can process at once) adds to VRAM usage through the KV cache. This is separate from model weight VRAM.

Approximate KV cache size for common context lengths (7B model at standard KV precision):

  • 4K context: ~0.5 GB
  • 8K context: ~1 GB
  • 16K context: ~2 GB
  • 32K context: ~4 GB
  • 64K context: ~8 GB
  • 128K context: ~16 GB

For a 7B model at Q4_K_M with 4.1 GB weights, running 32K context adds another 4 GB — pushing total VRAM to ~8–9 GB. Running 128K context would need ~20 GB total — well beyond what most 16GB cards can do cleanly in VRAM alone.

Long context overflows to system RAM when it exceeds VRAM. This works but is dramatically slower. If you use long contexts regularly, size your VRAM or system RAM accordingly.

GPU VRAM by Model Support

Here's which GPUs support which model classes for practical daily use:

6–8 GB VRAM (RTX 3060 8GB, RTX 4060 8GB, RTX 3070):

  • 7B–8B models at Q4_K_M: yes, comfortable
  • 13B models: tight, possible at Q3
  • 14B+: no

12 GB VRAM (RTX 3060 12GB, Arc B580, RTX 4070):

  • 7B–8B models: yes, comfortable
  • 13B–14B models: possible at Q4, limited context
  • 24B+: no

16 GB VRAM (RTX 4070 Ti, RTX 4070 Ti Super, RTX 5070 Ti, RTX 5080):

  • 7B–14B models: yes, comfortable with good context
  • 24B–27B models: tight but possible at Q4, limited context
  • 32B models: Q3 required, not ideal
  • 70B+: no

24 GB VRAM (RTX 3090, RTX 4090, RTX 3090 Ti):

  • 7B–30B models: yes, comfortable with full context
  • 70B models: requires CPU offloading (slow) or dual GPU
  • MoE models (Mixtral 8x7B): yes

32 GB VRAM (RTX 5090):

  • 7B–70B models: most run cleanly
  • MoE 8x22B models: yes
  • Largest models at high quantization: yes

40 GB VRAM (A100 40GB):

  • 7B–70B models at Q4_K_M: yes, 70B fits tightly at ~40GB
  • MoE models: yes
  • 70B at Q8 (requires ~70GB): no — use A100 80GB for that

The Quick-Reference Answer

Don't overthink this. Here's the straightforward version:

Just want to run AI locally, any model up to 14B: Get 16GB VRAM. RTX 4070 Ti Super, RTX 5070 Ti, or RTX 5080. Done.

Running 30B+ models or want real headroom: Get 24GB VRAM. RTX 3090 or RTX 4090.

Budget build, 7B models are fine: 12GB VRAM minimum. Arc B580 or RTX 3060 12GB.

Power user, MoE models, long context: 32GB+ or dual 24GB cards.

The people who regret their VRAM choice almost always went too small. 16GB is the sweet spot for 2026 — it covers everything except the very largest models, and those people know they need 24GB+.


See Also

vram calculator local-llm gpu-memory quantization hardware-guide reference

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.