How do I calculate VRAM needed for a model not in the table?

Use the formula: VRAM (GB) = (Parameters × bits_per_weight / 8 / 1,000,000,000) × 1.2. For a 20B parameter model at Q4_K_M (4.5 bits effective): (20,000,000,000 × 4.5 / 8 / 1e9) × 1.2 = 13.5GB. Add KV cache based on your target context length.

Why does the actual VRAM usage differ from the calculated value?

The 1.2x overhead factor accounts for framework overhead, KV cache at default context length, and buffer allocation. Actual usage depends on context length, batch size, and the inference backend. llama.cpp typically runs closer to the formula minimum; PyTorch-based frameworks often run higher.

Does quantization quality degrade significantly at Q4_K_M vs FP16?

For most tasks, Q4_K_M is virtually indistinguishable from FP16 on standard benchmarks. The K-quant methods (Q4_K_M, Q5_K_M) use importance-weighted quantization, meaning sensitive layers stay at higher precision. You will notice degradation on very precise mathematical reasoning; for general use, Q4_K_M is the standard recommendation.

What is KV cache and how much VRAM does it use?

Key-Value cache stores attention computations from previous tokens so they do not need to be recomputed. At 4K context length, KV cache is small (0.5–2GB for 7B models). At 128K context length, it becomes significant — a 7B model at FP16 KV cache uses approximately 16GB for 128K context alone.

VRAM Calculator for Local LLMs: Find Out Exactly What Fits in Your GPU

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Before buying a GPU or trying to load a model, you need to know whether it fits in VRAM. The calculation is straightforward once you know the formula — and it explains why a 70B model needs ~43GB at Q4_K_M instead of the ~35GB a naive size estimate would suggest.

This page gives you the formula, the derivation, worked examples for the most common model sizes, and a complete reference table. An interactive calculator is planned for /tools/vram-calculator-interactive — this page delivers the formula and tables as standalone reference.

Quick Summary

Core formula: VRAM (GB) = (parameters × bits_per_weight / 8 / 1e9) × 1.2 overhead
KV cache adds significantly at long context lengths — a factor most estimators ignore
Q4_K_M is the practical sweet spot — roughly 4.5 bits effective, excellent quality, roughly 1/4 the size of FP16

The VRAM Formula

Model Weights

VRAM_weights (GB) = (N × B / 8) / 1,000,000,000 × overhead

Where:

N = number of parameters (e.g., 7,000,000,000 for a 7B model)
B = bits per weight (16 for FP16, 8 for INT8/Q8_0, 4 for INT4/Q4)
1,000,000,000 = convert bytes to gigabytes (using 10⁹, not 2³⁰)
overhead = 1.15 to 1.25 (typically use 1.2)

The overhead factor accounts for:

Inference framework buffers
Temporary computation tensors
KV cache at default/short context lengths
Memory fragmentation

KV Cache

For long-context inference, KV cache becomes the dominant VRAM consumer:

KV_cache (GB) ≈ (2 × L × H × D × C × bytes) / 1,000,000,000

Where:

L = number of transformer layers
H = number of attention heads
D = head dimension (typically 128)
C = context length in tokens
bytes = bytes per element (2 for FP16, 1 for INT8, 0.5 for INT4)

For most users running 4K–8K context, KV cache adds 1–4GB on top of model weights. For 128K context, KV cache can exceed model weights entirely.

Worked Examples

Example 1: Llama 3.1 8B at FP16

N = 8,000,000,000
B = 16 bits
VRAM = (8e9 × 16 / 8) / 1e9 × 1.2
     = (8e9 × 2) / 1e9 × 1.2
     = 16 GB × 1.2
     = 19.2 GB

At FP16, a 7–8B model needs approximately 19GB of VRAM. This is why you cannot load Llama 3.1 8B at FP16 on a 16GB card.

Example 2: Llama 3.1 8B at Q8_0 (8-bit, ~1 byte per weight)

VRAM = (8e9 × 8 / 8) / 1e9 × 1.2
     = 8 GB × 1.2
     = 9.6 GB

Fits on a 12GB card with room for context cache.

Example 3: Llama 3.1 8B at Q4_K_M (~4.5 bits effective)

Q4_K_M uses mixed 4-bit and 6-bit quantization for sensitive layers, averaging ~4.5 bits:

VRAM = (8e9 × 4.5 / 8) / 1e9 × 1.2
     = 4.5 GB × 1.2
     = 5.4 GB

Fits on an 8GB card with approximately 2.5GB left for KV cache.

Example 4: 70B Model at Q4_K_M

VRAM = (70e9 × 4.5 / 8) / 1e9 × 1.2
     = 39.4 GB × 1.2
     = 47.2 GB

This confirms why a single RTX 4090 (24GB) cannot load a 70B model at Q4_K_M, and why you need 48GB+ of VRAM (dual RTX 4090, Apple Silicon 64GB+, or similar).

Bits Per Weight by Quantization Format

Notes

Training only; rarely used for inference

Full precision inference; maximum quality

Very close to FP16 quality; 2x smaller

Excellent quality; ~37% smaller than FP16

Strong quality; good balance point

Standard recommendation; minor quality loss

Smaller but lower quality than Q4_K_M

Noticeable quality degradation

Significant quality loss; emergency only The _K_M suffix in GGUF quantization names indicates k-quant with medium importance weighting — these variants use higher precision for attention layers that are most sensitive to quantization error.

Full Reference Table: VRAM Requirements by Model Size

All figures in GB, rounded to nearest 0.5GB, using 1.2x overhead factor at ~4K context length.

Q3_K_M

0.5

1.6

3.7

4.2

6.8

7.4

14.2

17.9

36.8

37.8

212.6 GPU-to-Model Compatibility Table

These are practical compatibility recommendations at Q4_K_M unless noted. "Comfortable" means room for long context; "tight" means fits but minimal KV cache headroom; "hybrid" means CPU+GPU split with llama.cpp.

Notes

Comfortable; limited KV cache

13B is tight at Q4_K_M

Same as 4060 8GB

27B Q4_K_M fits but watch context

Memory bandwidth advantage over 3060 12GB

Comfortable; 27B needs hybrid

Value king for 20–34B models

Same VRAM as 4060 Ti 16GB, much faster

Same VRAM as 3090, significantly faster

Linux ROCm only for optimal performance

Good value AMD option

Sweet spot for AMD single-card

Best AMD single-card for LLMs

Combined VRAM for large models

Fastest consumer dual-GPU setup

Unified memory; use MLX

Near-full-precision 70B

*Requires GTT config — see AMD GTT guide

KV Cache at Different Context Lengths

As context windows grow, KV cache adds substantially to VRAM needs. Approximate KV cache additions for 7B/8B models (FP16 KV cache):

70B Model KV Cache

~2 GB

~4 GB

~16 GB

~64 GB Most inference backends (llama.cpp, Ollama) default to 2K–4K context. Explicitly setting --ctx-size 32768 or larger will increase VRAM usage significantly. For long-document work, INT8 or INT4 KV cache quantization (available in llama.cpp with --cache-type-k q8_0) cuts this by 50–75%.

Common Model Sizes Quick Reference

For the models people actually run, actual measured VRAM in llama.cpp:

Notes

VRAM Calculator for Local LLMs: Find Out Exactly What Fits in Your GPU

The VRAM Formula

Model Weights

KV Cache

Worked Examples

Example 1: Llama 3.1 8B at FP16

Example 2: Llama 3.1 8B at Q8_0 (8-bit, ~1 byte per weight)

Example 3: Llama 3.1 8B at Q4_K_M (~4.5 bits effective)

Example 4: 70B Model at Q4_K_M

Bits Per Weight by Quantization Format

Full Reference Table: VRAM Requirements by Model Size

212.6

GPU-to-Model Compatibility Table

*Requires GTT config — see AMD GTT guide

KV Cache at Different Context Lengths

Common Model Sizes Quick Reference

Technical Intelligence, Weekly.