CraftRigs
Tool

VRAM Calculator for Local LLMs: Find Out Exactly What Fits in Your GPU

By Charlotte Stewart 4 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Before buying a GPU or trying to load a model, you need to know whether it fits in VRAM. The calculation is straightforward once you know the formula — and it explains why a 70B model needs ~43GB at Q4_K_M instead of the ~35GB a naive size estimate would suggest.

This page gives you the formula, the derivation, worked examples for the most common model sizes, and a complete reference table. An interactive calculator is planned for /tools/vram-calculator-interactive — this page delivers the formula and tables as standalone reference.

Quick Summary

  • Core formula: VRAM (GB) = (parameters × bits_per_weight / 8 / 1e9) × 1.2 overhead
  • KV cache adds significantly at long context lengths — a factor most estimators ignore
  • Q4_K_M is the practical sweet spot — roughly 4.5 bits effective, excellent quality, roughly 1/4 the size of FP16

The VRAM Formula

Model Weights

VRAM_weights (GB) = (N × B / 8) / 1,000,000,000 × overhead

Where:

  • N = number of parameters (e.g., 7,000,000,000 for a 7B model)
  • B = bits per weight (16 for FP16, 8 for INT8/Q8_0, 4 for INT4/Q4)
  • 1,000,000,000 = convert bytes to gigabytes (using 10⁹, not 2³⁰)
  • overhead = 1.15 to 1.25 (typically use 1.2)

The overhead factor accounts for:

  • Inference framework buffers
  • Temporary computation tensors
  • KV cache at default/short context lengths
  • Memory fragmentation

KV Cache

For long-context inference, KV cache becomes the dominant VRAM consumer:

KV_cache (GB) ≈ (2 × L × H × D × C × bytes) / 1,000,000,000

Where:

  • L = number of transformer layers
  • H = number of attention heads
  • D = head dimension (typically 128)
  • C = context length in tokens
  • bytes = bytes per element (2 for FP16, 1 for INT8, 0.5 for INT4)

For most users running 4K–8K context, KV cache adds 1–4GB on top of model weights. For 128K context, KV cache can exceed model weights entirely.


Worked Examples

Example 1: Llama 3.1 8B at FP16

N = 8,000,000,000
B = 16 bits
VRAM = (8e9 × 16 / 8) / 1e9 × 1.2
     = (8e9 × 2) / 1e9 × 1.2
     = 16 GB × 1.2
     = 19.2 GB

At FP16, a 7–8B model needs approximately 19GB of VRAM. This is why you cannot load Llama 3.1 8B at FP16 on a 16GB card.

Example 2: Llama 3.1 8B at Q8_0 (8-bit, ~1 byte per weight)

VRAM = (8e9 × 8 / 8) / 1e9 × 1.2
     = 8 GB × 1.2
     = 9.6 GB

Fits on a 12GB card with room for context cache.

Example 3: Llama 3.1 8B at Q4_K_M (~4.5 bits effective)

Q4_K_M uses mixed 4-bit and 6-bit quantization for sensitive layers, averaging ~4.5 bits:

VRAM = (8e9 × 4.5 / 8) / 1e9 × 1.2
     = 4.5 GB × 1.2
     = 5.4 GB

Fits on an 8GB card with approximately 2.5GB left for KV cache.

Example 4: 70B Model at Q4_K_M

VRAM = (70e9 × 4.5 / 8) / 1e9 × 1.2
     = 39.4 GB × 1.2
     = 47.2 GB

This confirms why a single RTX 4090 (24GB) cannot load a 70B model at Q4_K_M, and why you need 48GB+ of VRAM (dual RTX 4090, Apple Silicon 64GB+, or similar).


Bits Per Weight by Quantization Format

Notes

Training only; rarely used for inference

Full precision inference; maximum quality

Very close to FP16 quality; 2x smaller

Excellent quality; ~37% smaller than FP16

Strong quality; good balance point

Standard recommendation; minor quality loss

Smaller but lower quality than Q4_K_M

Noticeable quality degradation

Significant quality loss; emergency only The _K_M suffix in GGUF quantization names indicates k-quant with medium importance weighting — these variants use higher precision for attention layers that are most sensitive to quantization error.


Full Reference Table: VRAM Requirements by Model Size

All figures in GB, rounded to nearest 0.5GB, using 1.2x overhead factor at ~4K context length.

Q3_K_M

0.5

1.6

3.7

4.2

6.8

7.4

14.2

17.9

36.8

37.8

212.6

GPU-to-Model Compatibility Table

These are practical compatibility recommendations at Q4_K_M unless noted. "Comfortable" means room for long context; "tight" means fits but minimal KV cache headroom; "hybrid" means CPU+GPU split with llama.cpp.

Notes

Comfortable; limited KV cache

13B is tight at Q4_K_M

Same as 4060 8GB

27B Q4_K_M fits but watch context

Memory bandwidth advantage over 3060 12GB

Comfortable; 27B needs hybrid

Value king for 20–34B models

Same VRAM as 4060 Ti 16GB, much faster

Same VRAM as 3090, significantly faster

Linux ROCm only for optimal performance

Good value AMD option

Sweet spot for AMD single-card

Best AMD single-card for LLMs

Combined VRAM for large models

Fastest consumer dual-GPU setup

Unified memory; use MLX

Near-full-precision 70B

*Requires GTT config — see AMD GTT guide

KV Cache at Different Context Lengths

As context windows grow, KV cache adds substantially to VRAM needs. Approximate KV cache additions for 7B/8B models (FP16 KV cache):

70B Model KV Cache

~2 GB

~4 GB

~16 GB

~64 GB Most inference backends (llama.cpp, Ollama) default to 2K–4K context. Explicitly setting --ctx-size 32768 or larger will increase VRAM usage significantly. For long-document work, INT8 or INT4 KV cache quantization (available in llama.cpp with --cache-type-k q8_0) cuts this by 50–75%.


Common Model Sizes Quick Reference

For the models people actually run, actual measured VRAM in llama.cpp:

Notes

Most popular entry model

Different arch; wider context

Legacy but still used

Flagship 70B

Top-tier reasoning

MoE; needs 32GB+ at Q4_K_M

Dual 4090 or larger

Microsoft; strong reasoning For a deeper breakdown of any specific model, see our hardware requirement guides: Qwen 2.5 hardware guide, Llama 70B on Mac, and how much VRAM do you need.

vram-calculator gpu-memory local-llm vram model-size

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.