CraftRigs
Architecture Guide

How Much VRAM Do You Actually Need? A Model-by-Model Breakdown

By Georgia Thomas 3 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: VRAM is the single biggest constraint for running local LLMs. The rule is simple: match your VRAM to the model size you want to run. Going under means the model won't load or runs unacceptably slow. Going over gives you headroom for longer conversations and faster inference.

The one-line rule: 8GB for 7B. 16GB for 13B–20B. 24GB for 32B. Anything bigger, go Mac or multi-GPU.

Why VRAM Is the Constraint

When you run a local LLM, the entire model (or as much as possible) needs to fit in GPU memory (VRAM). If it doesn't fit, the GPU has to swap to system RAM — which is 10–50x slower. That's the difference between 60 t/s and 3 t/s.

Memory bandwidth — how fast data moves within VRAM — is the #2 constraint. Once the model fits, bandwidth determines how quickly you generate tokens. This is why two GPUs with the same VRAM but different bandwidth produce different speeds.

Compute power (raw TFLOPS) is a distant third for inference. Most local AI setups are memory-bound, not compute-bound.

VRAM vs. Unified Memory (Apple Silicon)

Apple Silicon (M1/M2/M3/M4) uses unified memory — the GPU and CPU share the same physical memory pool. This means a Mac Mini M4 with 16GB unified memory can load a model that would require a 24GB GPU on a standard desktop.

Why? Apple's memory architecture is more efficient for this workload — lower overhead, tighter integration. A 16GB Mac Mini will outperform a discrete 16GB GPU on larger models, and can run models that simply won't fit in a discrete 16GB card.

The trade-off: Apple Silicon is slower per token than a high-bandwidth NVIDIA GPU at the same model size. An RTX 4090 runs Llama 3.1 8B at 127 t/s. A Mac Mini M4 runs the same model at ~30 t/s. But the Mac Mini can handle 30B+ models that an NVIDIA 24GB card struggles with.

The Breakdown by Model Size

7B models (7 billion parameters)

  • VRAM needed: ~5–6GB at Q4_K_M
  • Comfortable tier: 8GB GPU
  • Examples: Llama 3.1 8B (technically 8B), Mistral 7B, Gemma 7B

What this tier handles well: everyday chat, summarization, simple Q&A, light coding. Fast speeds even on budget hardware — the RTX 3060 Ti 8GB does 45+ t/s here.

13B models

  • VRAM needed: ~9–10GB at Q4_K_M
  • Comfortable tier: 12–16GB GPU
  • Examples: Llama 2 13B, CodeLlama 13B

Noticeable quality jump over 7B for reasoning and instruction-following. This is where local AI starts feeling like a real tool.

20B–22B models

  • VRAM needed: ~13–15GB at Q4_K_M
  • Comfortable tier: 16GB GPU
  • Examples: Mistral 22B, various 20B fine-tunes

The 16GB tier was basically made for these models. Full quality at Q4. Good speeds on RTX 4060 Ti or RTX 4070 Ti Super.

30B–34B models

  • VRAM needed: ~20–22GB at Q4_K_M
  • Comfortable tier: 24GB GPU
  • Examples: CodeLlama 34B, Yi 34B, various 30B fine-tunes

This is the sweet spot for serious local AI use. On an RTX 3090 or 4090 (24GB), CodeLlama 34B at Q4 runs at ~30 t/s — excellent for coding assistance. Developers running local AI daily often settle here permanently.

70B models

  • VRAM needed: ~45–50GB at Q4_K_M (full quality)
  • Comfortable tier: 2x 24GB GPUs, or Apple Silicon 64GB+ unified memory
  • Examples: Llama 3.1 70B, DeepSeek R1 32B

At Q2_K (heavily quantized), a single 24GB GPU can run 70B — but quality takes a significant hit. For real 70B work, you need either multi-GPU (expensive) or Mac Pro / Mac Studio with 64–128GB unified memory.

By Use Case

Recommended Model Size

7B models

13B–20B

34B models

34B or Mac 48GB+

70B

The Hardware Tiers

8GB GPUs (~$200): RTX 3060 Ti 8GB, RTX 4060 8GB. 7B models only.

12–16GB GPUs ($250–$400): Arc B580, RTX 3060 12GB, RTX 4060 Ti 16GB. 13B–20B comfortably. The RTX 5060 Ti 16GB is also entering this tier, though pricing is volatile. See budget options under $300 or 16GB GPU comparison.

24GB GPUs ($800–$1,600): RTX 3090 (used), RTX 4090. 34B comfortably, 70B with quantization. See the RTX 5090 vs 4090 breakdown.

Apple Silicon ($599–$2,000+): Mac Mini M4 (16GB unified handles 20B+), Mac Mini M4 Pro (48GB handles 70B). See cheapest way to run Llama 3 locally, or how the M4 Max compares to the RTX 4090 for larger models.

Multi-GPU / Workstation ($3,000+): 2x RTX 3090 dual-GPU build, Mac Studio. 70B at full quality. Running 70B on a Mac is the other path.

For the full breakdown of every GPU worth considering for local AI, see the complete GPU comparison guide.

vram local-llm hardware gpu model-sizes 7b 13b 70b apple-silicon

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.