CraftRigs
Architecture Guide

The Local AI Hardware Decision Framework: Pick the Right Rig

By Georgia Thomas 8 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Most people overbuy or underbuy local AI hardware because they optimize for the wrong spec. Answer five questions in order — budget, model size target, speed needs, form factor, and upgrade preference — and the right hardware becomes obvious. The spec that determines everything else is VRAM (or unified memory on Apple Silicon). Get that right and the rest follows.


The Five Questions

Ask these in order. Each answer eliminates options and narrows the field. By question five, you have a clear recommendation.


Question 1: What's your actual budget?

Not "what would I spend if I really wanted this" — your real number, including whatever you'll need for the rest of the system if you're building from scratch.

Under $600 total: You're in GPU + basic PC territory or the base Mac Mini. See the budget tier guide for specifics by price.

$600–$1,200: Mid-range GPU + solid PC build, or a Mac Mini M4 at various configurations. Real capability available here.

$1,200–$2,500: High-end GPU territory or Mac Mini M4 Pro with 48GB unified memory.

$2,500+: Dual GPU builds, Mac Studio, or workstation hardware. Only makes sense with specific large-model needs.

Gotcha to avoid: Don't budget just for the GPU and forget everything else. A full PC build (CPU, motherboard, RAM, PSU, storage, case) costs $600–$900 on top of the GPU. A Mac is the full cost out of the box. If you're comparing a $1,000 GPU to a $1,799 Mac, the Mac may actually be cheaper when you factor in the full PC build.


Question 2: What model size do you actually want to run?

This is the most important question. The wrong answer here causes more wasted money than anything else.

7B–13B models: Most casual use — chat, summarization, drafting, Q&A. 8–12GB VRAM handles these well. Any budget GPU from $200–$300.

20B–22B models: Noticeably better than 13B for reasoning, instruction following, long-form generation. Need 16GB VRAM. RTX 4060 Ti 16GB or equivalent.

30B–34B models: Where local AI starts rivaling cloud quality for many tasks. Strong coding, complex reasoning, nuanced outputs. Need 24GB VRAM or 48GB unified memory. RTX 3090/4090 or Mac Mini M4 Pro 48GB.

70B models: Near-frontier quality. Meaningful for research, complex analysis, best-quality code generation. Need 48GB VRAM (dual GPU or Apple Silicon) at full Q4 quality. For PC: dual RTX 3090 NVLink build (~$2,500). For Mac: Mac Mini M4 Pro 48GB ($1,799).

If you don't know what model size you want: Start with 7B and a budget GPU. Run it. If you find yourself wishing for better quality, that tells you the upgrade case. Don't buy 70B capability speculatively — it's expensive and the quality gap over 34B is meaningful but not infinite.

The trap: Buying hardware for a model size you currently want but probably don't need. 7B models in 2026 are dramatically better than 7B models from two years ago. Llama 3.1 8B handles a surprising range of tasks. Be honest about whether you actually need 34B or if you just want it.


Question 3: How fast do you need inference to be?

This question matters less than most people think — but it matters for some use cases.

Conversational use (chat, Q&A): 15–25 tok/s is comfortable. This is achievable at every tier above budget CPU-only.

Daily productivity (drafting, summarization, coding assistant): 25–50 t/s is ideal. Possible with any mid-range or above GPU, or Apple Silicon at 7B–20B model sizes.

Automated workflows / batch processing: Here speed becomes important. If you're running hundreds of inference calls per hour, 100+ t/s makes a meaningful difference. This pushes you toward high-end NVIDIA GPUs (RTX 3090/4090 at 7B–34B) rather than Apple Silicon.

Speed truth: At matched model size, NVIDIA wins on speed. An RTX 3090 (936 GB/s bandwidth) generates tokens 3–5x faster than a Mac Mini M4 Pro (273 GB/s) running the same model. But that comparison is only relevant if both can actually run the model. If you need 70B models and the GPU can't fit them, its speed is irrelevant.

The trade-off: Apple Silicon is slower per token but accesses larger models. NVIDIA is faster per token but can't run models above its VRAM ceiling. If your priority is running 70B models, accept that Apple Silicon will be slower — 11 t/s at 70B beats 4 t/s from VRAM overflow.

If speed is critical and you also need large models: The dual RTX 3090 NVLink build runs 70B at ~18 t/s — the fastest consumer hardware path to that model size.


Question 4: What's your form factor and environment?

Often skipped, frequently regretted.

Home office with a dedicated workstation: PC tower is fine. You'll hear the fans, the case takes up space, the power draw is real — but for dedicated use, none of that is disqualifying.

Shared space, quiet environment, bedroom: A high-end GPU under inference load is loud. An RTX 3090 at full inference is not quiet. If noise is a concern, Apple Silicon is the answer — a Mac Mini M4 under sustained load is nearly silent.

Portability: If you need inference on a laptop, you're constrained to whatever GPU is in the laptop (usually 8–16GB VRAM in high-end gaming laptops) or Apple Silicon MacBooks. Desktop GPUs don't apply.

24/7 always-on inference server: Power consumption compounds over time. An RTX 3090 at 350W running 8 hours a day costs ~$15–$20/month in electricity. An M4 Mac Mini at 50W running 8 hours a day costs ~$2/month. Over 3 years, that's a $500+ electricity difference. Factor it in.

Desk space: A Mac Mini is 5 inches square. A full PC tower plus monitor setup is substantially larger. Obvious, but worth naming.


Question 5: Do you plan to upgrade as models improve?

Models are improving rapidly. Hardware decisions should account for where you expect to be in 18–24 months, not just today.

Yes, I'll want to upgrade: Buy NVIDIA GPU in a PC. GPU swaps are straightforward — keep the CPU, board, RAM, and PSU, swap the card. A $300 GPU today is replaceable with a $400 GPU next year without replacing the system.

No, I want one device that works for years: Buy Apple Silicon. The Mac Mini M4 Pro 48GB will run 70B models today and will still run (presumably better-quantized) large models in 3 years. The specs are fixed, but Apple Silicon's software optimization has consistently improved model performance over time. The M1 Mac runs models faster in 2026 than it did in 2021, due to llama.cpp and Ollama optimization improvements.

I want to scale beyond one GPU: Buy NVIDIA and plan for the dual-GPU path. The RTX 3090 with NVLink support is the key here — it's the only consumer card that lets you pool VRAM between two cards at speed. One RTX 3090 is a reasonable single-GPU purchase. A second one, plus an NVLink bridge, is a clear upgrade path.


Common Buying Mistakes

Mistake 1: Buying a 16GB card when you need 24GB

The 16GB to 24GB jump is where model access opens up significantly. 16GB maxes out at 20B models. 24GB handles 34B cleanly. If you're spending $400 and would run 34B models happily — you're $250 away from the used RTX 3090 at 24GB. The extra spending is worth it for the model access.

Mistake 2: Chasing TFLOPS instead of bandwidth

Compute throughput (TFLOPS) is largely irrelevant for local LLM inference. The bottleneck is memory bandwidth — how fast the GPU can move model weights through memory. An RTX 3090 with 936 GB/s bandwidth runs local models faster than an RTX 4080 Super with similar TFLOPS but lower bandwidth-per-dollar at this specific workload. Compare GB/s, not TFLOPS.

Mistake 3: Buying based on gaming benchmarks

A GPU that wins at 4K gaming benchmarks is not necessarily the best LLM inference card. Games are compute-bound. LLM inference is memory-bandwidth-bound. The spec that matters for gaming (shader FLOPS, compute units) is mostly irrelevant for inference. Look at memory bandwidth and VRAM.

Mistake 4: Ignoring the used market on NVIDIA

The RTX 3090 used market is the best value in local AI hardware right now. $600–$800 for 24GB and 936 GB/s bandwidth. These cards are plentiful, well-tested, and CUDA-supported. The only downsides are age (Ampere generation, no Ada features) and TDP (350W). Both are manageable. Don't reflexively avoid used hardware for this workload.

Mistake 5: Buying too much VRAM for your actual model targets

If you run 7B and 13B models and are happy with them, there's no reason to buy 24GB of VRAM. The money is better spent on a faster card with 16GB than an older card with 24GB at the same price. Buy for your actual model targets, not a hypothetical future where you want larger models.


The Spec That Matters Most

It's memory bandwidth. Not VRAM, not TFLOPS, not ray tracing cores — memory bandwidth.

Here's why: LLM inference works by loading model weights from memory, doing math on them, and moving to the next layer. The weights are large. The math is fast. The bottleneck is almost always how fast you can move data from VRAM to the compute units.

Higher bandwidth = more tokens per second. This is why:

  • RTX 3090 (936 GB/s) is faster than RTX 4060 Ti (288 GB/s) at the same model size
  • M4 Max (410 GB/s) is faster than M4 Pro (273 GB/s) for local inference
  • RTX 5090 (1,792 GB/s) is the fastest consumer single GPU for local inference

VRAM is still the gating factor — bandwidth doesn't matter if the model doesn't fit. But once the model fits, bandwidth determines speed.

When comparing two cards with the same VRAM, pick the one with higher bandwidth.


Quick Decision Table by Use Case

Estimated Budget

$220 (GPU only)

$380 (GPU only)

$650 (GPU only)

$650 (GPU only)

$1,799 (complete)

~$2,500 (complete build)

$2,799 (complete)

$1,500 (GPU only)

$200 (GPU only)

buyer-guide decision framework hardware local-llm recommendations

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.