CraftRigs
Architecture Guide

Common Local LLM Mistakes: Hardware Buying Guide for Beginners

By Georgia Thomas 5 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Every week someone shows up on r/LocalLLaMA asking why their new GPU can't run the model they wanted. Usually the answer is one of the same five mistakes, repeated constantly because the buying guides don't warn you about them.

This is that warning.

Mistake 1: Buying 8GB VRAM Because "It's Enough for Most Games"

Gaming VRAM needs and inference VRAM needs are completely different. In a game, the GPU loads textures and geometry once per scene, then renders repeatedly from what's already loaded. In LLM inference, the GPU loads model weights — the entire model — into VRAM every time you start a session.

An 8GB GPU running Llama 3.1 8B Q4 is basically full. You have about 500MB of headroom for the context window, the framework, and anything else. Longer conversations eat into that headroom fast.

What you can actually run in 8GB: 7B models at Q4 only. That's it. No 13B. No 14B. No upgrades as better models release.

What you can run in 16GB: 7B at Q8 quality, 13B at Q4, 14B at Q4, and 20B+ with aggressive compression. That's a massive jump in capability for often $100–150 more.

Caution

8GB is a dead end for local AI. The RTX 4060 (8GB) looks like a bargain next to the RTX 4060 Ti 16GB. It's not. You'll want to upgrade within 6 months as model sizes continue to grow. Spend the extra $150 now.

Mistake 2: Ignoring Memory Bandwidth

This one is subtler and bites people who did their research on VRAM capacity but missed the bandwidth spec.

Memory bandwidth is how fast the GPU can move model weights from VRAM into compute. Since LLM inference is almost entirely bandwidth-bound, this number determines your tokens per second more than CUDA core count, clock speed, or any other spec.

A card with 16GB of slow VRAM can run more models than an 8GB card, but it might generate tokens slower than a 12GB card with a wider memory bus and faster GDDR6X memory.

Real example: the RTX 4060 Ti 16GB has 288 GB/s bandwidth. The RTX 4070 12GB has 504 GB/s. For models that fit in 12GB, the 4070 produces tokens roughly 75% faster despite having less VRAM.

Check the bandwidth spec before you buy. Full comparison here.

Mistake 3: Getting 16GB of System RAM

System RAM is not VRAM. They're separate. But system RAM matters too.

Running Ollama or llama.cpp with 16GB of system RAM works fine until it doesn't. The OS, the inference framework, and any CPU offloading all compete for that 16GB. Start a browser alongside your inference session and you're in trouble.

Minimum for a local AI rig: 32GB. Comfortable: 64GB, especially if you're offloading model layers to CPU to handle models larger than your VRAM.

DDR5 is worth it on modern platforms. The bandwidth improvement over DDR4 is meaningful for CPU-offloaded inference — more on this in the DDR5 vs DDR4 guide.

Mistake 4: Skipping the NVMe Upgrade

Models don't live in VRAM permanently. Every time you start a session, llama.cpp or Ollama reads the model file from your storage drive and loads it into VRAM. The speed of that process depends entirely on your storage.

Loading a 14B model (roughly 8–9GB on disk) from a budget NVMe drive: 35–45 seconds. From a high-end PCIe 4.0 NVMe: 10–12 seconds. From PCIe 5.0: 7–8 seconds.

That's a 5x speed difference just for startup time, every single session. If you're experimenting with multiple models — switching between a coding model and a chat model, for instance — this adds up to real friction.

Tip

Minimum storage recommendation: At least PCIe 4.0 NVMe for your model storage drive. Budget $80–120 for a 2TB drive (you'll need the space — models range from 4GB to 40GB+ each). Full breakdown at Best NVMe SSDs for Local LLM Workflows.

Mistake 5: Not Understanding Quantization Before Buying

Beginners often look at a model card that says "Llama 3.1 70B" and think: "70 billion parameters, I need a massive GPU." Then they don't buy anything because they assume local AI is out of reach.

Quantization changes this calculus entirely. The 70B model in Q4_K_M format uses about 42GB of VRAM — not the 140GB the full FP16 model would require. In Q2_K it drops to ~25GB, runnable on two RTX 4090s or a single A100.

And the smaller models — 7B, 14B, 32B — run on very accessible hardware. A 32B model at Q4 fits in 24GB VRAM (single RTX 4090) with room to spare and produces output that rivals older 70B models in many tasks.

Understand the VRAM math before deciding what hardware you need. The VRAM requirements guide has exact numbers for every popular model and quantization level.

Mistake 6: Buying an AMD GPU Without Checking ROCm Support

AMD GPUs work for local LLMs — but not as easily as Nvidia, and not with every tool.

ROCm (AMD's GPU compute stack) has improved significantly but still has gaps. As of early 2026, llama.cpp supports AMD via ROCm on Linux, and Ollama has added AMD support, but you might hit driver issues on specific cards, and Windows ROCm support still lags behind Linux.

Intel Arc GPUs have even spottier support — they work via SYCL or Vulkan backends in llama.cpp, but performance can be inconsistent and setup requires more effort.

If you want the path of least resistance: Nvidia. If you specifically want AMD or Intel for a reason (price, power efficiency, availability), factor in the setup time and the possibility that some tools won't work out of the box.

Note

AMD's situation in 2026: The RX 7900 XTX (24GB) and RX 9070 XT (16GB) both work for local AI on Linux with recent ROCm builds. Windows support has improved but still has rough edges. The NVIDIA vs AMD vs Intel comparison covers the full picture.

Mistake 7: Buying on Specs Alone Without Checking Your Actual Use Case

The most expensive mistake isn't buying the wrong tier of hardware — it's buying the right hardware for the wrong use case.

Someone who wants a local AI coding assistant needs a different setup than someone who wants to run image generation alongside chat models, which is different from someone building a shared team inference server.

Before buying anything, answer:

  • What model sizes do I actually need? (Start with 14B. It's the sweet spot.)
  • Will I be running inference 24/7 or occasionally?
  • Am I the only user or sharing with a team?
  • Do I care about image generation too, or just text?

The answers point you to a very specific hardware profile. Without answering them, you're guessing. And guessing in PC component purchases is expensive.

For most beginners starting out, the right setup is simpler than it sounds: a used RTX 4090 or RTX 4060 Ti 16GB, 64GB of DDR5 RAM, and a fast NVMe. Start there, run some models, and you'll know within two weeks exactly what you wish you had more of. Then upgrade specifically that thing.

See Also

beginners buying-guide vram local-llm hardware-mistakes gpu

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.