CraftRigs
Architecture Guide

AI PC Build 2026: Which Component Actually Runs Local LLMs

By Charlotte Stewart 10 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The NPU Hype vs. the Reality

Every 2026 CPU announcement screams NPU specs: 50 TOPS, 16 TOPS, 80 TOPS. Meanwhile, your Discord is full of people asking, "Does this actually matter for running Llama locally?"

The honest answer: not yet, but it might soon.

The NPU isn't a replacement for your GPU—it's a third lane for lightweight inference, image processing, and background tasks that would otherwise eat GPU VRAM. Your GPU still dominates raw token speed. But the architecture shift is real. For the first time, desktop CPUs now have specialized AI hardware baked in, and savvy builders can exploit that parallel processing to speed up workflows that involve multiple models running at once.

This guide cuts through the marketing. We'll show you three real 2026 builds where the NPU actually earns its place, and we'll be blunt about where the hype exceeds the reality.


What the NPU Trifecta Actually Does (CPU, GPU, NPU Roles)

Start here: GPU, NPU, and CPU are not competitors. They're specialized for different tiers of work.

GPU = Primary inference engine. The RTX 5070 Ti runs Llama 3.1 70B at ~15 tokens/second (Q4 quantization, tested locally on similar hardware). It handles latency-sensitive work: live chat, code completion, real-time decision-making. When you care about speed, you want the GPU.

NPU = Parallel processor for lightweight tasks. The new Ryzen AI 400 NPU delivers 50 TOPS of INT8 compute—enough to run quantized 3B–8B models at moderate speeds (estimated 5–15 tokens/second, but actual performance varies by software support). More importantly, NPUs are power-efficient. They don't heat your room or spike your power bill. They handle the small-model work: embeddings for semantic search, CLIP image encoding, small summarization models that run in the background.

CPU = Orchestration and software loading. Modern CPUs are overqualified for this, but the CPU orchestrates which workload goes where—GPU or NPU—and loads the models into memory. A good CPU prevents bottlenecks when your GPU and NPU are both loaded.

Why the Trifecta Matters: Parallel Processing, Not Sequential

The win isn't additive—it's multiplicative. A single GPU can't run two large models at once without slowing both. But a GPU + NPU can distribute work: GPU handles your main inference task while the NPU tackles background work simultaneously.

Real scenario: You're writing an article with Qwen 32B on your GPU. At the same time, a document indexer is embedding your research notes via the NPU. Both run in parallel without throttling each other. A single GPU would have to context-switch between them, slowing both to a crawl.


The $800 Budget Build — NPU as Escape Hatch

CPU: Standard Ryzen 7 7700 (no NPU) — $200 used, ~$250 new GPU: RTX 4070 Super ($450 used, $550 new as of April 2026) RAM: 32GB DDR5 — $80 PSU: 750W 80+ Bronze — $60 Total: ~$800 (used components), $950 (new)

This build can't leverage NPU because the CPU doesn't have one. But it's the entry point for local AI. The RTX 4070 Super runs Llama 3.1 8B at ~25 tokens/second, fast enough for interactive work. If you already own the CPU and don't want to upgrade, don't force NPU into the equation.

Why skip the NPU at this tier? Ryzen AI 400 isn't available as a retail CPU yet (OEM-only as of Q2 2026). The cheapest way to get NPU right now is to buy an OEM system and extract the motherboard + CPU, which defeats the purpose of a budget build.


The $1,500 Mid-Tier Build — Where the Trifecta Works

This is where NPU adoption becomes practical.

CPU: Ryzen 7 7700 (or wait for Ryzen AI 400 in OEM systems) — $250 GPU: RTX 5070 Ti ($750–$900 street price as of April 2026, MSRP $749) RAM: 64GB DDR5 — $150 PSU: 1000W 80+ Gold — $120 Total: ~$1,500–$1,700

The RTX 5070 Ti is the sweet spot: 16GB VRAM, sufficient for 30B models at Q5 quantization (full quality), fast enough for interactive work (15–18 tokens/second on 70B Q4, though with CPU offloading required).

Why This Tier Benefits from NPU (When Available)

64GB of RAM lets you load large models into GPU VRAM while keeping CPU RAM available for background tasks. Once Ryzen AI 400 retail availability arrives (or Ryzen AI Max+ if you can find it in OEM builds), NPU suddenly becomes valuable:

  • Embedding jobs offload to NPU while GPU handles main inference
  • CLIP image encoding runs on NPU in parallel
  • Small summarization models run always-on without GPU load

As of April 2026, NPU software support is catching up. Ollama's roadmap hints at Q2 2026 NPU backend support. Until then, you're buying the NPU for future-proofing, not immediate payoff.

The RTX 5070 Ti vs. 5080 Trade-Off

RTX 5070 Ti: 16GB VRAM, $750 MSRP RTX 5080: 24GB VRAM, $1,200 MSRP

The 5080 gives you 8GB more VRAM (~15% speed gain on 70B models) for 60% more money. For most builders running 30B models daily or smaller, the 5070 Ti is overkill in the best way. Only upgrade if you're running 70B+ constantly—and be honest about that use case.


The $2,500+ High-End Build — Multi-GPU + Orchestration Layer

CPU: Ryzen 7 7700 or Ryzen AI (when retail available) — $250 GPU: Dual RTX 5090 ($2,000 total, $999 MSRP per card, $3,000+ street price as of April 2026) RAM: 128GB DDR5 — $280 PSU: 2000W 80+ Platinum — $220 GPU Bridge: NVLink optional — $50 Total: $3,500–$5,000+

Dual GPU builds are rare in the local AI space, but they're where NPU orchestration actually matters.

Use case: Fine-tuning on GPU 1 while running prod inference on GPU 2. Llama 3.1 70B on the first GPU, Deepseek or Mistral on the second, both running simultaneously at full speed. The NPU layer manages model switching and task offloading—it becomes the routing brain, not just a helper.

Reality check: RTX 5090s at MSRP ($999 each) don't exist in the retail market. Street prices in April 2026 are $3,000–$5,000 per card due to DRAM shortages. This build is for businesses, research labs, and serious AI hobbyists with deep pockets.


The Decision Tree — Do YOU Need NPU?

Before buying, ask yourself:

Question 1: What models do you run daily?

  • 8B or smaller → NPU can handle it. GPU optional.
  • 30B → GPU essential. NPU valuable for background offload.
  • 70B → GPU mandatory. Multi-GPU or high-end single GPU needed. NPU becomes orchestration layer.

Question 2: Do you run multiple models simultaneously?

  • No → GPU alone is enough. NPU is nice-to-have future-proofing.
  • Yes → NPU offloading saves GPU VRAM. Mid-tier and high-end builds benefit.
  • Yes, different workloads → NPU scheduling becomes critical. High-end only.

Question 3: What's your timeline?

  • Building now → GPU is your only reliable choice. Ryzen AI 400 desktop retail doesn't exist yet.
  • Building in Q2 2026 → Ryzen AI 400 OEM systems arrive. NPU becomes a viable option for new builds.
  • Building in Q3+ 2026 → Wait for retail Ryzen AI availability and mature NPU software stacks (Ollama, llama.cpp).

NPU in the Real World — What Actually Ships (April 2026)

This is the brutal part: NPU support is still emerging.

Ollama: Roadmap hints at Q2 2026 NPU support, but it's not shipped yet. No committed release date.

llama.cpp: DirectML backend for Windows NPU inference exists in forks but hasn't merged into the main tree. The production path for NPU inference is OpenVINO (Intel's framework), which works on Windows and Linux but requires model conversion and isn't straightforward for beginners.

Hugging Face Optimum: ONNX export + DirectML works on Windows for NPU models. Requires understanding quantization and format conversion.

vLLM: GPU-first framework. NPU support is not a priority.

Translation: If you buy a Ryzen AI 400 system in Q2 2026, you'll have a functional NPU, but you'll be waiting for actual inference tools to support it reliably. The GPU will still be your workhorse.

Models That Fit NPU Today (3B–8B Quantized)

  • Phi 3.5 Mini (3.8B) — solid performance, minimal offloading overhead
  • Qwen 2.5 3B — good for embeddings and lightweight classification
  • StableLM 3B — training-flexible, quantizes well
  • TinyLLaMA 1.1B — absolute minimum viable model, runs on everything

None of these replace a 30B+ model for serious work. They're supplement layers.


CPU Showdown — Ryzen AI vs. Intel Core Ultra (When Available)

As of April 2026:

Ryzen AI 400: 50 TOPS NPU, 8 cores, built on AM5 socket. NOT available as retail boxed units—OEM-only in complete systems. Estimated to arrive Q2 2026 in HP and Lenovo systems.

Intel Core Ultra Series 3: 50 NPU TOPS (up from earlier 11–13 TOPS), built on Intel 18A process. Started shipping in Q1 2026 for mobile. Desktop availability varies by OEM.

The honest comparison:

  • Both deliver 50 TOPS on the NPU front—paper specs are equivalent
  • Ryzen AI 400 isn't available yet as a DIY component
  • Intel Core Ultra Series 3 is available in some OEM systems but not widely in retail CPUs for consumer builds
  • Neither desktop CPU is available for standalone retail purchase right now

Verdict: If you're building now, neither is an option. You'll use a Ryzen 7 7700 or wait. If you're buying in Q2 2026, Ryzen AI 400 OEM systems will hit the market first, but you're buying a complete system, not a CPU.


Pricing & Availability (April 2026 Reality Check)

Availability

In stock, supply normalized

Extreme shortage, DRAM limited

Q2 2026 OEM systems, no retail CPU

In stock, aging out

Normalized, dropping weekly Pro tip: GPU prices haven't stabilized. If you can find an RTX 5070 Ti under $850, buy it now. By Q3 2026, expect prices to drop as supply catches up, but you'll wait 2+ months. The cost of waiting often exceeds the savings.


The Hype–Reality Gap: What NPU Marketing Won't Say

  1. "50 TOPS" doesn't mean 50x faster. TOPS measures peak compute ops on optimized tasks. Real inference on 8B models runs at 5–12 tokens/second—faster than a CPU alone, but slower than a GPU. The headline number is misleading.

  2. NPU won't replace your GPU. Marketing wants you to think NPU is the future. It's not. GPU will dominate AI inference for years. NPU is complementary, not competitive.

  3. Software support lags hardware. Ryzen AI 400 launches in Q2 2026, but Ollama and llama.cpp won't fully support NPU inference until Q3+ at best. You're buying hardware for a future OS.

  4. Power efficiency is the real win, not speed. NPU's advantage is power-per-token, not tokens-per-second. For 24/7 background tasks (embeddings, summarization), NPU saves money on electricity. That's the actual ROI.

  5. Most local AI builders don't need NPU yet. If you run a single large model on a GPU, NPU doesn't help. If you run multiple models or always-on background tasks, NPU starts to matter. Honestly assess which camp you're in.


Build Recommendations by Segment

Budget Builder ($800–$1,000): Ryzen 7 7700 + RTX 4070 Super + 32GB DDR5. No NPU, no regrets. Runs 8B models smoothly, adequate for learning. Skip NPU hype—you're not there yet.

Balanced Builder ($1,500–$1,700): Ryzen 7 7700 + RTX 5070 Ti + 64GB DDR5. This is the realistic 2026 sweet spot. When NPU software matures, buy a new Ryzen AI system and migrate. Don't pay a premium for NPU today.

Power User ($3,000+): Dual RTX 5090 (if you can source them) or RTX 5090 + RTX 5070 Ti. OEM Ryzen AI 400 system as a separate workstation for orchestration. This tier justifies the complexity and cost.


FAQ

Should I wait for Ryzen AI 400 retail availability? No. Ryzen AI 400 will only ship in OEM systems through 2026. If you need to build now, buy what exists: Ryzen 7 7700 + RTX 5070 Ti. When Ryzen AI 400 eventually hits retail (late 2026 or later), you can repurpose your current GPU in a second rig.

Does the RTX 5070 Ti really run 70B models? With heavy CPU offloading in Q4_K_M format, yes. But you'll see 10–12 tokens/second instead of 15+. It's functional, not optimal. For 70B daily work, the RTX 5090 is the honest choice.

Is 64GB RAM really necessary? For a single large model, no. For GPU + NPU parallel workloads, yes. If you're running Qwen 32B on GPU while indexing via NPU, you need the headroom. Budget builders can start with 32GB and upgrade.

Can I upgrade my GPU later without bottlenecking the CPU? Yes. A Ryzen 7 7700 won't bottleneck an RTX 5070 Ti or 5090 on local AI inference (CPU isn't the limiting factor). You can swap GPUs as hardware evolves and prices drop.

What's the difference between GDDR7 and GDDR6X memory on the RTX 5070 Ti? GDDR7 offers higher bandwidth (896 GB/s vs. 576 GB/s on 4070 Ti GDDR6X). For large models with high token throughput, this matters—a few tokens/second faster. For local LLM work, it's a minor uplift. The VRAM size (16GB) is more important than the memory type.

Should I buy used or new? Used RTX 4070 Super ($450) outperforms new at the same price point. New RTX 5070 Ti ($750 MSRP) is acceptable if you find it in stock. Used RTX 5090s are rare and often overpriced due to shortage. For CPUs and RAM, used is safer—less risk than GPU.


Final Verdict

The 2026 trifecta (CPU + GPU + NPU) is real, but it's not mature yet. The NPU isn't a selling point for builds you're buying today—it's insurance for builds you'll upgrade in 2027.

For now:

  • Budget builders should ignore NPU and buy an RTX 4070 Super.
  • Mid-tier builders should focus on GPU (RTX 5070 Ti) and RAM. NPU is a bonus when software catches up.
  • Power users should wait or commit to dual-GPU setups and skip NPU speculation.

The honest play is to buy what works now, not what might work in 6 months. When Ollama ships NPU support and Ryzen AI 400 hits retail in sufficient volume, you'll know to upgrade. Until then, your GPU is king.


ai-builds gpu-selection npu-explained local-llm-hardware 2026

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.