TL;DR — In 2026 local inference splits cleanly into three tiers: a desktop GPU rig for anything above 14B, an Apple Silicon laptop for 30B–70B portable workloads, and a phone NPU for 3B–8B chat-grade models. Pick by model size and power budget. Don't try to make one tier do another's job.
The Three Tiers of Local Inference
Desktop GPU. RTX 3060 12GB to RTX 5090 32GB and beyond. The only tier where 70B Q4 fits comfortably without offload and where vLLM-class throughput is on the table. Wall power, big PSU, no battery.
Apple Silicon. M4 Max and M5 Max with 64–128GB unified memory. The only laptops that run 70B+ models without thermal collapse. Memory bandwidth (400–600 GB/s) is the differentiator, not GPU FLOPS. See Apple Silicon M-series benchmarks for the actual tok/s curve.
Mobile NPU. iPhone 16 Pro (A18 Pro) and Snapdragon 8 Elite Android phones. 7B Q4 models now run at 8–15 tok/s on-device. Power-limited and RAM-limited — but private, offline, and instant. The full hardware-and-app rundown is in our iPhone and Android LLM guide.
When Each Tier Wins
Privacy beats everything? Mobile wins. Nothing leaves the device, even on a plane.
Latency-sensitive coding agent? Desktop GPU. You want >40 tok/s on a 14B–32B coder, and only a discrete GPU delivers that consistently.
Hotel-room workflow, full 70B? Apple Silicon. The Apple Silicon decision guide lays out exactly when an M-series Mac beats a desktop card, and when it loses.
Battery life matters? Mobile or Apple Silicon, never a discrete GPU. A 4090 burns 350W under inference; an M4 Max burns 30–45W for the same 13B prompt.
7B Models Are the New Floor
The reason mobile is suddenly viable: Qwen 2.5 7B, Llama 3.2 3B, and Phi-4 mini are good enough for real chat work. Two years ago, sub-10B models were toys. In 2026 they handle summarization, light coding, and reasoning that used to need a 70B. That shift is what made the VRAM tier ladder so important — 8GB now buys you a working assistant, not a demo.
That same compression unlocks the phone tier. An iPhone 16 Pro with 8GB RAM can hold a Q4 7B and still have headroom for the OS.
Picking Your Stack
llama.cpp — universal. Runs on every tier above. Default for desktop CUDA, default for Vulkan/ROCm, the engine behind most mobile apps.
MLX — Apple-only, fastest on M-series. Use it on Mac. Don't try to port your MLX pipeline anywhere else. The MLX vs llama.cpp vs Ollama comparison shows the gap.
Ollama — best onboarding on desktop and Mac. Wraps llama.cpp. Use it if you don't want to think about flags.
MLC-LLM — mobile-first. Tensor-compiled per device. The right pick if you're shipping an Android or iOS app, not a desktop tool.
The mistake to avoid: trying to run the same stack everywhere. Each tier has a native runtime. Use it.