NPU — Local AI Glossary | CraftRigs

An NPU (neural processing unit) is a fixed-function silicon block designed to run neural network math — mostly low-precision matrix multiplies — at a fraction of the power a GPU would draw. In a local AI rig it sits alongside the CPU and discrete GPU as a third compute path, optimized for always-on inference rather than peak throughput.

Where NPUs Show Up

NPUs ship inside modern laptop SoCs: Intel Core Ultra (Meteor Lake / Lunar Lake), AMD Ryzen AI (XDNA), Qualcomm Snapdragon X (Hexagon), and Apple's Neural Engine on Apple Silicon. They share system RAM — or unified memory on Apple parts — instead of carrying dedicated VRAM. That makes them cheap to include but bandwidth-bound for any model larger than a few billion parameters.

NPU vs GPU for Local LLMs

NPUs are tuned for INT8 and lower-precision workloads measured in TOPS, not the FP16/BF16 throughput an LLM decode loop actually needs. A 40-TOPS laptop NPU can handle wake-word detection, background noise suppression, and small classifier models without spinning up the GPU, but it will lose badly to even a mid-range discrete GPU on tokens per second for a 7B+ model. Software support is still fragmented — DirectML, OpenVINO, Core ML, and vendor SDKs each cover a slice — and most popular runtimes like llama.cpp and Ollama target GPU backends first. Apple's Neural Engine is the partial exception, since MLX and Core ML can route layers to it on M-series chips.

Why It Matters for Local AI

For a CraftRigs-style build, the NPU is rarely the thing running your main LLM — that job goes to VRAM-rich GPU silicon. But it changes the math on always-on assistants, voice pipelines, and background RAG embedding work, where offloading those side tasks to the NPU keeps the GPU free for token generation and drops idle power draw on a laptop or mini-PC. Treat it as a coprocessor for the small models around your big model, not a replacement for one.