numactl — Local AI Glossary | CraftRigs

numactl is the standard Linux utility for pinning a process — and its memory allocations — to a specific NUMA node. On multi-socket servers running local LLMs, it's the difference between a model that crawls and one that actually uses the hardware you paid for.

What NUMA Actually Means Here

NUMA (Non-Uniform Memory Access) describes systems where each CPU socket has its own bank of attached RAM. Accessing your own socket's memory is fast; reaching across the interconnect to the other socket's memory is significantly slower. On a dual EPYC box, a llama.cpp process that allocates memory on node 0 but runs threads on node 1 spends most of its time waiting on cross-socket traffic instead of doing math.

Single-Node Binding for Inference

The fix llama.cpp prints in its NUMA warning is to bind the entire process to one node with numactl --cpunodebind=0 --membind=0. This forces threads and weights onto the same socket, eliminating remote memory hops. The catch: you're now limited to that node's system RAM — typically half of total installed memory. On a 256 GB dual-socket box, you get 128 GB to work with. Models that fit see real speedups; models that don't will fail to load or spill to swap. Documented impact on dual EPYC: 6.8 → 14.2 tokens per second just from adding the binding flags.

Why It Matters for Local AI

Most local LLM guides assume a single-socket consumer box where NUMA is irrelevant. The moment you move to used dual-socket Xeon or EPYC hardware — a popular path for cheap high-RAM CPU inference — NUMA becomes the silent bottleneck. Knowing numactl turns a confusing "why is my server slower than my desktop" problem into a one-line launch flag, and it determines whether a 70B model at Q4 is usable or unusable on the rig you just built.