nvidia-smi
Command-line utility bundled with NVIDIA drivers that reports GPU utilization, VRAM usage, temperature, and power draw in real time.
nvidia-smi (NVIDIA System Management Interface) is the default diagnostic tool for monitoring NVIDIA GPUs on Linux and Windows. For local AI builders, it's the first command you run to check whether your model is actually living on the card or quietly spilling to system RAM.
What It Reports
Run it bare and you get a snapshot: driver version, CUDA runtime, per-GPU utilization percentage, VRAM used vs. total, temperature, fan speed, power draw, and the PIDs of processes holding GPU memory. In multi-GPU rigs, the indices it prints — GPU 0, GPU 1 — are the same indices llama.cpp and other runtimes use for tensor splitting, so it's the canonical way to confirm which physical card is which before configuring --tensor-split.
Limits and Blind Spots
The tool polls at roughly one-second intervals, which is fine for steady-state monitoring but misses transient behavior that matters for local LLMs. A 4090 can spike 150 W above its sustained draw for 50 milliseconds during prefill, and nvidia-smi will never show it — your tripped breaker will. It also can't see PSU conversion losses (8-12% even on 80 Plus Gold), so wall-meter readings always exceed what nvidia-smi reports. Worse for inference debugging: a card showing 40% utilization while tokens-per-second collapses usually means silent CPU fallback — the model exceeded VRAM, layers offloaded to system RAM, and the GPU is now waiting on PCIe transfers instead of computing.
Why It Matters for Local AI
Every VRAM-budget decision in local inference runs through nvidia-smi. It's how you confirm a quantized model actually fit, how you spot KV-cache growth eating headroom as context fills, and how you catch the 8 GB-card death spiral where prompts under 2K context fly and anything larger crawls. If watch -n 1 nvidia-smi isn't open in a second terminal while you benchmark, you're flying blind.