nvcc
NVIDIA's CUDA compiler driver — the toolchain component that compiles CUDA C++ source into GPU-executable code.
nvcc is NVIDIA's CUDA compiler driver, the tool that turns CUDA C++ source files into binaries that actually run on your GPU. For local AI builders, it's the thing you want to see invoked during a build — its presence in the compile log is direct evidence that GPU backend support is being baked into your inference runtime.
Where It Shows Up in a Build
When you compile llama.cpp with cmake -B build -DLLAMA_CUDA=ON, the build system locates nvcc from your installed CUDA toolkit and routes every .cu file through it. Running cmake --build build --verbose will surface nvcc invocations in the output — that's your confirmation the GPU backend is being compiled in. The AMD equivalent is hipcc, which plays the same role for ROCm/HIPBLAS builds.
Why Builds Silently Fall Back to CPU
The single most common cause of "0 layers offloaded" on a GPU machine is a llama.cpp binary that was compiled without nvcc ever being invoked — usually because the CUDA toolkit wasn't installed, wasn't on PATH, or the -DLLAMA_CUDA=ON flag was missing. The build completes successfully, you get a working llama-server binary, but it's CPU-only. You can verify after the fact with ldd llama-server | grep cuda; if no CUDA libraries are linked, nvcc never ran. This is a compile-time problem, not a runtime flag you can toggle.
Why It Matters for Local AI
On a local rig, nvcc is the gate between "my model loads in seconds and decodes at full speed" and "my 4090 sits idle while llama.cpp churns through CPU threads." Every GPU-accelerated inference path in the llama.cpp ecosystem — including downstream wrappers like Ollama and KoboldCpp when built from source — depends on nvcc having compiled the CUDA kernels. If you've spent thousands on VRAM, confirming nvcc ran is a 30-second sanity check that protects the whole investment.