CUDA — Local AI Glossary | CraftRigs

CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform, released in 2006. It's the software layer that lets developers write programs that run directly on NVIDIA GPUs, exposing thousands of parallel processor cores to applications. In the context of local AI, CUDA is what makes NVIDIA the dominant platform for running LLM inference on consumer hardware.

Why CUDA Dominates AI

When you run a model through llama.cpp, Ollama, ExLlamaV2, or any other local inference software, they all compile CUDA kernels to accelerate computation on NVIDIA hardware. The ecosystem advantage is enormous: almost every AI framework (PyTorch, TensorFlow, JAX) supports CUDA first, other platforms second. Driver support is mature, debugging tools exist, and optimizations like FlashAttention-2 were written for CUDA first.

This isn't just about raw performance. It's about software compatibility. A technique that runs on CUDA today might take 6–12 months to arrive on ROCm or Metal. For local AI users, this means NVIDIA hardware runs more software, with better optimization, immediately.

CUDA Versions and Compatibility

CUDA has a version number (currently 12.x) that must match your driver. Most modern local AI software targets CUDA 12.1+. If your driver is too old, you'll get build errors or fall back to CPU inference. Keeping NVIDIA drivers current is the simplest fix.

CUDA version compatibility is forward-compatible within a major version. A CUDA 12.1 binary runs on a CUDA 12.5 driver without issues. Running a newer CUDA binary on an older driver causes problems.

The Compute Capability System

Every NVIDIA GPU has a compute capability version (e.g., 8.6 for RTX 3090, 8.9 for RTX 4090). Higher compute capability unlocks newer CUDA features. For local AI, compute capability 7.0+ (Volta and newer) is the practical minimum — that's GTX 1080 Ti and above. Most modern optimizations like Flash Attention require 8.0+ (Ampere and above).

CUDA vs. ROCm

AMD's answer to CUDA is ROCm. While ROCm has improved significantly, CUDA retains meaningful advantages: larger ecosystem, better driver stability, and more pre-compiled binaries available. For local AI in 2026, NVIDIA CUDA hardware still offers better plug-and-play compatibility with inference software.

The performance gap between identical CUDA and ROCm implementations of the same model is narrowing but still exists, especially for cutting-edge optimizations like speculative decoding and quantized kernels.