Vulkan
A cross-vendor graphics and compute API used by llama.cpp as a portable GPU backend when CUDA or ROCm aren't available.
Vulkan is a low-level, cross-vendor GPU API that llama.cpp ships as a compute backend, letting you run local LLMs on any GPU with a working Vulkan driver — Intel Arc, AMD Radeon, NVIDIA, even some integrated chips. For builders, it's the universal fallback when the vendor-native stack isn't an option.
How It Fits in the Local LLM Stack
Most serious local inference goes through vendor-specific backends: CUDA on NVIDIA, ROCm on AMD, oneAPI on Intel. Vulkan sits underneath all of them as a portable alternative — the same llama.cpp build can target any GPU that exposes Vulkan compute. That portability is why Intel Arc cards like the B580 became viable for local LLMs well before Intel's native software stack matured.
Performance Tradeoffs
Vulkan is the lowest common denominator, and it shows. On AMD hardware, the Vulkan backend in llama.cpp runs slower than ROCm on Linux. On NVIDIA, it's slower than CUDA. The gap isn't catastrophic for casual inference, but if you're chasing tokens-per-second on a card that has a native backend available, Vulkan is the wrong choice. Where Vulkan wins is breadth: it works on Windows without ROCm headaches, it works on Intel Arc when oneAPI is rough, and it works on mixed-vendor setups where no single vendor stack covers everything.
Why It Matters for Local AI
For Intel Arc owners, Vulkan is currently the most reliable path to running GGUF models — it's what makes a B580 a usable local-LLM card today. For AMD and NVIDIA users, treat Vulkan as a backup: useful when ROCm or CUDA breaks, but not the build target if you care about decode speed. The practical rule is simple: pick the native backend if your hardware has one that works, and reach for Vulkan when portability beats peak throughput.