HIPBLAS — Local AI Glossary | CraftRigs

hipBLAS is AMD's BLAS (Basic Linear Algebra Subprograms) interface that routes matrix and vector operations onto AMD GPUs via the ROCm runtime. For local AI builders on Radeon hardware, it is the backend that makes llama.cpp actually use the GPU instead of falling back to the CPU.

How It Fits Into the llama.cpp Build

llama.cpp does not pick up AMD GPU support automatically. You have to compile with hipBLAS explicitly enabled (for example, GGML_HIPBLAS=ON or the equivalent CMake flag), pointing the build at a working ROCm install. Without that flag, the binary will load fine and even accept -ngl layer offload arguments, but layers silently stay on the CPU — which is the most common cause of "GPU layers not offloading" symptoms on AMD systems.

hipBLAS vs cuBLAS vs Vulkan

hipBLAS is the AMD analogue to NVIDIA's cuBLAS, which is what CUDA builds of llama.cpp link against. The APIs are deliberately similar — HIP is designed as a near drop-in for CUDA — but the toolchains do not mix: a cuBLAS binary will not talk to a Radeon, and a hipBLAS binary will not talk to a GeForce. Vulkan backends exist as a more portable alternative, but hipBLAS generally delivers better throughput on supported AMD cards because it goes through ROCm's tuned kernels rather than a generic compute API.

Why It Matters for Local AI

If you are running an AMD GPU for local inference, hipBLAS is the difference between fast GPU-accelerated decoding and a build that quietly runs on the CPU at a fraction of the speed. Misconfigured or missing hipBLAS support is the single most common reason VRAM offloading appears broken on Radeon rigs, so verifying the build flag is step one when tokens per second look wrong.