GGUF
The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.
GGUF (GPT-Generated Unified Format) is the file format used to store and distribute quantized language models. It was introduced in August 2023 as the successor to GGML, addressing limitations in the older format and adding support for richer metadata inside the model file itself.
Nearly every quantized model you'll download for local inference today is a GGUF file. Hugging Face hosts thousands of them, and tools like Ollama, LM Studio, and llama.cpp all read GGUF natively.
How to Read a GGUF Filename
GGUF filenames follow a consistent naming convention that tells you exactly what you're getting. For example:
Llama-3.1-8B-Instruct-Q4_K_M.gguf
Breaking it down:
- Llama-3.1 — Model family and version
- 8B — Parameter count (8 billion)
- Instruct — Model variant (instruction-tuned vs base)
- Q4_K_M — Quantization level (4-bit, K-quants method, medium quality)
- .gguf — File format
The quantization suffix is the most important part for hardware planning. Q4_K_M is the most common choice; Q5_K_M and Q8_0 are higher quality at larger size.
What's Inside a GGUF File
Unlike the older GGML format, GGUF files are self-contained. They include:
- All model weights (quantized to the specified precision)
- Model architecture metadata (layer count, attention heads, context length)
- Tokenizer data (vocab, special tokens)
- Recommended inference parameters
This self-contained design means you don't need separate config files or tokenizer downloads — the GGUF file has everything the inference engine needs.
Where to Get GGUF Models
The primary source is Hugging Face. Search for a model name plus "GGUF" and look for repos from quantizers like TheBloke (established library) or bartowski (more recent, often better Q4_K_M variants). Ollama's model library handles GGUF downloads automatically when you run ollama pull.
Why It Matters for Local AI
GGUF is the format that made the local LLM ecosystem practical. Its broad tool support means a model file downloaded once works across llama.cpp, Ollama, LM Studio, and most other local inference tools without conversion. When building a local rig, GGUF is the default format you'll be working with.
Related guides: GGUF vs GPTQ vs AWQ vs EXL2: which quantization format should you use? — when to pick GGUF vs alternatives. How to download GGUF models from HuggingFace — step-by-step model acquisition workflow. Ollama vs LM Studio vs llama.cpp vs vLLM — which inference runtime to use with your GGUF files.