CraftRigs
Tool

Complete Local LLM Glossary: Every Term Explained

By Charlotte Stewart 9 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: This is the reference glossary for local LLM terms. Bookmark it. Every technical term you'll encounter — quantization, VRAM, context window, KV cache, tokens per second, GGUF, llama.cpp, ROCm — is defined plainly here.


Local LLM documentation assumes you already know what everything means. Most beginners end up with 15 browser tabs open trying to understand one setup guide. This glossary closes those tabs.

Terms are organized by category. Use the section headers to find what you need.


Hardware Terms

VRAM (Video RAM) The memory built into your GPU. This is where the AI model's weights live during inference. VRAM is faster than system RAM and necessary for GPU-accelerated inference. Running out of VRAM forces the model to offload to system RAM, which is dramatically slower. VRAM is the single most important spec for local LLM hardware.

Memory Bandwidth How fast data can be moved between the GPU's VRAM and its processing cores. Measured in GB/s (gigabytes per second). For LLM inference, memory bandwidth is the primary bottleneck for token generation speed — the model weights have to be read from VRAM for every token generated. Higher bandwidth = faster token output. More important than CUDA core count for inference workloads.

CUDA Nvidia's proprietary parallel computing platform. When you run a local LLM on an Nvidia GPU, CUDA handles the actual computation. CUDA's maturity and optimization are a primary reason Nvidia dominates local AI inference despite competitors having competitive hardware specs.

CUDA Cores The processing units inside an Nvidia GPU. More CUDA cores generally means more raw compute throughput. For LLM inference, CUDA cores matter primarily for the prefill phase (processing input). Generation speed is more dependent on memory bandwidth than core count.

TDP (Thermal Design Power) The maximum sustained power draw of a GPU, in watts. The actual power draw during inference is typically 60–85% of TDP since inference is memory-bandwidth-bound, not fully compute-bound. TDP is also roughly how much heat the GPU produces — your cooling system needs to handle it.

PCIe (Peripheral Component Interconnect Express) The interface connecting your GPU to the motherboard. For consumer GPUs, PCIe x16 is standard. PCIe bandwidth is rarely a bottleneck for single-GPU LLM inference but matters for multi-GPU setups or when combining GPU and CPU offloading.

NVLink Nvidia's high-bandwidth GPU-to-GPU interconnect (up to 600 GB/s on RTX 3090). NVLink does NOT merge VRAM into a single unified pool — the GPUs still appear as separate devices with separate VRAM. For LLM inference, tools like llama.cpp use NVLink to split model layers across both GPUs via --tensor-split, enabling faster inter-GPU communication than PCIe alone. Only available on select Nvidia cards (not all GPUs support it).

GDDR (Graphics Double Data Rate) The type of RAM used in GPUs. GDDR5, GDDR6, GDDR6X, and GDDR7 are generations, with each offering higher bandwidth. GDDR7 in the RTX 50 series is significantly faster than GDDR6X in the 40 series, which is a meaningful jump for inference speed.


Model & Inference Terms

Parameters The numerical weights that define an AI model's behavior. More parameters generally means more capable and nuanced outputs. Common model sizes: 7B (7 billion), 13B, 14B, 30B, 70B. Parameter count directly determines VRAM requirements (before quantization).

Quantization The process of reducing model weight precision from 32-bit or 16-bit floating point to lower bit-width integers (Q8, Q4, Q3, Q2). Dramatically reduces VRAM requirements at the cost of some output quality. Q4_K_M is the standard for local use — roughly half the VRAM of full precision with minimal quality degradation. See the VRAM Calculator for size estimates at each quantization level.

Q4_K_M A specific 4-bit quantization format. K indicates K-quant (a higher-quality quantization method), M indicates medium. Q4_K_M is the most widely recommended quantization for local inference — good balance of quality and size. Other common variants: Q5_K_M (slightly better quality, larger), Q4_K_S (smaller than Q4_K_M, slightly lower quality).

GGUF (GGML Universal File Format) A model file format developed by Georgi Gerganov as the successor to the GGML format. GGUF stores model weights and metadata together in a single file, making it easy to run quantized models with llama.cpp, Ollama, LM Studio, and similar tools. When you download a model from Hugging Face for local use, you're usually downloading a .gguf file.

Inference Running a trained AI model to generate outputs — as opposed to training, which creates the model. When you ask your local LLM a question and it answers, that's inference. Inference is the computation type that matters for local AI use. Training requires dramatically more hardware.

Tokens The units of text that language models process. Roughly 1 token ≈ 0.75 English words, or 4 characters. Models have VRAM costs proportional to context length measured in tokens. Token limits matter for how much text you can send in a single prompt.

Tokens per Second (t/s) The speed at which a model generates output. Measured in tokens per second. At 10 t/s you're watching words appear slowly. At 30–40 t/s responses feel fast and natural. At 100+ t/s the model outputs faster than you can read. Generation speed is primarily determined by GPU memory bandwidth.

Context Window How many tokens the model can "see" at once — both the input you provide and the output it generates. A 4K context window means 4,096 tokens total. Larger context windows let you work with longer documents, longer conversations, and bigger codebases. Context window size directly affects VRAM usage through the KV cache.

KV Cache (Key-Value Cache) The stored intermediate computations for all tokens in the current context. During generation, the model doesn't recompute attention for previously-seen tokens — it retrieves them from the KV cache. The KV cache lives in VRAM and grows with context length. At 128K context, the KV cache can be larger than the model weights themselves.

Prefill The first phase of inference: processing your input prompt. During prefill, the model reads all your input tokens simultaneously and computes the KV cache for them. Fast prefill means faster time-to-first-token. Prefill speed scales with CUDA core count more than generation speed does.

Generation (Decode) The second phase of inference: producing output tokens one at a time. Each token requires reading the model weights from VRAM and attending to the KV cache. Generation speed is the bottleneck for response quality of experience — it's how fast you see text appearing. Directly limited by memory bandwidth.

Prompt The input you give to the model. Includes the system prompt (instructions about how the model should behave), prior conversation turns, and your actual question or request.

System Prompt Instructions given to the model before the conversation starts, defining its persona, constraints, or context. In Ollama and LM Studio, system prompts are configurable per model.

Temperature A parameter controlling randomness in model output. Temperature 0 = deterministic, always picks the most probable next token. Temperature 1 = default sampling, more varied. Temperature 2+ = increasingly random/creative/incoherent. Most practical use cases work well at 0.7–1.0.


Architecture Terms

Transformer The core neural network architecture used by virtually all modern LLMs. Transformers process text by computing attention between all tokens in the context — which is why context window processing scales quadratically in naive implementations.

Attention Mechanism The component of a transformer that lets the model relate different tokens to each other regardless of their distance. "Self-attention" allows the model to understand that "it" in a sentence refers to a specific earlier noun, for example. Attention computation is what makes long context windows expensive in compute and memory.

Flash Attention An optimized implementation of the attention mechanism that is more memory-efficient by reorganizing the computation order. Flash Attention and its successors dramatically reduced the VRAM cost of long context windows, enabling consumer GPUs to run models with 32K+ context windows.

MoE (Mixture of Experts) An architecture where a model has multiple specialized "expert" sub-networks. For each token, a router selects a small subset of experts to activate. MoE models have large total parameter counts but fewer active parameters per token — improving efficiency on high-end hardware. Key caveat: the full weight set must be loaded into VRAM even though only a fraction activates per token.

Llama (Large Language Model Meta AI) Meta's family of open-source foundation models. Llama 1, 2, 3, and 3.1/3.2/3.3 have become the base for a huge proportion of local models. The Llama architecture is the most widely used and best-supported in local inference tools.


Software & Ecosystem Terms

llama.cpp The foundational C++ library for running quantized LLMs on consumer hardware. Almost everything in the local AI ecosystem either uses llama.cpp directly or is inspired by it. Supports Nvidia (CUDA), AMD (ROCm/HIP), Apple (Metal), and CPU inference. Open-source and actively maintained.

Ollama A user-friendly wrapper around llama.cpp that makes running local models simple. Manages model downloads, provides an API, and handles server lifecycle. The easiest starting point for most people. Runs on Mac, Windows, and Linux.

LM Studio A GUI application for running local LLMs. Downloads models, manages inference, and provides a chat interface. Good for people who prefer not to use command-line tools. Built on llama.cpp.

Hugging Face The primary repository for model weights. Most open-source LLMs are hosted here. When people say "download a model," they usually mean from huggingface.co. The GGUF-quantized models for llama.cpp are organized by user "TheBloke" and similar community contributors.

CUDA Toolkit Nvidia's software development kit for GPU computing. Must be installed for Nvidia GPU acceleration to work with most local AI tools. The correct version must match both your GPU driver and the build of llama.cpp or Ollama you're using.

ROCm (Radeon Open Compute) AMD's equivalent of CUDA — the software platform for GPU computing on AMD hardware. ROCm support in local inference tools has improved significantly with RDNA 4, especially on Linux. Windows ROCm support still lags.

Metal Apple's GPU compute framework. Used for GPU-accelerated inference on Apple Silicon (M1, M2, M3, M4 chips). llama.cpp and Ollama support Metal, which is why Apple Silicon Macs perform well for local AI despite not having a discrete GPU.

Hugging Face Transformers A Python library for working with transformer models. More flexible than llama.cpp but less optimized for pure inference speed on consumer hardware. Used more in development and research contexts.

Quantization Formats: GGUF vs GPTQ vs AWQ Three common quantization approaches. GGUF (used by llama.cpp) is for CPU and CUDA inference with flexible quantization levels. GPTQ and AWQ are GPU-optimized quantization formats used with frameworks like vLLM and Transformers. For most local inference use cases, GGUF is the right choice.


Performance Terms

LocalScore A community benchmark for local LLM inference. Tests generation speed (tokens per second) and prefill speed across standardized model sizes. Useful for comparing GPUs because it measures actual inference performance rather than synthetic compute benchmarks.

Time to First Token (TTFT) How long you wait after submitting a prompt before the model starts generating output. Determined by prefill speed. A slow TTFT feels frustrating even if generation speed is fast.

Perplexity A metric for model quality — lower perplexity means the model assigns higher probability to real text. Used to evaluate quantization quality: Q4_K_M vs F16 on the same model has higher perplexity (slightly lower quality). Not directly visible to users but explains why lower quantizations produce worse outputs.

Offloading Splitting model layers between GPU VRAM and system RAM (or CPU). When a model is too large to fully fit in VRAM, you can offload some layers to RAM. GPU layers run fast; CPU/RAM layers run slow. Maximum offload (all layers to GPU) is ideal; partial offload with some layers on CPU dramatically reduces generation speed.


See Also

glossary local-llm definitions vram quantization inference reference

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.