Local AI Glossary
Every term you'll encounter when running LLMs locally — explained without the jargon. 78 terms across 6 categories.
Memory & Storage
The maximum number of tokens a model can process at once — its working memory.
EXPOAMD's memory overclocking profile standard for DDR5, equivalent to Intel's XMP — lets RAM run at advertised speeds instead of slower JEDEC defaults.
GDDR6XThe GPU memory standard used in RTX 30 and 40-series high-end cards, delivering up to 1,008 GB/s.
GDDR7The latest GPU memory standard, used in RTX 50-series cards, with roughly double GDDR6X bandwidth.
KV CacheMemory storage for the key-value attention states of all tokens in your current context.
LPDDR5XLow-power high-bandwidth RAM used in Apple Silicon chips as the unified memory substrate.
Memory BandwidthHow fast data moves between memory and the processor, measured in GB/s.
RAM (System RAM)General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.
System RAM (vs VRAM)General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.
TurboQuantA Google Research KV cache compression technique that extends usable context length 4-5x on consumer GPUs without retraining the model.
Unified MemoryA single memory pool shared by the CPU and GPU on Apple Silicon chips.
VRAM (Video RAM)Dedicated high-speed memory on your GPU that stores model weights during inference.
VRAM OffloadingRunning some model layers in VRAM and the rest in system RAM when the model is too large to fit entirely on the GPU.
XMPIntel's Extreme Memory Profile, a preset stored on DDR memory modules that lets the BIOS run RAM at its advertised speed and timings instead of conservative JEDEC defaults.
Hardware
Apple's ARM-based system-on-chip family (M-series) that integrates CPU, GPU, and unified memory on a single die, used in Macs for local LLM inference.
BlackwellNVIDIA's GPU architecture powering the RTX 50-series consumer cards and RTX Pro 6000 workstation GPUs, designed with GDDR7 memory and updated Tensor Cores for AI workloads.
CPUThe general-purpose processor that handles model layers and KV-cache when they don't fit in GPU VRAM. In local AI it's the slow fallback path, not the workhorse.
CUDANVIDIA's parallel computing platform that enables GPU-accelerated AI workloads on GeForce and data center cards.
GPUA graphics processing unit — the parallel-compute chip that runs the matrix math behind local LLM inference. For local AI, the GPU's onboard memory and bandwidth matter more than its gaming performance.
HopperNVIDIA's data-center GPU architecture launched in 2022, powering the H100 and H200 accelerators that dominate large-scale AI training and inference.
NPUA neural processing unit is a dedicated chip block built to accelerate small AI workloads at low power, separate from the CPU and GPU.
NVLinkA high-bandwidth NVIDIA interconnect that lets two GPUs share data directly, bypassing the slower PCIe bus.
PCIeThe high-speed slot standard that connects GPUs, NVMe drives, and other expansion cards to your motherboard. Each generation roughly doubles the bandwidth of the last.
RDNA 5AMD's next-generation GPU architecture, expected to launch in mid-2027 as the successor to RDNA 4.
ROCmAMD's open-source GPU compute platform — the AMD equivalent of CUDA for running AI workloads on Radeon GPUs.
Tensor CoresSpecialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.
Vera RubinNVIDIA's next-generation hyperscaler GPU platform announced at GTC 2026, succeeding Hopper and integrating Groq's LPU inference technology.
Models & Quantization
Activation-aware Weight Quantization — a 4-bit model compression format that protects the most important weights to preserve accuracy while shrinking VRAM footprint.
BF16 (Brain Float 16)Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.
Bits Per Weight (BPW)The number of bits used to store each model parameter, determining model size in memory.
DenseA model architecture where every parameter activates on every token, as opposed to mixture-of-experts designs that only fire a subset.
EmbeddingA dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.
EXL2A GPU-only quantization format from the ExLlamaV2 project that supports fractional, mixed bits-per-weight for fast local LLM inference on NVIDIA cards.
Fine-TuningTraining a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.
FP16 (Half Precision)16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.
GGMLThe predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.
GGUFThe standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.
GPTQA post-training quantization method that compresses LLM weights to 4-bit (or lower) precision for GPU inference, reducing VRAM use with minimal accuracy loss.
INT4 (4-bit Integer)4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.
LLM (Large Language Model)A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.
LoRA (Low-Rank Adaptation)Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.
Model Parameters (7B, 13B, 70B)The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.
MoE (Mixture of Experts)Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.
MultimodalAI models that process and generate more than one type of data — typically text plus images, audio, or video.
Q4_K_MA 4-bit quantization format for GGUF models that shrinks weights to roughly a quarter of their FP16 size while preserving most of the model's quality.
QuantizationReducing a model's numerical precision to shrink its memory footprint with minimal quality loss.
Software & Tools
Pre-packaged retrieval-augmented generation tools that hide chunking, embedding, and retrieval logic behind a simple UI, often at the cost of correctness and privacy.
ExLlamaV2High-performance inference library optimized for NVIDIA GPUs, known for fast quantized inference and support for EXL2 quantization format.
Full RAGA complete retrieval-augmented generation pipeline that chunks documents, embeds them into a vector database, and retrieves relevant context at query time to ground LLM responses.
HIPBLASAMD's GPU-accelerated BLAS library that lets llama.cpp and similar runtimes offload matrix math to Radeon and Instinct cards through the ROCm stack.
hipccAMD's HIP C++ compiler driver, used to build GPU-accelerated code that runs on Radeon and Instinct cards via the ROCm stack.
KoboldCppInference server with a web UI designed for creative writing and roleplay — built on llama.cpp with additional sampling controls and story management features.
llama.cppA C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.
LM StudioDesktop application for downloading and running local LLMs with a graphical interface — the easiest entry point for local AI on Windows and macOS.
Local LLMA large language model that runs entirely on your own hardware — no cloud API, no per-token billing, no data leaving the machine.
MLXApple's machine learning framework optimized for Apple Silicon — enables fast local LLM inference on M-series Macs using unified memory.
numactlA Linux command-line tool for binding processes to specific NUMA nodes, controlling which CPU cores and memory banks a workload uses on multi-socket systems.
nvccNVIDIA's CUDA compiler driver — the toolchain component that compiles CUDA C++ source into GPU-executable code.
nvidia-smiCommand-line utility bundled with NVIDIA drivers that reports GPU utilization, VRAM usage, temperature, and power draw in real time.
OllamaA tool that makes running local LLMs as simple as a single terminal command.
oneAPIIntel's open, cross-architecture programming toolkit for running compute workloads on Intel CPUs, GPUs, and accelerators. It's the Intel-side analog to CUDA or ROCm for local LLM inference on Arc cards.
RAGRetrieval-Augmented Generation: a technique where an LLM pulls relevant text from an external document store at query time and uses it as context for its answer.
Tinygrad Software StackAn open-source deep learning framework from George Hotz's tiny corp, designed as a minimal, hardware-agnostic alternative to PyTorch and CUDA for running and training neural networks.
TT-Forge software stackTenstorrent's open-source compiler and runtime stack for running ML models on its RISC-V-based Tensix accelerators, including the QuietBox 2 workstation.
VulkanA cross-vendor graphics and compute API used by llama.cpp as a portable GPU backend when CUDA or ROCm aren't available.
WDDMWindows Display Driver Model — the GPU driver framework Windows uses to share and manage graphics memory, which on WSL2 reserves a slice of your VRAM before any LLM ever loads.
WSL1The original Windows Subsystem for Linux that translates Linux syscalls to Windows kernel calls, running without a virtual machine layer.
WSL2Windows Subsystem for Linux 2 — a lightweight VM that runs a real Linux kernel inside Windows, letting you use Linux-native AI tooling without dual-booting.
Performance
The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.
Decode SpeedThe token generation phase of LLM inference — the rate at which output tokens stream out.
Flash AttentionAn algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.
InferenceRunning a trained model to generate output — the part of the LLM lifecycle that produces tokens in response to a prompt.
Prefill (Time to First Token)The phase where the model processes your input prompt before generating any output.
Prompt CachingReusing the computed KV cache state from a previous request's prefix — eliminating redundant compute for repeated system prompts or context.
Speculative DecodingInference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.
Tensor ParallelismSplitting individual model layers across multiple GPUs so each card holds a slice of every weight matrix and the layers compute in parallel.
Tokens Per Second (t/s)The primary speed metric for LLM inference — how many tokens the model generates each second.
All Terms A–Z
A
- Apple Silicon Apple's ARM-based system-on-chip family (M-series) that integrates CPU, GPU, and unified memory on a single die, used in Macs for local LLM inference.
- AWQ Activation-aware Weight Quantization — a 4-bit model compression format that protects the most important weights to preserve accuracy while shrinking VRAM footprint.
B
- Batch Size The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.
- BF16 (Brain Float 16) Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.
- Bits Per Weight (BPW) The number of bits used to store each model parameter, determining model size in memory.
- Black-Box RAG Pre-packaged retrieval-augmented generation tools that hide chunking, embedding, and retrieval logic behind a simple UI, often at the cost of correctness and privacy.
- Blackwell NVIDIA's GPU architecture powering the RTX 50-series consumer cards and RTX Pro 6000 workstation GPUs, designed with GDDR7 memory and updated Tensor Cores for AI workloads.
C
- Context Window The maximum number of tokens a model can process at once — its working memory.
- CPU The general-purpose processor that handles model layers and KV-cache when they don't fit in GPU VRAM. In local AI it's the slow fallback path, not the workhorse.
- CUDA NVIDIA's parallel computing platform that enables GPU-accelerated AI workloads on GeForce and data center cards.
D
- Decode Speed The token generation phase of LLM inference — the rate at which output tokens stream out.
- Dense A model architecture where every parameter activates on every token, as opposed to mixture-of-experts designs that only fire a subset.
E
- Embedding A dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.
- EXL2 A GPU-only quantization format from the ExLlamaV2 project that supports fractional, mixed bits-per-weight for fast local LLM inference on NVIDIA cards.
- ExLlamaV2 High-performance inference library optimized for NVIDIA GPUs, known for fast quantized inference and support for EXL2 quantization format.
- EXPO AMD's memory overclocking profile standard for DDR5, equivalent to Intel's XMP — lets RAM run at advertised speeds instead of slower JEDEC defaults.
F
- Fine-Tuning Training a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.
- Flash Attention An algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.
- FP16 (Half Precision) 16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.
- Full RAG A complete retrieval-augmented generation pipeline that chunks documents, embeds them into a vector database, and retrieves relevant context at query time to ground LLM responses.
G
- GDDR6X The GPU memory standard used in RTX 30 and 40-series high-end cards, delivering up to 1,008 GB/s.
- GDDR7 The latest GPU memory standard, used in RTX 50-series cards, with roughly double GDDR6X bandwidth.
- GGML The predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.
- GGUF The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.
- GPTQ A post-training quantization method that compresses LLM weights to 4-bit (or lower) precision for GPU inference, reducing VRAM use with minimal accuracy loss.
- GPU A graphics processing unit — the parallel-compute chip that runs the matrix math behind local LLM inference. For local AI, the GPU's onboard memory and bandwidth matter more than its gaming performance.
H
- HIPAA-Sensitive Describes data, workloads, or environments that handle protected health information (PHI) and must comply with HIPAA's privacy and security rules.
- HIPBLAS AMD's GPU-accelerated BLAS library that lets llama.cpp and similar runtimes offload matrix math to Radeon and Instinct cards through the ROCm stack.
- hipcc AMD's HIP C++ compiler driver, used to build GPU-accelerated code that runs on Radeon and Instinct cards via the ROCm stack.
- Hopper NVIDIA's data-center GPU architecture launched in 2022, powering the H100 and H200 accelerators that dominate large-scale AI training and inference.
I
- Inference Running a trained model to generate output — the part of the LLM lifecycle that produces tokens in response to a prompt.
- INT4 (4-bit Integer) 4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.
K
L
- llama.cpp A C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.
- LLM (Large Language Model) A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.
- LM Studio Desktop application for downloading and running local LLMs with a graphical interface — the easiest entry point for local AI on Windows and macOS.
- Local LLM A large language model that runs entirely on your own hardware — no cloud API, no per-token billing, no data leaving the machine.
- LoRA (Low-Rank Adaptation) Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.
- LPDDR5X Low-power high-bandwidth RAM used in Apple Silicon chips as the unified memory substrate.
M
- Memory Bandwidth How fast data moves between memory and the processor, measured in GB/s.
- MLX Apple's machine learning framework optimized for Apple Silicon — enables fast local LLM inference on M-series Macs using unified memory.
- Model Parameters (7B, 13B, 70B) The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.
- MoE (Mixture of Experts) Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.
- Multimodal AI models that process and generate more than one type of data — typically text plus images, audio, or video.
N
- NPU A neural processing unit is a dedicated chip block built to accelerate small AI workloads at low power, separate from the CPU and GPU.
- numactl A Linux command-line tool for binding processes to specific NUMA nodes, controlling which CPU cores and memory banks a workload uses on multi-socket systems.
- nvcc NVIDIA's CUDA compiler driver — the toolchain component that compiles CUDA C++ source into GPU-executable code.
- nvidia-smi Command-line utility bundled with NVIDIA drivers that reports GPU utilization, VRAM usage, temperature, and power draw in real time.
- NVLink A high-bandwidth NVIDIA interconnect that lets two GPUs share data directly, bypassing the slower PCIe bus.
O
P
- PCIe The high-speed slot standard that connects GPUs, NVMe drives, and other expansion cards to your motherboard. Each generation roughly doubles the bandwidth of the last.
- Prefill (Time to First Token) The phase where the model processes your input prompt before generating any output.
- Prompt Caching Reusing the computed KV cache state from a previous request's prefix — eliminating redundant compute for repeated system prompts or context.
Q
- Q4_K_M A 4-bit quantization format for GGUF models that shrinks weights to roughly a quarter of their FP16 size while preserving most of the model's quality.
- Quantization Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.
R
- RAG Retrieval-Augmented Generation: a technique where an LLM pulls relevant text from an external document store at query time and uses it as context for its answer.
- RAM (System RAM) General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.
- RDNA 5 AMD's next-generation GPU architecture, expected to launch in mid-2027 as the successor to RDNA 4.
- ROCm AMD's open-source GPU compute platform — the AMD equivalent of CUDA for running AI workloads on Radeon GPUs.
S
- Speculative Decoding Inference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.
- System RAM (vs VRAM) General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.
T
- Tensor Cores Specialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.
- Tensor Parallelism Splitting individual model layers across multiple GPUs so each card holds a slice of every weight matrix and the layers compute in parallel.
- Tinygrad Software Stack An open-source deep learning framework from George Hotz's tiny corp, designed as a minimal, hardware-agnostic alternative to PyTorch and CUDA for running and training neural networks.
- Tokens Per Second (t/s) The primary speed metric for LLM inference — how many tokens the model generates each second.
- TT-Forge software stack Tenstorrent's open-source compiler and runtime stack for running ML models on its RISC-V-based Tensix accelerators, including the QuietBox 2 workstation.
- TurboQuant A Google Research KV cache compression technique that extends usable context length 4-5x on consumer GPUs without retraining the model.
U
- Unified Memory A single memory pool shared by the CPU and GPU on Apple Silicon chips.
V
- Vera Rubin NVIDIA's next-generation hyperscaler GPU platform announced at GTC 2026, succeeding Hopper and integrating Groq's LPU inference technology.
- VRAM (Video RAM) Dedicated high-speed memory on your GPU that stores model weights during inference.
- VRAM Offloading Running some model layers in VRAM and the rest in system RAM when the model is too large to fit entirely on the GPU.
- Vulkan A cross-vendor graphics and compute API used by llama.cpp as a portable GPU backend when CUDA or ROCm aren't available.
W
- WDDM Windows Display Driver Model — the GPU driver framework Windows uses to share and manage graphics memory, which on WSL2 reserves a slice of your VRAM before any LLM ever loads.
- WSL1 The original Windows Subsystem for Linux that translates Linux syscalls to Windows kernel calls, running without a virtual machine layer.
- WSL2 Windows Subsystem for Linux 2 — a lightweight VM that runs a real Linux kernel inside Windows, letting you use Linux-native AI tooling without dual-booting.
X
- XMP Intel's Extreme Memory Profile, a preset stored on DDR memory modules that lets the BIOS run RAM at its advertised speed and timings instead of conservative JEDEC defaults.