Local AI Glossary

Every term you'll encounter when running LLMs locally — explained without the jargon. 78 terms across 6 categories.

Memory & Storage

The maximum number of tokens a model can process at once — its working memory.

AMD's memory overclocking profile standard for DDR5, equivalent to Intel's XMP — lets RAM run at advertised speeds instead of slower JEDEC defaults.

GDDR6X

The GPU memory standard used in RTX 30 and 40-series high-end cards, delivering up to 1,008 GB/s.

GDDR7

The latest GPU memory standard, used in RTX 50-series cards, with roughly double GDDR6X bandwidth.

KV Cache

Memory storage for the key-value attention states of all tokens in your current context.

LPDDR5X

Low-power high-bandwidth RAM used in Apple Silicon chips as the unified memory substrate.

Memory Bandwidth

How fast data moves between memory and the processor, measured in GB/s.

RAM (System RAM)

General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.

System RAM (vs VRAM)

General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.

TurboQuant

A Google Research KV cache compression technique that extends usable context length 4-5x on consumer GPUs without retraining the model.

Unified Memory

A single memory pool shared by the CPU and GPU on Apple Silicon chips.

VRAM (Video RAM)

Dedicated high-speed memory on your GPU that stores model weights during inference.

VRAM Offloading

Running some model layers in VRAM and the rest in system RAM when the model is too large to fit entirely on the GPU.

XMP

Intel's Extreme Memory Profile, a preset stored on DDR memory modules that lets the BIOS run RAM at its advertised speed and timings instead of conservative JEDEC defaults.

Hardware

Apple Silicon

Apple's ARM-based system-on-chip family (M-series) that integrates CPU, GPU, and unified memory on a single die, used in Macs for local LLM inference.

Blackwell

NVIDIA's GPU architecture powering the RTX 50-series consumer cards and RTX Pro 6000 workstation GPUs, designed with GDDR7 memory and updated Tensor Cores for AI workloads.

CPU

The general-purpose processor that handles model layers and KV-cache when they don't fit in GPU VRAM. In local AI it's the slow fallback path, not the workhorse.

CUDA

NVIDIA's parallel computing platform that enables GPU-accelerated AI workloads on GeForce and data center cards.

GPU

A graphics processing unit — the parallel-compute chip that runs the matrix math behind local LLM inference. For local AI, the GPU's onboard memory and bandwidth matter more than its gaming performance.

Hopper

NVIDIA's data-center GPU architecture launched in 2022, powering the H100 and H200 accelerators that dominate large-scale AI training and inference.

NPU

A neural processing unit is a dedicated chip block built to accelerate small AI workloads at low power, separate from the CPU and GPU.

NVLink

A high-bandwidth NVIDIA interconnect that lets two GPUs share data directly, bypassing the slower PCIe bus.

PCIe

The high-speed slot standard that connects GPUs, NVMe drives, and other expansion cards to your motherboard. Each generation roughly doubles the bandwidth of the last.

RDNA 5

AMD's next-generation GPU architecture, expected to launch in mid-2027 as the successor to RDNA 4.

ROCm

AMD's open-source GPU compute platform — the AMD equivalent of CUDA for running AI workloads on Radeon GPUs.

Tensor Cores

Specialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.

Vera Rubin

NVIDIA's next-generation hyperscaler GPU platform announced at GTC 2026, succeeding Hopper and integrating Groq's LPU inference technology.

Models & Quantization

AWQ

Activation-aware Weight Quantization — a 4-bit model compression format that protects the most important weights to preserve accuracy while shrinking VRAM footprint.

BF16 (Brain Float 16)

Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.

Bits Per Weight (BPW)

The number of bits used to store each model parameter, determining model size in memory.

Dense

A model architecture where every parameter activates on every token, as opposed to mixture-of-experts designs that only fire a subset.

Embedding

A dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.

EXL2

A GPU-only quantization format from the ExLlamaV2 project that supports fractional, mixed bits-per-weight for fast local LLM inference on NVIDIA cards.

Fine-Tuning

Training a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.

FP16 (Half Precision)

16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.

GGML

The predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.

GGUF

The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.

GPTQ

A post-training quantization method that compresses LLM weights to 4-bit (or lower) precision for GPU inference, reducing VRAM use with minimal accuracy loss.

INT4 (4-bit Integer)

4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.

LLM (Large Language Model)

A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.

Model Parameters (7B, 13B, 70B)

The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.

MoE (Mixture of Experts)

Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.

Multimodal

AI models that process and generate more than one type of data — typically text plus images, audio, or video.

Q4_K_M

A 4-bit quantization format for GGUF models that shrinks weights to roughly a quarter of their FP16 size while preserving most of the model's quality.

Quantization

Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.

Software & Tools

Black-Box RAG

Pre-packaged retrieval-augmented generation tools that hide chunking, embedding, and retrieval logic behind a simple UI, often at the cost of correctness and privacy.

ExLlamaV2

High-performance inference library optimized for NVIDIA GPUs, known for fast quantized inference and support for EXL2 quantization format.

Full RAG

A complete retrieval-augmented generation pipeline that chunks documents, embeds them into a vector database, and retrieves relevant context at query time to ground LLM responses.

HIPBLAS

AMD's GPU-accelerated BLAS library that lets llama.cpp and similar runtimes offload matrix math to Radeon and Instinct cards through the ROCm stack.

hipcc

AMD's HIP C++ compiler driver, used to build GPU-accelerated code that runs on Radeon and Instinct cards via the ROCm stack.

KoboldCpp

Inference server with a web UI designed for creative writing and roleplay — built on llama.cpp with additional sampling controls and story management features.

llama.cpp

A C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.

LM Studio

Desktop application for downloading and running local LLMs with a graphical interface — the easiest entry point for local AI on Windows and macOS.

Local LLM

A large language model that runs entirely on your own hardware — no cloud API, no per-token billing, no data leaving the machine.

MLX

Apple's machine learning framework optimized for Apple Silicon — enables fast local LLM inference on M-series Macs using unified memory.

numactl

A Linux command-line tool for binding processes to specific NUMA nodes, controlling which CPU cores and memory banks a workload uses on multi-socket systems.

nvcc

NVIDIA's CUDA compiler driver — the toolchain component that compiles CUDA C++ source into GPU-executable code.

nvidia-smi

Command-line utility bundled with NVIDIA drivers that reports GPU utilization, VRAM usage, temperature, and power draw in real time.

Ollama

A tool that makes running local LLMs as simple as a single terminal command.

oneAPI

Intel's open, cross-architecture programming toolkit for running compute workloads on Intel CPUs, GPUs, and accelerators. It's the Intel-side analog to CUDA or ROCm for local LLM inference on Arc cards.

RAG

Retrieval-Augmented Generation: a technique where an LLM pulls relevant text from an external document store at query time and uses it as context for its answer.

Tinygrad Software Stack

An open-source deep learning framework from George Hotz's tiny corp, designed as a minimal, hardware-agnostic alternative to PyTorch and CUDA for running and training neural networks.

TT-Forge software stack

Tenstorrent's open-source compiler and runtime stack for running ML models on its RISC-V-based Tensix accelerators, including the QuietBox 2 workstation.

Vulkan

A cross-vendor graphics and compute API used by llama.cpp as a portable GPU backend when CUDA or ROCm aren't available.

WDDM

Windows Display Driver Model — the GPU driver framework Windows uses to share and manage graphics memory, which on WSL2 reserves a slice of your VRAM before any LLM ever loads.

WSL1

The original Windows Subsystem for Linux that translates Linux syscalls to Windows kernel calls, running without a virtual machine layer.

WSL2

Windows Subsystem for Linux 2 — a lightweight VM that runs a real Linux kernel inside Windows, letting you use Linux-native AI tooling without dual-booting.

Performance

Batch Size

The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.

Decode Speed

The token generation phase of LLM inference — the rate at which output tokens stream out.

Flash Attention

An algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.

Inference

Running a trained model to generate output — the part of the LLM lifecycle that produces tokens in response to a prompt.

Prefill (Time to First Token)

The phase where the model processes your input prompt before generating any output.

Prompt Caching

Reusing the computed KV cache state from a previous request's prefix — eliminating redundant compute for repeated system prompts or context.

Speculative Decoding

Inference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.

Tensor Parallelism

Splitting individual model layers across multiple GPUs so each card holds a slice of every weight matrix and the layers compute in parallel.

Tokens Per Second (t/s)

The primary speed metric for LLM inference — how many tokens the model generates each second.

Networking & Inference

HIPAA-Sensitive

Describes data, workloads, or environments that handle protected health information (PHI) and must comply with HIPAA's privacy and security rules.

All Terms A–Z

A

Apple Silicon Apple's ARM-based system-on-chip family (M-series) that integrates CPU, GPU, and unified memory on a single die, used in Macs for local LLM inference.
AWQ Activation-aware Weight Quantization — a 4-bit model compression format that protects the most important weights to preserve accuracy while shrinking VRAM footprint.

B

Batch Size The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.
BF16 (Brain Float 16) Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.
Bits Per Weight (BPW) The number of bits used to store each model parameter, determining model size in memory.
Black-Box RAG Pre-packaged retrieval-augmented generation tools that hide chunking, embedding, and retrieval logic behind a simple UI, often at the cost of correctness and privacy.
Blackwell NVIDIA's GPU architecture powering the RTX 50-series consumer cards and RTX Pro 6000 workstation GPUs, designed with GDDR7 memory and updated Tensor Cores for AI workloads.

C

Context Window The maximum number of tokens a model can process at once — its working memory.
CPU The general-purpose processor that handles model layers and KV-cache when they don't fit in GPU VRAM. In local AI it's the slow fallback path, not the workhorse.
CUDA NVIDIA's parallel computing platform that enables GPU-accelerated AI workloads on GeForce and data center cards.

D

Decode Speed The token generation phase of LLM inference — the rate at which output tokens stream out.
Dense A model architecture where every parameter activates on every token, as opposed to mixture-of-experts designs that only fire a subset.

E

Embedding A dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.
EXL2 A GPU-only quantization format from the ExLlamaV2 project that supports fractional, mixed bits-per-weight for fast local LLM inference on NVIDIA cards.
ExLlamaV2 High-performance inference library optimized for NVIDIA GPUs, known for fast quantized inference and support for EXL2 quantization format.
EXPO AMD's memory overclocking profile standard for DDR5, equivalent to Intel's XMP — lets RAM run at advertised speeds instead of slower JEDEC defaults.

F

Fine-Tuning Training a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.
Flash Attention An algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.
FP16 (Half Precision) 16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.
Full RAG A complete retrieval-augmented generation pipeline that chunks documents, embeds them into a vector database, and retrieves relevant context at query time to ground LLM responses.

G

GDDR6X The GPU memory standard used in RTX 30 and 40-series high-end cards, delivering up to 1,008 GB/s.
GDDR7 The latest GPU memory standard, used in RTX 50-series cards, with roughly double GDDR6X bandwidth.
GGML The predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.
GGUF The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.
GPTQ A post-training quantization method that compresses LLM weights to 4-bit (or lower) precision for GPU inference, reducing VRAM use with minimal accuracy loss.
GPU A graphics processing unit — the parallel-compute chip that runs the matrix math behind local LLM inference. For local AI, the GPU's onboard memory and bandwidth matter more than its gaming performance.

H

HIPAA-Sensitive Describes data, workloads, or environments that handle protected health information (PHI) and must comply with HIPAA's privacy and security rules.
HIPBLAS AMD's GPU-accelerated BLAS library that lets llama.cpp and similar runtimes offload matrix math to Radeon and Instinct cards through the ROCm stack.
hipcc AMD's HIP C++ compiler driver, used to build GPU-accelerated code that runs on Radeon and Instinct cards via the ROCm stack.
Hopper NVIDIA's data-center GPU architecture launched in 2022, powering the H100 and H200 accelerators that dominate large-scale AI training and inference.

I

Inference Running a trained model to generate output — the part of the LLM lifecycle that produces tokens in response to a prompt.
INT4 (4-bit Integer) 4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.

K

KoboldCpp Inference server with a web UI designed for creative writing and roleplay — built on llama.cpp with additional sampling controls and story management features.
KV Cache Memory storage for the key-value attention states of all tokens in your current context.

L

llama.cpp A C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.
LLM (Large Language Model) A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.
LM Studio Desktop application for downloading and running local LLMs with a graphical interface — the easiest entry point for local AI on Windows and macOS.
Local LLM A large language model that runs entirely on your own hardware — no cloud API, no per-token billing, no data leaving the machine.
LoRA (Low-Rank Adaptation) Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.
LPDDR5X Low-power high-bandwidth RAM used in Apple Silicon chips as the unified memory substrate.

M

Memory Bandwidth How fast data moves between memory and the processor, measured in GB/s.
MLX Apple's machine learning framework optimized for Apple Silicon — enables fast local LLM inference on M-series Macs using unified memory.
Model Parameters (7B, 13B, 70B) The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.
MoE (Mixture of Experts) Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.
Multimodal AI models that process and generate more than one type of data — typically text plus images, audio, or video.

N

NPU A neural processing unit is a dedicated chip block built to accelerate small AI workloads at low power, separate from the CPU and GPU.
numactl A Linux command-line tool for binding processes to specific NUMA nodes, controlling which CPU cores and memory banks a workload uses on multi-socket systems.
nvcc NVIDIA's CUDA compiler driver — the toolchain component that compiles CUDA C++ source into GPU-executable code.
nvidia-smi Command-line utility bundled with NVIDIA drivers that reports GPU utilization, VRAM usage, temperature, and power draw in real time.
NVLink A high-bandwidth NVIDIA interconnect that lets two GPUs share data directly, bypassing the slower PCIe bus.

O

Ollama A tool that makes running local LLMs as simple as a single terminal command.
oneAPI Intel's open, cross-architecture programming toolkit for running compute workloads on Intel CPUs, GPUs, and accelerators. It's the Intel-side analog to CUDA or ROCm for local LLM inference on Arc cards.

P

PCIe The high-speed slot standard that connects GPUs, NVMe drives, and other expansion cards to your motherboard. Each generation roughly doubles the bandwidth of the last.
Prefill (Time to First Token) The phase where the model processes your input prompt before generating any output.
Prompt Caching Reusing the computed KV cache state from a previous request's prefix — eliminating redundant compute for repeated system prompts or context.

Q

Q4_K_M A 4-bit quantization format for GGUF models that shrinks weights to roughly a quarter of their FP16 size while preserving most of the model's quality.
Quantization Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.

R

RAG Retrieval-Augmented Generation: a technique where an LLM pulls relevant text from an external document store at query time and uses it as context for its answer.
RAM (System RAM) General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.
RDNA 5 AMD's next-generation GPU architecture, expected to launch in mid-2027 as the successor to RDNA 4.
ROCm AMD's open-source GPU compute platform — the AMD equivalent of CUDA for running AI workloads on Radeon GPUs.

S

Speculative Decoding Inference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.
System RAM (vs VRAM) General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.

T

Tensor Cores Specialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.
Tensor Parallelism Splitting individual model layers across multiple GPUs so each card holds a slice of every weight matrix and the layers compute in parallel.
Tinygrad Software Stack An open-source deep learning framework from George Hotz's tiny corp, designed as a minimal, hardware-agnostic alternative to PyTorch and CUDA for running and training neural networks.
Tokens Per Second (t/s) The primary speed metric for LLM inference — how many tokens the model generates each second.
TT-Forge software stack Tenstorrent's open-source compiler and runtime stack for running ML models on its RISC-V-based Tensix accelerators, including the QuietBox 2 workstation.
TurboQuant A Google Research KV cache compression technique that extends usable context length 4-5x on consumer GPUs without retraining the model.

U

Unified Memory A single memory pool shared by the CPU and GPU on Apple Silicon chips.

V

Vera Rubin NVIDIA's next-generation hyperscaler GPU platform announced at GTC 2026, succeeding Hopper and integrating Groq's LPU inference technology.
VRAM (Video RAM) Dedicated high-speed memory on your GPU that stores model weights during inference.
VRAM Offloading Running some model layers in VRAM and the rest in system RAM when the model is too large to fit entirely on the GPU.
Vulkan A cross-vendor graphics and compute API used by llama.cpp as a portable GPU backend when CUDA or ROCm aren't available.

W

WDDM Windows Display Driver Model — the GPU driver framework Windows uses to share and manage graphics memory, which on WSL2 reserves a slice of your VRAM before any LLM ever loads.
WSL1 The original Windows Subsystem for Linux that translates Linux syscalls to Windows kernel calls, running without a virtual machine layer.
WSL2 Windows Subsystem for Linux 2 — a lightweight VM that runs a real Linux kernel inside Windows, letting you use Linux-native AI tooling without dual-booting.

X

XMP Intel's Extreme Memory Profile, a preset stored on DDR memory modules that lets the BIOS run RAM at its advertised speed and timings instead of conservative JEDEC defaults.