CraftRigs
CraftRigs / Glossary

Local AI Glossary

Every term you'll encounter when running LLMs locally — explained without the jargon. 41 terms across 5 categories.

Memory & Storage

Hardware

Models & Quantization

BF16 (Brain Float 16)

Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.

Bits Per Weight (BPW)

The number of bits used to store each model parameter, determining model size in memory.

Embedding

A dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.

Fine-Tuning

Training a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.

FP16 (Half Precision)

16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.

GGML

The predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.

GGUF

The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.

INT4 (4-bit Integer)

4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.

LLM (Large Language Model)

A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.

Model Parameters (7B, 13B, 70B)

The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.

MoE (Mixture of Experts)

Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.

Multimodal

AI models that process and generate more than one type of data — typically text plus images, audio, or video.

Quantization

Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.

Software & Tools

Performance

All Terms A–Z

B

  • Batch Size The number of requests processed simultaneously during inference — higher batch sizes improve GPU utilization but increase latency per request.
  • BF16 (Brain Float 16) Google's 16-bit floating-point format with the same exponent range as FP32 — preferred over FP16 for training and increasingly common in inference.
  • Bits Per Weight (BPW) The number of bits used to store each model parameter, determining model size in memory.

C

  • Context Window The maximum number of tokens a model can process at once — its working memory.
  • CUDA NVIDIA's parallel computing platform that enables GPU-accelerated AI workloads on GeForce and data center cards.

D

  • Decode Speed The token generation phase of LLM inference — the rate at which output tokens stream out.

E

  • Embedding A dense numerical vector that represents text in a high-dimensional space — the foundation of semantic search and RAG systems.
  • ExLlamaV2 High-performance inference library optimized for NVIDIA GPUs, known for fast quantized inference and support for EXL2 quantization format.

F

  • Fine-Tuning Training a pre-trained model on additional data to specialize its behavior, improve task performance, or adjust its output style.
  • Flash Attention An algorithm that computes attention more efficiently by reducing VRAM reads, speeding up prefill and enabling longer context.
  • FP16 (Half Precision) 16-bit floating-point format used for AI model weights — half the memory of FP32 with minimal quality loss for inference.

G

  • GDDR6X The GPU memory standard used in RTX 30 and 40-series high-end cards, delivering up to 1,008 GB/s.
  • GDDR7 The latest GPU memory standard, used in RTX 50-series cards, with roughly double GDDR6X bandwidth.
  • GGML The predecessor file format to GGUF for storing quantized LLMs, used by early versions of llama.cpp.
  • GGUF The standard file format for quantized LLMs, used by llama.cpp, Ollama, and LM Studio.

I

  • INT4 (4-bit Integer) 4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.

K

  • KoboldCpp Inference server with a web UI designed for creative writing and roleplay — built on llama.cpp with additional sampling controls and story management features.
  • KV Cache Memory storage for the key-value attention states of all tokens in your current context.

L

  • llama.cpp A C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.
  • LLM (Large Language Model) A neural network trained on large amounts of text that can generate, summarize, translate, and reason about language.
  • LM Studio Desktop application for downloading and running local LLMs with a graphical interface — the easiest entry point for local AI on Windows and macOS.
  • LoRA (Low-Rank Adaptation) Parameter-efficient fine-tuning technique that adds small trainable weight matrices to a frozen base model — enabling custom model training on consumer hardware.
  • LPDDR5X Low-power high-bandwidth RAM used in Apple Silicon chips as the unified memory substrate.

M

  • Memory Bandwidth How fast data moves between memory and the processor, measured in GB/s.
  • MLX Apple's machine learning framework optimized for Apple Silicon — enables fast local LLM inference on M-series Macs using unified memory.
  • Model Parameters (7B, 13B, 70B) The number of learned numerical weights in a model — the primary predictor of capability and VRAM requirement.
  • MoE (Mixture of Experts) Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.
  • Multimodal AI models that process and generate more than one type of data — typically text plus images, audio, or video.

O

  • Ollama A tool that makes running local LLMs as simple as a single terminal command.

P

  • Prefill (Time to First Token) The phase where the model processes your input prompt before generating any output.
  • Prompt Caching Reusing the computed KV cache state from a previous request's prefix — eliminating redundant compute for repeated system prompts or context.

Q

  • Quantization Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.

R

  • RAM (System RAM) General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.
  • ROCm AMD's open-source GPU compute platform — the AMD equivalent of CUDA for running AI workloads on Radeon GPUs.

S

  • Speculative Decoding Inference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.
  • System RAM (vs VRAM) General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.

T

  • Tensor Cores Specialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.
  • Tokens Per Second (t/s) The primary speed metric for LLM inference — how many tokens the model generates each second.

U

  • Unified Memory A single memory pool shared by the CPU and GPU on Apple Silicon chips.

V

  • VRAM (Video RAM) Dedicated high-speed memory on your GPU that stores model weights during inference.
  • VRAM Offloading Running some model layers in VRAM and the rest in system RAM when the model is too large to fit entirely on the GPU.