CraftRigs
Software & Tools

RAG

Retrieval-Augmented Generation: a technique where an LLM pulls relevant text from an external document store at query time and uses it as context for its answer.

Retrieval-Augmented Generation (RAG) is the standard pattern for letting a local LLM answer questions about documents it was never trained on. Instead of stuffing every file into the prompt, you embed your corpus, retrieve the top matching chunks at query time, and inject only those chunks into the model's context window.

How the Pipeline Works

A RAG stack has three moving parts: an embedding model that converts text into vectors, a vector store that holds those vectors, and the generator LLM that reads the retrieved chunks and writes the answer. On a local rig, all three usually run on the same GPU — the embedding model is small (often under 1GB VRAM), so the bulk of your VRAM budget goes to the generator. Llama 3.1 8B at Q4_K_M is a common pairing: it scores 72.6 on HumanEval and handles RAG workflows comfortably on a 16GB card.

Black-Box vs Full RAG

There are two flavors worth knowing. Black-box RAG treats the LLM as a sealed unit — you only control the prompt and the retrieved context. Full RAG goes further, fine-tuning the generator on your retrieval format so it learns to weight cited chunks more heavily. Black-box is faster to ship and works with any GGUF model in Ollama or LM Studio. Full RAG demands fine-tuning infrastructure but produces noticeably better grounding on niche corpora.

Why It Matters for Local AI

RAG is the single biggest reason a 16GB local rig can compete with frontier cloud models on your own data — you're not asking the model to memorize your docs, just to read them. It also reshapes your VRAM math: a longer retrieved context inflates the KV-cache, so a model that fits at 4K tokens may OOM at 16K. Plan retrieval chunk sizes against your card, not against what the model technically supports.