Full RAG — Local AI Glossary | CraftRigs

Full RAG is the traditional retrieval-augmented generation stack: ingest documents, split them into chunks, run them through an embedding model, store the vectors in a database, and at query time pull the top-k matches into the prompt before generation. For local AI builders, it's the heavyweight option compared to lighter approaches like native web search or black-box RAG services.

What "Full" Actually Means

The "full" label distinguishes it from shortcut approaches. A full pipeline owns every stage: document loaders, chunking strategy, embedding model selection, vector store (Chroma, Qdrant, FAISS, pgvector), retrieval logic, reranking, and prompt assembly. Each layer is a knob you tune. Each layer is also a thing that can break, drift, or silently return garbage chunks that poison the context window.

When You Skip It

Tools like Ollama 0.18.1 added native web search precisely because full RAG is overkill for many workflows. If your knowledge source is the open web, a search API plus a fetched page beats spinning up a vector DB. If your corpus is small, stuffing it directly into a long context window can outperform retrieval entirely. Full RAG earns its weight when you have a large, private, relatively static corpus — internal docs, code repos, research libraries — where embeddings amortize across many queries and the retrieval signal is strong.

Why It Matters for Local AI

Running full RAG locally means your embedding model and vector DB compete for the same VRAM and RAM budget as the LLM itself. A 7B chat model plus a 500MB embedding model plus a vector index that grows with your corpus adds up fast on a single-GPU rig. Builders increasingly ask whether the retrieval quality justifies that overhead, or whether a longer context window, a web-search shortcut, or a hosted retrieval service is the better trade for their hardware.