TL;DR: Ollama + Open WebUI gives you private document Q&A on your own hardware. The install takes under 10 minutes. What determines whether it actually works is your embedding model. Use mxbai-embed-large for business docs; nomic-embed-text if you're VRAM-constrained. This guide covers Windows 11 and macOS M4 Pro, Open WebUI v0.5.x, and the retrieval benchmarks nobody else runs.
Why Your Last RAG Setup Probably Failed
You followed three different tutorials. Ollama installed fine. Open WebUI launched. You uploaded a PDF, asked a question, and got either silence, hallucinations, or answers from the model's training data instead of your document. You spent two hours in Discord threads before giving up.
The problem isn't you. It's that every guide covers a different slice of the stack. One stops at Ollama installation. Another assumes you already have Docker running. A third uses Open WebUI 0.4.x settings that moved or disappeared in 0.5.x. None of them test whether retrieval actually works — they just show you how to click "upload."
Local RAG has five components: Ollama (the runner), an LLM (the answer generator), an embedding model (the retrieval engine), Open WebUI (the interface), and the RAG pipeline configuration (chunking, overlap, top-K). Miss any one and you have a broken system that technically runs. This guide covers all five, end-to-end, with version-specific settings and hardware-tested embedding recommendations.
The 5-Component Stack No Single Guide Covers
Where Guides Fail
Fine — installation is covered everywhere
Almost never tested or compared
Settings change between versions; docs lag
Buried in submenus, never explained with tradeoffs The embedding model determines roughly 80% of retrieval accuracy. The LLM barely matters for whether your PDF's actual content surfaces. Most guides tell you to "use nomic-embed-text" because it's the default, then move on. They don't test whether it retrieves the right paragraph from a 40-page contract.
Open WebUI Version Gaps That Break Most Tutorials
Open WebUI v0.5.x moved RAG settings from the main settings panel into Document Settings under Workspace. Chunk size and overlap now live in a modal that opens from the document upload interface. Top-K and score threshold moved to the model configuration page under "Advanced Params." Tutorials written for 0.4.x send you hunting through menus that no longer exist.
This guide targets Open WebUI v0.5.x specifically. If you're on an older version, update. If you're on a newer version, the core paths should remain stable — the 0.5.x redesign was substantial.
What You Need Before Setting Up Local RAG
Local RAG runs on modest hardware, but the embedding model adds VRAM overhead many guides ignore. Here's the actual requirement breakdown, tested on RTX 4090 (24 GB VRAM) and M4 Pro (24 GB unified memory) as of April 2026.
Hardware Requirements — VRAM and RAM for Ollama RAG
Notes
Embedding + model compete for VRAM; expect slowdowns
Comfortable headroom for concurrent queries
Future-proofed for larger embedding models VRAM math that matters: Embedding models load into VRAM alongside your LLM and stay resident. nomic-embed-text needs ~0.5 GB. mxbai-embed-large needs ~1.2 GB. On an 8 GB card running a 7B Q4 model (~4.5 GB), you're already at ~6 GB before context and overhead. There's no room for error.
macOS unified memory note: Apple Silicon pools RAM and VRAM. A 16 GB MacBook Air can run nomic-embed-text + 7B comfortably. A 24 GB M4 Pro handles mxbai-embed-large + 13B with headroom. The Metal backend is less VRAM-efficient than CUDA, so add 20% overhead to the numbers above.
Supported OS, Drivers, and Software Versions
| Platform | Requirements |
|---|---|
| Windows 11 | CUDA 12.x (NVIDIA driver 545+), WSL2 optional but not required for Ollama native |
| macOS | 14+ on Apple Silicon (M1 or newer); Intel Macs unsupported for GPU acceleration |
| Docker | Desktop 4.25+ if using containerized Open WebUI; native install preferred for simplicity |
Install Ollama on Windows 11 and macOS
Ollama installation is straightforward. The RAG-specific configuration comes later.
Windows 11:
- Download from ollama.com and run the installer.
- Verify in PowerShell:
ollama --versionshould return 0.3.x or newer. - Pull your first model:
ollama pull llama3.1:8b(or your preferred size).
macOS:
- Download the Apple Silicon build from ollama.com.
- Move to Applications and launch. Ollama runs in the menu bar.
- Open Terminal and pull a model:
ollama pull llama3.1:8b.
Test that Ollama works: ollama run llama3.1:8b and ask "What is 2+2?" Exit with /bye.
Install and Configure Open WebUI
Open WebUI provides the web interface for document upload and chat. You can run it via Docker or install natively. Native is simpler for most users; Docker adds flexibility if you're already containerized.
Native Install (Recommended)
Windows 11:
pip install open-webui
open-webui serve
macOS:
pip install open-webui
open-webui serve
Access at http://localhost:8080. Create an admin account on first launch.
Docker Install
docker run -d -p 3000:8080 --gpus all -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000. The --gpus all flag passes NVIDIA GPUs through; omit on macOS (Docker Desktop handles Metal automatically).
Pull and Configure Your Embedding Model
This is where most guides fail. They tell you to "enable RAG" without specifying which embedding model to use or why it matters. Here's the tested comparison.
Embedding Model Benchmark: Business Document Retrieval
We tested three embedding models on a 47-page commercial contract, measuring whether the correct paragraph surfaced in the top-3 retrieved chunks for 20 specific questions (e.g., "What is the termination notice period?", "What are the liability caps?").
Best For
Business docs, contracts, technical reports
General use, VRAM-constrained builds
Desperation only, tiny edge devices Pull your chosen model:
ollama pull mxbai-embed-large
# or
ollama pull nomic-embed-text
Configure Open WebUI to Use Your Embedding Model
- Open Open WebUI → Settings (gear icon, bottom left) → Admin Settings.
- Navigate to Documents tab.
- Set Embedding Model Engine to "Ollama."
- Set Embedding Model to
mxbai-embed-large:latest(or your choice). - Set RAG Template — the default works, but for business docs we recommend:
Use the following context to answer the question. If the context doesn't contain the answer, say "I don't have that information in the provided documents."
Context:
{{context}}
Question: {{question}}
Answer:
This template reduces hallucination by explicitly permitting "I don't know."
Configure RAG Pipeline Settings (Open WebUI v0.5.x)
Document chunking determines whether your PDF's structure survives ingestion. Default settings work for short documents; long contracts need tuning.
Access Document Settings
- Go to Workspace (sidebar) → Documents.
- Click Document Settings (gear icon, top right of documents list).
- Configure:
Why
Preserves paragraph boundaries in legal/technical docs
Maintains context across page breaks
Allows longer retrieved context for complex questions
Configure Retrieval Parameters
- Go to Settings → Models → select your chat model → Advanced Params.
- Set Top K to 5 (retrieves 5 chunks; default 3 often misses relevant sections).
- Set Score Threshold to 0.5 (filters noise; lower if you're getting no results, raise if getting garbage).
Upload a PDF and Test Your RAG Pipeline
Upload Your Document
- In any chat, click the + (attachment) button.
- Select your PDF. Upload progress appears; processing happens in background.
- Wait for the checkmark — large PDFs (100+ pages) take 30–60 seconds with mxbai-embed-large.
Test Queries (Verification Checklist)
Run these questions on a contract PDF to verify retrieval quality:
Bad Result (Debug)
Wrong date or "I don't have that information" → check chunk overlap
Generic contract language → embedding model failing, switch to mxbai
Mentions payment but wrong details → chunk size too small, increase to 2500
Missing parties → top-K too low, increase to 7 If retrieval fails systematically, your embedding model is the culprit. Switch to mxbai-embed-large and re-ingest.
Hardware-Specific Tuning
8 GB VRAM Builds (GTX 1070, RTX 4060, M1 8 GB)
- Use nomic-embed-text, not mxbai-embed-large.
- Run 7B models only (llama3.1:8b, mistral:7b).
- Reduce chunk size to 1000 to speed embedding.
- Expect 5–10 second delays on first query (model loading).
16 GB VRAM Builds (RTX 4080, RX 7900 XTX, M3 Pro 18 GB)
- mxbai-embed-large + 13B models work comfortably.
- Can run concurrent sessions (2–3 users) with minor slowdown.
- Chunk size 2000, overlap 200 as recommended.
24 GB+ VRAM Builds (RTX 4090, M4 Pro 24 GB, M4 Max)
- mxbai-embed-large + 70B Q4_K_M is viable for highest accuracy.
- Multiple embedding models can stay resident for different document types.
- Suitable for team deployments (5+ concurrent users with proper queueing).
What to Do Next
You now have a working local RAG system. The embedding model choice — mxbai-embed-large for accuracy, nomic-embed-text for efficiency — determines whether you trust the answers. Test with your actual documents, measure retrieval quality with specific questions, and tune chunk settings if structure matters.
For team deployments, consider: dedicated hardware (separate inference from embedding), document versioning (Open WebUI doesn't track PDF updates), and backup strategies (the database grows with document count). We'll cover production RAG architecture in a future guide.