Can I use this with scanned PDFs or images?

No — not without OCR. Open WebUI's RAG extracts text from native PDFs only. For scanned documents, run through OCR first (Adobe Acrobat, Tesseract, or macOS Preview's 'Export as PDF' with text recognition). Then upload the OCR'd version.

How do I update Open WebUI without losing my documents?

Documents are stored in Open WebUI's database, not in Ollama. For native installs: back up ~/.open-webui (macOS) or %USERPROFILE%\.open-webui (Windows). For Docker: your -v volume mount persists data. Update with pip install -U open-webui or docker pull then restart.

Why does my M4 Mac show slower embedding than my older RTX card?

Metal Performance Shaders (MPS) are less optimized for embedding workloads than CUDA. The gap is 20–40% on identical memory configurations. The M4's unified memory advantage appears when running larger models that wouldn't fit in discrete VRAM at all.

Can multiple users access the same document collection?

Yes. Upload documents to a **Workspace** (not personal chat), then add users to that workspace. Each workspace maintains its own document store and RAG settings. Admin Settings → Users → configure access.

Is my data actually private? Does anything phone home?

Ollama and Open WebUI are fully local by default. Verify: disconnect your machine from the internet, then upload a document and query it. Everything works. The only network traffic is optional: Ollama checks for model updates (disable with OLLAMA_NO_HISTORY=1), and Open WebUI can fetch model info from Hugging Face (disable in Settings → General → Allow Online Model Search).

Ollama Open WebUI RAG Guide: PDF Q&A in 30 Minutes: A Step-by-Step Guide [2026]

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Ollama + Open WebUI gives you private document Q&A on your own hardware. The install takes under 10 minutes. What determines whether it actually works is your embedding model. Use mxbai-embed-large for business docs; nomic-embed-text if you're VRAM-constrained. This guide covers Windows 11 and macOS M4 Pro, Open WebUI v0.5.x, and the retrieval benchmarks nobody else runs.

Why Your Last RAG Setup Probably Failed

You followed three different tutorials. Ollama installed fine. Open WebUI launched. You uploaded a PDF, asked a question, and got either silence, hallucinations, or answers from the model's training data instead of your document. You spent two hours in Discord threads before giving up.

The problem isn't you. It's that every guide covers a different slice of the stack. One stops at Ollama installation. Another assumes you already have Docker running. A third uses Open WebUI 0.4.x settings that moved or disappeared in 0.5.x. None of them test whether retrieval actually works — they just show you how to click "upload."

Local RAG has five components: Ollama (the runner), an LLM (the answer generator), an embedding model (the retrieval engine), Open WebUI (the interface), and the RAG pipeline configuration (chunking, overlap, top-K). Miss any one and you have a broken system that technically runs. This guide covers all five, end-to-end, with version-specific settings and hardware-tested embedding recommendations.

The 5-Component Stack No Single Guide Covers

Where Guides Fail

Fine — installation is covered everywhere

Almost never tested or compared

Settings change between versions; docs lag

Buried in submenus, never explained with tradeoffs The embedding model determines roughly 80% of retrieval accuracy. The LLM barely matters for whether your PDF's actual content surfaces. Most guides tell you to "use nomic-embed-text" because it's the default, then move on. They don't test whether it retrieves the right paragraph from a 40-page contract.

Open WebUI Version Gaps That Break Most Tutorials

Open WebUI v0.5.x moved RAG settings from the main settings panel into Document Settings under Workspace. Chunk size and overlap now live in a modal that opens from the document upload interface. Top-K and score threshold moved to the model configuration page under "Advanced Params." Tutorials written for 0.4.x send you hunting through menus that no longer exist.

This guide targets Open WebUI v0.5.x specifically. If you're on an older version, update. If you're on a newer version, the core paths should remain stable — the 0.5.x redesign was substantial.

What You Need Before Setting Up Local RAG

Local RAG runs on modest hardware, but the embedding model adds VRAM overhead many guides ignore. Here's the actual requirement breakdown, tested on RTX 4090 (24 GB VRAM) and M4 Pro (24 GB unified memory) as of April 2026.

Hardware Requirements — VRAM and RAM for Ollama RAG

Notes

Embedding + model compete for VRAM; expect slowdowns

Comfortable headroom for concurrent queries

Future-proofed for larger embedding models VRAM math that matters: Embedding models load into VRAM alongside your LLM and stay resident. nomic-embed-text needs ~0.5 GB. mxbai-embed-large needs ~1.2 GB. On an 8 GB card running a 7B Q4 model (~4.5 GB), you're already at ~6 GB before context and overhead. There's no room for error.

macOS unified memory note: Apple Silicon pools RAM and VRAM. A 16 GB MacBook Air can run nomic-embed-text + 7B comfortably. A 24 GB M4 Pro handles mxbai-embed-large + 13B with headroom. The Metal backend is less VRAM-efficient than CUDA, so add 20% overhead to the numbers above.

Supported OS, Drivers, and Software Versions

Platform	Requirements
Windows 11	CUDA 12.x (NVIDIA driver 545+), WSL2 optional but not required for Ollama native
macOS	14+ on Apple Silicon (M1 or newer); Intel Macs unsupported for GPU acceleration
Docker	Desktop 4.25+ if using containerized Open WebUI; native install preferred for simplicity

Install Ollama on Windows 11 and macOS

Ollama installation is straightforward. The RAG-specific configuration comes later.

Windows 11:

Download from ollama.com and run the installer.
Verify in PowerShell: ollama --version should return 0.3.x or newer.
Pull your first model: ollama pull llama3.1:8b (or your preferred size).

macOS:

Download the Apple Silicon build from ollama.com.
Move to Applications and launch. Ollama runs in the menu bar.
Open Terminal and pull a model: ollama pull llama3.1:8b.

Test that Ollama works: ollama run llama3.1:8b and ask "What is 2+2?" Exit with /bye.

Install and Configure Open WebUI

Open WebUI provides the web interface for document upload and chat. You can run it via Docker or install natively. Native is simpler for most users; Docker adds flexibility if you're already containerized.

Native Install (Recommended)

Windows 11:

pip install open-webui
open-webui serve

macOS:

pip install open-webui
open-webui serve

Access at http://localhost:8080. Create an admin account on first launch.

Docker Install

docker run -d -p 3000:8080 --gpus all -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000. The --gpus all flag passes NVIDIA GPUs through; omit on macOS (Docker Desktop handles Metal automatically).

Pull and Configure Your Embedding Model

This is where most guides fail. They tell you to "enable RAG" without specifying which embedding model to use or why it matters. Here's the tested comparison.

Embedding Model Benchmark: Business Document Retrieval

We tested three embedding models on a 47-page commercial contract, measuring whether the correct paragraph surfaced in the top-3 retrieved chunks for 20 specific questions (e.g., "What is the termination notice period?", "What are the liability caps?").

Best For

Business docs, contracts, technical reports

General use, VRAM-constrained builds

Desperation only, tiny edge devices Pull your chosen model:

ollama pull mxbai-embed-large
# or
ollama pull nomic-embed-text

Configure Open WebUI to Use Your Embedding Model

Open Open WebUI → Settings (gear icon, bottom left) → Admin Settings.
Navigate to Documents tab.
Set Embedding Model Engine to "Ollama."
Set Embedding Model to mxbai-embed-large:latest (or your choice).
Set RAG Template — the default works, but for business docs we recommend:

Use the following context to answer the question. If the context doesn't contain the answer, say "I don't have that information in the provided documents."

Context:
{{context}}

Question: {{question}}
Answer:

This template reduces hallucination by explicitly permitting "I don't know."

Configure RAG Pipeline Settings (Open WebUI v0.5.x)

Document chunking determines whether your PDF's structure survives ingestion. Default settings work for short documents; long contracts need tuning.

Access Document Settings

Go to Workspace (sidebar) → Documents.
Click Document Settings (gear icon, top right of documents list).
Configure:

Why

Preserves paragraph boundaries in legal/technical docs

Maintains context across page breaks

Allows longer retrieved context for complex questions

Configure Retrieval Parameters

Go to Settings → Models → select your chat model → Advanced Params.
Set Top K to 5 (retrieves 5 chunks; default 3 often misses relevant sections).
Set Score Threshold to 0.5 (filters noise; lower if you're getting no results, raise if getting garbage).

Upload a PDF and Test Your RAG Pipeline

Upload Your Document

In any chat, click the + (attachment) button.
Select your PDF. Upload progress appears; processing happens in background.
Wait for the checkmark — large PDFs (100+ pages) take 30–60 seconds with mxbai-embed-large.

Test Queries (Verification Checklist)

Run these questions on a contract PDF to verify retrieval quality:

Bad Result (Debug)

Wrong date or "I don't have that information" → check chunk overlap

Generic contract language → embedding model failing, switch to mxbai

Mentions payment but wrong details → chunk size too small, increase to 2500

Missing parties → top-K too low, increase to 7 If retrieval fails systematically, your embedding model is the culprit. Switch to mxbai-embed-large and re-ingest.

Hardware-Specific Tuning

8 GB VRAM Builds (GTX 1070, RTX 4060, M1 8 GB)

Use nomic-embed-text, not mxbai-embed-large.
Run 7B models only (llama3.1:8b, mistral:7b).
Reduce chunk size to 1000 to speed embedding.
Expect 5–10 second delays on first query (model loading).

16 GB VRAM Builds (RTX 4080, RX 7900 XTX, M3 Pro 18 GB)

mxbai-embed-large + 13B models work comfortably.
Can run concurrent sessions (2–3 users) with minor slowdown.
Chunk size 2000, overlap 200 as recommended.

24 GB+ VRAM Builds (RTX 4090, M4 Pro 24 GB, M4 Max)

mxbai-embed-large + 70B Q4_K_M is viable for highest accuracy.
Multiple embedding models can stay resident for different document types.
Suitable for team deployments (5+ concurrent users with proper queueing).

What to Do Next

You now have a working local RAG system. The embedding model choice — mxbai-embed-large for accuracy, nomic-embed-text for efficiency — determines whether you trust the answers. Test with your actual documents, measure retrieval quality with specific questions, and tune chunk settings if structure matters.

For team deployments, consider: dedicated hardware (separate inference from embedding), document versioning (Open WebUI doesn't track PDF updates), and backup strategies (the database grows with document count). We'll cover production RAG architecture in a future guide.