TL;DR: Install Ollama with OLLAMA_HOST=0.0.0.0:11434, pull Llama 3.1 8B (4.7 GB), then launch Open WebUI via Docker with --add-host=host.docker.internal:host-gateway and the RAG stack (sentence-transformers/all-MiniLM-L6-v2, 80 MB embedding model). 16 GB VRAM handles 8B inference + embeddings simultaneously; 8 GB cards hit the wall when RAG activates and force CPU fallback, cutting response speed 10x.
You've got the 16 GB card. RX 9060 XT or RTX 5060 Ti—doesn't matter which. You paid for local AI. You want to drop a PDF into a chat window and ask questions. You've already burned an evening on Docker tutorials. They end with "No models found" or an infinite spinning upload wheel. This guide fixes that. Three steps, under 30 minutes, screenshot-worthy results.
Why Open WebUI Wins Over Bare Ollama for Document Work
Bare Ollama is a CLI tool. It pulls models, it runs inference, it streams tokens to your terminal. What it doesn't do: document upload, chat history that survives a browser refresh, or multi-user isolation. Three blockers for any serious RAG workflow.
RAG stands for Retrieval-Augmented Generation. Your document gets sliced into chunks. These convert to numerical embeddings—think "semantic fingerprints." A vector database stores them. The system retrieves them to ground the LLM's answers in your actual text. Without this, you're pasting pages into a context window. You're hoping the model doesn't hallucinate.
Open WebUI adds 47 MB of Docker overhead. In exchange, you eliminate 200+ lines of Python boilerplate. No custom embedding pipelines. No chunking logic. No vector storage setup. The RAG stack runs entirely containerized: no system Python conflicts, no PyTorch CUDA version mismatches, no pip install rabbit holes.
Critical check before you upload sensitive files: Open WebUI defaults to local sentence-transformers for embeddings. It does not use OpenAI's API. Navigate to Settings > Documents and verify "Embedding Model" shows a local model like sentence-transformers/all-MiniLM-L6-v2. If it shows an OpenAI key field, you're about to ship your documents to the cloud.
When to Skip Open WebUI and Use LM Studio Instead
LM Studio wins for model discovery. Its Hugging Face browser lets you download anything—uncensored models, 4-bit quants from random repos, MoE architectures—without touching a config file. For single-user offline use on one machine, it's the smoother experience.
Open WebUI wins for multi-device LAN access and team document sharing. One Docker container serves your household or small office. No license fees, no seat limits, no cloud handshake.
Hardware parity: both hit identical VRAM walls. LM Studio's KV cache visualization exposes offloading earlier. You'll see the warning before performance tanks. Open WebUI hides this until responses slow to a crawl. If you're debugging on 8 GB VRAM, LM Studio's transparency saves time.
Hardware Requirements by Tier: From 8 GB "Works" to 24 GB "Actually Works"
Real-world performance by tier:
- 8 GB VRAM — Llama 3.1 8B Q4_K_M (6.2 GB): Inference 35 tok/s, RAG triggers OOM at 4K context, falls back to CPU (3 tok/s)
- 12 GB VRAM — Gemma 2 9B Q4_K_M (7.1 GB): Stable at 10.4 GB with 4K context; 512-token response buffer safe
- 16 GB VRAM — Llama 3.1 8B Q8_0 (8.5 GB): Sweet spot: 45 tok/s inference, 8-12 sec RAG queries, 4.7 GB headroom for larger batches
- 24 GB VRAM — Llama 3.1 70B Q4_K_M (40 GB → 24 GB via split): 70B quality at 18 tok/s; requires CPU/GPU split or dual 24 GB cards
Q4_K_M is a quantization format—4-bit weights with K-means clustering and mixed precision—that shrinks model size ~75% with minimal quality loss. Q8_0 uses 8-bit weights, larger but crisper for technical documents with precise terminology.
The 16 GB tier is where RAG stops being a compromise. You can run Llama 3.1 8B at Q8_0 (higher quality than Q4_K_M), handle 6K context windows, and still have VRAM headroom for batch embedding of large PDFs. On 8 GB, RAG activation often forces CPU fallback for the embedding model. Your query latency jumps from 8 seconds to 80 seconds. You'll think the system broke.
Step 1: Install Ollama with Correct Bind Address and Verify GPU Visibility
The CraftRigs test bench confirmed this on RX 9060 XT 16 GB, RTX 5060 Ti 16 GB, and RTX 3090 24 GB. Same failure mode across all three: default Ollama binds to 127.0.0.1:11434, which Docker containers can't reach.
This guide assumes Linux or Windows with WSL2. macOS works but requires Rosetta 2 for some embedding models. Expect 15% slower token generation.
The --network=host flag that every tutorial skips? It's not the fix here. The fix is Ollama's bind address, not Docker's network mode.
Installation Commands
Linux (systemd):
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl edit ollama.service
Add to the override file:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
Windows (WSL2):
curl -fsSL https://ollama.com/install.sh | sh
export OLLAMA_HOST=0.0.0.0:11434
ollama serve
Run this in a dedicated terminal—WSL2 doesn't use systemd by default. For persistence, add export OLLAMA_HOST=0.0.0.0:11434 to ~/.bashrc.
Verify GPU Visibility
ollama run llama3.1:8b
Wait for the download (4.7 GB), then in the interactive prompt:
>>> why is the sky blue
You should see streaming text. If you see "Error: llama runner process no longer running," your GPU drivers need attention. AMD users: verify ROCm with rocminfo | grep gfx. NVIDIA users: nvidia-smi should show the driver version.
Screenshot moment: Open http://localhost:11434 in a browser. You should see Ollama is running. If you see connection refused, the bind address is wrong.
Pull your RAG-ready model now:
ollama pull llama3.1:8b
ollama pull nomic-embed-text # Alternative embedding model, 275 MB
Step 2: Deploy Open WebUI with Persistent Volumes and Working RAG Dependencies
docker run commands from GitHub issues work once, then lose all data on container restart. RAG document upload spins forever with no error message.
One docker-compose.yml with explicit volume mounts and the correct network flag for LAN access.
This exact file ran for 72 hours on the CraftRigs test bench. It processed 340 PDF pages across 12 documents. No container restart. No data loss.
The default embedding model downloads on first RAG query. It's 80 MB. If your internet hiccups, the upload wheel spins forever. No timeout message appears.
The --add-host=host.docker.internal:host-gateway flag is only needed on Linux. Windows and macOS Docker Desktop handle this automatically. Omitting it on Linux breaks LAN access silently.
docker-compose.yml
version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
- WEBUI_AUTH=False # Set True for multi-user with login
- RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- RAG_RERANKING_MODEL="" # Disable reranking to save VRAM on 16 GB
volumes:
- open-webui-data:/app/backend/data
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
volumes:
open-webui-data:
Critical flags explained:
OLLAMA_BASE_URL: Points to your host's Ollama instance.host.docker.internalresolves to the Docker host IP—this is why the0.0.0.0bind in Step 1 matters.extra_hosts: Explicitly mapshost.docker.internalon Linux. Without this, the container can't reach Ollama even with correct bind settings.RAG_RERANKING_MODEL="": Reranking improves RAG quality by re-scoring retrieved chunks, but adds ~1.2 GB VRAM usage. Disable on 16 GB cards to preserve headroom.
Launch and Verify
docker compose up -d
docker logs -f open-webui
Wait for: INFO: Application startup complete. This takes 30-90 seconds on first run as the container downloads Python dependencies.
Screenshot moment: Open http://localhost:3000. You should see the Open WebUI interface with "Llama 3.1 8B" in the model dropdown. Select it, send "hello", confirm streaming response.
LAN access test: From another device on your network, open http://YOUR_HOST_IP:3000. If this fails, your firewall blocks port 3000—not a Docker issue.
Step 3: Configure Document Processing Pipeline with Chunking Settings for 16 GB Cards
RAG responses are nonsense—clearly not using your document. Context window crashes at 4K tokens despite 16 GB VRAM.
Chunk size and overlap settings that keep embedding under 2 GB VRAM, with verification steps to confirm your document is actually being queried.
CraftRigs tested PDFs from 2 MB (research papers) to 180 MB (scanned technical manuals). The 50 MB wall is real—it's a default Nginx upload limit in the Open WebUI container, not a VRAM issue.
OCR for scanned PDFs requires Tesseract. It isn't bundled in the main image. Image-based PDFs fail silently without error text.
The "Context Length" slider in Settings > Models defaults to 2048 regardless of your VRAM. You must manually raise it to use your hardware.
Pre-Upload: Verify Embedding Model
Settings > Documents > Embedding Model should show: sentence-transformers/all-MiniLM-L6-v2
If it shows OpenAI or is blank, click the dropdown and select the local model. The 80 MB download happens now—wait for the checkmark before uploading documents.
Chunking Configuration for 16 GB VRAM
The container restarts automatically—this is normal.
Document Upload and Verification
- Click "+" next to the chat input > "Upload File"
- Select a text-based PDF under 100 MB
- Wait for "Processing..." to complete. This is the embedding step. It takes 2-30 seconds depending on page count. Critical verification: In the chat, type
#and your document name should appear as a selectable source
If upload spins forever:
- Check
docker logs open-webuiforConnection timeoutto HuggingFace—embedding model download failed - Verify PDF isn't image-based:
pdfinfo yourfile.pdfshould show "Text" in PDF version, not just image dimensions
Context Window Configuration
Settings > Models > Select "Llama 3.1 8B" > Context Length: 6144
This is 6K tokens—enough for 4-5 pages of document context plus your query and response. The 16 GB VRAM breakdown:
- 8.5 GB: Llama 3.1 8B Q8_0 weights
- 0.8 GB: Embedding model
- 2.4 GB: KV cache at 6K context (0.4 GB per 1K tokens for 8B models)
- 4.3 GB: Headroom for batch operations and OS overhead
If you see "CUDA out of memory" errors: Drop to 4096 context, or switch to Q4_K_M quantization (6.2 GB weights, 6.4 GB headroom).
Expected Performance: What "Working" Actually Looks Like
These numbers are from the RX 9060 XT 16 GB review unit. RTX 5060 Ti 16 GB performs within 5%. The CUDA software stack doesn't buy meaningful inference speed at this tier. You just get broader model compatibility.
When performance tanks:
- CPU fallback: Check
nvidia-smiorrocm-smiduring query—if GPU utilization is 0%, embedding model is on CPU. Usually means VRAM exhaustion. - Disk thrashing: First RAG query on a new document reads embeddings from SQLite. Subsequent queries are 3x faster. Normal behavior, not a problem.
Troubleshooting the Three Silent Failure Modes
"No models found" despite Ollama CLI working
Cause: Ollama bound to 127.0.0.1, not 0.0.0.0. Docker containers have their own localhost.
Fix:
sudo systemctl edit ollama.service
# Add Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama
Verify: curl http://YOUR_HOST_IP:11434 from another machine should return Ollama is running.
RAG upload spins forever, no error
Cause 1: Embedding model download failed. Check logs for HTTPSConnectionPool errors.
Fix:
docker exec -it open-webui bash
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
This forces the download inside the container. Wait for completion, retry upload.
Cause 2: PDF is image-based without OCR text layer.
Fix: Pre-process with OCRmyPDF: ocrmypdf --force-ocr input.pdf output.pdf
LAN access works briefly, then fails
Cause: Docker's host.docker.internal resolution expires on some Linux network managers.
Fix: Use explicit IP in docker-compose.yml:
environment:
- OLLAMA_BASE_URL=http://YOUR_HOST_IP:11434
Replace YOUR_HOST_IP with your machine's LAN address (192.168.1.x or similar).
FAQ
Q: Can I run this on 8 GB VRAM?
You can run Ollama. You can run Open WebUI. You cannot run RAG reliably. Embedding activation will push you into CPU fallback. Response times become unusable. For 8 GB, use LM Studio without RAG, or accept cloud-based embeddings with the privacy tradeoff.
Q: What's the difference between nomic-embed-text and all-MiniLM-L6-v2?
all-MiniLM-L6-v2 (80 MB, 384-dim) is smaller, faster, and sufficient for most PDF Q&A. Use Nomic for legal contracts or textbooks with dense cross-references. Use MiniLM for general documentation.
Q: How do I back up my chat history and document embeddings?
The open-webui-data Docker volume contains everything. Back up: docker run --rm -v open-webui-data:/data -v $(pwd):/backup alpine tar czf /backup/webui-backup.tar.gz -C /data . Restore by extracting to a new volume on any machine.
Q: Can multiple users access the same document collection?
Yes—enable WEBUI_AUTH=True in docker-compose.yml, create accounts in Settings > Admin Panel > Users. Document collections are user-isolated by default. Use "Public Collections" in the document upload dialog to share across users.
Q: Why does my 70B model fail to load on 24 GB VRAM?
Llama 3.1 70B Q4_K_M is 40 GB. You need CPU/GPU split loading—Ollama handles this automatically, but performance drops to 8-12 tok/s with frequent disk paging. For usable 70B speeds, you need dual 24 GB cards or a single 48 GB card. See our RX 9060 XT 16 GB review for the 8B-70B quality comparison—most users don't need 70B for document Q&A.
Your build is now running local document chat at 45 tok/s, with RAG responses in under 15 seconds, nothing leaving your machine. The "it just works" lie is fixed—now it actually does.