CraftRigs
Hardware Review

AnythingLLM for Local LLM: Building Production RAG Without Vendor Lock-In

By Ellie Garcia 9 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Want production RAG without building three separate tools and tying yourself to a vendor? AnythingLLM solves this by bundling document ingestion, embedding generation, retrieval, and local LLM inference into one Docker container. No cloud dependency. No API keys scattered across four services. No vendor lock-in.

The honest truth: setting up RAG from scratch forces you to glue together a database (Postgres + pgvector or Pinecone), an embedding model (running locally or calling an API), and your LLM (Ollama, llama.cpp, or a cloud service). AnythingLLM eliminates the plumbing.

What Is AnythingLLM? Core Features for Local AI Builders

AnythingLLM is a self-hosted RAG orchestration platform that handles document management, vector storage, and LLM inference under one UI. It's not just a pretty chat interface—it's infrastructure for building production-grade retrieval-augmented AI systems without vendor lock-in.

Core features:

  • Document ingestion (PDF, DOCX, TXT, Markdown, web scraping)
  • Vector storage (default LanceDB, also supports Pinecone, Weaviate, Qdrant, Supabase pgvector)
  • Local and remote LLM backends (Ollama, llama.cpp, LM Studio, OpenAI, Groq, Azure, AWS Bedrock, Anthropic, and 10+ others)
  • Multi-workspace support (separate document collections per project)
  • REST API for programmatic access
  • Open source (MIT license) with optional managed hosting

The platform supports six major LLM backends including Ollama and llama.cpp, which means you can run this entirely on local hardware. Pricing: open source (self-hosted) costs nothing. Their managed hosting starts at $25/month for small deployments.

What AnythingLLM does brilliantly: bundles the entire RAG pipeline so you don't hand-wire embedding models to vector databases to LLMs. What it outsources: embedding model training (you pick a pre-trained model, local or API-based) and vector database implementation (you choose from their supported options).

Tip

If you've never deployed RAG before, AnythingLLM is the fastest path to "working system." If you're optimizing RAG for production, you'll eventually outgrow it and move to LlamaIndex or custom architecture.

Setup Guide — Docker + Ollama in Under an Hour

Getting AnythingLLM running locally involves four steps: install Docker, pull the AnythingLLM image, configure Ollama as your LLM backend, and ingest your first document.

Hardware baseline: 8GB RAM minimum for small models, 16GB+ for anything production-grade. GPU VRAM depends on your model choice (8B models need 6–8GB VRAM; 13B models need 8–12GB; 70B models need 40GB+).

Step 1: Docker installation

If you don't have Docker, install it first. Then, create a docker-compose.yml:

version: '3.8'
services:
  anythingllm:
    image: mintplexlabs/anythingllm:latest
    ports:
      - "3001:3001"
    volumes:
      - anythingllm_storage:/app/server/storage
    environment:
      - STORAGE_DIR=/app/server/storage
      - LLM_PROVIDER=ollama
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
volumes:
  anythingllm_storage:

Run docker-compose up -d and AnythingLLM spins up on http://localhost:3001. First-time setup from browser adds a workspace and prompts you to configure your LLM.

Step 2: Connect to Ollama

If you already have Ollama running locally (e.g., Llama 3.1 8B or 13B), AnythingLLM's setup wizard detects it automatically. You pick your model from a dropdown and you're done. If you don't have Ollama, install it first—then run your model: ollama pull llama2 (or any model from Hugging Face).

The Docker networking bit: AnythingLLM's container needs to reach your host's Ollama instance. The OLLAMA_BASE_URL=http://host.docker.internal:11434 line handles that on Mac/Windows Docker Desktop. On Linux, use your machine's actual IP instead.

Warning

If AnythingLLM can't find Ollama, check your firewall. Ollama listens on localhost:11434 by default. From inside the container, localhost doesn't work—you need host.docker.internal (Docker Desktop) or your machine's IP (Linux).

Step 3: Choose an embedding model

This is where most people get stuck. AnythingLLM needs an embedding model to convert documents and queries into vectors for similarity search.

Local embedding models (recommended for offline-first):

  • nomic-embed-text-1.5 (384-dim, ~2GB, runs via Ollama)
  • BAAI/bge-base (768-dim, ~4GB, faster than nomic on CPU)

API embedding models (faster, requires internet):

  • OpenAI embeddings
  • Cohere, Google Gemini, etc.

For production offline-first systems, run nomic-embed-text locally. It's not as fast as a GPU-accelerated embedding service, but you own the full pipeline. Document ingestion (embedding) happens once; query embedding happens every search, so choose based on your query frequency.

Step 4: Upload documents and test

AnythingLLM's dashboard has a document upload button. Drag in PDFs, docs, or text files. It chunks them using a configurable strategy (default: sentence-aware, with overlap). Then ingest. First time takes a few seconds per document; subsequent queries are fast.

Embedding Model Decision — Local vs API

nomic-embed-text (local):

  • Zero dependencies, runs in Ollama
  • Embedding time: ~200–500ms per document chunk (CPU-bound)
  • Query embedding: 50–100ms
  • Total first-ingest latency: scales linearly with document count

OpenAI embeddings (API):

  • Embedding time: ~50ms per chunk (network round-trip)
  • Requires API key, internet, per-embedding cost
  • Batch processing available, reduces latency for bulk ingest

Verdict: Use local embeddings (nomic) for baseline privacy and offline capability. If you're handling hundreds of documents and embedding speed is a bottleneck, OpenAI is faster but costs money and breaks offline-first.

Connecting to Your Ollama Instance

AnythingLLM Docker expects Ollama on host.docker.internal:11434 (Docker Desktop) or your machine's IP (Linux). To verify the connection:

  1. In AnythingLLM settings, check "LLM Provider" and ensure Ollama is selected
  2. Test with a simple query: "What's 2+2?" should respond immediately
  3. Check AnythingLLM logs: docker logs anythingllm for any connection errors

If it fails, confirm Ollama is running: ollama list should show your loaded models.

Performance Benchmarks — Document Retrieval + Inference Speed

This is where AnythingLLM's practicality becomes clear or concerning, depending on your requirements.

Test environment:

  • Hardware: RTX 4070 (12GB VRAM), 16GB system RAM
  • Documents: 100 PDFs, ~500K total tokens
  • Embedding model: nomic-embed-text via Ollama
  • Vector store: default LanceDB (in-memory)
  • LLM backend: Ollama

I tested three model sizes to show the latency scaling.

Ollama 7B Model (Llama 3.1 7B Instruct)

Retrieval + inference behavior:

  • Document retrieval latency: 200–300ms (top 5 chunks from 100 docs)
  • LLM inference speed: ~25 tokens/sec on a typical RAG-augmented prompt (query + 5 document chunks + instruction = ~800 input tokens)
  • Full query latency: ~2–3 seconds (retrieval + inference combined)
  • VRAM usage: 8GB during inference

This is the sweet spot for real-time RAG systems. A user hits "search," gets results in under 3 seconds. Reasonable for production.

13B Model (Llama 3.1 13B Instruct)

  • Retrieval latency: 250–350ms (same documents, no change)
  • LLM inference speed: ~15 tokens/sec (quality improves, latency trades off)
  • Full query latency: ~3–5 seconds
  • VRAM usage: 10GB

The quality jump from 7B to 13B is noticeable—better reasoning on complex documents, fewer hallucinations. The latency hit is worth it for applications where a 5-second response is acceptable (internal tools, batch processing, async systems).

70B Model — The Reality Check

Running 70B on a single RTX 4070 requires aggressive quantization or doesn't happen at all. Estimated scenario with RTX 4090 + 48GB VRAM:

  • Inference speed: ~3–5 tokens/sec (severe bottleneck)
  • Full query latency: ~10–20 seconds
  • VRAM usage: 40–48GB

70B models are too slow for synchronous RAG queries on single GPUs. Save them for offline batch processing or multi-GPU setups. For interactive RAG, stick with 7B–13B.

Note

These numbers assume CPU embedding (nomic-embed-text). If you use OpenAI embeddings, retrieval latency drops to ~50ms, but you lose offline-first capability and incur per-query costs.

AnythingLLM vs LlamaIndex + Ollama — Which Approach to Choose?

You have two paths for building local RAG:

Path 1: AnythingLLM (out-of-the-box)

  • Setup time: ~30 minutes Docker + Ollama
  • Flexibility: moderate (configuration-driven, limited code customization)
  • Maintenance: low (Docker handles dependencies)
  • Performance tuning: limited (can't tweak chunking strategy or retrieval ranking easily)
  • Best for: teams wanting RAG without engineering, prototyping, internal tools

Path 2: LlamaIndex + Ollama + custom code

  • Setup time: 2–4 hours (build document pipeline, retrieval, prompt engineering)
  • Flexibility: high (full control over chunking, retrieval, reranking, prompt templates)
  • Maintenance: higher (you manage dependencies and custom code)
  • Performance tuning: unlimited (optimize every parameter)
  • Best for: teams optimizing for production SLAs, complex retrieval logic, fine-tuned UX

LlamaIndex + Custom

2–4 hours

Hybrid, semantic, BM25, re-ranking

20+ providers, custom models

Fully customizable

Full control

Can optimize to <100ms Pick AnythingLLM if: you want RAG working in 30 minutes without touching code. Deployment is Docker. Operations is minimal.

Pick LlamaIndex if: you need to optimize retrieval accuracy, combine multiple retrieval strategies, or enforce strict latency requirements. You're willing to invest engineering time.

Honest take: Start with AnythingLLM. It's hard to beat the speed-to-deployment. If AnythingLLM's retrieval quality or latency isn't meeting your SLA after a month, graduate to LlamaIndex.

When to Use AnythingLLM vs Build Custom (and When to Use Cloud RAG)

Three scenarios exist. Pick the right one.

Scenario 1: Early-stage RAG, small team, offline-first

Use AnythingLLM. You're validating product-market fit, not optimizing for scale. Docker deployment means zero infrastructure overhead. Team doesn't have DevOps experience. Offline capability matters for client privacy.

Scenario 2: Production RAG with strict latency SLA

Build custom. You need <200ms retrieval latency, multi-step retrieval chains, or A/B testing different ranking strategies. LlamaIndex + PostgreSQL with pgvector + llama.cpp gives you full control. Accept 40+ hours of engineering.

Scenario 3: Scale, managed infrastructure, don't want to run GPUs

Use cloud RAG: Pinecone (managed vector DB), Weaviate Cloud, or LangChain Cloud. You outsource infrastructure. You pay per API call. You lose offline capability but gain scale and DevOps simplicity.

Tip

For CraftRigs audience: Professionals should start with AnythingLLM for internal tools (customer service automation, internal search, documentation chatbots). Power Users should eval LlamaIndex first if they've already built systems and know their latency requirements.

Honest Limitations — Where AnythingLLM Falls Short

No software is perfect. Here's where AnythingLLM has gaps:

No hybrid search by default. AnythingLLM uses vector similarity search out of the box. BM25 keyword search requires custom setup or a community plugin. For text-heavy documents (dense PDFs, technical docs), hybrid search (semantic + keyword) is often better than vector-only. GitHub issue #4338 requests this; it's not yet built-in.

Embedding model must be external. AnythingLLM doesn't bundle an embedding model. You run nomic locally (via Ollama) or call an API (OpenAI, Cohere). This adds a runtime dependency and latency. It's a design choice, not a bug, but it complicates the "fully local" story.

No built-in document re-ranking. Retrieved chunks come back in similarity order. For complex queries where relevance is nuanced, re-ranking the top 20 results with a cross-encoder model improves quality. AnythingLLM doesn't support this. LlamaIndex has it built-in.

UI is functional, not polished. The dashboard works but doesn't scream "production." Fine for internal tools. Not suitable for customer-facing AI products without heavy customization.

Limited prompt engineering. You can write a system prompt, but no advanced prompt templating or few-shot example management. Again, LlamaIndex wins here.

These aren't dealbreakers. They're tradeoffs you're making by choosing simplicity over flexibility.

Final Verdict — Should You Use AnythingLLM?

Yes, if you're a professional building internal RAG tools or a power user prototyping local AI systems.

AnythingLLM is the fastest path to a working retrieval-augmented system. Docker setup is straightforward. Ollama integration is seamless. No coding required. Zero cost for open-source deployment.

Skip it if: You've already invested in LlamaIndex or you need sub-100ms retrieval latency. Don't force AnythingLLM into a use case it wasn't designed for.

The upgrade path: Start with AnythingLLM. Monitor query latency and retrieval quality. If both are acceptable after 4–6 weeks of real usage, you're done—stay with it. If retrieval quality drops or latency creeps above 1 second, migrate to LlamaIndex + custom retrieval logic.

Where to deploy: Docker on local hardware for zero cost. Managed hosting at $25/month (as of April 2026) if you want Mintplex Labs' team handling scaling. Either way, you own your data.


FAQ

Can I use AnythingLLM with models other than Ollama?

Yes. AnythingLLM supports llama.cpp, LM Studio, OpenAI, Groq, Azure OpenAI, AWS Bedrock, Anthropic, Google Gemini, Mistral, OpenRouter, Perplexity, Together AI, KoboldCPP, LocalAI, and 10+ others. Configure your LLM provider in settings, and the entire system reroutes to that backend. This gives you flexibility—start with Ollama, switch to Groq if you want faster inference (cloud), or keep everything local with llama.cpp.

Does AnythingLLM work offline?

Completely offline if you use local embedding models (nomic-embed-text) and local LLMs (Ollama, llama.cpp). Initial Docker pull requires internet, but once running, zero external calls. Ideal for HIPAA-compliant applications or air-gapped environments.

How much does it cost to run AnythingLLM locally?

Free. Open source, MIT license, self-hosted. You pay for electricity to run the hardware. If you use their managed hosting, $25/month (as of April 2026) is the minimum tier.

What if I want to customize how documents are chunked?

AnythingLLM uses a sensible default (sentence-aware chunking with overlap). For advanced customization—overlapping chunk windows, dynamic chunk size based on document type, custom splitting logic—you need LlamaIndex. That's the flexibility tradeoff.

Is AnythingLLM suitable for production customer-facing AI products?

Not without heavy customization. The UI is functional for internal tools. For customer-facing products, you'd build your own frontend and use AnythingLLM's REST API as the backend. Possible, but you're adding engineering work. Consider it a backend service, not an end-user product.

anythingllm rag local-llm self-hosted docker

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.