Local Embeddings and Vector Search on Your GPU: Complete Guide

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Local embedding models convert your documents into searchable vectors on your own GPU — all-MiniLM-L6-v2 uses only ~90 MB VRAM and handles most personal use cases. Step up to nomic-embed-text or BGE-M3 when you need multilingual support or higher retrieval precision. They're small enough to run alongside your LLM without stealing VRAM.

If you're already running a local LLM for privacy, sending your documents to an embedding API defeats the point. This guide closes that gap.

What Are Embeddings and Why Do Local Builders Need Them?

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text. Texts with similar meaning get vectors that are numerically close together in high-dimensional space, which enables semantic search.

Under the hood: the embedding model (typically a BERT-variant encoder) maps input tokens to a fixed-size dense vector, anywhere from 384 to 1,536 dimensions depending on the model. Those vectors get stored in a vector database. When you search, your query gets embedded the same way, and the database finds the stored vectors closest to it.

The analogy that actually makes it click: keywords are zip codes — exact address matching. Embeddings are GPS coordinates — you can find what's nearby even if the words don't match exactly. Searching "car repair" finds documents about "auto maintenance" because the GPS coordinates are close, even though none of the words overlap.

Why Run Embeddings Locally Instead of Using APIs

Cloud embedding APIs are cheap enough that the cost argument isn't always decisive. But the privacy argument is.

OpenAI's text-embedding-3-small costs $0.02 per million tokens. Embedding a 10,000-document personal library at an average of 500 tokens per document comes to $0.10 — basically free. The economics don't matter until you're re-indexing frequently or working at scale.

At 100,000 documents (~50M tokens), the API costs about $1.00. That's still cheap. But it also means sending your private content — notes, PDFs, emails, research — to external servers. If you're building a local LLM setup specifically because you don't want your data leaving your machine, that's a contradiction.

The same job runs in 8-12 minutes on an RTX 4070 at $0.00. No data leaves your machine. You own the index. You can re-embed anytime without worrying about costs.

Note

For public documents or research content you're comfortable uploading, cloud APIs are fine. The local route makes most sense when your documents are private: personal notes, business data, client files, research in progress.

Local Embedding VRAM Cost vs Inference VRAM

This is where the "you need a separate GPU" myth falls apart. Embedding models are tiny compared to LLMs:

Best For

Personal notes, English-only, fast indexing

General purpose, good quality, multilingual baseline

Technical docs, code, higher precision

Multilingual, long docs (up to 8K tokens), best overall These run alongside your LLM without eating into its VRAM budget. Embedding models aren't doing inference continuously — they run during document indexing, then sit idle while the LLM handles conversations.

Even BGE-M3 at 1.1 GB VRAM fits easily alongside a Q4_K_M 7B model (4.5 GB) on a 12 GB card, leaving ~6 GB still available for context. There's no meaningful tradeoff here.

Retrieval Quality Comparison on MTEB Benchmark

MTEB (Massive Text Embedding Benchmark) measures retrieval quality across a standardized set of tasks. Higher is better, and the differences are meaningful in practice:

Notes

Adequate for personal note search, English-only

Significant jump; good multilingual baseline

Strong for technical and code documents

Best local option, especially for multilingual corpora

Top cloud option — local BGE-M3 beats it That last row matters: BGE-M3 running on your GPU outperforms OpenAI's best embedding API on retrieval quality. You get better results, keep your data private, and pay nothing per query.

Tip

Start with nomic-embed-text. It's available directly via ollama pull nomic-embed-text, takes under 300 MB VRAM, and hits a MTEB score of 62.4 — more than enough for most personal knowledge bases. Upgrade to BGE-M3 only if you need multilingual support or are seeing retrieval gaps.

How Local Embeddings and Vector Search Work

Two phases: indexing and retrieval.

Indexing is the one-time (or periodic) job: run your documents through the embedding model to generate vectors, store them in Qdrant alongside the source text. This is the expensive part — it takes a few minutes — but you only do it when your document collection changes.

Retrieval is fast and happens every query: embed the user's question (takes ~50ms on GPU), find the nearest stored vectors in Qdrant, return the matching document chunks, inject them into the LLM prompt.

The embedding model produces the same vector space for both documents and queries — semantic similarity is cosine distance in that shared space. If a document chunk and a query are close in that space, they're semantically similar.

Indexing Pipeline — Document → Vector → Qdrant

Before indexing, you chunk your documents. Full documents are usually too long for the embedding model's context window, and long chunks hurt retrieval precision. The standard approach: 256-512 tokens per chunk, with 50-token overlap between chunks to avoid cutting sentences at boundaries.

Then:

Run each chunk through the embedding model → get a 384 or 768-dimensional vector
Store vector + metadata (source file, chunk index, original text) in a Qdrant collection

One-time indexing cost for 10,000 documents averaging 500 tokens each (about 40,000 total chunks) on an RTX 4070 with nomic-embed-text: roughly 3-5 minutes. After that, search is instant.

Retrieval Pipeline — Query → Vector → Results → LLM

This is RAG (Retrieval-Augmented Generation) in its basic form:

User asks a question → embed the question (~50ms on GPU)
Qdrant finds the top-K nearest chunks by cosine similarity (typically K=3-10)
Top chunks get injected into the LLM prompt as retrieved context
LLM generates an answer grounded in those chunks

The LLM isn't guessing from training data — it's reading your documents and answering based on what it finds. This is how you build a personal assistant that actually knows your notes.

Running Qdrant Locally

Qdrant is an open-source vector database written in Rust. It's fast, lightweight, and has a clean HTTP API.

Setup is one command:

docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Default ports: 6333 (HTTP REST API), 6334 (gRPC). Collections persist to disk — your vectors survive restarts.

Memory usage is predictable: roughly 50 MB base, plus about 0.5-2 MB per 10,000 768-dimensional vectors in the RAM index. A 100,000-document collection with 768-dim vectors uses roughly 5-20 MB of RAM for the index — negligible.

Qdrant is a long-running service, not a script. Run it under Docker, PM2, or systemd so it starts with your machine.

"You Need a Separate GPU for Embeddings" — Not True

This misconception comes from conflating text embedding models with vision embedding models (like CLIP), which are much larger. Text embedding models are small.

All-MiniLM-L6-v2 uses 90 MB VRAM — it runs in the background of any GPU setup. Even an RTX 3060 with a 7B model loaded has 6-7 GB of VRAM left over. That's room for a dozen embedding models simultaneously.

Embedding and inference also don't run at the same time in a typical RAG setup. The sequence is: retrieve documents (embed query, search Qdrant) → inject context → run LLM. The LLM waits for retrieval to finish. VRAM isn't split between competing processes — it's used sequentially.

Even in a worst-case scenario — BGE-M3 (1.1 GB) plus a Q4_K_M 7B model (4.5 GB) on a 12 GB card — you have 6.4 GB remaining. That covers a 4K-8K context window with room to spare.

Warning

The "separate GPU" advice usually applies to production serving environments where you're running embedding and inference simultaneously for multiple users. For a personal workstation with one user, sequential execution is fine and VRAM coexistence is a non-issue.

Embeddings in Practice — Personal Knowledge Base on RTX 4070

Hardware: RTX 4070 (12 GB GDDR6X), Qdrant 1.7 running locally on port 6333, nomic-embed-text served via Ollama.

Indexing test: 2,500 Markdown notes averaging 800 words each. Chunked into roughly 12,000 vectors at 512 tokens per chunk with 50-token overlap. Full index time: 4.5 minutes.

Query latency breakdown:

Step	Time
Embed query (GPU)

The search component — embedding plus vector lookup — takes under 60ms. Total response time is dominated by LLM generation, not retrieval.

Use cases where this setup shines:

Finding notes by concept when you don't remember the exact words you used
Answering questions about your own writing ("what did I say about X project last month?")
Summarizing clusters of related documents on a topic
Building a personal assistant that has context on your work

Getting started in under an hour:

# Start Qdrant
docker run -p 6333:6333 qdrant/qdrant

# Pull the embedding model via Ollama
ollama pull nomic-embed-text

# Pull a chat model
ollama pull mistral

# Install Open WebUI (includes RAG pipeline)
docker run -d -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Open WebUI has a built-in RAG pipeline that connects Ollama + Qdrant. Point it at your nomic-embed-text model, upload a folder of documents, and you have a working personal knowledge base.

RAG (Retrieval-Augmented Generation) — the full pipeline that uses embeddings to ground LLM responses in your documents. Embeddings are the core mechanism that makes RAG work.

Qdrant setup guide — step-by-step Qdrant installation, collection creation, and basic search API. Start here if you're setting up a vector database for the first time.

VRAM — embedding models consume VRAM alongside your LLM, but the amounts are small enough that coexistence is the default, not the exception.

Context length — limits how much retrieved text you can inject into the LLM prompt. Longer context windows let you retrieve more chunks per query, which improves answer quality.

The full picture: run your LLM and your embedding model on the same GPU, store vectors in local Qdrant, query with Open WebUI. Private, fast, and free after hardware cost. The build a local RAG pipeline guide covers the full integration end-to-end.