Embedding — Local AI Glossary | CraftRigs

An embedding is a fixed-length vector of floating-point numbers that represents a piece of text in high-dimensional space. Similar texts produce similar vectors (measured by cosine similarity or dot product). This property is what makes semantic search possible: you can retrieve documents based on meaning rather than keyword overlap.

Why Embeddings Matter for Local AI

In a retrieval-augmented generation (RAG) system, embeddings are the indexing mechanism. When you ingest documents, each chunk is converted to an embedding and stored in a vector database. At query time, your question is embedded and compared against all stored embeddings — the closest matches are retrieved and injected into the LLM's context.

Without embeddings, RAG would require keyword search, which misses synonyms, paraphrasing, and conceptual similarity. Embeddings handle all of these naturally.

Embedding Models vs. LLMs

Embedding models are different from generative LLMs. They're typically smaller (22M–400M parameters vs. 7B+) and faster, because they only need to produce a vector — not generate text. Popular local embedding models include:

nomic-embed-text — fast, high quality, runs locally
mxbai-embed-large — strong retrieval performance
all-minilm — compact, good for resource-constrained setups

Ollama and llama.cpp both support running embedding models locally alongside generation models.

Vector Dimensions

Embedding dimensions determine how much information the vector can encode. Larger dimensions (1536+) capture more semantic nuance but require more storage and computation. Practical local setups often use 384–768 dimensional embeddings, which balance quality and resource usage.

When building a vector index, all embeddings must use the same model and dimension — you can't mix embeddings from different models.

Practical Considerations

Embedding generation is CPU and RAM friendly — even on a modest machine, you can embed thousands of documents in minutes. The bottleneck is usually the vector database (Qdrant, Chroma, Weaviate) and retrieval latency at query time, not embedding generation speed.

For production local AI deployments, dedicate a small embedding model alongside your main LLM. The overhead is minimal and the retrieval quality improvement is significant.