CraftRigs
articles

OpenClaw Long-Term Memory: What Hardware You Need for Persistent Agents

By Charlotte Stewart 5 min read
OpenClaw Long-Term Memory: What Hardware You Need for Persistent Agents

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

OpenClaw shipped long-term memory persistence last week. If you haven't followed OpenClaw, it's an open-source agent framework that lets you build AI assistants that operate autonomously across tasks — think AutoGPT but with better tool integration and more predictable behavior.

The memory update is significant because persistent context has been one of the main reasons people still reach for cloud-hosted AI over local inference. Claude remembers your project context across sessions because it has a database backend doing that work. Your local Ollama instance forgets everything the moment you close the terminal.

OpenClaw's persistence layer changes that equation — but it also introduces hardware requirements that most people running local AI haven't thought about yet.


Quick Summary

  • Persistent agents need a vector database running alongside the LLM — Qdrant, Chroma, or Weaviate are the main options
  • VRAM requirements stay the same; the new constraint is system RAM (64GB minimum, 128GB recommended for large memory stores)
  • Fast NVMe storage matters more than most people expect for retrieval-heavy workloads

What "Long-Term Memory" Actually Means for Hardware

When an agent has long-term memory, it's doing three things that weren't happening before:

1. Embedding generation. Every piece of information the agent wants to remember gets converted into a vector embedding — a numerical representation that captures semantic meaning. This is a separate model inference step, usually done with a small embedding model (BAAI/bge-small, nomic-embed-text, or similar). These are tiny — typically 100–500MB — and they don't need your main GPU.

2. Vector storage. The embeddings are stored in a vector database. This is where Qdrant, Chroma, or Weaviate comes in. The database lives in system RAM (or on disk, with RAM used as a cache). A million stored memories takes roughly 3–4GB of RAM in Qdrant with default settings.

3. Retrieval at inference time. When the agent needs to answer a question or continue a task, it queries the vector database for relevant memories and injects them into the LLM's context window. This adds tokens to every prompt — which adds VRAM pressure during the inference call itself.

None of these steps are GPU-intensive. The bottleneck shifts from VRAM to system RAM and storage latency.


The VRAM Picture

The LLM is still doing the heavy lifting. VRAM requirements depend on which model you're running in OpenClaw, not on the memory system. Here's the relevant tier breakdown:

Notes

Good for light agent tasks

Strong reasoning, handles long context

Multi-GPU territory at Q4

Strong on tool-use tasks For a capable persistent agent, the sweet spot is Mistral Small 4 or DeepSeek R1 32B. Both run on a single 24GB GPU at Q4. Both have good instruction-following for tool use, which agents rely on heavily.

The vector DB retrieval injects context into the prompt — typically 500–2,000 tokens of retrieved memories per query. On a 24GB card running a 22B model at Q4, you have roughly 30,000–50,000 tokens of usable context window. Memory injection is a small slice of that.


System RAM: The Actual Constraint

This is where persistent agents differ from standard inference rigs. You need more system RAM than you probably have if your rig was built just for LLM inference.

Qdrant's memory model keeps the HNSW index in RAM for fast retrieval. The index size depends on your vector dimensions and collection size:

Memory store sizeRAM needed (Qdrant)
------
100K memories~400MB
1M memories~4GB
10M memories~40GB
100M memories~400GB

For a personal assistant or coding agent with a few months of history, you're well under 1M memories — 4GB of RAM overhead for the DB. That's manageable with 32GB system RAM.

For a more ambitious agent — one that's been ingesting documents, emails, or codebases — you can hit 5–10M memories. That pushes you toward 64GB system RAM before the LLM's own footprint even enters the picture (LLMs using RAM for CPU offloading add another 20–80GB depending on model size and offloading ratio).

Recommended RAM:

  • Light agent use (personal assistant, coding helper): 64GB
  • Medium use (document ingestion, multi-day research tasks): 128GB
  • Heavy use (continuous ingestion, multi-agent systems): 256GB+

Storage: NVMe Is Not Optional

Vector database retrieval latency directly affects agent response time. Qdrant and Weaviate can use memory-mapped storage for large collections — parts of the index that don't fit in RAM get paged from disk on demand.

On a SATA SSD, a cold retrieval hit can add 20–50ms of latency per query. On a PCIe 4.0 NVMe (like Samsung 980 Pro or WD Black SN850X), that drops to 2–5ms. For a conversational agent doing 10–20 memory queries per response, the difference compounds.

The recommendation is simple: if your local AI rig has a SATA SSD as its primary drive, add a PCIe 4.0 NVMe. A 2TB Samsung 980 Pro runs around $120 and eliminates storage as a bottleneck entirely.


This is a single-GPU build tuned for running a capable persistent agent 24/7:

GPU: RTX 3090 24GB ($500 used) or RTX 4090 24GB ($1,600 new)

  • The 3090 handles 22B models at Q4 without drama. The 4090 adds speed and handles Q8 on smaller models.

CPU: AMD Ryzen 9 7900X or Intel Core i9-13900K

  • You want good single-core performance for the embedding model and vector search, which run on CPU.

RAM: 128GB DDR5 (2x 64GB sticks)

  • Headroom for the LLM (if offloading layers), the Qdrant index, and the OS. Don't go below 64GB.

Storage: 2TB PCIe 4.0 NVMe (primary) + 4TB secondary for document storage

  • Fast primary for the OS and active vector index, bulk storage for raw documents.

Total build cost: ~$1,800–$2,500 depending on GPU choice


Qdrant vs. Chroma vs. Weaviate for Local Agents

Qdrant is the first choice for most local deployments. It's written in Rust, runs in Docker with a 200MB image, has excellent performance on commodity hardware, and OpenClaw has native integration. The gRPC API is faster than Chroma's HTTP interface for high-frequency retrieval.

Chroma is better for getting started. Simpler setup, no Docker required, works directly with Python. The performance ceiling is lower than Qdrant for large collections, but for under 500K embeddings it's fine.

Weaviate is the enterprise option — more features, more configuration, more overhead. For a personal agent, it's overkill. Consider it if you're building a product on top of OpenClaw that needs multi-tenancy.


FAQ

How much VRAM does a persistent memory agent require? The LLM itself needs 16–24GB VRAM for a capable 30–70B model. Vector DB operations add minimal GPU overhead — most run on CPU. The real requirement is RAM: 64–128GB system RAM for large memory stores, plus fast NVMe storage for persistence.

Can I run persistent AI agents on a single RTX 3090? Yes. An RTX 3090 (24GB) handles the LLM side well. Pair it with 64GB+ system RAM and a fast NVMe SSD for vector storage. The bottleneck for most agents is retrieval speed and context window size, not raw VRAM.

What vector database is best for local persistent agents? Qdrant is the most popular choice for local deployments — runs in Docker, has good Rust-based performance, and works well with llama.cpp and Ollama. Chroma is simpler to set up for prototyping. Weaviate scales better for large corpora but is heavier.

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.