How much VRAM do I need to self-host an AI agent?

More than a chatbot. Agents accumulate context with every tool call and observation. A 13B model with a 16K context window needs approximately 12GB of VRAM — 8GB for weights plus 4GB for KV cache. A 27B model with the same context needs roughly 20GB. Plan for the total, not just the model weights.

Why is CPU offloading especially bad for AI agents?

CPU offloading drops generation speed from 40–60 tokens/second to 2–3 t/s. For a single chatbot query, that's annoying. For a multi-step agent loop running 5–10 inference cycles, a 30-second task becomes 10+ minutes. The slowdown compounds with every loop iteration.

What is the minimum GPU for running a local AI agent effectively?

For 13B models with agent context overhead: 16GB minimum. An RTX 4060 Ti 16GB, RX 7900 GRE 16GB, or RTX 5060 Ti 16GB keeps a 13B model fully in VRAM with headroom for 16K context. For 27B agents, you need 24GB. The RTX 3090 and 4090 are strong agent GPUs precisely because VRAM gives context headroom.

How does context length affect VRAM requirements for AI agents?

KV cache grows proportionally with context. At 16K tokens on a 13B model, expect roughly 3–4GB of additional VRAM beyond model weights. At 32K context, that doubles to 6–8GB. Long-running agents that don't periodically summarize and compress their context will eventually exceed available VRAM and slow dramatically.

Best GPU for Self-Hosting an AI Agent in 2026: VRAM + Context Math

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Most GPU buying guides for local AI start with compute. CUDA cores, TFLOPS, architecture generation. That's the wrong question, and buying based on it will cost you either way — either you overspend on raw compute you don't need, or you end up with a beautiful, fast GPU that can't load the model you want.

The number that matters is VRAM. And for an AI agent specifically — something that calls tools, manages memory, runs multi-step reasoning loops — the VRAM calculus is messier than anyone tells you.

Here's the thing about agents that pure chatbot guides miss: they eat context. Every tool call, every observation, every retry cycle gets appended to the prompt. By the time your agent has browsed a page, written some code, run a test, and revised its answer, you might have 8,000–16,000 tokens of context sitting in the KV cache. A 13B model with a 16K context can need 20GB of VRAM just to stay functional. The model weights alone were only 8GB.

That's what kills your setup. Not clock speed. Not CUDA cores. Context.

Why Agents Are Different From Chatbots

Running a simple Q&A chatbot is one context window. Ask, answer, done. The model loads, does one forward pass, you get your text.

An agent loop is something else. The prompt grows. Each cycle appends new observations. If your agent is using Ollama with tool calls enabled — or running a framework like Open Interpreter, Cline, or AutoGPT — it's constantly re-ingesting an expanding context. By cycle three or four of a non-trivial task, your memory requirements have doubled from what you calculated at the start.

This is why the "minimum VRAM for X model" numbers you see in benchmark tables are misleading. Those numbers assume a fresh context at every inference. Agents don't work that way.

Warning

CPU offload is a trap for agents. When your model doesn't fit in VRAM, inference frameworks like llama.cpp and Ollama will silently offload layers to system RAM. For a one-shot chatbot, this is slow but tolerable. For an agent running multi-step loops, you drop from 40–60 tokens/second to 2–3 tokens/second. A 5-step agentic task that should take 30 seconds takes 10 minutes. The loop becomes unusable.

The VRAM Numbers You Actually Need

Here's the table that matters. Q4_K_M quantization — the standard choice for local inference in 2026 — with a 16K token context window included. Not just model weights. Context too.

Total VRAM needed

~8GB

~12GB

~20GB

~26GB

~53GB The 70B row alone explains why most single consumer GPUs can't touch 70B models for real agentic work. You'd need multi-GPU or something purpose-built.

For a practical home or small-business agent — running Qwen2.5, Mistral Nemo, or Llama 3.1 in the 7B to 27B range — 16GB is the minimum worth taking seriously. 24GB is where it gets genuinely comfortable.

[!INFO] VRAM formula: (Model parameters × 0.5 bytes for Q4) + KV cache overhead. KV cache grows linearly with context length. The longer your agent's working session, the more VRAM the context consumes on top of the model weights. A single agentic session doing file I/O and web search can push context past 12K tokens fast.

The GPU Tiers for Self-Hosted Agents

12GB Cards — RTX 4070 Super (~$549)

Fine for simple 7B or 8B agents. The RTX 4070 Super delivers around 57 tokens/sec on Llama 3.1 8B, which feels responsive. You can run a Qwen2.5-Coder-7B coding assistant, a lightweight local research agent, or a basic task automation loop without hitting walls.

Where it breaks: anything with longer context or a bigger model. A 13B model barely fits the weights, and once your agent accumulates real conversation history across tool calls, you're offloading to RAM. At that point you're running at 2–3 tok/s, which feels like watching paint dry on every reasoning step.

If you're experimenting, starting here makes sense. If you're building something for daily production use, this card will frustrate you within a month.

16GB Cards — RTX 4080 Super ($800 used), RTX 4070 Ti Super ($799)

This is the honest minimum for serious agent work. 16GB comfortably fits 13B models with full agentic contexts, and lets 27B models run at reduced context without offloading.

The RTX 4080 Super at ~72–78 tokens/second makes multi-step agent loops feel real-time. Token generation at that speed means you're not staring at a progress bar while your agent plans its next tool call.

Used RTX 4080 Supers are sitting around $799–$849 on eBay in March 2026. That's probably the sharpest value point in the market right now for agentic use.

One thing people consistently miss: the RTX 4070 Ti Super and 4080 Super both have 16GB and both use a 256-bit memory bus. The real differentiator is bandwidth — 736 GB/s on the 4080 Super versus 672 GB/s on the 4070 Ti Super. Token generation is a memory-bandwidth-bound operation, not a compute-bound one. The 4080 Super is meaningfully faster on long contexts because it moves data to the GPU faster.

24GB Cards — RTX 4090, RTX 3090 Used

Here's where local agent hosting stops feeling like a compromise. 24GB fits 27B models fully at Q4 with comfortable context headroom. You can push to 34B with careful context management. Multi-document RAG, long agentic sessions, code agents working through large repos — all of it works without offloading, without slowdown, without babysitting.

The RTX 4090 costs north of $1,900 new right now. Used examples run $1,600–$2,000. It benchmarks at around 108 tokens/sec on 8B models, which is fast. But for pure inference, the extra compute over a 4080 Super matters less than the extra 8GB of VRAM.

The sleeper option: a used RTX 3090 for $650–$750. You get 24GB of VRAM on a card that's architecturally older, but architecture matters less for inference than memory bandwidth. The 3090 runs hotter and louder. It's not a 2026 card. But if your goal is maximum context capacity per dollar, nothing else at that price comes close. r/LocalLLaMA has been saying this for two years and they're still right.

Tip

For agentic workloads specifically, prioritize VRAM over compute. A used RTX 3090 (24GB, ~$700) will outrun a brand-new RTX 4070 Super (12GB, ~$549) on any task involving long contexts or 27B+ models. Don't let the newer architecture fool you into buying less memory.

32GB+ — RTX 5090, Multi-GPU, or DGX Spark

Running 70B models locally requires either 32GB+ on a single card or two 24GB cards working in tandem. The RTX 5090 at 32GB and ~213 tokens/sec on 8B is genuinely extraordinary — but it's also $2,000+ and still inconsistent to find at MSRP in early 2026.

Two used RTX 3090s at 48GB combined unlocks 70B inference for around $1,400–$1,600 total. The setup is more complex and Ollama doesn't natively shard across GPUs the same way vLLM does. You'll need to configure CUDA_VISIBLE_DEVICES and run multiple inference servers. Not impossible — just not plug-and-play.

Worth it for a small team running shared infrastructure. Overkill for a single developer.

What Model Size Matches Your Use Case

The model you run determines the VRAM floor. Here's the practical map:

7B–8B: Simple task automation, lightweight coding assistance, quick research agents. Runs fine on 12GB. Hits reasoning ceilings on complex multi-step planning. Great for prototyping.

13B–14B: Noticeably better reasoning, handles tool calls more reliably, fewer loop failures on ambiguous tasks. Needs 12GB minimum, performs better with 16GB for real agentic context lengths.

27B–34B: Strong reasoning, dependable tool use, capable of nuanced multi-step planning. Needs 16–24GB. This is where agent quality stops feeling like a concession compared to cloud APIs.

70B: Near cloud-level capability for many tasks. 40GB+ VRAM required. Multi-GPU or purpose-built hardware. Overkill for most personal setups, but real for production deployments.

The Actual Recommendation

Buy for VRAM, not compute. That's the whole article, really.

Around $550–$600: RTX 4070 Super (12GB). Know its limits. It's a chatbot card running in agent mode, and long sessions will hit walls.

Around $750–$850: RTX 4080 Super used (16GB) or RTX 3090 used (24GB). The 3090 wins on pure VRAM capacity for the same money. The 4080 Super wins on speed and power draw. For agentic workloads that run long multi-step sessions, the 3090's extra 8GB is the better bet.

$1,600+: RTX 4090 (24GB). The speed advantage is real. Worth it if you're building something you run every day and want zero friction.

The mistake most people make is buying a 12GB card because it benchmarks well in gaming reviews, then discovering their agent crawls through anything beyond a 3-step task. That's not a model problem. It's a VRAM problem with a $700 solution.

Self-hosting an AI agent in 2026 is genuinely practical. The models are good enough. The tooling — Ollama, Open Interpreter, Cline — is mature and stable. The cost savings over cloud APIs are real; developers running heavy API usage can hit $300–$500/month, with hardware break-even typically under six months. The only part people keep getting wrong is the hardware. And it comes down to one number: how many gigabytes are on the card.