CraftRigs
Architecture Guide

Running Local AI for Software Development: Hardware Setup Guide

By Georgia Thomas 9 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: A local AI coding setup replaces your $10-20/month GitHub Copilot subscription with hardware you own. The sweet spot is an RTX 4090 (24GB, ~$1,600) running Qwen 2.5 Coder 32B, which matches GPT-4 on coding benchmarks and pays for itself in 8-14 months versus cloud subscriptions. If budget is tight, an RTX 3090 (~$800 used) running 14B-32B coding models is still dramatically better than any 7B model on cheaper hardware.

Why Run Coding AI Locally?

Three reasons keep pushing developers toward local:

Cost. GitHub Copilot is $10/month individual, $19/month business. Cursor Pro is $20/month. Claude Code runs $20-200/month depending on usage. Over 12 months, that's $120-2,400 for a service that requires internet, has rate limits, and can change pricing at any time.

Privacy. Your codebase goes to someone else's servers. For personal projects, maybe that's fine. For proprietary code, client work, or anything under NDA, it's a real concern. Local inference means your code never leaves your machine.

Latency. Cloud coding assistants add 200-500ms of network latency to every completion. Local inference on a good GPU responds in under 50ms. That difference compounds across hundreds of completions per day. Once you've used a truly fast local setup, cloud completions feel sluggish.

What Coding AI Actually Needs From Hardware

Coding assistance breaks into four workloads, each with different hardware demands:

1. Code Autocomplete (Tab Completion)

This is the Copilot-style inline suggestion as you type. It needs to be fast — under 200ms or it disrupts your flow. The model runs constantly, predicting the next few lines as you code.

  • Model size: 1.5B-7B parameters (small and fast)
  • VRAM needed: 2-5GB
  • Speed target: 80+ tokens per second
  • Best models: Qwen 2.5 Coder 1.5B (autocomplete specialist), StarCoder2 3B

Even an 8GB GPU handles autocomplete comfortably. This workload is not the bottleneck.

2. Code Chat (Ask Questions, Get Explanations)

You highlight code, ask "what does this do?" or "how do I refactor this?" The model reads your code and generates a detailed response. Latency matters less here — you're reading the response, so 20-40 t/s is fine.

  • Model size: 14B-32B parameters (quality matters more than speed)
  • VRAM needed: 8-20GB at Q4 quantization
  • Speed target: 20+ t/s
  • Best models: Qwen 2.5 Coder 32B, DeepSeek Coder V2 16B, Gemma 3 27B

This is where you need real VRAM. A 14B model gives decent answers. A 32B model gives answers that rival GPT-4.

3. Code Review and Debugging

Feed the model a file or diff, ask it to find bugs or suggest improvements. Requires longer context windows (8K-32K tokens for real codebases) and strong reasoning.

  • Model size: 27B-32B parameters preferred
  • VRAM needed: 16-24GB at Q4
  • Context needs: 8K-32K tokens minimum
  • Best models: Qwen 2.5 Coder 32B, Gemma 3 27B

4. Multi-File Agentic Coding

Tools like Aider and Claude Code can autonomously edit multiple files, run tests, and iterate. This is the most demanding use case — the model needs to hold large contexts and reason about complex codebases.

  • Model size: 32B+ parameters, or MoE models like Qwen3-Coder-Next
  • VRAM needed: 20-32GB+
  • Context needs: 32K-128K tokens
  • Best models: Qwen 2.5 Coder 32B, Qwen3-Coder-Next (80B MoE, 3B active)

The Models: What to Run

For Autocomplete: Qwen 2.5 Coder 1.5B

This is the undisputed king of local autocomplete. At 1.5 billion parameters, it loads in under 2GB of VRAM, generates completions at 150+ t/s on any modern GPU, and the quality is surprisingly good. Every local coding setup should have this model pulled and ready.

Run it: ollama pull qwen2.5-coder:1.5b

For Chat and Review: Qwen 2.5 Coder 32B

Scores 92.7% on HumanEval — matching GPT-4o. At Q4_K_M, it needs about 20GB of VRAM, fitting comfortably on an RTX 3090 or 4090. This is your workhorse for explaining code, writing functions, reviewing PRs, and debugging.

Run it: ollama pull qwen2.5-coder:32b

For Budget Setups: Qwen 2.5 Coder 14B

If 32B doesn't fit on your GPU, the 14B version is the next best thing. Needs about 9GB at Q4 — fits on any 12GB+ card. Quality is noticeably below the 32B version on complex tasks, but still far better than any 7B coding model.

Run it: ollama pull qwen2.5-coder:14b

For Reasoning-Heavy Tasks: DeepSeek R1 32B

When you need chain-of-thought reasoning — debugging complex logic, architectural decisions, understanding tricky algorithms — DeepSeek R1 32B's thinking process is hard to beat. It's slower than Qwen 2.5 Coder (the reasoning tokens add up), but the quality on hard problems is noticeably better.

For Maximum Context: Qwen3-Coder-Next (80B MoE)

Released February 2026, this Mixture of Experts model activates only 3B parameters at a time from an 80B total. The result: strong coding performance with a 256K context window, runnable on consumer hardware (24GB+ recommended). Ideal for agentic workflows that need to understand entire codebases.

IDE Integration: The Software Side

Continue.dev (Free, Open Source)

The leading open-source Copilot replacement. Works with VS Code and JetBrains IDEs. 31,300+ GitHub stars, used by Siemens and Morningstar in production.

Setup takes 5 minutes:

  1. Install Ollama and pull your models
  2. Install the Continue extension in your IDE
  3. Configure ~/.continue/config.json to point at your local models

Continue supports separate models for autocomplete and chat — run the tiny 1.5B model for tab completions and the big 32B model for chat. Best of both worlds.

Recommended config:

  • Tab autocomplete: Qwen 2.5 Coder 1.5B (fast, lightweight)
  • Chat model: Qwen 2.5 Coder 32B (quality reasoning)
  • Embeddings: nomic-embed-text (for codebase-aware search)

Aider (Free, Open Source)

Terminal-based AI pair programmer. Aider edits files directly, commits changes to git, and can iterate on tasks autonomously. It works with any Ollama model and supports multi-file editing.

The killer feature: Aider tracks costs per session, even with local models (tracking token usage). Developers report typical costs of $0.01-0.10 per feature when using local models — essentially free compared to API-based alternatives.

Aider pairs especially well with larger models (32B+) because it relies on the model's ability to produce correct file edits consistently. Smaller models make more formatting errors that break Aider's edit parsing.

Cursor (Local Mode)

Cursor has local model support through its Ollama integration. Point Cursor at your locally running Ollama instance and it uses your local model for completions and chat. The UI is polished — arguably the best developer experience of the three.

The catch: Cursor's best features (multi-file composer, intelligent codebase indexing) work better with their cloud models. The local integration is functional but doesn't get the same optimization love. If you're going fully local, Continue.dev or Aider are better choices.

Cost Comparison: Local vs Cloud Over 12 Months

Let's do the math. Assume a solo developer using AI coding assistance daily.

GitHub Copilot Individual:

  • Monthly: $10
  • 12-month total: $120
  • Quality: Good autocomplete, decent chat (GPT-4 based)

Cursor Pro:

  • Monthly: $20
  • 12-month total: $240
  • Quality: Excellent autocomplete, strong multi-file editing

Claude Code (Max plan):

  • Monthly: $100-200 (usage-dependent)
  • 12-month total: $1,200-2,400
  • Quality: Best-in-class autonomous coding

Local Setup (RTX 3090 + Open Source Stack):

  • Hardware: ~$800 (one-time, used RTX 3090)
  • Electricity: ~$3-5/month (GPU running during work hours)
  • Software: $0 (Ollama + Continue.dev + Aider are free)
  • 12-month total: ~$850
  • Quality: Matches Copilot for autocomplete, matches GPT-4 for chat with Qwen 2.5 Coder 32B

Local Setup (RTX 4090 + Open Source Stack):

  • Hardware: ~$1,600 (one-time)
  • Electricity: ~$5-8/month
  • Software: $0
  • 12-month total: ~$1,700
  • Quality: Faster inference, same model quality as above

Break-even analysis:

Against Copilot ($120/year): The RTX 3090 pays for itself in about 7 years — not a slam dunk on cost alone. But you also get a GPU for LLMs, image gen, and anything else. The real value is the privacy and the model quality (32B local models beat Copilot's suggestions for many tasks).

Against Cursor Pro ($240/year): RTX 3090 break-even in ~3.5 years.

Against Claude Code ($1,200-2,400/year): RTX 3090 break-even in 4-8 months. RTX 4090 in 8-14 months. This is where local really shines — heavy API users save thousands.

The honest take: If you only use basic Copilot autocomplete, the cost savings alone don't justify buying a GPU. But if you're a heavy user of chat, code review, or agentic coding (the $100+/month tier), local hardware pays for itself fast. And you keep the GPU forever.

The Entry Build: Code Autocomplete + Basic Chat (~$1,200)

  • GPU: RTX 3060 12GB (~$250 used)
  • CPU: AMD Ryzen 5 7600 (~$180)
  • RAM: 32GB DDR5 (~$80)
  • Storage: 1TB NVMe (~$70)
  • PSU: 650W (~$80)
  • Motherboard + Case: ~$250

Runs: Qwen 2.5 Coder 1.5B (autocomplete at 150+ t/s) + Qwen 2.5 Coder 14B (chat at 20-25 t/s). Handles daily coding assistance for most developers. Won't run 32B models at good quality.

The Sweet Spot: Full Coding Workflow (~$2,000)

  • GPU: RTX 3090 24GB (~$800 used)
  • CPU: AMD Ryzen 7 7700X (~$220)
  • RAM: 64GB DDR5 (~$160)
  • Storage: 2TB NVMe (~$120)
  • PSU: 850W (~$120)
  • Motherboard + Case: ~$350

Runs: Everything up to Qwen 2.5 Coder 32B at Q4 (35-40 t/s). Full autocomplete + chat + code review + Aider. This is the build most developers should target. The 64GB system RAM matters for Aider and Continue.dev, which keep model context and file caches in memory.

The Power Setup: Maximum Local AI (~$3,200)

  • GPU: RTX 4090 24GB ($1,600) or RTX 5090 32GB ($2,000)
  • CPU: AMD Ryzen 9 7950X (~$400)
  • RAM: 128GB DDR5 (~$300)
  • Storage: 4TB NVMe (~$250)
  • PSU: 1000W (~$160)
  • Motherboard + Case: ~$400

Runs: 32B models at Q8 (near-lossless quality), 70B models at Q4 (with 5090's 32GB), or multiple models simultaneously. The 128GB system RAM enables huge context windows with partial CPU offloading. This is the "never worry about hardware again" build for AI development.

The Mac Path

Apple Silicon is a legitimate alternative for local coding AI, especially if you value portability.

  • MacBook Pro M4 Pro 24GB (~$2,000): Runs Qwen 2.5 Coder 14B at Q4 comfortably. Good for autocomplete and basic chat. Speed is adequate (15-20 t/s for the 14B model).
  • MacBook Pro M4 Max 48GB (~$3,200): Runs Qwen 2.5 Coder 32B at Q4. Full coding workflow on a laptop. Speed is slower than a 4090 desktop (~18-25 t/s) but you get portability.
  • Mac Studio M4 Ultra 192GB (~$6,000+): Runs 70B+ coding models. Overkill for most developers, but if you want to run frontier-class models locally on quiet hardware, nothing else matches it.

Apple Silicon's unified memory means the GPU shares system RAM — no separate VRAM constraint. A 48GB MacBook can load models that would need a 24GB discrete GPU, because there's no overhead splitting memory between CPU and GPU pools.

The trade-off is speed. An M4 Max generates tokens at roughly 40-60% the speed of an RTX 4090 for the same model. For autocomplete, this gap is less noticeable (both feel instant at 1.5B). For 32B chat models, the Mac feels "smooth" while the 4090 feels "fast." See our Apple Silicon benchmarks for exact numbers.

Putting It All Together

The local AI coding stack in 2026 looks like this:

  1. Hardware: 24GB GPU minimum for serious use (RTX 3090 budget, RTX 4090 performance)
  2. Runtime: Ollama for model management and inference
  3. Autocomplete model: Qwen 2.5 Coder 1.5B (always running, nearly zero overhead)
  4. Chat/review model: Qwen 2.5 Coder 32B (loaded on demand, 20GB VRAM)
  5. IDE plugin: Continue.dev (free, open source, excellent integration)
  6. Optional: Aider for agentic multi-file editing

Total software cost: $0. Total monthly cost after hardware: electricity only.

The hardware investment is real — $800-2,000 depending on the GPU. But unlike a subscription, it doesn't disappear when you stop paying. You own the hardware, you own the models, and your code stays on your machine. For developers who use AI coding tools daily, the math works out faster than most people expect.

For GPU selection help, check our complete GPU rankings. For budget builds at every price point, see our budget guide.


software-development coding local-llm continue-dev aider cursor hardware copilot-alternative

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.