CraftRigs
Architecture Guide

Every Coding LLM Ranked by Hardware Requirements: Qwen Coder, DeepSeek, Llama 3.1 [2026]

By Charlotte Stewart 12 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The benchmark sheets that ship with most local LLM guides are built for general-purpose chat, not coding. HumanEval scores are useful, but they measure short Python completions in isolation — not debugging a 3,000-line Go service or refactoring a Django ORM model. Real coding workflows care about context handling, instruction-following on code-specific requests, and whether the model hallucinates plausible-but-wrong function signatures.

Here's the thing most hardware guides get wrong: they quote VRAM requirements from model file size, not from what actually loads into GPU memory during inference. A 32B model at Q4_K_M quantization is a 20GB file. With KV cache overhead at 32K context, you need 24GB of VRAM — not 16GB, not 20GB. That distinction changes which GPU you actually need to buy.

Qwen2.5-Coder 32B is the best all-around local coding LLM in 2026. It achieves ~92.7% on HumanEval and handles serious repo work on a single 24GB GPU. If you're on 12GB, Qwen2.5-Coder 14B is your pick. For professionals doing security-sensitive work or complex multi-repo analysis, Llama 3.1 70B on dual GPUs is the ceiling — but it's overkill for most.

Quick Pick Table — Coding LLMs by Hardware Budget

HumanEval

~82%

~92.7%

~82% Prices as of March 2026. RTX 4070 used ~$450; RTX 3090 used ~$750–$800; RTX 4090 used ~$1,400. HumanEval figures from official model cards and independent reproducible evaluations. Tok/s measured with Ollama on the listed hardware.

Why Trust This Guide

Every model was tested on real coding tasks: function generation from docstrings, bug fix requests on unfamiliar code, multi-file refactoring instructions, and code review on production-style Python and TypeScript. HumanEval and MBPP were used as a sanity check against real-world results, not as the primary measurement.

Test hardware: RTX 4070 (12GB), RTX 4090 (24GB), dual RTX 4090 (48GB). All tokens per second figures include software overhead from Ollama and llama.cpp — not theoretical peak bandwidth. Benchmarks and prices verified March 2026.

Where we cite community benchmarks, we checked methodology and traced claims back to reproducible sources. Where no primary source exists, we say "estimated" and show the math.

The Winners — Five Coding LLMs Ranked

Only coding-specialized or coding-strong instruct models made this list. The ranking weights VRAM-to-performance ratio for code tasks — not raw throughput in isolation.

#1 — Qwen2.5-Coder 32B (the sweet spot)

The first correction to make: this model doesn't fit in 16GB. Q4_K_M weighs ~20GB on disk, and with KV cache overhead at a practical 32K context, you need a 24GB GPU for full offload. A 16GB RTX 5070 Ti can run it with partial CPU offload — expect 10–15 tok/s that way. For real interactive coding, target an RTX 3090 or RTX 4090.

On a 24GB RTX 3090, Qwen2.5-Coder 32B at Q4_K_M runs at ~38 tok/s (Ollama 0.6.x, March 2026). That's fast enough for interactive sessions. HumanEval is ~92.7% at BF16 — Alibaba doesn't publish FP8-specific figures, but independent evaluations show minimal degradation at Q4. On LiveCodeBench, the 32B model scores 37.2%, beating GPT-4o's 29.2% on the same evaluation.

The 128K context window is real, but 32K is the practical sweet spot before VRAM pressure starts affecting speed. For repo analysis tasks that need longer context, budget an extra 2–3GB of VRAM headroom.

Best GPU pairing: RTX 3090 ($750–$800 used), RTX 4090 ($1,400 used, ~$2,000 new).

Pros: Best balance of code quality and context handling among locally-runnable dense models; strong on function generation, code review, and multi-language tasks; handles architectural reasoning that trips up 14B-class models.

Cons: Needs 24GB GPU — rules out most mid-range gaming rigs; slower on 64K+ context inputs than the official specs imply; still behind GPT-4o on genuinely ambiguous multi-system design questions.

Tip

If you're buying specifically to run Qwen2.5-Coder 32B, the RTX 3090 is better value than the RTX 4090. Same 24GB VRAM for roughly half the price. The 4090's bandwidth advantage matters most when you're pushing the VRAM ceiling — at Q4_K_M, the 3090's 936 GB/s is more than enough.

#2 — DeepSeek Coder 33B-Instruct (the fast alternative)

Same VRAM footprint as Qwen 32B — ~20GB at Q4_K_M, needs 24GB for full offload. The benchmark story is less impressive: HumanEval lands at ~79.3% pass@1 (Python, independent evaluations), a real 13-point gap below Qwen's 92.7%. That gap shows up in practice — DeepSeek 33B hallucinates unfamiliar library calls more often and is weaker on multi-file reasoning tasks.

Why #2 and not lower? Speed on short completions and code explanation tasks. On the same RTX 4090, DeepSeek 33B-Instruct generates slightly faster on short outputs due to architectural differences that favor low-latency responses. If your workflow is primarily code review, explanation, and documentation generation rather than complex generation, the quality gap narrows considerably.

Best GPU pairing: RTX 4090 (24GB), RTX 3090 (24GB).

Pros: Slightly faster latency on short outputs; well-suited for code explanation and test generation on familiar patterns; active community fine-tunes for specific domains.

Cons: Measurably weaker benchmark scores than Qwen at equivalent size; more hallucinations on uncommon syntax and third-party library calls; the base model dates from 2023 and hasn't received training updates comparable to Qwen 2.5.

#3 — Llama 3.1 70B-Instruct (the power user pick)

The VRAM math here gets misquoted constantly. At Q4_K_M, Llama 3.1 70B needs approximately 40GB for full GPU offload. At Q5_K_M, that figure jumps to ~53GB — outside the range of most dual-consumer-GPU setups. Stick to Q4_K_M: the quality difference is modest and the VRAM difference determines whether the model runs at useful speed or not.

On a dual RTX 4090 (48GB combined), Llama 3.1 70B at Q4_K_M runs at approximately 18–22 tok/s (March 2026). Slower per-token than the 32B options, but the output quality is meaningfully better on hard tasks: ambiguous refactoring requirements, multi-language reasoning across Python and SQL and shell in the same session, understanding system-level tradeoffs. HumanEval sits at ~80–83% based on community benchmarks — Meta's official evaluations don't publish a clean comparable figure for this specific configuration.

Warning

A single RTX 4090 (24GB) runs Llama 3.1 70B at ~4 tok/s with heavy CPU offload. That's technically functional, not practically useful. If you're on one 4090, Qwen2.5-Coder 32B is the right call — it's faster, fits cleanly in VRAM, and scores better on HumanEval.

Best GPU pairing: 2× RTX 4090 (~$2,800–$3,200 used pair as of March 2026). RTX 6000 Ada (48GB) is a single-card alternative if dual-GPU complexity isn't worth it.

Pros: Best absolute quality on complex, ambiguous tasks; handles multi-language reasoning that 32B models struggle with; superior few-shot learning when you provide coding examples.

Cons: Requires expensive dual-GPU setup; lower throughput than any 32B alternative; genuinely overkill for the vast majority of coding use cases.

#4 — Qwen2.5-Coder 14B (the entry pick)

This is the model for PC gamers who have a 12GB RTX 4070 sitting in a build already and want to start using local AI for coding today. At Q4_K_M, it uses ~9GB of VRAM — fits cleanly in 12GB with room left for the KV cache. On that GPU, expect ~40 tok/s. Fast enough for interactive sessions.

Note

8GB GPUs can technically run Qwen2.5-Coder 14B with CPU offload, but speed drops to 10–15 tok/s and the experience degrades noticeably. On 8GB VRAM, run the 7B version instead. The quality difference between 7B and 14B is real, but a slow 14B is worse than a fast 7B for interactive work.

HumanEval is around 82% — not as strong as the 32B, but the practical gap on everyday tasks (function generation, single-file debugging, test writing) is smaller than the number implies. Where it starts showing limits is multi-file context: feed it a 2,000-line codebase and ask it to trace a bug across five modules, and you'll hit the reasoning ceiling faster than with the 32B.

Best GPU pairing: RTX 4070 12GB (~$450 used, March 2026), RTX 4070 Super ($600 new).

Pros: Runs on any modern gaming GPU with 12GB VRAM; fast enough for IDE completion loops; three-command Ollama setup with no configuration headaches.

Cons: Weaker on complex multi-file reasoning; context window becomes a practical limitation faster at this model size; falls behind 32B on tasks requiring architectural understanding.

#5 — StarCoder2 15B (the niche specialist)

This model has a confusing benchmark profile. HumanEval+ sits at 37.8% pass@1 on the official model card — which sounds disqualifying. But that number measures isolated Python completion tasks, and StarCoder2 wasn't optimized for that. On RepoBench and CrossCodeEval, it outperforms models two to three times its size. It was trained with fill-in-the-middle objectives specifically for completion-style tasks, not instruction-following chat.

That shapes exactly when to use it. For IDE integrations using Continue.dev or Tabby, especially in Python-heavy or web-heavy repos, StarCoder2 15B's FIM training makes completions feel more natural than pure instruction-tuned models at the same size. For Go, Rust, or systems programming, the coverage drops — the training data skews toward high-resource languages.

VRAM at Q4_K_M: ~9GB. An RTX 4070 (12GB) handles it comfortably.

Best GPU pairing: RTX 4070 12GB, RTX 4060 Ti 16GB.

Pros: Designed for repo-level fill-in-the-middle completion; tiny VRAM footprint for the capability in its target languages; active community fine-tunes for domain-specific variants.

Cons: Low HumanEval score means it struggles with standalone function generation; poor coverage of Go, Rust, and systems languages; fewer model updates than Qwen or DeepSeek in 2025–2026.

Quantization Deep Dive — Code Quality vs Speed

The common advice is "Q4 is fine for coding." That's mostly right, but the failure mode matters more for code than for chat. A hallucinated argument type or wrong import path breaks the build. The tradeoff table for Qwen2.5-Coder 32B on an RTX 4090 (24GB):

Use Case

Critical production code on high-VRAM setups

Best quality for 48GB+ dual-GPU setups

Good balance on 48GB; tight on 24GB with large context

Default for 24GB GPUs — practical floor for serious coding

Not recommended — hallucination rate unacceptable for code Estimated from community benchmarks and bandwidth-based VRAM calculations as of March 2026.

Q4_K_M is the floor for general coding. Q5_K_M is the floor for complex architectural work. Q3 and below is where you start paying in errors rather than waiting. The speed gains at Q3 don't offset the extra debugging time. Don't go there.

Warning

Q3 and Q2 quantization are not recommended for code generation. The inference speed gain is real; so is the hallucination rate on API calls, type signatures, and syntax in less common languages.

Hardware Pairing Chart — Match Your GPU to the Right Model

Est. Tok/s

~55 tok/s

~40 tok/s

~42 tok/s

~44 tok/s

~38 tok/s

~20 tok/s RTX 5070 Ti MSRP is $749 but currently sells for $999–$1,099+ new (March 2026). Its 16GB VRAM runs Qwen 14B cleanly but can't fully offload Qwen 32B without CPU assist — at current RTX 5000 street prices, the used RTX 4090 (24GB) is better value for Qwen 32B work.

For the full breakdown of VRAM-to-cost ratios across the RTX 5000 series and where each GPU sits for general AI workloads, see the GPU buyer's guide for 2026.

Real-World Setup Examples

Budget build (~$700 total): RTX 4070 + Qwen2.5-Coder 14B

Hardware: RTX 4070 12GB (~$450 used), any modern CPU, 32GB RAM. Software: Ollama on Linux or Windows, Continue.dev for IDE integration.

Three commands to a working setup:

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5-coder:14b
ollama run qwen2.5-coder:14b

For Aider, add --model ollama/qwen2.5-coder:14b to your config. Expect ~40 tok/s. This handles function generation, debugging, code explanation, and test writing without meaningful lag. See the complete Ollama local setup guide for IDE integration and GPU detection troubleshooting.

Mid-range professional (~$1,600–$1,900 total): RTX 4090 + Qwen2.5-Coder 32B

Used RTX 4090 24GB (~$1,400), 64GB DDR5 RAM, Ryzen 9 or Intel Core Ultra. Qwen 32B Q4_K_M runs at ~38 tok/s with full GPU offload. At 32K context, you can feed 2,000+ line files for repo-level analysis. For continuous indexing sessions, add --num-ctx 32768 to your Ollama Modelfile to cap context at 32K and keep VRAM headroom stable.

Power user ($3,000+ total): 2× RTX 4090 + Llama 3.1 70B

Two RTX 4090 24GB cards (~$2,800–$3,200 for the pair), 128GB RAM. Ollama handles multi-GPU detection automatically; set OLLAMA_MAX_LOADED_MODELS=1 for stable VRAM allocation. Llama 3.1 70B Q4_K_M at ~18–22 tok/s.

This setup is built for: security-sensitive environments where no API traffic can leave the network, multi-language mono-repo analysis, and 50K+ token context tasks that would overflow a 32B model's practical window. If you're not doing that kind of work daily, the Qwen 32B on a single 4090 gives you 90% of the capability at 40% of the cost. See the dual GPU local LLM stack guide for NVLink configuration and memory allocation setup.

FAQ

Can I run a 70B coding model on 24GB VRAM?

Technically yes. At Q4_K_M, Llama 3.1 70B with full GPU offload needs ~40GB — on a single 24GB card, it splits between GPU and RAM, dropping speed to ~4 tok/s. That's viable for exploration and testing, not for waiting on completions during a real coding session. The minimum for comfortable 70B performance is 40GB on GPU: dual RTX 4090, RTX 6000 Ada (48GB), or an A100 80GB at enterprise scale.

Is speed or accuracy more important for coding tasks?

Accuracy. A correct completion at 20 tok/s beats a hallucinated one at 50 tok/s, every time. The exception is pure autocomplete — filling in a line you've already started — where speed and FIM-specific training matter more than reasoning depth. For that use case, StarCoder2-3B (under 4GB VRAM, trained on FIM) is a better choice than running a larger model at lower quantization. For everything else: run Q4_K_M minimum, Q5_K_M if you're doing serious architectural work.

Do I need to fine-tune these models for my codebase?

No for general coding. A well-crafted system prompt with coding style examples gets 80% of the way to codebase-specific quality without training overhead. Fine-tuning becomes worthwhile if you work with proprietary DSLs or internal frameworks with conventions the model hasn't seen — context stuffing stops being practical before it stops being needed. For most development teams, RAG-based retrieval (feeding relevant files as context) is a better first investment than fine-tuning.

Which model is best for real-time IDE autocomplete?

For completion quality: Qwen2.5-Coder 14B Q4_K_M on an RTX 4070 at ~40 tok/s keeps most completions under 200ms. For fill-in-the-middle autocomplete specifically, StarCoder2-3B is the smarter pick — it uses under 4GB VRAM and was explicitly trained on FIM objectives rather than instruction-following. See the beginner's guide to running a coding LLM locally for FIM configuration with Continue.dev.

Can I run two coding models in parallel?

On 40GB+: yes, comfortably. Two Qwen2.5-Coder 14B Q4_K_M instances split cleanly on dual RTX 4090s (~9GB VRAM each). On 24GB or less, parallel models compete for VRAM and neither runs at full speed — Ollama's model switching (3–5 second load time) is the better approach for single-GPU setups. For a head-to-head breakdown of the two leading models, see Qwen Coder vs DeepSeek Coder.

Final Verdict

Default recommendation: Qwen2.5-Coder 32B on a used RTX 4090 at Q4_K_M. 38 tok/s, 92.7% HumanEval, handles real repo work. The used RTX 4090 ($1,400 as of March 2026) costs more than a new mid-range card, but it's the only consumer GPU that runs this model without CPU offload compromising the experience. At current RTX 5000 street prices ($999–$1,099+ for the 5070 Ti), the used 4090 is better value for this specific workload.

If budget drives the decision: Qwen2.5-Coder 14B on an RTX 4070 at Q4_K_M. $450 for the GPU, three commands to set up, 40 tok/s on real coding tasks. The quality gap below 32B is genuine but won't stop you from using it productively every day.

If you need the ceiling: Llama 3.1 70B on dual RTX 4090s at Q4_K_M. Better reasoning on complex architectural problems. Overkill for 95% of developers. Necessary for the other 5% — the security-conscious, the teams with massive mono-repos, the people whose daily work involves reasoning across multiple systems at once.

One last thing the marketing copy won't tell you: "fits in 40GB VRAM" and "runs at useful speed in 40GB VRAM" are different claims. The model loads. The question is whether the throughput is practical for real work. Match VRAM to the model and quantization, not just to the file size.

Benchmarks and prices last verified: March 27, 2026. GPU used prices from eBay sold listings.

coding-llm local-llm vram qwen-coder llm-hardware

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.