CraftRigs
Architecture Guide

Open-Source LLM GPU Requirements 2026: Llama 4, Qwen 3.5, Mistral

By Charlotte Stewart 10 min read
Every Major Open-Source LLM in 2026: What GPU Do You Need to Run It — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: The 2026 open-source flagships are mostly Mixture-of-Experts models with huge total parameter counts — most don't fit in consumer VRAM as whole files. What actually runs well on one GPU: dense models up to ~27B (Qwen 3.5, Gemma 4) and the MoE mid-tier (Qwen 3.5 35B-A3B) on 16–24GB cards. The giants — Mistral Small 4 (119B), Llama 4 Scout (109B), DeepSeek V4, Kimi K2.5 — need either heavy system-RAM offload or an API. This guide maps every major model to real file sizes and the GPU that handles it.

On this page:


The Quick Model-to-GPU Lookup

Running an open-source LLM locally comes down to one number: the model's file size at the quantization level you'll accept, versus the memory that has to hold it. Use the VRAM calculator for your exact configuration.

ModelArchitectureQ4 file sizeWhat runs itEst. decode speed
Qwen 3.5 9BDense~6.5 GBAny 8GB+ GPU~40–55 tok/s on RTX 5060
Gemma 4 12BDense~7–8 GBAny 12GB GPU~35–50 tok/s on RTX 5060 Ti
Qwen 3.5 27BDense~17 GB24GB card (Q4) or 16GB card (Q3)~33–44 tok/s on RTX 3090
Qwen 3.5 35B-A3BMoE (3B active)~22 GB24GB card fully resident; 16GB with offload100+ tok/s fully resident
Llama 4 ScoutMoE 109B (17B active)65.4 GB64GB+ system RAM + GPU offloadSingle-digit tok/s offloaded
Mistral Small 4MoE 119B (6B active)73.8 GB96GB+ system RAM + GPU offloadLow-teens tok/s offloaded
Qwen 3.5 122B-A10BMoE (10B active)~70 GB96GB+ RAM + offload, or rentalDepends on RAM bandwidth
DeepSeek V4-FlashMoE 284B (13B active)150+ GBMulti-GPU cluster or APIAPI recommended
Kimi K2.5MoE 1T (32B active)Cluster-scaleAPIAPI recommended
DeepSeek V4-ProMoE 1.6T (49B active)Cluster-scaleAPIAPI recommended

File sizes from the published GGUF repositories linked in each model section below. Speed estimates are bandwidth arithmetic (explained at the end), not measurements.


Why 2026 Models Don't Fit Like 2025 Models Did

In 2025, "can I run it" meant comparing a dense model's file size to your VRAM. In 2026 the flagship releases — Llama 4, Mistral Small 4, DeepSeek V4, Kimi K2.5, the large Qwen 3.5 variants — are almost all Mixture-of-Experts. That changes the math in two directions:

Total parameters set the memory bill. A MoE model's full weight file must live somewhere — VRAM, system RAM, or both. Mistral Small 4's 119B parameters produce a 73.8 GB Q4 file no matter how few parameters activate per token.

Active parameters set the speed. During decode, only the routed experts are read per token. Mistral Small 4 activates ~6B parameters per token, so when the whole model sits in fast memory it decodes like a small model. When experts sit in system RAM (the realistic consumer setup, via llama.cpp's MoE offload), decode speed is bounded by your system RAM bandwidth, not your GPU.

The practical consequence: a 16–24GB GPU plus 64–96GB of fast DDR5 is the new enthusiast platform for big MoE models, and our CPU+GPU hybrid inference guide covers the offload mechanics. For everything else, dense models up to ~27B remain the sweet spot for a single card.


The Models, One by One

Qwen 3.5 — The Lineup That Covers Every Tier

Qwen 3.5 (released February 2026, Apache 2.0) ships in 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B sizes. Three of them matter for consumer hardware:

  • Qwen 3.5 9B — ~6.5 GB at 4-bit. Fits any modern 8GB+ card with context headroom. The default "first real model" for budget GPUs.
  • Qwen 3.5 27B — ~17 GB at 4-bit, ~14 GB at 3-bit, ~30 GB at 8-bit. The strongest dense model you can host on one consumer card. Q4 wants a 24GB card; 16GB cards run the 3-bit quant with modest context.
  • Qwen 3.5 35B-A3B — MoE, ~22 GB at 4-bit, only 3B parameters active per token. Fully resident on a 24GB card it decodes dramatically faster than the 27B dense model. On 16GB cards, offload the expert layers to system RAM (--n-cpu-moe in llama.cpp; Ollama handles it automatically) and it remains very usable.

The 122B-A10B (~70 GB at Q4) and 397B-A17B variants are rental-or-cluster territory. There is no Qwen 3.5 32B or 200B — if you see those sizes cited, the source confused this family with Qwen 2.5's lineup. Full per-size breakdown in our Qwen 3.5 hardware requirements guide.

Gemma 4 — Google's Consumer-Friendly Family

Gemma 4 (April 2, 2026) ships E2B and E4B edge models, a 12B dense model, a 26B MoE (3.8B active), and a 31B dense model. The 12B at Q4 (~7–8 GB) is an excellent fit for 12GB cards, and the E2B runs on virtually anything — including CPU-only — which makes it the right starting point if you want to learn the pipeline before buying GPU hardware. VRAM details by size in our Gemma 4 hardware requirements article.

Llama 4 Scout — Big MoE, Honest Numbers

Llama 4 Scout (Meta, April 2025) is a 109B-total, 17B-active MoE. The Q4_K_M GGUF is 65.4 GB — this is not a 12GB-VRAM model under any quantization you'd want to use. Realistic consumer hosting means 64GB+ of system RAM with GPU offload, and because 17B parameters activate per token, offloaded decode reads ~10 GB per token from system RAM. On dual-channel DDR5 that puts the arithmetic ceiling in the single digits of tokens per second. Scout is best treated as a capability you rent or a reason to own a high-memory Mac, not a reason to buy a consumer GPU.

Mistral Small 4 — The Efficient Giant

Mistral Small 4 (March 16, 2026, Apache 2.0) is a 119B-total MoE with 6B active parameters per token (128 experts, 4 active) and a 256K context window. The Q4_K_M file is 73.8 GB, so plan on 96GB of system RAM for comfortable offloaded hosting. The 6B active count is what makes it interesting: experts streaming from dual-channel DDR5 (~80–90 GB/s) put the decode ceiling around 20 tok/s, with real-world offloaded results landing in the low teens — slow but genuinely usable for a model of this quality. On unified-memory machines that hold the whole file (96GB+ Apple Silicon), it runs fully resident and considerably faster.

DeepSeek V4 — Rent It

DeepSeek V4 (2026) comes as V4-Pro — 1.6T total parameters, 49B active — and V4-Flash at 284B total, 13B active, both with 1M-token context. Neither is consumer-hostable: even V4-Flash's quantized weights run 150+ GB. If you need V4-class reasoning, use the DeepSeek API or rented cluster compute. No quantization trick changes this math on a 24GB card.

Warning

Don't buy hardware for a model class you can't host. A dual-GPU consumer rig still cannot hold V4-Flash, and the engineering time spent trying is worth more than a year of API usage.

Kimi K2.5 — The Long-Context Specialist

Kimi K2.5 (Moonshot AI) is a 1T-parameter MoE with 32B active parameters and 256K native context, under a Modified MIT license. Like DeepSeek V4 it's cluster-scale — the open weights matter for researchers and inference providers, not for single-GPU builders. Access it via the Moonshot API when you need extreme context.


Hardware Tiers: What to Buy

Choose the GPU based on which model tier you actually need, not which price bracket you can stretch to.

Budget ($300–$600): Dense Models to ~12B

The RTX 5060 8GB at $299 MSRP (448 GB/s) and the RTX 3060 12GB (360 GB/s, used market) cover Qwen 3.5 9B, Gemma 4 12B, and every 7–8B model at Q4 with real speed. The 3060's extra 4GB buys longer context and 12B-class fits; the 5060's newer GDDR7 buys raw decode speed. Our RTX 5060 vs RTX 3060 comparison works through that trade in detail. Used RTX 4070 12GB cards (504 GB/s) are also strong value here when priced near the 5060.

Don't attempt 27B+ on 8GB — extreme quants fit but quality and context suffer. Stay in the 7–14B range where these cards shine.

Mid-Range ($429–$800): 27B Dense or the MoE Mid-Tier

The RTX 5060 Ti 16GB ($429 MSRP) and RTX 5070 Ti 16GB ($749 MSRP, 896 GB/s) are the productivity floor. 16GB runs Qwen 3.5 27B at 3-bit, every 14B at Q5, and — the headline use case — Qwen 3.5 35B-A3B with expert offload. The 5070 Ti's doubled bandwidth over the 5060 Ti makes it the pick if your budget reaches; otherwise the 5060 Ti 16GB delivers the same fits at lower speed.

Pair either card with 64GB of DDR5 and you can also experiment with the offloaded giants (Scout, Mistral Small 4) at patience-required speeds.

High-End ($800–$2,000+): 27B at Full Quality, 70B, and Resident MoE

Three paths:

  • Used RTX 3090 24GB (~$800, 936 GB/s) — still the value king for VRAM. Runs Qwen 3.5 27B at Q4 (~33–44 tok/s estimated), 35B-A3B fully resident, and pairs into a 48GB dual-card rig for Llama 3.3 70B Q4 (42.5 GB file) via llama.cpp --tensor-split or vLLM tensor parallelism over PCIe.
  • RTX 5090 32GB ($1,999 MSRP, 1,792 GB/s) — the fastest single card: 27B Q4 at an estimated 63–84 tok/s ceiling-derived range, 35B-A3B resident with room for long context.
  • Apple Silicon with 96–128GB unified memory — the cleanest consumer host for the 65–75 GB MoE files (Scout, Mistral Small 4), since the whole model sits in unified memory. See our Mac local LLM guide for the bandwidth-by-chip breakdown.

Multi-GPU note: consumer NVLink ended with the RTX 3090. Modern multi-GPU inference runs over PCIe with tensor-parallel software — it works well for fitting bigger models, less well for raw speed scaling.


Quantization Strategy

Inference on consumer GPUs is memory-bandwidth-bound, which means smaller quants are faster, not slower — every byte you shave off the file is a byte you don't read per token.

  • Q4_K_M (default): roughly a quarter of the FP16 file size, which translates directly to ~4× the decode-speed ceiling. Quality loss is imperceptible for chat and general work. Use it everywhere unless you have a reason not to.
  • Q5/Q6: for when you have VRAM headroom and do long-document or precision-sensitive work. Modest quality gain, proportionally slower decode.
  • Q3 and below: a fit-enabler, not a preference. Qwen 3.5 27B at 3-bit (~14 GB) onto a 16GB card is the canonical good use. Below Q3, reasoning quality degrades noticeably — test your own workload before committing.
  • FP16/BF16: for research comparisons only. Locally it wastes memory and quarters your speed ceiling.

Where These Numbers Come From

Every speed figure in this article is an estimate from published specs, not a measurement: the decode ceiling is memory bandwidth (GB/s) ÷ bytes read per token, and real-world results from community benchmarks consistently land at 60–80% of that ceiling. For dense models, bytes per token ≈ the model file size. For MoE models, bytes per token ≈ the active-parameter share of the file — which is why a 22 GB MoE file can decode several times faster than a 17 GB dense file.

File sizes come from the linked GGUF repositories (unsloth, bartowski); model specs from the official releases linked in each section; GPU bandwidth from NVIDIA's specifications; community benchmark context from the llama.cpp performance discussions. Inference engine, context length, and driver version all move results within the range — verify on your own hardware before a purchase decision.


Frequently Asked Questions

Can I quantize any 27B+ model down until it fits on 8GB VRAM?

You can make it fit; you can't make it good. Sub-3-bit quants of mid-size dense models show visible reasoning drift. On 8GB, a well-quantized 9B beats a crushed 27B for almost every workload.

Why is my 16GB card slow on Qwen 3.5 35B-A3B?

The Q4 file (~22 GB) doesn't fit in 16GB, so expert layers spill to system RAM and your decode speed follows your RAM bandwidth, not your GPU's. Faster DDR5 helps; so does keeping more layers on-GPU with careful --n-cpu-moe tuning. Fully resident on a 24GB card, the same model is several times faster.

Is DeepSeek V4 really better than hosting a 70B locally?

Different problems. V4-class reasoning is beyond anything you can host on consumer hardware — that's what APIs are for. A local Llama 3.3 70B or Qwen 3.5 27B covers private, offline, latency-insensitive work. Most builders end up with both: a local daily driver plus an API for frontier-model tasks.

Does PCIe 4.0 vs 5.0 matter for inference?

Not for single-GPU inference — the model loads once and decode traffic stays on-card. It matters somewhat for multi-GPU tensor parallelism and for MoE offload setups where expert weights stream across the bus. Details in our PCIe lanes guide.

Should I overclock my GPU for LLM inference?

No. Memory overclocks buy a few percent of bandwidth at the cost of stability and heat under sustained inference load. Spend the effort on quantization choice instead — moving from Q5 to Q4 is worth more than any overclock.


Updated June 2026. Model specs from official releases: Mistral Small 4, Kimi K2.5, DeepSeek V4, Qwen 3.5, Gemma 4. File sizes from the linked GGUF repositories. Speed figures are bandwidth-arithmetic estimates, not measurements.

llm-benchmarks gpu-requirements local-ai hardware-guide vram-specs

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.