CraftRigs
Architecture Guide

Qwen 3.5 Hardware Requirements: VRAM for 9B, 27B & 35B [2026]

By Georgia Thomas 7 min read
Qwen 3.5 hardware requirements: VRAM needed for every model size — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Qwen 3.5 9B needs ~6.5GB of VRAM at 4-bit — any 8GB GPU works. The 27B needs ~17GB — that's RTX 3090/RX 7900 XT territory. The 35B-A3B MoE needs ~22GB to load but decodes like a 3B model, making it the speed king on a 24GB card. Qwen 2.5 72B (a different, older family) is a 47GB file at Q4_K_M and needs dual GPUs or 64GB+ unified memory.

Memory requirements per Unsloth's Qwen 3.5 documentation:

Model4-bit8-bitBF16Minimum sensible GPU (4-bit)
Qwen 3.5 9B6.5 GB13 GB19 GBRTX 3060 8GB
Qwen 3.5 27B17 GB30 GB54 GBRTX 3090 24GB / RX 7900 XT 20GB
Qwen 3.5 35B-A3B22 GB38 GB70 GBRTX 3090 / 4090 / RX 7900 XTX 24GB
Qwen 3.5 122B-A10B70 GB132 GB245 GBMulti-GPU or 96GB+ unified memory
Qwen 2.5 72B47.4 GB (Q4_K_M)77.3 GB (Q8_0)Dual 24GB GPUs or 64GB+ unified

On this page:


Qwen 3.5 vs Qwen 2.5: Which Sizes Actually Exist

The two families get mixed up constantly in searches, so let's settle it:

  • Qwen 3.5 (released February 2026, Apache 2.0, natively multimodal, 262K context): 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and the 397B-A17B flagship. The "A" sizes are sparse Mixture-of-Experts — the 35B-A3B model card lists 35B total parameters with only 3B active per token.
  • Qwen 2.5 (the older, text-only family): 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, plus the Coder fine-tunes up to 32B.

There is no Qwen 2.5 9B or 27B, and no Qwen 3.5 72B. This guide covers the four sizes people actually search for: Qwen 3.5's 9B, 27B, and 35B-A3B, plus Qwen 2.5's 72B.


VRAM Requirements by Model Size and Quantization

The Qwen 3.5 figures below are from Unsloth's documentation; they cover model weights at each precision. Add headroom for KV cache — a few GB at modest context, more if you push toward the 262K native window.

Qwen 3.5 9B

PrecisionMemoryNotes
3-bit5.5 GBFits on 8GB cards with context headroom
4-bit6.5 GBThe sweet spot — still fits 8GB cards
6-bit9 GBNeeds a 12GB card; better output quality
8-bit13 GB16GB card territory
BF1619 GBFine-tuning or research; no inference reason

Minimum GPU: RTX 3060 8GB for 4-bit. RTX 3060 12GB or better for 6-bit.

The 9B is the "just works" tier — and reviewers consistently flag it as the standout of the family for its size. Any gaming GPU from the last four years handles 4-bit. If your card has 12GB, the 6-bit quant is worth the extra memory.

Qwen 3.5 27B

PrecisionMemoryNotes
3-bit14 GBTight on 16GB cards — watch KV cache at long context
4-bit17 GBNeeds 20GB+ VRAM; RTX 3090 or RX 7900 XT
6-bit24 GB24GB cards exactly — zero headroom; 32GB preferred
8-bit30 GB32GB cards or dual 16GB GPUs
BF1654 GBImpractical on consumer hardware

Minimum GPU: a 20GB+ card for 4-bit — RX 7900 XT 20GB or RTX 3090 24GB. A 16GB card can run the 3-bit quant, tightly.

The dense 27B is the quality pick for users with 20–24GB cards who want consistent (non-MoE) behavior across all workloads.

Qwen 3.5 35B-A3B (Sparse MoE)

PrecisionMemoryNotes
3-bit17 GBPossible on 20GB cards
4-bit22 GB24GB card minimum — RTX 3090/4090, RX 7900 XTX
6-bit30 GB32GB+ VRAM or dual GPUs
8-bit38 GBDual 24GB GPUs or 48GB+ unified memory
BF1670 GBMulti-GPU or high-memory Apple Silicon

Minimum GPU: RTX 3090 24GB (tight at 4-bit) or RTX 4090 24GB.

Here's why this model matters: it loads like a 35B but decodes like a 3B. Only 3B parameters are active per generated token, so the per-token memory read is a fraction of what the dense 27B requires. If your card can hold the weights, the 35B-A3B is both the smarter and the faster model — it's the one Ollama picked to launch its MLX backend on Apple Silicon.

Qwen 2.5 72B

QuantizationFile sizeHardware
Q4_K_M47.42 GBDual 24GB GPUs (48GB — very tight), 64GB+ unified memory
Q5_K_M54.45 GBDual 24GB no longer fits; 64GB unified memory
Q8_077.26 GB96GB+ unified memory or server-class hardware

File sizes from the bartowski Qwen2.5-72B-Instruct GGUF repository.

This is where consumer hardware hits its ceiling. Dual RTX 3090/4090 (48GB combined) fits Q4_K_M with almost no headroom — many builders drop to Q4_K_S or IQ4_XS for KV cache room. On a single 24GB card, your option is CPU+GPU hybrid inference (see our llama.cpp hybrid inference guide) at a fraction of the speed. For general VRAM planning, see how much VRAM do you need or the VRAM calculator.


GPU Recommendations Per Tier

For Qwen 3.5 9B

Best value: RTX 3060 12GB — handles the 6-bit quant with headroom for KV cache, commonly listed used around $200–230.

Best performance: RTX 4060 Ti 16GB — runs 8-bit comfortably and future-proofs toward the 27B's 3-bit quant.

AMD option: RX 7600 8GB for 4-bit (tight) or a 12GB card for 6-bit. ROCm support is solid on Linux for these cards.

For Qwen 3.5 27B and 35B-A3B

Sweet spot: RX 7900 XT 20GB — runs the 27B at 4-bit (17GB) with room to spare; better VRAM per dollar than an RTX 3090 if buying new.

Best single-card option: RTX 3090 24GB used or RTX 4090 24GB new — both run the 27B at 4-bit comfortably and the 35B-A3B at 4-bit (22GB) with modest context.

AMD: RX 7900 XTX 24GB — same capacity story as the 3090, with mature ROCm 6.x support on Linux.

For Qwen 2.5 72B

Dual GPU: Dual RTX 3090 (48GB combined) just fits Q4_K_M. llama.cpp's tensor split distributes layers automatically — see our multi-GPU inference guide.

Apple Silicon: 64GB unified memory handles Q4_K_M; 96GB+ handles Q8_0.

AMD Strix Halo (mini PCs): 96GB unified-memory configs handle 72B Q4_K_M fully in GPU-addressable memory once GTT is configured (Linux — see the AMD Ryzen AI Max GTT guide).


Tokens Per Second on Common Hardware

Decode speed on consumer GPUs is limited by memory bandwidth, not compute. That gives you a ceiling you can calculate yourself: bandwidth ÷ bytes read per token. For dense models, bytes per token ≈ the weight file size; real-world results typically land at 60–80% of the ceiling.

Model @ 4-bitBytes/tokenRTX 3090 (936 GB/s) ceilingRX 7900 XT (800 GB/s) ceiling
Qwen 3.5 9B (6.5 GB)~6.5 GB~144 tok/s~123 tok/s
Qwen 3.5 27B (17 GB)~17 GB~55 tok/s~47 tok/s
Qwen 3.5 35B-A3B~2–3 GB (3B active)well over 100 tok/swell over 100 tok/s
Qwen 2.5 72B (47 GB, dual GPU)~47 GB~20 tok/s

The MoE row is the story: because only ~3B parameters activate per token, the 35B-A3B reads a few GB per token instead of 22 — which is how Ollama's MLX backend posts 112 tok/s on Apple Silicon for this exact model.


Apple Silicon: MLX vs llama.cpp

Apple Silicon is uniquely suited to the Qwen families because of unified memory — the same pool serves CPU and GPU with no PCIe transfer. Unsloth notes the 35B-A3B fits on a 22GB-class Mac and the 27B on an 18GB-class device at 4-bit.

For Qwen 3.5 specifically, MLX is the fast path — it's the framework behind Ollama's Apple Silicon preview, which roughly doubled decode speed over the old llama.cpp Metal path (58 → 112 tok/s on M5-generation hardware, per Ollama's announcement). Running MLX directly:

pip install mlx-lm
mlx_lm.generate --model Qwen/Qwen3.5-35B-A3B --prompt "Explain transformer attention"

For the runtime trade-offs, see our MLX vs llama.cpp comparison. For Qwen 2.5 72B: 64GB unified memory handles Q4_K_M, and 96GB+ configurations handle Q8_0.


Choosing the Right Tier for Your Hardware

Your hardwareRun thisWhy
8GB GPUQwen 3.5 9B @ 4-bitMaximum quality that fits
12GB GPUQwen 3.5 9B @ 6-bitQuality jump, still comfortable
16GB GPUQwen 3.5 27B @ 3-bit or 9B @ 8-bitFirst taste of the mid tier
20–24GB GPUQwen 3.5 35B-A3B @ 4-bitMoE speed + 35B quality on one card
Dual 24GB / 64GB unifiedQwen 2.5 72B @ Q4_K_MThe classic large-model tier
96GB+ unifiedQwen 3.5 122B-A10B @ 4-bit (70 GB)Frontier-class, still MoE-fast

One more thing: the Qwen 2.5 Coder models (up to 32B) follow the same VRAM requirements as their base counterparts — for that specific build, see Qwen 2.5 Coder 32B hardware requirements. And for grabbing the right quantized file for your card, see the HuggingFace GGUF download guide.

qwen qwen-3.5 qwen-2.5 hardware-requirements local-llm vram

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.