TL;DR: Qwen 3.5 9B needs ~6.5GB of VRAM at 4-bit — any 8GB GPU works. The 27B needs ~17GB — that's RTX 3090/RX 7900 XT territory. The 35B-A3B MoE needs ~22GB to load but decodes like a 3B model, making it the speed king on a 24GB card. Qwen 2.5 72B (a different, older family) is a 47GB file at Q4_K_M and needs dual GPUs or 64GB+ unified memory.
Memory requirements per Unsloth's Qwen 3.5 documentation:
| Model | 4-bit | 8-bit | BF16 | Minimum sensible GPU (4-bit) |
|---|---|---|---|---|
| Qwen 3.5 9B | 6.5 GB | 13 GB | 19 GB | RTX 3060 8GB |
| Qwen 3.5 27B | 17 GB | 30 GB | 54 GB | RTX 3090 24GB / RX 7900 XT 20GB |
| Qwen 3.5 35B-A3B | 22 GB | 38 GB | 70 GB | RTX 3090 / 4090 / RX 7900 XTX 24GB |
| Qwen 3.5 122B-A10B | 70 GB | 132 GB | 245 GB | Multi-GPU or 96GB+ unified memory |
| Qwen 2.5 72B | 47.4 GB (Q4_K_M) | 77.3 GB (Q8_0) | — | Dual 24GB GPUs or 64GB+ unified |
On this page:
- Qwen 3.5 vs Qwen 2.5: Which Sizes Actually Exist
- VRAM Requirements by Model Size and Quantization
- GPU Recommendations Per Tier
- Tokens Per Second on Common Hardware
- Apple Silicon: MLX vs llama.cpp
- Choosing the Right Tier for Your Hardware
Qwen 3.5 vs Qwen 2.5: Which Sizes Actually Exist
The two families get mixed up constantly in searches, so let's settle it:
- Qwen 3.5 (released February 2026, Apache 2.0, natively multimodal, 262K context): 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and the 397B-A17B flagship. The "A" sizes are sparse Mixture-of-Experts — the 35B-A3B model card lists 35B total parameters with only 3B active per token.
- Qwen 2.5 (the older, text-only family): 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, plus the Coder fine-tunes up to 32B.
There is no Qwen 2.5 9B or 27B, and no Qwen 3.5 72B. This guide covers the four sizes people actually search for: Qwen 3.5's 9B, 27B, and 35B-A3B, plus Qwen 2.5's 72B.
VRAM Requirements by Model Size and Quantization
The Qwen 3.5 figures below are from Unsloth's documentation; they cover model weights at each precision. Add headroom for KV cache — a few GB at modest context, more if you push toward the 262K native window.
Qwen 3.5 9B
| Precision | Memory | Notes |
|---|---|---|
| 3-bit | 5.5 GB | Fits on 8GB cards with context headroom |
| 4-bit | 6.5 GB | The sweet spot — still fits 8GB cards |
| 6-bit | 9 GB | Needs a 12GB card; better output quality |
| 8-bit | 13 GB | 16GB card territory |
| BF16 | 19 GB | Fine-tuning or research; no inference reason |
Minimum GPU: RTX 3060 8GB for 4-bit. RTX 3060 12GB or better for 6-bit.
The 9B is the "just works" tier — and reviewers consistently flag it as the standout of the family for its size. Any gaming GPU from the last four years handles 4-bit. If your card has 12GB, the 6-bit quant is worth the extra memory.
Qwen 3.5 27B
| Precision | Memory | Notes |
|---|---|---|
| 3-bit | 14 GB | Tight on 16GB cards — watch KV cache at long context |
| 4-bit | 17 GB | Needs 20GB+ VRAM; RTX 3090 or RX 7900 XT |
| 6-bit | 24 GB | 24GB cards exactly — zero headroom; 32GB preferred |
| 8-bit | 30 GB | 32GB cards or dual 16GB GPUs |
| BF16 | 54 GB | Impractical on consumer hardware |
Minimum GPU: a 20GB+ card for 4-bit — RX 7900 XT 20GB or RTX 3090 24GB. A 16GB card can run the 3-bit quant, tightly.
The dense 27B is the quality pick for users with 20–24GB cards who want consistent (non-MoE) behavior across all workloads.
Qwen 3.5 35B-A3B (Sparse MoE)
| Precision | Memory | Notes |
|---|---|---|
| 3-bit | 17 GB | Possible on 20GB cards |
| 4-bit | 22 GB | 24GB card minimum — RTX 3090/4090, RX 7900 XTX |
| 6-bit | 30 GB | 32GB+ VRAM or dual GPUs |
| 8-bit | 38 GB | Dual 24GB GPUs or 48GB+ unified memory |
| BF16 | 70 GB | Multi-GPU or high-memory Apple Silicon |
Minimum GPU: RTX 3090 24GB (tight at 4-bit) or RTX 4090 24GB.
Here's why this model matters: it loads like a 35B but decodes like a 3B. Only 3B parameters are active per generated token, so the per-token memory read is a fraction of what the dense 27B requires. If your card can hold the weights, the 35B-A3B is both the smarter and the faster model — it's the one Ollama picked to launch its MLX backend on Apple Silicon.
Qwen 2.5 72B
| Quantization | File size | Hardware |
|---|---|---|
| Q4_K_M | 47.42 GB | Dual 24GB GPUs (48GB — very tight), 64GB+ unified memory |
| Q5_K_M | 54.45 GB | Dual 24GB no longer fits; 64GB unified memory |
| Q8_0 | 77.26 GB | 96GB+ unified memory or server-class hardware |
File sizes from the bartowski Qwen2.5-72B-Instruct GGUF repository.
This is where consumer hardware hits its ceiling. Dual RTX 3090/4090 (48GB combined) fits Q4_K_M with almost no headroom — many builders drop to Q4_K_S or IQ4_XS for KV cache room. On a single 24GB card, your option is CPU+GPU hybrid inference (see our llama.cpp hybrid inference guide) at a fraction of the speed. For general VRAM planning, see how much VRAM do you need or the VRAM calculator.
GPU Recommendations Per Tier
For Qwen 3.5 9B
Best value: RTX 3060 12GB — handles the 6-bit quant with headroom for KV cache, commonly listed used around $200–230.
Best performance: RTX 4060 Ti 16GB — runs 8-bit comfortably and future-proofs toward the 27B's 3-bit quant.
AMD option: RX 7600 8GB for 4-bit (tight) or a 12GB card for 6-bit. ROCm support is solid on Linux for these cards.
For Qwen 3.5 27B and 35B-A3B
Sweet spot: RX 7900 XT 20GB — runs the 27B at 4-bit (17GB) with room to spare; better VRAM per dollar than an RTX 3090 if buying new.
Best single-card option: RTX 3090 24GB used or RTX 4090 24GB new — both run the 27B at 4-bit comfortably and the 35B-A3B at 4-bit (22GB) with modest context.
AMD: RX 7900 XTX 24GB — same capacity story as the 3090, with mature ROCm 6.x support on Linux.
For Qwen 2.5 72B
Dual GPU: Dual RTX 3090 (48GB combined) just fits Q4_K_M. llama.cpp's tensor split distributes layers automatically — see our multi-GPU inference guide.
Apple Silicon: 64GB unified memory handles Q4_K_M; 96GB+ handles Q8_0.
AMD Strix Halo (mini PCs): 96GB unified-memory configs handle 72B Q4_K_M fully in GPU-addressable memory once GTT is configured (Linux — see the AMD Ryzen AI Max GTT guide).
Tokens Per Second on Common Hardware
Decode speed on consumer GPUs is limited by memory bandwidth, not compute. That gives you a ceiling you can calculate yourself: bandwidth ÷ bytes read per token. For dense models, bytes per token ≈ the weight file size; real-world results typically land at 60–80% of the ceiling.
| Model @ 4-bit | Bytes/token | RTX 3090 (936 GB/s) ceiling | RX 7900 XT (800 GB/s) ceiling |
|---|---|---|---|
| Qwen 3.5 9B (6.5 GB) | ~6.5 GB | ~144 tok/s | ~123 tok/s |
| Qwen 3.5 27B (17 GB) | ~17 GB | ~55 tok/s | ~47 tok/s |
| Qwen 3.5 35B-A3B | ~2–3 GB (3B active) | well over 100 tok/s | well over 100 tok/s |
| Qwen 2.5 72B (47 GB, dual GPU) | ~47 GB | ~20 tok/s | — |
The MoE row is the story: because only ~3B parameters activate per token, the 35B-A3B reads a few GB per token instead of 22 — which is how Ollama's MLX backend posts 112 tok/s on Apple Silicon for this exact model.
Apple Silicon: MLX vs llama.cpp
Apple Silicon is uniquely suited to the Qwen families because of unified memory — the same pool serves CPU and GPU with no PCIe transfer. Unsloth notes the 35B-A3B fits on a 22GB-class Mac and the 27B on an 18GB-class device at 4-bit.
For Qwen 3.5 specifically, MLX is the fast path — it's the framework behind Ollama's Apple Silicon preview, which roughly doubled decode speed over the old llama.cpp Metal path (58 → 112 tok/s on M5-generation hardware, per Ollama's announcement). Running MLX directly:
pip install mlx-lm
mlx_lm.generate --model Qwen/Qwen3.5-35B-A3B --prompt "Explain transformer attention"
For the runtime trade-offs, see our MLX vs llama.cpp comparison. For Qwen 2.5 72B: 64GB unified memory handles Q4_K_M, and 96GB+ configurations handle Q8_0.
Choosing the Right Tier for Your Hardware
| Your hardware | Run this | Why |
|---|---|---|
| 8GB GPU | Qwen 3.5 9B @ 4-bit | Maximum quality that fits |
| 12GB GPU | Qwen 3.5 9B @ 6-bit | Quality jump, still comfortable |
| 16GB GPU | Qwen 3.5 27B @ 3-bit or 9B @ 8-bit | First taste of the mid tier |
| 20–24GB GPU | Qwen 3.5 35B-A3B @ 4-bit | MoE speed + 35B quality on one card |
| Dual 24GB / 64GB unified | Qwen 2.5 72B @ Q4_K_M | The classic large-model tier |
| 96GB+ unified | Qwen 3.5 122B-A10B @ 4-bit (70 GB) | Frontier-class, still MoE-fast |
One more thing: the Qwen 2.5 Coder models (up to 32B) follow the same VRAM requirements as their base counterparts — for that specific build, see Qwen 2.5 Coder 32B hardware requirements. And for grabbing the right quantized file for your card, see the HuggingFace GGUF download guide.