Is there a Qwen 2.5 9B or 27B model?

No. Qwen 2.5 ships in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. The 9B, 27B, and 35B-A3B sizes belong to Qwen 3.5, released February 2026. If you're searching for 9B or 27B VRAM requirements, you're looking for the Qwen 3.5 family.

Can I run Qwen 3.5 35B-A3B on a single consumer GPU?

Yes. At 4-bit quantization the 35B-A3B needs about 22GB (per Unsloth's documentation), so a 24GB card — RTX 3090, RTX 4090, or RX 7900 XTX — runs it with modest context. Because only 3B parameters are active per token, it also decodes much faster than a dense model of the same file size.

Can I run Qwen 2.5 72B on a single consumer GPU?

No. The Q4_K_M GGUF is 47.42GB, which exceeds any single consumer card. You need dual 24GB GPUs (48GB combined, tight), a 48GB workstation card, 64GB+ Apple Silicon unified memory, or CPU+GPU hybrid inference with llama.cpp at reduced speed.

How does Qwen 3.5 run on Apple Silicon?

Well — and it's the family Ollama chose for its MLX preview. Unsloth notes the 35B-A3B fits on a 22GB-class Mac at 4-bit and the 27B on an 18GB-class device. Ollama's MLX backend published 112 tok/s decode for Qwen3.5-35B-A3B on M5-generation hardware; Macs need more than 32GB unified memory for that path.

Does the Qwen family support function calling and structured output?

Yes. Both Qwen 2.5 and Qwen 3.5 instruction-tuned variants support function calling, JSON mode, and structured output, exposed through Ollama's tool-calling API or llama.cpp server's OpenAI-compatible endpoint. Larger sizes produce more reliable structured output than the small variants.

How does Qwen 2.5 Coder compare to the base models for coding tasks?

Qwen 2.5 Coder is a code-specialized fine-tune of the Qwen 2.5 series (up to 32B) with VRAM requirements identical to base variants at the same parameter count. For pure coding workloads it outperforms the base model; for mixed coding and conversation, the base instruction model is more versatile.

Qwen 3.5 Hardware Requirements: VRAM for 9B, 27B & 35B [2026]

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Qwen 3.5 9B needs ~6.5GB of VRAM at 4-bit — any 8GB GPU works. The 27B needs ~17GB — that's RTX 3090/RX 7900 XT territory. The 35B-A3B MoE needs ~22GB to load but decodes like a 3B model, making it the speed king on a 24GB card. Qwen 2.5 72B (a different, older family) is a 47GB file at Q4_K_M and needs dual GPUs or 64GB+ unified memory.

Memory requirements per Unsloth's Qwen 3.5 documentation:

Model	4-bit	8-bit	BF16	Minimum sensible GPU (4-bit)
Qwen 3.5 9B	6.5 GB	13 GB	19 GB	RTX 3060 8GB
Qwen 3.5 27B	17 GB	30 GB	54 GB	RTX 3090 24GB / RX 7900 XT 20GB
Qwen 3.5 35B-A3B	22 GB	38 GB	70 GB	RTX 3090 / 4090 / RX 7900 XTX 24GB
Qwen 3.5 122B-A10B	70 GB	132 GB	245 GB	Multi-GPU or 96GB+ unified memory
Qwen 2.5 72B	47.4 GB (Q4_K_M)	77.3 GB (Q8_0)	—	Dual 24GB GPUs or 64GB+ unified

On this page:

Qwen 3.5 vs Qwen 2.5: Which Sizes Actually Exist
VRAM Requirements by Model Size and Quantization
GPU Recommendations Per Tier
Tokens Per Second on Common Hardware
Apple Silicon: MLX vs llama.cpp
Choosing the Right Tier for Your Hardware

Qwen 3.5 vs Qwen 2.5: Which Sizes Actually Exist

The two families get mixed up constantly in searches, so let's settle it:

Qwen 3.5 (released February 2026, Apache 2.0, natively multimodal, 262K context): 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and the 397B-A17B flagship. The "A" sizes are sparse Mixture-of-Experts — the 35B-A3B model card lists 35B total parameters with only 3B active per token.
Qwen 2.5 (the older, text-only family): 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, plus the Coder fine-tunes up to 32B.

There is no Qwen 2.5 9B or 27B, and no Qwen 3.5 72B. This guide covers the four sizes people actually search for: Qwen 3.5's 9B, 27B, and 35B-A3B, plus Qwen 2.5's 72B.

VRAM Requirements by Model Size and Quantization

The Qwen 3.5 figures below are from Unsloth's documentation; they cover model weights at each precision. Add headroom for KV cache — a few GB at modest context, more if you push toward the 262K native window.

Qwen 3.5 9B

Precision	Memory	Notes
3-bit	5.5 GB	Fits on 8GB cards with context headroom
4-bit	6.5 GB	The sweet spot — still fits 8GB cards
6-bit	9 GB	Needs a 12GB card; better output quality
8-bit	13 GB	16GB card territory
BF16	19 GB	Fine-tuning or research; no inference reason

Minimum GPU: RTX 3060 8GB for 4-bit. RTX 3060 12GB or better for 6-bit.

The 9B is the "just works" tier — and reviewers consistently flag it as the standout of the family for its size. Any gaming GPU from the last four years handles 4-bit. If your card has 12GB, the 6-bit quant is worth the extra memory.

Qwen 3.5 27B

Precision	Memory	Notes
3-bit	14 GB	Tight on 16GB cards — watch KV cache at long context
4-bit	17 GB	Needs 20GB+ VRAM; RTX 3090 or RX 7900 XT
6-bit	24 GB	24GB cards exactly — zero headroom; 32GB preferred
8-bit	30 GB	32GB cards or dual 16GB GPUs
BF16	54 GB	Impractical on consumer hardware

Minimum GPU: a 20GB+ card for 4-bit — RX 7900 XT 20GB or RTX 3090 24GB. A 16GB card can run the 3-bit quant, tightly.

The dense 27B is the quality pick for users with 20–24GB cards who want consistent (non-MoE) behavior across all workloads.

Qwen 3.5 35B-A3B (Sparse MoE)

Precision	Memory	Notes
3-bit	17 GB	Possible on 20GB cards
4-bit	22 GB	24GB card minimum — RTX 3090/4090, RX 7900 XTX
6-bit	30 GB	32GB+ VRAM or dual GPUs
8-bit	38 GB	Dual 24GB GPUs or 48GB+ unified memory
BF16	70 GB	Multi-GPU or high-memory Apple Silicon

Minimum GPU: RTX 3090 24GB (tight at 4-bit) or RTX 4090 24GB.

Here's why this model matters: it loads like a 35B but decodes like a 3B. Only 3B parameters are active per generated token, so the per-token memory read is a fraction of what the dense 27B requires. If your card can hold the weights, the 35B-A3B is both the smarter and the faster model — it's the one Ollama picked to launch its MLX backend on Apple Silicon.

Qwen 2.5 72B

Quantization	File size	Hardware
Q4_K_M	47.42 GB	Dual 24GB GPUs (48GB — very tight), 64GB+ unified memory
Q5_K_M	54.45 GB	Dual 24GB no longer fits; 64GB unified memory
Q8_0	77.26 GB	96GB+ unified memory or server-class hardware

File sizes from the bartowski Qwen2.5-72B-Instruct GGUF repository.

This is where consumer hardware hits its ceiling. Dual RTX 3090/4090 (48GB combined) fits Q4_K_M with almost no headroom — many builders drop to Q4_K_S or IQ4_XS for KV cache room. On a single 24GB card, your option is CPU+GPU hybrid inference (see our llama.cpp hybrid inference guide) at a fraction of the speed. For general VRAM planning, see how much VRAM do you need or the VRAM calculator.

GPU Recommendations Per Tier

For Qwen 3.5 9B

Best value: RTX 3060 12GB — handles the 6-bit quant with headroom for KV cache, commonly listed used around $200–230.

Best performance: RTX 4060 Ti 16GB — runs 8-bit comfortably and future-proofs toward the 27B's 3-bit quant.

AMD option: RX 7600 8GB for 4-bit (tight) or a 12GB card for 6-bit. ROCm support is solid on Linux for these cards.

For Qwen 3.5 27B and 35B-A3B

Sweet spot: RX 7900 XT 20GB — runs the 27B at 4-bit (17GB) with room to spare; better VRAM per dollar than an RTX 3090 if buying new.

Best single-card option: RTX 3090 24GB used or RTX 4090 24GB new — both run the 27B at 4-bit comfortably and the 35B-A3B at 4-bit (22GB) with modest context.

AMD: RX 7900 XTX 24GB — same capacity story as the 3090, with mature ROCm 6.x support on Linux.

For Qwen 2.5 72B

Dual GPU: Dual RTX 3090 (48GB combined) just fits Q4_K_M. llama.cpp's tensor split distributes layers automatically — see our multi-GPU inference guide.

Apple Silicon: 64GB unified memory handles Q4_K_M; 96GB+ handles Q8_0.

AMD Strix Halo (mini PCs): 96GB unified-memory configs handle 72B Q4_K_M fully in GPU-addressable memory once GTT is configured (Linux — see the AMD Ryzen AI Max GTT guide).

Tokens Per Second on Common Hardware

Decode speed on consumer GPUs is limited by memory bandwidth, not compute. That gives you a ceiling you can calculate yourself: bandwidth ÷ bytes read per token. For dense models, bytes per token ≈ the weight file size; real-world results typically land at 60–80% of the ceiling.

Model @ 4-bit	Bytes/token	RTX 3090 (936 GB/s) ceiling	RX 7900 XT (800 GB/s) ceiling
Qwen 3.5 9B (6.5 GB)	~6.5 GB	~144 tok/s	~123 tok/s
Qwen 3.5 27B (17 GB)	~17 GB	~55 tok/s	~47 tok/s
Qwen 3.5 35B-A3B	~2–3 GB (3B active)	well over 100 tok/s	well over 100 tok/s
Qwen 2.5 72B (47 GB, dual GPU)	~47 GB	~20 tok/s	—

The MoE row is the story: because only ~3B parameters activate per token, the 35B-A3B reads a few GB per token instead of 22 — which is how Ollama's MLX backend posts 112 tok/s on Apple Silicon for this exact model.

Apple Silicon: MLX vs llama.cpp

Apple Silicon is uniquely suited to the Qwen families because of unified memory — the same pool serves CPU and GPU with no PCIe transfer. Unsloth notes the 35B-A3B fits on a 22GB-class Mac and the 27B on an 18GB-class device at 4-bit.

For Qwen 3.5 specifically, MLX is the fast path — it's the framework behind Ollama's Apple Silicon preview, which roughly doubled decode speed over the old llama.cpp Metal path (58 → 112 tok/s on M5-generation hardware, per Ollama's announcement). Running MLX directly:

pip install mlx-lm
mlx_lm.generate --model Qwen/Qwen3.5-35B-A3B --prompt "Explain transformer attention"

For the runtime trade-offs, see our MLX vs llama.cpp comparison. For Qwen 2.5 72B: 64GB unified memory handles Q4_K_M, and 96GB+ configurations handle Q8_0.

Choosing the Right Tier for Your Hardware

Your hardware	Run this	Why
8GB GPU	Qwen 3.5 9B @ 4-bit	Maximum quality that fits
12GB GPU	Qwen 3.5 9B @ 6-bit	Quality jump, still comfortable
16GB GPU	Qwen 3.5 27B @ 3-bit or 9B @ 8-bit	First taste of the mid tier
20–24GB GPU	Qwen 3.5 35B-A3B @ 4-bit	MoE speed + 35B quality on one card
Dual 24GB / 64GB unified	Qwen 2.5 72B @ Q4_K_M	The classic large-model tier
96GB+ unified	Qwen 3.5 122B-A10B @ 4-bit (70 GB)	Frontier-class, still MoE-fast

One more thing: the Qwen 2.5 Coder models (up to 32B) follow the same VRAM requirements as their base counterparts — for that specific build, see Qwen 2.5 Coder 32B hardware requirements. And for grabbing the right quantized file for your card, see the HuggingFace GGUF download guide.