The $800 GPU question for Qwen 3.5 35B-A3B has an answer most guides won't give you: 16GB of VRAM isn't enough for Q4 quantization on this model. The Q4_K_M GGUF weighs in at ~22.2GB. The RTX 5070 Ti's 16GB falls short by 6GB.
That doesn't put the model out of reach at this budget — but it does mean picking a strategy instead of following the "just buy an RTX 5070 Ti and pull the model" advice circulating everywhere right now.
TL;DR: For Qwen 3.5 35B-A3B at Q4_K_M, you need 24GB VRAM. At ~$800, that means a used RTX 4090 ($850–$900). If you want a new GPU, the RTX 5070 Ti (16GB GDDR7) runs the model at Q2_K quantization — fully on-GPU, noticeably lower quality than Q4. Neither path is $749: the 5070 Ti's MSRP is real but the street price is $1,000–$1,300+ in March 2026.
The VRAM Math Nobody Gets Right
Qwen 3.5 35B-A3B is a Mixture of Experts model. 35B total parameters, but only ~3B active per token during inference. People hear "3B active" and assume it fits in any GPU with 8GB of VRAM. That's not how it works.
The entire model still lives in VRAM. You can't activate experts that aren't loaded. The MoE architecture makes generation faster and cheaper than a dense 35B — it doesn't make the model smaller.
At Q4_K_M quantization, the GGUF file is approximately 22.2GB. Ollama's model library lists it at 24GB for the default tag. Here's the full picture across quantization levels for Qwen 3.5 35B-A3B (sizes from Bartowski's GGUF files on Hugging Face, verified March 2026):
Quality vs. FP16
~95%
~90%
~82%
~95% Q2_K fits on 16GB with room for a 4K context buffer. IQ3_M is right at the edge — whether it loads clean depends on background processes and driver overhead. Don't count on it without testing.
Warning
CPU offloading lets Ollama split the model across VRAM and system RAM when VRAM runs out. It sounds like a solution. Generation drops to 2–6 tok/s because every token requires data transfers across the PCIe bus. For interactive use, it's nearly unusable. If you see someone claiming their 16GB GPU "runs Qwen 35B," ask how many tokens per second.
RTX 5070 Ti: The Real Specs and the Real Price
The card is genuinely impressive — but let's lock in the accurate specs, because several roundups have gotten them wrong.
| Spec | RTX 5070 Ti |
|---|---|
| VRAM | 16 GB GDDR7 (not GDDR6X) |
| Memory bandwidth | 896 GB/s |
| CUDA cores | 8,960 |
| TGP (typical) | 300W |
| Peak power | ~356W |
| MSRP | $749 |
| Street price (March 2026) | $1,000–$1,300+ |
(Source: NVIDIA RTX 50-series official specifications page, March 2026)
The GDDR7 memory gives it 896 GB/s of bandwidth — a meaningful step up from the GDDR6X on Ada-generation cards, which helps inference throughput significantly. The 300W TGP is manageable on a standard 750W PSU, but under sustained inference the card can spike to ~356W. Don't underpower this build.
On thermals: 72°C under continuous inference is normal, ~35°C at idle. Stock cooler handles it without issue. This is not an exotic or loud build.
The pricing reality is the bigger issue. Blackwell supply has been constrained since launch. Retail prices for the 5070 Ti have consistently run $250–$550 above MSRP at US retailers through Q1 2026. If you're waiting for $749, you may be waiting until May or June at best — and that's optimistic.
Two Real Paths at ~$800
Given the VRAM constraints and current pricing, these are the two options that actually work.
Path 1: RTX 5070 Ti (16 GB) + Q2_K Quantization
At a realistic purchase price of ~$1,050–$1,100, you get a new-in-box card with a 3-year warranty, CUDA 12.x support, and GDDR7 bandwidth. The trade-off: you're running Qwen 3.5 35B-A3B at Q2_K, which is approximately 82% of FP16 quality per quantization benchmarks.
Q2_K isn't garbage. This is still a 35B-class MoE model with significantly higher reasoning capability than most 7B and 13B dense models. For coding assistance on straightforward problems, long-form summarization, and structured data tasks, Q2_K output is often indistinguishable from Q4 unless you're looking for subtle nuance.
Estimated performance at Q2_K (based on 5070 Ti scoring ~65.5 tok/s on Qwen 2.5 14B Q4_K_M, scaled for model size and quantization overhead): roughly 28–40 tok/s at 2K–4K context. This is a CraftRigs estimate — no published benchmark exists at this exact configuration as of March 2026. Treat it as directional.
Path 2: Used RTX 4090 (24 GB) + Q4_K_M
The RTX 4090 brings 24GB of VRAM, which holds the 22.2GB Q4_K_M model — barely. With ~2GB remaining for KV cache, you need to cap your context window at 2K–4K tokens or expect OOM errors.
Used RTX 4090 pricing runs $850–$900 on eBay and StockX (March 2026). That's above the nominal $800 budget but less than a new 5070 Ti at actual retail. The trade-off is condition uncertainty, no warranty, and unknown history (mining, gaming, machine learning workloads).
Estimated performance at Q4_K_M: the 4090 has substantially higher memory bandwidth than the 5070 Ti. For a 35B MoE model at Q4, expect 35–55 tok/s at 1K–2K context, dropping toward 20–30 tok/s at 4K. Estimated — not yet benchmarked publicly for this specific model at time of writing.
Warning
A used RTX 4090 running a 22.2GB model on 24GB VRAM has almost no headroom. Even at Q4_K_M, keep context under 4K. Ask the seller explicitly about prior workload history — cards that ran crypto mining or 24/7 AI inference have meaningfully higher wear than gaming cards.
Which path wins? For most people: Path 1 (new 5070 Ti + Q2_K). The warranty, newer architecture, and native GDDR7 bandwidth make it the lower-risk buy. Path 2 (used 4090 + Q4_K_M) is better if output quality is non-negotiable — legal drafting, detailed code review, nuanced summarization — and you're comfortable evaluating secondhand hardware.
Why Qwen 3.5 35B-A3B Specifically
Before the setup steps, a quick primer on what makes this model different from Llama 3.1 34B or Mistral 24B — because the architecture affects how you use it.
Qwen 3.5 35B-A3B has 256 total experts. Each token activates 8 routed experts plus 1 shared expert, meaning only ~3B of the 35B parameters are actually computing for any given token. The practical result: inference speed closer to a 10B dense model, reasoning quality that competes with much larger dense models.
Llama 3.1 34B (dense)
34B
~19–21 GB
Slower
Faster
More stable
Lower quality (Source: Qwen/Qwen3.5-35B-A3B Hugging Face model card; HuggingFace MoE architecture documentation)
The short-prompt latency note matters: MoE routing adds overhead that's proportionally more significant on 1–2 sentence prompts. For short back-and-forth chatting, Llama 3.1 34B may feel snappier. For anything requiring multi-step reasoning or detailed analysis, Qwen 3.5 35B-A3B pulls ahead.
Step-by-Step Setup with Ollama
Step 1: Install Ollama and Verify GPU Detection
Download Ollama from ollama.com. After install, start the server:
ollama serve
Watch the terminal output for GPU detection. In a second terminal, run:
nvidia-smi
Confirm your GPU appears with full VRAM listed. If Ollama doesn't detect your GPU, reinstall your NVIDIA drivers (Game Ready or Studio, latest version), restart, and verify nvidia-smi shows the card before launching Ollama again.
Step 2: Pull Qwen 3.5 35B-A3B
For the Q4_K_M default (24GB GPU):
ollama pull qwen3.5:35b-a3b
For Q2_K on a 16GB GPU:
ollama pull qwen3.5:35b-a3b-q2_k
Note
The correct Ollama tag is qwen3.5:35b-a3b — the 3.5 is part of the model name. The old qwen:35b-a3b tag doesn't resolve correctly. Ollama defaults to Q4_K_M on the standard pull and uses GGUF format natively — no manual GPTQ conversion needed. First download takes 15–45 minutes depending on connection speed.
Step 3: Start Inference
ollama run qwen3.5:35b-a3b
Initial load takes 20–40 seconds as the model pages into VRAM. Once you see the prompt, monitor VRAM in real time:
nvidia-smi -l 1
On a 16GB GPU at Q2_K, VRAM usage should stabilize at 12–14GB. On a 24GB 4090 at Q4_K_M, expect 22–23GB. If you see CPU memory climbing alongside VRAM, Ollama is offloading — your quantization is too large for on-GPU inference.
Step 4: Measure Your Performance
Use a consistent test prompt across sessions: a 200-token coding task or summarization request works well. Count the output tokens, divide by elapsed seconds. Alternatively, llama-bench from llama.cpp gives repeatable throughput measurements with configurable context lengths.
Troubleshooting
Generation is very slow — under 5 tok/s
CPU offloading triggered. nvidia-smi -l 1 will show VRAM maxed out. Fix: pull a smaller quantization (qwen3.5:35b-a3b-q2_k), or reduce context length with --ctx 2048.
OOM crash on RTX 4090 with Q4_K_M KV cache is filling the remaining 2GB headroom. Set context explicitly:
ollama run qwen3.5:35b-a3b --ctx 2048
GPU not recognized after install
Update NVIDIA drivers, restart the Ollama service, confirm nvidia-smi shows your card. The most common cause on Windows is a driver version mismatch after Ollama install.
Want a chat interface? Open WebUI connects to Ollama's local API at port 11434 and runs in a browser tab. Install via Docker: docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main. For the initial install it's quicker than you'd expect. Once it's running, it's close to ChatGPT in feel.
Verdict
Qwen 3.5 35B-A3B on a 16GB GPU is achievable — just not at Q4_K_M. If you buy an RTX 5070 Ti, you're running Q2_K. The model still performs at a high level for reasoning and analysis tasks, but you're accepting a real quality gap versus what the model is capable of at full quantization.
For the strict $800 budget: a used RTX 4090 at $850–$900 is the better hardware choice for this specific model. You get Q4_K_M quality, 24GB of VRAM, and a faster card overall. The catch is condition uncertainty and zero warranty.
If the new-card reliability matters more than Q4 quality, the RTX 5070 Ti at ~$1,050–$1,100 is a fine machine — just go in knowing you're running a lighter quantization. For the full hardware upgrade path beyond 35B models, the jump to 48GB+ territory changes the conversation significantly.
For more on how quantization affects output quality with real examples, the glossary entry walks through Q4 vs Q3 vs Q2 differences with side-by-side outputs.
FAQ
Can the RTX 5070 Ti run Qwen 3.5 35B-A3B? Yes — at Q2_K quantization (~12–13GB), it runs fully on-GPU with headroom for 4K context. For Q4_K_M (~22.2GB), the model doesn't fit in 16GB VRAM without CPU offloading, which drops generation to 2–6 tok/s. If Q4 quality matters to you, you need 24GB VRAM.
How much VRAM does Qwen 3.5 35B-A3B actually need? For Q4_K_M (best quality/size trade-off): 24GB minimum. The GGUF is ~22.2GB and Ollama estimates 24GB for the default pull. For 16GB GPUs, Q2_K fits cleanly; IQ3_M is borderline and may require testing. Verify exact GGUF sizes against Bartowski's Hugging Face page before downloading — quantization sizes update as new methods release.
Does the MoE architecture reduce VRAM requirements? No — not for loading. Qwen 3.5 35B-A3B has 35B total parameters across 256 experts. The entire model loads into VRAM regardless of how many experts activate per token. The MoE benefit is inference speed and quality-per-active-parameter, not VRAM efficiency.
Is the RTX 5070 Ti actually available at $749? Not consistently. MSRP is $749 but street prices in the US run $1,000–$1,300+ as of March 2026, with some listings above $1,600. If you're budgeting for a new 5070 Ti, use $1,050–$1,150 as your working number. A used RTX 4090 at $850–$900 offers more VRAM at a lower real-world cost right now.