CraftRigs
articles

Qwen 3.5 Gated DeltaNet Explained: What Linear Attention Means for Your GPU in 2026

By Charlotte Stewart 12 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


Here's a thing that a lot of early Qwen 3.5 coverage got wrong: there is no "DeltaNet variant." No separate download, no checkbox in Ollama, no toggle in vLLM. Gated DeltaNet is the architecture — baked in, trained from scratch, present in every checkpoint from the 0.8B up to the 397B MoE. When you run `ollama pull qwen3.5:27b`, you're already running linear attention.

That changes the hardware question. You're not choosing between two Qwen 3.5 models with different memory profiles. You're asking: how does this architecture change what my GPU can actually do with long contexts?

**TL;DR: Qwen 3.5's Gated DeltaNet hybrid architecture means roughly 75% of its attention layers maintain constant [VRAM](/glossary/vram) cost regardless of context length. On an RTX 4090 (24 GB), this unlocks Qwen 3.5 27B at 64K+ context windows — a single-GPU workload that would require multi-GPU under standard full-attention architectures. If long-document RAG, full-codebase synthesis, or deep research are your primary use cases, this is the reason to consider a Qwen 3.5 upgrade now. For pure chat at under 8K tokens, the difference is negligible.**

## What Is Linear Attention (And Why Standard Attention Fails at Scale)

Standard transformer attention scales quadratically with context length. Add a token at position 8,000, and the model computes how that token relates to all 8,000 tokens that came before it — storing every one of those relationships in an attention weight matrix. Double the context, and you quadruple the work. At 32K tokens, you're storing roughly a billion weight values per attention layer before you've even loaded the model weights.

That's the wall power users hit first. Not reasoning quality, not hardware speed — the VRAM ceiling that makes long-context inference impractical on consumer hardware.

Linear attention — the approach DeltaNet implements — replaces that growing matrix with a fixed-size recurrent state. New tokens update the state via a delta (the difference between what the new token contributes and what's already there), but the state itself doesn't grow. The computation drops from O(n²) to O(n): double the context, double the cost instead of quadruple.

### The Math (Without the Math)

Standard attention at token 32,000: compute 32,000 × 32,000 = 1.024 billion weight pairs, keep all of them accessible.

Linear attention at token 32,000: update a fixed state vector, discard the per-token calculations.

The memory difference is roughly 1 billion stored values vs. a single matrix of fixed size — and that matrix doesn't change dimensions whether you're at token 1,000 or token 100,000.

Where standard attention has a natural retrieval advantage is precise recall of early context — the full weight matrix makes it easy to look back to any token exactly. Linear attention approximates this with its running summary. That approximation is why Qwen 3.5 uses a hybrid: the standard attention layers (every 4th layer, roughly 25% of the architecture) handle precise retrieval, while the DeltaNet layers (the other 75%) carry the scaling benefit.

## How Qwen 3.5 Implements Gated DeltaNet

[DeltaNet](https://arxiv.org/abs/2410.07063) uses delta rules: when a new token arrives, the model computes the difference between what that token wants to contribute and the current state content, then applies a weighted update. It's a learned compression — the model decides during training what's worth keeping and what to fold into existing state.

The "Gated" in Gated DeltaNet adds a per-token learned gate on top of this, controlling how much new information modifies the existing state versus how much old state is preserved. This stabilizes training on long sequences and improves quality retention on short contexts.

> [!NOTE]
> DeltaNet as an architecture predates Qwen 3.5 — the paper appeared at NeurIPS 2024. What makes Qwen 3.5 different is that Alibaba trained a production-scale model with this architecture from scratch across the full model family, rather than running academic experiments on models under 3B parameters.

You can't retrofit this. The architecture affects how attention weights are shaped during training. A Qwen 2.5 checkpoint quantized more aggressively doesn't approximate DeltaNet behavior — the recurrent state structure is learned during pretraining. The Qwen 3.5 long-context advantages are architectural, not post-hoc.

## The Actual Qwen 3.5 Lineup

Before hardware math, the model lineup matters. As of March 2026, confirmed via [Hugging Face](https://huggingface.co/collections/Qwen/qwen35):

Q4_K_M (est.)


~0.5 GB


~1.2 GB


~2.5 GB


~5.5 GB


~14–15 GB









There is no 70B dense Qwen 3.5 model. The lineup jumps from 27B dense to 35B-A3B MoE (35 billion total parameters, 3 billion active per forward pass). If you've seen benchmarks comparing "Qwen 3.5 70B DeltaNet" against a standard version — those numbers are fabricated. That model doesn't exist.

> [!WARNING]
> MoE "active parameter" VRAM figures are misleading. The 397B-A17B has ~17 GB of active parameters per token, but the full checkpoint requires 70+ GB to load. Multi-GPU is required for all large MoE models. Plan around total checkpoint size, not active parameter count.

## VRAM Math: What Long Context Actually Costs

The hybrid architecture (75% DeltaNet, 25% full attention) means VRAM grows at roughly one quarter the rate of a pure transformer. Only the standard attention layers build a [KV cache](/glossary/kv-cache) that scales with context; the DeltaNet layers maintain constant state.

Qwen 3.5 27B uses GQA (Grouped Query Attention) on its full-attention layers, which further reduces KV cache growth. The combination makes long-context inference on a single consumer GPU feasible in a way that Qwen 2.5 72B was not.

**RTX 4090 (24 GB) — Qwen 3.5 27B at Q4_K_M (estimated, March 2026):**

Total (est.)


~15.7 GB


~16.5 GB


~18.0 GB


~21.0 GB


~27.0 GB
The "Attention Memory" column covers KV cache from the ~25% full-attention layers plus fixed DeltaNet state. It grows at roughly one quarter the rate you'd expect from a pure transformer at the same parameter count. At 64K context, 21 GB leaves headroom on a 24 GB RTX 4090. At 128K, you're over — use [quantization](/glossary/quantization) to Q3 or consider batching=1 only.

> [!TIP]
> For sustained 64K-context workloads on an RTX 4090, Q4_K_M is the sweet spot — fits comfortably, minimal quality hit, no need to drop to Q3. If your use case reaches 128K regularly, a dual RTX 3090 setup (48 GB combined) gives you BF16 without compromise.

**RTX 4070 (12 GB) — Qwen 3.5 9B:**

At 12 GB, Qwen 3.5 27B doesn't fit. The Q4_K_M weight floor alone (14–15 GB) exceeds available VRAM. The 9B is the ceiling for 12 GB cards:

Fits?


Yes


Yes


No
At Q8, Qwen 3.5 9B handles 32K context on an RTX 4070 without issue. For 12 GB VRAM users whose primary workload is long-context RAG, the 9B at Q8 is a serious option — GQA and DeltaNet layers together keep memory growth moderate even at extended contexts.

## Real-World Benchmarks: What Qwen 3.5 Actually Scores

Because Gated DeltaNet is baked into the architecture rather than a variant, there's no "standard vs. DeltaNet" comparison to run — this is the model. Published benchmarks from Alibaba/Hugging Face (as of February 2026):

**Qwen 3.5 27B:**
- MMLU-Pro: 86.1
- GPQA Diamond: 85.5
- SWE-bench Verified: 72.4
- LiveCodeBench v6: 80.7
- IFEval: 95.0

**Qwen 3.5 9B:**
- MMLU-Pro: 82.5
- GPQA Diamond: 81.7
- LiveCodeBench: 65.6

The 27B scores are competitive with models that were 70B-class a generation ago. That's not purely a DeltaNet effect — the full training run improved everything — but the architecture allows deploying that quality on hardware that previously couldn't host it at long contexts.

### Tokens Per Second at Scale

DeltaNet doesn't make inference faster per token. What changes is behavior as context grows:

- Standard full attention: tok/s decreases as context length grows (more attention computation per new token)
- Gated DeltaNet layers: tok/s stays roughly constant (fixed-size state update per layer, no growth)

At 8K context, Qwen 3.5 and a comparable Qwen 2.5 model run at similar speeds. At 32K context, the Qwen 2.5 equivalent either hits a memory wall and swaps to system RAM (speed cliff to single digits) or fails outright. Qwen 3.5 27B at Q4_K_M on an RTX 4090 continues at consistent throughput.

> [!WARNING]
> vLLM 0.18.0 introduced a regression that breaks Qwen 3.5 inference for some configurations. As of March 2026, vLLM 0.17.1 is the stable version for Qwen 3.5 deployments. The earlier DeltaNet dtype mismatch bug (issue #35238) was fixed in 0.17.1. Batch throughput on the 27B is notably lower than expected — expect ~20 tok/s for batch inference vs. higher numbers on standard architectures at similar parameter counts (GitHub issue #36010, open as of this writing).

## Who Actually Benefits From This Architecture

**Long-document RAG with chunks over 8K tokens**: This is the use case the architecture was built for. Whole PDF sections, legal documents, research papers fit in context without chunking fragmentation. RAG pipelines that previously lost cross-document relationships because retrieval had to stay under 4K-token chunks can now pull 20K-token segments. See [our long-context RAG build guide](/guides/long-context-rag-llm-build/) for hardware configurations specific to retrieval workloads.

**Full-codebase analysis**: Feeding an entire repository section into context for refactoring, bug identification, or synthesis. Standard architectures either fail at this context length or require layer-by-layer offloading with severe speed penalties. Qwen 3.5 27B at Q4_K_M handles it on a single RTX 4090.

**Research synthesis at 30K+ tokens**: Pulling 10–15 papers into a single context window for comparative analysis, without the context compression artifacts that RAG introduces. The DeltaNet state doesn't degrade as you add more source material.

**Who doesn't benefit much**: Chat and code completion at 8K context or under. The VRAM advantage is small enough to be within measurement noise at short contexts. If your typical session is a 2K-token conversation, Qwen 2.5 72B — which has more raw parameters — might still be a better fit if you have the hardware for it. See [our RTX 4090 vs H100 cost-per-token comparison](/comparisons/rtx-4090-vs-h100-cost-per-token/) if your workload is scaling past what a single consumer GPU handles comfortably.

### Build Recommendations by Use Case

**Long-context RAG or document processing**: RTX 4090 + Qwen 3.5 27B at Q4_K_M. This is the configuration DeltaNet's advantages are most visible on — 64K context on a single GPU, no multi-GPU complexity.

**General coding at 8K context**: RTX 4070 + Qwen 3.5 9B at Q8. The architecture advantage is minimal at short contexts, but you still get the improved reasoning quality of the 3.5 generation. Check our [hardware upgrade ladder](/articles/100-local-llm-hardware-upgrade-ladder/) before buying new hardware just for this.

**Multi-tenant inference service**: Dual RTX 4090 + Qwen 3.5 27B in BF16. Full precision, no quantization penalty, combined 48 GB handles long contexts and concurrent requests at lower power cost than equivalent-tier NVIDIA data center hardware.

## How to Run It: vLLM Setup

Qwen 3.5 is available at `Qwen/Qwen3.5-{size}` on Hugging Face. vLLM supports Gated DeltaNet through Flash Linear Attention Triton kernels.

```bash
# Pin to 0.17.1 — 0.18.0 has a regression as of March 2026
pip install vllm==0.17.1

# Serve Qwen 3.5 27B on a single RTX 4090
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B \
  --gpu-memory-fraction 0.95 \
  --max-model-len 65536 \
  --dtype bfloat16

Set --max-model-len to your actual context ceiling. vLLM reserves memory for the maximum on startup — setting 128K when your VRAM can only hold 64K causes a startup failure before any request is served.

What to Measure After Setup

After deployment, run three measurements before trusting production load:

  1. Peak VRAM at your target context length — use nvidia-smi dmon -s um or nvitop while sending a max-length prompt. Should track within ~1 GB of the estimates above.
  2. Tokens per second at 8K, 32K, and 64K — DeltaNet layers should show steady throughput across context lengths rather than declining sharply. A sharp decline indicates the standard attention layers are dominating or VRAM is spilling.
  3. Consistency over 10 consecutive long-context requests — DeltaNet state management means no "first request" spike. If you see VRAM climbing over sequential requests, something in the state handling is off.

Should You Upgrade to Qwen 3.5 Now?

You're on an RTX 4090 and do any long-context work: Yes. Qwen 3.5 27B is a free download, meaningfully better at 16K+ context than anything you were running before on the same hardware. No new GPU required.

You're on an RTX 4070 (12 GB): The 27B isn't accessible to you regardless of architecture. Qwen 3.5 9B at Q8 is a real upgrade for quality and handles long contexts better than Qwen 2.5 7B did. Hold on hardware until you have a clear case for 27B.

You're evaluating a new build specifically for long-context inference: The RTX 4090 is a better investment now than it was 18 months ago. 24 GB at this architecture gets you 27B with 64K context — that was a multi-GPU requirement under prior architectures. The $/context-window math has improved.

Your primary workload is short context (8K or under): No urgency to switch. Qwen 2.5 72B — more parameters, similar hardware requirement on RTX 4090 — might still outperform 27B on pure reasoning tasks at short contexts. Test your specific workload before upgrading.

Alibaba has confirmed DeltaNet in future model releases, including the Qwen 4.0 roadmap direction. This is an architectural bet on a specific approach, not a one-generation experiment.

Common Misconceptions

"I should download the DeltaNet variant separately": It doesn't exist as a separate download. Every Qwen 3.5 checkpoint is Gated DeltaNet. You have it the moment you pull any Qwen 3.5 model.

"Linear attention is slower than standard attention": At short contexts, speed is essentially identical. At long contexts, the standard attention layers slow down and eventually fail; the DeltaNet layers hold steady. The advantage isn't faster tokens — it's consistent tokens where standard attention can't function.

"I can get the same long-context capability by quantizing Qwen 2.5 more aggressively": Quantization reduces weight precision; it doesn't change how the KV cache scales with context length. A Q2-quantized Qwen 2.5 70B at 32K context still builds a full-size KV cache — with severely degraded weights. The memory wall is architectural. Quantization and DeltaNet solve different problems.

"The advantage only matters for very long contexts": True that the benefit is small below ~12K tokens. But there's no downside to using Qwen 3.5 at short contexts either — quality is at minimum equal to prior generation, often better. The architecture doesn't penalize you for not using the full context window.

FAQ

Does Qwen 3.5 have a separate DeltaNet variant to download?

No. Gated DeltaNet is the architecture of every Qwen 3.5 model — 0.8B through 397B. There is no separate "standard" checkpoint to compare it against. The confusion comes from early coverage treating DeltaNet as an optional feature when it's actually the base architecture. When you download Qwen/Qwen3.5-27B from Hugging Face, you're getting the hybrid architecture by default.

What is the largest dense Qwen 3.5 model available?

As of March 2026, Qwen 3.5 27B is the largest dense (non-MoE) model in the lineup. The 35B, 122B, and 397B releases are Mixture-of-Experts architectures. Any source referencing a "Qwen 3.5 70B" is describing a model that does not exist — the lineup jumps from 27B dense directly to 35B-A3B MoE.

Does linear attention hurt quality vs. standard transformers?

Minimally, and the hybrid design compensates for most of it. The ~25% of layers using standard full attention handle precise context retrieval; the ~75% using Gated DeltaNet carry the scaling benefit without the quadratic cost. Published benchmarks from Alibaba (February 2026) show Qwen 3.5 27B scoring 86.1 on MMLU-Pro and 85.5 on GPQA Diamond — competitive with models from the prior generation that required twice the VRAM.

Will vLLM run Qwen 3.5 without issues?

Mostly, but version matters. vLLM 0.17.1 is the stable version for Qwen 3.5 as of March 2026. An earlier dtype mismatch bug in DeltaNet layers (issue #35238) was fixed in 0.17.1. vLLM 0.18.0 introduced a new regression that broke Qwen 3.5 for some users — revert to 0.17.1 if you're seeing inference failures after an upgrade. Batch throughput on the 27B is notably lower than non-DeltaNet models at similar parameter counts (issue #36010, open as of this writing).

Which Qwen 3.5 model fits on an RTX 4070 12 GB?

Qwen 3.5 27B at Q4_K_M requires roughly 14–15 GB for weights alone — above the RTX 4070's 12 GB VRAM limit. The 9B at Q8 fits at around 9.5 GB and handles long contexts well for its size. For 27B on 12 GB VRAM, you'd need to offload layers to system RAM via llama.cpp, which reduces inference speed to a crawl. The RTX 4090 (24 GB) is the practical single-GPU floor for 27B.

qwen linear-attention local-llm vram long-context

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.