CraftRigs
Architecture Guide

Gemma 4 GPU Sweet Spot: Which Card Handles Every Size

By Charlotte Stewart 8 min read
Gemma 4 GPU Sweet Spot: Which Card Handles Every Size — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The speculation ended on April 2, 2026: Google shipped Gemma 4 in four sizes — E2B, E4B, a 26B Mixture-of-Experts with 3.8B active parameters, and a 31B dense flagship — and added a multimodal 12B on June 3 (see the official Gemma release notes). The question that matters for builders hasn't changed: which GPU tier covers each size?

TL;DR — Gemma 4 VRAM by size (Q4 quantization):

Gemma 4 sizeApprox. VRAM at Q4GPU tier
E2B~2 GBAnything — phones and edge devices
E4B~3–4 GBAny 6–8 GB card
12B~7–8 GB12 GB (RTX 3060, RTX 4070)
26B MoE (3.8B active)~15–16 GB16 GB cards
31B dense~18–19 GB24 GB (RTX 3090 / 4090)

These are spec-sheet math, not lab measurements: Q4_K_M works out to roughly 0.55–0.6 GB per billion parameters, plus several GB of KV cache at long context. The VRAM calculator runs the numbers for any size, quant, and context length.

The hard VRAM numbers from Gemma 3 are still the best baseline for understanding why this lineup lands where it does — so let's start there and work forward.

On this page:


What Gemma 3 Actually Needs: The Full VRAM Breakdown

Gemma 3 launched in March 2025 across four sizes. The VRAM requirements by precision:

SizeBF16 (native)Q8 quantizedQ4_K_M
1B~2 GB~1.5 GB~0.8 GB
4B~8 GB~5 GB~2.5 GB
12B~24 GB~14 GB~7.5 GB
27B~54 GB~28 GB~16 GB

The 27B at full BF16 precision requires 54GB — that's an H100 or a multi-GPU consumer setup. But quantized? The 27B Q4 fits in 16GB. The Q8 version lands at 28GB, which means a pair of 16GB cards handles it comfortably, and a single 24GB card handles Q4 with VRAM headroom for context.

That 27B Q4 on a 24GB card is what made Gemma 3 genuinely exciting. It's why the RTX 3090 suddenly became interesting again two years after everyone called it dead.

Note

Context window costs VRAM too. At 128K tokens, the KV cache for Gemma 3 27B can chew through several additional gigabytes at full context. If you're running long documents or agentic workflows, plan for a ~4–6GB overhead on top of model weights.


Gemma 3n's Efficiency Tricks — And Why They Matter for Gemma 4

To understand why the Gemma 4 lineup looks the way it does, you have to understand what Gemma 3n did to memory requirements. Because it completely changed the math.

Gemma 3n launched in June 2025 with two variants: E2B (about 5 billion total parameters) and E4B (about 8 billion). The "E" stands for "effective" — and that distinction is the whole story. Gemma 3n E2B runs in roughly 2GB of VRAM. The E4B runs in around 4GB. A 5-billion-parameter model in 2GB. That's not a typo.

Two architectural innovations made this possible.

PLE caching (Per-Layer Embeddings): The embedding parameters — a surprisingly large chunk of total model size — get offloaded to fast local storage or system RAM rather than sitting in VRAM. Only the core transformer weights live in accelerated memory. The 5B E2B model only needs 2GB of VRAM because most of its 5 billion parameters aren't actually in VRAM during inference.

MatFormer (Matryoshka Transformer): Think of Russian nesting dolls. One large trained model contains multiple smaller, functional models nested inside it. At inference time, you can selectively activate only the parameters you need. You get smaller-model speed with larger-model training quality. The E4B became the first sub-10B model to score over 1300 on LMArena — a benchmark score that would have required a 30B+ model just eighteen months earlier.

Google didn't design Gemma 3n for desktop rigs. It was built for phones and edge devices. But those same efficiency innovations don't just disappear when the next generation arrives.

Tip

PLE caching works best when your local storage is fast. If you're building a rig specifically for Gemma 3n (or Gemma 4 variants that inherit this architecture), an NVMe SSD with high sequential read speeds — think 7,000+ MB/s — meaningfully reduces the latency penalty from loading embedding layers off-disk.


What Gemma 4 Actually Shipped

The lineup Google delivered on April 2 (official model overview) tracks the Gemma 3 ladder closely, with one big structural change:

  • E2B / E4B — the Gemma 3n efficiency heritage made it into the main line. Per-layer embedding offloading keeps these in ~2 GB and ~4 GB of VRAM respectively.
  • 12B (added June 3) — unified multimodal, the consumer-GPU mainstream tier at ~7–8 GB Q4.
  • 26B MoE, 3.8B active — the structural surprise. Instead of a bigger dense mid-flagship, Google shipped a Mixture-of-Experts: the weights need ~15–16 GB at Q4, but each token only activates 3.8B parameters, so generation speed lands closer to a 4B model than a 26B one.
  • 31B dense — the workstation flagship, ~18–19 GB at Q4. No 70B+ tier this generation.

The question this page asked before launch — would Gemma 4 bake Gemma 3n's efficiency innovations into the standard lineup — was answered yes, twice over: PLE in the E-series, and the MoE flagship as the bigger efficiency move.

And the prediction that held: the 24GB VRAM tier covers the flagship single-GPU model at Q4 quantization, just like it covered Gemma 3 27B Q4. That math hasn't changed in three generations.


The 24GB Tier: Why RTX 4090 and 3090 Still Win

The 24GB VRAM tier is the sweet spot for local LLMs right now, and it will be the sweet spot for Gemma 4. This isn't a controversial take — it's just what the numbers say.

At 24GB you can run:

  • Gemma 3 27B at Q4 with headroom
  • Gemma 4 31B dense at Q4 with headroom — and the 26B MoE with room to spare
  • 70B models at Q4 if you're willing to offload some layers (tight, but possible with good quantization)

The two GPUs that hit this tier without asking $4,000 from you:

RTX 3090 — $600–800 used, widely covered in 2026 as the best-value local-AI GPU. 24GB GDDR6X, NVLink-capable, proven stability. Draws 350W max, around 250W typical AI workload. Its age shows in raw CUDA throughput — the 4090 generates tokens roughly 40–60% faster — but if your workload is batched or asynchronous, the speed difference matters a lot less than the VRAM difference versus cheaper cards. For a head-to-head against AMD's RDNA 4 alternative, see the RTX 3090 vs RX 9070 XT breakdown.

RTX 4090 — ~$2,200 used, ~$2,700 new as of mid-2026. Same 24GB, meaningfully faster at ~82 TFLOPS BF16 versus the 3090's ~35 TFLOPS. If you're doing interactive inference where response latency matters, the speed gap is real and you'll feel it. The 4090 also runs cooler per watt at AI workloads. Whether that $1,400 premium over a used 3090 is worth it depends entirely on your use case and how much you care about tokens per second.

Caution

Don't buy the RTX 4060 Ti or 4070 if you want the Gemma 4 flagship. The 4060 Ti tops out at 16GB in its best configuration, and the standard 4070 has 12GB. The 31B dense needs ~18–19GB at Q4 — running it on those cards requires aggressive quantization that visibly hurts quality on reasoning tasks. The one new wrinkle: the 26B MoE does fit on 16GB at Q4, so a 4060 Ti 16GB is now a legitimate budget path into big-Gemma territory — just not for the dense flagship.


Dual-GPU Options for 70B-Class Models

Gemma 4 topped out at 31B dense — no 70B tier this generation, so a single 24GB card covers the whole family. Dual-GPU still matters if you want 70B-class models from other families (Llama 3.3, Qwen) alongside Gemma 4.

Two 3090s connected via NVLink gives you 48GB of pooled VRAM. That's enough for a 70B model at Q4 (~40–45GB required) with a few gigabytes to spare. The pair runs about $1,400–1,600 used, needs a 1000W+ PSU, and requires a motherboard with two full x16 slots close enough together for the NVLink bridge. Not every ATX board qualifies — check before buying.

Two 4090s via tensor parallelism (llama.cpp, vLLM) — not NVLink — gets you the same 48GB with far better throughput. The cost is around $4,400–4,600 for the pair used. That's workstation territory.

One important caveat on multi-GPU: if your target model fits on a single card, adding a second makes inference slower, not faster. The PCIe communication overhead costs you 3–10% tokens per second. Multi-GPU is only worth it when the model doesn't fit on one card and you need the VRAM, full stop.


The Launch Didn't Move GPU Prices — They Never Do

This page originally weighed "buy now vs. wait for launch." The launch happened, and the prediction held: GPU prices didn't move. When Gemma 3 dropped in March 2025, the used 3090 market didn't budge. When Llama 4 arrived, same story. Gemma 4's April release followed the pattern — model releases don't deprecate hardware overnight, and the 24GB tier covers every Gemma generation that has ever existed.

The hypothetical worth waiting for — a flagship with a dramatically different memory profile, like a 32B that runs in 8GB — half-arrived in the form of the 26B MoE, and it made the 16GB tier more useful rather than making 24GB cards obsolete. If anything, the launch widened the set of GPUs that make sense.

If you need a rig, buy the 3090 at $600–800 and be running in a week. There is no next launch worth waiting for on the calendar.


The Bottom Line

The 24GB VRAM tier is where Gemma 4 lives for most people who aren't running 70B models. The RTX 3090 at ~$700 used is the value play. The RTX 4090 at ~$2,200 used is the performance play. Both run every Gemma 3 model today, both run every Gemma 4 tier at Q4 quantization, and the April launch made neither one obsolete.

If you want 70B-class models from other families alongside Gemma 4, a dual 3090 setup with NVLink gives you 48GB for around $1,400–1,600. That's the more interesting build, honestly. Covers both current and next-gen comfortably.

Google has been on a roughly twelve-month cadence with Gemma generations. The hardware story hasn't changed between them. Bet on the 24GB tier and buy when you're ready.


See Also

gemma-4 gemma-3 google local-llm vram rtx-3090 rtx-4090 gpu-buying-guide

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.