TL;DR: Yes, your RTX 4060 Mobile (8 GB) can run Gemma 4 — but not the version you've seen in most benchmarks. The E2B variant at Q4_K_M loads in ~5 GB VRAM and hits 15-20 tok/s cold. After 10-15 minutes of sustained inference, expect a 30-40% performance drop from thermal throttling. This guide gives you the exact setup to make it work and keep it stable.
Gemma 4 E2B Fits in 5 GB VRAM — What That Means for Laptop Owners
Every local LLM guide assumes you've got a desktop with 24 GB VRAM and a power supply that doesn't care about your electric bill. You don't. You've got a gaming laptop from 2022 or 2023, a power brick the size of a paperback, and a GPU that starts sweating when you open Blender.
Here's what nobody's telling you: Gemma 4 E2B at Q4_K_M quantization loads in approximately 5 GB VRAM. That leaves 3 GB of headroom on an 8 GB RTX 4060 Mobile — enough to run the model, handle a decent context window, and not crash when Windows decides to update in the background.
Gemma 4 comes in two variants: E2B (2 billion parameters) and E4B (4 billion parameters). The E4B variant needs roughly 9-10 GB VRAM at Q4_K_M, which puts it out of reach for 8 GB laptop GPUs unless you're willing to drop to Q3 or lower quantization — and at that point, the quality degradation isn't worth it. E2B is the laptop owner's variant, full stop.
E2B vs E4B — Which Gemma 4 Variant Fits Your Laptop?
Gemma 4 E4B
4B
~9-10 GB
~16-18 GB
No (at usable quality)
Yes (Q4 only)
Better reasoning, longer context The table is clear: E4B is for desktop builds or high-end mobile workstations with 16 GB VRAM. E2B is what you actually run on a gaming laptop. Anyone telling you to "just get more VRAM" isn't writing for your hardware.
What "5 GB at Q4" Actually Means for an 8 GB GPU
Quantization is compression for AI models. Q4_K_M means 4-bit weights with medium-quality mixed precision — it's the sweet spot where file size drops dramatically but output quality stays usable for most tasks. Think of it like H.265 video: smaller than raw, good enough for almost everything.
That 5 GB load leaves 3 GB free on your 8 GB card. Here's why that overhead matters:
- Context window expansion: Each token you generate needs memory. 3 GB of headroom lets you push past 4K context without hitting VRAM limits.
- System stability: Windows, browser tabs, and background processes will steal 500 MB-1 GB. That 3 GB buffer absorbs it.
- Thermal headroom: Less VRAM utilization means slightly lower power draw, which delays the throttle point we'll cover next.
RTX 4060 Mobile 8 GB — Burst Speed vs 30-Minute Sustained Performance
Here's where CraftRigs testing diverges from every other site. We didn't run a 30-second benchmark and call it done. We ran Gemma 4 E2B Q4_K_M on an RTX 4060 Mobile for 30 minutes straight, logging token speed, GPU temperature, and power draw every 30 seconds. What we found: thermal throttling cuts performance 30-40% after 10-15 minutes of sustained load, and most "benchmarks" you'll find online never test long enough to see it.
The RTX 4060 Mobile is specced at 115W maximum graphics power in most laptops, but that's a burst figure. Sustained power draw — what matters for a 20-minute coding session or document analysis — typically settles 20-30% lower once the cooling system hits equilibrium.
Burst Speed: Gemma 4 E2B at Q4_K_M — First 5 Minutes
Cold-start performance on our test unit (ASUS TUF Gaming F15, RTX 4060 Mobile 8 GB, 16 GB DDR5):
| Metric | Value |
|---|---|
| Token speed | 18-22 tok/s |
| GPU temperature | 62°C → 78°C |
| GPU power draw | 95-110W |
| VRAM utilization | 5.2 GB |
This is the number you'll see in quick benchmarks: ~20 tok/s, competitive with desktop integrated solutions, genuinely usable for interactive work. For the first five minutes, your laptop feels like a capable local AI machine.
Sustained Speed: What Thermal Throttling Does After 15 Minutes
Same session, 15-minute mark:
| Metric | Value |
|---|---|
| Token speed | 11-14 tok/s |
| GPU temperature | 86°C (thermal limit) |
| GPU power draw | 65-75W |
| Performance vs cold start | -38% |
By 30 minutes, the system had settled at 12 tok/s sustained — still usable, but a very different experience from the burst numbers. The GPU hit its 87°C thermal ceiling and the laptop's firmware aggressively pulled power to maintain it.
Warning
Plugged in ≠ full performance. Your RTX 4060 Mobile will throttle on AC power just as hard as on battery. The thermal wall is the real limit, not the power source. Battery mode makes it worse (typically 40-50% power limit), but AC power doesn't save you from physics.
How to Read Thermal Throttling in Your Own Testing
Most monitoring tools won't scream "THERMAL THROTTLE" at you. Here's what to watch:
- GPU clock drops: RTX 4060 Mobile boosts to ~2.4 GHz cold, settles to ~1.6-1.8 GHz throttled. Check GPU-Z or HWiNFO64.
- Power draw plateau: If you're seeing 65W sustained on a 115W GPU, you're throttled.
- Token speed decay: If your tok/s drops 20%+ from minute 5 to minute 15, your cooling system has hit its limit.
The Exact Setup: Model Files, Quantization, and Software
You need three things to replicate our results: the right model file, the right quantization, and inference software that doesn't waste VRAM on overhead.
Model File: Get E2B, Not E4B
Download from Hugging Face or Ollama:
- Correct:
google/gemma-4-2b-it(E2B instruction-tuned) - Incorrect:
google/gemma-4-4b-it(E4B — won't fit at usable quality)
Quantization: Q4_K_M Minimum, Q5_K_M If You Have Headroom
Our Recommendation
Avoid — artifacts in code generation
Default choice for 8 GB cards
Use if you can spare the VRAM
Fits but leaves no headroom — risky Q4_K_M is the practical minimum. Q3 shows visible quality degradation in code completion and structured output tasks. Q5_K_M is noticeably better for reasoning if you can tolerate slightly higher VRAM pressure.
Inference Software: llama.cpp or Ollama, Not Transformers
- llama.cpp (direct): Lowest overhead, best performance, steeper setup. Use if you're comfortable with command lines.
- Ollama: One-command install, handles quantization automatically, 10-15% overhead vs raw llama.cpp. Our recommended starting point.
- Hugging Face Transformers: Don't use this on 8 GB mobile GPUs. The Python overhead alone eats 1-2 GB VRAM.
Ollama one-liner for our tested config:
ollama run gemma4:2b-q4_k_m
Managing Thermals: Practical Tactics for Sustained Sessions
You can't eliminate throttling without hardware mods, but you can delay it and reduce its severity.
Immediate Tactics (No Disassembly)
Effort
30 seconds
Environmental
1 minute
Advanced: Undervolting via MSI Afterburner
RTX 40-series mobile GPUs respond well to undervolting. A -100mV core offset typically reduces power draw 8-12% with minimal performance loss, directly translating to lower temperatures and delayed throttling. This is laptop-specific — start conservative and test stability.
Session Pacing: Work With the Throttle
For long tasks (document summarization, batch code generation), break work into 10-minute chunks with 2-3 minute cooldown periods. This keeps you in the burst performance zone rather than accepting sustained throttled speeds. It's annoying, but it's faster than letting the GPU cook at 12 tok/s for an hour.
What You Can Actually Do With This Setup
Performance expectations for Gemma 4 E2B Q4_K_M on RTX 4060 Mobile, sustained (post-throttle) speeds:
Usability
Good — natural pacing
Good — faster than you can read
Acceptable — validation adds latency This is not a desktop replacement. It's a genuine, usable local LLM setup that doesn't require a new hardware purchase. The quality gap between E2B and E4B is real but smaller than the gap between "running local" and "not running local at all."