Your RTX 4090 runs Llama 70B at 4 tokens per second. Meanwhile your second PCIe slot sits empty, collecting dust, mocking you.
That's the reality hitting builders right now. One GPU isn't enough anymore—not for 70B models, not for anything resembling useful inference at full quantization. And the community has noticed. r/LocalLLaMA crossed 636,000 members this year, and the threads about dual-GPU rigs are multiplying fast. The question isn't whether to go dual. It's which pair makes sense for your workload.
Three separate data points converged in March 2026 that make this the right moment to write this guide: Lenovo announced a workstation stuffed with 192GB of GPU VRAM across two cards, Apple shipped the M5 Max MacBook Pro with 128GB of unified memory, and benchmarks confirmed that a single RTX 5090 tops out at 32GB—still not enough for 70B at Q4 quantization if you want any headroom.
The dual-GPU era for local inference isn't coming. It's here.
The VRAM Math Nobody Talks About Plainly
A 70B parameter model at 4-bit quantization (Q4_K_M) needs roughly 43-47GB of VRAM to load. A single RTX 5090 has 32GB. A single RTX 4090 has 24GB. Neither card can hold it.
This is the constraint that's driving everything. And it's not going away—models are getting bigger, not smaller, even as quantization improves.
Here's what dual-GPU configs actually unlock:
What fits
70B at Q4, 34B at Q8
70B at Q4, 34B at Q8
70B at Q8, 120B+ at Q4
405B at full precision The 70B class is the target because it's where local inference stops being a toy and starts being genuinely useful. Llama 3 70B, Qwen 2.5 72B, DeepSeek 67B—these models beat GPT-4 on several benchmarks and they're free to run. A developer making 200+ API requests per day can realistically save $300-500 per month running locally. Two GPUs is the entry ticket.
[!INFO] VRAM rule of thumb: Q4 quantization cuts a model's memory footprint to roughly 0.6GB per billion parameters. So a 70B model at Q4 = ~42GB. At Q8 (much better quality), it's roughly 1.1GB per billion = ~77GB. That's why 48GB barely covers Q4 70B, and why 64GB+ opens up Q8.
The PCIe Bandwidth Penalty (And When It Actually Matters)
Here's where most dual-GPU guides get evasive. Splitting a model across two consumer GPUs over PCIe is slower than running it on a single card with equivalent VRAM—sometimes dramatically so.
PCIe Gen 5 x16 gives you about 128 GB/s of inter-GPU bandwidth. NVLink 5.0 on Blackwell professional cards does 1.8 TB/s. That's roughly 14x faster for the same data transfer.
For inference on consumer RTX hardware (4090, 5090, 3090), you're stuck with PCIe. There's no NVLink bridge for GeForce cards in dual-GPU setups. The RTX 5090 communicates with a second 5090 via PCIe, and the inter-GPU overhead shows up as a performance penalty that scales with how often layers need to pass data between cards.
In practice: splitting Llama 70B Q4 across two RTX 4090s over PCIe x8/x8 drops performance from ~52 tok/s (hypothetical single card with 48GB) to something closer to 18-24 tok/s depending on your CPU and PCIe slot configuration. Still faster than offloading to RAM. Still much better than 4 tok/s when you're cramming a 70B model onto 24GB with layer offloading.
The threshold question: are you running at 4 tok/s today because your 70B model won't fit? Then dual GPU is the upgrade. Are you running small models well on a single card and hoping two cards doubles your throughput? That's not how this works.
Warning
The x8 trap: Most consumer motherboards can't run two PCIe 5.0 x16 slots simultaneously. When you add a second GPU, both slots typically drop to x8. For LLM inference where the model is already loaded into VRAM, this is usually fine. For frequent inter-GPU tensor transfers on huge models, it can become a bottleneck. Check your motherboard's PCIe topology before buying.
Lenovo Just Announced the Professional Version of This Idea
On March 16, 2026, Lenovo launched the ThinkStation P5 Gen 2. It supports up to two NVIDIA RTX Pro 6000 Blackwell Max-Q GPUs, each with 96GB of ECC GDDR7 VRAM. Total: 192GB of GPU memory in a single desktop.
The Xeon 600 processor goes up to 48 cores at 4.9GHz. RAM slots support up to 1TB of DDR5. Four M.2 NVMe slots can hold up to 16TB of flash storage.
This machine runs full 405B parameter models at usable speeds. It's not for everyone—a single RTX Pro 6000 Blackwell config starts around $11,600—but Lenovo building it at all is a signal. When workstation manufacturers are designing dual-GPU chassis around local AI inference, the use case is no longer fringe.
The RTX Pro 6000 Blackwell also supports NVLink, which is the real advantage over consumer GPUs in dual-card setups. Two Pro 6000s talking over NVLink at ~1.8 TB/s don't suffer the same bandwidth penalty as two RTX 5090s over PCIe. The tensor transfers are fast enough that splitting even massive models across both cards doesn't hurt inference throughput the way it does on consumer hardware.
A single RTX Pro 6000 Blackwell running MiniMax-M2.5 (a 139 billion parameter mixture-of-experts model) achieves 68 tokens per second with a 64K context window. That number on a single consumer card would require it to have 96GB of VRAM—which no consumer card has. Two of them, and you're running things that datacenter teams were deploying on A100 clusters eighteen months ago.
Where the M5 Max Fits In This Picture
Apple's M5 Max, which shipped March 11, 2026, is the obvious counterargument to everything above.
128GB of unified memory in a MacBook Pro. 614 GB/s memory bandwidth. A claimed 4x faster LLM prompt processing versus the M4 Max (community benchmarks suggest the real number is closer to 3x, but still meaningful). And Neural Accelerators embedded in every one of the 40 GPU cores—a genuine architectural change from M4 that Apple specifically aimed at inference workloads.
For a single-machine, battery-powered, portable local LLM setup, nothing beats it. You can load and run 70B models at Q4 without worrying about slot configs, PCIe bandwidth, or second power supplies. LM Studio even ran live demos on the M5 Max at launch—Apple isn't being coy about who the target user is.
But here's the comparison that matters: the M5 Max has 614 GB/s of memory bandwidth. A single RTX 5090 has 1,792 GB/s. Two RTX 5090s, even with PCIe overhead, still push more raw compute for tokens-per-second than the M5 Max on identical models.
If you need local inference at a desk and don't care about portability, two consumer RTX cards still wins on throughput. If you want a laptop that can actually run 70B, M5 Max is the answer—and the only answer in that form factor.
The Two Builds Worth Considering
Budget dual: 2× RTX 4090
Two 4090s gives you 48GB of pooled VRAM. Enough for 70B at Q4, enough for 34B at Q8. Used 4090s are running around $1,100-$1,400 right now (the 5090 launch pushed secondhand 4090 prices down). A dual 4090 build lands around $2,800-$3,200 for both cards.
Inference at 18-24 tok/s on Llama 70B Q4 is more than usable for coding assistance, document analysis, and agentic workflows where you're not watching every token print. It's not exciting speed, but it's private, it's offline, and it costs $0 per token after hardware.
The catch: check your case. Two 4090s are enormous cards. You need a full-tower chassis with at least 3 slots of separation between the cards, or thermals become a problem fast.
Premium dual: 2× RTX 5090
Two 5090s gives you 64GB of pooled VRAM. That's 70B at Q8 (much better output quality than Q4), or 120B+ models at Q4. Each 5090 runs at $1,999 new, so you're at ~$4,000 for the cards.
The single-card RTX 5090 runs Llama 70B at 85 tok/s. In a dual configuration over PCIe, expect a different performance profile depending on how much of the model fits in VRAM vs. how many layers are being passed between cards. For models that fit cleanly within 64GB, both cards handle their local layers and communication overhead stays low. For this use case, you're getting around 60-75 tok/s on 70B Q4 depending on PCIe topology.
That's fast. That's faster than most API responses when the model is busy.
Tip
Layer splitting in llama.cpp: Use --n-gpu-layers to control how many model layers load onto each GPU. With two GPUs, set --main-gpu 0 and let llama.cpp split automatically, or use --tensor-split to specify the exact ratio (e.g., 0.5,0.5 for equal distribution). Ollama handles this automatically when it detects multiple GPUs.
Software Setup in 2026
Ollama hit 5 million downloads in January 2026 and now handles multi-GPU detection automatically. Install it, run ollama run llama3:70b, and it will find both GPUs and split accordingly. No configuration required.
llama.cpp gives you more control. Useful flags for dual-GPU inference:
./llama-cli \
-m ./llama-70b-q4_k_m.gguf \
--n-gpu-layers 80 \
--main-gpu 0 \
--tensor-split 0.5,0.5 \
-c 4096 \
-n 512
For higher throughput serving (multiple concurrent users), vLLM's tensor parallelism is the right choice. Set --tensor-parallel-size 2 and it shards model layers across both GPUs at the framework level, which is generally faster than llama.cpp's approach for batch workloads.
LM Studio 0.3+ also supports multi-GPU natively with a GPU selector in the settings panel.
The Verdict
One GPU isn't enough for 70B models. That's the honest summary of where local inference sits right now.
Lenovo building a 192GB dual-GPU workstation for enterprises, Apple putting 128GB into a MacBook Pro, and r/LocalLLaMA filling up with dual-card build logs—these aren't coincidences. The community figured out what the hardware manufacturers are now shipping to: the single-GPU ceiling is real, and the only way over it is more cards or unified memory.
For most builders, two RTX 4090s is the most cost-effective path to running 70B models locally. If you're buying new hardware, two RTX 5090s is the better long-term investment—64GB covers the current generation of open-weight models with room to grow. And if budget isn't the constraint, a single RTX Pro 6000 Blackwell at 96GB skips the PCIe overhead entirely and runs cleaner at higher speeds.
The second GPU slot in your case isn't wasted real estate anymore. Fill it.