**TL;DR: Multi-GPU RTX 3090 scaling works — but only if you use vLLM with tensor parallelism. With llama.cpp in its default mode, adding a 3rd or 4th GPU barely improves generation speed; you're mostly buying VRAM, not throughput. Switch to vLLM, and a 4x RTX 3090 setup at 220W per GPU delivers 39 tok/s on 32B models with real scaling gains at each tier. At current pricing (~$1,058/card used, March 2026), this is primarily compelling if you already own cards and want to scale. Starting from scratch, compare carefully against a single RTX 5090 at $1,999.**
---
There's a version of the RTX 3090 multi-GPU story that's popular on forums: old cards are cheap, buy four of them, run massive models for pennies. That story has gotten more complicated. Used 3090s now trade at around $1,058 on eBay — up 19% in a single month. The framework you run matters enormously. And most guides don't tell you that adding GPUs to llama.cpp's default setup gets you almost nothing in speed.
This guide covers what actually happens — with real numbers, not theoretical peaks — at each scaling tier.
## Quick Comparison: 2x vs 3x vs 4x RTX 3090
All RTX 3090 specs: 24 GB GDDR6X [VRAM](/glossary/vram) on a 384-bit bus, 936 GB/s memory bandwidth, 350W TDP (reference Founders Edition). AIB factory-overclocked cards — which are most of what you'll find used — draw 380-420W under sustained load. Verified against NVIDIA's spec sheet, March 2026.
Approx. System Draw
~450-550W
~950-1,100W
~1,400-1,600W
~1,750-2,100W
> [!WARNING]
> The llama.cpp default column is the number most guides don't show you. In layer-split mode — the default — adding a 3rd or 4th RTX 3090 produces virtually no speed improvement on 70B inference. Published benchmarks confirm ~16-17 tok/s on Llama 3 70B Q4_K_M across 2x, 4x, and even 6x RTX 3090 setups in layer-split mode. You're adding VRAM capacity, not compute parallelism. The vLLM column is what you get when multi-GPU is actually working. The 4x vLLM number (39 tok/s) is confirmed via a published benchmark on Qwen QwQ-32B at 220W per GPU. 70B vLLM figures for this exact configuration are estimated from scaling patterns — no reproducible single-stream 70B benchmark for 3x or 4x RTX 3090 was publicly available at time of writing.
Used GPU prices sourced from BestValueGPU price tracker and eBay sold listings, March 29, 2026.
## Why RTX 3090s Still Matter — With Honest Pricing
Let's be direct about the pricing situation first, because it changes the math significantly.
Used RTX 3090s are no longer $200-300 each. They've climbed to ~$1,058 per card on eBay as demand for 24 GB VRAM has grown. A 3x build is roughly $3,200 in GPU costs before PSU, case, or platform. An RTX 5090 — NVIDIA's current flagship — costs $1,999 and gives you 32 GB VRAM.
The case for RTX 3090 multi-GPU now depends heavily on your starting point. If you own one or two already, scaling is still compelling: adding two more to reach 3x costs ~$2,100 and gets you 72 GB VRAM. That's 2.25x the VRAM of a single RTX 5090 for the same price as a new one. If you're building from scratch with no existing cards, the math is tighter.
Where RTX 3090 multi-GPU still wins clearly: raw VRAM capacity per dollar. Three of them give you 72 GB — enough to run 120B models with aggressive [quantization](/glossary/quantization), something a single RTX 5090 can't touch. One community builder ran a 5x RTX 3090 rig (120 GB VRAM) on a 120B model and hit 124 tok/s — versus a brand-new NVIDIA DGX Spark at 38.5 tok/s on the same workload. The total GPU cost for that setup was ~$5,000 used.
The RTX 3090's value proposition in 2026 is VRAM capacity at scale. Speed-per-watt goes to the RTX 5090. For the [upgrade path comparison](/comparisons/rtx-3090-vs-rtx-5090-scaling-roi/), see our full ROI breakdown.
## The Bandwidth Bottleneck: NVLink vs PCIe Reality
The RTX 3090 is the only consumer GeForce RTX 30-series card with [NVLink](/glossary/nvlink) support — NVIDIA dropped it from every other 30-series GPU. A retail NVLink bridge is available at Best Buy (~$79). It's a 4-slot-wide bridge, wider than a standard SLI connector.
The bandwidth difference is real: 2-way NVLink provides approximately 600 GB/s bidirectional bandwidth. PCIe 4.0 x16 delivers ~32 GB/s unidirectional per slot — about 19x slower for direct GPU-to-GPU communication. In vLLM throughput tests, a 2x RTX 3090 NVLink configuration reached 715 tok/s batched vs 483 tok/s over PCIe — a 48% improvement.
But here's what the forum posts usually skip: NVLink only connects two cards. A 3rd or 4th RTX 3090 communicates over PCIe regardless of whether you have the bridge installed. At 4 GPUs, the NVLink advantage over PCIe drops from 48% to roughly 9%. The bridge is highly worth buying for a 2x setup. For 3x or 4x builds, it helps the first pair but doesn't change the overall constraint.
PCIe 4.0 x16 handles 3x and 4x setups adequately for most inference workloads — the bottleneck at that scale is usually power and cooling before bandwidth.
### How to Verify Your PCIe Configuration
Run `nvidia-smi -q -d PCIE` to check each GPU's PCIe generation and lane width. You want x16 PCIe 4.0 lanes. Physical x8 slots cut bandwidth in half and will throttle multi-GPU performance noticeably.
Also check that CUDA peer access is enabled between your GPUs: `nvidia-smi topo -m` shows the interconnect topology. Direct peer access (PIX or SYS) is required for efficient tensor parallelism in both llama.cpp row-split and vLLM.
## Token/s Gains at Each Tier — Framework Is Everything
The single most important thing to understand about RTX 3090 multi-GPU scaling: the framework you run determines whether scaling works at all.
**llama.cpp with default `--split-mode layer`**: This splits the model across GPUs by assigning layers to different cards, then runs them sequentially. It doesn't parallelize computation — it pipelining VRAM. Real-world benchmarks show Llama 3 70B Q4_K_M hitting ~16.57 tok/s at 512 tokens on 2x RTX 3090, ~17.07 tok/s on 4x RTX 3090, and ~17.09 tok/s on 6x RTX 3090. Essentially flat. You're expanding VRAM headroom, not compute throughput.
**llama.cpp with `--split-mode row`**: This enables true [tensor parallelism](/glossary/tensor-parallelism) — weight matrix rows are split across GPUs and computation runs in parallel. It does scale, but inter-GPU synchronization overhead still limits it on PCIe. Community testing puts it behind vLLM in multi-GPU efficiency, especially at 3x and 4x.
**vLLM with `--tensor-parallel-size`**: The right tool for multi-GPU inference. A published benchmark (Himesh P., March 2025) on 4x RTX 3090 using vLLM with tensor parallelism at 220W per GPU shows 39 tok/s output speed on Qwen QwQ-32B, with 353 tok/s total batched throughput. This is confirmed data — but note the model is 32B, not 70B. Larger models will be slower; confirmed 70B vLLM numbers for this exact hardware configuration weren't available at time of writing.
The practical takeaway: if you're adding GPUs to run bigger models faster, you need vLLM. llama.cpp's layer split gets you the VRAM to load the model, but near-zero speed benefit.
### Why Efficiency Drops at 4x
Three compounding factors:
PCIe bandwidth contention increases with every GPU added beyond the NVLink pair. Each card competes for the available lanes. At 4x, you're sharing PCIe bandwidth across cards 3 and 4, and inter-GPU data transfer during tensor-parallel operations becomes a real constraint.
Tensor-parallel synchronization overhead grows non-linearly. Every forward pass requires all-reduce operations across GPUs. At 4x, that's three synchronization points versus one at 2x — and each over PCIe adds latency.
Software scheduling limits are real. vLLM handles 4-GPU distribution better than llama.cpp, but even vLLM has a practical efficiency floor. The confirmed throughput improvement from 2x to 4x is meaningful but not linear.
## When Does the 4th GPU Stop Paying Off?
At current used pricing (~$1,058/card), the 4th RTX 3090 costs more than the 3rd on a cost-per-added-throughput basis, and the efficiency return has already begun to flatten.
**Single-user testing or development:** 3x RTX 3090 handles every major 70B model available today with 72 GB VRAM. The 4th card adds throughput at the cost of meaningfully higher power draw and a harder cooling problem. If your workload is one conversation thread at a time, it's idle most of the day. Skip it.
**Multi-user API serving:** Here's where 4x earns its place. Running 3+ concurrent inference requests saturates 3x capacity quickly. The extra burst headroom from a 4th GPU means you can handle concurrent requests without queuing. The 250-350 tok/s batched throughput of a 4x vLLM setup translates directly to latency under load.
**Batch processing:** Document analysis, prompt-heavy pipelines, content generation at volume — these all benefit from maximum GPU count because throughput matters more than single-stream latency. 4x is the right call.
The honest decision heuristic: run `watch -n 1 nvidia-smi` during your real workloads for a week. If you're hitting 90-100% GPU utilization consistently, add the card. If you're at 60-70% with spikes, optimize your stack first — especially by switching from llama.cpp to vLLM.
> [!NOTE]
> The power and cooling jump from 3x to 4x is not incremental — it's a category change. Three RTX 3090s in a quality ATX tower with good airflow is manageable. Four requires either exceptional case airflow, undervolting/power limiting all cards, or a server rack chassis. Many builders hit thermal limits at 4x before they hit power budget limits.
## Distributed Inference Setup: llama.cpp Row-Split and vLLM
**For llama.cpp with true tensor parallelism:**
```bash
./llama-cli -m model.gguf \
-ngl 999 \
--split-mode row \
--tensor-split 1,1,1,1 \
--main-gpu 0 \
-n 512 \
-p "your prompt"
Key flags:
-ngl 999— force all layers onto GPU; without it, layers fall back to CPU and tensor split doesn't engage--split-mode row— weight matrix row distribution; all GPUs compute in parallel (not sequentially likelayer)--tensor-split 1,1,1,1— proportion values per GPU (equal split for 4 identical cards)--main-gpu 0— which GPU handles KV cache and intermediate results
Verify all GPUs are active during inference by watching nvidia-smi in a separate terminal. All GPUs should show >50% utilization. If only 1-2 cards are busy, row-split isn't engaging — check peer access with nvidia-smi topo -m.
For vLLM (recommended for 3x and 4x):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--dtype float16 \
--gpu-memory-utilization 0.95
vLLM's implementation of tensor parallelism is more optimized than llama.cpp's, particularly for multi-GPU synchronization. For a full walkthrough of both frameworks, including KV cache tuning and batching configuration, see our llama.cpp multi-GPU setup guide.
Troubleshooting Multi-GPU Inference
All GPUs show <5% utilization: llama.cpp is probably running in layer-split mode. Add --split-mode row explicitly — it won't activate automatically. Also verify -ngl 999 is present.
Only the first GPU is active: CUDA peer access isn't enabled between your cards. Run nvidia-smi topo -m and look for the topology type — NV# means NVLink, PIX means PCIe switch, SYS means system memory. Any of these supports peer access, but SYS is the slowest. If you see NODE or worse, your slots are too far apart on the PCIe tree.
Out-of-memory error on a model that should fit: AIB RTX 3090 cards sometimes report slightly different available VRAM than the 24 GB spec due to display output reservation. Try dropping GPU memory utilization in vLLM to 0.90. For llama.cpp, reduce context length (-c 2048 instead of -c 4096).
Speed barely improved adding a 3rd GPU: Check PCIe slot generation. x8 slots provide half the bandwidth of x16. Consult your motherboard manual and move the 3rd GPU to a full x16 slot if one is available.
Power Supply and Cooling: The Real Limiting Factor
The 350W TDP is the Founders Edition spec. Most used RTX 3090s are AIB factory-overclocked models — which routinely draw 380-420W under sustained inference load. Peak spikes on some AIB cards have been measured at 464W. Build your power budget around 400W per GPU, not 350W.
Approx. System Total
~950-1,100W
~1,400-1,600W
~1,850-2,100W System totals include a mid-range CPU (125-200W), NVMe drives, and cooling. High-core-count workstation CPUs push this higher.
Actual PSU minimums:
- 3x RTX 3090: 1,600W, 80+ Gold minimum. 80+ Platinum preferred for sustained load.
- 4x RTX 3090: 2,000W, 80+ Platinum. A 1,600W PSU running at 90% capacity under continuous AI inference load is not a question of if it fails — it's when.
For detailed PSU sizing calculations including efficiency curves and headroom math, see our power supply guide for multi-GPU AI builds.
Cooling notes for each tier:
3x RTX 3090 in a standard full-tower is manageable with strong case airflow — minimum 3 front intake fans and 2 top exhausts. The GPUs need to breathe. Slot spacing matters: if your motherboard allows, skip a slot between GPUs to create an air gap.
4x RTX 3090 in a standard case is genuinely difficult. Many builders who push to 4x end up either undervolting to a 220W power limit per card (which cuts GPU draw by ~35% with modest performance cost) or moving to an open-frame or server chassis. Power limiting to 220W is worth considering — the Himesh benchmarks show competitive throughput at that setting vs full power, with dramatically lower heat output.
Tip
Power limiting at 220W per GPU is a legitimate performance strategy, not just a thermal workaround. The 4x RTX 3090 vLLM benchmark at 220W reached 39 tok/s output — a good result for that class of hardware. You keep the throughput while dropping total GPU draw to ~880W instead of ~1,520-1,680W. If cooling is a concern, try 220W before adding more fans.
The Model Landscape in 2026: Why VRAM Capacity Matters
In 2023, 13B models were the practical ceiling for local builders. A single RTX 3090 handled them comfortably.
2026 is different. Llama 3.1 70B is a minimum-viable model for serious work. Qwen 72B, Mistral Large, and their equivalents are the baseline — and none of them fit on a single consumer GPU at quality levels worth running. A single RTX 3090's 24 GB VRAM forces heavy quantization on 70B models, and with llama.cpp's layer-split, you get 5-8 tok/s at lower output quality.
The VRAM ceiling is the constraint. 48 GB (2x) lets you run 70B at Q4_K_M. 72 GB (3x) lets you run 70B at higher quality quantization tiers or push into 100B+ territory with Q4. 96 GB (4x) opens full 120B+ model access.
Fine-tuning and LoRA training push VRAM requirements higher still. If you're doing any training — even lightweight LoRA — a 3x or 4x setup pays for itself in what becomes possible.
Decision Framework: Which Tier Is Right for You?
2x RTX 3090 (~$2,116 used): Baseline 70B capability. Fits every major model at Q4_K_M. With vLLM, throughput is real. If you're already running a single 3090 and want to reach 70B territory without rebuilding, adding one card is the clearest upgrade. Start here if budget is tight.
3x RTX 3090 (~$3,174 used): The sweet spot for power users running 70B+ models daily. 72 GB VRAM gives you model headroom and better quantization options. vLLM tensor parallelism scales meaningfully at this tier. Compare against a single RTX 5090 at $1,999 — the 5090 wins on tokens-per-watt; 3x RTX 3090 wins on raw VRAM capacity.
4x RTX 3090 (~$4,232 used): Multi-user inference servers, batch processing, or 100B+ model work. Requires 2,000W PSU and serious cooling planning. Only build this if you've saturated 3x or have a confirmed concurrent-request workload. Undervolting to 220W per card makes this manageable.
The workload question to ask: how often is your current setup at 100% GPU utilization? Check during your real workflow, not synthetic benchmarks. If the answer is "rarely," optimize your inference stack before adding hardware. If it's "constantly," add a card.
New build from scratch: The calculus is different. At $4,232 for 4x used RTX 3090s vs $1,999 for a new RTX 5090, the RTX 5090 wins on single-card performance and future scalability. The 3x/4x RTX 3090 play makes most sense for owners of existing cards scaling up.
NVLink and the Next Upgrade Cycle
The RTX 5090 supports NVLink in 2-4 GPU configurations with proper scaling behavior. If you're planning a multi-GPU build from scratch in 2026-2027 and you're not already in the RTX 3090 ecosystem, waiting for RTX 5090-based multi-GPU setups gives you better bandwidth, tokens-per-watt, and a more scalable path beyond 4x GPUs.
For RTX 3090 owners specifically: buy the NVLink bridge for a 2x setup — it's $79 and delivers a real throughput gain. Don't chase NVLink for 3x or 4x; the bridge connects exactly two cards, and the additional GPUs go over PCIe regardless.
The RTX 3090's PCIe ceiling at 4x GPUs is real. Beyond that, you're fighting both bandwidth limits and software scheduler overhead. Four is the practical maximum for this hardware generation.
Real-World Setup: A 3x RTX 3090 Builder's Walkthrough
Case: Full ATX tower with three full-length PCIe x16 slots. Verify your motherboard's lane configuration before buying — some boards run the third PCIe slot at x4 or x8 electrical even if physically x16. That matters for GPU 3's bandwidth.
Slot placement: Use every other slot where the motherboard allows. This creates an air gap between cards and reduces GPU-to-GPU heat transfer. Check the manual for slot recommendations when using multiple GPUs — some boards have preferred configurations.
Power connectors: Each RTX 3090 requires dual 8-pin connectors (some AIB cards use a 16-pin/12VHPWR adapter). Three cards means six 8-pin or equivalent connections. Use a fully modular PSU — routing six GPU power cables in a mid-tower is tight, and non-modular cables make it worse.
NVIDIA driver: Version 550+ detects multi-GPU setups correctly without additional configuration. Install the driver, run nvidia-smi, and confirm all three GPUs appear with correct VRAM counts before installing any inference software.
Thermal paste/pads: Used RTX 3090s frequently have degraded thermal pads between the die and heatsink. Replacing them (30-45 minutes of teardown per card, thermal pad kit ~$15) typically drops temperatures 5-10°C. Worth doing before sustained AI workloads — a card running hot will throttle and throw off your benchmarks.
First test: Before running any 70B model, verify multi-GPU tensor parallelism is working by running llama.cpp with Llama 3.1 8B in row-split mode. You should see all three GPUs at >50% utilization and roughly 3x the tok/s of a single-GPU run. If only one GPU is busy, your peer access or slot configuration needs attention.
*All GPU pricing in this article was verified against eBay sold listings and BestValueGPU price tracker data as of March 29, 2026. Benchmark data sources: Himesh P. blog (March 2025) for 4x RTX 3090 vLLM benchmarks; GPU-Benchmarks-on-LLM-Inference GitHub repository for llama.cpp multi-GPU layer-split data.
FAQ
How many RTX 3090s do I need to run Llama 3.1 70B locally?
Two is the minimum — 48 GB combined VRAM fits the model at Q4_K_M quantization. With vLLM and tensor parallelism, you'll see real throughput gains over a single card. Three is where daily heavy use becomes comfortable: 72 GB VRAM lets you load 70B at better quantization tiers, and vLLM's 3-way tensor parallelism delivers meaningfully higher throughput than 2-way. If you're using llama.cpp in its default layer-split mode, be aware that adding cards beyond one mostly adds VRAM capacity — not speed.
Does multi-GPU scaling actually work with llama.cpp?
The answer depends on which split mode you're using. The default --split-mode layer runs GPU computations sequentially — published benchmarks show ~16-17 tok/s on Llama 3 70B Q4_K_M regardless of whether you have 2x, 4x, or 6x RTX 3090s. --split-mode row enables true tensor parallelism and does scale with additional GPUs, but vLLM's implementation is more optimized and delivers better efficiency at 3x and 4x. If multi-GPU throughput is the goal, vLLM is the right inference stack. For the full configuration details, see our llama.cpp multi-GPU setup guide.
Does the RTX 3090 support NVLink?
Yes — it's the only consumer RTX 30-series card with NVLink support. NVIDIA sold a retail NVLink bridge for it (~$79 at Best Buy). Two-way NVLink pairing gives roughly 600 GB/s bidirectional bandwidth, and vLLM tests show a 48% throughput improvement with NVLink over PCIe for 2-GPU configurations. The catch: it's 2-way only on consumer hardware. A 3rd or 4th card communicates over PCIe regardless of whether the bridge is installed. For 2x setups, buy the bridge. For 3x or 4x builds, it helps the first pair marginally (+9% at 4 GPUs) but doesn't change the overall constraint.
What PSU do I need for a 3x or 4x RTX 3090 build?
For 3x: 1,600W minimum, 80+ Gold or Platinum. AIB RTX 3090 cards regularly draw 380-420W under load — not the 350W rated TDP. Three cards at 400W average is 1,200W of GPU draw alone; add CPU and system overhead and you're at 1,500W+ before headroom. For 4x: 2,000W minimum with 80+ Platinum. Running a 1,600W PSU at 90% load under continuous AI inference is a reliability risk. One option worth considering: power-limit each GPU to 220W. vLLM benchmarks show competitive throughput at 220W per card, and it cuts total GPU draw from ~1,600W to ~880W.
Is a 4th RTX 3090 worth it for single-user inference?
For most single users, no. A 3x RTX 3090 setup handles every major 70B model currently available with headroom for larger models via quantization. The 4th GPU adds throughput but costs ~$1,058 on the used market and meaningfully raises power draw, heat, and PSU requirements. The upgrade math works for multi-user inference serving or batch workloads where your current setup is consistently at 100% utilization. For a single conversation thread, 3x has more capacity than you'll use most days. Start with 3x and monitor actual GPU utilization before committing to a 4th card.