Multi-GPU LLM Inference: How to Split Models Across Two Cards

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Multi-GPU inference makes sense when your model is too large for a single GPU's VRAM. Two RTX 3090s give you 48GB combined, which opens up 70B models at higher quality. The speed gains are real but modest — VRAM capacity is the bigger win than raw throughput.

Most local LLM builds are single-GPU. One card, all the VRAM, simple setup. But once you hit the ceiling of 24GB on a single RTX 3090 or 4090, the next question is obvious: can you just add another GPU and double the VRAM? The answer is yes, with some nuance.

When Multi-GPU Actually Makes Sense

There's one clear reason to go multi-GPU: the model you want to run doesn't fit in a single card's VRAM. That's it. If your target model fits on one GPU, adding a second GPU will not dramatically speed up inference — the overhead of inter-GPU communication often offsets any parallelism gains.

Multi-GPU is the right call when:

You want to run 70B models at Q6 or Q8 quantization (needs ~48GB+)
You're running 34B+ models and want more headroom for context
You're planning ahead for larger models coming in the next year
You're building a shared inference server and need total throughput across multiple users

Multi-GPU is probably not worth it when:

You're running models that already fit on one 24GB card
You want faster inference on small/medium models (a single faster GPU is better)
You're primarily doing CPU-offloaded inference

How Model Splitting Works

When a model is too large for a single GPU, it gets split across multiple GPUs. Each GPU holds some of the model's layers. Inference passes data sequentially through the layers, jumping between GPUs as it goes.

There are two ways the GPUs can communicate: NVLink and PCIe.

NVLink:

Dedicated high-bandwidth bridge between two GPUs on the same card pair
RTX 3090s support NVLink via the NVLink bridge connector
Bandwidth: ~600 GB/s (vs PCIe 4.0 x16 at ~32 GB/s)
Latency is dramatically lower than PCIe
Requires a compatible NVLink bridge (~$30–50 extra)
RTX 4090 does NOT support NVLink — this is a genuine downgrade vs 3090 for dual-GPU

PCIe:

Standard connection through the motherboard
Much lower bandwidth than NVLink
Still functional for inference — the inter-layer transfers are not constant
Performance hit vs NVLink is noticeable but not always catastrophic
Works with any dual-GPU setup on any compatible motherboard

For local inference specifically, the difference between NVLink and PCIe is less dramatic than it sounds. Inference is not like training — data transfer between GPUs only happens at layer boundaries, not continuously. In practice, a dual RTX 4090 PCIe setup often performs within 20–30% of a dual 3090 NVLink setup at the same VRAM capacity, despite the bandwidth gap.

Practical Setup: Dual RTX 3090

Two RTX 3090s is the most common dual-GPU configuration for local LLM builders. Here's the actual setup.

Hardware requirements:

Two RTX 3090 24GB cards
Motherboard with dual PCIe x16 slots (physical) — check that it runs x16/x16 or at minimum x16/x8 electrically. Many x16/x16 physical boards only run x8/x8 electrically when both slots are populated.
CPU with enough PCIe lanes: Intel Core i9-13900K or i9-14900K (20 PCIe lanes from CPU), or AMD Threadripper for maximum lanes
NVLink bridge if you want peak performance between the two 3090s
PSU rated at minimum 1000W (two 3090s can pull 700W+ peak combined)

Software setup with llama.cpp: llama.cpp has native tensor parallelism support for multi-GPU. The flag is --tensor-split.

./llama-cli -m model.gguf --tensor-split 1,1 -ngl 99

The --tensor-split 1,1 tells llama.cpp to split evenly between two equal GPUs. If your GPUs have different VRAM amounts, you can weight accordingly (e.g., --tensor-split 3,1 for a 24GB + 8GB split).

-ngl 99 offloads all layers to GPU (you want this for pure GPU inference).

What you can run with 48GB total:

Llama 3.1 70B at Q4_K_M: fits comfortably with room for context
Llama 3.1 70B at Q6_K: fits with ~42GB, leaving 6GB for context
Qwen 2.5 72B at Q4: fits
405B models: still need quantization, borderline at 48GB for Q2/Q3

Practical Setup: Dual RTX 4090

The RTX 4090 does not support NVLink. This is a deliberate product segmentation decision by NVIDIA. Dual 4090 setups run over PCIe only.

Despite that, dual 4090 setups are still used because the 4090 has meaningfully faster compute per VRAM byte than the 3090. The AD102 architecture is ~40% faster at inference than the GA102 in the 3090.

The tradeoff:

Dual 4090 (PCIe only): 48GB combined, faster per-layer compute, higher inter-GPU latency
Dual 3090 (NVLink available): 48GB combined, slower per-layer compute, lower inter-GPU latency with NVLink bridge

For most local inference use cases, the dual 4090 PCIe setup is equal or slightly faster in practice, because the faster compute of the 4090 compensates for the PCIe overhead.

Cost: Two RTX 4090s at ~$1,400–1,700 used each = $2,800–3,400 just for GPUs. This is a $5,000+ total build.

Real Performance Expectations

Running Llama 3.1 70B Q4_K_M on dual RTX 3090 (NVLink):

Roughly 15–25 tokens/second depending on context length
Context lengths above 8k slow things down significantly
Prompt processing (prefill) benefits more from multi-GPU than generation

For comparison, a single RTX 4090 running a 34B Q4 model:

Roughly 30–45 tokens/second
Smaller model, but much faster and simpler setup

The honest takeaway: multi-GPU gets you access to bigger models, not necessarily faster inference on models that already fit. If speed is the priority and your model fits on 24GB, stick to one fast GPU.

Common Mistakes

Buying a motherboard with x16/x16 physical slots that only run x8/x8 electrically — check the spec sheet carefully
Forgetting the PSU is undersized after adding the second GPU
Using --tensor-split incorrectly and having all layers load onto one GPU anyway
Not buying the NVLink bridge with dual 3090 setups (it's $30–50 and meaningfully improves bandwidth)