CraftRigs
Architecture Guide

llama.cpp --tensor-split: Running 70B Models Across Multiple GPUs

By Charlotte Stewart 7 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Here's the thing about running large models on two GPUs: it sounds simple until you actually try it. You've got 48GB of VRAM (dual RTX 3090), the 70B model is 42.5GB, it should just work—but if you're splitting wrong, you'll get bottlenecks, OOM errors on the smaller card, or worse, no speed gain at all because the GPUs are waiting on each other. This guide fixes that.

What --tensor-split Actually Does

llama.cpp's --tensor-split flag distributes transformer layers across multiple GPUs proportionally. Think of it like this: your model has ~80 layers. With two GPUs, GPU 0 keeps the first 48 layers, GPU 1 keeps the last 32. When llama.cpp needs a layer, it sends work to whichever GPU has it. Simple in theory; the real trick is getting the proportions right so neither card starves while the other maxes out.

Important: --tensor-split does not split individual transformer layers across GPUs. Each layer lives on one card. This is different from pipeline parallelism in frameworks like DeepSpeed. It's simpler, which is why llama.cpp supports it natively—but it also means the bottleneck is whichever GPU finishes its work last.

How to Calculate Your Split Ratio

Forget about abstract ratios like --tensor-split 3,1. Use actual VRAM.

The formula is dead simple:

--tensor-split <GPU0_VRAM>,<GPU1_VRAM>

Examples:

  • Dual RTX 3090 (24GB + 24GB): --tensor-split 24,24
  • RTX 4090 + RTX 3090 (24GB + 24GB): --tensor-split 24,24
  • RTX 3090 + RTX 4080 Super (24GB + 16GB): --tensor-split 24,16
  • RTX 3090 + RTX 3060 (24GB + 12GB): --tensor-split 22,10 (conservative; leaves 2GB headroom)

GPU 0 is the first card in your nvidia-smi output. If you're unsure which is which, run:

nvidia-smi -L

It'll list your GPUs in order. GPU 0 is the one with the larger number.

Why VRAM amounts, not abstract ratios?

Because llama.cpp divides layers proportionally. If you have 24GB and 16GB, that's a 3:2 ratio. You want 60% of the layers on the larger card, 40% on the smaller. Using --tensor-split 24,16 gets you there. If you used --tensor-split 3,1 instead (which you'll see in some blogs), you'd pile 75% of layers on one card, which is too much. The numbers you use directly correspond to how llama.cpp distributes the load.

Mixed VRAM Setups and Headroom

Always leave 2GB of headroom on each card for activation memory and KV cache overhead. That's why a 24GB + 24GB combo becomes --tensor-split 22,22, not --tensor-split 24,24.

For tighter setups:

  • 24GB + 16GB: --tensor-split 22,14 is safest, or --tensor-split 23,15 if you're comfortable pushing it
  • 24GB + 12GB: --tensor-split 22,10 (leave 2GB buffer on the smaller card)
  • 3x 24GB: --tensor-split 22,22,22

Full Commands for Common GPU Combos

Here are production-ready commands for the most common setups. These use --n-gpu-layers 99 to fully offload the model to VRAM and -c 4096 for a reasonable context length.

Dual RTX 3090 (48GB total)

./llama-cli -m Llama-3.1-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --tensor-split 24,24 \
  -c 4096 \
  -t 6 \
  -b 512

Fits Llama 3.1 70B Q4_K_M (42.5GB) with room for KV cache. You'll hit around 8–10 tokens/second on token generation with this combo.

RTX 4090 + RTX 3090 (48GB total)

./llama-cli -m Llama-3.1-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --tensor-split 24,24 \
  -c 4096 \
  -t 8

Same as dual 3090. Both cards have identical VRAM, so equal split is optimal.

RTX 3090 + RTX 4080 Super (40GB total)

./llama-cli -m Llama-3.1-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --tensor-split 23,16 \
  -c 2048 \
  -t 8

The 4080 Super is slightly faster per-GPU, but with only 16GB, it becomes a bottleneck. Reduce context length to 2048 or use Q3_K_M quantization instead.

Running Mixtral 8x7B on Dual RTX 3090

./llama-cli -m Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --tensor-split 24,24 \
  -c 8192 \
  -t 8 \
  -b 1024

Mixtral is ~47GB in Q4_K_M. Fits perfectly on dual 3090 with room to spare. You can push context to 8192 and get strong performance.

Real-World Performance: What You Actually Gain

Here's where the blogs lie to you. Adding a second GPU does not double your speed.

Running Llama 3.1 70B Q4_K_M on:

  • Single RTX 3090: ~5.5 tokens/sec (token generation phase)
  • Dual RTX 3090 with tensor-split: ~8.5–9 tokens/sec

That's a 55% improvement, not 100%. The extra GPU gives you access to bigger models, which is the real win. If your model already fits on one card, the speed gain is modest because inter-GPU communication overhead eats into the tensor parallelism.

Warning

Don't add a second GPU expecting dramatic speed boosts on models that already fit on one card. The real value is running 70B and 100B models that wouldn't fit at all otherwise.

PCIe Slot Configuration Matters (a little)

If both GPUs are in x16/x16 slots, you get the maximum bandwidth. If one is x16 and the other is x4 (which happens on some motherboards), you'll see a small penalty—maybe 2–4% slower—but it won't kill you.

# Check your slot bandwidth
nvidia-smi -q | grep "Max Power Consumption"
# and visually confirm GPU placement in nvidia-smi

If your RTX 3090s have NVLink bridges installed, inter-GPU communication is dramatically faster. Real-world benchmarks show 50% improvement with NVLink on dual 3090s versus PCIe alone. But NVLink bridges are rare and expensive; most builders use PCIe.

Troubleshooting Common Errors

"CUDA error: out of memory on device 1"

The smaller GPU ran out of VRAM. Reduce its allocation in --tensor-split.

# If this fails:
--tensor-split 24,16

# Try this:
--tensor-split 23,15

# Or shift more to the larger card:
--tensor-split 26,14

"Model loads on GPU 0 only"

Check if CUDA_VISIBLE_DEVICES is set:

echo $CUDA_VISIBLE_DEVICES

# If it shows "0", unset it
unset CUDA_VISIBLE_DEVICES

# Then try again
./llama-cli -m model.gguf --tensor-split 24,24 ...

"Zero speed improvement over single GPU"

Run nvidia-smi and confirm both GPUs are under load:

# In another terminal, watch GPU usage
watch -n 1 nvidia-smi

If GPU 1 stays at 0% memory, you've got a device detection issue. Check:

  1. Are both GPUs properly seated?
  2. Run nvidia-smi — do you see both devices listed?
  3. Check your motherboard BIOS — PCIe bifurcation might need enabling for x4 slots

"Model OOMs on both GPUs"

Llama 3.1 70B Q4_K_M needs at least 44GB for comfortable use (42.5GB model + 1–2GB KV cache + overhead). If you're running 70B models on 48GB total, you're at the edge. Try:

  1. Reduce context length: -c 2048 instead of -c 4096
  2. Use a smaller quantization: Q3_K_M instead of Q4_K_M
  3. Add offload to system RAM: --tensor-split 24,24 --cpu-offload (slower, but fits)

ROCm (AMD) Issues

ROCm's --tensor-split support is unstable on consumer RX cards as of March 2026. It works on AMD Instinct cards but not reliably on RX 7900 XTX or similar consumer GPUs.

If you're on AMD, use:

  • vLLM for multi-GPU inference (tensor parallel, much faster)
  • ExLlamaV2 for single-GPU (fast, but limited to one card)

FAQ

Do I need to set --cpu-offload with --tensor-split?

No. --tensor-split and --cpu-offload are separate strategies. --cpu-offload offloads to system RAM (much slower). --tensor-split offloads to a second GPU (much faster). Use one or the other.

Can I use --tensor-split with three or more GPUs?

Yes. Just extend the list: --tensor-split 24,24,24 for three 24GB cards. The same VRAM ratio principle applies.

Does --tensor-split work with older llama.cpp versions?

It's been in llama.cpp for years, but the implementation has improved. Use a build from 2025 or later. Check your version:

./llama-cli --version

If it's older than March 2025, rebuild from source or update your binary.

Is tensor-split the same as pipeline parallelism?

No. Tensor-split is layer-based (each layer on one GPU). Pipeline parallelism (used in vLLM) splits individual tensors across GPUs. Different strategies with different trade-offs. llama.cpp uses layer-based, which is simpler but slightly less flexible.

Should I use --tensor-split or vLLM for my 70B model?

For pure speed on 70B+ models with professional requirements (batched requests, long context), vLLM's tensor parallelism is faster. For single-user, interactive use (like running a chat interface), llama.cpp with --tensor-split is simpler and totally adequate.

The Actual Setup: Step by Step

  1. Check your GPUs:

    nvidia-smi

    Note the VRAM for GPU 0 and GPU 1.

  2. Decide your split ratio using the VRAM amounts from step 1. Write it down.

  3. Download your model as GGUF. Llama 3.1 70B Q4_K_M from bartowski on Hugging Face.

  4. Run the command with your split ratio:

    ./llama-cli -m model.gguf --tensor-split <GPU0>,<GPU1> --n-gpu-layers 99 -c 4096
  5. Monitor both GPUs while it loads:

    watch -n 1 nvidia-smi

    Both should show increasing memory usage. If only one GPU gets memory, something's wrong.

  6. Start a conversation and watch token generation speed. You should hit 7–10 tokens/sec on Llama 3.1 70B with dual 3090s.

The Bottom Line

--tensor-split is how you run 70B and larger models on dual-GPU rigs. The key is using actual VRAM amounts, not abstract ratios. Set --tensor-split to match your cards' VRAM, pair it with --n-gpu-layers 99, and you've got a working multi-GPU setup. You won't get double the speed, but you'll get access to bigger models—which is the real point of multi-GPU anyway.

For more on multi-GPU setups, read our multi-GPU inference guide and the llama.cpp advanced guide for the full flag reference.

llama-cpp multi-gpu inference local-llm cuda

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.