Tensor Parallelism — Local AI Glossary | CraftRigs

Tensor parallelism is the technique that lets a model larger than any single GPU's VRAM run as if it were one model. Each layer's weight matrices are partitioned across the GPUs, every GPU computes its slice of every layer in lockstep, and the partial results are summed across cards over the interconnect after each layer.

When You Need It

Tensor parallelism is the typical answer to "I want to run a 70B-class model and a single card doesn't have enough VRAM." A 70B Q4 model needs ~40 GB; two RTX 3090 24 GB cards in tensor-parallel mode give the model ~48 GB to work with — comfortably enough, with headroom for context.

It is not the only option. The alternatives:

Pipeline parallelism — different layers live on different GPUs, and tokens flow through them sequentially. Less interconnect traffic but stalls on small batches.
CPU offload — lower-priority layers live in system RAM. Cheap, but tanks decode speed by 5–10× because system memory is far slower than VRAM.
A bigger single GPU — RTX 5090 32 GB, RTX Pro 6000 48 GB, or used A6000 48 GB. Higher cost, simpler setup.

Real-World Scaling

Tensor parallelism does not scale linearly. Two 24 GB cards do not give 2× the throughput of one — interconnect overhead and synchronization eat 15–25% on consumer PCIe Gen 4 platforms. Realistic dual-GPU scaling is 1.6–1.8× single-card throughput at the same model size, and the benefit is mostly that the model now fits at all, not that decoding got faster.

Runtime Support

vLLM has the cleanest tensor-parallel implementation for consumer hardware — set --tensor-parallel-size 2 and it splits across two visible CUDA devices. llama.cpp supports tensor-split via --tensor-split but the interconnect efficiency is lower. ExllamaV2 supports tensor parallelism on Linux for users with NVLink-capable cards.

NVLink between two RTX 3090s narrows the scaling gap meaningfully (some setups see 1.85–1.9×) versus pure PCIe-only configurations.