Does PCIe x8 vs x16 affect local LLM inference speed?

Almost never for inference. Once the model is loaded into VRAM, inference happens entirely inside the GPU at VRAM bandwidth speeds — PCIe doesn't participate. Multiple benchmarks show less than 2% difference between PCIe 4.0 x16 and x8 for GPU inference. The only place the gap shows is model loading from disk, where x4 slots become noticeably slow.

What PCIe configuration do I need for a dual-GPU LLM build?

x8/x8 electrical is sufficient for two GPUs in a local LLM rig. The bandwidth between cards (NVLink or PCIe) during inference is minimal — the main transfers happen during model loading and inter-GPU layer communication, neither of which saturates x8. Avoid x4 on either slot, which creates a loading bottleneck.

When do I need Threadripper for local LLM PCIe lane counts?

For three or more GPUs, or when you want guaranteed x16/x16 on two GPUs. Consumer platforms like AMD X670E and Intel Z890 provide 24–28 CPU lanes — enough for x8/x8 with two GPUs. Going beyond two GPUs typically means lanes get split below x8, which starts to affect model load performance. Threadripper offers 88–160 PCIe lanes depending on the model.

Does PCIe generation (4.0 vs 5.0) matter for local LLMs?

Not meaningfully for inference. PCIe 5.0 x8 provides the same bandwidth as PCIe 4.0 x16 (32 GB/s). For model loading, PCIe 5.0 reduces load times slightly, but the difference is seconds on a 40GB model — a one-time cost per session. For GPUs running inference, PCIe generation is not a meaningful performance factor.

PCIe Lanes for Local LLM Builds: When It Actually Matters

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: For single-GPU inference where your model fits in VRAM, PCIe bandwidth doesn't matter — run x16 or x8 and you'll see zero difference in tokens per second. For dual-GPU builds, x8/x8 is fine. Avoid x4 on any GPU you care about — it creates a real bottleneck during model loading and layer transfers. For three or more GPUs, you need HEDT.

Why People Get Confused About This

The confusion is understandable. PCIe x16 sounds twice as good as x8, and four times better than x4. Marketing materials for high-end motherboards emphasize lane counts. GPU spec sheets list "PCIe 4.0 x16" as a feature. So people assume more lanes = faster AI.

It's mostly wrong — but not entirely.

PCIe Bandwidth vs VRAM Bandwidth

Here's the number that puts it in context.

PCIe 4.0 x16 has a theoretical bandwidth of 32 GB/s (bidirectional). PCIe 4.0 x8 is 16 GB/s. PCIe 4.0 x4 is 8 GB/s.

An RTX 4090's VRAM bandwidth is 1,008 GB/s. An RTX 3090 is 936 GB/s. Even a mid-range RTX 4070 does 504 GB/s.

The moment a model is loaded into VRAM and you're running inference, all the math is happening inside the GPU at VRAM speeds. PCIe doesn't participate. Your tokens per second are determined entirely by VRAM bandwidth and compute — not by whether you're on x16 or x8.

This is why multiple published benchmarks show less than 2% performance difference between PCIe 4.0 x16 and x8 for GPU inference tasks. The bottleneck isn't the PCIe link.

Note

PCIe 5.0 x8 matches PCIe 4.0 x16 in bandwidth (32 GB/s). If you're using a PCIe 5.0 motherboard and your GPU runs at x8 electrically, you have equivalent throughput to a PCIe 4.0 x16 slot. The "x8" label can be misleading without knowing the PCIe generation.

When PCIe Bandwidth Actually Matters

Model Loading

Loading a 40GB model from NVMe into system RAM and then into GPU VRAM involves the PCIe bus. At PCIe 4.0 x16, that 40GB transfer completes in roughly 2-3 seconds assuming storage isn't the bottleneck. At x8, maybe 4-5 seconds. At x4, 8-10 seconds.

If you're switching between models frequently, x4 starts to feel sluggish. If you load once and leave it, it's irrelevant.

Multi-GPU Inference (NVLink or Tensor Parallelism)

This is where PCIe lane count matters most. When two GPUs are working together on a single model — splitting layers or running tensor parallelism — they communicate constantly. The PCIe bus carries that inter-GPU traffic.

Going from x16/x16 to x8/x8 shows a meaningful performance drop on collaborative multi-GPU workloads: typically 8-15% depending on the workload. Going to x16/x4 is worse — the x4 GPU becomes a communication bottleneck.

For single-GPU setups, or two GPUs where each handles separate requests (not sharing a single model), x8/x8 is fine.

CPU Offloading with Frequent Layer Swaps

When offloading, llama.cpp shuttles data between CPU RAM and GPU VRAM through the PCIe bus for each forward pass. If you're doing heavy offloading (20+ layers to CPU) with short context windows and fast inference, PCIe bandwidth becomes a small but real factor. Not the dominant bottleneck — that's RAM bandwidth — but a contributing one at x4.

The Practical Lane Math

Single GPU: Any Slot Works

If you have one GPU, it goes in the top x16 slot. It runs at x16 electrically. You never think about this again.

Dual GPU: Check the Electrical Wiring

The spec sheet saying "two x16 slots" doesn't mean both slots deliver x16 bandwidth simultaneously. When two slots are populated, most consumer boards split available CPU lanes:

x16/x16: Both GPUs at full bandwidth. Rare on consumer boards, requires 32+ CPU lanes. Common on HEDT.
x8/x8: Standard split on X670E/Z790 boards. Fine for dual-GPU LLM inference.
x16/x4: Common on budget boards. Avoid for serious dual-GPU builds.
x16/x0: Some boards cut the second slot entirely when populated. The GPU physically fits but runs at PCIe bandwidth of zero — it's dead. Check your manual.

Look for "PCIe bifurcation" in your motherboard manual. The relevant table shows what speed each slot runs at when both are populated.

Tip

ASUS ProArt X670E Creator reliably delivers x8/x8 with two GPUs. So does the MSI MEG Z790 ACE. Budget boards often don't — verify before buying. Our motherboard guide covers which specific boards handle dual GPU correctly.

Triple or More GPUs: HEDT Only

Consumer AM5 and LGA 1851 platforms have 24-28 CPU PCIe lanes. Split three ways, you get roughly x8/x8/x4 at best — and the x4 GPU is limited.

For three or more GPUs at x8/x8/x8 or better, you need AMD Threadripper 9000 series (TRX50/WRX90) with 160 PCIe lanes, or a server platform. The platform cost jumps significantly, but it's the only way to feed multiple GPUs properly.

Chipset Lanes vs CPU Lanes

One clarification that trips people up: modern motherboards have two sources of PCIe lanes.

CPU lanes: direct connection to the processor. Low latency, full bandwidth. This is what your GPU slots should use.

Chipset lanes: connected through the PCIe switch on the motherboard (the "chipset"). Higher latency, usually PCIe 3.0 speed (lower bandwidth). NVMe slots in "M.2 slots 2 and 3" on budget boards often use chipset lanes.

For GPUs, always use CPU-connected slots. Chipset slots for GPUs technically work but add latency and bandwidth reduction. Most boards label CPU-connected slots clearly.

Caution

M.2 NVMe slots that share lanes with your GPU can cause bandwidth conflicts on some boards. Populating certain M.2 slots disables PCIe bandwidth from the adjacent GPU slot. Check your motherboard manual's "PCIe sharing" section before installing NVMe drives alongside a GPU.

The Decision Framework

One GPU, model fits in VRAM: PCIe doesn't matter. Use whatever slot is available, confirm it's a CPU-connected x16 slot, move on.

One GPU + heavy CPU offloading: PCIe still doesn't matter much. RAM bandwidth is your bottleneck, not PCIe.

Two GPUs, running separate requests: x8/x8 is fine. Make sure your board actually delivers x8/x8 (not x16/x4) before buying.

Two GPUs sharing a model (tensor parallelism): x8/x8 works but you'll see some overhead. x16/x16 is meaningful here. Consider HEDT if you're serious about this workload.

Three or more GPUs: HEDT. No way around it.