How slow is CPU+GPU hybrid inference compared to full GPU inference?

The CPU layers are the bottleneck, not the GPU layers. A 70B model with 20 layers on GPU and 60 layers on CPU will generate at roughly 1–3 tokens/second. The same model fully on GPU (48GB VRAM) generates at 15–25 t/s. Hybrid inference is usable for non-interactive batch tasks but frustrating for real-time conversation.

How much system RAM do I need for hybrid inference?

CPU layers live in system RAM. For a 70B Q4_K_M model (approximately 43GB total), if 20GB sits in GPU VRAM, the remaining 23GB must fit in RAM. You need system RAM larger than the model size minus your GPU's VRAM. For 70B Q4_K_M: at minimum 32GB RAM with a 24GB GPU; ideally 64GB RAM for comfortable operation.

Does --n-gpu-layers=99 always mean all layers go to GPU?

Yes. Setting --n-gpu-layers to 99 or any number larger than the model's actual layer count forces all layers to GPU. If the model has 80 layers and you set --n-gpu-layers 99, all 80 go to GPU. If VRAM is insufficient, llama.cpp will error rather than silently overflow to RAM.

Can I use hybrid inference with Ollama instead of llama.cpp directly?

Yes. Ollama is built on llama.cpp and passes through the --n-gpu-layers parameter. Set OLLAMA_NUM_GPU_LAYERS or configure it in your Modelfile. However, direct llama.cpp gives you more fine-grained control over layer splits and lets you benchmark different configurations quickly.

What is the minimum GPU VRAM worth using for hybrid inference on a 70B model?

8GB VRAM is the practical floor — at that tier you fit roughly 5-8 layers out of 80 on the GPU, yielding only modest improvement over pure CPU inference. 16GB is where hybrid inference becomes meaningfully faster than CPU-only (35 layers on GPU, ~2-3.5 t/s). 24GB gets you ~55 layers on GPU and ~8-12 t/s — approaching interactive usability for non-real-time tasks. Below 8GB, pure CPU inference is generally comparable and simpler.

Does CPU speed matter much for hybrid inference throughput?

Yes, significantly. CPU layers are the bottleneck, and their compute speed determines how fast those layers process. A Ryzen 9 7950X with AVX-512 support runs CPU layers meaningfully faster than a budget Ryzen 5 5600X. More CPU cores also help — llama.cpp uses the --threads parameter to parallelize CPU layer computation. For hybrid inference on 70B models, a high-core-count CPU (12+ cores) with modern AVX support improves throughput more than the same investment in other components.

llama.cpp CPU+GPU Hybrid Inference: Run 70B on Any VRAM [2026]

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

A 70B model needs roughly 43GB of VRAM at Q4_K_M. A single RTX 4090 has 24GB. The math does not work — unless you split the model across GPU and CPU.

llama.cpp has supported layer offloading since 2023, and it is more useful than most people realize. You cannot get fast inference this way — CPU compute is an order of magnitude slower than GPU for matrix operations. But you can run models that would otherwise require hardware you do not own, at speeds that are tolerable for specific use cases.

This guide explains exactly how layer offloading works, what speeds to expect, how to configure it, and how to get the best performance from the hardware you have.

Quick Summary

--n-gpu-layers N: N=0 is full CPU, N=99 forces all layers to GPU, any value between splits proportionally
Speed penalty: Hybrid inference runs at 1–5 t/s for large models — 5–15x slower than full GPU inference
RAM requirement: CPU layers live in system RAM — 70B Q4_K_M needs up to 43GB total; subtract your GPU VRAM to get RAM requirement

How Layer Offloading Actually Works

A transformer model consists of a stack of layers — typically 32 layers for 7B models, 80 layers for 70B models. Each layer is a self-contained set of weight matrices (attention, MLP, normalization).

llama.cpp's layer offloading assigns each layer to either GPU VRAM or CPU RAM. During inference:

For each layer assigned to GPU: the computation happens in CUDA/ROCm at full GPU speed
For each layer assigned to CPU: the weights are read from RAM and computed using CPU SIMD instructions (AVX2/AVX-512)
Between GPU and CPU layers, data transfers over the PCIe bus

The critical point: inference is sequential through layers. Token generation requires passing through all layers in order, which means the overall speed is dominated by the slowest hardware in the chain — the CPU.

Why This Creates a Throughput Cliff

GPU tokens/second on a 70B model: ~15–25 t/s (fully in 48GB VRAM) CPU tokens/second on a 70B model: ~0.5–1.5 t/s (fully in RAM on a modern Ryzen/Intel)

With 25% of layers on GPU and 75% on CPU, the bottleneck is still the CPU sections. You do not get 25% of GPU speed — you get something much closer to CPU speed with GPU sections reducing total time modestly.

Useful rule of thumb: if less than 50% of layers fit in VRAM, full CPU inference is often comparable in speed to hybrid inference for smaller 70B quantizations. Hybrid inference becomes worth it when you can get 60%+ of layers on GPU.

The `--n-gpu-layers` Parameter

llama-cli \
  --model ./Llama-3.1-70B.Q4_K_M.gguf \
  --n-gpu-layers 20 \
  --ctx-size 4096 \
  --prompt "Explain transformer attention mechanisms"

`--n-gpu-layers` value	Effect
0	All layers on CPU (pure CPU inference)
1–N	Specific number of layers on GPU, rest on CPU
99 (or any large number)

llama.cpp counts from the bottom layers up. With --n-gpu-layers 20 on a 80-layer model, the first 20 layers go to GPU and layers 21–80 run on CPU.

Checking Layer Count for Your Model

llama-cli --model your-model.gguf --verbose 2>&1 | grep "n_layer"

Common layer counts:

7B/8B models: 32 layers
13B/14B models: 40 layers
34B models: 48 layers
70B/72B models: 80 layers

RAM Requirements for Hybrid Inference

When layers run on CPU, their weights live in system RAM (not VRAM). For a 70B Q4_K_M model (~43GB total):

RAM_needed = total_model_size - (VRAM × 0.9)

The 0.9 factor accounts for VRAM overhead from CUDA context, KV cache, and scratch buffers.

RAM Required (70B Q4_K_M)

38–40 GB system RAM

34–36 GB system RAM

28–30 GB system RAM

16–20 GB system RAM

For the 16GB and 24GB GPU scenarios, 32–64GB of system RAM is the practical requirement. Running hybrid inference with only 16GB system RAM on a 16GB GPU card (total 32GB addressable) leaves almost nothing for the OS.

Speed Benchmarks by Offload Configuration

70B Llama 3.1 Q4_K_M on RTX 4060 Ti 16GB (80 total layers)

Notes

Ryzen 9 7950X, AVX2

Marginal improvement

Noticeable but still slow

Best this card can do 70B Llama 3.1 Q4_K_M on RTX 3090/4090 24GB (80 total layers)

Approx Tokens/sec

1.5–3.0 t/s

4.0–6.0 t/s

8.0–12.0 t/s

The diminishing-returns pattern here is clear: going from 20 to 40 GPU layers roughly doubles speed; going from 40 to 60 roughly doubles again. The majority of speed comes from pushing the majority of layers to GPU.

Optimal Configurations for Common GPU Cards

RTX 4060 Ti 16GB — Running 70B

Maximum safe layer count: approximately 35–38 layers at Q4_K_M before VRAM overflow.

llama-cli \
  --model Llama-3.1-70B-Instruct.Q4_K_M.gguf \
  --n-gpu-layers 35 \
  --ctx-size 2048 \
  --threads 8 \
  --prompt-cache ./cache.bin

RAM requirement: 64GB system RAM recommended. 32GB works but leaves the OS tight.

Practical use: 2–3 t/s. Better suited to batch processing (summarizing documents, generating reports) than real-time chat.

RTX 3090 / RTX 4090 24GB — Running 70B

Maximum safe layer count: approximately 55–60 layers at Q4_K_M.

llama-cli \
  --model Llama-3.1-70B-Instruct.Q4_K_M.gguf \
  --n-gpu-layers 58 \
  --ctx-size 4096 \
  --threads 8 \
  --flash-attn

RAM requirement: 32GB system RAM sufficient; 64GB comfortable.

Expected speed: 8–12 t/s. Usable for interactive chat, though a 7B fully on GPU will feel faster.

Dual RTX 4090 48GB — Running 70B Fully on GPU

With two RTX 4090s in tensor parallel mode via llama.cpp:

llama-cli \
  --model Llama-3.1-70B-Instruct.Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --tensor-split 1,1 \
  --ctx-size 8192

Expected speed: 15–25 t/s at 4K context. This is the consumer ceiling for 70B models.

See our guide on multi-GPU LLM inference setup for tensor split configuration details.

Reducing Context to Fit More Layers

Context length directly affects VRAM usage through the KV cache. Cutting context length in half roughly halves KV cache VRAM, freeing room for more GPU layers.

# Instead of default 2048, use 1024 to free ~0.5-1GB VRAM for an extra few layers
llama-cli \
  --model model.gguf \
  --n-gpu-layers 40 \
  --ctx-size 1024

For applications that do not need long context (summarizing paragraphs, answering questions about short documents), --ctx-size 1024 or 1536 frees meaningful VRAM.

Using mmap for Faster CPU Layer Loading

When CPU layers live in RAM, llama.cpp can use memory-mapped files (mmap) to load model weights without copying them into a separate allocation:

llama-cli \
  --model model.gguf \
  --n-gpu-layers 30 \
  --mmap \
  --ctx-size 4096

--mmap is the default in most builds. If your build has it disabled, enabling it reduces RAM usage and speeds up model loading. It does not improve generation speed.

When Hybrid Inference Is and Is Not Worth It

Use hybrid inference when:

You need to run a 70B model for non-real-time tasks (processing documents, batch generation)
You have 16–24GB VRAM and 64GB+ RAM and cannot afford a second GPU
You are fine with 2–8 t/s throughput
You specifically need 70B-level reasoning quality and lower quantizations of smaller models are not sufficient

Skip hybrid inference and use a smaller model when:

You need interactive conversational speed (>10 t/s)
A 13B or 34B model fully in VRAM meets your quality requirements
You have 24GB VRAM — run 27B at Q8_0 fully on GPU instead of 70B hybrid

For pure performance per dollar, running a 34B model fully on a single RTX 3090 (24GB, ~$750 used) outperforms 70B hybrid inference on the same card in most real-world tasks. See our VRAM calculator to compare options before committing to a hybrid setup. For budget build options that avoid hybrid inference entirely, see our cheap $500 local LLM build and $1,200 build guide.