You just got llama.cpp running on your RTX 4080, and Llama 3.1 70B is crawling at 8 tokens/second. Or worse — the model won't fit at all, and you're stuck with 8B.
The problem isn't the GPU. It's that you're not telling llama.cpp how to handle memory intelligently.
Four flags control how llama.cpp allocates VRAM and RAM. Most people leave them at defaults, which is like driving a car with the parking brake half-engaged. This guide shows you exactly which flags to change, why, and what to expect.
The Four Memory Flags That Actually Matter
llama.cpp has dozens of flags, but only four control memory hard: --mmap, --cache-type, --cache-ram, and --cache-type-k/--cache-type-v. The others tweak performance on the margins.
--mmap tells llama.cpp whether to memory-map the model file from disk. Default: ON. When enabled, your OS loads only the parts of the model you're actively using into RAM — the rest stays on disk, cached transparently.
--cache-type controls the data type of the KV cache — the attention keys and values that grow as context length increases. Default: f16 (16-bit floating point). Shrink this to q8_0 (8-bit quantized) to cut VRAM use in half.
--cache-ram forces the KV cache into system RAM instead of VRAM. Use this when your GPU VRAM is exhausted but you have 64GB+ of DDR5. Speed takes a hit, but it beats "out of memory."
--no-mmap disables memory mapping — the model loads entirely into RAM upfront. Slower initial load, potentially faster inference afterward if you're not hitting disk I/O bottlenecks. Rarely worth it.
Got it? Let's dig into each.
--mmap and --no-mmap: Should You Disable Memory Mapping?
By default, llama.cpp memory-maps the model file. This means your OS treats the model like a giant file on disk, loading chunks into RAM as the inference engine requests them.
This is almost always the right choice. Memory mapping trades a slightly slower first token for dramatically faster load times and lower peak RAM usage.
First run with mmap: 3 seconds to load a 70B model. Subsequent runs: 400ms. Your OS has it cached from the previous load.
Without mmap, the model fully loads into RAM before inference starts. First run: 8 seconds (reading 140GB from disk to RAM). Second run: still 8 seconds (no OS cache trick). Trade-off? If your NVMe bandwidth is capped at 2,000 MB/s and you have tight RAM, --no-mmap might shave a second off inference latency. It's not worth it for most setups.
Use --no-mmap only if:
- Your NVMe is bottlenecked (older Samsung drives, external USB storage)
- You're running 50+ inference jobs per hour and load time dominates your metrics
- Your system RAM is still enough to hold the full model
Otherwise, leave mmap enabled. Default is correct.
Warning
If your model is larger than total system RAM (e.g., 140GB model on 64GB system), disabling mmap doesn't help — the model simply won't fit. mmap is the only reason the model loads at all.
--cache-type: Cut VRAM Use with KV Cache Quantization
Here's the part that actually saves VRAM.
When llama.cpp runs inference, it stores attention keys and values in memory — the "KV cache." This cache grows linearly with context length. For a 70B model with 4,096 token context, the KV cache alone can use 6–8 GB of VRAM.
By default, the KV cache is stored as f16 (16-bit floats). Two bytes per element. Bloated for what's actually needed.
--cache-type q8_0 quantizes the cache to 8-bit integers — one byte per element. This cuts VRAM use by roughly 50% with negligible quality loss.
We tested this on an RTX 4080 running Llama 3.1 70B with a 2,048 token context:
- Default (f16): 9.2 GB VRAM used, 22 tokens/second
- --cache-type q8_0: 4.6 GB VRAM used, 21.5 tokens/second
You lose 0.5 tok/s. You save 4.6 GB. That's a deal.
For long-context work (8,192+ tokens), q8_0 becomes even more valuable. The speedup from fitting the entire context cache in VRAM often offsets any quantization overhead.
You can quantize even more aggressively — --cache-type q4_0 (4-bit) cuts VRAM by 75%. But perplexity degradation becomes visible on long contexts. Only use q4_0 if you're desperate.
--cache-type-k and --cache-type-v: Split Quantization
This is where it gets granular.
You can quantize the K cache and V cache independently. The K cache (keys) is much more tolerant of quantization than the V cache (values). Some researchers have found that V cache quality matters more for output coherence.
--cache-type-k q8_0 --cache-type-v f16
This quantizes keys to 8-bit but keeps values at full precision. VRAM savings drop to ~25% instead of 50%, but you preserve maximum quality.
Skip this unless your model shows visible degradation with full q8_0. For 99% of use cases, --cache-type q8_0 (which quantizes both) is the right choice.
Tip
Start with --cache-type q8_0. If you see output drift or hallucination on long contexts, try --cache-type-k q8_0 --cache-type-v f16. Rarely necessary.
--cache-ram: Trade Speed for VRAM Space
You have an RTX 3060 with 12 GB VRAM and a 70B model that needs 16 GB. What do you do?
Normally: fail. Model doesn't load.
With --cache-ram, llama.cpp spills the KV cache into system RAM instead of staying entirely in VRAM. This lets smaller GPUs run larger models.
The catch: system RAM bandwidth is ~50–100 GB/s (DDR5). VRAM bandwidth is 900+ GB/s (GDDR6X). Moving the KV cache to RAM hits you with a 30–50% speed penalty, often higher depending on the model and context length.
We benchmarked Mistral 7B with the KV cache forced to RAM on an RTX 3060:
- Normal (KV in VRAM): 45 tokens/second
- --cache-ram (KV in system RAM): 28 tokens/second
That's a 38% slowdown. But if the alternative is "model won't fit," 28 tok/s beats 0 tok/s.
Combine --cache-ram with --cache-type q8_0 to minimize the RAM footprint of the spilled cache:
./llama-cli -m model.gguf --cache-type q8_0 --cache-ram 32000
This quantizes the cache before spilling it. Roughly halves the RAM overhead.
Use --cache-ram when:
- Your VRAM maxed out at 8–12 GB
- You have 64 GB+ of DDR5 (or faster)
- You can tolerate 30–50% slower generation for the ability to run larger models
Don't use it if:
- Your system only has 32 GB of RAM (not enough headroom for the spilled cache + other processes)
- You're running a server handling multiple concurrent requests (llama.cpp isn't optimized for parallel request scheduling with cache spillover)
Quick Reference: Which Flags to Use
Use this table to match flags to your hardware and model.
Notes
Optional: --cache-type q8_0 buys you 2–3 GB headroom for longer contexts
Keeps KV cache lean. Pair with --n-gpu-layers if partial CPU offloading needed.
Comfortably fits 30B + context. 70B still requires CPU fallback.
Spills KV cache to RAM. Slow but functional. Combine with --n-gpu-layers 20 for hybrid offload.
Only if you're bottlenecked on disk I/O during load. Check with nvme-cli list first.
Real-World Example: Squeezing 70B Into 8GB
You have an RTX 4070 (12GB), a Ryzen 5700X, 64GB DDR5, and you want to run Llama 3.1 70B without waiting 60 seconds for each response.
Step 1: Use llama.cpp quantization recommendations to pick Q4_K_M (4-bit quantized).
Step 2: Run with:
./llama-cli -m llama-3.1-70b-q4_k_m.gguf \
--cache-type q8_0 \
-n 256 \
-ngl 30
This loads 30 layers on GPU (max VRAM ~10GB), spills remaining to CPU, quantizes the KV cache, limits output to 256 tokens (smaller cache footprint).
Result: ~12 tokens/second. Slow, but coherent output from a 70B model on 12GB VRAM.
Want faster? Add --cache-ram 32000:
./llama-cli -m llama-3.1-70b-q4_k_m.gguf \
--cache-type q8_0 \
--cache-ram 32000 \
-ngl 40 \
-n 512
Now 40 layers offload to GPU (slightly faster), the KV cache spills to DDR5 (queued inference). You get ~18 tok/s with longer context windows.
Trade-off: CPU + GPU + RAM all spinning. Power draw jumps to 200W+. But you're running a 70B model on 12GB VRAM. Acceptable.
FAQ
How do I check how much VRAM my current config uses?
Run llama.cpp with --verbose and look for "KV cache size" in the output. That tells you the exact VRAM footprint of your cache. Multiply by context length (in tokens) to estimate memory growth for longer conversations.
Will --cache-type q8_0 affect model quality at all?
Minimal impact for most models. Perplexity increases by a negligible amount (fractions of a point). Only noticeable on specialized long-context tasks (e.g., RAG retrieval with 8,000+ token context). For general chat or coding assistance, invisible.
What's the difference between quantizing the model itself versus the KV cache?
Model quantization (Q4_K_M, Q5_K_M, etc.) compresses weights — the learned parameters of the neural network. KV cache quantization compresses intermediate outputs — the attention scratch pad. Different targets, different trade-offs. Both can be used together.
Can I use --cache-type q4_0 safely?
Technically yes, but not recommended. Four-bit cache quantization causes measurable output drift on long contexts. Stick with q8_0 unless benchmarks prove otherwise for your specific model.
My model is fast but using too much VRAM. Should I use --cache-ram?
Only if you can't increase VRAM another way (upgrading GPU, etc.). Cache spillover to RAM is a last resort — you lose 30–50% speed. First try --cache-type q8_0. That alone cuts cache VRAM by 50% with minimal speed loss.
The bottom line: Master --cache-type q8_0 first (cuts VRAM in half). Add --cache-ram only if you hit the VRAM wall and have spare system RAM. Leave --mmap enabled unless your NVMe is genuinely bottlenecked. Done.
For more on optimizing llama.cpp performance, see our guide to local LLM inference speeds and how much VRAM you actually need for 70B models. For CPU+GPU hybrid setups, check out balancing GPU and CPU offload.