CraftRigs
Architecture Guide

Why Your VRAM Runs Out Mid-Conversation: The KV Cache Explained

By Georgia Thomas 6 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Quick Summary:

  • Three VRAM consumers: Model weights (fixed), KV cache (grows with conversation length), framework overhead (small, fixed).
  • KV cache scales with context: Doubling your context window setting quadruples KV cache memory. A 32K context window can use more VRAM than the model weights themselves.
  • Practical fix first: Drop --ctx-size from 8192 to 2048. This frees 75% of KV cache VRAM with minimal impact on most conversations.

You bought a 24GB GPU. You loaded a 13B model, it took 8GB, you had 16GB free. The first conversation went fine. Then an hour later, mid-conversation, inference stalls and your monitoring tool shows VRAM fully saturated.

The model didn't change. The GPU didn't change. What changed was the conversation.

This is the KV cache, and it's the most commonly misunderstood aspect of local LLM memory management.

What VRAM Is Actually Doing

When you load a local LLM, three distinct things consume VRAM:

1. Model weights (fixed) The quantized model parameters. For a 13B model at Q4_K_M, this is roughly 7-8 GB. It's loaded once and stays there until you unload the model. This is the number everyone knows — it's the file size.

2. KV cache (dynamic — grows with context) This is the culprit. The KV cache stores intermediate computations from the attention mechanism. It grows with every token in the current context window. A fresh context starts at near-zero and expands as the conversation gets longer.

3. Framework overhead (small, fixed) CUDA/ROCm runtime, activation buffers, temporary computation space. Usually 0.5-1 GB depending on runtime and batch size. Largely ignorable in planning.

The total at any moment: VRAM used = model weights + KV cache (current size) + overhead

At the start of a conversation, KV cache is tiny. After 10,000 tokens, it may rival the model weights in size.

How the KV Cache Grows: The Math

In transformer models, every layer maintains a Key and Value matrix for every token in the current context. These matrices are what "attention" attends to — instead of recomputing them for past tokens on every forward pass (which would be impossibly slow), they're cached.

The formula for KV cache size:

KV cache bytes = 2 × layers × heads × head_dim × seq_len × dtype_bytes

Breaking this down for Llama 3.1 8B at FP16:

  • 2 = Key and Value (both cached)
  • 32 layers
  • 32 attention heads
  • 128 head dimension
  • seq_len = number of tokens in context
  • 2 bytes per value (FP16)

At 2,048 tokens: 2 × 32 × 32 × 128 × 2048 × 2 = ~536 MB

At 8,192 tokens: 2 × 32 × 32 × 128 × 8192 × 2 = ~2.1 GB

At 32,768 tokens: 2 × 32 × 32 × 128 × 32768 × 2 = ~8.6 GB

For a 13B model (more layers, more heads), these numbers scale up proportionally. The pattern: KV cache grows linearly with context length. Doubling context doubles KV cache VRAM.

This is why "what's the maximum context window this model supports?" is a different question from "how much VRAM do I have for KV cache?" The model supports 128K context. Your RTX 4070 doesn't have 40GB free for the KV cache to fill.

A Worked Example: RTX 4090 24GB

The RTX 4090 is the standard benchmark for a high-end consumer LLM rig. 24GB seems like a lot. Here's where it goes:

Scenario A: 13B model, 8,192 context (typical chat)

  • Model weights (Q4_K_M): 7.7 GB
  • KV cache at 8K context: ~3 GB
  • Framework overhead: ~0.8 GB
  • Total: ~11.5 GB — well within 24GB, plenty of headroom

Scenario B: 13B model, 32K context (pasting a document)

  • Model weights: 7.7 GB
  • KV cache at 32K context: ~12 GB
  • Overhead: ~0.8 GB
  • Total: ~20.5 GB — tight. Works, but leaves 3.5GB margin. Swap would hurt.

Scenario C: 34B model (Q4_K_M), 8K context

  • Model weights: ~20 GB
  • KV cache at 8K: ~6 GB
  • Overhead: ~0.8 GB
  • Total: ~26.8 GB — exceeds 24GB. Won't load with this context setting.

Scenario C fix: 34B model (Q3_K_M), 4K context

  • Model weights: ~15 GB
  • KV cache at 4K: ~3 GB
  • Overhead: ~0.8 GB
  • Total: ~18.8 GB — fits comfortably.

The math shows why context length configuration matters as much as model selection.

Practical Fixes: Stretching VRAM Further

Fix 1: Reduce Context Length

This is the first thing to try. The default context in Ollama and LM Studio is often 2048 or 4096. Many models ship with 8192 as default. Reducing it costs you conversation length — not model quality.

In Ollama:

ollama run llama3.2 --ctx-size 2048

Or via the API:

{
  "model": "llama3.2",
  "options": {"num_ctx": 2048}
}

In llama.cpp:

./llama-server -m model.gguf --ctx-size 2048

Rule of thumb: 2048 tokens handles most single-turn Q&A and short conversations. 4096 handles moderately long conversations or small document chunks. 8192 handles most professional use cases. Only go to 16K+ if you're specifically pasting large documents.

Fix 2: Quantize the Model More Aggressively

Smaller model weights = more VRAM left for KV cache. On a 16GB GPU:

  • 13B Q4_K_M (7.7GB) + 8K context (3GB) = 10.7GB — fits with room
  • 13B Q5_K_M (9.7GB) + 8K context (3GB) = 12.7GB — still fits
  • 13B Q8_0 (13.5GB) + 8K context (3GB) = 16.5GB — doesn't fit at 8K context; need to drop to 4K

If your use case requires long context, prioritize a smaller quantization to leave VRAM available for KV cache.

Fix 3: KV Cache Quantization

Newer versions of llama.cpp (0.2.x+) support quantizing the KV cache itself, which cuts KV cache VRAM roughly in half with minimal quality impact:

./llama-server -m model.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

Setting KV cache to Q8_0 (instead of FP16) halves KV cache memory. At 8K context on a 13B model, this drops KV cache from ~3GB to ~1.5GB — meaningful on a 16GB or 24GB card.

Q4_0 KV cache quantization is also available and halves it further, but introduces more quality degradation in attention. Q8_0 is the safe default.

Ollama 0.5.x+ supports this via the KV_CACHE_TYPE environment variable or configuration. Check your specific version's documentation.

Fix 4: CPU Offload for KV Cache

Some inference runtimes support offloading KV cache to system RAM when GPU VRAM fills up. This works but is significantly slower than GPU-resident KV cache — system RAM bandwidth (50-80 GB/s) vs GPU memory bandwidth (500-1000+ GB/s).

In llama.cpp, this happens automatically when VRAM is exhausted if you're already doing partial GPU offload (via -ngl). For full GPU offload models, there's no automatic fallback — it will error out.

If your system has 64GB+ of fast DDR5 RAM and you need very long context on a model that just barely fits, intentional CPU offload of KV cache can be a practical compromise for occasional long-context operations.

Fix 5: Flash Attention

Flash Attention is an algorithm that computes attention without materializing the full KV matrix in memory — it computes in tiles, dramatically reducing peak memory use for long contexts. Models and runtimes with Flash Attention support can handle much longer contexts in the same VRAM.

llama.cpp supports Flash Attention: --flash-attn flag. Ollama enables it automatically on supported builds and hardware.

This doesn't reduce steady-state KV cache size for persistent context, but it reduces peak memory during the initial prefill phase when processing a long prompt.

Quick Reference: Context Length vs KV Cache VRAM

For 7B model (approximately):

~0.15 GB

~0.25 GB

~0.55 GB

~2.1 GB For 13B model (approximately):

~0.3 GB

~0.6 GB

~1.25 GB

~5 GB

The Planning Checklist

When setting up a new model on your hardware:

  1. Note model file size (= minimum VRAM required)
  2. Decide max context you actually need (not the model's theoretical max)
  3. Estimate KV cache from the table above at your chosen context
  4. Add 0.5-1GB for overhead
  5. Confirm total < your GPU VRAM
  6. If it doesn't fit: reduce context, use more aggressive quantization, or enable KV cache quantization

The pattern: VRAM planning for local LLMs is not "will the model fit?" — it's "will the model fit at the context length I need?"

For picking the right model quantization format to maximize the headroom left for KV cache, see our GGUF vs GPTQ vs AWQ vs EXL2 guide. For hardware that maximizes total available memory, see our AMD Strix Halo mini PC vs Mac Mini comparison. To estimate exactly how much VRAM your model and context combination requires, use our VRAM calculator.

kv-cache vram local-llm context-length memory

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.