Why does my LLM run out of VRAM mid-conversation but not at the start?

The KV cache grows as the conversation gets longer. Model weights are fixed in VRAM, but the KV cache expands with every new token in the context window. A 13B model might load fine using 8GB of VRAM, but after 10,000 tokens of conversation the KV cache has added another 4-6GB — pushing you over the limit.

How do I reduce KV cache VRAM usage?

Three options: (1) Reduce context length with --ctx-size (2048 instead of 8192 uses 4x less KV cache). (2) Use a more aggressively quantized model to free up VRAM for KV cache. (3) Enable KV cache quantization in newer llama.cpp builds (--cache-type-k q8_0) to halve KV cache memory with minimal quality impact.

What is prefix caching and how does it save VRAM?

Prefix caching (also called prompt caching) reuses computed KV states for repeated prompt prefixes — system prompts, document context, etc. Ollama uses disk-based KV cache persistence. vLLM uses in-memory PagedAttention. Neither eliminates KV cache VRAM use during active generation, but they reduce re-computation cost and can improve effective throughput.

How much VRAM does the KV cache use per token?

KV cache size per token depends on model architecture: roughly 0.5-1 MB per token for a 13B model at FP16, less for quantized KV cache. At 8192 context tokens, a 13B model's KV cache uses approximately 3-5GB. At 32K context, it can exceed 15GB — larger than the model weights themselves at Q4_K_M.

Does Flash Attention reduce KV cache VRAM usage?

Flash Attention reduces peak VRAM usage during the prefill phase (processing your input prompt) by computing attention in memory-efficient tiles rather than materializing the full attention matrix. However, it does not reduce the steady-state KV cache size during generation — the cached keys and values still accumulate as context grows. Enable it with --flash-attn in llama.cpp for faster prompt processing, but don't expect it to fix VRAM exhaustion from long conversations.

What happens to my conversation when I hit the context limit?

Behavior depends on the inference tool. Most tools will either truncate the oldest context (sliding window), return an error, or stop generating. Ollama typically handles context overflow by truncating older turns. You can avoid hitting the limit by starting a fresh conversation periodically, reducing the --ctx-size parameter (which caps how much context is allocated), or enabling KV cache quantization to extend effective context within the same VRAM budget.

Why Your VRAM Runs Out Mid-Conversation: The KV Cache Explained

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Quick Summary:

Three VRAM consumers: Model weights (fixed), KV cache (grows with conversation length), framework overhead (small, fixed).
KV cache scales with context: Doubling your context window setting quadruples KV cache memory. A 32K context window can use more VRAM than the model weights themselves.
Practical fix first: Drop --ctx-size from 8192 to 2048. This frees 75% of KV cache VRAM with minimal impact on most conversations.

You bought a 24GB GPU. You loaded a 13B model, it took 8GB, you had 16GB free. The first conversation went fine. Then an hour later, mid-conversation, inference stalls and your monitoring tool shows VRAM fully saturated.

The model didn't change. The GPU didn't change. What changed was the conversation.

This is the KV cache, and it's the most commonly misunderstood aspect of local LLM memory management.

What VRAM Is Actually Doing

When you load a local LLM, three distinct things consume VRAM:

1. Model weights (fixed) The quantized model parameters. For a 13B model at Q4_K_M, this is roughly 7-8 GB. It's loaded once and stays there until you unload the model. This is the number everyone knows — it's the file size.

2. KV cache (dynamic — grows with context) This is the culprit. The KV cache stores intermediate computations from the attention mechanism. It grows with every token in the current context window. A fresh context starts at near-zero and expands as the conversation gets longer.

3. Framework overhead (small, fixed) CUDA/ROCm runtime, activation buffers, temporary computation space. Usually 0.5-1 GB depending on runtime and batch size. Largely ignorable in planning.

The total at any moment: VRAM used = model weights + KV cache (current size) + overhead

At the start of a conversation, KV cache is tiny. After 10,000 tokens, it may rival the model weights in size.

How the KV Cache Grows: The Math

In transformer models, every layer maintains a Key and Value matrix for every token in the current context. These matrices are what "attention" attends to — instead of recomputing them for past tokens on every forward pass (which would be impossibly slow), they're cached.

The formula for KV cache size:

KV cache bytes = 2 × layers × heads × head_dim × seq_len × dtype_bytes

Breaking this down for Llama 3.1 8B at FP16:

2 = Key and Value (both cached)
32 layers
32 attention heads
128 head dimension
seq_len = number of tokens in context
2 bytes per value (FP16)

At 2,048 tokens: 2 × 32 × 32 × 128 × 2048 × 2 = ~536 MB

At 8,192 tokens: 2 × 32 × 32 × 128 × 8192 × 2 = ~2.1 GB

At 32,768 tokens: 2 × 32 × 32 × 128 × 32768 × 2 = ~8.6 GB

For a 13B model (more layers, more heads), these numbers scale up proportionally. The pattern: KV cache grows linearly with context length. Doubling context doubles KV cache VRAM.

This is why "what's the maximum context window this model supports?" is a different question from "how much VRAM do I have for KV cache?" The model supports 128K context. Your RTX 4070 doesn't have 40GB free for the KV cache to fill.

A Worked Example: RTX 4090 24GB

The RTX 4090 is the standard benchmark for a high-end consumer LLM rig. 24GB seems like a lot. Here's where it goes:

Scenario A: 13B model, 8,192 context (typical chat)

Model weights (Q4_K_M): 7.7 GB
KV cache at 8K context: ~3 GB
Framework overhead: ~0.8 GB
Total: ~11.5 GB — well within 24GB, plenty of headroom

Scenario B: 13B model, 32K context (pasting a document)

Model weights: 7.7 GB
KV cache at 32K context: ~12 GB
Overhead: ~0.8 GB
Total: ~20.5 GB — tight. Works, but leaves 3.5GB margin. Swap would hurt.

Scenario C: 34B model (Q4_K_M), 8K context

Model weights: ~20 GB
KV cache at 8K: ~6 GB
Overhead: ~0.8 GB
Total: ~26.8 GB — exceeds 24GB. Won't load with this context setting.

Scenario C fix: 34B model (Q3_K_M), 4K context

Model weights: ~15 GB
KV cache at 4K: ~3 GB
Overhead: ~0.8 GB
Total: ~18.8 GB — fits comfortably.

The math shows why context length configuration matters as much as model selection.

Practical Fixes: Stretching VRAM Further

Fix 1: Reduce Context Length

This is the first thing to try. The default context in Ollama and LM Studio is often 2048 or 4096. Many models ship with 8192 as default. Reducing it costs you conversation length — not model quality.

In Ollama:

ollama run llama3.2 --ctx-size 2048

Or via the API:

{
  "model": "llama3.2",
  "options": {"num_ctx": 2048}
}

In llama.cpp:

./llama-server -m model.gguf --ctx-size 2048

Rule of thumb: 2048 tokens handles most single-turn Q&A and short conversations. 4096 handles moderately long conversations or small document chunks. 8192 handles most professional use cases. Only go to 16K+ if you're specifically pasting large documents.

Fix 2: Quantize the Model More Aggressively

Smaller model weights = more VRAM left for KV cache. On a 16GB GPU:

13B Q4_K_M (7.7GB) + 8K context (3GB) = 10.7GB — fits with room
13B Q5_K_M (9.7GB) + 8K context (3GB) = 12.7GB — still fits
13B Q8_0 (13.5GB) + 8K context (3GB) = 16.5GB — doesn't fit at 8K context; need to drop to 4K

If your use case requires long context, prioritize a smaller quantization to leave VRAM available for KV cache.

Fix 3: KV Cache Quantization

Newer versions of llama.cpp (0.2.x+) support quantizing the KV cache itself, which cuts KV cache VRAM roughly in half with minimal quality impact:

./llama-server -m model.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

Setting KV cache to Q8_0 (instead of FP16) halves KV cache memory. At 8K context on a 13B model, this drops KV cache from ~3GB to ~1.5GB — meaningful on a 16GB or 24GB card.

Q4_0 KV cache quantization is also available and halves it further, but introduces more quality degradation in attention. Q8_0 is the safe default.

Ollama 0.5.x+ supports this via the KV_CACHE_TYPE environment variable or configuration. Check your specific version's documentation.

Fix 4: CPU Offload for KV Cache

Some inference runtimes support offloading KV cache to system RAM when GPU VRAM fills up. This works but is significantly slower than GPU-resident KV cache — system RAM bandwidth (50-80 GB/s) vs GPU memory bandwidth (500-1000+ GB/s).

In llama.cpp, this happens automatically when VRAM is exhausted if you're already doing partial GPU offload (via -ngl). For full GPU offload models, there's no automatic fallback — it will error out.

If your system has 64GB+ of fast DDR5 RAM and you need very long context on a model that just barely fits, intentional CPU offload of KV cache can be a practical compromise for occasional long-context operations.

Fix 5: Flash Attention

Flash Attention is an algorithm that computes attention without materializing the full KV matrix in memory — it computes in tiles, dramatically reducing peak memory use for long contexts. Models and runtimes with Flash Attention support can handle much longer contexts in the same VRAM.

llama.cpp supports Flash Attention: --flash-attn flag. Ollama enables it automatically on supported builds and hardware.

This doesn't reduce steady-state KV cache size for persistent context, but it reduces peak memory during the initial prefill phase when processing a long prompt.

Quick Reference: Context Length vs KV Cache VRAM

For 7B model (approximately):

~0.15 GB

~0.25 GB

~0.55 GB

~2.1 GB For 13B model (approximately):

~0.3 GB

~0.6 GB

~1.25 GB

~5 GB

The Planning Checklist

When setting up a new model on your hardware:

Note model file size (= minimum VRAM required)
Decide max context you actually need (not the model's theoretical max)
Estimate KV cache from the table above at your chosen context
Add 0.5-1GB for overhead
Confirm total < your GPU VRAM
If it doesn't fit: reduce context, use more aggressive quantization, or enable KV cache quantization

The pattern: VRAM planning for local LLMs is not "will the model fit?" — it's "will the model fit at the context length I need?"

For picking the right model quantization format to maximize the headroom left for KV cache, see our GGUF vs GPTQ vs AWQ vs EXL2 guide. For hardware that maximizes total available memory, see our AMD Strix Halo mini PC vs Mac Mini comparison. To estimate exactly how much VRAM your model and context combination requires, use our VRAM calculator.