KV Cache
Memory storage for the key-value attention states of all tokens in your current context.
Every time a language model processes a token, it computes a set of internal values called keys and values — part of the attention mechanism that lets the model relate tokens to each other. These computations are expensive. Rather than recalculating them from scratch every time a new token is generated, the model caches them in memory. That cache is the KV cache.
As your conversation grows longer, the KV cache grows with it. Each new token that was ever processed adds to the cache. This is the mechanism that lets the model "remember" everything said so far in a conversation — but it also means memory usage grows linearly with context length.
How Much VRAM the KV Cache Consumes
The exact size depends on model architecture (number of layers, attention heads, head dimension) and context length. As a rough guide for a 7B model:
- 4K context: ~0.5GB KV cache
- 32K context: ~4GB KV cache
- 128K context: ~8–16GB KV cache
For larger models, scale proportionally. A 70B model running at 128K context can consume 40–60GB of KV cache alone — before even counting model weights.
This is why supporting long context windows requires far more VRAM than the model size alone suggests.
What Happens When the KV Cache Runs Out
When available VRAM fills up with model weights plus KV cache, one of several things happens depending on your inference software:
- Generation stops with an out-of-memory error
- The context is truncated (oldest tokens dropped)
- Cache overflow to system RAM, causing a severe speed drop
Quantizing the KV Cache
Some inference frameworks (including llama.cpp) support quantizing the KV cache itself — storing the cached values at reduced precision to shrink memory usage. This trades a small amount of quality for significantly longer effective context at the same VRAM budget. It's worth enabling when running long conversations on VRAM-constrained hardware.
Why It Matters for Local AI
If you're running models with large context windows — for document analysis, long coding sessions, or extended conversations — KV cache is often the actual limiting factor, not model size. A 24GB card can run a 13B model comfortably, but fill that context to 128K tokens and you may run out of headroom fast.
Related guides: Why your VRAM runs out mid-conversation: the KV cache explained — full breakdown of KV cache math, VRAM tables by context length, and practical fixes. How much VRAM do you need for local LLMs? — planning your VRAM budget including KV cache overhead. Use the VRAM calculator to estimate total VRAM needed at your target context length.