Prompt Caching — Local AI Glossary | CraftRigs

Prompt caching is an optimization that stores the computed key-value (KV) cache from a request's prefix and reuses it for subsequent requests with the same prefix. The expensive prefill phase — which processes every input token and is compute-heavy — only happens once for the cached portion. Future requests with the same prefix skip directly to the cheaper decode phase for the cached tokens.

Practical Impact

Consider a deployment where every request starts with the same 2,000-token system prompt. Without prompt caching, those 2,000 tokens are re-processed for every new conversation — a significant compute cost. With prompt caching, that work happens once, and the saved KV state is reused for all subsequent conversations.

The speedup is proportional to the cached prefix length relative to total context length. A 2,000-token cached system prompt out of a 4,000-token total context = ~50% of prefill computation eliminated.

Hardware Requirements

Prompt caching requires keeping the cached KV states in memory between requests. For a large system prompt (2,000 tokens) with a 70B model, the cached KV state might be 1–4GB of VRAM. This is a persistent overhead across the lifetime of the server, not per-request.

The trade-off is straightforward: spend VRAM to save compute and latency on repeated prefixes. For single-user personal use, prompt caching is less relevant. For applications with many users sharing the same system prompt, it's a significant efficiency win.

Software Support

vLLM, SGLang, and TGI support prompt caching (often called "prefix caching"). llama.cpp has context reuse features that accomplish similar goals. Anthropic and OpenAI offer prompt caching in their APIs as a cost-reduction feature, which applies the same principle to billed token usage.

Relationship to KV Cache

Prompt caching is a specific application of KV cache management. The KV cache stores all intermediate computation for the current context; prompt caching specifically saves and reuses the portion corresponding to a repeated prefix across multiple independent requests.