Context Window — Local AI Glossary | CraftRigs

The context window is the total number of tokens a model can hold in attention at one time. Every token in your prompt, the model's previous responses, any system instructions, and all documents you've pasted in — all of it counts against this limit.

When you hit the context window limit, the model can no longer see earlier parts of the conversation. Depending on the software, it either truncates (drops the oldest tokens), throws an error, or wraps around.

Token Count Basics

Tokens aren't exactly words — they're chunks of text roughly 3–4 characters long on average. A useful rule of thumb:

1,000 words ≈ 750 tokens
A full page of text ≈ 500–600 tokens
A large codebase file ≈ 2,000–10,000 tokens

Common context window sizes:

4K tokens: About a short document or a few pages of code
32K tokens: A medium-length book chapter or large codebase file
128K tokens: About 100 pages of dense text
1M tokens: A full novel (mostly cloud models; rare locally)

Context Window vs VRAM

Larger context windows require more VRAM because the KV cache grows with context length. This is a hard constraint for local inference:

A 7B model with 4K context: fits on 8GB VRAM
Same model with 128K context: may need 16–20GB just for the KV cache
A 70B model with 128K context: requires enormous VRAM or CPU offloading

Most local rigs don't run models at their maximum theoretical context length. Practical context is limited by available VRAM after loading model weights.

Advertised vs Practical Context

Model cards often list a maximum context window, but this is the architectural limit at full precision with unlimited VRAM. Locally, your effective context window depends on how much VRAM remains after the model is loaded. On a 24GB card running a Q4_K_M 13B model (~8GB weights), you have roughly 16GB left for KV cache — enough for 32–64K tokens at this model size, not the theoretical 128K maximum.

Why It Matters for Local AI

Context window directly affects what you can do with a local model. Short context limits you to simple Q&A. Long context enables document summarization, multi-file code review, and extended conversations. Matching context requirements to your hardware's VRAM ceiling is essential before buying or building a local LLM rig.