CraftRigs
Technical Report

MSA Memory: The Research That Could Slash VRAM Requirements for Long-Context LLMs

By Charlotte Stewart 6 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

EverMind's open-sourced MSA memory architecture — released March 19, 2026 — is the most significant inference efficiency research to land this year, and if it scales the way the paper suggests, it could halve the VRAM cost of running long-context models locally.

Long-context inference is expensive. Not in the "requires a frontier-class API" sense, but in the hardware sense: keeping a 100K or 1M token context window alive during inference requires enormous amounts of VRAM and memory bandwidth. It's the primary reason most home-built AI servers are constrained to 4K–32K effective context windows, regardless of what a model's theoretical maximum is.

EverMind's Multi-Scale Attention (MSA) memory architecture, open-sourced yesterday, attacks this constraint directly. The research matters. Here's what it actually does and what it means for local builders.


The KV Cache Problem

To understand why MSA is significant, you need to understand what's currently eating your VRAM during long-context inference.

When a transformer-based LLM processes a sequence of tokens, it generates a key-value (KV) cache for each attention layer. This cache stores the compressed representations of all previous tokens so the model can "attend" back to them when generating new output. The KV cache is what makes attention work — without it, the model would have to reprocess the entire context from scratch for every new token generated.

The problem is that KV cache size grows linearly with context length. For a model like Llama 3.1 8B with a 128K context window:

  • At 4K tokens: ~200MB KV cache
  • At 32K tokens: ~1.6GB KV cache
  • At 128K tokens: ~6.4GB KV cache
  • At 1M tokens: ~50GB KV cache (theoretical, most hardware can't sustain this)

This is why a 16GB VRAM GPU can technically handle a 128K-context model in GGUF format, but in practice starts degrading well before you reach that limit — the KV cache competes directly with the model weights for available VRAM.

Tip: If you've ever noticed your local LLM getting slower or producing lower-quality outputs toward the end of a long conversation, KV cache pressure is often the culprit. The model is either evicting early context or starting to offload cache to system RAM.


What MSA Memory Does Differently

EverMind's Multi-Scale Attention (MSA) architecture addresses KV cache growth through hierarchical compression across multiple time scales.

The core insight: not all tokens in a long context are equally important to the current generation step. Tokens from 50,000 positions ago contribute very differently to the next predicted token than tokens from 200 positions ago. Current attention mechanisms treat all past tokens with roughly equal resolution, which is computationally expensive and VRAM-intensive.

MSA introduces three distinct memory layers:

Layer 1 — Recent Context (Full Resolution) The last N tokens (configurable, typically 2K–8K) are stored in standard full-resolution KV cache. This is where most of the active generation work happens, and it requires full fidelity.

Layer 2 — Compressed Mid-Range Memory Tokens from the middle distance (8K–100K positions back) are compressed into summary representations using learned compression functions. The compression ratio is approximately 8:1 to 16:1. A 64K token range that would normally require ~3.2GB of KV cache is compressed to ~200–400MB without significant quality loss on tasks that require general context retention.

Layer 3 — Episodic Long-Range Memory Tokens beyond 100K positions are compressed to sparse episodic embeddings — think of them as "chapter summaries" of what happened earlier in the sequence. These require only a fraction of the VRAM of the original KV cache but preserve the gist of earlier context for tasks that require broad narrative coherence.

The model learns when to "reach back" into each layer via a routing mechanism trained jointly with the compression functions. On retrieval-intensive tasks (like "find the paragraph where X was mentioned 800K tokens ago"), MSA selectively decompresses relevant mid-range or episodic memories before attending to them.


The Numbers From the Paper

EverMind's benchmarks, run on their internal evaluation suite:

Reduction

25%

56%

76%

82% Quality metrics on the RULER benchmark (long-context evaluation):

  • MSA vs. standard attention at 128K: -2.1% perplexity increase (minimal degradation)
  • MSA vs. standard attention at 512K: -3.8% perplexity increase
  • MSA vs. standard attention at 1M: -6.2% perplexity increase

The quality tradeoff at 1M context is real — a 6% increase in perplexity is noticeable on precision tasks. But at 128K, the degradation is minimal, and the VRAM reduction from 6.4GB to 2.8GB is significant.

Warning: These are paper benchmarks from the team that developed the architecture. Independent reproduction on diverse model families and real-world tasks hasn't happened yet — the paper was released yesterday. Treat the numbers as directionally correct, not precision figures. Production implementations often show different efficiency curves than research benchmarks.


Why This Matters for Local Builders

If MSA (or an architecture like it) gets integrated into mainstream inference runtimes like llama.cpp or Ollama, the practical effect on home AI servers would be substantial.

Scenario 1: You're running a 13B model for conversational AI

Currently, a 32K context window with a 13B Q4_K_M model takes roughly 8–9GB of VRAM (model weights + KV cache). That works on a 16GB card but leaves limited headroom. With MSA-level compression, the same 32K context would take approximately 7GB, and a 64K context might become practical on 16GB hardware that currently can't sustain it.

Scenario 2: You're running RAG pipelines with large documents

RAG (retrieval-augmented generation) workloads often involve injecting large document chunks into context. MSA's compression of mid-range context is specifically well-suited to this use case — the retrieved documents occupy the mid-range memory tier, compressed efficiently, while the active generation focuses on the recent context layer.

Scenario 3: You want to run 70B models on consumer hardware

This is where it gets speculative but interesting. A 70B model in Q4_K_M takes approximately 40GB of VRAM for weights alone. No consumer GPU handles that without CPU offloading. But if MSA reduces KV cache requirements by 70–80% at long contexts, it reduces the overhead added on top of those 40GB — making dual-GPU setups or APU-based systems (like the ASRock AI BOX-A395) more viable for sustained long-context 70B inference.

For a full picture of how much VRAM you actually need to run 70B models, see our dedicated guide.


The Path to Production

MSA is a research architecture. The gap between research architecture and production runtime integration is non-trivial. Here's the realistic timeline:

Near term (3–6 months): Independent researchers attempt reproduction. If results hold, expect implementations in experimental llama.cpp forks. Integration into the main llama.cpp branch typically takes 2–4 months after a solid reference implementation exists.

Medium term (6–12 months): Ollama and LM Studio begin supporting models fine-tuned with MSA attention. GGUF format extensions may be required to store the compression functions alongside the model weights.

Longer term (12–24 months): Model families trained natively with MSA become available for download. This is where the full efficiency gains materialize — current models would need to be retrained with MSA, not just have it applied post-hoc to standard attention weights.

For hardware buying decisions right now: MSA doesn't change what you should buy today. But it's a strong signal that 16GB VRAM cards will remain relevant longer than the raw parameter count growth curve might suggest. Architecture improvements are extending the useful life of current VRAM tiers.


What to Watch For

  • EverMind's GitHub repository (released alongside the paper) — watch the issues tracker for reproduction attempts from independent researchers
  • llama.cpp issue tracker — any PR tagged "long context" or "KV compression" in the next 60 days likely references or builds on this work
  • Model releases from EverMind — if they release models fine-tuned with MSA natively, these will be worth testing immediately

This is the kind of research that quietly reshapes what hardware is viable over an 18-month window. File it away, watch the reproduction results, and revisit when llama.cpp forks start shipping experimental MSA support.

See Also

msa-memory vram kv-cache long-context llama-cpp inference research local-llm

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.