CraftRigs
CraftRigs / Glossary / Speculative Decoding
Performance

Speculative Decoding

Inference acceleration technique that uses a smaller draft model to generate candidate tokens, verified in parallel by the main model — reducing effective latency.

Speculative decoding is an inference optimization that accelerates text generation by using a small, fast "draft" model to propose multiple tokens at once, then having the larger "target" model verify them in a single forward pass. If the draft tokens match what the target model would have generated, they're accepted and the process jumps ahead multiple tokens at once. Rejected tokens cause the draft to be discarded from that point, and generation continues normally.

Why This Works

Standard LLM inference is autoregressive — each token is generated one at a time, and each step requires a full forward pass through the model. This is sequential and hard to parallelize. Speculative decoding breaks this constraint: verifying a batch of draft tokens costs roughly the same as generating a single token, because transformer verification is parallelizable across the sequence.

When draft acceptance rates are high (typically 70–90% for related model families), speculative decoding can deliver 2–3x effective speedup on the same hardware.

Hardware Requirements

Speculative decoding requires running two models simultaneously — the draft model and the target model. Both need to fit in VRAM at the same time. Common pairings:

  • Llama 3.1 70B (target) + Llama 3.2 1B (draft) — the 1B adds ~1GB overhead
  • Qwen 32B (target) + Qwen 1.8B (draft)

The draft model must be from the same model family for high acceptance rates — a Llama draft model doesn't work well with a Mistral target.

When It Helps Most

Speculative decoding provides the largest speedup in generation-heavy scenarios: long document summarization, code generation, and extended response generation. For short responses (under 50 tokens), the overhead of speculative initialization reduces the benefit.

It also helps more on memory-bandwidth-limited hardware (most consumer GPUs) than on compute-limited hardware, because the bottleneck in standard inference is memory bandwidth, and speculative decoding amortizes this cost across multiple accepted tokens.

Software Support

llama.cpp, vllm, and Hugging Face Transformers all support speculative decoding. Ollama support varies by version. The configuration typically requires specifying the draft model path and an acceptance threshold.