TL;DR
An acceleration technique that speeds up large language model inference by using a smaller, faster model to draft tokens that are then verified in parallel by the main model.
Speculative Decoding utilizes a draft-then-verify paradigm to minimize the memory bandwidth bottleneck of autoregressive decoding. The lighter draft model generates a sequence of potential tokens quickly, which the target LLM validates in a single forward pass. This maintains the exact same target distribution while achieving substantial latency speedups.
Why this matters for your business
It significantly reduces the operational latency and hosting costs of serving large language models, making real-time interactive AI features much more practical to deploy in production.