TL;DR
An optimization technique that stores the pre-computed mathematical states of static prompt segments to reduce latency and API costs during LLM inference.
Prompt caching works by storing the attention mechanism key-value states of frequently used, unchanging prompt segments—such as system instructions, tool schemas, or reference documents. When a new user query begins with the same cached prefix, the model retrieves these pre-computed states from memory instead of processing the entire instruction set from scratch. This drastically decreases the Time to First Token and reduces raw computational overhead, offering dramatic cost savings on high-volume workloads.
Why this matters for your business
It makes production-grade AI agents and multi-step reasoning systems commercially viable by slashing prompt-processing invoices and ensuring real-time responsiveness.