Latency
The time delay between sending a request to a model and receiving its first token in response.
Full Definition
In LLM inference, latency is typically broken into two components: time-to-first-token (TTFT), the delay from request submission to the first output token appearing, and inter-token latency, the time between subsequent tokens during streaming. TTFT is dominated by the prefill computation (processing the entire input prompt in parallel); inter-token latency is dominated by the autoregressive decode step (one token at a time). Latency matters enormously for user experience: sub-200ms TTFT feels instant; above 1 second feels slow. Reducing latency involves hardware optimisation (faster GPUs, custom ASIC chips), batching strategies, caching, and smaller/quantised models.
Examples
Claude API achieving 300ms TTFT for a 1,000-token prompt on Anthropic's inference infrastructure.
A voice AI product requiring sub-100ms TTFT to maintain natural conversational rhythm — achievable only with specialised inference engines.
Apply this in your prompts
PromptITIN automatically uses techniques like Latency to build better prompts for you.