Technical

Latency

The time delay between sending a request to a model and receiving its first token in response.

Full Definition

In LLM inference, latency is typically broken into two components: time-to-first-token (TTFT), the delay from request submission to the first output token appearing, and inter-token latency, the time between subsequent tokens during streaming. TTFT is dominated by the prefill computation (processing the entire input prompt in parallel); inter-token latency is dominated by the autoregressive decode step (one token at a time). Latency matters enormously for user experience: sub-200ms TTFT feels instant; above 1 second feels slow. Reducing latency involves hardware optimisation (faster GPUs, custom ASIC chips), batching strategies, caching, and smaller/quantised models.

Examples

Claude API achieving 300ms TTFT for a 1,000-token prompt on Anthropic's inference infrastructure.

A voice AI product requiring sub-100ms TTFT to maintain natural conversational rhythm — achievable only with specialised inference engines.

Apply this in your prompts

PromptITIN automatically uses techniques like Latency to build better prompts for you.

✦ Try it free

Related Terms

Inference

The process of running a trained model to generate predictions or responses from…

View →

Throughput

The number of tokens or requests an inference system can process per unit of tim…

View →

Streaming

Sending model output tokens to the client incrementally as they are generated ra…

View →

← Browse all 100 terms