Throughput
The number of tokens or requests an inference system can process per unit of time.
Full Definition
Throughput measures inference system capacity — typically expressed in tokens per second (TPS) or requests per second (RPS). It is the complement to latency: optimising for throughput (batching many requests together to maximise GPU utilisation) often increases per-request latency, while optimising for latency (serving each request immediately) reduces throughput. High-throughput systems are critical for production APIs serving many concurrent users. Continuous batching, tensor parallelism, and KV cache management are key throughput optimisation techniques. Throughput and latency must be co-optimised according to the application's SLA requirements.
Examples
An inference cluster serving 10,000 tokens per second across all concurrent users, allowing 100 simultaneous long-form generation requests.
Increasing batch size from 1 to 32 in an offline text classification pipeline to maximise GPU utilisation and achieve 20× throughput improvement.
Apply this in your prompts
PromptITIN automatically uses techniques like Throughput to build better prompts for you.