Technical

Throughput

The number of tokens or requests an inference system can process per unit of time.

Full Definition

Throughput measures inference system capacity — typically expressed in tokens per second (TPS) or requests per second (RPS). It is the complement to latency: optimising for throughput (batching many requests together to maximise GPU utilisation) often increases per-request latency, while optimising for latency (serving each request immediately) reduces throughput. High-throughput systems are critical for production APIs serving many concurrent users. Continuous batching, tensor parallelism, and KV cache management are key throughput optimisation techniques. Throughput and latency must be co-optimised according to the application's SLA requirements.

Examples

An inference cluster serving 10,000 tokens per second across all concurrent users, allowing 100 simultaneous long-form generation requests.

Increasing batch size from 1 to 32 in an offline text classification pipeline to maximise GPU utilisation and achieve 20× throughput improvement.

Apply this in your prompts

PromptITIN automatically uses techniques like Throughput to build better prompts for you.

✦ Try it free

Related Terms

Inference

The process of running a trained model to generate predictions or responses from…

View →

Latency

The time delay between sending a request to a model and receiving its first toke…

View →

API

An Application Programming Interface that lets developers call AI model capabili…

View →

← Browse all 100 terms