Home/Glossary/Throughput
Technical

Throughput

The number of tokens or requests an inference system can process per unit of time.

Full Definition

Throughput measures inference system capacity — typically expressed in tokens per second (TPS) or requests per second (RPS). It is the complement to latency: optimising for throughput (batching many requests together to maximise GPU utilisation) often increases per-request latency, while optimising for latency (serving each request immediately) reduces throughput. High-throughput systems are critical for production APIs serving many concurrent users. Continuous batching, tensor parallelism, and KV cache management are key throughput optimisation techniques. Throughput and latency must be co-optimised according to the application's SLA requirements.

Examples

1

An inference cluster serving 10,000 tokens per second across all concurrent users, allowing 100 simultaneous long-form generation requests.

2

Increasing batch size from 1 to 32 in an offline text classification pipeline to maximise GPU utilisation and achieve 20× throughput improvement.

Apply this in your prompts

PromptITIN automatically uses techniques like Throughput to build better prompts for you.

✦ Try it free

Related Terms

Inference

The process of running a trained model to generate predictions or responses from

View →

Latency

The time delay between sending a request to a model and receiving its first toke

View →

API

An Application Programming Interface that lets developers call AI model capabili

View →
← Browse all 100 terms