Concepts

Scaling Laws

Empirical relationships describing how model performance improves predictably with more compute, data, or parameters.

Full Definition

Scaling laws, formalised in the Chinchilla paper (Hoffmann et al., 2022) and the earlier Kaplan et al. (2020) work, describe power-law relationships between model performance (measured as loss) and the three axes of scale: model parameters (N), training data tokens (D), and compute budget (C). The Chinchilla laws specify that for a given compute budget, performance is maximised when N and D are scaled proportionally (approximately 20 tokens per parameter). These laws enable research labs to predict how much performance improvement they will get before committing to expensive training runs. They also revealed that many pre-Chinchilla models (including GPT-3) were undertrained relative to their parameter count.

Examples

The Chinchilla law predicting that a 70B parameter model should be trained on 1.4 trillion tokens for compute-optimal performance — a ratio of 20 tokens per parameter.

Using scaling law extrapolations to predict that a model trained with 10× the compute budget will achieve a specific perplexity reduction on a validation set.

Apply this in your prompts

PromptITIN automatically uses techniques like Scaling Laws to build better prompts for you.

✦ Try it free

Related Terms

Emergent Behaviour

Capabilities that appear suddenly in large models without being explicitly train…

View →

Pretraining

The initial phase of training a model on massive text data to learn general lang…

View →

Benchmark

A standardised test suite used to measure and compare AI model capabilities acro…

View →

← Browse all 100 terms