Scaling Laws
Empirical relationships describing how model performance improves predictably with more compute, data, or parameters.
Full Definition
Scaling laws, formalised in the Chinchilla paper (Hoffmann et al., 2022) and the earlier Kaplan et al. (2020) work, describe power-law relationships between model performance (measured as loss) and the three axes of scale: model parameters (N), training data tokens (D), and compute budget (C). The Chinchilla laws specify that for a given compute budget, performance is maximised when N and D are scaled proportionally (approximately 20 tokens per parameter). These laws enable research labs to predict how much performance improvement they will get before committing to expensive training runs. They also revealed that many pre-Chinchilla models (including GPT-3) were undertrained relative to their parameter count.
Examples
The Chinchilla law predicting that a 70B parameter model should be trained on 1.4 trillion tokens for compute-optimal performance — a ratio of 20 tokens per parameter.
Using scaling law extrapolations to predict that a model trained with 10× the compute budget will achieve a specific perplexity reduction on a validation set.
Apply this in your prompts
PromptITIN automatically uses techniques like Scaling Laws to build better prompts for you.