Home/Glossary/Training Data
Training

Training Data

The corpus of examples a model learns from during its training process.

Full Definition

Training data is the collection of examples — text, images, code, or multimodal pairs — used to update a model's weights through gradient descent. For large language models, pretraining data typically comes from web crawls (Common Crawl), books (Books3, Gutenberg), code repositories (GitHub), academic papers (arXiv), and curated sources (Wikipedia). The quality, diversity, and recency of training data directly determine model capabilities, biases, and knowledge cutoffs. Data contamination — benchmark test questions appearing in training data — is a major concern when evaluating models fairly. Increasingly, post-training alignment data (instruction-response pairs, preference data) is considered a separate category.

Examples

1

The Pile, a 825GB open-source training dataset comprising 22 diverse text sources used to train EleutherAI's GPT-Neo models.

2

GitHub's entire public code repository being included in Codex's training data, enabling it to write and debug code across 12+ languages.

Apply this in your prompts

PromptITIN automatically uses techniques like Training Data to build better prompts for you.

✦ Try it free

Related Terms

Dataset

A structured collection of data examples used to train, validate, or evaluate a

View →

Pretraining

The initial phase of training a model on massive text data to learn general lang

View →

Synthetic Data

Training data generated by AI models rather than collected from human-created so

View →
← Browse all 100 terms