Training Data
The corpus of examples a model learns from during its training process.
Full Definition
Training data is the collection of examples — text, images, code, or multimodal pairs — used to update a model's weights through gradient descent. For large language models, pretraining data typically comes from web crawls (Common Crawl), books (Books3, Gutenberg), code repositories (GitHub), academic papers (arXiv), and curated sources (Wikipedia). The quality, diversity, and recency of training data directly determine model capabilities, biases, and knowledge cutoffs. Data contamination — benchmark test questions appearing in training data — is a major concern when evaluating models fairly. Increasingly, post-training alignment data (instruction-response pairs, preference data) is considered a separate category.
Examples
The Pile, a 825GB open-source training dataset comprising 22 diverse text sources used to train EleutherAI's GPT-Neo models.
GitHub's entire public code repository being included in Codex's training data, enabling it to write and debug code across 12+ languages.
Apply this in your prompts
PromptITIN automatically uses techniques like Training Data to build better prompts for you.