Training

Dataset

A structured collection of data examples used to train, validate, or evaluate a machine learning model.

Full Definition

A dataset is an organised collection of input-output pairs or raw examples used at different stages of the ML lifecycle: pretraining (raw text), instruction tuning (prompt-response pairs), RLHF (ranked responses), and evaluation (benchmark questions with ground-truth answers). Dataset quality, diversity, and size are among the most important determinants of model capability. Common language model pretraining datasets include Common Crawl, The Pile, and RedPajama. Data curation — filtering duplicates, removing low-quality content, deduplicating, and ensuring demographic balance — has become as important as model architecture in determining final model quality.

Examples

The MMLU benchmark dataset: 14,000+ multiple-choice questions across 57 academic subjects used to evaluate model knowledge.

Anthropic's HH (Helpful and Harmless) dataset: human-ranked conversation pairs used to train Claude via RLHF.

Apply this in your prompts

PromptITIN automatically uses techniques like Dataset to build better prompts for you.

✦ Try it free

Related Terms

Pretraining

The initial phase of training a model on massive text data to learn general lang…

View →

Training Data

The corpus of examples a model learns from during its training process.…

View →

Synthetic Data

Training data generated by AI models rather than collected from human-created so…

View →

← Browse all 100 terms