Concepts

Evaluation

The systematic process of measuring AI model quality, safety, and alignment against defined criteria.

Full Definition

Evaluation is the process of rigorously assessing AI model behaviour across capability, safety, and alignment dimensions before and after deployment. Methods include automated benchmarks (standardised test suites with programmatic scoring), human evaluation (human raters judging quality or preference), LLM-as-a-judge (using a more capable model to score outputs), red-teaming (adversarial failure-finding), and online A/B testing. Good evaluation practice requires clear criteria, representative test distributions, contamination controls, and disaggregated analysis across demographic subgroups and edge cases. Evaluation is foundational to responsible AI deployment: you cannot improve what you cannot measure, and you cannot trust systems you have not rigorously tested.

Examples

Anthropic's pre-deployment evaluation suite: running each Claude model version against MMLU, HumanEval, and internal safety red-teaming datasets before release.

Using GPT-4 as a judge to score 1,000 candidate responses on helpfulness (1-5) and factual accuracy (1-5), then correlating with human judgements.

Apply this in your prompts

PromptITIN automatically uses techniques like Evaluation to build better prompts for you.

✦ Try it free

Related Terms

Benchmark

A standardised test suite used to measure and compare AI model capabilities acro…

View →

Red Teaming

Systematically testing an AI system by attempting to elicit harmful or unintende…

View →

Scaling Laws

Empirical relationships describing how model performance improves predictably wi…

View →

← Browse all 100 terms