Evaluation
The systematic process of measuring AI model quality, safety, and alignment against defined criteria.
Full Definition
Evaluation is the process of rigorously assessing AI model behaviour across capability, safety, and alignment dimensions before and after deployment. Methods include automated benchmarks (standardised test suites with programmatic scoring), human evaluation (human raters judging quality or preference), LLM-as-a-judge (using a more capable model to score outputs), red-teaming (adversarial failure-finding), and online A/B testing. Good evaluation practice requires clear criteria, representative test distributions, contamination controls, and disaggregated analysis across demographic subgroups and edge cases. Evaluation is foundational to responsible AI deployment: you cannot improve what you cannot measure, and you cannot trust systems you have not rigorously tested.
Examples
Anthropic's pre-deployment evaluation suite: running each Claude model version against MMLU, HumanEval, and internal safety red-teaming datasets before release.
Using GPT-4 as a judge to score 1,000 candidate responses on helpfulness (1-5) and factual accuracy (1-5), then correlating with human judgements.
Apply this in your prompts
PromptITIN automatically uses techniques like Evaluation to build better prompts for you.