Safety

Red Teaming

Systematically testing an AI system by attempting to elicit harmful or unintended behaviours before deployment.

Full Definition

Red teaming borrows from cybersecurity: an adversarial team (red team) attempts to find failure modes, safety violations, and exploitable behaviours in an AI system before it reaches users. LLM red teaming covers jailbreaks, prompt injections, bias elicitation, misinformation generation, dangerous capability discovery, and misuse scenario testing. Red teams include both human experts (who bring creative adversarial thinking) and automated systems (which can test millions of prompt variants at scale). Red-teaming findings feed back into training, guardrail design, and capability thresholds. All major AI labs conduct extensive red teaming before model releases; third-party red teaming is increasingly mandated by governments.

Examples

A red team spending two weeks attempting to extract step-by-step instructions for synthesising dangerous chemicals from a new model, documenting every successful technique.

Automated red teaming using a separate attacker LLM to generate adversarial prompts and test a target model at 10,000 attempts per hour.

Apply this in your prompts

PromptITIN automatically uses techniques like Red Teaming to build better prompts for you.

✦ Try it free

Related Terms

Adversarial Prompting

Crafting inputs specifically designed to cause a model to behave incorrectly or …

View →

Jailbreak

A prompt designed to bypass a model's safety guidelines and elicit restricted co…

View →

AI Safety

The interdisciplinary field studying how to develop AI systems that are safe, re…

View →

← Browse all 100 terms