Definition
AI safety evaluation systematically tests AI systems for dangerous capabilities, misuse potential, and alignment with human values.
What's Tested: - Harmful content generation - Bias and fairness - Deception and manipulation - Dangerous information - Jailbreak resistance - Capability thresholds
- **Evaluation Types:**
- Red Teaming: Adversarial testing by humans
- Automated Evals: Scalable benchmark testing
- Capability Evals: Measuring dangerous abilities
- Alignment Evals: Testing value alignment
Organizations: - METR (Model Evaluation & Threat Research) - Apollo Research - ARC Evals - Internal safety teams
Examples
Testing whether a model will help users create biological weapons when asked indirectly.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.