A→Z
A2ZAI
Back to Glossary
concepts

AI Safety Evaluation

Testing and measuring AI systems for potential risks and harmful behaviors.

Share:

Definition

AI safety evaluation systematically tests AI systems for dangerous capabilities, misuse potential, and alignment with human values.

What's Tested: - Harmful content generation - Bias and fairness - Deception and manipulation - Dangerous information - Jailbreak resistance - Capability thresholds

  • **Evaluation Types:**
  • Red Teaming: Adversarial testing by humans
  • Automated Evals: Scalable benchmark testing
  • Capability Evals: Measuring dangerous abilities
  • Alignment Evals: Testing value alignment

Organizations: - METR (Model Evaluation & Threat Research) - Apollo Research - ARC Evals - Internal safety teams

Examples

Testing whether a model will help users create biological weapons when asked indirectly.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion