AI Safety Evaluation

Definition

AI safety evaluation systematically tests AI systems for dangerous capabilities, misuse potential, and alignment with human values.

What's Tested: - Harmful content generation - Bias and fairness - Deception and manipulation - Dangerous information - Jailbreak resistance - Capability thresholds

**Evaluation Types:**
Red Teaming: Adversarial testing by humans
Automated Evals: Scalable benchmark testing
Capability Evals: Measuring dangerous abilities
Alignment Evals: Testing value alignment

Organizations: - METR (Model Evaluation & Threat Research) - Apollo Research - ARC Evals - Internal safety teams

Examples

Testing whether a model will help users create biological weapons when asked indirectly.

Related Terms

AI Alignment

Ensuring AI systems behave in accordance with human values and intentions.

Red Teaming

Testing AI systems by deliberately trying to make them fail or produce harmful outputs.

AI Safety

Research field focused on ensuring AI systems are beneficial and don't cause harm.

Definition

Examples

Related Terms

Want more AI knowledge?

Discussion