Constitutional AI

Definition

Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful, harmless, and honest.

How It Works: 1. Start with helpful but potentially harmful model 2. Model critiques its own outputs 3. Uses written principles (constitution) as guide 4. Revises responses to be safer 5. Train on self-revised responses

Key Innovation: - Less reliance on human labeling - Scalable safety training - Explicit principles instead of implicit preferences

The Constitution: - Set of principles model follows - Examples: "Be helpful but don't cause harm" - Can be customized for different use cases

Benefits: - More transparent than pure RLHF - Reduces human labeling costs - Consistent application of rules

Used By: - Claude (Anthropic's model) - Basis for Claude's behavior

Examples

Claude refusing harmful requests based on constitutional principles.

Training method using human preferences to make AI more helpful and safe.

AI Alignment

Ensuring AI systems behave in accordance with human values and intentions.

Anthropic

AI safety company that created Claude, focused on building safe and beneficial AI.

Definition

Examples

Related Terms

Want more AI knowledge?

Discussion