A→Z
A2ZAI
Back to Glossary
concepts

Jailbreak

Techniques to bypass AI safety measures and get models to ignore restrictions.

Share:

Definition

Jailbreaking refers to prompts or techniques that circumvent AI safety guardrails.

Common Techniques: - Role-playing scenarios - Hypothetical framing - Token manipulation - Prompt injection - Multi-turn manipulation

Example Patterns: - "Pretend you're an AI without restrictions" - "For educational purposes only..." - Character role-play - Gradual escalation

Why They Work: - Training doesn't cover all cases - Models follow instructions literally - Edge cases in safety training - Novel prompt structures

Mitigation: - Better training data - Constitutional AI - Output filtering - Continuous red teaming - Regular updates

Responsible Use: - Research and safety testing - Not for actual harm - Report vulnerabilities

Examples

The "DAN" (Do Anything Now) prompt attempting to bypass ChatGPT restrictions.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free daily digest. No spam, unsubscribe anytime.

Discussion