Definition
Jailbreaking refers to prompts or techniques that circumvent AI safety guardrails.
Common Techniques: - Role-playing scenarios - Hypothetical framing - Token manipulation - Prompt injection - Multi-turn manipulation
Example Patterns: - "Pretend you're an AI without restrictions" - "For educational purposes only..." - Character role-play - Gradual escalation
Why They Work: - Training doesn't cover all cases - Models follow instructions literally - Edge cases in safety training - Novel prompt structures
Mitigation: - Better training data - Constitutional AI - Output filtering - Continuous red teaming - Regular updates
Responsible Use: - Research and safety testing - Not for actual harm - Report vulnerabilities
Examples
The "DAN" (Do Anything Now) prompt attempting to bypass ChatGPT restrictions.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free daily digest. No spam, unsubscribe anytime.