model_detected

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 14h ago

This system helped us identify this happened for some of our prior Instant and mini models. It additionally affected GPT-5.4 Thinking in les

This system helped us identify this happened for some of our prior Instant and mini models. It additionally affected GPT-5.4 Thinking in less than 0.6% of samples. Out of abundance of caution, we did an in-depth analysis of these cases: they did not seem to reduce

OpenAI

OpenAI5/8/2026, 8:19:05 PMOpenAI

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 17h ago

High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misali

High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario. https://t.co/JORhSuY4N7

Anthropic5/8/2026, 5:52:12 PMAnthropic

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 17h ago

We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite be

We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.

Anthropic5/8/2026, 5:52:10 PMAnthropic

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 17h ago

New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail use

New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?

Anthropic5/8/2026, 5:52:08 PMAnthropic

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 17h ago

We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply

We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong. Read more: https://t.co/0YaRlXhVZb

Anthropic5/8/2026, 5:52:09 PMAnthropic

model release model_release model_detected builder_relevant

model releaseOfficialUpdated: 17h ago

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.

Anthropic5/8/2026, 5:52:09 PMAnthropic