Search: AI safety actions

Teaching Claude why

… Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. …

May 8, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …

Nov 21, 2025

2028: Two scenarios for global AI leadership

… Opportunities for engagement on AI safety Anthropic supports international AI safety dialogue with AI experts in China, when possible. The world has a vested interest in safe AI, regardless of where it is developed and deployed. …

May 14, 2026

Results from first Anthropic Public Record

… When asked what would best ensure AI is of benefit to humanity, Americans ranked holding AI companies legally liable for harm 47% and prioritizing safety over growth 44% as the highest-leverage actions. …

Jun 12, 2026

Natural Language Autoencoders

… What about cases where Claude doesn’t explicitly verbalize suspicion that it’s undergoing safety testing? Can we then be confident that Claude is playing it straight? …

May 7, 2026

LLMs and biorisk

… In this post, we want to expand on our perspective on AI and biological risk biorisk . It is striking—but not necessarily intuitive—that every safety framework released by frontier AI labs includes some reference to biorisk. …

Sep 5, 2025

Anthropic Sydney office

… "Organizations across Australia and New Zealand are thinking carefully about how to adopt AI, and they want partners who take safety and rigor as seriously as they take the opportunity,” said Theo Hourmouzis, Anthropic General Manager of Australia and New Zealand . “That's what drew me to Anthropic. …

Apr 27, 2026

The Long-Term Benefit Trust

… Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down to begin a new nonprofit, the AI Access Initiative , and Zach Robinson stepped down to focus on non-profit and philanthropic work. …

Sep 19, 2023

Measuring AI agent autonomy in practice

… Model developers should consider training models to recognize their own uncertainty. Training models to recognize their own uncertainty and surface issues to humans proactively is an important safety property that complements external safeguards like human approval flows and access restrictions. …

Feb 18, 2026

Trustworthy agents in practice

… It’s built on five core principles: keeping humans in control, aligning with human values, securing agents’ interactions, maintaining transparency, and protecting privacy. …

Apr 9, 2026

Followed topics