Teaching Claude why
… Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. …
… Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. …
… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …
… Opportunities for engagement on AI safety Anthropic supports international AI safety dialogue with AI experts in China, when possible. The world has a vested interest in safe AI, regardless of where it is developed and deployed. …
… When asked what would best ensure AI is of benefit to humanity, Americans ranked holding AI companies legally liable for harm 47% and prioritizing safety over growth 44% as the highest-leverage actions. …
… What about cases where Claude doesn’t explicitly verbalize suspicion that it’s undergoing safety testing? Can we then be confident that Claude is playing it straight? …
… In this post, we want to expand on our perspective on AI and biological risk biorisk . It is striking—but not necessarily intuitive—that every safety framework released by frontier AI labs includes some reference to biorisk. …
… "Organizations across Australia and New Zealand are thinking carefully about how to adopt AI, and they want partners who take safety and rigor as seriously as they take the opportunity,” said Theo Hourmouzis, Anthropic General Manager of Australia and New Zealand . “That's what drew me to Anthropic. …
… Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down to begin a new nonprofit, the AI Access Initiative , and Zach Robinson stepped down to focus on non-profit and philanthropic work. …
… Model developers should consider training models to recognize their own uncertainty. Training models to recognize their own uncertainty and surface issues to humans proactively is an important safety property that complements external safeguards like human approval flows and access restrictions. …
… It’s built on five core principles: keeping humans in control, aligning with human values, securing agents’ interactions, maintaining transparency, and protecting privacy. …