Search

Showing top 61 results for "Agent safety research"

People also ask

Why does agentic misalignment happen?

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T

Teaching Claude why

What safety risks?

If you’re willing to entertain the views outlined above, then it’s not very hard to argue that AI could be a risk to our safety and security. There are two common sense reasons to be concerned. First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers. To use an analogy, it is easy for a chess grandmaster to detect bad moves in a novice but very hard for a novice to detect bad moves in a grandmaster. If we build an AI system that’s significantly more competent than human

Core views on AI safety: When, why, what, and how

Australian government and Anthropic sign MOU for AI safety and research

Announcements Australian government and Anthropic sign MOU for AI safety and research Mar 31, 2026 Today, Anthropic signed a Memorandum of Understanding with the Australian government to cooperate on AI safety research and support the goals of Australia’s National AI Plan. …

Mar 31, 2026

Teaching Claude why

… Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. We use agentic misalignment as a case study to highlight some of the techniques we found to be surprisingly effective. …

May 8, 2026

Core views on AI safety: When, why, what, and how

… Rather than betting on a single possible scenario from the list above, we are trying to develop a research program that could significantly improve things in intermediate scenarios where AI safety research is most likely to have an outsized impact, while also raising the alarm in pessimistic scenar… …

Mar 8, 2023

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …

Nov 21, 2025

Claude Opus 4.6

… We’ve introduced agent teams in Claude Code as a research preview. You can now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independent, read-heavy work like codebase reviews. …

Feb 5, 2026

Focus areas for The Anthropic Institute

… Our agenda focuses on four areas for research: Economic diffusion Threats and resilience AI systems in the wild AI-driven R&D In Core Views on AI Safety , we wrote that doing effective safety research required close contact with frontier AI systems. …

May 7, 2026

Advancing Claude in healthcare and the life sciences

… Claude's Agent SDK has unlocked a step-change in how we operate—converting rigid research processes into adaptive, compliant agents. …

Jan 11, 2026

Claude Fable 5 and Claude Mythos 5

… Safety classifiers The frontier cybersecurity and research biology capabilities of Mythos-class models mean that they pose a substantial risk of uplift to malicious actors. …

Jun 9, 2026

Introducing Claude Opus 4.5

… Opus 4.5 is also very effective at managing a team of subagents, enabling the construction of complex, well-coordinated multi-agent systems. In our testing, the combination of all these techniques boosted Opus 4.5’s performance on a deep research evaluation by almost 15 percentage points 4 . …

Nov 24, 2025

Anthropic opens Seoul office and announces new partnerships across the Korean AI ecosystem

… Anthropic will provide Claude access to up to 60 NAIRL-affiliated researchers, supporting work on AI safety, model evaluation, alignment, robustness, and broader frontier AI research. …

Jun 17, 2026

Followed topics

People also ask

Australian government and Anthropic sign MOU for AI safety and research

Teaching Claude why

Core views on AI safety: When, why, what, and how

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Claude Opus 4.6

Focus areas for The Anthropic Institute

Advancing Claude in healthcare and the life sciences

Claude Fable 5 and Claude Mythos 5

Introducing Claude Opus 4.5

Anthropic opens Seoul office and announces new partnerships across the Korean AI ecosystem