How we contain Claude across products
…1 The second approach to capping the blast radius—and the focus of much of this post—is containment. Rather than supervising what the agent does, we supervise what it’s able…
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were: Our post-training process was accidentally encouraging this behavior with misaligned rewards.This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it. We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. T
Teaching Claude why…1 The second approach to capping the blast radius—and the focus of much of this post—is containment. Rather than supervising what the agent does, we supervise what it’s able…
…Public Policy focuses on the areas where Anthropic has defined priorities and perspectives, including model safety and transparency , energy ratepayer protections , infrastructure investments , export controls , and democratic leadership in AI . Sarah Heck…
Discover how Anthropic approaches the development of reliable AI agents. Learn about our research on agent capabilities, safety considerations, and technical framework for building trustworthy AI.
…AI Safety Institute . In January 2026, Kanika Bahl stepped down to begin a new nonprofit, the AI Access Initiative , and Zach Robinson stepped down to focus on non-profit and philanthropic work…
…a trusted method of confirming whether an AI agent’s output actually achieves its goal. Task verifiers give the agent real-time feedback as it explores a codebase, allowing it to iterate…
…Interestingly, the topic of conversation was no longer strictly focused on Claude's excellent coding capabilities. Don't get me wrong, people were still relying on Claude for coding. But somewhere along…
…While increasing numbers of researchers in China’s AI labs and policy community are concerned with AI safety risks, this trend has not translated into safety practices on par with labs in…
…6 Finally, in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar…
…promptText }], agent: input.agent, model: { providerID, modelID }, }, }); Each sub-reviewer runs in its own OpenCode session with its own agent prompt. The coordinator doesn't see or control what tools the sub…
…we saw sycophantic behavior in 38% of conversations focused on spirituality, and 25% of conversations on relationships. We chose to focus model training efforts on relationship guidance as the domain with the…