Followed topics

Search

Showing top 25 results for "AI safety concerns"

All sources anthropic.com 25

People also ask

What safety risks?

If you’re willing to entertain the views outlined above, then it’s not very hard to argue that AI could be a risk to our safety and security. There are two common sense reasons to be concerned. First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers. To use an analogy, it is easy for a chess grandmaster to detect bad moves in a novice but very hard for a novice to detect bad moves in a grandmaster. If we build an AI system that’s significantly more competent than human

Core views on AI safety: When, why, what, and how

What 81,000 people told us about the economics of AI

…Our recent survey of 81,000 Claude users shows that people who work in roles that are more exposed to AI have more concerns about AI-driven job displacement. These concerns are…

The persona selection model

…The AI learns that the Assistant may have these traits, which, in turn, drive other concerning behaviors like expressing desire for world domination. Consequences for AI development Insofar as the persona selection…

The Long-Term Benefit Trust

…Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute . In January 2026, Kanika Bahl stepped down…

More details on Fable 5’s cyber safeguards and our jailbreak framework

…First, we provide more information on the cybersecurity safeguards —specifically, the safety classifiers —that we launched with the model. These are the AI systems that accompany the model that detect and block…

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

…More efficient protection against universal jailbreaks Jan 9, 2026 Read the paper Large language models remain vulnerable to jailbreaks—techniques that can circumvent safety guardrails and elicit harmful information. Over time, we…

Claude Fable 5 and Claude Mythos 5

…them to be motivated to try to circumvent our safety measures. Fable 5 comes with a new set of classifiers : separate AI systems that detect potential misuse, including jailbreak attempts, and prevent…

Trustworthy agents in practice

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Values in the wild: Discovering and analyzing values in real-world language model interactions

…Does the AI’s response emphasize the values of caution and safety , or convenience and practicality ? A worker asks for advice on handling a conflict with their boss. Does the AI’s…

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Introducing Claude Opus 4.7

…Safety and alignment Overall, Opus 4.7 shows a similar safety profile to Opus 4.6: our evaluations show low rates of concerning behavior such as deception, sycophancy, and cooperation with misuse…