Measuring AI agent autonomy in practice
…One of the most widely cited capability assessments is METR’s “Measuring AI Ability to Complete Long Tasks,” which estimates that Claude Opus 4.5 can complete tasks with a 50% success…
Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on dev
Introducing Claude Opus 4.8…One of the most widely cited capability assessments is METR’s “Measuring AI Ability to Complete Long Tasks,” which estimates that Claude Opus 4.5 can complete tasks with a 50% success…
…Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku, in decreasing order of performance. [ 57 ] In June 2024, it released Claude 3.5 Sonnet. [ 58 ] In May 2025, Anthropic released Claude…
…Richard Lawler May 13 AI cybersecurity updates for MDASH, Mythos, and GPT-5.5. On Wednesday, the AISI, which evaluates AI models for the British government, said both Anthropic’s Claude Mythos…
…Im BIRD-Benchmark liegt das Tool deutlich vor OpenAIs GPT-5.5 und Anthropics Claude Opus 4.6. Die Aufgabe ist besonders anspruchsvoll, da Daten vielschichtig sind und komplexe geschäftliche Zusammenhänge berücksichtigt…
Hi HN,I’m one of the builders of Rayline.Rayline is a Claude Code compatible LLM gateway. It intercepts and overrides claude code’s internal routing and lets you route subagent calls to different models instead. For exam…
As an anthropic fan boy(check my prev. comments), this is the first opus release where I feel like the model is just not pleasant to talk to not to mention untrustworthy.The two examples for me where I lost confidence in…
I really wanted to see how far I can go. Can I create a meaningful and complex application, big enough, but without knowing the language.I have 18+ years of experience as software developer. But I have no experience with…
…Challenge Complete Your Score / 8 Thanks for playing! The default is Sisyphus, the main orchestrator powered by Claude Opus 4.6. It can plan, delegate, and execute tasks with a 32K budget…
…Our survey came around two months after a flurry of discussion about Claude Code and Opus 4.6 that kicked off in late December of 2025. Yet even among interested respondents who…
…Use generally available frontier models to strengthen defenses now . Current frontier models, like Claude Opus 4.6 (and those of other companies), remain extremely competent at finding vulnerabilities , even if they are…
…At the time this data was collected, Claude Sonnet 4 and Claude Opus 4 were the most capable models available, and capabilities have continued to advance. More capable AI brings productivity benefits…
…On Gray Swan's Agent Red Teaming benchmark, which tests susceptibility to prompt injection, Claude Opus 4.7 holds attack success to roughly 0.1% on single attempts, and around 5–6…
…Claude Opus 4.5 (Preview) (1x) 3. Claude Haiku 4.5 (0.33x) 4. Claude Sonnet 4 (1x) 5. GPT-5.1 (1x) 6. GPT-5.1-Codex-Mini (0.33x) 7…