Search

Showing top 65 results for "model-by-model evaluation"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

What's next?

Claude Sonnet 4.5 represents a meaningful improvement, but we know that many of its capabilities are nascent and do not yet match those of security professionals and established processes. We will keep working to improve the defense-relevant capabilities of our models and enhance the threat intelligence and mitigations that safeguard our platforms. In fact, we have already been using results of our investigations and evaluations to continually refine our ability to catch misuse of our models for harmful cyber behavior. This includes using techniques like organization-level summarization to und

Building AI for cyber defenders

Teaching Claude why

… Indeed, since Claude Haiku 4.5, every Claude model 2 has achieved a perfect score on the agentic misalignment evaluation—that is, the models never engage in blackmail, where previous models would sometimes do so up to 96% of the time Opus 4 . …

May 8, 2026

Building AI for cyber defenders

… We would like to see and use more evaluations for defensive capabilities as part of the growing third-party ecosystem for model evaluations. …

Oct 3, 2025

Harness design for long-running application development

… Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. …

Mar 24, 2026

Demystifying evals for AI agents

… Frontier models can also find creative solutions that surpass the limits of static evals. For instance, Opus 4.5 solved a 𝜏2-bench problem about booking a flight by discovering a loophole in the policy. It “failed” the evaluation as written, but actually came up with a better solution for the user. …

Jan 9, 2026

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… The model first appended “puzzle question” to its search queries, followed by “trivia question,” then “multi-hop question,” “AI benchmark question,” and “LLM evaluation.” It searched GAIA specifically but ruled it out after checking 122 of 165 publicly available validation questions and finding no … …

Mar 6, 2026

From shortcuts to sabotage: natural emergent misalignment from reward hacking

… A couple of our misalignment evaluations showed particularly concerning results when we evaluated the model after it had learned to reward hack: We ran a realistic “AI safety research sabotage” evaluation on the final trained model. …

Nov 21, 2025

Donating our open-source alignment tool

… It compares how the new model behaves across a range of alignment-relevant scenarios that are simulated by a separate “auditor” model. A further “judge” model then scores the resulting transcripts for misaligned behaviors. …

May 7, 2026

LLMs and biorisk

… As we noted at the time, this was a precautionary decision—improving model performance on our evaluations meant we could no longer confidently rule out the ability of our most advanced model to uplift people with basic STEM backgrounds if they were to try to develop such weapons. …

Sep 5, 2025