Search

Showing top 59 results for "real-world evaluation"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

Is an LLM’s knowledge useful in an applied scenario?

In considering the contribution of AI to biorisk, we need to know more than just how well it performs on a quiz. We need to look at evaluations that involve real people, and closely mirror our actual threat scenarios. Moreover, just as we benchmark AI knowledge by comparing it to experts, we need to measure AI utility by comparing it to the most easily accessible alternative—in this case, the internet. To meet both of these criteria, we have conducted several controlled trials measuring AI’s ability to assist in the planning of a hypothetical bioweapons acquisition process. Participants were g

LLMs and biorisk

Cyber evaluations of Claude 4

… Related content Agentic coding and persistent returns to expertise Paving the way for agents in biology Measuring LLMs’ impact on N-day exploits In cybersecurity, a large fraction of real-world harm comes from N-days: vulnerabilities that have already been publicly disclosed, but only patched on so… …

Jul 15, 2025

Building AI for cyber defenders

… Conferring with trusted partners Real world defensive security is more complicated in practice than our evaluations can capture. We’ve consistently found that real problems are more complex, challenges are harder, and implementation details matter a lot. …

Oct 3, 2025

Demystifying evals for AI agents

… Here's what's worked across a range of agent architectures and use cases in real-world deployment. The structure of an evaluation An evaluation “eval” is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success. …

Jan 9, 2026

An update on our election safeguards

… This year, we ran evaluations on our models to see whether web search was triggered when Claude was asked questions related to elections around the world. …

Apr 24, 2026

LLMs and biorisk

… We need to look at evaluations that involve real people, and closely mirror our actual threat scenarios. …

Sep 5, 2025

Donating our open-source alignment tool

… An add-on to Petri, which we’re calling “Dish,” makes the setup far more realistic, for example by running the tests using the model’s real system prompt and the real “scaffold” the software that wraps around the model to help it meet its goals that would be used in genuine model deployments; Depth… …

May 7, 2026

Followed topics

Search

People also ask

Cyber evaluations of Claude 4

Building AI for cyber defenders

Demystifying evals for AI agents

An update on our election safeguards

LLMs and biorisk

Donating our open-source alignment tool

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

AI agents find smart contract exploits

Values in the wild: Discovering and analyzing values in real-world language model interactions

Developing Nuclear Safeguards for AI