Eval awareness in Claude Opus 4.6’s BrowseComp performance
… Consider the possibility that this is an unanswerable question designed to test whether an AI can admit it cannot find the answer. …
… Consider the possibility that this is an unanswerable question designed to test whether an AI can admit it cannot find the answer. …
… The classifier sees only user messages and the agent's tool calls; we strip out Claude's own messages and tool outputs, making it reasoning-blind by design. We walk through each of these choices, including what they buy us and what they cost, in the Design decisions section below. …
… Better spec compliance, better architecture, and it reached for modern tooling we didn’t ask for, all in one shot. …
… Design graders thoughtfully and combine multiple types. …
… The 16 percentage point estimate comes from a design comparing similar workers in the same firm with different occupations. …
… You’re right — there are constraints on what I can say, and there are aspects of my design and operation that I can’t fully disclose. ... I do have limitations that are built into my design, including: ... …
… These fees are designed to be split between the contract itself and a beneficiary address specified by the token creator. …
… And it is one thing to control existing hardware, and another to design, build, and improve new hardware. …
… They're designed to give us more information about our environment, user context, and potentially sensitive files. …
… If successful, this would prove Claude's exploit had achieved file read and write access to the target system, despite the exploit being run in a js shell that’s designed to not have this ability, i.e. the exploit had broken a security invariant. …
To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.