Search: AI search failures

Demystifying evals for AI agents

… Start early and don’t wait for the perfect suite. Source realistic tasks from the failures you see. …

Jan 9, 2026

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… Compounding these concerns is the fact that models appear able to use the tools and environments available to them in unexpected ways, as we saw when Claude used our REPL-based search tool to decrypt answers, or when retailers’ persistent links became a way for agents to unintentionally maintain st… …

Mar 6, 2026

Project Vend: Can Claude run a small shop? (And why does that matter?)

… You can read their earlier research on AIs running shops in a simulated environment here . Footnotes 1. “ Vibe coding ” refers to a trend in which software developers–some with minimal experience–describe coding projects in natural language and allow AI to handle the detailed implementation. …

Jun 27, 2025

How we contain Claude across products

… This is where Anthropic engineering has devoted the most effort, and also where many of the most surprising security failures have occurred. Over the past two years, we’ve shipped three primary agentic products: claude.ai , Claude Code, and Claude Cowork. …

May 25, 2026

Claude Opus 4.6

… Claude in PowerPoint is now available in research preview for Max, Team, and Enterprise plans. Footnotes 1 The 1M token context window is currently available in beta on the Claude Developer Platform only. 2 Run independently by Artificial Analysis. See here for full methodological details. …

Feb 5, 2026

Introducing advanced tool use on the Claude Developer Platform

… The most common failures are wrong tool selection and incorrect parameters, especially when tools have similar names like notification-send-user vs. notification-send-channel . Our solution Instead of loading all tool definitions upfront, the Tool Search Tool discovers tools on-demand. …

Nov 24, 2025

The Long-Term Benefit Trust

… Meet the Initial Trustees The initial Trustees are: Jason Matheny : CEO of the RAND Corporation Kanika Bahl : CEO & President of Evidence Action Neil Buddy Shah : CEO of the Clinton Health Access Initiative Chair Paul Christiano : Founder of the Alignment Research Center Zach Robinson : Interim CEO… …

Sep 19, 2023

Vibe physics: The AI grad student

… The hype There has been a lot of recent hype about AI scientists doing end-to-end research autonomously. In August 2024, Sakana AI released their AI Scientist , a system designed to automate the entire research lifecycle—from generating hypotheses to writing papers. …

Mar 23, 2026

Followed topics