Search: Tooling comparisons

Donating our open-source alignment tool

…We’ve now integrated Petri with our other open-source alignment tool, Bloom , which can perform much more in-depth assessments of specific chosen behaviors (in comparison to Petri’s wider-ranging…

May 7, 2026

Coding agents in the social sciences

…Claude Code is the most common coding agent tool reported, with 86% of users reporting Claude Code use (31% report using Codex, the next most common tool). Adoption is highly uneven Figure…

May 27, 2026

Introducing Sonnet 4.6

… Better spec compliance, better architecture, and it reached for modern tooling we didn’t ask for, all in one shot. …

Feb 17, 2026

How AI Is Transforming Work at Anthropic

… This provides a baseline for team-specific comparisons. …

Dec 2, 2025

Building AI for cyber defenders

… These enable clear comparisons across models, measure the speed of AI progress, and—especially in the case of novel, externally developed evaluations—provide a good metric to ensure that we are not simply teaching to our own tests. …

Oct 3, 2025

Agentic coding and persistent returns to expertise

…Over those seven months, the value of the typical task, which we estimate through a comparison to freelance job postings, rose in almost every kind of work—about 25% on average. Introduction…

Jun 16, 2026

Measuring AI agent autonomy in practice

…an agent is an AI system equipped with tools that allow it to take actions , like running code, calling external APIs, and sending messages to other agents. 1 Studying the tools that…

Feb 18, 2026

Vibe physics: The AI grad student

…I’ve been working with modern machine learning tools for over a decade. My first modern ML paper , from 2016, was an early application of deep learning to particle physics. In a…

Mar 23, 2026

Claude Fable 5 and Claude Mythos 5

… In blinded head-to-head comparisons against Opus-class models, our scientists preferred Mythos’s molecular biology hypotheses ~80% of the time, and have advanced several to experimental evaluation. …

Jun 9, 2026

Harness design for long-running application development

…Contract criterion Evaluator finding Rectangle fill tool allows click-drag to fill a rectangular area with selected tile FAIL — Tool only places tiles at drag start/end points instead of filling the…

Mar 24, 2026

Followed topics