Donating our open-source alignment tool
…We’ve now integrated Petri with our other open-source alignment tool, Bloom , which can perform much more in-depth assessments of specific chosen behaviors (in comparison to Petri’s wider-ranging…
…We’ve now integrated Petri with our other open-source alignment tool, Bloom , which can perform much more in-depth assessments of specific chosen behaviors (in comparison to Petri’s wider-ranging…
…Claude Code is the most common coding agent tool reported, with 86% of users reporting Claude Code use (31% report using Codex, the next most common tool). Adoption is highly uneven Figure…
… Better spec compliance, better architecture, and it reached for modern tooling we didn’t ask for, all in one shot. …
… This provides a baseline for team-specific comparisons. …
… These enable clear comparisons across models, measure the speed of AI progress, and—especially in the case of novel, externally developed evaluations—provide a good metric to ensure that we are not simply teaching to our own tests. …
…Over those seven months, the value of the typical task, which we estimate through a comparison to freelance job postings, rose in almost every kind of work—about 25% on average. Introduction…
…an agent is an AI system equipped with tools that allow it to take actions , like running code, calling external APIs, and sending messages to other agents. 1 Studying the tools that…
…I’ve been working with modern machine learning tools for over a decade. My first modern ML paper , from 2016, was an early application of deep learning to particle physics. In a…
… In blinded head-to-head comparisons against Opus-class models, our scientists preferred Mythos’s molecular biology hypotheses ~80% of the time, and have advanced several to experimental evaluation. …
…Contract criterion Evaluator finding Rectangle fill tool allows click-drag to fill a rectangular area with selected tile FAIL — Tool only places tiles at drag start/end points instead of filling the…