Search: Verification/benchmarks

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

… To answer this, the research community has built several benchmarks. …

Apr 29, 2026

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… Unlike in the first case, it did no post-hoc verification. …

Mar 6, 2026

Claude for Financial Services

… Access your critical data sources with direct hyperlinks to source materials for instant verification, all in one platform with expanded capacity for demanding financial workloads. …

Jul 15, 2025

Introducing Claude Opus 4.8

… It builds on Opus 4.7 with improvements across benchmarks, and is a more effective collaborator. …

May 28, 2026

Introducing Claude Opus 4.7

… Across our agentic reasoning over data benchmarks, it is the best-performing Claude model for enterprise document analysis. …

Apr 16, 2026

Partnering with Mozilla to improve Firefox’s security

… The Firefox team highlighted three components of our submissions that were key for trusting our results: Accompanying minimal test cases Detailed proofs-of-concept Candidate patches We strongly encourage researchers who use LLM-powered vulnerability research tools to include similar evidence of ver… …

Mar 6, 2026

Followed topics

Search

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Claude for Financial Services

Introducing Claude Opus 4.8

Introducing Claude Opus 4.7

Partnering with Mozilla to improve Firefox’s security

Project Glasswing: An initial update

Introducing Claude Opus 4.5

Natural Language Autoencoders

Agents for financial services