Search

Showing top 6 results for "UI and reliability issues"

People also ask

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually

Demystifying evals for AI agents

Harness design for long-running application development

… On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. …

Mar 24, 2026

An update on recent Claude Code quality reports

… We've additionally added guidance to our CLAUDE.md to ensure model-specific changes are gated to the specific model they're targeting. For any change that could trade off against intelligence, we'll add soak periods, a broader eval suite, and gradual rollouts so we catch issues earlier. …

Apr 23, 2026

Introducing advanced tool use on the Claude Developer Platform

… Use it when: Tool definitions consuming 10K tokens Experiencing tool selection accuracy issues Building MCP-powered systems with multiple servers 10+ tools available Less beneficial when: Small tool library budget "travel limit" : exceeded.append { "name": member "name" , "spent": total, "limit": b… …

Nov 24, 2025

Demystifying evals for AI agents

… An overview of approaches for understanding AI agent performance Method Pros Cons Automated evals Running tests programmatically without real users Faster iteration Fully reproducible No user impact Can run on every commit Tests scenarios at scale without requiring a prod deployment Requires more u… …

Jan 9, 2026

How AI Is Transforming Work at Anthropic

… This creates a filtering mechanism where Claude handles routine inquiries, leaving colleagues to address more complex, strategic, or context-heavy issues that exceed AI capabilities “It has reduced my dependence on my team by 80%, but the last 20% is crucial and I go and talk to them” . …

Dec 2, 2025

Claude Code auto mode: a safer way to skip permissions

… In headless mode claude -p there is no UI to ask the human, so we instead terminate the process. …

Mar 25, 2026

Followed topics

People also ask

Harness design for long-running application development

An update on recent Claude Code quality reports

Introducing advanced tool use on the Claude Developer Platform

Demystifying evals for AI agents

How AI Is Transforming Work at Anthropic

Claude Code auto mode: a safer way to skip permissions