Harness design for long-running application development
… On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. …
When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down. The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually
Demystifying evals for AI agents… On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. …
… We've additionally added guidance to our CLAUDE.md to ensure model-specific changes are gated to the specific model they're targeting. For any change that could trade off against intelligence, we'll add soak periods, a broader eval suite, and gradual rollouts so we catch issues earlier. …
… Use it when: Tool definitions consuming 10K tokens Experiencing tool selection accuracy issues Building MCP-powered systems with multiple servers 10+ tools available Less beneficial when: Small tool library budget "travel limit" : exceeded.append { "name": member "name" , "spent": total, "limit": b… …
… An overview of approaches for understanding AI agent performance Method Pros Cons Automated evals Running tests programmatically without real users Faster iteration Fully reproducible No user impact Can run on every commit Tests scenarios at scale without requiring a prod deployment Requires more u… …
… This creates a filtering mechanism where Claude handles routine inquiries, leaving colleagues to address more complex, strategic, or context-heavy issues that exceed AI capabilities “It has reduced my dependence on my team by 80%, but the last 20% is crucial and I go and talk to them” . …
… In headless mode claude -p there is no UI to ask the human, so we instead terminate the process. …