Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
… However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. …