Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
…Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model…
