Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
…The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far…
