Paper page - EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
…On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations , artifact delivery , visual quality, cost, runtime…
