Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
… Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness OpenClaw, Claude Code, Codex, or Hermes Agent with access to real tools rather than mock services. …