Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
…Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls , and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex…