Paper page - HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
…Towards Trustworthy Evaluation of Autonomous Agents (2026) ClawArena: Benchmarking AI Agents in Evolving Information Environments (2026) AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation (2026) OccuBench: Evaluating AI Agents…