Paper page - Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
… Systematically Auditing AI Agent Benchmarks with BenchJack 2026 Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories 2026 Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use 2026 MOSAIC-Bench: Measuring Compositional Vulnerability Induct… …