Paper page - Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
…Measuring Reward Hacking in Long-Horizon Coding Agents (2026) Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use (2026) Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on…
