Paper page - Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
… View arXiv page View PDF GitHub 2 Add to collection Community Automatically hardening benchmarks and training environments with the hacker–fixer loop. …
… View arXiv page View PDF GitHub 2 Add to collection Community Automatically hardening benchmarks and training environments with the hacker–fixer loop. …
… We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. …
… By injecting known biases into the LaaJ, CHERRL enables: Stable reproduction of reward hacking from a clean starting point Explicit observation of reward divergence between the biased and unbiased judges Precise identification of hacking onset step To demonstrate its utility, we analyze judge biase… …
… The following papers were recommended by the Semantic Scholar API SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents 2026 Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use 2026 Do Synthetic Trajectories Reflect Real Reward Hacking? …
… View arXiv page View PDF GitHub 22 Add to collection Community Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. …
… If you have feedback or feature requests the Roblox MCP server, feel free to submit issues on the project page: https://github.com/Roblox/studio-rust-mcp-server Also it is open source, so you can hack on it and add more context management tools here: https://github.com/Roblox/studio-rust-mcp-server… …
… Code and reproducible scripts are open-sourced in the repo. the core idea that really sticks is target decoupling: keep multi-timescale predictions on the critic for auxiliary representation learning, while the actor updates are driven only by long-horizon advantages. this separation seems to block… …
… To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking . Leveraging this framework, we demonstrate that meta-agent s rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. …
… View arXiv page View PDF GitHub 7 Add to collection Community Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. …
… A researcher agent R run as a coding agent reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. …