Search: Verification/benchmarks

Paper page - Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

…math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging…

May 13, 2026

Paper page - ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

…through negative sample projection, maintaining diversity while outperforming existing methods on multiple benchmarks. AI-generated summary Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits…

May 5, 2026

Paper page - From Web to Pixels: Bringing Agentic Search into Visual Perception

…We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence , knowledge-intensive queries , precise box/mask annotations , and three task views: Search-based Grounding…

May 13, 2026

Paper page - Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

…Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers , judges , critics ; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across…

May 6, 2026

Paper page - QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

…Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination , unsupported accusation , deception collapse , and language…

May 27, 2026

Paper page - SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

…Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning…

May 20, 2026

Paper page - CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

…ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints (2026) From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs (2026) Affordance Agent Harness: Verification-Gated Skill…

May 7, 2026

Paper page - TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

…Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning…

May 12, 2026

Paper page - Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

…Before any pairwise comparison, ARR externalizes a VLM 's internalized preference knowledge as prompt-specific rubrics , translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable…