Search: Verification/benchmarks

Paper page - AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

…An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents (2026) PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models (2026) SEQUOR: A Multi-Turn Benchmark for Realistic…

Jun 5, 2026

Paper page - Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

…Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers , judges , critics ; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across…

May 6, 2026

Paper page - SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

…Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning…

May 20, 2026

Paper page - SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

…A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026) WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing (2026) Test-Driven AI Agent Definition (TDAD): Compiling Tool…

May 7, 2026

Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

…To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE…

Jun 11, 2026

Paper page - QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

…Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination , unsupported accusation , deception collapse , and language…

May 27, 2026

Paper page - TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

…Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning…

May 12, 2026

Paper page - SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

…Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data…

Jun 1, 2026

Paper page - Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

…Before any pairwise comparison, ARR externalizes a VLM 's internalized preference knowledge as prompt-specific rubrics , translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable…