Search: agentic tooling

Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

…Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks (2026) The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break (2026) AJ-Bench: Benchmarking Agent-as…

Jun 1, 2026

Paper page - A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

…AI-generated summary Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward , making it difficult to evaluate the contribution of individual tool-calls within…

May 8, 2026

Paper page - Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

…We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide…

May 6, 2026

Paper page - Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

…Observability-Driven Automatic Evolution of Coding-Agent Harnesses (2026) Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use (2026) Continual Harness: Online Adaptation for Self-Improving Foundation Agents (2026) RoboPhD…

May 29, 2026

Paper page - BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

…While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a…

May 8, 2026

Paper page - From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

…AI-generated summary Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these…

May 11, 2026

Paper page - WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

…Computer-Use Agents Learning Professional Skills via Exploration (2026) EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings (2026) OccuBench: Evaluating AI Agents on Real-World…

May 6, 2026

Followed topics

Search