Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
…Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks (2026) The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break (2026) AJ-Bench: Benchmarking Agent-as…