Search: LLM capabilities

Paper page - PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

…Authors: , , , , , , , , , , , , Abstract PhysicianBench evaluates LLM agents on real clinical tasks requiring complex, multi-step workflows within electronic health record environments, revealing significant gaps in current agent capabilities. AI-generated summary We introduce…

May 5, 2026

Paper page - IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

…organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM -generated candidates at…

May 13, 2026

Paper page - Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

…Experimental results demonstrate that Skills-Coach achieves significant performance improvements in skill capability across a wide range of categories, highlighting its potential to advance the development of more robust and adaptable LLM…

May 6, 2026

Paper page - FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

…Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments Published on Apr 28 Submitted by Amir on Apr 30 Authors: Amir Saeidi , , , , , , Abstract Failure-Aware Meta-Agentic…

Apr 30, 2026

Paper page - LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

…LLMs by scaling test-time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To…

May 13, 2026

Paper page - CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

…Cheng Qian , , Jiayu Liu , , , , , , , , , , Abstract Large language models demonstrate limited creative problem-solving abilities when required to repurpose objects based on affordance reasoning, indicating a gap in current AI capabilities for novel…

May 7, 2026

Paper page - HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

…Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this…

May 7, 2026

Paper page - UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

…AI-generated summary As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context…

May 11, 2026

Paper page - PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

…Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents Published on May 13 Submitted by Mikhail Menschikov on May 14 Skoltech Authors: Mikhail Menschikov , , , , , , , , , , Abstract PersonalAI 2.0 enhances…

May 14, 2026

Paper page - RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

…icu lt to assess the true reasoning capabilities of AI systems. We introduce Real ICU , a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are…