Search

Showing top 10 results for "Tooling comparisons"

Paper page - OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

… It first measures how well different models and agent frameworks handle real downstream tasks — with and without skill augmentation — and then runs controlled, same-task comparisons across community-contributed skills, logging quality alongside token and time cost. …

Jun 1, 2026

Paper page - Function2Scene: 3D Indoor Scene Layout from Functional Specifications

… Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. …

Jun 1, 2026

Paper page - RewardHarness: Self-Evolving Agentic Post-Training

… This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. …

May 14, 2026

Paper page - InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

…Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level…

May 13, 2026

Paper page - KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

…basic spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints , disentangled from perception, language understanding, and application-specific complexity. Empirical evaluation shows that existing methods struggle to…

May 7, 2026

Paper page - REPOT: Recoverable Program-of-Thought via Checkpoint Repair

… We also release Derail-550 — the first benchmark to fix the failure point across recovery methods, so cross-method comparisons become causal rather than correlational. …

May 29, 2026

Paper page - AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

… These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. …

May 28, 2026

Paper page - EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

…Both metrics apply to different agent architectures , enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness , and pass…

May 14, 2026

Paper page - Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

…this line up with differentiable simulation and tool-use work, and a direct comparison here could clarify what external planning actually buys on real visual tasks. Get this paper in your agent…

May 1, 2026

Open R1: Update #2

…it also serve as a tool to validate whether the answers are correct? Great work ！I have some questions about the values in the performance comparison table. According to DeepSeek's paper…

Feb 6, 2025

To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.

Followed topics