Search: AI agent debugging

Paper page - AcademiClaw: When Students Set Challenges for AI Agents

… The following papers were recommended by the Semantic Scholar API Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks 2026 Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents 2026 AlphaEval: Evaluating Agents in Production 2026 ClawBench: C… …

May 5, 2026

Paper page - AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

… But real medical-AI research is a workflow, and failures are often hidden inside that workflow. ❕The solution: AutoMedBench provides the workflow-aware evaluation of LLM agents in AutoResearch tasks, and covering wild range of tasks from segmentation, image enhancement, VQA, report generation and l… …

Jun 3, 2026

Paper page - Code World Model Preparedness Report

… Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety 2026 Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework 2026 CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments 2… …

May 6, 2026

Paper page - Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

… Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself i.e. chain-of-thought and tools in a single stream of computation. …

May 13, 2026

Paper page - InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

… The following papers were recommended by the Semantic Scholar API HATS: Hardness-Aware Trajectory Synthesis for GUI Agents 2026 WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models 2026 GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents 2026… …

May 1, 2026

Paper page - MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

… The following papers were recommended by the Semantic Scholar API Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation 2026 MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models 2026 MemFail: Stress-Testing Failure Modes of … …

May 28, 2026

We Got Claude to Fine-Tune an Open Source LLM

… I found the explanation of Hugging Face’s “Skills Training” initiative — how it lets you use a coding‑agent like Claude Code or other supported agents to fine‑tune large language models, submit GPU jobs, monitor progress and push trained models to the Hub — particularly eye‑opening. …

Oct 14, 2025 · ben burtenshaw

Followed topics