Search

Showing top 14 results for "LLM-driven tooling"

Paper page - Code World Model Preparedness Report

…the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety (2026) Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework (2026) CritBench: A Framework for Evaluating Cybersecurity…

May 6, 2026

Paper page - The Last Harness You'll Ever Build

…Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that…

Apr 29, 2026

Paper page - LychSim: A Controllable and Interactive Simulation Framework for Vision Research

…AI-generated summary While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However…

May 13, 2026

Paper page - Synthetic Computers at Scale for Long-Horizon Productivity Simulation

…Scalable Real-World Software Engineering Tasks for Agents (2026) A Subgoal-driven Framework for Improving Long-Horizon LLM Agents (2026) PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory (2026) Toward…

May 1, 2026

To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.

Followed topics

Paper page - Code World Model Preparedness Report

Paper page - The Last Harness You'll Ever Build

Paper page - LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Paper page - Synthetic Computers at Scale for Long-Horizon Productivity Simulation