Search: agentic coding

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

…Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents , do not readily resolve these failures, still falling short on VSTAT. View arXiv page View…

Jun 3, 2026

Paper page - HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

…AI-generated summary Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to…

May 5, 2026

Paper page - Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen…

Jun 2, 2026

Paper page - FastKernels: Benchmarking GPU Kernel Generation in Production

…We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/ fastkernels View arXiv…

May 27, 2026

Paper page - Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of…

Jun 1, 2026

Paper page - MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

…code RC-7291 → Code_1> Local secure mapping (persistent across sessions) Store the mapping placeholder ↔ original value in a local SQLite DB . Cloud reasoning and memory operations (cloud) The cloud agent…

May 13, 2026

Paper page - LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

…Advancing PRMs for Reinforcing Code Agents (2026) Stabilizing Efficient Reasoning with Step-Level Advantage Selection (2026) Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling (2026) EnvSimBench…

May 11, 2026

Paper page - Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct Personalization is a crucial capability of modern language agents . However, current research primarily positions personalized agents as passive responders to user preferences, limiting their…

Jun 3, 2026

Paper page - Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

…View arXiv page View PDF Add to collection Community SEIG is an agentic framework that reconstructs 3D scenes from single images by progressively generating executable Blender code, enabling novel-view synthesis, scene…

Jun 2, 2026

Paper page - WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

…Towards Multimodal Web Coding Evaluation for Code Language Models (2026) WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games (2026) WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis (2026…

Jun 4, 2026

Followed topics

Search

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

Top stories

Paper page - EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Paper page - Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Paper page - Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Paper page - AutoMedBench: Towards Medical AutoResearch with Agentic AI Models