Search: AI agents safety

Paper page - On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

… Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents. …

Paper page - Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

… AI-generated summary Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. …

May 13, 2026

Paper page - Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

… Rethinking Safety Evaluation for Phone-Use Agents Published on May 8 Submitted by Zhengyang Tang on May 12 Authors: Zhengyang Tang , , , , , , , , , , , , , , , , Zheng Ruan , , , , Abstract PhoneSafety benchmark reveals that avoiding harmful outcomes doesn't necessarily indicate safety, as models … …

May 12, 2026

Paper page - IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

… Our construction pipeline rejects 70.3% of LLM -generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM -only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ w = 0.798 against a dom… …

May 13, 2026

Paper page - One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

… The following papers were recommended by the Semantic Scholar API SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics 2026 ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming 2026 Transient Turn Injection: Exposing Stateless Multi-… …

May 13, 2026

Paper page - Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

… Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems. …

May 11, 2026

We Got Claude to Fine-Tune an Open Source LLM

… I found the explanation of Hugging Face’s “Skills Training” initiative — how it lets you use a coding‑agent like Claude Code or other supported agents to fine‑tune large language models, submit GPU jobs, monitor progress and push trained models to the Hub — particularly eye‑opening. …

Oct 14, 2025 · ben burtenshaw

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

… View arXiv page View PDF Add to collection Community For questions or model-evaluation requests, contact guijin.son@snu.ac.kr . objective evaluation을 하기에는 class imbalance가 너무 큰게 아닌게 싶네요. the way soohak treats refusal as a first-class signal is a clever move, highlighting a real frontier beyond pure… …

May 12, 2026

Paper page - SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

… The following papers were recommended by the Semantic Scholar API CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator 2026 Large Language Models are Universal Reasoners for Visual Generation 2026 BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cogni… …

May 11, 2026

Welcome Gemma 4: Frontier multimodal intelligence on device

… PR: github.com/google-gemma/cookbook/pull/342 HDP spec: helixar.ai/about/labs/hdp HDP-P spec: helixar.ai/about/labs/hdp-physical · cc @ sergiopaniego thats amazing This exactly fits our kid education domain! …

Apr 12, 2026 · merve

Followed topics

Paper page - On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Paper page - Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Paper page - Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Paper page - IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Paper page - One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Paper page - Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

We Got Claude to Fine-Tune an Open Source LLM

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Paper page - SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Welcome Gemma 4: Frontier multimodal intelligence on device