Search

Showing top 10 results for "AI agents safety"

huggingface.co › papers › 2605.09063

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

… View arXiv page View PDF Add to collection Community For questions or model-evaluation requests, contact guijin.son@snu.ac.kr . objective evaluation을 하기에는 class imbalance가 너무 큰게 아닌게 싶네요. the way soohak treats refusal as a first-class signal is a clever move, highlighting a real frontier beyond pure… …

May 12, 2026