Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
…On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late…