Paper page - SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
…Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false…