Paper page - Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
…math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging…