Paper page - Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
… To preserve gradient unbiasedness despite using trajectory-conditioned features , we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. …