Paper page - ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
…Standard policy gradients pick up on this signal immediately, and within a few hundred updates the model converges to generating identical maximum-length paths for every user, with near-zero diversity. High…