Paper page - Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
…Sudong Wang , , Xiaomin Yu , Zuhao Yang , , Keming Wu , , , , , , Abstract PRISM addresses distributional drift in multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning with verifiable rewards…