Paper page - LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
…Furthermore, applying Direct Multi-turn Preference Optimization ( DMPO ) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal…