Paper page - Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
…We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional…