Paper page - LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
…This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking . Experiments on three reasoning LLMs (4B…