Paper page - Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
…To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math…