Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
English summary
The paper introduces Z-Reward, a teacher-student framework that decouples complex reasoning from efficient reward deployment in text-to-image training. The teacher model, a large vision-language model, infers rubric-aligned score distributions through reasoning and is trained via GDSO, combining policy-gradient rewards with score supervision. The student is trained with RISD to transfer the teacher's score distribution without explicit reasoning, achieving 88.6% human preference accuracy compared to the teacher's 89.6%. Z-Reward provides a differentiable reward signal that yields a 41.3% net human-preference improvement over the baseline.
Chinese summary
论文提出Z-Reward框架,通过教师-学生模型将复杂推理与高效奖励部署解耦,用于文本到图像训练。教师模型(大型视觉语言模型)通过推理推断与评分标准对齐的评分分布,并采用GDSO方法(结合策略梯度奖励与评分监督)进行训练。学生模型通过RISD训练,无需显式推理即可迁移教师的评分分布,达到88.6%的人类偏好准确率(教师为89.6%)。Z-Reward作为可微分奖励信号,相对基线实现了41.3%的净人类偏好提升。
Key points
Z-Reward separates reasoning and reward deployment via a teacher-student architecture, where the teacher handles complex reasoning and the student deploys efficiently.
Z-Reward通过教师-学生架构分离推理与奖励部署,教师负责复杂推理,学生高效执行。
Teacher VLM achieves 89.6% human preference accuracy, outperforming existing models, while the student attains 88.6% without explicit reasoning.
教师视觉语言模型达到89.6%的人类偏好准确率,超过现有模型;学生模型无需显式推理即达到88.6%。
Teacher training uses GDSO, mixing policy-gradient rewards with score supervision; student training uses RISD to mimic the teacher's score distribution.
教师训练采用GDSO,混合策略梯度奖励与评分监督;学生训练使用RISD模仿教师的评分分布。
As a differentiable reward signal, Z-Reward yields a 41.3% net improvement in human preference over the baseline.
Z-Reward作为可微分奖励信号,使人类偏好相较基线净提升41.3%。