Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Loading / 加载中

English summary

The paper introduces Z-Reward, a teacher-student framework that decouples complex reasoning from efficient reward deployment in text-to-image training. The teacher model, a large vision-language model, infers rubric-aligned score distributions through reasoning and is trained via GDSO, combining policy-gradient rewards with score supervision. The student is trained with RISD to transfer the teacher's score distribution without explicit reasoning, achieving 88.6% human preference accuracy compared to the teacher's 89.6%. Z-Reward provides a differentiable reward signal that yields a 41.3% net human-preference improvement over the baseline.

Chinese summary

论文提出Z-Reward框架，通过教师-学生模型将复杂推理与高效奖励部署解耦，用于文本到图像训练。教师模型（大型视觉语言模型）通过推理推断与评分标准对齐的评分分布，并采用GDSO方法（结合策略梯度奖励与评分监督）进行训练。学生模型通过RISD训练，无需显式推理即可迁移教师的评分分布，达到88.6%的人类偏好准确率（教师为89.6%）。Z-Reward作为可微分奖励信号，相对基线实现了41.3%的净人类偏好提升。

Key points

Z-Reward separates reasoning and reward deployment via a teacher-student architecture, where the teacher handles complex reasoning and the student deploys efficiently.

Z-Reward通过教师-学生架构分离推理与奖励部署，教师负责复杂推理，学生高效执行。

Teacher VLM achieves 89.6% human preference accuracy, outperforming existing models, while the student attains 88.6% without explicit reasoning.

教师视觉语言模型达到89.6%的人类偏好准确率，超过现有模型；学生模型无需显式推理即达到88.6%。

Teacher training uses GDSO, mixing policy-gradient rewards with score supervision; student training uses RISD to mimic the teacher's score distribution.

教师训练采用GDSO，混合策略梯度奖励与评分监督；学生训练使用RISD模仿教师的评分分布。

As a differentiable reward signal, Z-Reward yields a 41.3% net improvement in human preference over the baseline.

Z-Reward作为可微分奖励信号，使人类偏好相较基线净提升41.3%。