论文来源: ARXIV2026年6月10日重要度: 4/5

The Role of Feedback Alignment in Self-Distillation

中文标题: 反馈对齐在自蒸馏中的作用

英文摘要

The paper investigates how the design of the context used by the self-teacher in self-distillation affects reasoning performance. It compares conditioning on a binary reward signal (GRPO), the ground-truth reference solution, and a step-by-step critique aligned with the solver's own reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on average across 12 benchmarks. Per-token advantage analysis reveals that step-aligned feedback only modifies incorrect reasoning steps, leaving correct tokens intact, while reference solutions force unnecessary changes at every token. The results demonstrate that structural alignment between feedback and the model's reasoning is a critical driver of self-distillation effectiveness.

中文摘要

本文研究自蒸馏中自我教师所见上下文的设计如何影响推理性能，对比了二元奖励信号（GRPO）、正确答案参考解以及与模型自身推理轨迹对齐的逐步批评三种条件。逐步对齐批评带来的提升最大，在12个基准上平均比GRPO高16.11分，比参考解条件高5.27分。逐token优势分析表明，逐步对齐反馈只修正错误的推理步骤，不改变正确token，而参考解则迫使模型在每个token上做出不必要改变。结果表明，反馈与模型推理的结构对齐是自蒸馏有效性的关键驱动因素。

关键要点

The study systematically compares three feedback types for self-distillation: binary reward (GRPO), reference solution, and step-aligned critique.
系统对比了三种自蒸馏反馈类型：二元奖励（GRPO）、参考解和逐步对齐批评。
Step-aligned critique achieves the highest improvement, with +16.11 over GRPO and +5.27 over reference-solution conditioning (Avg@12).
逐步对齐批评带来最高提升，相对GRPO平均高16.11分，相对参考解条件高5.27分（12基准平均）。
Per-token advantage analysis shows step-aligned feedback only corrects erroneous reasoning steps, preserving correct behavior.
逐token优势分析显示，逐步对齐反馈只修正错误推理步骤，保留正确行为。
Conditioning on reference solutions pressures the model to change every token, even correct ones, because alternative derivations differ in phrasing and approach.
以参考解为条件会迫使模型在每个token上改变行为（即使是正确的），因为替代推导在措辞和思路上必然不同。
Structural alignment between feedback and the solver's reasoning trace is identified as a key driver of self-distillation effectiveness.
反馈与求解器推理轨迹的结构对齐被确定为实现自蒸馏效果的关键驱动因素。

打开原文