Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
English summary
The paper analyzes on-policy distillation (OPD), a post-training method combining on-policy student trajectories and dense teacher supervision. The study finds that OPD-style updates are small and coordinate-sparse, distributed across layers and FFN-heavy. Training only the discovered sparse subnetwork recovers nearly full OPD performance, but the sparsity-inducing SGD optimizer underperforms AdamW because dense supervision preserves heterogeneous gradient scales that benefit from adaptive scaling. Geometrically, the updates are numerically full-rank but spectrally concentrated, lying away from the principal singular subspaces of the source weights and disproportionately on coordinates where source weights are near zero. The results show that OPD retains geometric signatures of on-policy post-training rather than behaving as dense parameter rewriting.
Chinese summary
该论文分析了结合同策略学生轨迹与密集教师监督的后训练方法——同策略蒸馏(OPD)。研究发现OPD式更新幅度小且坐标稀疏,分布在各层且以前馈网络(FFN)为主;仅训练发现的稀疏子网络即可几乎恢复完整OPD的性能。但诱导稀疏性的SGD优化器表现不及AdamW,因为密集监督保留了异质的坐标级梯度尺度,而AdamW的自适应缩放仍有用。几何上,更新在数值上满秩但谱集中于少数方向,且远离源权重的主奇异子空间,更多地落在源权重接近零的坐标上。结果表明OPD保留了同策略后训练的几何特征,而非普通密集参数重写。
Key points
OPD updates are small, coordinate-sparse, and FFN-heavy across layers; a sparse subnetwork can nearly match full OPD performance.
OPD 更新幅度小、坐标稀疏,且以 FFN 为主;训练稀疏子网络可几乎达到完整 ODP 的性能。
SGD underperforms AdamW for OPD because dense supervision preserves heterogeneous gradient scales that benefit from adaptive optimization.
SGD 在 OPD 中表现不如 AdamW,因为密集监督保留了异质梯度尺度,自适应优化仍有优势。
Updates are full-rank but spectrally concentrated, lying away from the principal singular subspaces and biased toward coordinates with near-zero source weights.
更新数值满秩但谱集中,远离主奇异子空间,偏向源权值接近零的坐标。
OPD does not behave like ordinary dense parameter rewriting; it retains geometric signatures characteristic of on-policy post-training.
OPD 并非普通密集参数重写,而是保留了同策略后训练特有的几何特征。