On-policy distillation: one of the hottest terms on PapersWithCode [R]
English summary
Niels from Hugging Face announces the addition of on-policy distillation (OPD) to PapersWithCode as a key term. OPD is a post-training technique used in models like Qwen 3.6, GLM-5.1, and DeepSeek-V4. The method involves injecting hint tokens to discourage specific errors during rollouts without regenerating new rollouts. A whiteboard explanation by Sasha Rush is linked, and the post invites feedback on other methods to add.
Chinese summary
来自Hugging Face的Niels宣布在PapersWithCode上新增了一个热门术语——在线策略蒸馏(OPD)。OPD是一种后训练技术,被用于Qwen 3.6、GLM-5.1和DeepSeek-V4等模型中。该方法通过注入提示标记来抑制特定错误,而无需重新生成新的轨迹。帖子还提供了Sasha Rush的白板讲解视频链接,并邀请用户建议其他方法。
Key points
On-policy distillation (OPD) is now listed on PapersWithCode as a hot research term.
在线策略蒸馏(OPD)现已被列为PapersWithCode上的热门研究术语。
OPD is a post-training technique used in recent large models like Qwen 3.6, GLM-5.1, and DeepSeek-V4.
OPD是一种后训练技术,被用于Qwen 3.6、GLM-5.1和DeepSeek-V4等最新大模型中。
The method uses hint tokens to penalize specific errors during rollout, avoiding full regeneration of trajectories.
该方法使用提示标记来惩罚轨迹中的特定错误,避免完全重新生成轨迹。