SocialSource: REDDIT MACHINELEARNINGJune 4, 2026Importance: 4/5

On-policy distillation: one of the hottest terms on PapersWithCode [R]

English summary

Niels from Hugging Face announces the addition of on-policy distillation (OPD) to PapersWithCode as a key term. OPD is a post-training technique used in models like Qwen 3.6, GLM-5.1, and DeepSeek-V4. The method involves injecting hint tokens to discourage specific errors during rollouts without regenerating new rollouts. A whiteboard explanation by Sasha Rush is linked, and the post invites feedback on other methods to add.

Chinese summary

来自Hugging Face的Niels宣布在PapersWithCode上新增了一个热门术语——在线策略蒸馏（OPD）。OPD是一种后训练技术，被用于Qwen 3.6、GLM-5.1和DeepSeek-V4等模型中。该方法通过注入提示标记来抑制特定错误，而无需重新生成新的轨迹。帖子还提供了Sasha Rush的白板讲解视频链接，并邀请用户建议其他方法。

Key points

On-policy distillation (OPD) is now listed on PapersWithCode as a hot research term.
在线策略蒸馏（OPD）现已被列为PapersWithCode上的热门研究术语。
OPD is a post-training technique used in recent large models like Qwen 3.6, GLM-5.1, and DeepSeek-V4.
OPD是一种后训练技术，被用于Qwen 3.6、GLM-5.1和DeepSeek-V4等最新大模型中。
The method uses hint tokens to penalize specific errors during rollout, avoiding full regeneration of trajectories.
该方法使用提示标记来惩罚轨迹中的特定错误，避免完全重新生成轨迹。

Open original