信任区域在策略蒸馏:通过信赖域与离群估计提升大模型蒸馏稳定性
英文摘要
Trust Region On-Policy Distillation (TrOPD) is proposed to enhance on-policy distillation for large language models by mitigating instability caused by distribution mismatch between teacher and student. The method integrates trust region constraints, outlier estimation for token-level credit assignment, and off-policy guidance to stabilize policy gradients. Experiments show TrOPD outperforms existing on-policy distillation baselines across mathematical reasoning, code generation, and general-domain benchmarks.
中文摘要
提出信任区域在策略蒸馏(TrOPD),针对大语言模型在策略蒸馏中因师生分布差异导致的策略梯度不稳定问题。TrOPD引入信任区域约束、离群点估计进行令牌级信用分配,并结合离策略指导增强优化稳定性。在数学推理、代码生成和通用领域基准测试中,TrOPD均显著超越现有在策略蒸馏基线方法。
关键要点
TrOPD addresses instability in on-policy distillation from large teacher-student distribution gaps.
TrOPD解决了因教师与学生输出分布差异大而导致的在策略蒸馏不稳定问题。
It leverages trust region on-policy learning, outlier estimation for token-level credit assignment, and off-policy guidance.
方法利用信任区域在策略学习、离群估计进行令牌级信用分配,以及离策略指导。
TrOPD achieves superior results over baselines on math reasoning, code generation, and general tasks.
TrOPD在数学推理、代码生成和通用任务上取得了优于基线的性能。