Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
English summary
This paper introduces a data-centric post-training pipeline that applies interpretability protocols to preference datasets, uncovering latent concepts that distinguish preferred from dispreferred model outputs and making them explicit for user feedback. The approach diagnoses undesirable signals such as over-stylization and sycophancy, and mitigates off-target learning by intervening on the learning signal at the concept level. It unifies several interpretability-based training protocols as ways of shaping rewards through feature or data interventions. Empirically, the method amplifies desired properties like safeguards and model personality, turning opaque scalar reward optimization into an auditable process of sculpting the training signal.
Chinese summary
该论文提出一种数据中心的后训练流程,利用可解释性协议分析偏好数据集,揭示区分喜欢与不喜欢输出的潜在概念,并使之显式化以供用户反馈。该方法能诊断出过度风格化、迎合等不良信号,并通过概念层面的干预减轻非目标学习。它统一了多种基于可解释性的训练协议,将其视为通过特征或数据干预塑造奖励的方式。实验表明,该方法能增强安全保护、模型个性等期望属性,将不透明的标量奖励优化转变为可审计的训练信号雕刻过程。
Key points
Proposes a post-training pipeline that uses interpretability to identify latent concepts in preference data, enabling concept-level auditing of what a model will learn.
提出一种后训练流程,利用可解释性识别偏好数据中的潜在概念,实现对模型将学习内容的概念级审计。
Diagnoses and mitigates off-target learning behaviors like over-stylization and sycophancy by intervening directly on the learning signal.
通过直接干预学习信号,诊断并缓解过度风格化、迎合等非目标学习行为。
Unifies various interpretability-based training methods as feature or data interventions that shape the reward, moving beyond opaque scalar optimization.
将多种基于可解释性的训练方法统一为塑造奖励的特征或数据干预,超越不透明的标量优化。
Empirically demonstrates amplification of desired properties such as safeguards and model personality, showing the pipeline's practical utility.
实证表明该方法可增强安全保护、模型个性等期望属性,展示了该流程的实用价值。