Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution
English summary
Large language model training data curation requires data attribution methods to identify how individual samples influence model outputs. Traditional influence functions, though effective, are too slow and memory-intensive for large-scale use. The paper proposes Influcoder, which distills gradient influence rankings from decoder models into a dedicated encoder. This yields a quick, cost-effective approach to influence-based data attribution at scale.
Chinese summary
大型语言模型训练数据筛选需要数据归因方法来确定单个样本如何影响模型输出。传统影响函数虽然有效,但处理速度慢、内存开销大,难以大规模应用。本文提出Influcoder,将解码器模型的梯度影响排序信息蒸馏到一个专用编码器中,从而实现快速、低成本的大规模数据归因。
Key points
Traditional influence functions are effective for data attribution but suffer from slow speed and high storage demands on large datasets.
传统影响函数用于数据归因效果良好,但在大数据集上处理速度慢、存储开销高。
Influcoder distills gradient influence rankings from decoder models into an encoder, retaining the ranking information while improving efficiency.
Influcoder 将解码器模型的梯度影响排序信息蒸馏到编码器中,保留了排序信息并提升了效率。
The method enables fast and cost-effective data attribution at scale, facilitating practical large-scale training data curation.
该方法实现了快速、低成本的大规模数据归因,使大规模训练数据筛选更切实可行。