ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
English summary
The paper introduces ALIGNBEAM, a training-free method that transfers safety alignment from an anchor model to a target specialist during inference, even when they have different vocabularies. It works by translating anchor logits token-by-token into the target vocabulary at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. No model weights are altered, and the safety-utility trade-off can be tuned at deployment. Across both cross-vocabulary and same-vocabulary settings, ALIGNBEAM significantly increases refusal on adversarial safety benchmarks while maintaining task accuracy and practical inference overhead. The results demonstrate that safety alignment can be effectively transferred between model families at inference time without modifying either model.
Chinese summary
本文提出了ALIGNBEAM,一种无需训练的方法,可在推理时将安全对齐从锚模型迁移到目标专业模型,即使两者词表不同也能工作。该方法在每个解码步骤将锚逻辑值逐Token翻译到目标词表,然后由一个小型LLM评审选择最安全的K个候选续写。不修改任何模型权重,安全-效用权衡可在部署时调节。在跨词表和同词表的评估中,ALIGNBEAM在对抗性安全基准上将拒绝率大幅提升,同时保持任务准确率和可接受的推理开销。结果表明,安全对齐可以在推理时跨模型家族传递,无需改动模型权重。
Key points
Overcomes the vocabulary-sharing limitation of existing logit-mixing defenses by translating anchor logits token-by-token into the target vocabulary.
通过将锚逻辑值逐Token翻译到目标词表,克服了现有逻辑值混合防御必须共享词表的限制。
Uses a small LLM judge to select the safest continuation among K candidates at each step, with no weight changes needed.
每一步使用小型LLM评审从K个候选续写中挑出最安全的,无需修改任何权重。
Tunable safety-utility trade-off can be adjusted at deployment time without retraining.
可在部署时调节安全与效用的权衡,无需重新训练。
Evaluated on cross-vocabulary pairs, it substantially raises refusal rates on adversarial benchmarks while keeping task accuracy and inference overhead practical.
在跨词表对上的评估显示,对抗基准上的拒绝率显著提升,同时任务准确率和推理开销保持在实用范围内。