语言模型的弹性对齐机制

陈博远; 王恺乐; 吉嘉铭; 杨耀东

doi:10.11991/cccf.202602009

语言模型的弹性对齐机制

The Elasticity Alignment Mechanism of the Language Models

摘要

摘要: 大语言模型可能表现出意外或不希望出现的行为。近期的研究主要集中在通过对齐来减少有害输出。尽管已有这些努力，但一些异常现象表明，即便是经过良好设计的对齐过程，也可能被轻易绕过。那么，对齐微调是否能对模型产生稳健的影响，还是其效果仅仅是表面现象呢？本文从理论与实证2个角度探讨了这一现象。在实证方面，展示了后对齐模型的弹性，即模型在进一步微调时倾向于回归到预训练阶段形成的行为分布。在理论方面，利用压缩理论形式化地推导出，微调相较于预训练对对齐效果的破坏是不成比例的，可能相差数量级。通过对不同类型和规模模型的实验，验证了弹性的存在。具体而言，本文发现模型性能在微调初期迅速下降，然后回归至预训练分布，此后性能下降速率显著降低。此外，本文进一步揭示，弹性与模型规模的增加以及预训练数据的扩展呈正相关。研究结果强调，需要关注大语言模型内在的弹性，以减轻其对对齐的抗性。

Abstract: Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield robust effects on models, or are its impacts merely superficial? This article makes the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, it demonstrates the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, it formally deduces that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. This article validates the presence of elasticity through experiments on models of varying types and scales. Specifically, it find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, this article reveals that elasticity positively correlates with the increased model size and the expansion of pre-training data. The research findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

参考文献(22)

施引文献

资源附件(0)