大语言模型推理强化学习的熵变机制

崔淦渠; 丁宁

doi:10.11991/cccf.202511006

大语言模型推理强化学习的熵变机制

崔淦渠,
丁宁

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

摘要

摘要: 本文旨在对大语言模型推理的强化学习中策略熵的变化机制建立深入系统的理解，包括3个核心研究问题：1）强化学习过程中策略熵的典型行为是什么？2）它为什么会有这种行为？3）本文如何干预以控制熵，从而实现更好的探索−利用权衡？本文首先揭示了强化学习中的熵减现象和熵与下游任务表现的经验定律，随后从理论上得出结论，熵变化是由动作优势（action advantage）和概率之间的协方差驱动的。最后，提出了两种简单而有效的技术Clip-Cov 和 KL-Cov来控制策略熵。实验表明，这些方法可以鼓励探索，从而帮助模型摆脱熵崩溃并实现更好的下游性能。

Abstract: This article aims to establish a deep and systematic understanding of the mechanisms behind policy entropy changes in reinforcement learning for large language model inference, focusing on three core research questions: 1) What is the typical behavior of policy entropy during the reinforcement learning process? 2) Why does it exhibit this behavior? 3) How can we intervene to control entropy and thus achieve a better exploration-exploitation trade-off? The article first reveals the phenomenon of entropy reduction and an empirical law connecting entropy with downstream task performance in reinforcement learning. It then theoretically concludes that the change in entropy is driven by the covariance between action advantage and probability. Finally, two simple yet effective techniques, Clip-Cov and KL-Cov, are proposed to control policy entropy. Experiments show that these methods can encourage exploration, helping the model escape entropy collapse and achieve better downstream performance.

参考文献(21)

施引文献

资源附件(0)