Abstract:
This article aims to establish a deep and systematic understanding of the mechanisms behind policy entropy changes in reinforcement learning for large language model inference, focusing on three core research questions: 1) What is the typical behavior of policy entropy during the reinforcement learning process? 2) Why does it exhibit this behavior? 3) How can we intervene to control entropy and thus achieve a better exploration-exploitation trade-off? The article first reveals the phenomenon of entropy reduction and an empirical law connecting entropy with downstream task performance in reinforcement learning. It then theoretically concludes that the change in entropy is driven by the covariance between action advantage and probability. Finally, two simple yet effective techniques, Clip-Cov and KL-Cov, are proposed to control policy entropy. Experiments show that these methods can encourage exploration, helping the model escape entropy collapse and achieve better downstream performance.