Self-Improving of Large Reasoning Models
-
Graphical Abstract
-
Abstract
As high-quality human data is increasingly utilized on a large scale in pre-training data construction, further expanding its scale to improve models has become increasingly challenging. Therefore, self-improvement through self-synthesized data has emerged as one of the mainstream paradigms for enhancing the inference capabilities of models. This article reveals the intrinsic connection between self-data synthesis and reinforcement learning, clarifying that traditional synthetic data methods can be viewed as a special form of reinforcement learning. It introduces two complementary research efforts representing the technical evolution from offline to online reinforcement learning methods: balanced self-taught reasoner (B-STaR), a self-improvement framework that balances exploration and exploitation by monitoring and dynamically adjusting configurations to achieve optimal balance, providing interpretable insights for dynamic self-training; and investigating and taming zero reinforcement learning for open base models in the wild (SimpleRL-Zoo), a comprehensive exploration of zero-shot reinforcement learning training, marking a shift towards online RL methods. Through a systematic study of 10 foundation models spanning different families and scales, the universality and effectiveness of zero-shot RL training were demonstrated, uncovering key design strategies and highlighting phenomena such as the conditional emergence of “aha moments” across different model types.
-
-