高级检索

推理模型的自我进化

Self-Improving of Large Reasoning Models

  • 摘要: 随着高质量人工标注数据在预训练数据构建中被大规模利用,通过进一步扩大其规模提升模型性能愈发困难。因此,通过自我合成数据进行自我提升已成为提升模型推理能力的主流范式之一。本研究首先揭示自我合成数据与强化学习之间的内在联系,阐明传统合成数据方法可以被视为强化学习的特殊形式。随后介绍两项互补的研究工作,它们代表了从离线到在线强化学习方法的技术演进。平衡自学推理器(balanced self-taught reasoner, B-STaR)是一个平衡探索与利用的自我提升框架。虽然仍在合成数据范式内运作,但B-STaR通过监控和动态调整配置来实现探索与利用的最优平衡,为自我训练动态提供了可解释的洞察。开源基座模型的零样本强化学习研究与优化(investigating and taming zero reinforcement learning for open base models in the wild, SimpleRL-Zoo)代表了自我训练向在线强化学习方法的转变。通过对涵盖不同模型家族和规模的10个基础模型的系统性研究,我们证明了零强化学习训练的普适性和有效性,揭示了其关键设计策略,并发现了“顿悟时刻”在不同模型类型中条件性出现等重要现象,为理解和推进大语言推理模型的自我提升机制提供了基础的理论框架和实证验证。

     

    Abstract: As high-quality human data is increasingly utilized on a large scale in pre-training data construction, further expanding its scale to improve models has become increasingly challenging. Therefore, self-improvement through self-synthesized data has emerged as one of the mainstream paradigms for enhancing the inference capabilities of models. This article reveals the intrinsic connection between self-data synthesis and reinforcement learning, clarifying that traditional synthetic data methods can be viewed as a special form of reinforcement learning. It introduces two complementary research efforts representing the technical evolution from offline to online reinforcement learning methods: balanced self-taught reasoner (B-STaR), a self-improvement framework that balances exploration and exploitation by monitoring and dynamically adjusting configurations to achieve optimal balance, providing interpretable insights for dynamic self-training; and investigating and taming zero reinforcement learning for open base models in the wild (SimpleRL-Zoo), a comprehensive exploration of zero-shot reinforcement learning training, marking a shift towards online RL methods. Through a systematic study of 10 foundation models spanning different families and scales, the universality and effectiveness of zero-shot RL training were demonstrated, uncovering key design strategies and highlighting phenomena such as the conditional emergence of “aha moments” across different model types.

     

/

返回文章
返回