面向大语言模型的强化学习技术发展

俞扬

doi:10.11991/cccf.202511003

面向大语言模型的强化学习技术发展

俞扬

Reinforcement Learning for Large Language Models: Progress

Yang Yu

摘要

摘要: 大语言模型的成功不仅依赖其庞大规模，更关键在于通过对齐技术使其行为符合人类期望。基于人类反馈的强化学习（reinforcement learning from human feedback, RLHF）是实现对齐的核心范式。本文回顾了面向大语言模型的强化学习技术发展历程：分析了以近端策略优化（proximal policy optimization, PPO）算法为代表的传统人类反馈的强化学习方法所面临的复杂性高、计算开销大等挑战，探讨了针对大语言模型特性的强化学习算法革新，在保证在线学习优势的同时，大幅提升了训练效率，推动了大模型专用强化学习的新发展。最后，展望了基于人工智能反馈的强化学习等方向如何驱动模型实现自我提升，并描述了强化学习与大模型共生进化、相互促进的未来趋势。

Abstract: The success of large language models (LLMs) depends not only on their vast scale but, more crucially, on alignment techniques that conform their behavior to human expectations. Reinforcement learning from human feedback (RLHF) is the core paradigm for achieving this alignment. This article reviews the developmental trajectory of reinforcement learning techniques for LLMs: First, it analyzes the challenges faced by traditional RLHF methods, represented by the proximal policy optimization (PPO) algorithm, such as high complexity and significant computational overhead; subsequently, it analyzes the challenges such as high complexity and large computational overhead faced by traditional RLHF methods represented by the PPO algorithm, and discusses innovations in reinforcement learning algorithms tailored to the characteristics of large language models, which successfully enhance training efficiency significantly while preserving the advantages of online learning, thereby ushering in a new wave of LLM-specific reinforcement learning; finally, it looks forward to how directions such as reinforcement learning based on artificial intelligence feedback can drive models to achieve self-improvement, and describes the future trends of symbiotic evolution and mutual promotion between reinforcement learning and large models.

参考文献(16)

施引文献

资源附件(0)