面向深思考的高效混合注意力机制

肖朝军; 方晔玮; 韩旭

doi:10.11991/cccf.202511008

面向深思考的高效混合注意力机制

Hybrid Attention Mechanism is Efficient and Effective for Deep Thinking

摘要

摘要: 近年来，大语言模型通过强化学习训练获得了强大的深度推理能力，能够生成长思维链来解决复杂问题。然而，传统自注意力机制的平方复杂度在处理包含大量推理词元的长序列时产生巨大计算和存储开销，严重制约了深思考模型的训练和推理效率。现有研究主要关注对深思考模型进行后处理优化，从而提升深思考模型的推理效率，忽略了训练阶段的效率问题。本研究发现，深思考场景下推理过程呈现局部性特征，使得混合注意力机制具有独特优势，将稠密注意力模型通过少量后训练转换为混合注意力模型，并基于此进行深思考训练。实验结果表明，在AIME、MATH-500和LiveCodebench等典型深思考评测集上，1∶1混合注意力模型达到了与稠密注意力模型相当甚至更优的性能，同时在64k 词元的上下文窗口下节省22%的训练时间开销和46.9%的键值缓存存储开销。

Abstract: Large language models demonstrate strong deep reasoning capabilities by generating chain-of-thoughts for complex problems. However, the quadratic complexity of self-attention mechanisms creates substantial computational and memory overhead when processing long sequences with numerous reasoning tokens, limiting the efficiency of deep reasoning models during both training and inference. While existing work focuses on post-processing optimizations for inference efficiency, training-stage efficiency remains largely unaddressed. We observe that reasoning processes exhibit locality, which makes hybrid attention mechanisms particularly suitable. We convert full attention models to hybrid attention models via minimal post-training and perform deep reasoning training on this architecture. On benchmarks including AIME, MATH-500, and LiveCodebench, our 1:1 hybrid attention model achieves comparable or superior performance to full attention models while reducing training time by 22% and key-value cache storage by 46.9% under 64k context windows.

参考文献(23)

施引文献

资源附件(0)