Hybrid Attention Mechanism is Efficient and Effective for Deep Thinking
-
Graphical Abstract
-
Abstract
Large language models demonstrate strong deep reasoning capabilities by generating chain-of-thoughts for complex problems. However, the quadratic complexity of self-attention mechanisms creates substantial computational and memory overhead when processing long sequences with numerous reasoning tokens, limiting the efficiency of deep reasoning models during both training and inference. While existing work focuses on post-processing optimizations for inference efficiency, training-stage efficiency remains largely unaddressed. We observe that reasoning processes exhibit locality, which makes hybrid attention mechanisms particularly suitable. We convert full attention models to hybrid attention models via minimal post-training and perform deep reasoning training on this architecture. On benchmarks including AIME, MATH-500, and LiveCodebench, our 1:1 hybrid attention model achieves comparable or superior performance to full attention models while reducing training time by 22% and key-value cache storage by 46.9% under 64k context windows.
-
-