原生稀疏注意力：算法与硬件协同设计助力大模型长上下文高效建模

袁境阳; 张铭

doi:10.11991/cccf.202603008

原生稀疏注意力：算法与硬件协同设计助力大模型长上下文高效建模

袁境阳,
张铭

Native Sparse Attention: Co-Designing Algorithms and Hardware for Practical Long-Context Efficiency

摘要

摘要: 长上下文建模是下一代大语言模型的核心能力，但标准注意力机制的二次复杂度带来了严峻的计算效率挑战。本文介绍原生稀疏注意力（native sparse attention, NSA），一种算法与硬件协同设计的高效注意力机制。NSA通过压缩、选择和滑动窗口三条互补路径实现层次化稀疏处理，同时采用硬件对齐的分块操作和原生训练支持，成功将理论效率转化为实际加速。在64×10³长度序列上，NSA实现了前向传播最高9倍、解码最高11.6倍的加速，同时保持乃至提升模型性能。

Abstract: Long-context modeling is crucial for next-generation large language models, yet the quadratic complexity of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining long-context capabilities of LLMs, but many existing methods struggle to translate theoretical computational reductions into practical speedups due to hardware-unfriendly designs and training-inference gaps. This article presents native sparse attention (NSA), a co-designed approach that integrates algorithmic innovation with hardware-aligned implementation. On the algorithmic side, NSA employs sparse attention through three mechanisms: compression for global context, selection for critical details, and sliding window attention for local patterns. On the hardware side, NSA follows memory-friendly strategies with contiguous cache access patterns for sparse operations and implements specialized kernels that achieve high hardware utilization. Experimental validation shows that NSA maintains or even exceeds full attention performance across general benchmarks, long-context tasks, and mathematical reasoning, while achieving substantial computational reduction through sparsity. Meanwhile, NSA achieves substantial speedups over full attention on 64k-length sequences across decoding, forward propagation, and backward propagation.

参考文献(7)

施引文献

资源附件(0)