Abstract:
Long-context modeling is crucial for next-generation large language models, yet the quadratic complexity of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining long-context capabilities of LLMs, but many existing methods struggle to translate theoretical computational reductions into practical speedups due to hardware-unfriendly designs and training-inference gaps. This article presents native sparse attention (NSA), a co-designed approach that integrates algorithmic innovation with hardware-aligned implementation. On the algorithmic side, NSA employs sparse attention through three mechanisms: compression for global context, selection for critical details, and sliding window attention for local patterns. On the hardware side, NSA follows memory-friendly strategies with contiguous cache access patterns for sparse operations and implements specialized kernels that achieve high hardware utilization. Experimental validation shows that NSA maintains or even exceeds full attention performance across general benchmarks, long-context tasks, and mathematical reasoning, while achieving substantial computational reduction through sparsity. Meanwhile, NSA achieves substantial speedups over full attention on 64k-length sequences across decoding, forward propagation, and backward propagation.