Abstract:
In recent years, the diminishing returns of the “Scaling Law” have prompted the emergence of Reasoning Models, which enhance intelligence by extending context windows and enabling chain-of-thought inference. However, these deep reasoning models impose unprecedented demands on artificial intelligence infrastructure: the training phase must support longer sequences, multi-stage distributed parallelism, and RLHF workflows; the inference phase faces explosive growth in computation, memory, and bandwidth due to ultra-long inputs, outputs, and KV cache. To address these challenges, practitioners have integrated tensor, pipeline, expert, and sequence parallelism in training and developed DualPipe to mask communication latency. In inference, “storage-for-compute” architectures such as Mooncake’s heterogeneous prefill/model/attention split, and KTransformers’ compute-intensity-driven offloading enable heterogeneous resource collaboration. Looking ahead, sparse attention mechanisms (e.g., MoBA and NSA) promise to reduce the O\left(n^2\right) complexity of long-sequence processing, offering fresh directions for infrastructure adaptation. This article reviews representative models and system innovations, providing a roadmap for co-designing algorithms and infrastructure.