存内计算：大模型时代的算力破局之道

蒋力; 刘方鑫

doi:10.11991/cccf.202510006

摘要: 在人工智能、物联网与5G推动社会智能化的背景下，以GPT-4、DeepSeek为代表的百亿/万亿级参数大模型，凭借高计算密度成为核心应用引擎。但传统冯·诺依曼架构在数据密集型任务中，因频繁数据搬运产生“存储墙”问题，制约系统性能提升。为此，存内计算（processing-in-memory, PIM）技术诞生，核心是将计算单元集成于存储器附近或内部，规避“计算—存储—搬运”瓶颈，减少数据搬运延迟与能耗。其发展分两阶段：早期小模型时代，聚焦新兴非易失性存储器（non-volatile memory, NVM）与静态随机存取存储器（static random-access memory, SRAM）的片上计算探索；大模型时代，转向高密度、低成本动态随机存取存储器（dynamic random accessmemory, DRAM）为核心的PIM技术，典型如三星HBM-PIM与SK海力士AiM，通过将计算单元深度集成DRAM阵列释放内部带宽，降低数据搬运开销。当前PIM架构有两大核心范式：一是基于内存映射输入输出（memory mapping I/O, MMIO）的异步驱动模式，支持PIM核心独立编程、灵活度高，但软件栈复杂；二是基于中央处理器（central processing unit, CPU）指令集扩展的同步驱动模式，生态兼容性更优，却面临指令发射带宽不足、总线竞争等挑战。尽管PIM在能效与并行性上潜力显著，但大规模商业落地仍须突破与CPU兼容性、通信瓶颈、数据排布适配、软硬件生态构建、算法协同及散热控制等系统级关键问题。综上，PIM是突破大模型算力瓶颈、推动计算范式革新的战略方向，需跨学科协同实现落地，未来将在计算领域发挥关键作用。

Abstract: Against the backdrop of societal intelligence driven by artificial intelligence (AI), the Internet of Things (IoT), and 5G technology, Large Language Models (LLMs) with tens of billions or trillions of parameters—represented by GPT-4 and DeepSeek—have become core application engines due to their high computational density. However, in data-intensive tasks, the traditional von Neumann architecture faces the “memory wall” issue caused by frequent data movement, which limits the improvement of system performance. To solve this problem, processing-in-memory (PIM) technology has emerged. Its core is to integrate computing units near or inside memory, so as to avoid the bottleneck of the traditional “compute-store-move” process and reduce the latency and energy consumption caused by data movement. The development of PIM has two stages: in the early era of small models, the focus was on exploring on-chip computing with emerging non-volatile memory (NVM) and static random-access memory (SRAM); in the era of LLMs, it has shifted to PIM technology centered on high-density and low-cost dynamic random accessmemory (DRAM). Typical examples include Samsung's HBM-PIM and SK Hynix's AiM. These technologies release large internal bandwidth and greatly reduce data movement costs by deeply integrating computing units into DRAM arrays. Currently, PIM architectures have two main types: one is the asynchronous-driven mode based on memory-mapped I/O (MMIO). This mode supports independent programming of PIM cores and has high flexibility, but its software stack is complex. The other is the synchronous-driven mode based on central processing unit (CPU) instruction set extension. It has better ecosystem compatibility, but faces challenges such as insufficient instruction issuance bandwidth and bus competition. Although PIM has great potential in energy efficiency and parallelism, its large-scale commercial use still needs to overcome a series of key system-level issues. These include compatibility with CPUs, communication bottlenecks, data layout adaptation, software-hardware ecosystem construction, algorithm collaboration, and heat control. In summary, PIM is a strategic direction to break through the computing power bottleneck of LLMs and drive the innovation of computing models. Its successful application relies on interdisciplinary cooperation, and it will play a key role in the computing field in the future.

存内计算：大模型时代的算力破局之道

In-Memory Computing: The Solution to Break the Computility Bottleneck in the LLM Era