H<sup>2</sup>-LLM：面向基于异构混合键合的低批次大语言模型推理的硬件−数据流协同探索

李聪; 尹奕涵; 孙广宇

doi:10.11991/cccf.202511009

H²-LLM：面向基于异构混合键合的低批次大语言模型推理的硬件−数据流协同探索

H²-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

摘要

摘要: 针对边缘端低批次大语言模型推理需求，现有近存处理架构设计因计算能力受限而加速效果不佳。为此，本研究提出了异构混合键合的低批次大语言模型（heterogeneous hybrid-bonding-based low-batch LLM, H²-LLM），一种基于混合键合的异构加速器。H²-LLM通过创新的异构架构设计对混合键合的计算能力与带宽进行权衡，并提出了以数据为中心的数据流抽象方法以充分挖掘低批次推理的加速潜力。利用设计空间探索框架，自动优化架构配置。相比现有片内近存处理架构和数据流实现，H²-LLM在性能和能效上均实现了显著提升。

Abstract: To address the demand for low-batch large language model (LLM) inference at the edge, existing near-memory processing architectures are constrained by limited computational capability, resulting in suboptimal acceleration performance. To overcome this, we propose H²-LLM, a heterogeneous accelerator based on hybrid bonding. H²-LLM employs an innovative heterogeneous architecture design to enabled through hybrid bonding to balance computational capability and bandwidth. Furthermore, it introduces a data-centric dataflow abstraction methodology to fully exploit the speedup potential of low-batch inference. Utilizing a design space exploration (DSE) framework, we automatically optimize the architectural configuration. Compared to existing in-die near-memory processing architectures and dataflow implementations, H²-LLM achieves significant improvements in both performance and energy efficiency.

参考文献(7)

施引文献

资源附件(0)

H2-LLM：面向基于异构混合键合的低批次大语言模型推理的硬件−数据流协同探索

H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

H²-LLM：面向基于异构混合键合的低批次大语言模型推理的硬件−数据流协同探索

H²-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference