H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
-
Graphical Abstract
-
Abstract
To address the demand for low-batch large language model (LLM) inference at the edge, existing near-memory processing architectures are constrained by limited computational capability, resulting in suboptimal acceleration performance. To overcome this, we propose H2-LLM, a heterogeneous accelerator based on hybrid bonding. H2-LLM employs an innovative heterogeneous architecture design to enabled through hybrid bonding to balance computational capability and bandwidth. Furthermore, it introduces a data-centric dataflow abstraction methodology to fully exploit the speedup potential of low-batch inference. Utilizing a design space exploration (DSE) framework, we automatically optimize the architectural configuration. Compared to existing in-die near-memory processing architectures and dataflow implementations, H2-LLM achieves significant improvements in both performance and energy efficiency.
-
-