Abstract:
To address the demand for low-batch large language model (LLM) inference at the edge, existing near-memory processing architectures are constrained by limited computational capability, resulting in suboptimal acceleration performance. To overcome this, we propose H
2-LLM, a heterogeneous accelerator based on hybrid bonding. H
2-LLM employs an innovative heterogeneous architecture design to enabled through hybrid bonding to balance computational capability and bandwidth. Furthermore, it introduces a data-centric dataflow abstraction methodology to fully exploit the speedup potential of low-batch inference. Utilizing a design space exploration (DSE) framework, we automatically optimize the architectural configuration. Compared to existing in-die near-memory processing architectures and dataflow implementations, H
2-LLM achieves significant improvements in both performance and energy efficiency.