一种紧凑的多专家协同大语言模型框架

周鸿祎; 黄绍莽; 潘剑锋; 彭敏; 郑涵中

doi:10.11991/cccf.202506007

一种紧凑的多专家协同大语言模型框架

A compact large language models with collaboration of experts

摘要

摘要: 近年来，大语言模型（large language models, LLMs）在自然语言处理等领域展现出卓越性能。然而，在显存受限的工业环境中，将LLMs扩展至多个下游任务时常难以兼顾资源消耗与性能表现之间的平衡，进而限制了其更广泛的应用。为此，提出了一种结构紧凑的多专家协同架构（compact llm with collaboration of experts，CCoE）。该架构采用模块化设计，能够高效且灵活地将多个领域专家集成至统一的LLMs中，在保证性能的同时显著降低多专家部署的显存开销。此外，CCoE通过引入基于规则的门控机制与专家规划模块，实现了任务的精准分配与专家间的协同，从而有效支持复杂推理任务。在5个领域数据集上的实验证明，CCoE在各项任务中均达到了与现有垂类LLMs相当的表现。此外，相比于现有模型集成方法，CCoE在保持性能的前提下将显存占用量降低了61.3%，并较参数高效的多专家集成方法提升了76.4%的推理吞吐量。

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing and other domains. However, limited GPU memory in industrial settings leads to a trade-off between resource efficiency and model performance when extending LLMs to multiple downstream tasks, thereby limiting their broader deployment. To address this, we propose Compact LLM with Collaboration of Experts (CCoE), a modular and compact multi-expert architecture. CCoE integrates multiple domain-specific experts into a unified LLM with efficiency and flexibility, significantly reducing memory overhead. It further employs a rule-based gating mechanism and an expert planning module to enable precise task assignment and effective expert collaboration, thereby supporting complex reasoning. Experiments across five distinct datasets show that CCoE matches the performance of domain-specific LLMs. Compared to existing model ensemble methods, CCoE reduces GPU memory usage by 61.3%, and improves inference throughput by 76.4% compared to parameter-efficient multi-expert integration approaches.

参考文献(25)

施引文献

资源附件(0)