Abstract:
Large language models (LLMs) have achieved remarkable success in natural language processing and other domains. However, limited GPU memory in industrial settings leads to a trade-off between resource efficiency and model performance when extending LLMs to multiple downstream tasks, thereby limiting their broader deployment. To address this, we propose Compact LLM with Collaboration of Experts (CCoE), a modular and compact multi-expert architecture. CCoE integrates multiple domain-specific experts into a unified LLM with efficiency and flexibility, significantly reducing memory overhead. It further employs a rule-based gating mechanism and an expert planning module to enable precise task assignment and effective expert collaboration, thereby supporting complex reasoning. Experiments across five distinct datasets show that CCoE matches the performance of domain-specific LLMs. Compared to existing model ensemble methods, CCoE reduces GPU memory usage by 61.3%, and improves inference throughput by 76.4% compared to parameter-efficient multi-expert integration approaches.