大语言模型越狱攻击与防御

Jailbreak Attack and Defense of Large Language Models

摘要: 近年来，以ChatGPT、Deepseek-R1为代表的大语言模型接连引发人工智能（artificial intelligence, AI）发展狂潮，AI针对传统领域的渗透不断加速。然而，由于大模型输入内容多样，用户群体广泛，其本身面临着严重的安全风险，其中越狱攻击通过越狱提示诱导模型输出偏见、暴力等严重违规内容，造成大语言模型服务商触犯安全监管法规，已成为当下大模型面临的最具威胁性的安全风险之一。本文详细分析了大模型遭受的越狱攻击安全风险，对当前大模型越狱攻击防御方法进行了梳理，并且探讨了防御方法面临的主要挑战以及未来的可能解决方案。

Abstract: In recent years, large language models (LLMs) represented by ChatGPT and Deepseek-R1 have triggered successive waves of artificial intelligence (AI) development, accelerating AI's penetration into traditional domains. However, due to diverse input content and broad user base, LLMs face significant security risks. Among these, jailbreak attacks represent one of the most critical threats, potentially inducing models to generate harmful content, leading to malicious exploitation and regulatory violations of LLM service providers. This article analyzes the security risks of jailbreak attacks on LLMs, reviews current defense methods, and examines both the challenges and potential solutions in this domain.