视觉–语言–动作（VLA）模型的前世今生

王寄哲; 张伟男; 刘挺

doi:10.11991/cccf.202508005

摘要: 近年来，视觉–语言–动作模型（vision-language-action model, VLA）迅速成为人工智能领域的热点。自2023年起，随着谷歌RT-2等成果的亮相，VLA因突破性的“多模态感知—决策—执行一体化”能力引发广泛关注。相较于传统模块化方案，VLA凭借对成熟大语言模型、视觉–语言模型路线的继承，通过端到端架构实现从文本图像理解到动作生成的直接映射，大幅提升了机器人在动态环境中的泛化能力。随着技术演进，VLA的定义被不断拓展，扩散模型、流匹配的引入使动作生成更加高效，隐动作充当标记达成与视觉–语言模型原生模态的语义对齐、分层结构形成“快慢脑”进一步冲击复杂任务，技术路线的多元化标志着VLA进入快速发展时期。这种“百家争鸣”的局面既体现了技术潜力，也暴露了局限性−可解释性缺乏、实时性不足、混合架构训练困难。尽管前景广阔，众多问题仍亟待研究者去解决。未来，VLA的进化或将重塑人机协作范式。若突破空间感知与场景探索瓶颈，机器人不仅能完成简单的重复性任务，更能自由探索与世界交互。比如结合群体智能技术，多台VLA驱动的机器人可协作完成救灾、手术等复杂任务。正如计算机视觉曾赋予机器“眼睛”，VLA正在赋予机器“手脚”与“物理知识”−这场始于对“具身”能力追求的变革，终将推动人工智能从虚拟助手向实体伙伴跨越。

Abstract: In recent years, vision-language-action (VLA) models have rapidly emerged as a focal point in artificial intelligence research. Since 2023, the appearance of systems such as Google RT-2 has attracted widespread attention to VLAs’ breakthrough capability to integrate multimodal perception, decision-making, and action execution. Unlike traditional modular pipelines, VLAs build upon mature large language and vision-language model paradigms to deliver an end-to-end mapping from text and image understanding directly to action generation, thereby markedly enhancing robotic generalization in dynamic environments. As the field evolves, the scope of VLA has been continuously broadened: the introduction of diffusion models and flow-matching techniques has greatly improved action-generation efficiency; latent actions act as tokens that semantically align with the native modalities of vision–language models; hierarchical “fast–slow brain” architectures have been devised to tackle complex tasks; and the proliferation of diverse technical pathways signals that VLAs have entered a phase of rapid development. This “Hundred-Schools-Contend” landscape underscores both the promise of the technology and its current limitations—most notably, challenges in interpretability, real-time performance, and training of hybrid architectures. Despite an auspicious outlook, numerous open problems remain for researchers to address. Looking forward, the continued evolution of VLA may well redefine paradigms of human–machine collaboration: by overcoming bottlenecks in spatial perception and physical interaction, robots could transcend simple repetitive tasks to explore and engage with the world autonomously; and, when coupled with swarm-intelligence techniques, fleets of VLA-driven robots might collaborate on complex missions such as disaster relief or surgical assistance. Just as computer vision once bestowed machines with “eyes”, VLA is now endowing them with “hands” “feet” and embodied physical knowledge—ushering in a transformation from virtual assistants to genuine physical partners.

视觉–语言–动作（VLA）模型的前世今生

The Past and Present of Vision-Language-Action Models