Abstract:
In recent years, vision-language-action (VLA) models have rapidly emerged as a focal point in artificial intelligence research. Since 2023, the appearance of systems such as Google RT-2 has attracted widespread attention to VLAs’ breakthrough capability to integrate multimodal perception, decision-making, and action execution. Unlike traditional modular pipelines, VLAs build upon mature large language and vision-language model paradigms to deliver an end-to-end mapping from text and image understanding directly to action generation, thereby markedly enhancing robotic generalization in dynamic environments. As the field evolves, the scope of VLA has been continuously broadened: the introduction of diffusion models and flow-matching techniques has greatly improved action-generation efficiency; latent actions act as tokens that semantically align with the native modalities of vision–language models; hierarchical “fast–slow brain” architectures have been devised to tackle complex tasks; and the proliferation of diverse technical pathways signals that VLAs have entered a phase of rapid development. This “Hundred-Schools-Contend” landscape underscores both the promise of the technology and its current limitations—most notably, challenges in interpretability, real-time performance, and training of hybrid architectures. Despite an auspicious outlook, numerous open problems remain for researchers to address. Looking forward, the continued evolution of VLA may well redefine paradigms of human–machine collaboration: by overcoming bottlenecks in spatial perception and physical interaction, robots could transcend simple repetitive tasks to explore and engage with the world autonomously; and, when coupled with swarm-intelligence techniques, fleets of VLA-driven robots might collaborate on complex missions such as disaster relief or surgical assistance. Just as computer vision once bestowed machines with “eyes”, VLA is now endowing them with “hands” “feet” and embodied physical knowledge—ushering in a transformation from virtual assistants to genuine physical partners.