Aether：几何感知统一世界模型

朱皓怡; 贺通

doi:10.11991/cccf.202605009

Aether：几何感知统一世界模型

朱皓怡,
贺通

Aether: Geometric-Aware Unified World Modeling

Haoyi Zhu,
Tong He

摘要

摘要: 几何重建与生成式建模的融合，是构建具备类人空间推理能力的人工智能系统所面临的核心挑战。本研究提出一种统一框架Aether，通过联合优化四维动态重建、基于动作条件的视频预测、基于目标条件的视觉规划三大核心能力，实现了世界模型中的几何感知推理。借助任务交错式特征学习，Aether实现了重建、预测与规划目标间的协同知识共享。本框架以视频生成模型为基础，在训练阶段未接触任何真实世界数据的场景下，仍展现出从合成域到真实域的零样本泛化能力。此外，得益于其固有的几何建模特性，Aether在动作执行与重建任务中均实现了零样本泛化。值得注意的是，即便不使用真实数据，其重建效果也可媲美甚至优于领域专用模型。同时，Aether将相机轨迹作为蕴含几何信息的动作空间，有效支撑了基于动作条件的预测与视觉规划。期望本研究能为探索具备物理合理性的世界建模及其应用开辟新方向。

Abstract: The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This article proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: 4D dynamic reconstruction, action-conditioned video prediction, and goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

参考文献(0)

施引文献

资源附件(0)