机器人策略学习的几何动作模型

英文摘要

The paper introduces the Geometric Action Model (GAM), which leverages a pretrained geometric foundation model to enhance language-conditioned manipulation in 3D physical environments. GAM splits the foundation model into an observation encoding layer and a future prediction layer, enabling it to predict future tokens from language, proprioception, and action history before decoding them into actions. This 3D-aware approach significantly improves accuracy, robustness, efficiency, and speed over standard 2D vision-language-action models in both simulated and real-robot contact-rich tasks.

中文摘要

该论文提出几何动作模型（GAM），利用预训练的几何基础模型来增强三维物理环境中的语言条件操纵。GAM将预训练基础模型分为观测编码层和未来预测层，使其能够根据语言、本体感知和动作历史预测未来令牌，然后解码为动作。这种三维感知方法在模拟和真实机器人接触密集任务中，较传统二维视觉-语言-动作模型显著提升了准确性、鲁棒性、效率和速度。

关键要点

GAM incorporates a pretrained geometric foundation model to bring 3D spatial understanding into robot manipulation policies.
GAM整合预训练几何基础模型，为机器人操纵策略引入三维空间理解。
The foundation model is split into observation encoding and future prediction layers, enabling temporal reasoning beyond static frames.
基础模型被拆分为观测编码层和未来预测层，实现超越静态帧的时序推理。
The model predicts future tokens conditioned on language, proprioception, and action history, then decodes them into executable actions.
模型根据语言指令、本体感知和动作历史预测未来令牌，再解码为可执行动作。
GAM outperforms 2D-only vision-language-action models in accuracy, robustness, efficiency, and speed across simulation and real-robot benchmarks.
GAM在模拟和真实机器人测试中，较纯二维视觉-语言-动作模型在准确性、鲁棒性、效率和速度上均有提升。

打开原文