SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
English summary
SpatialClaw is a training-free framework that enhances the spatial reasoning of vision-language models by using code as an action interface. It empowers agents to dynamically compose and manipulate perception results, adapting to each task's text and visual observations. The method achieves flexible, stateful reasoning across diverse 3D and 4D tasks. Without any training, SpatialClaw achieves an average accuracy of 59.9% on multiple benchmarks, outperforming existing spatial agents.
Chinese summary
SpatialClaw 是一个免训练框架,通过将代码作为动作接口来增强视觉语言模型的空间推理能力。它使智能体能够动态组合和操控感知结果,并根据每个任务的文本和视觉观察进行调整。该方法在多样的 3D 和 4D 任务中实现了灵活且有状态的推理。无需任何训练,SpatialClaw 就在多个基准测试上取得了 59.9% 的平均准确率,超越了现有的空间智能体。
Key points
Training-free framework requiring no parameter updates.
免训练框架,无需更新模型参数。
Uses code as the action interface for flexible and stateful spatial reasoning.
使用代码作为动作接口,实现灵活且有状态的空间推理。
Agents can dynamically compose, manipulate, and adapt perception results based on task requirements and observations.
智能体能够根据任务需求和观察动态组合、操控和调整感知结果。
Achieves 59.9% average accuracy across diverse 3D/4D benchmarks, outperforming existing spatial agents.
在多样的 3D/4D 基准测试上取得 59.9% 的平均准确率,超越现有空间智能体。