论文来源: ARXIV2026年6月12日重要度: 4/5

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

中文标题: SpatialClaw：重新思考智能体空间推理的动作接口

英文摘要

The paper presents SpatialClaw, a training-free framework that uses code execution as the action interface for agentic spatial reasoning. It maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, allowing a VLM-backed agent to write one executable cell per step based on all prior outputs. Evaluated on 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, outperforming the prior spatial agent by 11.2 percentage points. The gains are consistent across six vision-language model backbones from two model families, with no benchmark‑ or model‑specific tuning. The results demonstrate that a flexible, iterative code‑based interface significantly outperforms single‑pass or structured tool‑call designs for open‑ended spatial tasks.

中文摘要

SpatialClaw是一个无需训练的框架，采用代码执行作为动作接口，通过维护一个有状态的Python内核，预加载输入帧以及感知与几何原语，让基于VLM的智能体根据过往输出逐步编写可执行单元。在20个涵盖静态与动态的3D/4D空间推理基准上，平均准确率达到59.9%，较此前的最佳空间智能体提升11.2个百分点。在六个来自两个模型家族的VLM骨干上均获一致增益，无需针对基准或模型进行特化调整。结果表明，灵活的迭代式代码接口在开放式空间任务上显著优于单次执行或结构化工具调用的设计。

关键要点

SpatialClaw is a training‑free framework that replaces fixed tool invocations with iterative code cell execution, giving the agent full flexibility to compose perception and geometry operations.
SpatialClaw是无训练的框架，用可迭代的代码单元执行替代固定的工具调用，使智能体能灵活组合感知与几何运算。
It maintains a stateful Python kernel pre‑loaded with input frames and specialist primitives, enabling the VLM agent to write each step conditioned on all intermediate text and visual outputs.
框架维护一个有状态的Python内核，预加载输入帧和专业原语，VLM智能体可根据所有中间文本和视觉输出逐步编写每一步操作。
Across 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw averages 59.9% accuracy, surpassing the previous best spatial agent by 11.2 points.
在20个静态与动态3D/4D空间推理基准上，SpatialClaw平均准确率59.9%，比此前最佳空间智能体高出11.2个百分点。
The improvement is consistent on six different VLM backbones from two model families, with no model‑ or benchmark‑specific adaptation required.
该提升在六个来自两个模型家族的不同VLM骨干上一贯出现，且无需针对模型或基准进行任何适配。
The study shows that an interface built around stateful code execution better handles open‑ended spatial reasoning than single‑pass code or structured tool‑call interfaces.
研究表明，与单次执行代码或结构化工具调用接口相比，基于有状态代码执行的接口更擅长处理开放式空间推理任务。

打开原文