DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
English summary
DIRECT is a routing framework that dynamically allocates test-time compute per prompt in embodied Vision-Language Model (VLM) planners by analyzing multimodal scene context. It examines three scaling axes—chain-of-thought depth, model size, and memory history—and reveals that naively scaling test-time compute yields uneven and often diminishing returns. Experiments on VLABench and RoboMME demonstrate that DIRECT significantly improves the success–cost Pareto frontier over fixed model selection. Validation on a physical Franka arm in a DROID setup shows that the router matches or exceeds a stronger model's success rate while cutting average latency by up to 65%. The results confirm that intelligent compute allocation enables frontier-level embodied planning at a fraction of the cost.
Chinese summary
DIRECT 是一种路由框架,通过分析多模态场景上下文在具身视觉语言模型规划器中动态地为每个提示分配测试时计算。它考察了思维链深度、模型大小和记忆历史三个缩放轴,揭示盲目增加测试时计算会带来不均且常递减的收益。在 VLABench 和 RoboMME 上的实验表明,DIRECT 相较固定模型选择显著改善了成功-成本帕累托前沿。在物理 Franka 机械臂上的验证中,该路由器以最高降低 65% 的平均延迟达到与更强模型相当甚至更高的成功率,证明智能计算分配能以极低成本实现前沿水平的具身规划。
Key points
Introduces DIRECT, a routing framework that allocates test-time compute per prompt in embodied agents based on multimodal scene context.
提出 DIRECT 路由框架,根据多模态场景上下文为每个提示分配测试时计算。
Identifies three scaling axes—chain-of-thought depth, model size, and memory history—where compute investment yields qualitatively distinct and uneven gains.
确定三个缩放轴(思维链深度、模型大小、记忆历史),计算投入在这些轴上产生性质不同且不均的收益。
Experiments on VLABench and RoboMME show that DIRECT improves the success–cost Pareto frontier over fixed model selection, making naive scaling wasteful.
在 VLABench 和 RoboMME 上的实验证明 DIRECT 比固定模型选择改善了成功-成本帕累托前沿,使盲目扩展变得浪费。
Physical Franka arm validation achieves similar or higher success rate with up to 65% lower average latency, demonstrating real-world efficiency.
物理 Franka 机械臂验证实现相近或更高的成功率,同时平均延迟降低高达 65%,展示了实际部署的效率。