11位研究者在智源大会激辩“世界模型”定义
英文摘要
At the 2026 Zhiyuan Conference, 11 leading researchers from institutions like BAAI, Skywork, and Tencent debated what constitutes a true world model, highlighting that current approaches—video generation, 3D reconstruction, or VLM-based methods—fail to capture physical causality and multi-sensor input. They argued that existing benchmarks overemphasize visual fidelity while neglecting physical correctness and interactive prediction. Key bottlenecks include lack of precise physical annotations, absence of standardized evaluation for physical understanding, and inability to model precise real-world interactions. Panelists advocated for moving from next-token to next-state prediction, joint training of state and action models, and closed-loop robotic verification.
中文摘要
在2026智源大会上,11位来自智源研究院、昆仑万维、腾讯混元等机构的一线研究者激辩世界模型的定义,一致认为当前视频生成、3D重建或视觉语言模型路线均未能实现物理因果推理与多传感器输入。专家指出,现有评测过度关注视觉质量,缺乏对物理正确性的检验;数据标注不足、评估标准缺失、精细物理规律建模精度不够是核心瓶颈。与会者提出需从预测下一个token转向预测下一个物理状态,并主张状态与动作联合训练、以事件为尺度分割、通过机器人闭环交互验证模型。
关键要点
All current models labeled as 'world models' are considered insufficient; they rely on statistical correlations rather than physical causality and cannot handle multi-modal sensory input and precise physical prediction.
当前所有号称“世界模型”的方案都不够格,它们依赖统计相关性而非物理因果,无法处理多传感器输入和精准物理预测。
Existing benchmarks measure generation quality, not physical understanding; no model can reliably complete basic interactive tasks like opening a fridge, and industry lacks a unified evaluation standard for physical correctness.
现有评测仅衡量生成画质,而非物理理解;尚无模型能稳定完成“开冰箱”类基础交互任务,行业缺乏统一的物理正确性评测标准。
Panelists identified three critical deficits: lack of precise physical annotation in video data, modeling accuracy too limited for high-precision tasks, and scarcity of real-machine interaction data for training closed-loop agents.
研究者指出三大缺陷:视频数据缺乏精准物理标注、建模精度远不足以应对高精度任务、训练闭环智能体的真机交互数据严重不足。
The path forward emphasizes joint training of next-state prediction and action generation, event-centric segmentation for efficient encoding, and moving from passive observation to active interaction that allows models to update from real-world feedback.
未来路径强调下一状态预测与动作生成的联合训练、以事件为中心的变长分割、以及从被动观测转向主动交互,使模型在真实反馈中自我修正。