WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
English summary
WeaveBench is introduced as a comprehensive benchmark for evaluating computer-use agents (CUAs) operating across hybrid interfaces, requiring both GUI and CLI/code operations. It encompasses 114 long-horizon tasks spanning 8 real-world work domains, all evaluated on a real Ubuntu desktop. The benchmark includes a trajectory-aware judge that inspects agent deliverables and detects shortcut behaviors, addressing limitations of traditional evaluation methods. The PassRate across tested model-runtime pairings is only 41.2%, highlighting a significant performance gap in long-horizon task orchestration.
Chinese summary
WeaveBench 是一个专为评估跨混合接口操作的计算机使用智能体(CUA)而设计的全面基准,要求同时进行 GUI 和 CLI/代码操作。它包含 114 个长周期任务,覆盖 8 个真实工作领域,并在真实的 Ubuntu 桌面上进行评估。该基准引入了一种轨迹感知评判器,用于检查智能体的交付成果并检测走捷径行为,弥补了传统评估方法的不足。在测试的模型-运行时组合中,通过率仅为 41.2%,暴露了在长周期任务编排方面的显著性能差距。
Key points
A new benchmark, WeaveBench, specifically targets computer-use agents handling hybrid GUI and CLI operations in long-horizon tasks.
新基准 WeaveBench 专门针对处理混合 GUI 和 CLI 操作的长周期任务的计算机使用智能体。
The benchmark contains 114 tasks across 8 real-world domains, evaluated on a real Ubuntu desktop environment.
该基准包含 114 个任务,覆盖 8 个真实世界领域,在真实的 Ubuntu 桌面环境中进行评估。
It features a trajectory-aware judge that inspects deliverables and detects shortcut behaviors, improving evaluation fidelity.
它采用轨迹感知评判器检查交付成果并检测走捷径行为,提升了评估的保真度。
All tested model-runtime pairings achieve only a 41.2% PassRate, revealing a large gap in long-horizon orchestration capabilities.
所有测试的模型-运行时组合仅获得 41.2% 的通过率,揭示了在长周期编排能力上的巨大差距。