WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Loading / 加载中

English summary

WeaveBench is introduced as a comprehensive benchmark for evaluating computer-use agents (CUAs) operating across hybrid interfaces, requiring both GUI and CLI/code operations. It encompasses 114 long-horizon tasks spanning 8 real-world work domains, all evaluated on a real Ubuntu desktop. The benchmark includes a trajectory-aware judge that inspects agent deliverables and detects shortcut behaviors, addressing limitations of traditional evaluation methods. The PassRate across tested model-runtime pairings is only 41.2%, highlighting a significant performance gap in long-horizon task orchestration.

Chinese summary

WeaveBench 是一个专为评估跨混合接口操作的计算机使用智能体（CUA）而设计的全面基准，要求同时进行 GUI 和 CLI/代码操作。它包含 114 个长周期任务，覆盖 8 个真实工作领域，并在真实的 Ubuntu 桌面上进行评估。该基准引入了一种轨迹感知评判器，用于检查智能体的交付成果并检测走捷径行为，弥补了传统评估方法的不足。在测试的模型-运行时组合中，通过率仅为 41.2%，暴露了在长周期任务编排方面的显著性能差距。

Key points

A new benchmark, WeaveBench, specifically targets computer-use agents handling hybrid GUI and CLI operations in long-horizon tasks.

新基准 WeaveBench 专门针对处理混合 GUI 和 CLI 操作的长周期任务的计算机使用智能体。

The benchmark contains 114 tasks across 8 real-world domains, evaluated on a real Ubuntu desktop environment.

该基准包含 114 个任务，覆盖 8 个真实世界领域，在真实的 Ubuntu 桌面环境中进行评估。

It features a trajectory-aware judge that inspects deliverables and detects shortcut behaviors, improving evaluation fidelity.

它采用轨迹感知评判器检查交付成果并检测走捷径行为，提升了评估的保真度。

All tested model-runtime pairings achieve only a 41.2% PassRate, revealing a large gap in long-horizon orchestration capabilities.

所有测试的模型-运行时组合仅获得 41.2% 的通过率，揭示了在长周期编排能力上的巨大差距。