Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
English summary
Claw-SWE-Bench is a multilingual SWE-bench-style benchmark with 350 issue-resolution instances across 8 languages and 43 repositories, designed to fairly compare heterogeneous agent harnesses (claws) through a standardized adapter protocol including fixed prompts, runtime budgets, and patch extraction. A cost-aware Lite subset of 80 instances is provided for faster validation. Using the same GLM 5.1 backbone, OpenClaw's Pass@1 jumps from 19.1% with a minimal direct-diff adapter to 73.4% with the full adapter, demonstrating that adapter design is essential for harness performance. A sweep over nine models and five harnesses shows model choice and harness choice each independently shift Pass@1 by about 29 pp and 27 pp, while total API cost varies substantially even among systems with similar accuracy. The benchmark thus treats harness architecture and cost as first-class evaluation axes for coding agents.
Chinese summary
Claw-SWE-Bench是一个多语言SWE-bench风格基准测试,包含350个问题解决实例,覆盖8种语言和43个代码库,旨在通过标准化的适配器协议(固定提示、运行时预算、补丁提取)公平地比较不同类型的智能体适配器。提供了经成本感知筛选的80实例Lite子集以加快验证。在相同的GLM 5.1基座模型下,OpenClaw使用最小直接差异适配器的Pass@1仅为19.1%,而使用完整适配器后达到73.4%,表明适配器设计对开源智能体适配器的编码性能至关重要。在九个模型和五个适配器的交叉实验中,模型选择和适配器选择分别独立导致Pass@1约29个百分点和27个百分点的变化,且精度相近的系统在总API成本上可能差异显著。因此该基准将适配器架构和成本作为编码智能体评估的一级维度。
Key points
Introduces Claw-SWE-Bench, a 350-instance multilingual coding benchmark spanning 8 languages and 43 repos, with a standardized adapter protocol for comparing agent harnesses.
提出Claw-SWE-Bench,一个包含350个实例、跨8种语言和43个代码库的多语言编码基准,通过标准化适配器协议比较智能体适配器。
Provides Claw-SWE-Bench Lite, an 80-instance subset selected by a cost-aware, rank-aware procedure for faster and reproducible validation.
提供Claw-SWE-Bench Lite,一个经成本感知和排序筛选的80实例子集,用于快速且可复现的验证。
Adapter design is critical: with the same GLM 5.1 backbone, Pass@1 more than triples from 19.1% (minimum adapter) to 73.4% (full adapter), showing harness engineering dominates raw model capability.
适配器设计至关重要:在相同GLM 5.1基座下,Pass@1从最小适配器的19.1%跃升至完整适配器的73.4%,证明适配器工程比单纯模型能力影响更大。
Sweep experiments reveal that model choice changes Pass@1 by 29.4 pp, harness choice by 27.4 pp, and cost varies independently, making harness and cost first-class evaluation axes.
交叉实验显示,模型选择导致Pass@1变化29.4个百分点,适配器选择变化27.4个百分点,且成本独立变动,使适配器和成本成为一级评估维度。