Recursive Agent Harnesses

English summary

The paper introduces and formalizes the Recursive Agent Harness (RAH), a code-first extension of recursive language models where a parent agent generates executable scripts that spawn full subagent harnesses with filesystem tools, code execution, and planning. Controlled evaluation on Oolong-Synthetic (199 samples, context lengths up to 4M tokens) shows RAH with a fixed GPT-5 backbone improves the Codex coding-agent baseline from 71.75% to 81.36%. With a stronger backbone, Claude Sonnet 4.5, RAH achieves 89.77%, confirming the gains stem from the harness design rather than model scaling.

Chinese summary

本文提出并形式化了递归智能体框架（RAH），这是递归语言模型的一种以代码为中心的扩展：父智能体生成可执行脚本，并行生成带有文件系统工具、代码执行和规划的完整子智能体套件。在Oolong-Synthetic（199个样本，上下文长度达4M tokens）上的受控评估显示，固定GPT-5骨干下，RAH将Codex编码智能体基线从71.75%提升至81.36%；使用更强骨干Claude Sonnet 4.5时，RAH达到89.77%，表明提升源于框架设计而非模型规模。

Key points

RAH defines harness recursion: parent agent spawns subagent harnesses with tools via executable scripts, extending model recursion into code-level orchestration.
RAH定义套件递归：父智能体通过可执行脚本生成带工具的完整子智能体套件，将模型递归扩展为代码级编排。
Evaluated on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), RAH with GPT-5 backbone achieves 81.36%, a 9.61 percentage point improvement over the Codex coding-agent baseline (71.75%).
在Oolong-Synthetic（199个样本，13个上下文长度区间，最高4M tokens）上，固定GPT-5骨干的RAH达到81.36%，较Codex基线（71.75%）提升9.61个百分点。
With a stronger backbone Claude Sonnet 4.5, RAH reaches 89.77%, demonstrating that gains are attributable to the harness design rather than model power alone.
使用更强骨干Claude Sonnet 4.5时，RAH达到89.77%，表明提升来自框架设计而非单纯依赖模型能力。

Open original