Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization
中文标题: 可验证环境如同乐高积木:用于推理泛化的递归组合
英文摘要
This paper introduces RACES, a framework that treats verifiable environments as composable building blocks, automatically fusing them into new training environments when their input-output types align. Using 300 base environments and composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT), RL training on composite environments consistently enhances reasoning generalization. Experiments show a 3.1-point average gain on six unseen benchmarks for DeepSeek-R1-Distill-Qwen-14B (48.2→51.3) and a 2.3-point gain for Qwen3-14B (58.8→61.1). Training with only 50 base environments reaches performance comparable to using all 300, demonstrating efficient environment scaling.
中文摘要
本文提出RACES框架,将可验证环境视为可组合的构建块,当输入输出类型匹配时自动融合为新的训练环境。基于300个基础环境和一组组合算子(顺序、并行、排序、选择),在复合环境上的强化学习训练可稳定提升推理泛化能力。在六个未见过的基准上,DeepSeek-R1-Distill-Qwen-14B平均提升3.1分(48.2→51.3),Qwen3-14B提升2.3分(58.8→61.1)。仅用50个基础环境就达到了与使用全部300个相当的性能,展示了高效的环境扩展。
关键要点
RACES enables recursive automated composition of verifiable environments by matching codomain of one to domain of another, using operators like SEQUENTIAL and PARALLEL to create diverse training tasks.
RACES通过将一个环境的输出类型与另一个的输入类型匹配,并利用顺序、并行等算子,实现可验证环境的递归自动组合,生成多样化的训练任务。
RL training on RACES-generated composite environments improves reasoning generalization: DeepSeek-R1-Distill-Qwen-14B gains 3.1 points and Qwen3-14B gains 2.3 points across six unseen benchmarks.
在RACES生成的复合环境上进行的强化学习训练提升了推理泛化:在六个未见基准上,DeepSeek-R1-Distill-Qwen-14B提升3.1分,Qwen3-14B提升2.3分。
The method is highly environment-efficient: using only 50 base environments achieves results comparable to training on the full set of 300 individual environments.
该方法具有极高的环境利用效率:仅用50个基础环境即可达到与使用全部300个独立环境相当的性能。