Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Loading / 加载中

英文摘要

This paper introduces RACES, a framework that treats verifiable environments as composable building blocks, automatically fusing them into new training environments when their input-output types align. Using 300 base environments and composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT), RL training on composite environments consistently enhances reasoning generalization. Experiments show a 3.1-point average gain on six unseen benchmarks for DeepSeek-R1-Distill-Qwen-14B (48.2→51.3) and a 2.3-point gain for Qwen3-14B (58.8→61.1). Training with only 50 base environments reaches performance comparable to using all 300, demonstrating efficient environment scaling.

中文摘要

本文提出RACES框架，将可验证环境视为可组合的构建块，当输入输出类型匹配时自动融合为新的训练环境。基于300个基础环境和一组组合算子（顺序、并行、排序、选择），在复合环境上的强化学习训练可稳定提升推理泛化能力。在六个未见过的基准上，DeepSeek-R1-Distill-Qwen-14B平均提升3.1分（48.2→51.3），Qwen3-14B提升2.3分（58.8→61.1）。仅用50个基础环境就达到了与使用全部300个相当的性能，展示了高效的环境扩展。

关键要点

RACES enables recursive automated composition of verifiable environments by matching codomain of one to domain of another, using operators like SEQUENTIAL and PARALLEL to create diverse training tasks.

RACES通过将一个环境的输出类型与另一个的输入类型匹配，并利用顺序、并行等算子，实现可验证环境的递归自动组合，生成多样化的训练任务。

RL training on RACES-generated composite environments improves reasoning generalization: DeepSeek-R1-Distill-Qwen-14B gains 3.1 points and Qwen3-14B gains 2.3 points across six unseen benchmarks.

在RACES生成的复合环境上进行的强化学习训练提升了推理泛化：在六个未见基准上，DeepSeek-R1-Distill-Qwen-14B提升3.1分，Qwen3-14B提升2.3分。

The method is highly environment-efficient: using only 50 base environments achieves results comparable to training on the full set of 300 individual environments.

该方法具有极高的环境利用效率：仅用50个基础环境即可达到与使用全部300个独立环境相当的性能。