Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Loading / 加载中

English summary

This paper introduces RACES, a framework that treats verifiable environments as composable building blocks, automatically fusing them into new training environments when their input-output types align. Using 300 base environments and composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT), RL training on composite environments consistently enhances reasoning generalization. Experiments show a 3.1-point average gain on six unseen benchmarks for DeepSeek-R1-Distill-Qwen-14B (48.2→51.3) and a 2.3-point gain for Qwen3-14B (58.8→61.1). Training with only 50 base environments reaches performance comparable to using all 300, demonstrating efficient environment scaling.

Chinese summary

本文提出RACES框架，将可验证环境视为可组合的构建块，当输入输出类型匹配时自动融合为新的训练环境。基于300个基础环境和一组组合算子（顺序、并行、排序、选择），在复合环境上的强化学习训练可稳定提升推理泛化能力。在六个未见过的基准上，DeepSeek-R1-Distill-Qwen-14B平均提升3.1分（48.2→51.3），Qwen3-14B提升2.3分（58.8→61.1）。仅用50个基础环境就达到了与使用全部300个相当的性能，展示了高效的环境扩展。

Key points

RACES enables recursive automated composition of verifiable environments by matching codomain of one to domain of another, using operators like SEQUENTIAL and PARALLEL to create diverse training tasks.

RACES通过将一个环境的输出类型与另一个的输入类型匹配，并利用顺序、并行等算子，实现可验证环境的递归自动组合，生成多样化的训练任务。

RL training on RACES-generated composite environments improves reasoning generalization: DeepSeek-R1-Distill-Qwen-14B gains 3.1 points and Qwen3-14B gains 2.3 points across six unseen benchmarks.

在RACES生成的复合环境上进行的强化学习训练提升了推理泛化：在六个未见基准上，DeepSeek-R1-Distill-Qwen-14B提升3.1分，Qwen3-14B提升2.3分。

The method is highly environment-efficient: using only 50 base environments achieves results comparable to training on the full set of 300 individual environments.

该方法具有极高的环境利用效率：仅用50个基础环境即可达到与使用全部300个独立环境相当的性能。