Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
English summary
The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.
Chinese summary
本文提出检索增强的强化微调(RA-RFT),一种教授语言模型通过类比进行推理的后训练框架。它首先通过金标准相关性蒸馏训练一个推理感知的检索器,使上下文按预期推理收益而非语义相似度排序。然后使用检索到的类比演示,在可验证的结果奖励下对策略模型进行强化学习微调,使其学会利用推理轨迹。分析表明,推理感知检索能挖掘互补的解题策略,为不同问题提供独特的推理支架。在AIME 2025基准上,RA-RFT在平均@32准确率上较GRPO为Qwen3-1.7B和Qwen3-4B分别提升了7.1和2.8个百分点,表明推理感知检索是与奖励设计或训练课程正交的改进维度。
Key points
RA-RFT trains a reasoning-aware retriever via gold-relevance distillation, ranking contexts by expected reasoning benefit, not semantic similarity.
RA-RFT通过金标准相关性蒸馏训练推理感知的检索器,按其预期的推理收益而非语义相似度对上下文排序。
The policy model is fine-tuned with reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards.
在可验证的结果奖励下,策略模型使用检索到的类比演示进行强化学习微调。
Reasoning-aware retrieval surfaces complementary solution strategies, providing distinct reasoning scaffolds for different problems.
推理感知检索能挖掘互补的解题策略,为不同问题提供独特的推理支架。
On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points (Qwen3-1.7B) and 2.8 points (Qwen3-4B), showing orthogonality to other advances.
在AIME 2025上,RA-RFT平均@32准确率较GRPO分别提升7.1个点(Qwen3-1.7B)和2.8个点(Qwen3-4B),证明其与奖励设计或课程进步正交。