DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
English summary
DeepRubric is a data construction framework that reverses the typical process of generating rubrics for a given query. Instead, it first builds an evidence tree by recursively expanding evidence-backed sub-questions from a seed topic, then uses the tree’s leaves as atomic, verifiable evaluation targets to synthesize aligned query–rubric pairs. This ensures the reward evaluates exactly the information the query requests. Using 9K such query–rubric pairs, the authors train DeepRubric-8B with rubric-based GRPO, achieving performance comparable to the prior open state-of-the-art deep research models across three benchmarks while requiring roughly 13× fewer RL GPU-hours.
Chinese summary
DeepRubric 是一个数据构建框架,它反转了通常为查询生成评分标准的流程。该框架先从种子主题出发,递归扩展证据支撑的子问题,构建一棵证据树;随后以树的叶子节点作为原子化、可验证的评估目标,合成对齐的查询-评分标准对。由此确保奖励信号准确评估查询所要求的信息。作者利用 9K 条此类样本,以基于评分标准的 GRPO 训练了 DeepRubric-8B,使其在三个基准上的性能与之前开源的最佳深度研究模型持平,而所需的强化学习 GPU 小时仅约 1/13。
Key points
DeepRubric reverses query→rubric generation: it first builds an evidence tree, then synthesizes query–rubric pairs from atomic evaluation targets.
DeepRubric 反转了从查询到评分标准的生成流程:先构建证据树,再从原子化评估目标合成查询-评分标准对。
The evidence tree is built by recursively expanding evidence-backed sub-questions, leaving leaves that serve as verifiable evaluation targets.
证据树通过对证据支撑的子问题递归扩展得到,叶子节点充当可验证的评估目标。
Using 9K constructed query–rubric pairs, DeepRubric-8B trained with GRPO reaches prior open SOTA performance with roughly 13× less RL GPU time.
基于 9K 条合成的查询-评分标准对,用 GRPO 训练的 DeepRubric-8B 达到之前开源 SOTA 性能,而 RL GPU 时间仅约为其 1/13。
The work demonstrates that higher-quality rubric supervision significantly improves RL training efficiency for deep research agents.
该工作证明,更高质量的评分标准监督能显著提升深度研究智能体的强化学习训练效率。