论文来源: ARXIV2026年6月16日重要度: 3/5

Context-Aware RL for Agentic and Multimodal LLMs

中文标题: 面向智能体与多模态大语言模型的上下文感知强化学习方法

英文摘要

The authors propose ContextRL, a context-aware reinforcement learning method that improves long-horizon reasoning and multimodal performance in LLMs. It uses an indirect objective: the model is rewarded for selecting which of two highly similar contexts supports a given query–answer pair, promoting fine-grained evidence grounding. Contrastive context data is constructed from coding agent trajectories (1K pairs) and multimodal images (7K pairs) via condition filtering and generative editing. ContextRL yields average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks and +1.8% on 12 visual question answering benchmarks. Data-augmentation baselines that repurpose the same contrastive data as standard examples show little improvement, confirming that the gains arise from the context-selection objective rather than from added data alone.

中文摘要

作者提出 ContextRL，一种上下文感知的强化学习方法，提升大语言模型在长程推理和多模态任务上的表现。该方法通过间接目标训练：模型需从两个高度相似的上下文中选出支持给定问答对的那个，从而获得奖励，促进细粒度的证据定位。通过条件过滤和生成式编辑，他们从编码智能体的执行轨迹构建了 1,000 对对比上下文数据，从多模态图像构建了 7,000 对。ContextRL 在 5 个长程基准测试上平均比标准 GRPO 提高 2.2%，在 12 个视觉问答基准上提高 1.8%。仅将相同对比数据作为标准示例的数据增强基线提升甚微，证实性能增益来自上下文选择目标而非额外数据。

关键要点

ContextRL uses an auxiliary objective that rewards accurate selection of the supporting context from two similar options, rather than supervising only the final answer.
ContextRL 采用辅助目标，奖励模型从两个相似上下文中准确选出支持问答对的那个，而非仅监督最终答案。
Contrastive context data is built from 1K coding agent trajectory pairs (via condition filtering) and 7K multimodal image pairs (via generative editing and similarity search).
通过条件过滤从代码智能体轨迹构建 1,000 对对比上下文，通过生成式编辑和相似度搜索从多模态图像构建 7,000 对。
On 5 long-horizon benchmarks, ContextRL achieves +2.2% over GRPO; on 12 VQA benchmarks, +1.8%.
在 5 个长程基准上，ContextRL 比 GRPO 平均提高 2.2%；在 12 个视觉问答基准上提高 1.8%。
Data-augmentation baselines using the same contexts as standard examples yield negligible gains, isolating the benefit to the context-selection objective.
将相同上下文用作标准示例的数据增强基线几乎无增益，表明提升源自上下文选择目标本身。

打开原文