The authors propose ContextRL, a context-aware reinforcement learning method that improves long-horizon reasoning and multimodal performance in LLMs. It uses an indirect objective: the model is rewarded for selecting which of two highly similar contexts supports a given query–answer pair, promoting fine-grained evidence grounding. Contrastive context data is constructed from coding agent trajectories (1K pairs) and multimodal images (7K pairs) via condition filtering and generative editing. ContextRL yields average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks and +1.8% on 12 visual question answering benchmarks. Data-augmentation baselines that repurpose the same contrastive data as standard examples show little improvement, confirming that the gains arise from the context-selection objective rather than from added data alone.
This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.
KVEraser is a learned method for post-hoc context erasing in long-context LLMs that avoids full recomputation. It replaces only the KV states of the to-be-erased span with learned steering values while keeping the rest of the cache intact. A two-stage training pipeline first pre-trains on generic span-neighbor suppression, then fine-tunes for downstream tasks. On in-domain tasks with 1K–32K context, KVEraser nearly matches the accuracy of full recomputation but increases latency by only 24% versus a 17.6× increase for full recomputation. The method also generalizes to unseen long-document QA with harmful distractors, achieving the best approximate baseline performance and a 3–4× speedup over full recomputation.
DeepRubric is a data construction framework that reverses the typical process of generating rubrics for a given query. Instead, it first builds an evidence tree by recursively expanding evidence-backed sub-questions from a seed topic, then uses the tree’s leaves as atomic, verifiable evaluation targets to synthesize aligned query–rubric pairs. This ensures the reward evaluates exactly the information the query requests. Using 9K such query–rubric pairs, the authors train DeepRubric-8B with rubric-based GRPO, achieving performance comparable to the prior open state-of-the-art deep research models across three benchmarks while requiring roughly 13× fewer RL GPU-hours.
ExpRL proposes an RL-based mid-training method that uses human-written question-answer pairs as reward scaffolds, hiding reference solutions from the policy and instead having an LLM judge compare sampled reasoning traces to assign dense outcome or process rewards. This reinforces partial progress and useful reasoning behaviors that sparse final-answer rewards often miss. On challenging math tasks, ExpRL yields stronger RL priming than supervised fine-tuning, sparse-reward GRPO, and self-distillation, providing a better initialization for subsequent sparse-reward RL. The method also shows promise in mixed-domain experiments beyond math.
The paper introduces TokenPilot, a dual-granularity context management framework for long-horizon LLM agents that preserves prompt cache continuity while reducing token footprints. It contains a global Ingestion-Aware Compaction that stabilizes prompt prefixes and filters environmental noise, and a local Lifecycle-Aware Eviction that monitors segment utility and evicts only when task relevance expires. On PinchBench and Claw-Eval, TokenPilot reduces costs by 61%/56% in isolated mode and 61%/87% in continuous mode versus prior systems, while maintaining competitive performance. The method has been integrated into the open-source LightMem2 library.