The authors propose ContextRL, a context-aware reinforcement learning method that improves long-horizon reasoning and multimodal performance in LLMs. It uses an indirect objective: the model is rewarded for selecting which of two highly similar contexts supports a given query–answer pair, promoting fine-grained evidence grounding. Contrastive context data is constructed from coding agent trajectories (1K pairs) and multimodal images (7K pairs) via condition filtering and generative editing. ContextRL yields average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks and +1.8% on 12 visual question answering benchmarks. Data-augmentation baselines that repurpose the same contrastive data as standard examples show little improvement, confirming that the gains arise from the context-selection objective rather than from added data alone.
The paper introduces the Geometric Action Model (GAM), a language-conditioned manipulation policy that leverages a pretrained geometric foundation model (GFM) to explicitly incorporate 3D geometry for contact-rich tasks. GAM splits the GFM at an intermediate layer, using shallow layers for observation encoding and inserting a causal future predictor that forecasts future latent tokens based on language, proprioception, and action history. The predicted tokens are then processed by the remaining GFM blocks, enabling a single backbone to jointly predict future geometry scenes and robot actions with minimal architectural changes. Across simulation and real-robot benchmarks, GAM achieves higher accuracy, robustness, speed, and model compactness compared to existing foundation-model-scale baselines.
Online RL fine-tuning of pretrained VLA policies suffers from sparse binary episode outcomes that conflate viability and efficiency, providing poor per-transition supervision, and naive outcome assignment across human interventions leads to incorrect credit. The paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for viability and efficiency on distinct data subsets and merges their one-step advantages via a state-adaptive gate that prioritizes viability when success is uncertain and shifts to efficiency only when viability is high. Intervention-aware credit assignment restricts outcome labels to autonomous segments, preventing supervision leakage. On three contact-rich bimanual real-robot tasks, HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively.
This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.
DeepRubric is a data construction framework that reverses the typical process of generating rubrics for a given query. Instead, it first builds an evidence tree by recursively expanding evidence-backed sub-questions from a seed topic, then uses the tree’s leaves as atomic, verifiable evaluation targets to synthesize aligned query–rubric pairs. This ensures the reward evaluates exactly the information the query requests. Using 9K such query–rubric pairs, the authors train DeepRubric-8B with rubric-based GRPO, achieving performance comparable to the prior open state-of-the-art deep research models across three benchmarks while requiring roughly 13× fewer RL GPU-hours.
The paper introduces TokenPilot, a dual-granularity context management framework for long-horizon LLM agents that preserves prompt cache continuity while reducing token footprints. It contains a global Ingestion-Aware Compaction that stabilizes prompt prefixes and filters environmental noise, and a local Lifecycle-Aware Eviction that monitors segment utility and evicts only when task relevance expires. On PinchBench and Claw-Eval, TokenPilot reduces costs by 61%/56% in isolated mode and 61%/87% in continuous mode versus prior systems, while maintaining competitive performance. The method has been integrated into the open-source LightMem2 library.