The authors propose ContextRL, a context-aware reinforcement learning method that improves long-horizon reasoning and multimodal performance in LLMs. It uses an indirect objective: the model is rewarded for selecting which of two highly similar contexts supports a given query–answer pair, promoting fine-grained evidence grounding. Contrastive context data is constructed from coding agent trajectories (1K pairs) and multimodal images (7K pairs) via condition filtering and generative editing. ContextRL yields average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks and +1.8% on 12 visual question answering benchmarks. Data-augmentation baselines that repurpose the same contrastive data as standard examples show little improvement, confirming that the gains arise from the context-selection objective rather than from added data alone.
Online RL fine-tuning of pretrained VLA policies suffers from sparse binary episode outcomes that conflate viability and efficiency, providing poor per-transition supervision, and naive outcome assignment across human interventions leads to incorrect credit. The paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for viability and efficiency on distinct data subsets and merges their one-step advantages via a state-adaptive gate that prioritizes viability when success is uncertain and shifts to efficiency only when viability is high. Intervention-aware credit assignment restricts outcome labels to autonomous segments, preventing supervision leakage. On three contact-rich bimanual real-robot tasks, HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively.
The paper introduces FusionRS, the first large-scale RGB-infrared-text dataset for remote sensing vision-language learning, built by translating public RGB images into infrared-style counterparts. It provides aligned RGB-IR image pairs with both conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties. The authors train CLIP-style models for RGB-IR-text alignment and fine-tune generative vision-language models for dual-modal captioning. Experiments show FusionRS significantly improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies confirm that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision.
The paper proposes ROVE, a reinforcement learning framework for post-training vision-language-action (VLA) models on humanoid robots using imperfect human interventions. ROVE employs a human-in-the-loop pipeline to collect real deployment and intervention trajectories, which are often suboptimal. To avoid imitating hesitant or erroneous behaviors, it introduces Optimistic Value Estimation (OVE) to prioritize high-value actions from mixed-quality data. Cross-embodiment human experience videos provide additional supervision for long-tailed failure and recovery modes, improving the critic's advantage signals. In real-world contact-rich and fine-grained tasks, ROVE consistently outperforms experience-learning baselines and improves through multiple rollout-intervention iterations.
The paper introduces NEXIS, a method for identifying heterogeneous treatment effects (HTEs) in controlled experiments by re-framing the problem as Markov-blanket discovery on sufficient, aligned multi-modal pre-treatment representations. NEXIS iteratively selects latent interactors with provably consistent selection, avoiding spurious causal characterizations that arise from unmeasured effect modifiers. The approach is deployed on two anti-poverty programs in Africa, augmenting each with satellite imagery to capture previously unmeasured environmental modifiers. The results produce novel, interpretable prescriptive guidelines for optimizing the programs' next iterations.
Researchers introduce TuneJury, an open instance-level pairwise reward model for text-to-music that predicts preference scores from a text prompt and an audio clip. The model is trained on publicly available human-preference labels including arena votes, metric-alignment pairs, crowdsourced comparisons, and expert aesthetic ratings. Its score margin is well-calibrated on a held-out test split, enabling data filtering via a simple threshold, and it generalizes to out-of-distribution benchmarks. For generators released after training, the paper proposes anchor calibration, a post-hoc Bradley-Terry calibration that recovers agreement efficiently without retraining. The frozen reward drives consistent gains in three downstream tasks: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available open-source on GitHub.