This paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, reducing posterior sampling to a denoising problem at an operator-dependent shifted pivot with anisotropic noise covariance. The method, Exact Posterior Score (EPS), defines a denoising training objective that mirrors standard pretraining, enabling training from scratch or fine-tuning a pretrained denoiser. At inference, EPS uses the identical sampler as the base model, eliminating the need for likelihood gradients or projections. Evaluated on five linear inverse tasks across FFHQ and ImageNet, EPS surpasses both training-free and training-based baselines in fidelity, perceptual, and distributional metrics while requiring roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.
Online RL fine-tuning of pretrained VLA policies suffers from sparse binary episode outcomes that conflate viability and efficiency, providing poor per-transition supervision, and naive outcome assignment across human interventions leads to incorrect credit. The paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for viability and efficiency on distinct data subsets and merges their one-step advantages via a state-adaptive gate that prioritizes viability when success is uncertain and shifts to efficiency only when viability is high. Intervention-aware credit assignment restricts outcome labels to autonomous segments, preventing supervision leakage. On three contact-rich bimanual real-robot tasks, HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively.
DeepRubric is a data construction framework that reverses the typical process of generating rubrics for a given query. Instead, it first builds an evidence tree by recursively expanding evidence-backed sub-questions from a seed topic, then uses the tree’s leaves as atomic, verifiable evaluation targets to synthesize aligned query–rubric pairs. This ensures the reward evaluates exactly the information the query requests. Using 9K such query–rubric pairs, the authors train DeepRubric-8B with rubric-based GRPO, achieving performance comparable to the prior open state-of-the-art deep research models across three benchmarks while requiring roughly 13× fewer RL GPU-hours.
ExpRL proposes an RL-based mid-training method that uses human-written question-answer pairs as reward scaffolds, hiding reference solutions from the policy and instead having an LLM judge compare sampled reasoning traces to assign dense outcome or process rewards. This reinforces partial progress and useful reasoning behaviors that sparse final-answer rewards often miss. On challenging math tasks, ExpRL yields stronger RL priming than supervised fine-tuning, sparse-reward GRPO, and self-distillation, providing a better initialization for subsequent sparse-reward RL. The method also shows promise in mixed-domain experiments beyond math.
This paper benchmarks six deep learning architectures, two zero-shot foundation models (TimesFM), and statistical baselines for forecasting step counts, screen time, and sleep duration from wearable data. Using three public datasets with over 800 participants, the study evaluates performance across 1–8 day horizons. Among trained models, PatchTST leads with no significant differences among TCN, MLP, and Transformer. The foundation model TimesFM performs on par or better than trained models zero-shot, especially in low-data settings, while participant-level fine-tuning reduces RMSE by 16–60%, with sleep benefiting most. This is the first study to jointly compare deep learning, foundation models, and personalization for multi-horizon mobile health forecasting.
The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.