This paper introduces RACES, a framework that treats verifiable environments as composable building blocks, automatically fusing them into new training environments when their input-output types align. Using 300 base environments and composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT), RL training on composite environments consistently enhances reasoning generalization. Experiments show a 3.1-point average gain on six unseen benchmarks for DeepSeek-R1-Distill-Qwen-14B (48.2→51.3) and a 2.3-point gain for Qwen3-14B (58.8→61.1). Training with only 50 base environments reaches performance comparable to using all 300, demonstrating efficient environment scaling.
PapersSource: ARXIVImportance: 4/5
This paper introduces a data-centric post-training pipeline that applies interpretability protocols to preference datasets, uncovering latent concepts that distinguish preferred from dispreferred model outputs and making them explicit for user feedback. The approach diagnoses undesirable signals such as over-stylization and sycophancy, and mitigates off-target learning by intervening on the learning signal at the concept level. It unifies several interpretability-based training protocols as ways of shaping rewards through feature or data interventions. Empirically, the method amplifies desired properties like safeguards and model personality, turning opaque scalar reward optimization into an auditable process of sculpting the training signal.
PapersSource: ARXIVImportance: 4/5
This paper reinterprets supervised fine-tuning as a target distribution design problem. The Q-target framework decomposes SFT supervision into two choices: how strongly to rely on the observed token and how to allocate remaining probability mass to alternatives. This unifies many existing SFT variants as implicit selections of the target distribution Q. The authors propose Target-SFT, which constructs the training objective directly from the desired target distribution. Across ten reasoning dataset-model combinations, Target-SFT consistently outperforms conventional SFT and other variants, demonstrating a more fundamental SFT design principle.
PapersSource: ARXIVImportance: 4/5
The paper investigates how the design of the context used by the self-teacher in self-distillation affects reasoning performance. It compares conditioning on a binary reward signal (GRPO), the ground-truth reference solution, and a step-by-step critique aligned with the solver's own reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on average across 12 benchmarks. Per-token advantage analysis reveals that step-aligned feedback only modifies incorrect reasoning steps, leaving correct tokens intact, while reference solutions force unnecessary changes at every token. The results demonstrate that structural alignment between feedback and the model's reasoning is a critical driver of self-distillation effectiveness.
PapersSource: ARXIVImportance: 4/5
The paper introduces a reinforcement learning post-training method to comprehensively improve interactivity in full-duplex spoken dialogue models. It addresses four axes of interactive behavior: pause handling, turn-taking, backchanneling, and user interruption, using axis-specific reward functions trained on short audio segments extracted from human conversation corpora. An auxiliary LLM-based reward preserves semantic response quality during optimization. The approach is applied to two open-source models, Moshi and PersonaPlex, and demonstrates consistent gains in both offline evaluation with pre-recorded audio and real-time multi-turn dialogue tests.
PapersSource: ARXIVImportance: 3/5
This paper studies two underexplored aspects of synthetic data curation for post-training: whether filtering signals are grounded in the source provenance of each generation, and whether rejected samples can be systematically recovered instead of discarded. Using adversarially injected corpora to obtain ground-truth failure labels, the authors show that exact source provenance improves faithfulness gating for stronger judges. They find that hallucination-based and reward-based gates reject largely disjoint sample populations, making both necessary. An adaptive recovery pipeline that combines failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is primarily driven by generator scale, with filtration and recovery contributing meaningfully but secondarily.