Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
English summary
Online RL fine-tuning of pretrained VLA policies suffers from sparse binary episode outcomes that conflate viability and efficiency, providing poor per-transition supervision, and naive outcome assignment across human interventions leads to incorrect credit. The paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for viability and efficiency on distinct data subsets and merges their one-step advantages via a state-adaptive gate that prioritizes viability when success is uncertain and shifts to efficiency only when viability is high. Intervention-aware credit assignment restricts outcome labels to autonomous segments, preventing supervision leakage. On three contact-rich bimanual real-robot tasks, HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively.
Chinese summary
在线强化学习微调预训练的视觉语言动作策略时,稀疏的二元回合结果会混淆可行性与效率,无法提供逐步监督,并且简单地将回合结果分配给含有人工干预的片段会导致错误的信用分配。本文提出分层优势加权行为克隆(HABC),分别训练可行性和效率的评估头,并通过状态自适应门控合并单步优势,在成功不确定时优先关注可行性,仅当可行性高时才转向效率;干预感知的信用分配仅将结果标签赋予自主执行片段,防止监督泄漏。在三个接触密集型双手灵巧操作的真实机器人任务上,HABC将监督微调基线的成功率从36%、44%和12%分别提升至92%、88%和38%。
Key points
Sparse binary success/failure outcomes in VLA fine-tuning conflate viability and efficiency, and naive credit assignment across intervention boundaries misleads training.
VLA微调中的稀疏二元成功/失败结果会混淆可行性与效率,跨干预边界的简单信用分配会误导训练。
HABC uses dual critics for viability and efficiency, a state-adaptive gate to merge advantages (prioritize viability when uncertain, shift to efficiency when viability is high), converting merged signals into per-transition actor weights.
HABC使用可行性和效率双评估头,通过状态自适应门控合并优势(不确定时优先可行性,可行性高时转向效率),将合并信号转化为逐步动作权重。
Intervention-aware assignment restricts outcome labels to autonomous segments only, preventing supervision from leaking across human intervention boundaries.
干预感知的分配将结果标签仅赋予自主执行片段,防止人工干预边界引发的监督泄漏。
On three real-world bimanual contact-rich tasks, HABC improves success rates dramatically over SFT baselines (36%→92%, 44%→88%, 12%→38%).
在三个真实双手接触密集型任务上,HABC将SFT基线的成功率大幅提升(36%→92%,44%→88%,12%→38%)。