ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
English summary
The paper proposes ROVE, a reinforcement learning framework for post-training vision-language-action (VLA) models on humanoid robots using imperfect human interventions. ROVE employs a human-in-the-loop pipeline to collect real deployment and intervention trajectories, which are often suboptimal. To avoid imitating hesitant or erroneous behaviors, it introduces Optimistic Value Estimation (OVE) to prioritize high-value actions from mixed-quality data. Cross-embodiment human experience videos provide additional supervision for long-tailed failure and recovery modes, improving the critic's advantage signals. In real-world contact-rich and fine-grained tasks, ROVE consistently outperforms experience-learning baselines and improves through multiple rollout-intervention iterations.
Chinese summary
该论文提出ROVE框架,利用不完美的人类干预对视觉-语言-动作(VLA)模型进行人形机器人操作的强化学习后训练。ROVE通过人在回路的流水线收集实际部署与干预数据,这些轨迹往往次优。为避免模仿犹豫或错误行为,它引入乐观价值估计(OVE),从质量参差不齐的轨迹中优先选择高价值动作。跨具身人类经验视频为长尾故障与恢复模式提供额外监督,改善评价器的优势信号。在真实世界中接触密集和精细操作任务上,ROVE持续优于经验学习基线,并在多次部署-干预迭代中不断提升。
Key points
ROVE is an RL post-training framework for humanoid VLA models that handles imperfect human interventions.
ROVE是面向人形机器人VLA模型的强化学习后训练框架,能处理不完美的人类干预数据。
It uses Optimistic Value Estimation (OVE) to selectively imitate high-value behaviors instead of all demonstrated actions.
利用乐观价值估计(OVE)选择性地模仿高价值行为,而非盲目跟随所有示范动作。
Cross-embodiment human experience videos enhance value estimation for rare failure and recovery scenarios.
跨具身人类经验视频增强了针对罕见故障与恢复场景的价值估计。
On real-world humanoid tasks, ROVE surpasses baselines and improves iteratively with human intervention loops.
在真实人形机器人任务上,ROVE超越基线方法,并通过人类干预循环迭代提升性能。