ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

Loading / 加载中

English summary

The paper proposes ROVE, a reinforcement learning framework for post-training vision-language-action (VLA) models on humanoid robots using imperfect human interventions. ROVE employs a human-in-the-loop pipeline to collect real deployment and intervention trajectories, which are often suboptimal. To avoid imitating hesitant or erroneous behaviors, it introduces Optimistic Value Estimation (OVE) to prioritize high-value actions from mixed-quality data. Cross-embodiment human experience videos provide additional supervision for long-tailed failure and recovery modes, improving the critic's advantage signals. In real-world contact-rich and fine-grained tasks, ROVE consistently outperforms experience-learning baselines and improves through multiple rollout-intervention iterations.

Chinese summary

该论文提出ROVE框架，利用不完美的人类干预对视觉-语言-动作（VLA）模型进行人形机器人操作的强化学习后训练。ROVE通过人在回路的流水线收集实际部署与干预数据，这些轨迹往往次优。为避免模仿犹豫或错误行为，它引入乐观价值估计（OVE），从质量参差不齐的轨迹中优先选择高价值动作。跨具身人类经验视频为长尾故障与恢复模式提供额外监督，改善评价器的优势信号。在真实世界中接触密集和精细操作任务上，ROVE持续优于经验学习基线，并在多次部署-干预迭代中不断提升。

Key points

ROVE is an RL post-training framework for humanoid VLA models that handles imperfect human interventions.

ROVE是面向人形机器人VLA模型的强化学习后训练框架，能处理不完美的人类干预数据。

It uses Optimistic Value Estimation (OVE) to selectively imitate high-value behaviors instead of all demonstrated actions.

利用乐观价值估计（OVE）选择性地模仿高价值行为，而非盲目跟随所有示范动作。

Cross-embodiment human experience videos enhance value estimation for rare failure and recovery scenarios.

跨具身人类经验视频增强了针对罕见故障与恢复场景的价值估计。

On real-world humanoid tasks, ROVE surpasses baselines and improves iteratively with human intervention loops.

在真实人形机器人任务上，ROVE超越基线方法，并通过人类干预循环迭代提升性能。