APPO: Agentic Procedural Policy Optimization
English summary
APPO is a new agentic reinforcement learning method that improves multi-turn tool-use in large language model agents. It refines branching and credit assignment by focusing on fine-grained token-level decision points rather than coarse heuristic interaction units. The method selects branching locations using token uncertainty and policy-induced likelihood gains, leading to more precise exploration and better credit distribution across branched rollouts. Experiments across 13 benchmarks show APPO consistently boosts performance over existing agentic RL methods by approximately 4 points. The approach also ensures efficient tool-calls and maintains behavioral interpretability.
Chinese summary
APPO 是一种新的智能体强化学习方法,旨在增强大语言模型智能体的多轮工具调用能力。它通过关注细粒度的 token 级别决策点而非粗粒度的交互单元,改进分支选择与功劳分配。该方法利用 token 不确定性和策略诱导的似然增益来选择分支位置,从而实现更精准的探索,并在分支展开间更合理地分配功劳。在 13 个基准测试中,APPO 相较于现有智能体强化学习方法平均提升约 4 个百分点,同时确保高效的工具调用并保持行为可解释性。
Key points
APPO addresses the 'where to branch' and 'how to attribute credit' problems in agentic RL by operating at token-level decision points instead of coarse units.
APPO 通过在 token 级别决策点而非粗粒度单元上操作,解决智能体强化学习中“在何处分支”与“如何分配功劳”的问题。
It uses token uncertainty and policy-induced likelihood gains to select branching locations, enabling more precise exploration and improving credit distribution.
它利用 token 不确定性与策略诱导的似然增益来选择分支位置,从而促进更精细的探索并改进功劳分配。
Experiments on 13 benchmarks show an average performance boost of about 4 points over existing agentic RL methods.
在 13 个基准上的实验表明,相较于现有方法,平均性能提升约 4 个百分点。
The method ensures efficient tool-calls and maintains behavioral interpretability.
该方法保证高效的工具调用并维持行为的可解释性。