The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]
English summary
This paper, presented at ACM CAIS 2026, studies safety evaluation in tool-using LLM agents. It categorizes outcomes into safe success, unsafe success, and failure, and proposes a two-tier verification architecture: deterministic policy/tool checks followed by an LLM-based verifier. Using τ-bench tool-use scenarios, the authors find that verification can reduce unsafe success but also decreases task completion as the task horizon increases. They term this phenomenon the 'Verifier Tax', a horizon-dependent tradeoff between safety and successful task completion. The work highlights that unsafe completion should be treated as a separate category distinct from safe success.
Chinese summary
该论文于ACM CAIS 2026发表,研究了工具使用LLM智能体的安全评估问题。文中将结果划分为安全成功、不安全成功和失败三类,并提出两级验证架构:先进行确定性策略/工具检查,再采用基于LLM的验证器处理上下文安全。使用τ-bench工具使用场景进行评估,发现验证能减少不安全成功,但随着任务步长增加,任务完成率也会下降。作者将这一现象称为“验证器税”,揭示了一种依赖任务时长的安全与成功完成之间的权衡。研究强调不安全完成应作为独立类别,与安全成功区分开来。
Key points
Categorizes agent outcomes into safe success, unsafe success, and failure.
将智能体结果划分为安全成功、不安全成功和失败三类。
Proposes a two-tier verification architecture: deterministic checks first, then an LLM-based verifier.
提出两级验证架构:先确定性检查,再由LLM验证器处理。
Verification reduces unsafe success but causes a task-horizon-dependent drop in completion, termed the 'Verifier Tax'.
验证减少不安全成功,却导致任务完成率随步长下降,称为“验证器税”。
Evaluated on τ-bench tool-use scenarios.
在τ-bench工具使用场景上进行了评测。