Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
中文标题: 前沿AI评估公共档案的贝叶斯推断与决策审计
英文摘要
This paper proposes a Bayesian inference framework to audit frontier AI evaluations using public leaderboard archives such as LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench. It demonstrates that terminal-only performance claims are ambiguous: a single snapshot can be compatible with vastly different pre-terminal histories, varying the time to approach a ceiling by a factor of over three. Synthetic experiments show that a candidate selection-aware frontier model fails in synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, leading audit gates to reject its stronger claims. The study introduces an archive-and-adjudication protocol that reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier assertions, providing a rigorous method for interpreting leaderboard data.
中文摘要
本文提出一个贝叶斯推断框架,利用LiveBench、Open LLM Leaderboard v2、LMArena、GAIA和tau-bench等公开排行榜档案,对前沿AI评估进行审计。研究发现,仅凭终值性能声明会产生歧义:同一终值快照可与截然不同的前期演化历史兼容,导致接近性能上限的时间相差三倍以上。合成实验表明,考虑候选选择的前沿模型在合成恢复、目标档案预测、偏好迁移和不确定性校准等方面均告失败,审计关卡因此拒绝其较强论断。文中引入的档案-裁决协议能够重建公开评估历史,确定经验证的时序边界,并证伪缺乏支持的前沿论断,为解读排行榜数据提供了严谨方法。
关键要点
Public leaderboard archives are selective time series shaped by reporting rules, not neutral terminal records.
公开排行榜档案是受报告规则影响的选择性时间序列,而非中性的终态记录。
Terminal-only snapshot claims are ambiguous: the same observed final performance can arise from histories with 23.03 or 75.13 time units to approach the ceiling.
仅凭终态快照声明存在歧义:同一最终表现可来自接近性能上限耗时23.03或75.13时间单位的不同历史。
A candidate selection-aware frontier model fails synthetic recovery, prediction, transfer, and calibration, exposing its unreliability.
考虑候选选择的前沿模型在合成恢复、预测、迁移和校准方面均失败,暴露其不可靠性。
A proposed archive-and-adjudication protocol reconstructs evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.
提出的档案-裁决协议可重建评估历史,分离出经核验的时序边界,并证伪无支撑的前沿论断。
The analysis leverages actual longitudinal data from LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench.
分析利用了LiveBench、Open LLM Leaderboard v2、LMArena、GAIA和tau-bench的真实纵向数据。