Automated reproducibility assessments in the social and behavioral sciences using large language models
English summary
Researchers evaluated an LLM pipeline on 76 published social and behavioral science studies with predefined claims. Excluding 7 studies where the LLM failed to produce a viable effect size estimate, the pipeline recovered original effect sizes within ±0.05 Cohen's d in 41% of the remaining studies. It reached the same qualitative conclusion as the original study in 96% of cases, outperforming human reanalysts who achieved 34% effect-size recovery and 74% conclusion agreement. These findings suggest LLMs can automate and scale reproducibility assessments, providing a foundation for systematic auditing of empirical results.
Chinese summary
研究人员在76项已发表的社会与行为科学研究上评估了一个LLM流水线,这些研究均带有预定义结论。排除7项LLM无法生成有效效应量估计的研究,该流水线在剩余41%的研究中以±0.05 Cohen's d的容差恢复了原始效应量。它在96%的案例中得出了与原始研究相同的定性结论,优于人类再分析者34%的效应量恢复率和74%的结论一致性。这些结果表明LLM能够自动化并扩展可重复性评估,为系统性审核实证结果奠定了基础。
Key points
LLM pipeline tested on 76 published studies with predefined claims in social and behavioral sciences.
LLM流水线在76项已发表的社会与行为科学研究上进行了测试,这些研究具有预定义结论。
In 41% of studies (excluding 7 failures), LLM replicated original effect sizes within ±0.05 Cohen's d.
排除7项失败研究后,LLM在41%的研究中以±0.05 Cohen's d的容差复制了原始效应量。
LLM matched the original study's qualitative conclusion in 96% of cases, vs. 74% for human reanalysts.
LLM在96%的案例中得出了与原始研究相同的定性结论,而人类再分析者仅为74%。
Human reanalysts recovered effect sizes in 34% of studies, lower than LLM's 41%.
人类再分析者恢复了34%的效应量,低于LLM的41%(不计失败案例)。
LLMs are shown to be a scalable tool for automated reproducibility auditing of empirical results.
研究表明,LLM可作为系统审核实证结果的可扩展自动化工具。