Automated reproducibility assessments in the social and behavioral sciences using large language models

Loading / 加载中

English summary

Researchers evaluated an LLM pipeline on 76 published social and behavioral science studies with predefined claims. Excluding 7 studies where the LLM failed to produce a viable effect size estimate, the pipeline recovered original effect sizes within ±0.05 Cohen's d in 41% of the remaining studies. It reached the same qualitative conclusion as the original study in 96% of cases, outperforming human reanalysts who achieved 34% effect-size recovery and 74% conclusion agreement. These findings suggest LLMs can automate and scale reproducibility assessments, providing a foundation for systematic auditing of empirical results.

Chinese summary

研究人员在76项已发表的社会与行为科学研究上评估了一个LLM流水线，这些研究均带有预定义结论。排除7项LLM无法生成有效效应量估计的研究，该流水线在剩余41%的研究中以±0.05 Cohen's d的容差恢复了原始效应量。它在96%的案例中得出了与原始研究相同的定性结论，优于人类再分析者34%的效应量恢复率和74%的结论一致性。这些结果表明LLM能够自动化并扩展可重复性评估，为系统性审核实证结果奠定了基础。

Key points

LLM pipeline tested on 76 published studies with predefined claims in social and behavioral sciences.

LLM流水线在76项已发表的社会与行为科学研究上进行了测试，这些研究具有预定义结论。

In 41% of studies (excluding 7 failures), LLM replicated original effect sizes within ±0.05 Cohen's d.

排除7项失败研究后，LLM在41%的研究中以±0.05 Cohen's d的容差复制了原始效应量。

LLM matched the original study's qualitative conclusion in 96% of cases, vs. 74% for human reanalysts.

LLM在96%的案例中得出了与原始研究相同的定性结论，而人类再分析者仅为74%。

Human reanalysts recovered effect sizes in 34% of studies, lower than LLM's 41%.

人类再分析者恢复了34%的效应量，低于LLM的41%（不计失败案例）。

LLMs are shown to be a scalable tool for automated reproducibility auditing of empirical results.

研究表明，LLM可作为系统审核实证结果的可扩展自动化工具。