First Proof 项目严格测试：AI 数学解题能力仍不如人类专家

英文摘要

The First Proof project tested four AI systems on ten original, unpublished research-level math problems created by mathematicians for this purpose. All problems were never included in any model's training data, and solutions were scored by anonymous expert reviewers from relevant fields. The AI responses showed frequent hallucinations and a critical absence of literature citations, failing to reference any sources. The evaluation confirmed that current reasoning models cannot yet match top human mathematicians. This was the first assessment to simultaneously satisfy three key standards: frontier math problems, no training data leakage, and expert human evaluation.

中文摘要

First Proof 项目让 4 款 AI 系统解答 10 道由数学家专门设计的原创、未公开科研级数学题，所有题目均从未出现在模型训练数据中，并由相关领域的匿名专家评审团打分。结果显示，AI 作答频繁出现幻觉，且全部严重缺失文献引用，未标注任何来源。该测试首次同时满足三大核心标准：前沿数学问题、零训练数据泄漏、专业数学家评审，证实当前推理模型仍无法匹敌顶尖人类数学家。

关键要点

Four AI systems attempted 10 research-level math problems that were original and never seen in training data.
4 款 AI 系统尝试解答 10 道原创且从未出现在训练数据中的科研级数学题。
AI responses exhibited frequent hallucinations, a persistent problem for large language models.
AI 作答频繁出现幻觉，这是大语言模型持续存在的问题。
All AI answers lacked proper literature citations, failing to reference any sources.
所有 AI 作答均缺失文献引用，完全未标注来源。
The evaluation met three strict standards: frontier problems, unseen data, and expert mathematician review.
评估严格满足三大标准：前沿问题、未见数据、专业数学家评审。
The results show current AI reasoning models still do not match top human mathematicians.
结果表明，当前的 AI 推理模型仍比不上顶尖人类数学家。

打开原文