iOSWorld: A Benchmark for Personally Intelligent Phone Agents
English summary
iOSWorld is the first interactive benchmark built on a native iOS simulator with a persistent user identity across 26 newly built apps. It includes 133 tasks across single-app, multi-app, and memory/personalization categories, testing agents' ability to reason over personal data. Evaluated models achieve at most 52% overall accuracy, with multi-app tasks proving especially challenging at 37%. The benchmark is released open-source, including all apps, seeded data, and evaluation code.
Chinese summary
iOSWorld 是首个基于原生 iOS 模拟器构建的交互式基准测试,拥有跨 26 个新建应用的持久用户身份。它包含横跨单应用、多应用以及记忆与个性化三大类别的 133 个任务,测试代理对个人数据的推理能力。评估模型最高整体准确率为 52%,其中多应用任务最具挑战性,仅为 37%。该基准测试以开源形式发布,包括所有应用、种子数据及评估代码。
Key points
First interactive native iOS simulator benchmark with persistent user identity.
首个具有持久用户身份的原生 iOS 模拟器交互式基准测试。
Includes 26 newly built iOS apps with connected personal data like transactions and contacts.
包含 26 个新建的 iOS 应用,数据相互关联,如交易记录和联系人。
133 tasks in three categories: single-app (27), multi-app (60), and memory/personalization (46).
共 133 个任务,分为三类:单应用(27 个)、多应用(60 个)和记忆与个性化(46 个)。
Best model achieves 52% overall; multi-app tasks only 37%.
最佳模型整体准确率为 52%,多应用任务仅为 37%。
Open-source release includes apps, seeded data, tasks, rubrics, and evaluation code.
开源发布包括应用、种子数据、任务、评分标准和评估代码。