This paper introduces EvoArena, a benchmark that evaluates LLM agents under progressive environmental changes across terminal, software, and social domains. Current agents achieve only 39.6% average accuracy on EvoArena. The authors propose EvoMem, a patch-based memory paradigm that records structured update histories to reason about environmental evolution. EvoMem boosts EvoArena accuracy by 1.5 points, and also improves GAIA and LoCoMo benchmarks by 6.1 and 4.8 percentage points, respectively. On chain-level tasks requiring sequences of related subtasks, EvoMem raises accuracy by 3.7 points. Mechanistic analysis shows EvoMem better preserves complete evolving environment states in memory evidence.
PapersSource: ARXIVImportance: 4/5
The paper presents SpatialClaw, a training-free framework that uses code execution as the action interface for agentic spatial reasoning. It maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, allowing a VLM-backed agent to write one executable cell per step based on all prior outputs. Evaluated on 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, outperforming the prior spatial agent by 11.2 percentage points. The gains are consistent across six vision-language model backbones from two model families, with no benchmark‑ or model‑specific tuning. The results demonstrate that a flexible, iterative code‑based interface significantly outperforms single‑pass or structured tool‑call designs for open‑ended spatial tasks.
PapersSource: ARXIVImportance: 4/5
The paper presents Agents-K1, an end-to-end pipeline that transforms raw documents into agent-native scientific knowledge graphs. It combines a multimodal parser using a five-module schema to capture entities, evidence, citations, and typed cross-entity relations from full papers, a 4B information-extraction backbone trained with GRPO under a rule-based reward, and a GraphAnything CLI that unifies web search, multimodal graph retrieval, and cross-document traversal. The authors process 2.46 million scientific papers across six subjects to construct Scholar-KG and release a one-million-paper subset. Experiments show superior performance on scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning. The pipeline is extensible to general-domain corpora and schema-conformant data synthesis.
Current tool-augmented LLM agents suffer from an execution-granularity mismatch, as step-wise atomic tool calls expose low-level dataflow and waste context windows. HyperTool proposes a unified MCP-style tool interface where the agent invokes a code block that internally calls multiple tools, manipulates returned values, and passes intermediate results locally, collapsing deterministic subroutines into a single model-visible call. The system is trained on synthesized trajectories from cross-tool compositional tasks and verified in real MCP environments. On the MCP-Universe benchmark, HyperTool raises average accuracy from 15.69% to 35.29% on Qwen3-32B and from 9.93% to 33.33% on Qwen3-8B, outperforming GPT-OSS and Kimi-k2.5. The results show that moving beyond step-wise tool calls significantly improves multi-step tool use in agents.
The paper presents EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. It argues the key bottleneck has shifted from designing agent workflows to engineering agent environments that amplify productive behaviors and suppress harmful ones. EurekAgent engineers environments across four dimensions: permissions engineering for bounded execution and isolated evaluation, artifact engineering for filesystem and Git-based collaboration, budget engineering for budget-aware exploration, and human-in-the-loop engineering for easy oversight. The system achieves new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks, including a novel 26-circle packing solution discovered with under $11 total API cost. Code and results are open-sourced, and the authors call for environment engineering as a core research direction for reliable autonomous research agents.
The paper introduces and formalizes the Recursive Agent Harness (RAH), a code-first extension of recursive language models where a parent agent generates executable scripts that spawn full subagent harnesses with filesystem tools, code execution, and planning. Controlled evaluation on Oolong-Synthetic (199 samples, context lengths up to 4M tokens) shows RAH with a fixed GPT-5 backbone improves the Codex coding-agent baseline from 71.75% to 81.36%. With a stronger backbone, Claude Sonnet 4.5, RAH achieves 89.77%, confirming the gains stem from the harness design rather than model scaling.