This paper introduces EvoArena, a benchmark that evaluates LLM agents under progressive environmental changes across terminal, software, and social domains. Current agents achieve only 39.6% average accuracy on EvoArena. The authors propose EvoMem, a patch-based memory paradigm that records structured update histories to reason about environmental evolution. EvoMem boosts EvoArena accuracy by 1.5 points, and also improves GAIA and LoCoMo benchmarks by 6.1 and 4.8 percentage points, respectively. On chain-level tasks requiring sequences of related subtasks, EvoMem raises accuracy by 3.7 points. Mechanistic analysis shows EvoMem better preserves complete evolving environment states in memory evidence.
PapersSource: ARXIVImportance: 4/5
The paper presents SpatialClaw, a training-free framework that uses code execution as the action interface for agentic spatial reasoning. It maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, allowing a VLM-backed agent to write one executable cell per step based on all prior outputs. Evaluated on 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, outperforming the prior spatial agent by 11.2 percentage points. The gains are consistent across six vision-language model backbones from two model families, with no benchmark‑ or model‑specific tuning. The results demonstrate that a flexible, iterative code‑based interface significantly outperforms single‑pass or structured tool‑call designs for open‑ended spatial tasks.
PapersSource: ARXIVImportance: 4/5
Researchers evaluated an LLM pipeline on 76 published social and behavioral science studies with predefined claims. Excluding 7 studies where the LLM failed to produce a viable effect size estimate, the pipeline recovered original effect sizes within ±0.05 Cohen's d in 41% of the remaining studies. It reached the same qualitative conclusion as the original study in 96% of cases, outperforming human reanalysts who achieved 34% effect-size recovery and 74% conclusion agreement. These findings suggest LLMs can automate and scale reproducibility assessments, providing a foundation for systematic auditing of empirical results.
Current tool-augmented LLM agents suffer from an execution-granularity mismatch, as step-wise atomic tool calls expose low-level dataflow and waste context windows. HyperTool proposes a unified MCP-style tool interface where the agent invokes a code block that internally calls multiple tools, manipulates returned values, and passes intermediate results locally, collapsing deterministic subroutines into a single model-visible call. The system is trained on synthesized trajectories from cross-tool compositional tasks and verified in real MCP environments. On the MCP-Universe benchmark, HyperTool raises average accuracy from 15.69% to 35.29% on Qwen3-32B and from 9.93% to 33.33% on Qwen3-8B, outperforming GPT-OSS and Kimi-k2.5. The results show that moving beyond step-wise tool calls significantly improves multi-step tool use in agents.
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
The paper introduces and formalizes the Recursive Agent Harness (RAH), a code-first extension of recursive language models where a parent agent generates executable scripts that spawn full subagent harnesses with filesystem tools, code execution, and planning. Controlled evaluation on Oolong-Synthetic (199 samples, context lengths up to 4M tokens) shows RAH with a fixed GPT-5 backbone improves the Codex coding-agent baseline from 71.75% to 81.36%. With a stronger backbone, Claude Sonnet 4.5, RAH achieves 89.77%, confirming the gains stem from the harness design rather than model scaling.