This paper introduces EvoArena, a benchmark that evaluates LLM agents under progressive environmental changes across terminal, software, and social domains. Current agents achieve only 39.6% average accuracy on EvoArena. The authors propose EvoMem, a patch-based memory paradigm that records structured update histories to reason about environmental evolution. EvoMem boosts EvoArena accuracy by 1.5 points, and also improves GAIA and LoCoMo benchmarks by 6.1 and 4.8 percentage points, respectively. On chain-level tasks requiring sequences of related subtasks, EvoMem raises accuracy by 3.7 points. Mechanistic analysis shows EvoMem better preserves complete evolving environment states in memory evidence.
PapersSource: ARXIVImportance: 4/5
The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.
PapersSource: ARXIVImportance: 3/5
Mana is a sim-to-real framework that reinterprets dexterous manipulation of articulated tools as an animation problem. It uses a coarse-to-fine pipeline combining procedurally generated grasp keyframes with motion planning and reinforcement learning. Data generation requires only a few mouse clicks to specify functional affordances, taking less than one minute per tool. The method achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation across four articulated tools of different scales and joint types. This demonstrates a scalable approach to a challenging robotics problem.
PapersSource: ARXIVImportance: 4/5
The paper presents SpatialClaw, a training-free framework that uses code execution as the action interface for agentic spatial reasoning. It maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, allowing a VLM-backed agent to write one executable cell per step based on all prior outputs. Evaluated on 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, outperforming the prior spatial agent by 11.2 percentage points. The gains are consistent across six vision-language model backbones from two model families, with no benchmark‑ or model‑specific tuning. The results demonstrate that a flexible, iterative code‑based interface significantly outperforms single‑pass or structured tool‑call designs for open‑ended spatial tasks.
PapersSource: ARXIVImportance: 3/5
This paper studies the theoretical expressiveness of truncated positional encodings (PEs) for graph neural networks, which are commonly used in practice for computational efficiency. It shows that under truncation, previously equivalent PE families (spectral and walk-based) become fundamentally different in expressive power, with truncated spectral PEs losing their advantage and becoming no stronger than the 1-WL test. The authors introduce k-harmonic distances to further compare closely related truncated spectral PEs. Experiments on real-world datasets demonstrate that mixing different truncated PE families yields better performance than using any single family.
PapersSource: ARXIVImportance: 4/5
Researchers evaluated an LLM pipeline on 76 published social and behavioral science studies with predefined claims. Excluding 7 studies where the LLM failed to produce a viable effect size estimate, the pipeline recovered original effect sizes within ±0.05 Cohen's d in 41% of the remaining studies. It reached the same qualitative conclusion as the original study in 96% of cases, outperforming human reanalysts who achieved 34% effect-size recovery and 74% conclusion agreement. These findings suggest LLMs can automate and scale reproducibility assessments, providing a foundation for systematic auditing of empirical results.