JoyAI-VL-Interaction is an 8B-scale, vision-first model that autonomously decides to respond or delegate without user prompting, aiming to interact with environmental changes like a human would. The system streams ongoing videos for real-time interaction, with pluggable ASR/TTS modules and a background brain. In evaluations, human raters preferred this model over existing video-call assistants across multiple scenarios. The model and system are open-source, representing a new paradigm in interaction modeling for always-on, perceptive agents.
HarnessX is a platform that enables composable, adaptive, and evolvable agent runtime harnesses. It introduces compositional primitives and AEGIS, a trace-driven evolution engine that iteratively refines harness design using execution feedback. Traditional static, hand-crafted harnesses are replaced with a substitution algebra for dynamic adaptation. Evaluated across multiple benchmarks, HarnessX achieved an average performance improvement of +14.5% over conventional harnesses, highlighting the impact of runtime interface evolution alongside model scaling. The full codebase will be released in the future.
WeaveBench is introduced as a comprehensive benchmark for evaluating computer-use agents (CUAs) operating across hybrid interfaces, requiring both GUI and CLI/code operations. It encompasses 114 long-horizon tasks spanning 8 real-world work domains, all evaluated on a real Ubuntu desktop. The benchmark includes a trajectory-aware judge that inspects agent deliverables and detects shortcut behaviors, addressing limitations of traditional evaluation methods. The PassRate across tested model-runtime pairings is only 41.2%, highlighting a significant performance gap in long-horizon task orchestration.
Researchers propose FORT, a framework for synthesizing training data for deep search agents that resists shortcut learning. It identifies and mitigates four types of shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. The framework uses trajectory signatures to measure and control shortcut risks during data generation. Experiments show that FORT-generated data leads to improved search agent performance on deep search benchmarks. The accompanying tool, FORT-Searcher, outperforms comparable agents on challenging tasks. Code is available on GitHub.
The paper introduces EvoArena, a benchmark designed to simulate real-world dynamic changes for LLM agents, and EvoMem, a memory paradigm that models progressive updates and structured memory evolution. Current LLM agents show significant difficulty on EvoArena's evolving tasks. EvoMem consistently improves agent performance on EvoArena and also increases accuracy on existing benchmarks like GAIA and LoCoMo. By recording memory evolution and update histories, EvoMem enables better reasoning about environmental shifts. The work demonstrates the importance of incorporating evolution modeling into both evaluation and memory for effective agent deployment.
The paper presents Claw-SWE-Bench, a benchmark designed to standardize evaluation of OpenClaw-style coding agent harnesses. It provides 350 GitHub issue-resolution instances spanning various programming languages and repositories, along with a Lite version for rapid validation. An adapter protocol is introduced to decouple agent logic from harness execution, and experiments show that adapter choice significantly impacts agent performance. The results highlight the critical role of harness design and cost in fair comparisons, offering a reproducible and cost-effective reference set for coding-agent evaluation.