The paper proposes VERITAS, a generator-verifier framework for generalist robot policies. It pairs a pre-trained robot policy (generator) with a gradient-free visual verifier that evaluates actions at inference time, enabling policy steering without additional training. Verified rollouts are then used as supervision for offline fine-tuning, yielding consistent performance gains. The approach matches the efficiency of expert demonstrations but requires no human intervention, highlighting inference-time verification as a scalable mechanism for self-improvement in real-world deployment.
This paper proposes the > <former, a transformer architecture with wider early and late layers and narrower middle layers, using a parameter-free residual resizing mechanism. Across decoder-only language models from 200M to 2B dense parameters and 3B MoE parameters, > <former consistently outperforms uniform-width baselines on language modeling loss. Under loss-matched scaling, the architecture reduces overall FLOPs by 22% and KV cache memory and I/O cost by 15%. Analysis reveals the bottleneck structure produces qualitatively different representations in residual streams, demonstrating that nonuniform width allocation enables more resource-optimal scaling.
The paper introduces ReproRepo, a scalable framework for evaluating LLM agents on research reproducibility by using human-raised GitHub issues as naturally occurring supervision. It is instantiated on 1,149 recent machine learning papers from major conferences and tests four frontier model-agent configurations. The best configuration, Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for approximately 90% of the papers. Analysis shows agents are particularly effective at identifying visible failures and locating the correct semantic region, though exact bug localization remains a weakness. The code is publicly released.
This paper studies three complexity measures for binary concept classes in learning theory: sign rank, Z₂-index, and list replicability number. It proves that the Z₂-index is bounded above by a linear function of the list replicability number, establishing list replicability as the stronger of the two lower bounds. Using this relationship, the authors obtain a strong separation between sign rank and Z₂-index, resolving a question of Frick, Hosseini, and Vasileuski. Additionally, upper bounds on list replicability are given via the combinatorial measures height and minimum star number, and a composition theorem shows the list replicability number of a product class is at most the sum of the individual numbers.
EvolveNav is a self-evolving zero-shot object-goal navigation framework that continuously improves during test time by extracting actionable rules from past trajectories into an agentic rule memory. A retrieval strategy based on upper confidence bound selects effective rules by balancing semantic relevance with historical success. A memory-guided preflection module forecasts potential outcomes before action, reducing inefficient exploration. In experiments, EvolveNav outperforms existing zero-shot baselines, improving the success rate by 10.1% while taking fewer unnecessary steps.
The paper proposes AdaVoMP, a method to predict dense spatially-varying mechanical properties (Young's modulus, Poisson's ratio, and density) for input 3D objects across representations. It introduces a sparse adaptive voxel structure (SAV) to efficiently represent shape and material fields, and a novel sparse transformer encoder-decoder that autoregressively generates a unique SAV for each input. AdaVoMP achieves a resolution 16³ times higher than the prior state-of-the-art VoMP, with better accuracy even at lower test-time compute. This allows conversion of high-resolution complex 3D objects into simulation-ready assets for realistic deformable simulations.