OpenEvidence expressed dissatisfaction with a recent LLM benchmarking study, echoing a broader call for improved benchmarks. The author supports this view and suggests evaluating OpenEvidence on the open and transparent Medmarks benchmark suite.
In this blog post, the author benchmarks retrieval-augmented generation (RAG) pipelines against a deterministic full-scan engine across 100,000 rows for aggregation tasks. The results show that larger context windows do not improve accuracy—they actually make errors harder to detect. The author finds that computation-heavy queries must be routed away from RAG entirely, and builds a system that directs such queries to a deterministic full-scan engine to preserve accuracy.
SocialSource: XImportance: 3/5
The viral study tested medical AI products UpToDate and OpenEvidence—not underlying models—on benchmarks like MedQA and HealthBench, finding them worse than frontier general-purpose models. The author argues this does not prove domain-specific models are inherently inferior; their own comprehensive benchmark shows fine-tuning a frontier model for medicine yields a noticeable boost. Current domain-specific models often lag because they are built on older or weaker open-source base models, not because specialization fails. For example, Baichuan-M4 is cited as a medical-specific model that claims to outperform frontier models. The main takeaway is that adapting strong frontier models into medical tools quickly would produce superior domain-specific systems, but open-source base model progress and adaptation speed remain challenges.
Researchers propose FORT, a framework for synthesizing training data for deep search agents that resists shortcut learning. It identifies and mitigates four types of shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. The framework uses trajectory signatures to measure and control shortcut risks during data generation. Experiments show that FORT-generated data leads to improved search agent performance on deep search benchmarks. The accompanying tool, FORT-Searcher, outperforms comparable agents on challenging tasks. Code is available on GitHub.
The paper introduces EvoArena, a benchmark designed to simulate real-world dynamic changes for LLM agents, and EvoMem, a memory paradigm that models progressive updates and structured memory evolution. Current LLM agents show significant difficulty on EvoArena's evolving tasks. EvoMem consistently improves agent performance on EvoArena and also increases accuracy on existing benchmarks like GAIA and LoCoMo. By recording memory evolution and update histories, EvoMem enables better reasoning about environmental shifts. The work demonstrates the importance of incorporating evolution modeling into both evaluation and memory for effective agent deployment.
TutorialsSource: MARKTECHPOSTImportance: 4/5
Moonshot AI released Kimi K2.7-Code, an open-weight, coding-specialized agentic model under Modified MIT license. It is a Mixture-of-Experts architecture with 1T total parameters, 32B active per token, 384 experts with 8 selected, MLA attention, SwiGLU feed-forward, and a 400M-parameter MoonViT vision encoder. The model supports a 256K-token context window, ships with native INT4 quantization, and enforces mandatory thinking mode with fixed sampling parameters (temperature 1.0, top_p 0.95, n 1). In company-reported benchmarks, K2.7-Code achieves 62.0 on Kimi Code Bench v2 (+21.8% over K2.6), 81.1 on MCP Mark Verified (beating Claude Opus 4.8’s 76.4), and demonstrates approximately 30% lower reasoning-token usage than K2.6, reducing cost and latency in agentic workflows. The 595 GB model weights are available on Hugging Face and can be self-hosted via vLLM, SGLang, or KTransformers; API access uses the kimi-k2.7-code model name with OpenAI-compatible endpoints.