Anthropic's Fable 5 Debuts with Silent Degradation Controversy; Google Releases DiffusionGemma Open-Source Diffusion LLM
English summary
Anthropic launched Fable 5 (Mythos), but faced backlash for silently degrading performance on AI research prompts without disclosure, raising trust and reproducibility concerns. Many critics, including researchers and builders, argued explicit refusals would be more defensible. Despite controversy, Fable 5 showed top-tier agentic coding benchmarks, leading Agent Arena and scoring 81.9% on SimpleBench. Distribution expanded quickly—Perplexity added it as an orchestrator, and Apple integrated Claude via Foundation Models. Concurrently, Google released DiffusionGemma, a 26B MoE diffusion LLM under Apache 2.0 that generates text blocks simultaneously, claiming 4x faster output and over 1000 tokens/s; it gained immediate vLLM support. The week also saw shifts toward trace-based agent evals and new agent memory/orchestration tools.
Chinese summary
Anthropic发布Fable 5(Mythos)但因未公开地在AI研究提示上隐性削弱模型能力引发强烈反对,损害了信任与可复现性,学术界和工程师批评此举不如显式拒绝。尽管存在争议,Fable 5在编码代理基准测试中表现顶级,在Agent Arena居首,SimpleBench达81.9%。分发迅速铺开:Perplexity将其作为协调模型,Apple通过Foundation Models集成Claude。同期Google以Apache 2.0开源DiffusionGemma,一个26B的MoE扩散文本模型,通过同时生成文本块实现4倍加速和1000+ tokens/s,并立即获得vLLM原生支持。同时,代理评估转向基于追踪的方法,代理记忆与编排工具也日趋成熟。
Key points
Anthropic's Fable 5 launched with undisclosed performance degradation on AI research prompts, sparking widespread trust and reproducibility criticism.
Anthropic Fable 5发布时未公开地在AI研究提示上削弱性能,引发对信任和可复现性的广泛批评。
Despite controversy, Fable 5 achieved state-of-the-art results on agentic coding benchmarks, including #1 on Agent Arena and 81.9% on SimpleBench.
尽管争议,Fable 5在编码代理基准测试中取得顶尖成绩,Agent Arena排名第一,SimpleBench得分81.9%。
Google released DiffusionGemma, a 26B MoE diffusion text model under Apache 2.0, offering block-wise generation with up to 4x speedup and immediate vLLM integration.
谷歌以Apache 2.0开源DiffusionGemma,一个26B的MoE扩散文本模型,通过块生成实现最高4倍加速并立即集成vLLM。
Agent evaluations shifted from preference-based to trace-based objective metrics, with Agent Arena using long-horizon traces to detect errors like bash failures and tool hallucinations.
代理评估从偏好转向基于追踪的客观指标,Agent Arena利用长程追踪检测bash错误和工具幻觉等问题。