Google DeepMind’s DiffusionGemma 26B A4B IT is an open-weights multimodal model that uses discrete diffusion to generate text from text, image, and video inputs. It has 25.2B total parameters and 3.8B active parameters (MoE), supports a 256K context window, and achieves over 1,100 tokens per second on NVIDIA H100 GPUs. NVIDIA has quantized the model to NVFP4 precision using its Model Optimizer, making it available on Hugging Face for commercial and non-commercial use. The model also features configurable thinking mode, native function calling, and multilingual support across 35+ languages.
DeepSeek v4 Pro achieves top coding scores: 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench. However, CAISI’s multi-domain evaluation places it roughly 8 months behind the US frontier, contrasting with DeepSeek’s own claim of 2 months behind. The discrepancy is attributed to narrow coding benchmarks versus broader requirements in cybersecurity and abstract reasoning. The frontier has also advanced, with closed models like Fable 5 recently released. For local users, quantized versions of the model may yield different real-world agent performance than the full 1.6T-parameter Pro configuration.
A Reddit user posted in r/LocalLLaMA asking for recommendations on the most powerful open-source AI coding model compatible with their hardware. Their system features an AMD Ryzen 7 7700 CPU, an NVIDIA RTX 5070 GPU with 12GB VRAM, and 32GB DDR5 RAM running Windows 11. The intended use cases are writing, coding, and debugging. The post is a straightforward request for model suggestions that fit these specs.
The paper proposes Lookahead Sparse Attention (LSA), a novel inference paradigm based on a Neural Memory Indexer integrated with the DeepSeek-V4 architecture. Instead of retaining the full KV cache, LSA proactively predicts future context needs and preserves only query‑critical KV chunks in GPU memory. The indexer is trained independently via a backbone‑free, dual‑encoder retrieval framework, avoiding loading the full backbone model. Across LongBench‑v2, LongMemEval, and RULER, FM‑DS‑V4 compresses the physical KV cache to 13.5% of the full‑context baseline while raising average accuracy by +0.6 percentage points. At extreme 500K token scales, it suppresses physical KV cache overhead by over 90% without degrading the backbone’s core reasoning. Code and weights are publicly released on GitHub and HuggingFace.
Lemonade v10.7 introduces local omni-modal chat supporting image generation and editing by combining multiple backends and models; its LMX-Omni virtual models are now compatible with Open WebUI and other OpenAI clients. The release adds a lemonade bench CLI tool to collect standardized LLM performance data across llama.cpp, FastFlowLM, and vLLM. Cross-vendor support expands with CUDA backends for llama.cpp and stable-diffusion.cpp and a Vulkan backend for sd-cpp, enabling GPU acceleration on AMD, Apple Silicon, Nvidia, and Intel systems. The project is now organized into six working groups, four led by non-AMD contributors, and this release involved 19 contributors.
Cohere has released North Mini Code, an open-source coding model with 30 billion total parameters and only 3 billion active parameters for efficient inference. It scores 33.4 on the Artificial Analysis Coding Index, making it competitive among similarly sized models. The model is designed for agentic coding tasks and is available under the Apache 2.0 license on Hugging Face under the CohereLabs organization.