A student from India has published a first paper introducing Silia, a novel transformer architecture designed for tiny models under 5 million parameters. Silia replaces the static linear matrices in the Feed-Forward Network (FFN) with an attention mechanism, unifying dynamic information mixing and strong non-linearity into a single operation to save parameters. In experiments, a 0.8M-parameter Silia model matched the loss of a comparably trained GPT-2 (nanoGPT) baseline while using significantly fewer parameters. Training was severely limited by old hardware (3-4 days for a 4M model on a personal PC), so the paper presents only preliminary findings on sub-10M-parameter scale. The author treats the work as an introduction of the idea, not a final conclusion, and the code is mentioned but not yet openly distributed.
The paper proposes Lookahead Sparse Attention (LSA), a novel inference paradigm based on a Neural Memory Indexer integrated with the DeepSeek-V4 architecture. Instead of retaining the full KV cache, LSA proactively predicts future context needs and preserves only query‑critical KV chunks in GPU memory. The indexer is trained independently via a backbone‑free, dual‑encoder retrieval framework, avoiding loading the full backbone model. Across LongBench‑v2, LongMemEval, and RULER, FM‑DS‑V4 compresses the physical KV cache to 13.5% of the full‑context baseline while raising average accuracy by +0.6 percentage points. At extreme 500K token scales, it suppresses physical KV cache overhead by over 90% without degrading the backbone’s core reasoning. Code and weights are publicly released on GitHub and HuggingFace.
SCAIL-2 is an open-source model for end-to-end controlled character animation that removes dependence on intermediate pose representations. It was trained on 60K synthetic motion pairs using several teacher models (SCAIL-Preview, Wan-Animate, MoCha) and a Unified Motion Transfer Interface. The model enables animating a reference character from a driving video, supports cross-identity character replacement and multi-character scenarios, and extends to animal-driving. Additionally, it offers zero-shot support for advanced control intermediates like SAM3D-Body mesh rendering.
The paper 'Predictable Compression Failures' (ICML 2026) addresses hallucinations in evidence-grounded QA by modeling order sensitivity as permutation dispersion and deriving an Expectation-level Decompression Law (EDFL). It defines a fixed ISR=1 answer/abstain gate that requires no threshold tuning, achieving 0.0–0.7% hallucination at ~24% abstention and 80.5% accuracy on held-out tests. The newly released ntkMirror implements this gate for local open-weight models in a training-free manner, using order-marginal verification across multiple evidence permutations. A fused kernel speeds up the permutation forwards by 2.6–10× with bit-identical fp32 results. New hallucination detection benchmarks on Qwen2.5 and Gemma models show AUROC up to 0.96 on SciFact, and the gate raises grounded fraction from 50% to 75–90% at the cost of dropping 10–20% valid claims.
A Reddit user conducted a comprehensive benchmark comparing ByteShape and Unsloth quantized versions of the Qwen3.6-35B-A3B model on tool calling tasks. Tests included three KV cache quantizations (f16, q8_0, q4_0) and two context lengths (short ~5k tokens and long with ~122k filler tokens). Results showed no clear winner between ByteShape and Unsloth quants overall, but q8_0 KV cache quant was virtually indistinguishable from f16, offering a free lunch, while q4_0 degraded scores slightly. Long context (50% filled context) significantly reduced tool calling performance across all configurations. The best performing quant was ByteShape GPU-5 (IQ4_XS style), which showed resilience under long context pressure.
This Reddit post questions why ternary LLMs like BitNet have not scaled beyond 2B parameters despite initial promise. The author wonders why frontier open-weight AI labs have not adopted ternary approaches. Comments may discuss technical limitations or lack of practical benefits. The post reflects community curiosity about the viability of ternary architectures for large-scale models.