SocialSource: REDDIT LOCALLLAMAJune 11, 2026Importance: 4/5

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

English summary

The paper proposes Lookahead Sparse Attention (LSA), a novel inference paradigm based on a Neural Memory Indexer integrated with the DeepSeek-V4 architecture. Instead of retaining the full KV cache, LSA proactively predicts future context needs and preserves only query‑critical KV chunks in GPU memory. The indexer is trained independently via a backbone‑free, dual‑encoder retrieval framework, avoiding loading the full backbone model. Across LongBench‑v2, LongMemEval, and RULER, FM‑DS‑V4 compresses the physical KV cache to 13.5% of the full‑context baseline while raising average accuracy by +0.6 percentage points. At extreme 500K token scales, it suppresses physical KV cache overhead by over 90% without degrading the backbone’s core reasoning. Code and weights are publicly released on GitHub and HuggingFace.

Chinese summary

论文提出前瞻稀疏注意力（LSA），一种基于神经记忆索引器并与DeepSeek‑V4架构结合的新型推理范式。该方法主动预测未来上下文需求，仅在GPU内存中保留查询关键的KV块，而非保留全部KV缓存。索引器通过无骨干解耦训练策略独立训练，采用双编码器检索框架，无需加载完整骨干模型。在LongBench‑v2、LongMemEval和RULER等长上下文评测中，FM‑DS‑V4将物理KV缓存压缩至全上下文基线的13.5%，同时平均准确率绝对提升0.6个百分点。在50万tokens的极端规模下，物理KV缓存开销被抑制超过90%，且无损骨干模型的核心推理能力。代码与权重已在GitHub和HuggingFace公开。

Key points

Introduces Lookahead Sparse Attention (LSA) with a Neural Memory Indexer on DeepSeek‑V4, predicting future context demands instead of attending to all past tokens.
引入基于神经记忆索引器的前瞻稀疏注意力（LSA），预测未来上下文需求，而非关注所有历史token。
Uses backbone‑free dual‑encoder training, enabling independent indexer training without loading the massive backbone model into GPU memory.
采用无骨干双编码器训练策略，可独立训练索引器，无需将庞大骨干模型加载至GPU内存。
Achieves 86.5% KV cache compression (to 13.5% of full‑context) on long‑context benchmarks while slightly improving average accuracy (+0.6%).
在长上下文基准上实现KV缓存压缩至全上下文的13.5%，同时平均准确率小幅提升（+0.6%）。
At 500K token length, suppresses physical KV cache overhead by over 90% without harming core reasoning performance.
在50万token长度下，物理KV缓存开销被削减90%以上，且未损害核心推理性能。
Code, model weights, and paper are publicly available on GitHub, HuggingFace, and arXiv.
代码、模型权重和论文已公开在GitHub、HuggingFace和arXiv上。

Open original