Transformer的拓扑麻烦
英文摘要
Google DeepMind published a paper titled "The Topological Trouble With Transformers," arguing that the Transformer architecture has a structural flaw in state tracking: as sequences grow, internal state updates are pushed into deeper network layers and become inaccessible to later processing steps. The paper demonstrates this defect with failures in a number-guessing game and the "bank" ambiguity test, where models give contradictory answers despite having disambiguated the word earlier. Chain-of-thought prompting mitigates the problem by externalizing hidden states as visible text, but it is computationally expensive and does not fix the underlying architectural limitation. The authors advocate shifting focus toward recurrent architectures that explicitly pass state along the sequence dimension, such as MAMBA, RWKV-7, and DeltaNet, and suggest future directions like coarser-grained recurrence and staged training from feedforward pretraining to recurrent fine-tuning.
中文摘要
谷歌DeepMind发表论文《The Topological Trouble With Transformers》,指出Transformer架构在状态追踪上存在结构性缺陷:随序列增长,内部状态更新被推入更深网络层,导致后续处理无法访问,引发逻辑矛盾。论文通过猜数字游戏和“bank”歧义测试中的实际失效案例加以佐证。思维链通过将隐藏状态外显为文本缓解了问题,但计算成本高昂且未触及架构根本。作者主张转向沿序列维度显式传递状态的循环架构(如MAMBA、RWKV-7、DeltaNet),并建议研究更粗粒度的循环机制以及从前馈预训练到循环微调的分阶段训练策略。
关键要点
The paper identifies a fundamental topological limitation of Transformers: state tracking fails because updated internal states are buried in deep layers and are not accessible to later processing steps.
论文指出Transformer的根本拓扑缺陷:状态追踪失效是因更新后的内部状态被埋藏于网络深层,后续步骤无法调用。
Experiments show models like Gemini 3 contradict themselves in a guess-the-number game and in resolving word-sense ambiguity, despite correct internal disambiguation.
实验显示,Gemini 3等模型在猜数字游戏和词义消歧中自相矛盾,即便内部已正确消歧。
Chain-of-thought only works around the flaw by materializing hidden states as text, which is costly and not a structural solution.
思维链只是将隐藏状态转化为文本输出再回读,治标不治本且成本高昂。
The paper advocates for sequence-directional recurrent architectures (MAMBA, RWKV-7, DeltaNet) that explicitly carry state across steps, enabling indefinite tracking.
论文主张采用沿序列方向传递状态的循环架构(如MAMBA、RWKV-7、DeltaNet),以实现无限期的状态追踪。
Future work includes coarser-grained recurrence (e.g., sentence-level), leveraging residual alignment for cheaper training, and staged pretraining-recurrent fine-tuning.
后续方向包括更粗粒度循环(如句子级)、利用残差对齐降低训练成本,以及前馈预训练配合循环微调。