Student Proposes Silia: A Parameter-Efficient Transformer That Fuses Attention and Feed-Forward Layers
English summary
A student from India has published a first paper introducing Silia, a novel transformer architecture designed for tiny models under 5 million parameters. Silia replaces the static linear matrices in the Feed-Forward Network (FFN) with an attention mechanism, unifying dynamic information mixing and strong non-linearity into a single operation to save parameters. In experiments, a 0.8M-parameter Silia model matched the loss of a comparably trained GPT-2 (nanoGPT) baseline while using significantly fewer parameters. Training was severely limited by old hardware (3-4 days for a 4M model on a personal PC), so the paper presents only preliminary findings on sub-10M-parameter scale. The author treats the work as an introduction of the idea, not a final conclusion, and the code is mentioned but not yet openly distributed.
Chinese summary
一位印度学生发布了首篇论文,提出名为 Silia 的新型 Transformer 架构,专为 500 万参数以下的微型模型设计。Silia 将前馈网络(FFN)中的静态线性矩阵替换为注意力机制,将动态信息混合与强非线性统一为单个操作以节省参数。实验中,0.8M 参数的 Silia 模型在相同训练条件下达到了与 GPT-2(nanoGPT)基线相近的损失,但使用的参数显著更少。受限于老旧硬件(4M 模型在个人电脑上训练需 3-4 天),论文仅给出了亚 10M 参数规模的初步结果。作者将该研究视为想法的引入而非最终结论,代码尚未公开。
Key points
Introduces Silia, a transformer variant that fuses attention and FFN by replacing static FFN linear layers with attention, saving parameters.
提出 Silia 架构,通过用注意力替换 FFN 中的静态线性层,融合注意力与 FFN,节省参数。
Targets the underexplored domain of tiny models (≤5M parameters) and demonstrates comparable loss to nanoGPT with fewer parameters on a 0.8M model.
针对未被充分研究的微型模型领域(≤5M 参数),在 0.8M 模型上以更少参数达到了与 nanoGPT 相近的损失。
Experiments severely constrained by hardware (0.8M model trained in 8-10 hours, 4M model in 3-4 days on a personal PC), limiting the scale and number of trials.
实验受硬件严重限制(个人电脑上 0.8M 模型训练 8-10 小时,4M 模型训练 3-4 天),限制了实验规模和数量。
The paper is presented as a preliminary idea introduction, not a final conclusion, and the code has not yet been made publicly available.
论文定位为初步想法介绍,非最终结论,且代码尚未公开。