SocialSource: REDDIT LOCALLLAMAJune 11, 2026Importance: 2/5

Student Proposes Silia: A Parameter-Efficient Transformer That Fuses Attention and Feed-Forward Layers

English summary

A student from India has published a first paper introducing Silia, a novel transformer architecture designed for tiny models under 5 million parameters. Silia replaces the static linear matrices in the Feed-Forward Network (FFN) with an attention mechanism, unifying dynamic information mixing and strong non-linearity into a single operation to save parameters. In experiments, a 0.8M-parameter Silia model matched the loss of a comparably trained GPT-2 (nanoGPT) baseline while using significantly fewer parameters. Training was severely limited by old hardware (3-4 days for a 4M model on a personal PC), so the paper presents only preliminary findings on sub-10M-parameter scale. The author treats the work as an introduction of the idea, not a final conclusion, and the code is mentioned but not yet openly distributed.

Chinese summary

一位印度学生发布了首篇论文，提出名为 Silia 的新型 Transformer 架构，专为 500 万参数以下的微型模型设计。Silia 将前馈网络（FFN）中的静态线性矩阵替换为注意力机制，将动态信息混合与强非线性统一为单个操作以节省参数。实验中，0.8M 参数的 Silia 模型在相同训练条件下达到了与 GPT-2（nanoGPT）基线相近的损失，但使用的参数显著更少。受限于老旧硬件（4M 模型在个人电脑上训练需 3-4 天），论文仅给出了亚 10M 参数规模的初步结果。作者将该研究视为想法的引入而非最终结论，代码尚未公开。

Key points

Introduces Silia, a transformer variant that fuses attention and FFN by replacing static FFN linear layers with attention, saving parameters.
提出 Silia 架构，通过用注意力替换 FFN 中的静态线性层，融合注意力与 FFN，节省参数。
Targets the underexplored domain of tiny models (≤5M parameters) and demonstrates comparable loss to nanoGPT with fewer parameters on a 0.8M model.
针对未被充分研究的微型模型领域（≤5M 参数），在 0.8M 模型上以更少参数达到了与 nanoGPT 相近的损失。
Experiments severely constrained by hardware (0.8M model trained in 8-10 hours, 4M model in 3-4 days on a personal PC), limiting the scale and number of trials.
实验受硬件严重限制（个人电脑上 0.8M 模型训练 8-10 小时，4M 模型训练 3-4 天），限制了实验规模和数量。
The paper is presented as a preliminary idea introduction, not a final conclusion, and the code has not yet been made publicly available.
论文定位为初步想法介绍，非最终结论，且代码尚未公开。

Open original