One Wide Feedforward is All You Need

AuthorsTelmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan

This paper was accepted at the WMT conference at EMNLP.

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work, we explore the role of FFN and find that despite, and find that despite taking up a significant fraction of the model’s parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally, we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Related readings and updates.

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

July 2, 2026research area Methods and Algorithmsconference ICML

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate…

Deploying Attention-Based Vision Transformers to Apple Neural Engine

January 5, 2024research area Computer Vision, research area Speech and Natural Language Processing

Motivated by the effective implementation of transformer architectures in natural language processing, machine learning researchers introduced the concept of a vision transformer (ViT) in 2021. This innovative approach serves as an alternative to convolutional neural networks (CNNs) for computer vision applications, as detailed in the paper, An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.

One Wide Feedforward is All You Need

Related readings and updates.

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

Deploying Attention-Based Vision Transformers to Apple Neural Engine

Discover opportunities in Machine Learning.