paperJuly 2024

How Smooth Is Attention?

AuthorsValérie Castin, Pierre Ablin, Gabriel Peyré

Self-attention and masked self-attention are at the heart of Transformers’ outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties — which are key when it comes to analyzing robustness and expressive power — is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length n in any compact set, the Lipschitz constant of self-attention is bounded by sqrt(n) up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length n is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of n. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.

Figure 1: Regularity of the attention layer as a function of sequence length for different architectures.

Related readings and updates.

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

November 19, 2024research area Methods and Algorithms, research area Speech and Natural Language Processingconference NeurIPS

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes…

Hybrid Transformer and CTC Networks for Hardware Efficient Voice Triggering

October 6, 2020research area Speech and Natural Language Processingconference Interspeech

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring…

How Smooth Is Attention?

Related readings and updates.

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Hybrid Transformer and CTC Networks for Hardware Efficient Voice Triggering

Discover opportunities in Machine Learning.