State Spaces Aren’t Enough: Machine Translation Needs Attention
In collaboration with University of Amsterdam
AuthorsAli Vardasbi*, Telmo Pessoa Pires*, Robin M. Schmidt, Stephan Peitz
In collaboration with University of Amsterdam
AuthorsAli Vardasbi*, Telmo Pessoa Pires*, Robin M. Schmidt, Stephan Peitz
*= Equal Contributors
Structured State Spaces for Sequences (S4) is a recently proposed sequence model with successful applications in various tasks, e.g., vision, language modeling, and audio. Thanks to its mathematical formulation, it compresses its input to a single hidden state and is able to capture long-range dependencies while avoiding the need for an attention mechanism. In this work, we apply S4 to Machine Translation (MT) and evaluate several encoder-decoder variants on WMT'14 and WMT'16. In contrast with the success in language modeling, we find that S4 lags behind the Transformer by approximately 4 BLEU points and counter-intuitively struggles with long sentences. Finally, we show that this gap is caused by S4's inability to summarize the full source sentence in a single hidden state, and show that we can close the gap by introducing an attention mechanism.