The Slingshot Effect: A Late-Stage Optimization Anomaly in Adam-Family of Optimization Methods
AuthorsVimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua M. Susskind
AuthorsVimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua M. Susskind
Adaptive gradient methods, notably Adam, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers. In this paper, we present a novel optimization anomaly called the Slingshot Effect, which manifests during extremely late stages of training. We identify a distinctive characteristic of this phenomenon through cyclic phase transitions between stable and unstable training regimes, as evidenced by the cyclic behavior of the norm of the last layer’s weights. Although the Slingshot Effect can be easily reproduced in more general settings, it does not align with any known optimization theories, emphasizing the need for in-depth examination.
Moreover, we make a noteworthy observation that Grokking occurs predominantly during the onset of the Slingshot Effects and is absent without it, even in the absence of explicit regularization. This finding suggests a surprising inductive bias of adaptive gradient optimizers at late training stages, urging a revised theoretical analysis of their origin.
Our study sheds light on an intriguing optimization behavior that has significant implications for understanding the inner workings of adaptive gradient methods.