The Slingshot Effect: A Late-Stage Optimization Anomaly in Adam-Family of Optimization Methods

AuthorsVimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua M. Susskind

Adaptive gradient methods, notably Adam, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers. In this paper, we present a novel optimization anomaly called the Slingshot Effect, which manifests during extremely late stages of training. We identify a distinctive characteristic of this phenomenon through cyclic phase transitions between stable and unstable training regimes, as evidenced by the cyclic behavior of the norm of the last layer’s weights. Although the Slingshot Effect can be easily reproduced in more general settings, it does not align with any known optimization theories, emphasizing the need for in-depth examination.

Moreover, we make a noteworthy observation that Grokking occurs predominantly during the onset of the Slingshot Effects and is absent without it, even in the absence of explicit regularization. This finding suggests a surprising inductive bias of adaptive gradient optimizers at late training stages, urging a revised theoretical analysis of their origin.

Our study sheds light on an intriguing optimization behavior that has significant implications for understanding the inner workings of adaptive gradient methods.

The Slingshot Effect: A Late-Stage Optimization Anomaly in Adam-Family of Optimization Methods

Related readings and updates.

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Private Adaptive Gradient Methods for Convex Optimization

Discover opportunities in Machine Learning.