Memory-Retaining Finetuning via Distillation
AuthorsZitong Yang, Aonan Zhang, Sam Wiseman, Xiang Kong, Ke Ye, Dong Yin
AuthorsZitong Yang, Aonan Zhang, Sam Wiseman, Xiang Kong, Ke Ye, Dong Yin
This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024.
Large language models (LLMs) pretrained on large corpora of internet text possess much of the world's knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or "posttraining" (a.k.a., alignment) techniques to make the models follow users' instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs, such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models' own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuning by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model's predictions and those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model's performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition as to why label annealing would help.