View publication

This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024.

Large language models (LLMs) pretrained on large corpora of internet text possess much of the world's knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or "posttraining" (a.k.a., alignment) techniques to make the models follow users' instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs, such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models' own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuning by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model's predictions and those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model's performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition as to why label annealing would help.

Related readings and updates.

Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective…
See paper details

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target…
See paper details