View publication

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Figure 1: The distillation scaling law is fitted to students with high cross-entropy for a range of teachers with cross-entropies (x-axis). Solid lines represent predicted model behavior for unseen teachers for a given student configuration (interpolation), and dashed lines represent predicted model behavior beyond seen teachers and for low cross-entropy students (extrapolation). The diagonal block dashed line indicates where student and teacher cross-entropies are equal. Teachers with lower cross-entropy generally produce students with lower cross-entropy, until the capacity gap. As shown, a student can also outperform its teacher.

Figure 1: Student cross-entropy predicted by our distillation scaling law compared with the achieved student cross-entropy. Prediction error is at most ~1%.

Related readings and updates.

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a…
Read more
Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output…
Read more