Distillation Scaling Laws
AuthorsDan Busbridge, Amitis Shidani†‡, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
AuthorsDan Busbridge, Amitis Shidani†‡, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.
Figure 1: The distillation scaling law is fitted to students with high cross-entropy for a range of teachers with cross-entropies (x-axis). Solid lines represent predicted model behavior for unseen teachers for a given student configuration (interpolation), and dashed lines represent predicted model behavior beyond seen teachers and for low cross-entropy students (extrapolation). The diagonal block dashed line indicates where student and teacher cross-entropies are equal. Teachers with lower cross-entropy generally produce students with lower cross-entropy, until the capacity gap. As shown, a student can also outperform its teacher.
Figure 1: Student cross-entropy predicted by our distillation scaling law compared with the achieved student cross-entropy. Prediction error is at most ~1%.
December 1, 2022research area Computer Visionconference CVPR
Visual anomaly detection, an important problem in computer vision, is usually formulated as a one-class classification and segmentation task. The student-teacher (S-T) framework has proved to be effective in solving this challenge. However, previous works based on S-T only empirically applied constraints on normal data and fused multi-level information. In this study, we propose an improved model called DeSTSeg, which integrates a pre-trained...
June 10, 2021research area Computer VisionWorkshop at CVPR
Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output...