Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
AuthorsIlia Mahrooghi†, Aryo Lotfi, Emmanuel Abbe†
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
AuthorsIlia Mahrooghi†, Aryo Lotfi, Emmanuel Abbe†
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question’s difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student’s performance on seen samples, the teacher continuously adapts to the student’s evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers
October 8, 2025research area Computer Vision, research area Methods and Algorithmsconference ICLR
Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…
Distillation Scaling Laws
July 1, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a…