paperJuly 2025

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

Large-scale models are routinely trained on a mixture of different data sources. Different data mixtures yield very different downstream performances. We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model. Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input histogram. To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data sampled from the corresponding histogram. We demonstrate the promise of our approach to quickly obtain small specialized models on several datasets.

Diagram illustrating the training pipeline used for the soup-of-experts model. — Figure 1: Training pipeline for the soup-of-experts.

Related readings and updates.

Scaling Laws for Optimal Data Mixtures

September 26, 2025research area Methods and Algorithmsconference NeurIPS

Large foundation models are typically trained on data from multiple domains, with the data mixture—the proportion of each domain used—playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach…

No Need to Talk: Asynchronous Mixture of Language Models

April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR

We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction…

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Related readings and updates.

Scaling Laws for Optimal Data Mixtures

No Need to Talk: Asynchronous Mixture of Language Models

Discover opportunities in Machine Learning.