Optimal Corpus Aware Training for Neural Machine Translation
AuthorsYi-Hsiu Liao, Cheng Shen, Brenda (Zixiaofan) Yang
Optimal Corpus Aware Training for Neural Machine Translation
AuthorsYi-Hsiu Liao, Cheng Shen, Brenda (Zixiaofan) Yang
Corpus Aware Training (CAT) leverages valuable corpus metadata during training by injecting corpus information into each training example, and has been found effective in the literature, commonly known as the “tagging” approach. Models trained with CAT inherently learn the quality, domain and nuance between corpora directly from data, and can easily switch to different inference behavior. To achieve the best evaluation, CAT models pre-define a group of high quality data before training starts which can be error-prone and inefficient. In this work, we propose Optimal Corpus Aware Training (OCAT), which fine-tunes a CAT pre-trained model by freezing most of the model parameters and only tuning small set of corpus-related parameters. We show that OCAT is lightweight, resilient to overfitting, and effective in boosting model accuracy. We use WMT23 English to Chinese and English to German translation tasks as our test ground and show +3.6 and +1.8 chrF improvement, respectively, over vanilla training. Furthermore, our approach is on-par or slightly better than other state-of-the-art fine-tuning techniques while being less sensitive to hyperparameter settings.
Optimal Splitting of Language Models from Mixtures to Specialized Domains
March 23, 2026research area Data Science and Annotation, research area Speech and Natural Language ProcessingWorkshop at ICLR
This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models at ICLR 2026.
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality,…
Self Supervision Does Not Help Natural Language Supervision at Scale
March 13, 2023research area Computer Vision, research area Speech and Natural Language Processingconference CVPR
Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE [31] and SLIP [64] have suggested that these approaches can be effectively combined, but most notably their results use small (100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when…