paperJuly 2026

MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities

AuthorsMingqiao Ye†¶, Zhaochong An†‡¶, Zhitong Gao†, Xian Liu§, Oğuzhan Fatih Kar†, Jesse Allardice, Roman Bachmann†, David Mizrahi, François Fleuret††, Chuan Li‡‡, Amir Zadeh‡‡, Serge Belongie‡, Afshin Dehghan, Amir Zamir†

View publication

Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a requirement that arises across multimodal learning and scientific domains such as ecology and astronomy. However, existing any-to-any approaches are typically trained from scratch using encoder–decoder or diffusion architectures, limiting empirical performance and the use of pretrained models. We investigate decoder-only any-to-any multimodal modeling, which treats all modalities symmetrically and supports arbitrary modalities as inputs and outputs without modality-specific heads, losses, or task pipelines. As a consequence of this unified design, the resulting model MODUS naturally enables chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS demonstrates strong out-of-the-box performance and flexible multimodal composition within a single model.

† EPFL, Lausanne, Switzerland
‡ University of Copenhagen, Copenhagen, Denmark
§ The Chinese University of Hong Kong, Hong Kong, China
¶ Equal contribution
†† University of Geneva, Geneva, Switzerland
‡‡ Lambda AI

MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities

Related readings and updates.

Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Discover opportunities in Machine Learning.