Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a requirement that arises across multimodal learning and scientific domains such as ecology and astronomy. However, existing any-to-any approaches are typically trained from scratch using encoder–decoder or diffusion architectures, limiting empirical performance and the use of pretrained models. We investigate decoder-only any-to-any multimodal modeling, which treats all modalities symmetrically and supports arbitrary modalities as inputs and outputs without modality-specific heads, losses, or task pipelines. As a consequence of this unified design, the resulting model MODUS naturally enables chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS demonstrates strong out-of-the-box performance and flexible multimodal composition within a single model.
- † EPFL, Lausanne, Switzerland
- ‡ University of Copenhagen, Copenhagen, Denmark
- § The Chinese University of Hong Kong, Hong Kong, China
- ¶ Equal contribution
- †† University of Geneva, Geneva, Switzerland
- ‡‡ Lambda AI