paperJune 2023

Robustness in Multimodal Learning under Train-Test Modality Mismatch

AuthorsBrandon McKinzie, Joseph Cheng, Vaishaal Shankar, Yinfei Yang, Jonathon Shlens, Alexander Toshev

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to 1.5×-4× robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of 44.2 mAP on AudioSet 20K.

Related readings and updates.

Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals

October 28, 2024research area Methods and Algorithmsconference NeurIPS

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective…

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

October 18, 2024research area Computer Visionconference NeurIPS

*Equal Contributors

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we significantly expand upon the capabilities of 4M by training it on tens of highly diverse modalities and by performing…

Robustness in Multimodal Learning under Train-Test Modality Mismatch

Related readings and updates.

Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Discover opportunities in Machine Learning.