Learning to Detect Novel and Fine-Grained Acoustic Sequences Using Pretrained Audio Representations
AuthorsVasudha Kowtha, Miquel Espi Marques, Jonathan Huang, Yichi Zhang, Carlos Avendano
AuthorsVasudha Kowtha, Miquel Espi Marques, Jonathan Huang, Yichi Zhang, Carlos Avendano
This work investigates pre-trained audio representations for few shot Sound Event Detection. We specifically address the task of few shot detection of novel acoustic sequences, or sound events, with semantically meaningful temporal structure without assuming access to non-target audio. We develop procedures for pre-training suitable representations and methods that transfer them to our few shot learning scenario. Our experiments evaluate the general purpose utility of our pre-trained representations on AudioSet, and the utility of proposed few shot methods via tasks constructed from real-world acoustic sequences. Our pre-trained embeddings are suitable to the proposed task and enable multiple aspects of our few shot framework.
March 20, 2024research area Computer Vision, research area Speech and Natural Language Processing
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful...
March 4, 2024research area Data Science and Annotation, research area Speech and Natural Language Processingconference EACL
In-context learning with Large Language Models (LLMs) has emerged as a promising avenue of research in Dialog State Tracking (DST). However, the best-performing in-context learning methods involve retrieving and adding similar examples to the prompt, requiring access to labeled training data. Procuring such training data for a wide range of domains and applications is time-consuming, expensive, and, at times, infeasible. While zero-shot learning...