View publication

With the help of creative prompt engineering and in-context learning, large language models (LLMs) are known to generalize well on a variety of text-based natural language processing (NLP) tasks. However, for performing well on spoken language understanding (SLU) tasks, LLMs either need to be equipped with in-built speech modality or they need to rely on speech-to-text conversion from an off-the-shelf automation speech recognition (ASR) system. In this work, we focus on the latter setup where the accuracy of LLM on SLU tasks is constrained by the accuracy of a frozen ASR system on the given speech input. Specifically, we tackle the task of speech intent classification where a high word-error-rate (WER) implies that the LLM may not have the correct textual information to understand the spoken intent. To alleviate this problem, we propose to prompt the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We first explore prompting the LLM with descriptive prompts which explain the concept of n-best lists to invoke LLM's emergent abilities to understand the task; followed by finetuning of LoRA adapters on the intent classification task. We demonstrate the efficacy of our approach on a binary device-directed speech detection task as well as on a keyword spotting task on Google speech commands dataset where systems using n-best list prompts outperform the ones using 1-best ASR outputs; thus paving way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.

Related readings and updates.

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024. Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion…
See paper details

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In…
See paper details