A Multi-signal Large Language Model for Device-directed Speech Detection
AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and audio-only models of 38.9% and 20.5% respectively.
November 5, 2024research area Human-Computer Interaction, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024.
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the...
December 18, 2023research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being...