A Multi-signal Large Language Model for Device-directed Speech Detection
AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
A Multi-signal Large Language Model for Device-directed Speech Detection
AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and audio-only models of 38.9% and 20.5% respectively.
Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models
November 5, 2024research area Human-Computer Interaction, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024.
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the…
Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features
December 18, 2023research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being…