A Multi-signal Large Language Model for Device-directed Speech Detection

AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and audio-only models of 38.9% and 20.5% respectively.

Related readings and updates.

Adaptive Knowledge Distillation for Device-Directed Speech Detection

August 8, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference Interspeech

Device-directed speech detection (DDSD) is a binary classification task that separates the user’s queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general…

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

December 18, 2023research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being…

A Multi-signal Large Language Model for Device-directed Speech Detection

Related readings and updates.

Adaptive Knowledge Distillation for Device-Directed Speech Detection

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Discover opportunities in Machine Learning.