Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
AuthorsDominik Wagner*, Alex Churchill*, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
AuthorsDominik Wagner*, Alex Churchill*, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
*Equal Contributors
This paper was accepted at the Efficient Natural Language and Speech Processing workshop at NeurIPS 2023.
Interactions with virtual assistants often begin with a predefined trigger phrase followed by the user command. To make interactions with the assistant more natural, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We address this task by combining the decoder signals of an automatic speech recognition (ASR) system with acoustic and lexical representations as input features to a large language model (LLM). We are interested in data- and resource-efficient systems that require only a small amount of training data and can potentially run on devices such as smartphones. For this reason, our model is finetuned on a small amount of multimodal data using low-rank adaptation. We compare the proposed system to unimodal models that rely either on lexical or acoustic information only. The effectiveness of our method is analyzed by finetuning decoder-only LLMs with sizes between 3 billion and 13 billion parameters on training data consisting of 10k to 80k utterances. We show that our best multimodal system yields better results than unimodal baselines while using only a fraction of the training data.
A growing number of consumer devices, including smart speakers, headphones, and watches, use speech as the primary means of user input. As a result, voice trigger detection systems—a mechanism that uses voice recognition technology to control access to a particular device or feature—have become an important component of the user interaction pipeline as they signal the start of an interaction between the user and a device. Since these systems are deployed entirely on-device, several considerations inform their design, like privacy, latency, accuracy, and power consumption.
Apple sponsored the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) in May 2020. With a focus on signal processing and its applications, the conference took place virtually from May 4 - 8. Read Apple’s accepted papers below.