Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR
AuthorsPranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewfik
Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR
AuthorsPranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewfik
Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. a smartphone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user intention (whether the user is speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end (E2E) ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable “sub”-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To learn the subword representations for the audio-to-intent classification, we extract: (i) acoustic information from an E2E-ASR model, which provides frame-level CTC posterior probabilities for the subword tokens, and (ii) textual information from a pretrained continuous bag-of-words model capturing the semantic meaning of the subword tokens. The key to our approach is that it combines acoustic subword-level posteriors with text information using the notion of positional-encoding to account for multiple ASR hypotheses simultaneously. We show that the proposed approach learns robust representations for audio- to-intent classification and correctly mitigates 93.3% of unintended user audio from invoking the VA at 99% true positive rate.
UI-JEPA: Towards Active Perception of User Intent Through Onscreen User Activity
September 9, 2024research area Human-Computer Interaction, research area Methods and Algorithms
Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy…
MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
October 20, 2021research area Computer Vision, research area Speech and Natural Language Processingconference WeCNLP
In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not…