Referring to Screen Texts with Voice Assistants

AuthorsShruti Bhargava, Anand Dhoot, Ing-Marie Jonsson, Hoang Long Nguyen, Alkesh Patel, Hong Yu, Vincent Renkens

Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users’ context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen, similar to visual grounding. We collect a dataset and propose a lightweight general-purpose model for this novel experience. Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI. Our model is modular, thus offering flexibility, improved interpretability, and efficient runtime memory utilization.

Related readings and updates.

Understanding Screen Relationships from Screenshots of Smartphone Applications

March 30, 2022research area Human-Computer Interactionconference IUI

All graphical user interfaces are comprised of one or more screens that may be shown to the user depending on their interactions. Identifying different screens of an app and understanding the type of changes that happen on the screens is a challenging task that can be applied in many areas including automatic app crawling, playback of app automation macros and large scale app dataset analysis. For example, an automated app crawler needs to…

Learning to Rank Intents in Voice Assistants

May 4, 2020research area Human-Computer Interaction, research area Speech and Natural Language Processingconference IWSDS

Voice assistants aim to fulfill user requests by choosing the best intent from multiple options generated by its Automated Speech Recognition and Natural Language Understanding sub-systems. However, voice assistants do not always produce the expected results. This can happen because voice assistants choose from ambiguous intents. User-specific or domain-specific contextual information can reduce the ambiguity of the user request. Additionally,…

Referring to Screen Texts with Voice Assistants

Related readings and updates.

Understanding Screen Relationships from Screenshots of Smartphone Applications

Learning to Rank Intents in Voice Assistants

Discover opportunities in Machine Learning.