View publication

Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content, such as contact names, remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification. Our solution uses a class-based language model, in which a general language model provides modeling of the context for named entity classes, and personal named entities are compiled in a separate finite state transducer. We further introduce a phoneme-to-wordpiece model to map rare named entities to more frequent homophonic wordpieces, and also wordpiece prior normalization to bias for rare wordpieces, leading to another 48.9% relative improvement in personal named entity accuracy on top of an already personalized baseline. This work allows our systems to match highly competitive personalized hybrid systems on personal named entity recognition.

Related readings and updates.

Contextualization of ASR with LLM Using Phonetic Retrieval-Based Augmentation

Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or textual response given a speech input. However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrieval-based solution to…
See paper details

Noise-robust Named Entity Understanding for Virtual Assistants

Named Entity Understanding (NEU) plays an essential role in interactions between users and voice assistants, since successfully identifying entities and correctly linking them to their standard forms is crucial to understanding the user's intent. NEU is a challenging task in voice assistants due to the ambiguous nature of natural language and because noise introduced by speech transcription and user errors occur frequently in spoken natural…
See paper details