Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

AuthorsZhihong Lei, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng, Ruchir Travadi, Youyuan Zhang, Mirko Hannemann, Man-Hung Siu, Zhen Huang

View publication

Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content, such as contact names, remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification. Our solution uses a class-based language model, in which a general language model provides modeling of the context for named entity classes, and personal named entities are compiled in a separate finite state transducer. We further introduce a phoneme-to-wordpiece model to map rare named entities to more frequent homophonic wordpieces, and also wordpiece prior normalization to bias for rare wordpieces, leading to another 48.9% relative improvement in personal named entity accuracy on top of an already personalized baseline. This work allows our systems to match highly competitive personalized hybrid systems on personal named entity recognition.

Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Related readings and updates.

Contextualization of ASR with LLM Using Phonetic Retrieval-Based Augmentation

Noise-robust Named Entity Understanding for Virtual Assistants

Discover opportunities in Machine Learning.