View publication

This paper introduces ESPnet-SPK, a toolkit designed for training and utilizing speaker embedding extractors. It offers an open-source platform, facilitating effortless construction of models ranging from the x-vector to the SKA-TDNN, thanks to its modular architecture that simplifies the development of variants. The toolkit advances the use of speaker embeddings across various tasks where outdated embeddings are often employed, enabling the broader research community to use advanced speaker embeddings effortlessly. Pre-trained extractors are readily available for off-the-shelf use. The toolkit also supports integration with various self-supervised learning features. ESPnet-SPK features over 30 recipes: seven speaker verification recipes, including reproducible WavLM-ECAPA with an EER of 0.39% on the Vox1-O benchmark and diverse downstream tasks, including text-to-speech and target speaker extraction. It even supports speaker similarity evaluation for singing voice synthesis and more.

Related readings and updates.

Improving On-Device Speaker Verification Using Federated Learning With Privacy

Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy. However, such information is often private. This paper investigates how privacy-preserving learning can improve a speaker verification system, by enabling the use of privacy-sensitive speaker data to train an auxiliary classification model that predicts vocal characteristics of speakers. In particular, this paper explores the…
See paper details

Generating Multilingual Voices Using Speaker Space Translation Based on Bilingual Speaker Data

We present progress towards bilingual Text-to-Speech which is able to transform a monolingual voice to speak a second language while preserving speaker voice quality. We demonstrate that a bilingual speaker embedding space contains a separate distribution for each language and that a simple transform in speaker space generated by the speaker embedding can be used to control the degree of accent of a synthetic voice in a language. The same…
See paper details