SpeakStream: Streaming Text-to-Speech with Interleaved Data

AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a ‘streaming’ TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech. The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech. Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system. Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Related readings and updates.

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Streaming Models for Joint Speech Recognition and Translation

Discover opportunities in Machine Learning.