SpeakStream: Streaming Text-to-Speech with Interleaved Data
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a 'streaming' TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech. The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech. Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system. Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.
July 14, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model’s ability to jointly process and leverage multimodal...
April 5, 2021research area Speech and Natural Language Processingconference EACL
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap,...