SpeakStream: Streaming Text-to-Speech with Interleaved Data
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
SpeakStream: Streaming Text-to-Speech with Interleaved Data
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a ‘streaming’ TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech. The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech. Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system. Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
May 22, 2026research area Computer Vision, research area Data Science and Annotationconference CVPR
Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model’s…
Streaming Models for Joint Speech Recognition and Translation
April 5, 2021research area Speech and Natural Language Processingconference EACL
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap,…