Pre-Trained Foundation Model Representations to Uncover Breathing Patterns in Speech
AuthorsVikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi
AuthorsVikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi
The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (π π ) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure π π (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate π π using bio-sensor signals as input. Speech-based estimation of π π can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate π π from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth π π was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as WAV2VEC2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate π π with a low mean absolute error (ππ΄πΈ) β 1.6πππππ‘hπ /πππ.
December 3, 2024research area Methods and Algorithms, research area Speech and Natural Language Processingconference NeurIPS
Multi-modal large language models (MLLMs) have enabled numerous advances in understanding and reasoning in domains like vision, but we have not yet seen this broad success for time-series. Although prior works on time-series MLLMs have shown promising performance in time-series forecasting, very few works show how an LLM could be used for time-series reasoning in natural language. We propose a novel multi-modal time-series LLM approach that...
August 6, 2021research area Health, research area Methods and Algorithmsconference EMBC
Respiratory rate (RR) is a clinical metric used to assess overall health and physical fitness. An individualβs RR can change due to normal activities like physical exertion during exercise or due to chronic and acute illnesses. Remote estimation of RR offers a cost-effective method to track disease progression and cardio-respiratory fitness over time. This work investigates a model-driven approach to estimate RR from short audio segments obtained...