research highlights

Couples' Behavior Multimodal Signal Processing

Behavioral observation is a common practice for researchers and practitioners in psychology, such as in the study of marital and family interactions. The research and therapeutic paradigm in this domain often involves the collection and analysis of audiovisual observations from the subject(s) in focus (e.g., couples or families).

In our recent work, we argue that the application of appropriate signal processing and machine learning techniques has the potential to both reduce the cost and increase the consistency of this coding process. We automatically analyze interactions of married couples and extract audio, video and transcription based behavioral cues. These low- and intermediate-level descriptors are then shown to be predictive of high-level behaviors as coded by trained evaluators. More…

Computational Speech Production Modeling using Multimodal Articulatory Data

The large corpus of real time magnetic resonance image sequences of the vocal tract during speech production that was recently acquired and can be referred to as MRI-TIMIT, provides us with a unique platform for systematically studying articulatory dynamics.

Compared to previously collected articulatory datasets, e.g., using articulography or X-rays, MRI-TIMIT is a rich source of information for the entire vocal tract and not only for certain articulatory landmarks and further has the potential to continue increasing in size covering a large variety of speakers and speaking styles. Apart from the real-time MRI data, in our work, we investigate articulatory representations and speech production models based on data captured by Electromagnetic Articulography, X-rays and X-ray microbeam. More…

Tracking Emotion based on Body Language, Audiovisual Cues and Context

Human expressive interactions are characterized by an ongoing unfolding of verbal and nonverbal cues. Such cues convey the interlocutor’s emotional state which is continuous and of variable intensity and clarity over time.

In our current work, we examine the emotional content of body language cues describing a participant’s posture, relative position and approach/withdraw behaviors during improvised affective interactions, and show that they reflect changes in the participant’s activation and dominance levels. Furthermore, we introduce a framework for tracking changes in emotional states during an interaction using a statistical mapping between the observed audiovisual cues and the underlying user state. Our approach shows promising results for tracking changes in activation and dominance. More…

Robust Long Speech-Text Alignment

Long speech-text alignment can facilitate large-scale study of rich spoken language resources that have recently become widely accessible, e.g., collections of audio books, or multimedia documents. For such resources, the conventional Viterbi based forced alignment may often be proven inadequate mainly due to mismatched audio and text and/or noisy audio. We have developed SailAlign which is an open-source software toolkit for robust long speech-text alignment that circumvents these restrictions. It implements an adaptive, iterative speech recognition and text alignment scheme that allows for the processing of very long (and possibly noisy) audio and is robust to transcription errors. More…

Audiovisual Speech Inversion

We are interested in recovering aspects of vocal tract’s geometry and dynamics from speech, a problem referred to as speech inversion. Traditional audio-only speech inversion techniques are inherently ill-posed since the same speech acoustics can be produced by multiple articulatory configurations.

To alleviate the ill-posedness of the audio-only inversion process, we propose an inversion scheme which also exploits visual information from the speaker’s face. The complex audiovisual-to-articulatory mapping is approximated by an adaptive piecewise linear model. More…

Audiovisual Speech Recognition

Audiovisual speech recognition refers to the problem of recognizing speech by lipreading. We have developed highly adaptive multimodal fusion rules based on uncertainty compensation which are compatible with synchronous and asynchronous multimodal interaction architectures. Further, our work on AAM-based face representations leads to highly informative visual speech features which can be extracted in real-time. More…