Visual features for the MOCHA database

 as provided by the 
 Computer Vision, Speech Communication & Signal Processing Group 
 at the National Technical University of Athens, Greece

 CVSP group web site:
 http://cvsp.cs.ntua.gr

 Overview:
 --------------------
 The MOCHA (Multichannel Articulatory) database has been compiled by the
 Department of Speech and Language Sciences at Queen Margaret University
 College and the Department of Linguistics at the University of Edinburgh.
 It can be found online at
 http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html

 The database currently comprises sound, EMA (Electromagnetic
 Articulography), laryngograph, electropalatograph and video recordings of
 two speakers (one male and one female) uttering a set of 460 sentences in
 English. For details on the database please refer to (Wrench, Hardcastle
 2000). Though the rest of the recorded modalities have been exploited in
 various contexts (Richmond et al. 2003, Toda et al. 2008, Toutios and
 Margaritis 2008) the video footage had been so far unused.

 To also use the video for audiovisual speech inversion
 (Katsamanis et al. 2008, 2009) we processed the raw recordings of the
 female speaker (fsew0) and extracted visual features based on face active
 appearance modeling (Cootes et al. 2001). The video was first segmented
 and labeled by automatically aligning the pretranscribed audio data
 (available with the MOCHA database) with audio tracks extracted from the
 video files. Shape and texture features were extracted at 25Hz, for each
 video frame, after face detection, tracking and active appearance model
 fitting, using the AAM fitting algorithm described in (Papandreou and
 Maragos, 2008). Please check the AAMtools webpage
 (http://cvsp.cs.ntua.gr/software/AAMtools) for further information
 related to the modeling process and software. In total, 12 shape and 27
 texture features were extracted from each frame. Synchronization with the
 EMA features was verified both visually and quantitatively using
 canonical correlation analysis (Katsamanis et al. 2009).
 
 Practical details:
 -------------------------------
 Features are stored in binary HTK format (Young et al. 2000) in a
 separate file per utterance. Naming of the files follows the MOCHA
 conventions. The suffix is .aam. 

 The current script demonstrates how to import the MOCHA visual features
 in the MATLAB environment.

 Please check http://cvsp.cs.ntua.gr/research/inversion for further
 description of our related research and updates on papers, datasets and
 software.
 
 References:
 -----------------------
 A. Wrench and W. Hardcastle, "A multichannel articulatory speech database
   and its application for automatic speech recognition," in Proc. 5th
   Seminar on Speech Production, Kloster Seeon, Bavaria, 2000, pp. 305-308.
   [Online]. Available: http://www.cstr.ed.ac.uk/artic  
 K. Richmond, S. King, and P. Taylor, "Modelling the uncertainty in
   recovering articulation from acoustics," Computer Speech and Language,
   vol. 17, pp. 153-172, 2003. 
 T. Toda, A. W. Black, and K. Tokuda, "Statistical mapping between
   articulatory movements and acoustic spectrum using a gaussian mixture
   model," Speech Communication, vol. 50, pp. 215-227, 2008. 
 A. Toutios, K. Margaritis, "Estimating Electropalatographic Patterns from
   the Speech Signal", Computer Speech and Language, Volume 22, Issue 4,
   Pages 346-359, October 2008
 A. Katsamanis, G. Papandreou, and P. Maragos, "Face Active Appearance
   Modeling and Speech Acoustic Information to Recover Articulation," IEEE
   Transactions on Audio, Speech and Language Processing, Vol. 17, No. 3,
   pp. 411-422, March 2009.
 A. Katsamanis, G. Papandreou, and P. Maragos, "Audiovisual-to-Articulatory
   Speech Inversion Using Active Appearance Models for the Face and Hidden
   Markov Models for the Dynamics," Proc. IEEE Int'l Conference on Acoustics,
   Speech, and Signal Processing (ICASSP-2008), Las Vegas, NV, U.S.A.,
   Mar.-Apr. 2008.
 T. Cootes, G. Edwards, and C. Taylor, "Active Apperance Models," IEEE
   Transactions on Pattern Analysis and Machine Intelligence, Vol.23,
   No.6, pp. 681-685, June 2001
 G. Papandreou and P. Maragos, "Adaptive and Constrained Algorithms for
   Inverse Compositional Active Appearance Model Fitting", Proc. IEEE
   Int'l Conf. on Computer Vision and Pattern Recognition (CVPR-2008),
   Anchorage, AL, June 2008.
 S. Young et al., The HTK Book (for HTK Version 3.0), University of
   Cambridge, 2000. [Online]. Available: http://htk.eng.cam.ac.uk/docs/docs.shtml 

 For reprints of our papers, visit the CVSP group web site
 URL: http://cvsp.cs.ntua.gr