Visual features for the MOCHA database as provided by the Computer Vision, Speech Communication & Signal Processing Group at the National Technical University of Athens, Greece CVSP group web site: http://cvsp.cs.ntua.gr Overview: -------------------- The MOCHA (Multichannel Articulatory) database has been compiled by the Department of Speech and Language Sciences at Queen Margaret University College and the Department of Linguistics at the University of Edinburgh. It can be found online at http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html The database currently comprises sound, EMA (Electromagnetic Articulography), laryngograph, electropalatograph and video recordings of two speakers (one male and one female) uttering a set of 460 sentences in English. For details on the database please refer to (Wrench, Hardcastle 2000). Though the rest of the recorded modalities have been exploited in various contexts (Richmond et al. 2003, Toda et al. 2008, Toutios and Margaritis 2008) the video footage had been so far unused. To also use the video for audiovisual speech inversion (Katsamanis et al. 2008, 2009) we processed the raw recordings of the female speaker (fsew0) and extracted visual features based on face active appearance modeling (Cootes et al. 2001). The video was first segmented and labeled by automatically aligning the pretranscribed audio data (available with the MOCHA database) with audio tracks extracted from the video files. Shape and texture features were extracted at 25Hz, for each video frame, after face detection, tracking and active appearance model fitting, using the AAM fitting algorithm described in (Papandreou and Maragos, 2008). Please check the AAMtools webpage (http://cvsp.cs.ntua.gr/software/AAMtools) for further information related to the modeling process and software. In total, 12 shape and 27 texture features were extracted from each frame. Synchronization with the EMA features was verified both visually and quantitatively using canonical correlation analysis (Katsamanis et al. 2009). Practical details: ------------------------------- Features are stored in binary HTK format (Young et al. 2000) in a separate file per utterance. Naming of the files follows the MOCHA conventions. The suffix is .aam. The current script demonstrates how to import the MOCHA visual features in the MATLAB environment. Please check http://cvsp.cs.ntua.gr/research/inversion for further description of our related research and updates on papers, datasets and software. References: ----------------------- A. Wrench and W. Hardcastle, "A multichannel articulatory speech database and its application for automatic speech recognition," in Proc. 5th Seminar on Speech Production, Kloster Seeon, Bavaria, 2000, pp. 305-308. [Online]. Available: http://www.cstr.ed.ac.uk/artic K. Richmond, S. King, and P. Taylor, "Modelling the uncertainty in recovering articulation from acoustics," Computer Speech and Language, vol. 17, pp. 153-172, 2003. T. Toda, A. W. Black, and K. Tokuda, "Statistical mapping between articulatory movements and acoustic spectrum using a gaussian mixture model," Speech Communication, vol. 50, pp. 215-227, 2008. A. Toutios, K. Margaritis, "Estimating Electropalatographic Patterns from the Speech Signal", Computer Speech and Language, Volume 22, Issue 4, Pages 346-359, October 2008 A. Katsamanis, G. Papandreou, and P. Maragos, "Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation," IEEE Transactions on Audio, Speech and Language Processing, Vol. 17, No. 3, pp. 411-422, March 2009. A. Katsamanis, G. Papandreou, and P. Maragos, "Audiovisual-to-Articulatory Speech Inversion Using Active Appearance Models for the Face and Hidden Markov Models for the Dynamics," Proc. IEEE Int'l Conference on Acoustics, Speech, and Signal Processing (ICASSP-2008), Las Vegas, NV, U.S.A., Mar.-Apr. 2008. T. Cootes, G. Edwards, and C. Taylor, "Active Apperance Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.23, No.6, pp. 681-685, June 2001 G. Papandreou and P. Maragos, "Adaptive and Constrained Algorithms for Inverse Compositional Active Appearance Model Fitting", Proc. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR-2008), Anchorage, AL, June 2008. S. Young et al., The HTK Book (for HTK Version 3.0), University of Cambridge, 2000. [Online]. Available: http://htk.eng.cam.ac.uk/docs/docs.shtml For reprints of our papers, visit the CVSP group web site URL: http://cvsp.cs.ntua.gr