WP1 : Environment & Sensor Robustness
WP objectives
The objectives for this WP are to:
- Investigate strategies to improve the robustness of speech recognition against noise.
- Assess the improvement that can be gained with optimal integration of advanced sensor (multi-microphone, audio visual speech capture)
WP duration
The Work package will begin at month 1 and finish at month 36.
WP milestones
The workplan is structured around four main milestones:
M1.1, m3 Selection of suitable benchmark database(s), work repartition between partners
M1.2, m10 Completion of baseline experiments on environment & sensor robustness
M1.3, m21 Completion of phase I experiments on environment & sensor robustness
M1.4, m33 Completion of phase II experiments on environment & sensor robustness
WP tasks
The WP is divided in two main tasks :
Task 1 : Sensor integration and independence (m1-36)
The objective is to go beyond the state of the art in terms of speech capture and maximize the impact on speech recognition performances. Two subjects will be addressed : multi-microphone (see subject 1) and audio visual speech capture (see subject 5). The main part of the task resources will be devoted to multi-microphone sound capture design oriented toward recognition. It has been observed that the improvement of recognition rate is not directly related to the signal to noise ratio improvement provided by a microphone array for example. The results from this task will be used in the integration phase of the project to maximize the performance gains when multi microphones arrays are introduced in the fixed platform. Activities about audio visual processing are meant to assess theoretically the real benefit that can be carried by this method compare with efficient pure speech processing. If audio visual is shown to improve performances in ours environments (by a better speech /no speech discrimination in hard environments for example), and if we do not succeed to achieved required performances by pure audio processing, this alternate may be envisaged for further system integration. Within this task, partners will work on two main fields in collaboration :
- Direct extraction of robust feature from a multimicrophone array without reconstructing the signal.
- Analysis of audio-visual (multi-modal) recognition of speech compared to single-microphone sound capture and multi-microphone sound capture.
Task 2 : Noise independence (m1-36)
This task is clearly the project’s back bone. It deals directly with the improvement of the robustness against noise by new way of processing the audio speech signal and to model the speech acoustic properties. In this task we will address three main subtasks that cover all the steps involved in speech recognition processing.
- Acoustic feature extraction. (see subject 2, subject 7).
- Acoustic models and improved recognition principles (see subject 3, 4).
- Noise cancellation techniques (see subject 6)
The results of this task will be exploited for the specification of the speech recognition system integrated on both fixed and mobile platforms.
Description of work
_Subject 1 : Multi-Microphone Systems_ have in principle the capability of discerning between directive sources and spatially diffused disturbances.
_Subject 2 : Advanced Signal Processing._ We propose to develop advanced signal processing models and robust algorithms and extract related acoustic signal features describing two types of nonlinear phenomena in speech, i) modulations in formant resonances and ii) turbulence in speech geometry (via fractals) and nonlinear dynamics (via chaos).
_Subject 3 : Missing Data Approach (MDA) for Robust Speech Recognition._ The human auditory system can relatively easy cope with missing data and handles simultaneous signals. It is not the case for an automatic speech recognition system.
_Subject 4 : Improved recognition principles_
Probabilistic Graphical Modelling:
In order to improve robustness, it is crucial to develop new probabilistic models capable of capturing all speech features and of exploiting at best the available data.
Segment models:
In an effort to improve robustness, particularly under noisy conditions, we will develop new generalized HMM schemes for speech recognition.
_Subject 5 : Multi-Modal Features._ Another aspect that we want to investigate is the one of multi-modal features. Commercial speech recognition systems are uni-modal, i.e., only use features extracted from the speech signal to perform recognition. We propose to combine audio and visual cues and perform audio-visual speech recognition.
_Subject 6 : Noise cancellation._ Reduction of the effects of noise can be addressed at different levels of the speech representation.
_Subject 7: Nonlinear feature normalization._ The acoustic mismatch between the training and test data degrades the performance of Automatic Speech Recognition (ASR) systems.