WP2 : User Robustness
WP objectives
This WP address the second objective of the project, that is robustness against speaker variability. Depending on the level of distortion between user voice and averaged voices, different strategies to adapt the recognition system can be used. For small mismatch between user and voices used to train the system, we expect that an improvement of the speech processing will directly lead to a better speaker independence. When the mismatch is stronger, it is necessary that the machine progressively learns from the human voice and adapt its global reference system to the specificities of this voice. In this work package we will investigate through three different tasks, three types of adaptation techniques that corresponds to an increasing level of voice mismatch with references. The last task of the work package will be devoted to the study of the robustness issue in the management of the user interaction.
WP duration
The Work package will begin at month 1 and finish at month 36.
WP milestones
The workplan is structured around four main milestones :
M2.1, m3 Selection of suitable benchmark database(s)
M2.2, m10 Completion of baseline experiments on user robustness
M2.3, m21 Completion of phase I experiments on user robustness
M2.4, m33 Completion of phase II experiments on user robustness
WP tasks
The WP is divided in four main tasks :
Task 1 : Speaker independence (m1-36)
This target of this task is to identify main methods to build recognition systems that can accept directly a broad range of users, without the adaptation step. The expected result would be the capability to offer best recognition performance directly to most of the population. Within the frame of our project, the related work will be closely related to WP1 task 2 activities as we will evaluate how an optimisation in speech processing could lead to an improved speaker independence. In consequence the work will consist in adapting and evaluating the concept of advanced signal and speech processing to the objective of speaker independence, and to find the optimal trade-off between robustness improvement and robustness toward users.
Task 2 : Speaker adaptation (m1-36)
In this second task we will investigate techniques that allows to rapidly adapt a system to a user. Adaptation techniques have become an important tool in today’s large vocabulary speech recognition systems since they can significantly improve recognition performance when mismatched conditions exist (accented or "low" non native speech). Most of the adaptation techniques that appeared in the literature belong to one of two main categories: the transformation-based and the Bayesian-based. In HIWIRE we intend to go beyond the state of the art concerning these two methods (subject 1). Another very interesting field we intend to investigate is the adaptation to the speakers directly on the speech signal (subject 2).
Task 3 : Non native speech (m1-36)
Non native speech is a domain that covers very distinct reality, since it is extremely dependant on the native language of the speaker and mainly dependent of its familiarity with the command language. In this task we will investigate solutions for the difficult cases on which speaker adaptation proposed in task 2 cannot be effective due to large mismatch between used phoneme set and system phoneme set. Our work will be oriented toward phonetic adaptation (subject 3 and 4).
Task 4 : Robust interaction mechanisms (m1-36)
In order to create flexible, user-friendly and effective speech-based interfaces it is necessary to incorporate robust language processing and dialogue management system components. Only in the simplest situation can the inputs and outputs from the core speech engines be mapped easily and directly onto the appropriate application functions. In more complex applications, users find it difficult to remember what they can say, so a process is required that is able to interpret the meaning of an input utterance in the context of the functionality of the application. Similarly, only the simplest applications enable a user to define their objective in one single utterance. More usually, the user provides information in a piecemeal fashion, and a dialogue management component is needed in order to manage the turn-by-turn interaction.
Previous research by the partners in this project has found that it is useful to invoke the concept of ‘agency’ as a wrapper for these key components of a robust next-generation conversational interface. This is important, not only in order to recognise the close interactions between these components, but also because it is the conversational agent that provides the interface to the ‘back-end’ application.
Advanced speech-based applications can be categorised according to one or more of the following ‘back-end’ functions:
- command & control (C2) applications which involve performing and responding to actions in the application environment; the back-end is analogous to a piece of machinery.
- information applications in which knowledge/data is stored and accessed by voice; the back-end is ‘structured’ information as in a formal SQL-type database or ‘unstructured’ information as in a web page or an arbitrary piece of text.
- communication applications in which messaging information is transmitted, received and transformed between different channels and modes; the back-end may be another user.
The first of these covers traditional C2 applications, the second covers a wide range of ‘infomedia’ type applications such as searching an archive, telephone enquiry systems, information kiosks or voice web browsing, and the third covers mode conversions such as speech-to-text, text-to-speech and speech-to-speech (spoken language translation).
Description of Work
_Subject 1 : Acoustic Model Adaptation to Speakers._ Two main speaker adaptation techniques are widely used : Transformation based adaptation (MLLR for example) and Bayesian adaptation (MAP).
Improvement thanks to hierarchical structures:
The basic principle of MLLR adaptation is to adapt the parameters of the acoustic models by calculating one or several transformations from the speaker specific data.
Bayes optimal classification criterion:
This criterion suggests that a weighted average of the posterior probabilities over the possible values of the parameters can be used where the weights are the conditional probabilities of the test data given the model parameters.
_Subject 2 : Speaker Adaptation on the Speech Signal Level._ Another point that we would like to investigate is the possibility to perform adaptation at the speech signal level.
_Subject 3 : Phonetic Models for Non-Native Speech._ The non-native case, can be viewed as an extreme case of speaker adaptation.
_Subject 4 : Principal component analysis for non native and accented speech._ A major limitation of all current generation speech technology is the amount of effort involved in developing versions for different languages.
_Subject 5: Robust interaction management_
The key technologies required for implementing a conversational interface agent are:
- spatio-temporal processing
_Subject 6 : Speaker clustering._ To build efficient system for large range of users, one way to explore could be a clustering of speakers used in the training stage.