r4 - 06 Dec 2006 - 15:25:32 - Main.gpapanYou are here: TWiki >  MuscleWP5 Web  > WorkPlan

MUSCLE WP5 Workplan

Index Subject
Task 1 State-of-the-Art Evaluation and Roadmap
Task 2 Cross-Modal Interaction in Multimedia Problems
Sub-task 1 Audio-Visual Interaction for Speech Recognition
Sub-task 2 Cross-Modal Interaction in Human-Computer Interfaces
Task 3 Cross-Modal Integration for Multimedia Analysis and Recognition
Sub-task 1 Video Analysis and Integration of Asynchronous Time-evolving Modalities
Sub-task 2 Combining Text and Vision for Semantic Labeling of Image Data
Sub-task 3 Integrated Multimedia Content Analysis
Sub-task 4 Integration-Fusion Methods
Task 4 Dissemination of the results

Task 1: State-of-the-Art Evaluation and Roadmap

The partners will first collaborate in the review and evaluation of the current state-of-the-art in the areas spanned by the scientific and technological objectives of this WP. This task also includes an assessment of the expertise of the NOE participants in fields related to this WP. Then the partners will identify the most promising problems and methodologies and dynamically update their research efforts toward the general directions outlined by the NOE objectives. Moreover, a goal of this task is the establishment of a roadmap and an appropriate set of criteria to assess and advise the WP on how best to achieve its objectives. Examples of such criteria include relevance to objectives, appropriate application of and reference to relevant external results, coverage of the targeted technological areas and harmonisation with relevant standardisation work.Finally, all the partners involved in the WP will establish an efficient system of communication, mainly by electronic means, will pursue common or related technical directions within the NOE objectives and will provide production of high quality technical documentation, synchronisation and assessment of the achievements of the WP.

Task 2: Cross-Modal Interaction in Multimedia Problems

The main objective of this task is to explore the interaction of multiple modalities, e.g. vision, speech-audio, and text, in cases where one modality is directly affected by the others and when the goal is to improve performance over single modality processing. All possible combinations of modalities and their interactions are of research interest to this NOE. However, for brevity of exposition, next we outline only two such specific problem areas: the first involves interaction between the speech and the vision modality; the second deals with many interacting modalities, i.e., vision, text, speech, and tactile information.

Subtask 2.1: Audio-Visual Interaction for Speech Recognition

Commercial automatic speech recognition (ASR) systems are uni-modal, i.e., only use features extracted from the speech signal to perform recognition. We propose to investigate the interactive combination of audio and visual cues to perform audio-visual speech recognition, a.k.a. “lip-reading”. Recent research in this area has shown that significant improvements (error-rate reduction) can be achieved by using multimodal features. Research in this NOE will be performed on the interaction between the vision and speech directions: (i) Lip reading requires the solution of several image analysis and computer vision tasks such as geometric feature detection, extracting the “region-of-interest” (mouth-region) on the speaker’s face image, estimating the lips boundary, and comparing its shape with the standard shapes corresponding to various phonemes. For advancing such tasks we propose to perform research on new improved methods related to the areas of object-oriented lattice geometric filtering and detection as well as active contours/regions driven by geometric partial differential equations (PDEs) and implemented via level set methods using fast numerical algorithms. Shape priors and statistical models will also be explored to increase robustness in the previous approaches. (ii) Research will also focus on extracting robust ASR-related features from the detected mouth region based on the interaction between the speech and the visual information as well as on the fusion of the audio-visual features, that is how to combine features from the audio and video streams and weight them based on their relative information content for the speech recognition task. Overall, we expect great improvements in ASR performance.Subtask 2.2: Cross-Modal Interaction in Human-Computer Interfaces:Nowadays human-computer interaction has started becoming a reality due to recent advances in speech recognition, natural language processing, object/motion detection-tracking in vision sensors and tactile sensors. However, to reach the vision of a natural and efficient human-machine interaction, especially as we transition from the desktop to the mobile computer, several advances are still needed both in the core technology areas (e.g., speech recognition and synthesis, visual object tracking and recognition) and in the user interface areas (speech, vision, grachics, text, tactile). Combining such different modalities toward improving the overall performance becomes a significant technical challenge in this case where the modalities can interact strongly. In this NOE we will focus our research work on investigating several cross-media interaction scenarios; examples include: (1) speech synthesis and recognition of faces or other visual objects, (2) speech recognition and image synthesis, (3) text I/O and speech or image modalities, (4) tactile and visual object tracking. Research in this task will be in strong collaboration with a broad challenging application outlined in the WP on human-computer interaction that refers to developing a multi-modal natural language interface to a WWW search engine. In this application the user will be able to perform web queries using natural language or speech input to the system. An important part of this system will be the ability to summarize the web query results and create visualization of such results. This will require research related to this WP on: (i) the interactions among and integration of inputs from various sources (including different modalities, user profiles and past user history), and (ii) the presentation of search results via multimedia outputs (combining text, audio, graphics and visualization of search results).

Subtask 2.2: Cross-Modal Interaction in Human-Computer Interfaces

Nowadays human-computer interaction has started becoming a reality due to recent advances in speech recognition, natural language processing, object/motion detection-tracking in vision sensors and tactile sensors. However, to reach the vision of a natural and efficient human-machine interaction, especially as we transition from the desktop to the mobile computer, several advances are still needed both in the core technology areas (e.g., speech recognition and synthesis, visual object tracking and recognition) and in the user interface areas (speech, vision, grachics, text, tactile). Combining such different modalities toward improving the overall performance becomes a significant technical challenge in this case where the modalities can interact strongly. In this NOE we will focus our research work on investigating several cross-media interaction scenarios; examples include: (1) speech synthesis and recognition of faces or other visual objects, (2) speech recognition and image synthesis, (3) text I/O and speech or image modalities, (4) tactile and visual object tracking. Research in this task will be in strong collaboration with a broad challenging application outlined in the WP on human-computer interaction that refers to developing a multi-modal natural language interface to a WWW search engine. In this application the user will be able to perform web queries using natural language or speech input to the system. An important part of this system will be the ability to summarize the web query results and create visualization of such results. This will require research related to this WP on: (i) the interactions among and integration of inputs from various sources (including different modalities, user profiles and past user history), and (ii) the presentation of search results via multimedia outputs (combining text, audio, graphics and visualization of search results).

Task 3: Cross-Modal Integration for Multimedia Analysis and Recognition

Nowadays we are witnessing a rapid explosion of multimedia data. They are produced by a variety of sources including video cameras, TV and other digital entertainment, digital audio-visual libraries, and the multimodal web. This rapid explosion of multimedia data creates an increasing difficulty of finding relevant information, e.g. in the web, which has spurred enormous efforts to develop tools for automatic semantic analysis of multimedia content. Most of these efforts, however, concentrate on using the available textual information and ignore other types of information. The multimedia explosion also poses several ambitious technical challenges; two of which are: (i) Natural access and interaction with multimedia databases, and (ii) Analyzing and Recognizing objects/events and human behavior in surveillance or sports indexing by processing combined video-audio-text data. As such the results of the task will contribute directly to the WP on Integration.The general objective of this task is to explore the integration-fusion of multiple modalities toward the goals of analyzing multimedia content and recognizing entities, given the information provided by several cues including visual, audio, speech, text, and tactile information. As in Task 2, we shall outline research directions in a few broad problem areas.

Subtask 3.1: Video Analysis and Integration of Asynchronous Time-evolving Modalities

Video processing is usually done separately on sound and on images. One main reason for this is that audio/speech processing and image processing concern different scientific fields. However, the solution of many video analysis tasks can be improved and become more robust by integrating these two modalities and possibly text. Research will be performed on integrating many pieces of information that can be extracted from a video such as sound and image segmentation, speaker segmentation, face (and other object) recognition, and speech transcription. Possible applications include video segmentation, event detection, and video summarization.Another major problem in cross-media integration is to find a common model which is able to represent the various media and which also allows to express the user's tasks. Major difficulties exist because the various media work with different temporal granularities, i.e. they are not temporally coherent, and provide very different kinds of data. For example, image and audio contents are not necessarily simultaneously present in the video data but may appear at different time instants. The work in this NOE will investigate ways of efficiently integrating such asynchronous modalities.

Subtask 3.2: Combining Text and Vision for Semantic Labeling of Image Data

The research goal here is to use structural and textual information for semantic interpretation of image data in the Web. Achieving this goal is a quite challenging and difficult task considering the current state-of-the-art. Such technologies would be useful for a full semantic analysis of web content which currently ignores anything beyond text. We assume that the input to our framework is an image within a web page. The output should be a semantic representation of the image. The expressiveness of the semantic representation will significantly affect the complexity of the semantic interpretation. We propose to perform the research in stages, starting from rough categorization and continuing with more complex representations such as those developed by the semantic web community. The research will also attempt to establish strong relations between concepts in (domain-specific) ontologies and the visual appearance of image regions or configurations of regions, by building “visual profiles” of concepts and by investigating hybrid kernels. These relations will be exploited for the automatic annotation of images, for hybrid (image and text) interactive search and for hybrid clustering. Another issue is to develop automatic suggestion of textual annotation which could be provided by visual similarities between annotated and non-annotated images or could be provided by object detection.

Subtask 3.3: Integrated Multimedia Content Analysis

The knowledge of a typical multimedia content analysis system consists of several definitions of entities, in terms of a typical representational framework (e.g. ontologies). The objective of this task is the analysis, design and implementation of algorithms and tools that recognize the above entities into a given multimedia document or a user interface environment by fusing the information provided by single modality recognizers that are tuned on single cues which form a subset from various visual cues, and/or audio-speech cues, and/or text. The process of the multimedia content analysis will be the following. A time-line will be defined, representing the time sequence of the narrative world. Each single-cue recognizer will insert in each specific time instance, the recognized entities (described in a typical descriptive language recorder in the time-line). Then, the cross-modal recognition will analyze the time-line and extract the final recognized entities, while understanding the specific context concluded by the user actions and/or the environmental conditions.

Subtask 3.4: Integration-Fusion Methods

Several approaches will be examined and used for the above cross-modal integration problems including intelligent systems, statistical analysis and stochastic models. These architectures are composed of a subsymbolic and a symbolic part, which interact with each other during training, operation and adaptation. Over the past years, neural network approaches have been successfully combined with fuzzy and symbolic processing techniques. Hybrid connectionist-symbolic models constitute a promising approach towards developing more robust and versatile intelligent systems. The main focus of such models is the relation of symbolic descriptions of multimedia objects, in terms of rules and conventional representations, to subsymbolic numeric descriptions used by neural networks for learning and adaptivity. Research will focus on the real challenge of integrating learning methods with complex representations.Whereas the common modalities of multimedia are speech, vision and text, neither of these sensory modes in raw format may be particularly suited for the integration. Thus, semantic context and expert knowledge as 'virtual' mode will also be explored since it has a great potential for integrating the other modes. In video analysis, work will also be done on the use of extended hidden Markov models (HMMs) to integrate all the pieces of information that can be extracted from a video (sound and video segmentation, speaker segmentation, face recognition, speech transcription). HMMs are already widely used in sound and speech processing and are a good candidate for a global representation of video streams since they should be able to integrate any source of mono or multi modal information. Further, for various cross-modal integration problems additional data fusion methods will also be examined including Bayesian inference, hierarchical models, and directed acyclic graphs; also, computations carried out using tree structured algorithms, EM, MCMC, particle filters (sequential MCMC).

Task 4: Dissemination of the results

To gain wide applicability and to promote visibility, the partners involved in this WP will pay special attention to dissemination activities. This task will focus on dissemination of the underlying scientific and technological ideas and results via demonstrations at public events & on the WWW, presentations at international conferences and publications in scientific journals. It is also anticipated that a continuous monitoring and assessment of both the scientific and technological results will be performed so that potential project partners and other research groups (in Europe and outside Europe) will be informed and attracted for possible future collaboration.
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback