Computer Vision, Speech Communication & Signal Processing Group

Multisensory Video Processing and Learning
for Human-Robot Interaction

Tutorial 14 Title: Multisensory Video Processing and Learning for Human-Robot Interaction

Abstract: In many human-robot interaction (HRI) application areas where multisensory processing and recognition is greatly needed, multimodality occurs naturally and cross-modal integration increases performance. This tutorial addresses the multisensory spatio-temporal processing of visual information together with its fusion with the speech/audio modality as applied to two different HRI areas: assistive and social robotics. Our coverage will include theory, algorithms and a rich variety of integrated applications for specific groups like elderly users and children. There are many challenges in this area including the familiarity of these users with new technologies and the domain specific datasets, which are required for training user oriented models. Nowadays, modern assistive and social HRI requires a multimodal communication with speech, gestures and human movements so as to enhance the classic interaction with only spoken commands. This tutorial will present state-of -the-art works for multisensory and visual processing and machine learning models that can be effectively trained with a relatively small amount of data, which is very important when we deal with elderly users and children. Moreover, in the present state of our information society, we are witnessing a very rapid expansion of multimodal and multisensory content, with huge volumes of multimedia content being continuously created. As a result, multimodal processing technologies have become increasingly relevant. Computer vision techniques, despite recent advances, still significantly lag behind the human ability in understanding real-life scenes and performing demanding robotic tasks. Motivated by the multimodal way humans perceive their environment, complementary information sources have been successfully used in many applications, such as human action recognition where the audio-visual cues pose many challenges at the level of features, information stream modeling and fusion. Afterwards, we will focus on the major application area, which is Human-Robot Interaction, for social, edutainment and healthcare applications, including audio-gestural commands recognition and multi-view human action recognition.

Related papers and current results can be found in http://cvsp.cs.ntua.gr and http://robotics.ntua.gr.

Date/Time: Sunday, September 22, 2019; 14:00-17:30

Presenters

Petros Maragos
Petros Koutras

Primary Contact: Petros Maragos
IRAL-CVSP, National Technical Univ. of Athens,
Zografou campus, Athens 15773
maragos@cs.ntua.gr
Phone: +30 210772-2360, Fax: +30 210772-3397

Tutorial Slides

Introduction: Multisensory Video Processing and Learning For Human-Robot Interaction
Part 1: Spatio-Temporal Visual Processing
Part 2: Audio-Visual Processing, Fusion and Perception
Part 3 and 4: Audio-Visual HRI: Methodology and Applications in Assistive Robotics
Part 5: Audio-Visual HRI in Social Robotics for Child-Robot Interaction
List of References

This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call Research - Create - Innovate (project code: T1EDK-01248, i-Walk).

Computer Vision, Speech Communication &

Signal Processing Group

ICIP 2019

Multisensory Video Processing and Learning for Human-Robot Interaction

Presenters

Tutorial Slides

Multisensory Video Processing and Learning
for Human-Robot Interaction