Computer Vision, Speech Communication &

Signal Processing Group

Faculty | PhD Students | Collaborators
Journal | Book Chapters | Conference
Undergraduate | Graduate | Diploma Theses

ICIP 2019

small logo


Multisensory Video Processing and Learning
for Human-Robot Interaction

Tutorial 14 Title: Multisensory Video Processing and Learning for Human-Robot Interaction

Abstract: In many human-robot interaction (HRI) application areas where multisensory processing and recognition is greatly needed, multimodality occurs naturally and cross-modal integration increases performance. This tutorial addresses the multisensory spatio-temporal processing of visual information together with its fusion with the speech/audio modality as applied to two different HRI areas: assistive and social robotics. Our coverage will include theory, algorithms and a rich variety of integrated applications for specific groups like elderly users and children. There are many challenges in this area including the familiarity of these users with new technologies and the domain specific datasets, which are required for training user oriented models. Nowadays, modern assistive and social HRI requires a multimodal communication with speech, gestures and human movements so as to enhance the classic interaction with only spoken commands. This tutorial will present state-of -the-art works for multisensory and visual processing and machine learning models that can be effectively trained with a relatively small amount of data, which is very important when we deal with elderly users and children. Moreover, in the present state of our information society, we are witnessing a very rapid expansion of multimodal and multisensory content, with huge volumes of multimedia content being continuously created. As a result, multimodal processing technologies have become increasingly relevant. Computer vision techniques, despite recent advances, still significantly lag behind the human ability in understanding real-life scenes and performing demanding robotic tasks. Motivated by the multimodal way humans perceive their environment, complementary information sources have been successfully used in many applications, such as human action recognition where the audio-visual cues pose many challenges at the level of features, information stream modeling and fusion. Afterwards, we will focus on the major application area, which is Human-Robot Interaction, for social, edutainment and healthcare applications, including audio-gestural commands recognition and multi-view human action recognition.

Related papers and current results can be found in and

Date/Time: Sunday, September 22, 2019; 14:00-17:30


Petros Maragos
Petros Koutras

Primary Contact: Petros Maragos
IRAL-CVSP, National Technical Univ. of Athens,
Zografou campus, Athens 15773
Phone: +30 210772-2360, Fax: +30 210772-3397

Tutorial Slides

Introduction: Multisensory Video Processing and Learning For Human-Robot Interaction
Part 1: Spatio-Temporal Visual Processing
Part 2: Audio-Visual Processing, Fusion and Perception
Part 3 and 4: Audio-Visual HRI: Methodology and Applications in Assistive Robotics
Part 5: Audio-Visual HRI in Social Robotics for Child-Robot Interaction
List of References


This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call Research - Create - Innovate (project code: T1EDK-01248, i-Walk).

Last modified: Wednesday, 02 October 2019 | Created by Nassos Katsamanis and George Papandreou