Computer Vision, Speech Communication &

Signal Processing Group

NTUA | ECE
Faculty | PhD Students | Collaborators
Journal | Book Chapters | Conference
Undergraduate | Graduate | Diploma Theses

Audio-Visual Saliency Modeling: Eyetracking data



Database Overview

For the purposes of experimental evaluation with eye-tracking data, since there are only a few databases with audiovisual eye-tracking data, we decided to collect such data for two databases, SumMe [1] and ETMD [2]. The SumMe database contains 25 unstructured videos, while the ETMD contains 12 videos from six different hollywood movies, both summing up to 37 videos totaling approximately 2 h and 171,000 frames. For this reason, the group of participants and of videos were split into two equivalent groups containing the half number of people and videos, respectively. Thus, each video was seen by 10 different subjects. The subjects were recruited through the National Technical University of Athens, with ages ranging from 23-55 (mean 35). Almost all subjects were naive as to the purposes of the experiment and they all had normal vision. The employed videos ranged from 38 to 388 s in length and they were converted from their original sources to a MOV video format.

Figure 1: Sample video frames with eye-tracking data from SumMe database, along with the distribution of eye-tracking data for the whole video.

Data collection procedure

Eye movements were binocularly monitored via a SR Research Eyelink 2000 desktop mounted eye-tracker with 1000 Hz sampling rate. Videos were displayed on a 1600 x 900 monitor at a 90 cm distance from the viewer. Audio was delivered in stereo, through headphones. A chin and headrest was used during the experiment, in order to ensure the viewer's minimal movement and avoid continuous calibration. Presentation was controlled using the SR Research Experiment Builder software. The subjects that participated in the experiment were informed only that they would watch some videos and that they should avoid moving during a video playback. The order of the clips was randomized across participants. The whole experimental procedure for each participant was approximately 90 min long, including instructions, calibration, testing, and short breaks if needed.

Regarding calibration, a 13-point binocular calibration preceded the experiment. Before each video, if central fixation accuracy was exceeding a pre-defined threshold of 0.5°, a full calibration was repeated. The central fixation marker also served as a cue for the participant and offered an optional break-point in the procedure. After checking for a central fixation, the start of each trial was manually triggered. Regarding post-processing, the 1000-Hz raw eye-tracking recordings were sampled down to match each video's frame rate. One sample frame per video with its corresponding eye-tracking data superimposed, and the distribution of eye-tracking data for the whole video can be found in Fig. 1, Fig. 2 for SumMe and ETMD databases for all videos. The data are publicly released and can be dowloaded using the link below.

Figure 2: Sample video frames with eye-tracking data from ETMD database, along with the distribution of eye-tracking data for the whole video.

Data annotation

SumMe database


- 25 unstructured videos from YouTube, etc. (two of them without audio)
- eye-tracking data from 10 viewers
- free viewing
- video list:
Air_Force_One.mp4
Base_jumping.mp4
Bearpark_climbing.mp4
Bike_Polo.mp4
Bus_in_Rock_Tunnel.mp4
car_over_camera.mp4
Car_railcrossing.mp4
Cockpit_Landing.mp4
Cooking.mp4
Eiffel_Tower.mp4
Excavators_river_crossing.mp4
Fire_Domino.mp4
Jumps.mp4
Kids_playing_in_leaves.mp4
Notre_Dame.mp4
Paintball.mp4
paluma_jump.mp4
playing_ball.mp4
Playing_on_water_slide.mp4
Saving_dolphines.mp4
Scuba.mp4
Statue_of_Liberty.mp4
St_Maarten_Landing.mp4
Uncut_Evening_Flight.mp4
Valparaiso_Downhill.mp4

ETMD database


ETMD database:
- 12 Hollywood movie videos
- eye-tracking data from 10 viewers
- free viewing
- video list:
CHI_1_color.avi
CHI_2_color.avi
CRA_1_color.avi
CRA_2_color.avi
DEP_1_color.avi
DEP_2_color.avi
FNE_1_color.avi
FNE_2_color.avi
GLA_1_color.avi
GLA_2_color.avi
LOR_1_color.avi
LOR_2_color.avi

Eye-tracking data structure


Folder: audio_stereo
- Contains the stereo audio tracks in wav format, with 16-bit and 44100Hz

Folder: audio_mono
- Contains the one-channel audio tracks in wav format, with 16-bit and 44100Hz. (Same tracks as above, but converted to one channel audio)

Folder: video
- Contains the video in the original format for both SumMe and ETMD databases

Folder: eyetracking
- Contains the collected eyetracking data
Matlab structure: all_videos.mat

eye_data_all.(video_name) --> 3D vector: 2 x Nframes x Nparticipants

1st dimension: (x,y) coordinates
2nd dimension: frame number
3rd dimension: participant number (from 1 to 10)

For example eye_data_all.Base_jumping(:,456,4) will show the (x,y) coordinates of the eye-tracking data of frame 456 as viewed by the 4th participant

Download

You can download this database by clicking this link:

    [Downlad Database]

    If you use the corpus please cite:

    A. Tsiami, P. Koutras, A. Katsamanis, A. Vatakis and P. Maragos,
    A behaviorally inspired fusion approach for computational audiovisual saliency modeling,
    Signal Processing: Image Communication, 2019, pp. 186 - 200

    For more information, please email Antigoni Tsiami (antsiami@cs.ntua.gr)

References

[1] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating summaries from user videos , in: Proc. European Conf. on Computer Vision, 2014, pp. 505 - 520.

[2] Petros Koutras, Petros Maragos, A perceptually based spatio-temporal computational framework for visual saliency estimation , Signal Process., Image Commun., 2015, pp. 15-31

Last modified: Friday, 28 January 2022 | Created by Nassos Katsamanis and George Papandreou