2022
Conference Papers
-
Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation
Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, François Brémond
Proceedings of the 30th ACM International Conference on Multimedia, pp. 70–79, 2022.
Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for human action detection. We perform experiments using four variants of spatial-temporal features as input to PDAN: Two-Stream Inflated 3D CNN, Temporal Segment Networks, Temporal Shift Module and Swin Transformer. Results are promising and indicate a great room for improvement in this difficult task. Representing a key piece in the puzzle towards automatic understanding of social behavior, BBSI is fully available to the research community.Paper: balazia22_mm.pdf
Paper Access: https://doi.org/10.1145/3503161.3548363
@inproceedings{balazia22_mm, author = {Balazia, Michal and M\"{u}ller, Philipp and T\'{a}nczos, \'{A}kos Levente and Liechtenstein, August von and Br\'{e}mond, Fran\c{c}ois}, title = {Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation}, year = {2022}, url = {https://doi.org/10.1145/3503161.3548363}, doi = {10.1145/3503161.3548363}, booktitle = {Proceedings of the 30th ACM International Conference on Multimedia}, pages = {70–79} }
Technical Reports
-
MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions
Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling
arXiv:2209.09578, pp. 1–6, 2022.
Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroupInteraction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.@techreport{mueller22_arxiv, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {1--6}, doi = {}, url = {http://arxiv.org/abs/2209.09578} }
2021
Conference Papers
-
MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation
Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, Andreas Bulling
Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
Artificial mediators are promising to support human group conversations but at present their abilities are limited by insufficient progress in group behaviour analysis. The MultiMediate challenge addresses, for the first time, two fundamental group behaviour analysis tasks in well-defined conditions: eye contact detection and next speaker prediction. For training and evaluation, MultiMediate makes use of the MPIIGroupInteraction dataset consisting of 22 three- to four-person discussions as well as of an unpublished test set of six additional discussions. This paper describes the MultiMediate challenge and presents the challenge dataset including novel fine-grained speaking annotations that were collected for the purpose of MultiMediate. Furthermore, we present baseline approaches and ablation studies for both challenge tasks.Paper: mueller21_mm.pdf
@inproceedings{mueller21_mm, title = {MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Zhang, Guanhua and Dietz, Michael and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2021}, pages = {4878--4882}, doi = {10.1145/3474085.3479219}, booktitle = {Proc. ACM Multimedia (MM)} }
2020
Conference Papers
-
Anticipating Averted Gaze in Dyadic Interactions
Philipp Müller, Ekta Sood, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-10, 2020.
We present the first method to anticipate averted gaze in natural dyadic interactions. The task of anticipating averted gaze, i.e. that a person will not make eye contact in the near future, remains unsolved despite its importance for human social encounters as well as a number of applications, including human-robot interaction or conversational agents. Our multimodal method is based on a long short-term memory (LSTM) network that analyses non-verbal facial cues and speaking behaviour. We empirically evaluate our method for different future time horizons on a novel dataset of 121 YouTube videos of dyadic video conferences (74 hours in total). We investigate person-specific and person-independent performance and demonstrate that our method clearly outperforms baselines in both settings. As such, our work sheds light on the tight interplay between eye contact and other non-verbal signals and underlines the potential of computational modelling and anticipation of averted gaze for interactive applications.@inproceedings{mueller20_etra, title = {Anticipating Averted Gaze in Dyadic Interactions}, author = {Müller, Philipp and Sood, Ekta and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391332}, pages = {1-10} }
2019
Conference Papers
-
Emergent Leadership Detection Across Datasets
Philipp Müller, Andreas Bulling
Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 274-278, 2019.
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards real-world emergent leadership detection.@inproceedings{mueller19_icmi, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {274-278}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, doi = {10.1145/3340555.3353721} } -
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage
Philipp Müller, Daniel Buschek, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.
Automatic saliency-based recalibration is promising for addressing calibration drift in mobile eye trackers but existing bottom-up saliency methods neglect user’s goal-directed visual attention in natural behaviour. By inspecting real-life recordings of egocentric eye tracker cameras, we reveal that users are likely to look at their phones once these appear in view. We propose two novel automatic recalibration methods that exploit mobile phone usage: The first builds saliency maps using the phone location in the egocentric view to identify likely gaze locations. The second uses the occurrence of touch events to recalibrate the eye tracker, thereby enabling privacy-preserving recalibration. Through in-depth evaluations on a recent mobile eye tracking dataset (N=17, 65 hours) we show that our approaches outperform a state-of-the-art saliency approach for the automatic recalibration task. As such, our approach improves mobile eye tracking and gaze-based interaction, particularly for long-term use.@inproceedings{mueller19_etra, title = {Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage}, author = {M{\"{u}}ller, Philipp and Buschek, Daniel and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319918}, pages = {1--9} }
Technical Reports
-
Emergent Leadership Detection Across Datasets
Philipp Müller, Andreas Bulling
arXiv:1905.02058, pp. 1–5, 2019.
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.Paper Access: https://arxiv.org/abs/1905.02058
@techreport{mueller19_arxiv, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {1--5}, url = {https://arxiv.org/abs/1905.02058} }
2018
Conference Papers
-
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 31:1-31:10, 2018.
Eye contact is one of the most important non-verbal social cues and fundamental to human interactions. However, detecting eye contact without specialized eye tracking equipment poses significant challenges, particularly for multiple people in real-world settings. We present a novel method to robustly detect eye contact in natural three- and four-person interactions using off-the-shelf ambient cameras. Our method exploits that, during conversations, people tend to look at the person who is currently speaking. Harnessing the correlation between people’s gaze and speaking behaviour therefore allows our method to automatically acquire training data during deployment and adaptively train eye contact detectors for each target user. We empirically evaluate the performance of our method on a recent dataset of natural group interactions and demonstrate that it achieves a relative improvement over the state-of-the-art method of more than 60%, and also improves over a head pose based baseline.Paper: mueller18_etra.pdf
@inproceedings{mueller18_etra, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Zhang, Xucong and Bulling, Andreas}, title = {Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {31:1-31:10}, doi = {10.1145/3204493.3204549} } -
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior
Philipp Müller, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Conference on Intelligent User Interfaces (IUI), pp. 153-164, 2018.
Rapport, the close and harmonious relationship in which interaction partners are "in sync" with each other, was shown to result in smoother social interactions, improved collaboration, and improved interpersonal outcomes. In this work, we are first to investigate automatic prediction of low rapport during natural interactions within small groups. This task is challenging given that rapport only manifests in subtle non-verbal signals that are, in addition, subject to influences of group dynamics as well as inter-personal idiosyncrasies. We record videos of unscripted discussions of three to four people using a multi-view camera system and microphones. We analyse a rich set of non-verbal signals for rapport detection, namely facial expressions, hand motion, gaze, speaker turns, and speech prosody. Using facial features, we can detect low rapport with an average precision of 0.7 (chance level at 0.25), while incorporating prior knowledge of participants’ personalities can even achieve early prediction without a drop in performance. We further provide a detailed analysis of different feature sets and the amount of information contained in different temporal segments of the interactions.Paper: mueller18_iui.pdf
@inproceedings{mueller18_iui, title = {Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior}, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Bulling, Andreas}, year = {2018}, pages = {153-164}, booktitle = {Proc. ACM International Conference on Intelligent User Interfaces (IUI)}, doi = {10.1145/3172944.3172969} }
2017
Conference Papers
-
The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, Michel Valstar
Proceedings of 19th ACM International Conference on Multimodal Interaction, pp. 350–359, 2017.
We present a novel multi-lingual database of natural dyadic novice- expert interactions, named NoXi, featuring screen-mediated dyadic human interactions in the context of information exchange and retrieval. NoXi is designed to provide spontaneous interactions with emphasis on adaptive behaviors and unexpected situations (e.g. conversational interruptions). A rich set of audio-visual data, as well as continuous and discrete annotations are publicly available through a web interface. Descriptors include low level social signals (e.g. gestures, smiles), functional descriptors (e.g. turn-taking, dialogue acts) and interaction descriptors (e.g. engagement, interest, and fluidity).Paper: cafaro17_icmi.pdf
@inproceedings{cafaro17_icmi, title = {The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions}, author = {Cafaro, Angelo and Wagner, Johannes and Baur, Tobias and Dermouche, Soumia and Torres, Mercedes Torres and Pelachaud, Catherine and André, Elisabeth and Valstar, Michel}, year = {2017}, booktitle = {Proceedings of 19th ACM International Conference on Multimodal Interaction}, doi = {10.1145/3136755.3136780}, pages = {350–359} }
2015
Conference Papers
-
Emotion recognition from embedded bodily expressions and speech during dyadic interactions
Philipp Müller, Sikandar Amin, Prateek Verma, Mykhaylo Andriluka, Andreas Bulling
Proc. International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 663-669, 2015.
Previous work on emotion recognition from bodily expressions focused on analysing such expressions in isolation, of individuals or in controlled settings, from a single camera view, or required intrusive motion tracking equipment. We study the problem of emotion recognition from bodily expressions and speech during dyadic (person-person) interactions in a real kitchen instrumented with ambient cameras and microphones. We specifically focus on bodily expressions that are embedded in regular interactions and background activities and recorded without human augmentation to increase naturalness of the expressions. We present a human-validated dataset that contains 224 high-resolution, multi-view video clips and audio recordings of emotionally charged interactions between eight couples of actors. The dataset is fully annotated with categorical labels for four basic emotions (anger, happiness, sadness, and surprise) and continuous labels for valence, activation, power, and anticipation provided by five annotators for each actor. We evaluate vision and audio-based emotion recognition using dense trajectories and a standard audio pipeline and provide insights into the importance of different body parts and audio features for emotion recognition.@inproceedings{mueller15_acii, title = {Emotion recognition from embedded bodily expressions and speech during dyadic interactions}, author = {M{\"{u}}ller, Philipp and Amin, Sikandar and Verma, Prateek and Andriluka, Mykhaylo and Bulling, Andreas}, year = {2015}, pages = {663-669}, doi = {10.1109/ACII.2015.7344640}, booktitle = {Proc. International Conference on Affective Computing and Intelligent Interaction (ACII)} }