Backchannels serve important meta-conversational purposes like signifying attention or indicating agreement. They can be expressed in a variety of ways - ranging from vocal behaviour (“yes”, “ah-ha”) to subtle nonverbal cues like head nods or hand movements. The backchannel detection sub-challenge focuses on classifying whether a participant of a group interaction expresses a backchannel at a given point in time. Challenge participants will be required to perform this classification based on a 10-second context window of audiovisual recordings of the whole group.
A key function of backchannels is the expression of agreement or disagreement towards the current speaker. It is crucial for artificial mediators to have access to this information to understand the group structure and to intervene to avoid potential escalations. In this sub-challenge, participants will address the task of automatically estimating the amount of agreement expressed in a backchannel. In line with the backchannel detection sub-challenge, a 10-second audiovisual context window containing views on all interactants will be provided.
Eye Contact Detection (MultiMediate’21 sub-challenge)
We define eye contact as a discrete indication of whether a participant is looking at another participant’s face, and if so, who this other participant is. Video and audio recordings over a 10 second context window will be provided as input to provide temporal context for the classification decision. Eye contact has to be detected for the last frame of the 10-second context window. In the next speaker prediction sub-challenge, participants need to predict the speaking status of each participant at one second after the end of the context window.
Next Speaker Prediction (MultiMediate’21 sub-challenge)
In the next speaker prediction sub-challenge, approaches need to predict which members of the group will be speaking at a future point in time. Similar to the eye contact detection sub-challenge, video and audio recordings over a 10 second context window will be provided as input. Based on this information, approaches need to predict the speaking status of each participant at one second after the end of the context window.
Evaluation of Participants’ Approaches
Training and validation data for each sub-challenge can be downloaded at multimediate-challenge.org/Dataset/. The evaluation of these approaches will then be performed remotely on our side with the unpublished test portion of the dataset. We will provide baseline implementations along with pre-computed features to minimise the overhead for participants. The test set will be released two weeks before the challenge deadline. Participants will in turn submit their predictions via email to Michael Dietz ().
We will evaluate approaches with the following metrics: accuracy for backchannel detection and eye contact estimation, mean squared error for agreement estimation from backchannels, and next speaker prediction is evaluated with unweighted average recall.
Rules for participation
* The competition is team-based. A single person can only be part of a single team.
* Each team will have 5 evaluation runs on the test set (per sub-challenge).
* Additional datasets can be used, but they need to be publicly available.
* The Organisers will not participate in the challenge.
* For awarding certificates for 1st, 2nd and 3rd place in each subchallenge we will only consider approaches that are described in accepted papers that were submitted to the ACM MM Grand Challenge track.
* The evaluation servers will be open until the paper submission deadline (18 June 2022).
* The test set (without labels) will be provided to participants 2 weeks before the challenge deadline. It is not allowed to manually annotate the test set.