Download - Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Transcript
Page 1: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Modelling and Analyzing Multimodal Dyadic Interactions

Using Social Networks

Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu

Page 2: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Outline

1. Introduction

2. Audio – Visual cues extraction and fusion

3. Social Network extraction and analysis

4. Experimental Results

5. Conclusions and future work

Page 3: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

1. Introduction

- Social interactions play a very important role in people’s daily lives.

- Present trend: analysis of human behavior based on electronic communications: SMS, e-mails, chat

- New trend: analysis of human behavior based on nonverbal communication: social signals

- Quantification of social signals represents a powerful cue to characterize human behavior: facial expression, hand and body gestures, focus of attention, voice prosody, etc.

Page 4: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Social Network Analysis (SNA) has been developed as a tool to model social interactions in terms of a graph-based structure:

- ‘Nodes’ represent the ‘actors’: persons, communities, institutions, etc.

- ‘Links’ represent a specific type of interdepency: friendship, familiarity, business transactions, etc.

A common way to characterize the information ‘encoded’ in a SNA is to use several centrality measures.

Page 5: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Our contribution:

- In this work, we propose an integrated framework for extraction and analysis of a SNA from multimodal (A/V) dyadic interactions*

- The advantage is represented by the fact that it is based on a totally non-intrunsive technology

- First: we perform speech segmentation through an audio/visual fusion scheme- In the audio domain, speech is detected through clusterization of audio features

- In the visual domain, speech is detected through differential-based feature extraction from the segmented mouth region

- The fusion scheme is based on stacked sequential learning

*We used a set of videos belonging to the New York Times’ Blogging heads opinion blog. The videos depict two persons talking on different subject in front of a webcam

Page 6: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Block-diagram representation of our integrated framework

- Second: To quantify the dyadic interaction, we used the ‘Influence Model’, whose states encode previously integrated audio-visual data

- Third: The Social Network is extracted based on the estimated influences* and its properties are characterized based on several centrality measures.

* The use of term ‘influence’ is inspired by the previous work of Choudhury:

T. Choudhury, 2003. “Sensing and Modelling Human Networks”, Ph.D. Thesis, MIT Media Lab

Page 7: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

2. Audio – Visual cues extraction and fusion

• Audio cue– Description

• 12 first MFCC coefficients• Signal energy• Temporal cepstral derivatives (Δ and Δ2 )

Page 8: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

• Audio cue– Diarization process

• Segmentation– Coarse segmentation according Generalized Likelihood

ratio between consecutive windows

• Clustering– Agglomerative hierarchical clustering with a BIC stopping

scheme

• Segments boundaries are adjusted at the end

Page 9: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

• Visual cue– Description:

• Face segmentation based on Viola-Jones detector• Mouth region segmentation• Vector of HOG descriptors for for the mouth region

Page 10: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

• Visual cue– Classification:

• Non-Speech class modelling• One-class Dynamic Time warping based on the

following dynamic programming equation

Page 11: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

• Fusion scheme– Stacked sequential learning (suitable for

problems characterized by long runs of identical labels)

• Fusion of audio-visual modalities• Determining temporal relations of both feature sets

for learning a two-stage classifier (based on Ada-Boost)

Page 12: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

3. Social Network extraction and analysis

- Influence Model (IM), was a tool introduced for quantification of interacting processes using a coupled Hidden Markov Model (HMM)

- In the case of social interaction, the states of IM encode automatically extracted audio-visual features

Influence Model Architecture

parameters represent the ‘influences’

Page 13: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

- The construction of the Social Network is based on ‘influences’ values

- A directed link between two nodes A and B (designated by A → B) implies that ‘A has influence over B’

- The SNA is based on several centrality measures:- degree centrality (indegree and outdegree) - Refers to the number of direct connections with other persons

- closeness centrality

- Refers to the facility between two persons to communicate

- betweeness centrality

- Refers to the relevance of a person to act as a ‘bridge’ between two sub-groups of the network

- eigenvector centrality

- Refers to the importance of a person in the network

Page 14: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

4. Experimental results

- We collected a subset of videos from the New York Blogging Heads’ opinion blog

- We used 17 videos from 15 persons- Videos depict two persons having a conversation in front of

their webcam on different topics (politics, economy,…)- The conversations have an informal character and

sometimes frequent interruptions can occur

Snapshot from a video

Page 15: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

- Audio features - The audio stream has been analyzed using sliding windows of 25 ms with

an overlapping factor of 50%.- Each window is characterized by 13 features (12 MFCC +E), complemented with Δ and Δ2 - The shortest length of a valid audio segment was set to 2.5 ms

- Video features- 32 oriented features (corresponding to the mouth region) have been extracted using the HOG descriptor

- the length of the DTW sequences has been set to 18 frames (which corresponds to 1.5 s)

- Fusion process- stacked sequential learning was used to fusion the audio-visual features - Adaboost was chosen as classifier

Page 16: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Visual and audio-visual speaker segmentation accuracy

Page 17: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

The extracted social network showing participants’ label and influence directions

Page 18: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Centrality measures table

Page 19: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

5. Conclusions and future work

- We presented an integrated framework for automatic extraction and analysis of a social network from im-plicit input (multimodal dyadic interactions), based on theintegration of audio/visual features.

- In the future, we are planning to extend the current work to study the problem of social interactions at larger scale and in different scenarios

- Starting from the premise that people's lives are more structured than it might seem a priori, we plan to study long-term interactions between persons, with the aim to discover underlying behavioral patterns present in our day-to-day existence