Learning Interactions and Relationships between Movie ... · D. Dataset Analysis Fig.11and...

Learning Interactions and Relationships between Movie CharactersSUPPLEMENTARY MATERIAL

Anna Kukleva1,2 Makarand Tapaswi1 Ivan Laptev1

1Inria Paris, France 2Max-Planck-Institute for Informatics, Saarbrucken, [email protected], {makarand.tapaswi,ivan.laptev}@inria.fr

We provide additional analysis of our task and modelsincluding confusion matrices, prediction examples for allour models, skewed distribution of number of samples forour classes, and diagrams depicting how we grouped theinteraction and relationship classes.

A. Impact of ModalitiesWe analyze the impact of modalities by presenting quali-

tative examples where using multiple modalities help predictthe correct interactions. Qualitative results presented here,refer to the quantitative performance indicated in Table 1 ofthe main paper. Fig. 1 shows that using dialog can help toimprove predictions, Fig. 2 demonstrates the necessity ofvisual clip information and highlights that the two modalitiesare complementary. Finally, Fig. 3 shows that focusing ontracks (visual representations in which the two charactersappear) provides further improvements to our model. Fur-thermore, Fig. 4 shows top-5 interaction classes that benefitmost from using additional modalities.Analyzing modalities. We also analyze the two modelstrained on only visual or only dialog cues (first two rowsof Table 1). Some interactions can be recognized only withvisual (v) features: rides 63% (v) / 0% (d), walks 29% (v) /0% (d), runs 26% (v) / 0% (d); while others only with dialog(d) cues: apologizes 0% (v) / 66% (d), compliments 0% (v) /26% (d), agrees 0% (v) / 25% (d).

Interactions that achieve non-zero accuracy with bothmodalities are: hits 64% (v) / 5% (d), greets 12% (v) / 57%(d), explains 25% (v) / 51% (d).

Additionally, the top-5 predicted classes for visual cuesare asks 77%, hits 64%, rides 63%, watches 49%, talks onphone 41%; and dialog cues are asks 75%, apologizes 66%,greets 57%, explains 51%, watches 30%. As asks is the mostcommon class, and watches is the second most common,these interactions work well with both modalities.

B. Joint Interaction and RelationshipsConfusion matrices. Fig. 5 shows the confusion matrix inthe top-15 most commonly occurring interactions on the

validation and test sets. We see that multiple dialog basedinteractions (e.g. talks to, informs, and explains) are oftenconfused. We also present confusion matrices for relation-ships in Fig. 6. A large part of the confusion is due to lackof sufficient data to model the tail of relationship classes.Qualitative examples. Related to Table 2 of the main paper,Fig. 7 shows some examples where interaction predictionsimprove by jointly learning to model both interactions andrelationships. Similarly, Fig. 8 shows how relationship clas-sification benefits from our multi-task training setup.

C. Examples for Who is InteractingEmpirical evaluation shows that the knowledge about the

relationship is important for localizing the pair of characters(Table 6 of the main paper). In Fig. 9, we illustrate anexample where the dad walks into a room, sees his daughterwith someone, and asks questions (see figure caption fordetails).

Finally, in Fig. 10, we show an example where the modelis able to correctly predict all components (interaction class,relationship type and the pair of tracks) in a complex situa-tion with more than 2 people appearing in the clip.

D. Dataset AnalysisFig. 11 and Fig. 12 show normalized distributions for the

number of samples in each class for train, validation andtest sets of interactions and relationships respectively. Ascan be seen the most common classes appear many moretimes than the others. Data from a complete movie belongsto one of the three train/val/test sets to avoid model bias onthe plot and characters behaviour. Notably, this means thatthe relative ratios between number of samples per class arealso not necessarily consistent making the dataset and taskeven more challenging.

In the main paper, we described our approach to groupover 300 interactions into 101 classes, and over 100 rela-tionships into 15. We use radial tree diagrams to depict thegroupings for interaction and relationship labels, visualizedin Fig. 13 and 14 respectively.

1

Visual

- Focker, I'm not gonna tell you again! Jinx cannot flush the toilet. He's a cat, for chrissakes! The animal doesn't even have thumbs, Focker.

watches 0.71asks 0.70

informs 0.66orders 0.64

explains 0.64

explains 0.78talks to 0.72reminds 0.68informs 0.67shows 0.64

Textual

Visual

Input Modalities

Figure 1: Improvement in prediction of interactions by including textual modality in addition to visual. The model learns to recognizesubtle differences between interactions based on dialog. The example is from Meet the Parents (2000).

- What a bruiser. - The Cardinals just refuse to go quietly into the desert night. I'm

just trying to be a little poetic.- A tough hit on Tidwell. I'd have made a touchdown. That's his

sixth catch tonight. A first down at the 19 yard line.- Another savage hit on Tidwell. They're really working Tidwell.

explains 0.73hears 0.65

watches 0.65informs 0.65argues 0.64

watches 0.75explains 0.67

hears 0.67talks to 0.62plays 0.58

Textual

Textual

Visual

Input Modalities

Figure 2: Improvement in prediction of interactions by including visual modality in addition to textual. The top-5 predicted interactionsreflect the impact of visual input rather than relying only on the dialog. The example is from Jerry Maguire (1996).

Visual- Hey, Jackie.- Wanna play, Pops?

Let's play.

informs 0.70suggests 0.69

greets 0.69watches 0.67

asks 0.67

watches 0.69greets 0.67

suggests 0.67informs 0.66asks 0.64

Textual

Tracks

Visual Textual

Input Modalities

Figure 3: Improvement in prediction of interactions by including the pair of tracks modality in addition to visual and textual cues. Themodel can concentrate its attention on visual cues for the two people of interest instead of looking only at the clip level. The example is fromMeet the Parents (2000).

2

0 5 10 15 20 25 30# improved samples

suggests

asks

watches

informs

explains

(vis) -> (vis + txt)

0 5 10# improved samples

runs

informs

explains

kisses

watches

(txt) -> (vis + txt)

0 5 10# improved samples

talksto

explains

hears

informs

watches

(vis + txt) -> (vis + txt + tr)

Figure 4: Each plot shows 5 interaction classes that have the most number of improved instances by including an additional modality.Specifically, the x-axis denotes the number of samples in which interaction prediction performance improves. Left: From only visual cliprepresentation to visual and textual. As expected, using dialogues in addition to video frames boosts performance for classes that rely ondialog e.g. explains, informs. Middle: From only textual clip representation to visual and textual. Visual clip representations influenceclasses as kisses, runs during which people usually do not talk (dialog modality filled with zeros). Right: Finally, including all threemodalities visual, textual, tracks improves performance over using visual and textual. Track pair localization improves recognition of classestypically used in group activities.

Figure 5: Confusion matrices for top-15 most common interactions for validation set (left) and test set (right). Model corresponds to the “Int.only” performance of 26.1% shown in Table 2. Numbers on the right axis indicate number of samples for each class.

stran

ger

frien

dlov

erco

lleag

uesib

ling

acqu

aintan

cepa

rent

worke

rch

ildbo

sskb

rcu

stomer

manag

erex

-love

ren

emy

Prediction

strangerfriendlover

colleaguesibling

acquaintanceparentworker

childboss

kbrcustomermanagerex-lover

enemy

Grou

nd T

ruth

VAL

2813151617182021242847607196

stran

ger

frien

dco

lleag

ueac

quain

tance

lover

paren

two

rker

child

manag

ercu

stomer

boss

kbr

enem

ysib

ling

ex-lo

ver

Prediction

strangerfriend

colleagueacquaintance

loverparentworker

childmanagercustomer

bosskbr

enemysibling

ex-lover

Grou

nd T

ruth

TEST

1112022242424262933505160120211

Figure 6: Confusion matrices for all relationships for validation set (left) and test set (right). Model corresponds to the “Rel. only”performance of 26.8% shown in Table 2. Numbers on the right axis indicate number of samples for each class.

3

Interactions

only

Model

watches 0.64suggests 0.61explains 0.61

asks 0.58informs 0.54

JointInts Rels

asks 0.64watches 0.62assures 0.59explains 0.57

suggests 0.51

- If anybody else wants to come with me, this is a chance for something real, and fun, and inspiring in this godforsaken business,and we will do it together. Who's coming with me? Who's coming with me? Who's coming with me besides Flipper here? This is embarrassing.

Ints Prediction

colleaguesrepresented as collection of N clips with the same two people

Interactions

only

Model

watches 0.70leaves 0.66greets 0.65walks 0.59stays 0.59

JointInts Rels

leaves 0.62greets 0.61

watches 0.58stays 0.50walks 0.50

no dialog

Ints Prediction

friends

represented as collection of N clips with the same two people

Figure 7: We show examples where training to predict interactions and relationships jointly helps improve the performance of interactions.Top: In the example from Jerry Maguire (1996), the joint model looks at several clips between Dorothy and Jerry and is able to reasonabout them being colleagues. This in turn helps refine the interaction prediction to asks. Bottom: In the example from Four Weddings andFuneral (1994), the model observes several clips from the entire movie where Charles and Tom are friends, and reasons that the interactionshould be leave (which contains the leave together class). Note that there is no dialog for this clip.

4

Relationshipsonly

Model

stranger 0.58lover 0.49parent 0.48

Rels Prediction

complains kisses informsRels Ints

Jointlover 0.63

stranger 0.62parent 0.55

Relationships

Rels Ints

only

Joint

Model

stranger 0.59worker 0.59

boss 0.58

Rels Prediction

customer 0.66manager 0.62stranger 0.60

explains explains

Figure 8: We show examples where training to predict interactions and relationships jointly helps improve the performance of relationships.Top: In the movie Four Weddings and Funeral (1994), clips between Bernard and Lydia exhibit a variety of interactions (e.g. kisses) that aremore typical between lovers than strangers. Bottom: In the movie The Firm (1993), Frank and Mitch meet only once for a consultation, andare involved in two clips with the same interaction label explains. Our model is able to reason about this interaction, and it encourages therelationship to be customer and manager, instead of stranger.

5

Clip

Possible track pairs

- Not in my own den. What are you two doing in here?

(Pam, None) (None, Pam)

(Greg, Pam) (Pam, Greg)

(Greg, Jack)

(Pam, Jack)

(Jack, Greg)

(Jack, Pam)

(Jack, None) (None, Jack)

(Greg, None) (None, Greg)

PredictGiven

Who?asks

parent

(Jack, Pam)

Figure 9: We illustrate an example from the movie Meet the Parents (2000) where a father (Jack) walks into a room while his daughter (Pam)and the guy (Greg) are kissing. Our goal is to predict the two characters when the interaction and relationship labels are provided. In thisparticular example, we see that Dad asks Pam a question (What are you two doing in here?). Note that their relationship is encoded as (Pam→ child → Jack), or equivalently, (Jack → parent → Pam). When searching for the pair of characters with a given interaction asks andrelationship as parent, our model is able to focus on the question at the clip level as it is asked by Jack in the interaction, and correctlypredict (Jack, Pam) as the ordered character pair. Note that our model not only considers all possible directed track pairs (e.g. (Greg, Pam)and (Pam, Greg)) between characters, but also singleton tracks (e.g. (Jack, None)) to deal with situations when a person is absent due tofailure in tracking or does not appear in the scene.

6

- Oh, hi. Hi, Jess.- Uh, this is my work friend, David.- David is an accountant.- David, this is Jessica, my babysitter.- Uh So, you know, everything looks

great.- See you at work.- Yeah, see you at work.

Possible track pairs

Clip

(Jess, David) (David, Jess)

(Jess, Emily)

(Emily, David)

(Emily, Jess)

(David, Emily)

(Jess, None) (None, Jess)

(David, None) (None, David)

(Emily, None) (None, Emily)

Emily Jess

Joint prediction

introduces

boss

Figure 10: We present an example where our model is able to correctly and jointly predict all three components: track pair, interaction classand relationship type for the clip obtained from the movie Crazy, Stupid, Love (2011). This clip contains three characters which leads to 12possible track pairs (including singletons to deal with situations when a person is absent due to failure in tracking or does not appear in thescene). The model is able to correctly predict the two characters, their order, interaction and relationship. In this case, Emily introducesDavid to Jess. Jess is also her hired babysitter, and thus their relationship is – Emily is boss of Jess.

7

asks

watc

hes

expl

ains

info

rms

talk

s to

sugg

ests

orde

rsgr

eets

hear

sta

lks a

bout

answ

ers

leav

esco

mpl

imen

tsst

ays

reas

sure

swa

lks

kiss

esye

llsar

gues

agre

eshu

gsta

lks o

n th

e ph

one

disa

gree

s with

give

ssh

ows

play

sea

ts w

ithhe

lps

runs hits sits

cong

ratu

late

sm

eets

teas

esrid

esad

mits

follo

wsap

olog

izes

intro

duce

sas

sure

swo

rks w

ithac

cuse

sjo

kes

laug

hsre

fuse

sco

mpl

ains

rem

inds

conv

ince

sca

llsig

nore

sru

ns fr

omen

cour

ages

appr

oach

esin

sults

hold

spu

shes

shoo

tsle

ads

disc

usse

sth

reat

ens

danc

es w

ithta

kes

reve

als

com

es h

ome

smile

s at

look

s for

rest

rain

swa

kes u

pan

noun

ces

brin

gsno

tices

read

slie

sth

inks

fishe

stri

esco

mm

its c

rime

wish

essa

ves

rem

embe

rsvi

sits

phot

ogra

phs

take

s car

e of

shus

hes

walk

s awa

y fro

mre

ceiv

es n

ews f

rom

open

shi

des f

rom

stat

esgo

ssip

spa

sses

by

poin

ts a

tth

rows

chec

kspu

tspu

llssin

gs to

send

s awa

yre

peat

sha

s sex

obey

s

Norm

alize

d #s

ampl

es

trainvaltest

Figure 11: Distribution of interaction labels in train/val/test sets. Sorted by descending order based on train set.

stran

ger

collea

gue

friend

lover

boss

enem

ywork

er

acqua

intan

cepa

rent

kbr

manag

erchi

ld

custom

ersib

ling

ex-lo

ver

Norm

alize

d #s

ampl

es trainvaltest

Figure 12: Distribution of relationship labels in train/val/test sets. Sorted by descending order based on train set.

8

INTERACTIONSleavesleaves

leave togethergets out of the house kisses

kissestries to kiss walks

walkswalks with

accompanieswalks to

walks down the hallwaywalks away from

walks away fromtakes off

follows

follows

hugs

hugs

puts arm around

cuddles

embraces

sits

sits

sits near

sits at table

sits together

passes by

passes by

passes in the street

comes home

comes home

tries to enter the room

arrives

meets

meets

bumps into

hits

hits

fights

slaps

attacks

punches

hurts

kicks

approaches

approaches

draws near to

stays

stays

stands near

stays near

stays with

stands on rooftop

waits for

spends time w

ith

waits for the m

edication

plays

plays

plays with

plays cards

plays poker

wakes up

wakes up

runs

runsruns to

runs with

rusheschases

runs after

runs from

runs from

runs away

rides

ridesrides w

ithdrives

picks up

rides together

moves van

hold

s

hold

s

hold

s in arm

shold

scatch

eshold

s han

d

visits

visits

dance

s with

dance

s with

push

es

push

es

touch

es

push

es a

way

push

es to

go to

the ro

om

shoots

shoots

poin

ts gun

poin

ts a g

un a

t

eats w

ith

eats w

ithdrin

ks with

feeds

has d

inner

take

sta

kes

gra

bs

buys fro

mta

kes

from

take

s aw

ay

opens

opens

opens

the d

oor

for

opens

door

bri

ngs

bri

ngs

carr

ies

serv

es

poin

ts a

t

poin

tsat

hid

es

from

hid

es

from

trie

s

trie

s

puts

puts

thro

ws

thro

ws

thro

ws

at

pulls

pulls

dra

gs

sends

away

sends

away

has

sex

has

sex

fishe

s

fishe

s

give

s

give

spa

ysgi

ves

a gi

ft

give

s m

oney

to

offer

s fo

od to

give

s ph

one

sells

to

expl

ains

expl

ains

expl

ains

to

inst

ruct

s

expl

ains

situ

atio

n

excu

ses

self

teac

hes

desc

ribes

answ

ers

answ

ers

resp

onds

repl

ies

orde

rs

orde

rsre

ques

ts

dem

ands

begs

orde

rs a

bout

wor

k

greet

s

gree

ts

welco

mes

shak

es h

ands

waves

at

says

goo

dbye

reass

ures

reass

ures

supports

comforts

conso

les

suggests

suggests

advises

warns

invites

offers

offers

to help

gives opinion

proposes

corrects

offers to

asks out

calls out to

compliments

compliments

praises

flatters

admires

sympathizes with

flirtsthanks

helps

helpsassists

encourages

encourages

apologizes

apologizes

reminds

reminds

convinces

convincespersuadesurgesinsiststries to convincecalls for

assuresassurespromisesswears

congratulatescongratulatesapplaudscheerscheers forcelebratesapplauds totoasts to

leads

leadsguidesdirects

agrees

agreesconfirms toconfirmsacceptsallowsgives permission

admits

admitsadmits toconfesses to

announces

announcesbreaks news to

declares love to

receives news from

receives news from

receives news about

wishes

wisheshopes for

saves

savesdefends

checks

checkschecks on

takes care of

takes care of

petstreats

asks

asksquestions

asks about

asks for help

asks about behavior

asks about situation

asks permission

asks for advice

asks about health

informs

informs

informs about w

ork

signals

reports

watches

watches

looks at

sees

stares at

watches the opera

watches tv

watch

talks to

talks to

speaks to

tellschats

chitchats

gives speech

converses with

saysw

hispers

talks abou

t

talks about

talks abou

t work

talks abou

t relationsh

ips

talks abou

t self

men

tions

hears

hears

listens to

show

s

show

s

joke

s

joke

s

calls

calls

ignore

s

ignore

savo

ids

tries to

avoid

intro

duce

s

intro

duce

sin

troduce

s self

works w

ith

works w

ith

reveals

reveals

confides in

opens u

pta

lks

on

the p

hone

talks o

n th

e p

hone

calls

on p

hone

hangs

up

laughs

laughs

laughs

at

laughs

wit

hgig

gle

s w

ith

smile

s at

smile

s at

reads

reads

reads

to

looks

for

looks

for

searc

hes

dis

cuss

es

dis

cuss

es

negoti

ate

s

dis

cuss

es

pro

ble

m

not

ices

noti

ces

reco

gniz

es

unders

tands

spot

s

rem

ember

s

rem

ember

s

know

s ab

out

shus

hes

shus

hes

calm

s

trie

s to

cal

m

sing

s to

sing

s to

thin

ks

thin

ks

thin

ks o

f

belie

ves

in

assu

mes

won

ders

stat

es

stat

es

pret

ends

pron

ounc

es

conc

lude

s

repe

ats

repe

ats

obey

s

obey

s

goss

ips

goss

ips

goss

ips with

phot

ogra

phs

phot

ogra

phs

phot

ogra

phs with

rest

rain

s

rest

rain

s

tries

to st

opstop

s

argues

argues

confro

ntsar

gues

argues w

ith

yells

yells

scre

amssc

olds

snaps a

t

disagrees w

ith

disagrees w

ithprotests

critic

izes

interrupts

dislikes

disbelieves

disapproves o

f

complains

complains

complains about

regrets

accuses

accusesblames

reprimands

reproaches

insults

insultsoffendsscoffs atdisrespects

teases

teasesmocksmakes fun of

threatens

threatens

refuses

refusesdeclinesrefuses helpdismissesrefuses to helpdenies

lies

lies

commits crime

commits crimekillssteals

Figure 13: Diagram depicting how we group 324 interaction classes (outer circle) into 101 (inner circle). Best seen on the screen.

9

RELATIONSHIPS

strangerstranger

friend

friend

colleague

colleague

lover

lover

pare

nt

pare

nt

boss

boss

siblin

g

siblin

g

kbr

knows by reputation

spousecollaborator

enemy

enemy

customer

customer

child

child

acquaintance

acquaintance

best friend

antagonist

empl

oyer

of

work

er

em

plo

yed b

y

would like to know

engaged

aunt/u

ncle

man

ager

teach

er

neighbor

operative system

business partnerco

usin

owner

moth

er-in

-law

men

tor

patientfiancee

roomm

ate

uncl

e

foster-son

ex-lover

ex-loverdivorced

siste

r/bro

ther-i

n-law

aunt

robber

ex-spouse

ex-neighbor

gra

ndpare

nt

super

ior

girlfriend

pare

nt-

in-law

fath

er-

in-law

mistress

supporters

grandchild

godfa

ther

land

lord

publ

ic offi

cial

appre

ntice

niece/nephew

law

yer supporter

couple

slave

vet

inte

rvie

wee

inte

rvie

wer

guard

ian

agent

classmate

family friend

godson

brother in

-law

cusomer

replacement

work

er

empl

oyer

customers

fian

ce

aid

e

heard about

spon

sor

alle

ged lo

ver

fan

host

students

fam

ily

classmate

s

ex-fiance

student

boyfrie

nd

psyc

hiat

rist

goddaughter

ex-girlfriend

supe

rviso

rtra

iner

hostage

babysitte

rdocto

rnanny

one n

ight sta

nd

distant c

ousin

instructor

ex-boyfriend

nursetenant

competitor

killer

step

-mot

her

close friend

Figure 14: Diagram depicting how we group 107 relationship classes (outer circle) into 15 (inner circle).

10

Learning Interactions and Relationships between Movie ... · D. Dataset Analysis Fig.11and...

Documents

Transcript of Learning Interactions and Relationships between Movie ... · D. Dataset Analysis Fig.11and...