How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...

How Much Does Audio Matter to Recognize Egocentric Object Interactions?

Alejandro Cartas1, Jordi Luque2, Petia Radeva1, Carlos Segura2 and Mariella Dimiccoli3

1University of Barcelona2Telefonica I+D, Research, Spain

3Institut de Robotica i Informatica Industrial (CSIC-UPC)

Abstract

Sounds are an important source of information on ourdaily interactions with objects. For instance, a significantamount of people can discern the temperature of water thatit is being poured just by using the sense of hearing1. How-ever, only a few works have explored the use of audio forthe classification of object interactions in conjunction withvision or as single modality. In this preliminary work, wepropose an audio model for egocentric action recognitionand explore its usefulness on the parts of the problem (noun,verb, and action classification). Our model achieves a com-petitive result in terms of verb classification (34.26% accu-racy) on a standard benchmark with respect to vision-basedstate of the art systems, using a comparatively lighter archi-tecture.

1. IntroductionHuman experience of the world is inherently multimodal

[5, 3]. We employ different senses to perceive new infor-mation both passively and actively when we explore our en-vironment and interact with it. In particular, object manip-ulations almost always have an associate sound (e.g. opentap) and we naturally learn and exploit these associationsto recognize object interactions by relying on available sen-sory information (audio, vision or both). Recognizing ob-ject interactions from visual information has a long storyof research in Computer Vision [19]. Instead, audio hascomparatively been little explored in this context and mostworks have been focused on auditory scene classification[10, 14, 15, 16]. Only more recently, audio has been used inconjunction with visual information for scene classification[2] and object interactions recognition [1, 12].

All works previous to the introduction of Convolutional

1For more information on this fun fact, please refer to experiment 1 in[18]

Neural Networks (CNNs) for audio classification [10, 14,15, 16] shared a common pipeline consisting in first ex-tracting the time-frequency representation from the audiosignal, such as the mel-spectrogram [10], and then classi-fying it with methods like Random Forests [14] or SupportVector Machines (SVMs) [16]. More recently Rakotoma-monjy and Gasso [15] proposed to work directly on the im-age of the spectrogram instead of its coefficients to extractaudio features. More specifically, they used histogram ofgradients (HOGs) of the spectrogram image as features, onthe top of which they applied a SVM classifier. The ideaof using the image spectrogram as input to a CNN to learnfeatures in a end-to-end fashion was firstly proposed in [13].

Later on, Owens et al.[12] presented a model that takesas input a video without audio capturing a person scratch-ing or touching different materials, and generates the emit-ting synchronized sounds. In their model, the sequential vi-sual information is processed by a CNN (AlexNet [9]) con-nected to a Long Short-Term Memory (LSTM) [8], and theproblem is treated as a regression of an audio cochleagram.Aytar et al. [2] proposed to perform auditory scene clas-sification by using transfer learning from visual CNNs inan unsupervised fashion. In [6], a two-stream neural net-work learns semantically meaningful words and phrases atthe spectral feature level from a natural input image and itsspoken captions without relying on speech recognition ortext transcriptions.

Given the indisputable importance of sounds in humanand machine perception, this work aims at understandingthe role of audio in egocentric action recognition, and inparticular to unveil when audio and visual features providecomplementary information in the specific domain of ego-centric object interaction in a kitchen.

2. Audio-based Classification

Our model is a VGG-11[17] neural network that takes asinput an audio spectrogram. This spectrogram only consid-

arX

iv:1

906.

0063

4v1

[cs

.CV

] 3

Jun

201

9

64

248

128

128

25664256

64512

32512

32 512 16512 16

1

4096

1

4096

1

26

VGG-11

Audio Spectrogram

Video Segment

Figure 1: Audio-based classification overview.

Video segments timeduration stats (secs)

Min: 0.5Mean: 3.38

Median: 1.78Std. Dev.: 5.04

Mode: 1.0Max: 145.16

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145Time (seconds)

0

1000

2000

3000

4000

5000

Num

ber of seg

men

ts

0%

20%

40%

60%

80%

100%

FrequencyCumulative percentageThreshold

Figure 2: Statistics and histogram of the time duration of the video segments in the EPIC Kitchens dataset training split with a bin size ofhalf a second.

ers the first four seconds of a short video segment. To de-termine such time interval, we calculated the time durationstatistics of the video segments in the filtered training split.We also calculated its time duration histogram and cumu-lative percentage as shown in colors black and red, respec-tively in Fig. 2. As it can be observed from the thresholdline in yellow, setting the audio time size window equal to4 seconds, allows to completely cover 80.697% of all videosegments in one window. When the video segment has atime duration of less than four seconds, then a zero paddingis applied on its spectrogram.

We extracted the audio from all video segments usinga sampling frequency of 16 KHz2, as it covers most of theband audible to the average person [7]. Since the audio fromthe videos have two channels, we joined them in one signalby computing their mean value. From this signal, we com-puted its short-time Fourier transform (STFT) [11], as weare interested in noises rather than in human voices. TheSTFT used a Hamming window of length equal to roughly30 ms with a time overlapping of 50%. For convenienceof the input size of our CNN, we used a sampling framelength of 661. The spectrogram was equal to the squaredmagnitude of the STFT of the signal. Subsequently, we ob-tained the logarithm value of the spectrogram in order toreduce the range of values. Finally, all the spectrogramswere normalized. The size of the input spectrogram image

2We used FFMPEG for the audio extraction http://www.ffmpeg.org.

is 331× 248.

3. ExperimentsThe objective of our experiments was to determine the

classification performance by leveraging the audio modal-ity on an egocentric action recognition task. We used theEPIC Kitchens dataset [4] on our experiments. Each videosegment in the dataset shows a participant doing one spe-cific cooking related action. A labeled action consists of averb plus a noun, for example, “cut potato” or “wash cup”.We used the accuracy as a performance metric in our exper-iments. Moreover, as a comparison with using only visualinformation, we show the results obtained in the official testsplit of the dataset.

Dataset The EPIC Kitchens dataset includes 432 videosegocentrically recorded by 32 participants in their ownkitchens while cooking/preparing something. Each videowas divided into segments in which the person is doing onespecific action (a verb plus a noun). The total number ofverbs and nouns categories in dataset is 125 and 352, corre-spondingly. Currently, this dataset is been used on an ego-centric action recognition challenge 3. Thus, we only usedthe labeled training split and filtered it accordingly to [4],i.e. only using the verbs and nouns that have more than100 instances in the split. This results in 271 videos having

3https://epic-kitchens.github.io/2018

2

http://www.ffmpeg.org

https://epic-kitchens.github.io/2018

Table 1: Performance comparison with EPIC Kitchens challenge baseline results. See text for more information.Top-1 Accuracy Top-5 Accuracy Avg Class Precision Avg Class Recall

VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION

S1

Chance/Random 12.62 1.73 00.22 43.39 08.12 03.68 03.67 01.15 00.08 03.67 01.15 00.05TSN (RGB) 45.68 36.80 19.86 85.56 64.19 41.89 61.64 34.32 09.96 23.81 31.62 08.81TSN (FLOW) 42.75 17.40 09.02 79.52 39.43 21.92 21.42 13.75 02.33 15.58 09.51 02.06TSN (FUSION) 48.23 36.71 20.54 84.09 62.32 39.79 47.26 35.42 10.46 22.33 30.53 08.83Ours 34.26 08.60 03.28 75.53 25.54 11.49 12.04 02.36 00.55 11.54 04.56 00.89

S2

Chance/Random 10.71 01.89 00.22 38.98 09.31 03.81 03.56 01.08 00.08 03.56 01.08 00.05TSN (RGB) 34.89 21.82 10.11 74.56 45.34 25.33 19.48 14.67 04.77 11.22 17.24 05.67TSN (FLOW) 40.08 14.51 06.73 73.40 33.77 18.64 19.98 09.48 02.08 13.81 08.58 02.27TSN (FUSION) 39.40 22.70 10.89 74.29 45.72 25.26 22.54 15.33 05.60 13.06 17.52 05.81Ours 32.09 08.13 02.77 68.90 22.43 10.58 11.83 02.92 00.86 10.25 04.93 01.90

adju

stch

eck

close

cut

dry

empt

yfil

lfli

pin

sert

mix

mov

eop

enpe

elpo

urpr

ess

put

rem

ove

scoo

psh

ake

sque

eze

take

thro

wtu

rntu

rn-o

fftu

rn-o

nwa

sh

adjustcheckclose

cutdry

emptyfillflip

insertmix

moveopenpeelpour

pressput

removescoopshake

squeezetake

throwturn

turn-offturn-on

wash

Verb classification Acc. 39.39%

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Normalized verb confusion matrix.

Verb Noun ActionChance/Random 13.11% 2.48% 0.75%Ours (Top-1) 39.39% 13.41% 10.16%Ours (Top-5) 81.99% 35.60% 26.36%

Table 2: Action recognition accuracy for all our experiments.

22,018 segments, and 26 verb and 71 noun categories. Theresulting distribution of action classes is highly unbalanced.

Dataset split For the different combinations of experi-ments (verb, noun, and action), we divided the given la-beled data into training, validation, and test splits consider-ing all the participants. In all of the combinations, the dataproportions for the validation and test splits were 10% and15%, respectively. For the verb and noun experiments, thesplits were obtained by randomly stratifying the data be-cause each category has more than 100 samples. In the caseof the action experiment, the imbalance of the classes were

considered for the data split as follows. At least one sampleof each category was put in the training split. If the cate-gory had at least two samples, one of them went to the testsplit. The rest of the categories were randomly stratified onall splits.

Training For all our experiments we used the stochasticgradient descent (SGD) optimization algorithm to train ournetwork. We used a momentum and a batch size equal to0.9 and 6, correspondingly. The specific learning rates andnumber of training epochs for each experiment were: forverb were 5× 10−6 during 79 epochs; for noun were 2.5×10−6 during 129 epochs; and for action were 1.75 × 10−6

during 5 epochs.

Results and discussion The accuracy performance forall experiments is shown in Table 2. Additionally, it alsopresents random classification accuracy baseline from thedataset splits described above. We calculated the randomclassification accuracy of N categories as

acc =

N∑i

ptraini · ptesti (1)

where ptraini and ptesti are the occurrence probability forclass i in the train and test splits, accordingly. As meansfor comparison, we also present the tests results on theseen (S1) and unseen (S2) splits from the EPIC Kitchenschallenge in Table 1. This results were obtained using thetrained networks for the experiments on verb and noun.

The overall results indicate a good performance usingaudio alone for verb classification. Additionally, the mod-els fail to recognize categories that do not produce soundsuch as flip (verb) and heat (noun), as seen on the confu-sion matrices in Fig. 3 and 4. In the case of verbs, themodel also fails on conceptually closed verbs like the pairsturn-on/open and turn-off /close. In the case of noun clas-sification, the model incorrectly predicts objects that havesimilar materials, for example, the categories can and panare metallic.

3

bag

bin

boar

d:ch

oppi

ngbo

ttle

bowl

box

brea

dca

nca

rrot

chee

sech

icken

cloth

coffe

eco

land

erco

ntai

ner

cup

cupb

oard

cutle

rydi

shwa

sher

doug

hdr

awer

egg

filte

rfo

odfo

rkfri

dge

garli

cgl

ass

hand

heat

hob

jar

jug

kettl

ekn

ifele

aflid liq

uid:

wash

ing

mak

er:c

offe

em

eat

micr

owav

em

ilkoi

lon

ion

oven

pack

age

pan

past

ape

ach

pepp

erpl

ate

pot

pota

toric

eru

bbish

sala

dsa

ltsa

uce

saus

age

sink

skin

spat

ula

spice

spon

gesp

oon

stoc

kta

pto

mat

oto

ptra

ywa

ter

bagbin

board:choppingbottlebowlbox

breadcan

carrotcheesechicken

clothcoffee

colandercontainer

cupcupboard

cutlerydishwasher

doughdrawer

eggfilterfoodfork

fridgegarlicglasshandheathobjarjug

kettleknifeleaflid

liquid:washingmaker:coffee

meatmicrowave

milkoil

onionoven

packagepan

pastapeach

pepperplate

potpotato

ricerubbish

saladsalt

saucesausage

sinkskin

spatulaspice

spongespoonstock

taptomato

toptray

water

Noun classification Acc. 13.41%

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Normalized noun confusion matrix.

We consider that an object may have different sounds de-pending on how it is manipulated, and this may help to bet-ter discriminate the verb performed on the object that mayresult visually ambiguous from an egocentric perspective.For instance, knifes and peelers are visually similar objectsthat could led to an action misclassification on verbs cut andpeel used on nouns like carrot, but their sounds were mostlycorrectly classified as seen in the confusion matrices in Fig.3 and 4.

4. Conclusion

We presented an audio based model for egocentric ac-tion classification trained on the EPIC Kitchens dataset. Weanalyzed the results on the splits we made from the train-ing set and made a comparison with the visual test baseline.The obtained results show that audio alone achieves a goodperformance on verb classification (34.26% accuracy). Thissuggests that audio could complement visual sources on thesame task in a multimodal manner. Further work will di-rectly focus on this research line.

References[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In

IEEE International Conference on Computer Vision, 2017.[2] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound-

net: Learning sound representations from unlabeled video.In Advances in Neural Information Processing Systems,2016.

[3] Ruth Campbell. The processing of audio-visual speech:empirical and neural bases. Philosophical Transactions ofthe Royal Society B: Biological Sciences, 363(1493):1001–1010, 2007.

[4] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da-vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price,and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vi-sion (ECCV), 2018.

[5] Francesca Frassinetti, Nadia Bolognini, and ElisabettaLadavas. Enhancement of visual perception by crossmodalvisuo-auditory interaction. Experimental brain research,147(3):332–343, 2002.

[6] David F. Harwath, Antonio Torralba, and James R. Glass.Unsupervised learning of spoken language with visual con-

4

text. In NIPS, 2016.[7] Henry Heffner and Rickye Heffner. Hearing ranges of lab-

oratory animals. Journal of the American Association forLaboratory Animal Science : JAALAS, 46:20–2, 02 2007.

[8] Sepp Hochreiter and Jrgen Schmidhuber. Long short-termmemory. Neural Computation, 9(8):1735–1780, 1997.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Pro-cessing Systems 25, pages 1097–1105. Curran Associates,Inc., 2012.

[10] David Li, Jason Tam, and Derek Toub. Auditory scene clas-sification using machine learning techniques. AASP Chal-lenge, 2013.

[11] Brian McFee, Colin A. Raffel, Dawen Liang, DanielPatrick Whittlesey Ellis, Matt McVicar, Eric Battenberg, andOriol Nieto. librosa: Audio and music signal analysis inpython. 2015.

[12] Andrew Owens, Phillip Isola, Josh H. McDermott, AntonioTorralba, Edward H. Adelson, and William T. Freeman. Vi-sually indicated sounds. 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 2405–2413, 2016.

[13] K. J. Piczak. Environmental sound classification with con-volutional neural networks. In 2015 IEEE 25th Interna-tional Workshop on Machine Learning for Signal Processing(MLSP), pages 1–6, Sep. 2015.

[14] Karol J. Piczak. Esc: Dataset for environmental sound clas-sification. In Proceedings of the 23rd ACM InternationalConference on Multimedia, MM ’15, pages 1015–1018, NewYork, NY, USA, 2015. ACM.

[15] A. Rakotomamonjy and G. Gasso. Histogram of gradients oftimefrequency representations for audio scene classification.IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 23(1):142–153, Jan 2015.

[16] Gerard Roma, Waldo Nogueira, and Perfecto Herrera. Re-currence quantification analysis features for auditory sceneclassification. 2013.

[17] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.

[18] Carlos Velasco, Russ Jones, Scott King, and Charles Spence.The sound of temperature: What information do pouringsounds convey concerning the temperature of a beverage.Journal of Sensory Studies, 28(5):335–345, 2013.

[19] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei,Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A com-prehensive survey of vision-based human action recognitionmethods. Sensors, 19(5), 2019.

5

How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...

Documents

Transcript of How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...