How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...

How Much Does Audio Matter to Recognize Egocentric Object Interactions?

Alejandro Cartas1, Jordi Luque2, Petia Radeva1, Carlos Segura2 and Mariella Dimiccoli3

1University of Barcelona2Telefonica I+D, Research, Spain

3Institut de Robotica i Informatica Industrial (CSIC-UPC)

Abstract

Sounds are an important source of information on ourdaily interactions with objects. For instance, a significantamount of people can discern the temperature of water thatit is being poured just by using the sense of hearing1. How-ever, only a few works have explored the use of audio forthe classification of object interactions in conjunction withvision or as single modality. In this preliminary work, wepropose an audio model for egocentric action recognitionand explore its usefulness on the parts of the problem (noun,verb, and action classification). Our model achieves a com-petitive result in terms of verb classification (34.26% accu-racy) on a standard benchmark with respect to vision-basedstate of the art systems, using a comparatively lighter archi-tecture.

1. IntroductionHuman experience of the world is inherently multimodal

[5, 3]. We employ different senses to perceive new infor-mation both passively and actively when we explore our en-vironment and interact with it. In particular, object manip-ulations almost always have an associate sound (e.g. opentap) and we naturally learn and exploit these associationsto recognize object interactions by relying on available sen-sory information (audio, vision or both). Recognizing ob-ject interactions from visual information has a long storyof research in Computer Vision [19]. Instead, audio hascomparatively been little explored in this context and mostworks have been focused on auditory scene classification[10, 14, 15, 16]. Only more recently, audio has been used inconjunction with visual information for scene classification[2] and object interactions recognition [1, 12].

All works previous to the introduction of Convolutional

1For more information on this fun fact, please refer to experiment 1 in[18]

Neural Networks (CNNs) for audio classification [10, 14,15, 16] shared a common pipeline consisting in first ex-tracting the time-frequency representation from the audiosignal, such as the mel-spectrogram [10], and then classi-fying it with methods like Random Forests [14] or SupportVector Machines (SVMs) [16]. More recently Rakotoma-monjy and Gasso [15] proposed to work directly on the im-age of the spectrogram instead of its coefficients to extractaudio features. More specifically, they used histogram ofgradients (HOGs) of the spectrogram image as features, onthe top of which they applied a SVM classifier. The ideaof using the image spectrogram as input to a CNN to learnfeatures in a end-to-end fashion was firstly proposed in [13].

Later on, Owens et al.[12] presented a model that takesas input a video without audio capturing a person scratch-ing or touching different materials, and generates the emit-ting synchronized sounds. In their model, the sequential vi-sual information is processed by a CNN (AlexNet [9]) con-nected to a Long Short-Term Memory (LSTM) [8], and theproblem is treated as a regression of an audio cochleagram.Aytar et al. [2] proposed to perform auditory scene clas-sification by using transfer learning from visual CNNs inan unsupervised fashion. In [6], a two-stream neural net-work learns semantically meaningful words and phrases atthe spectral feature level from a natural input image and itsspoken captions without relying on speech recognition ortext transcriptions.

Given the indisputable importance of sounds in humanand machine perception, this work aims at understandingthe role of audio in egocentric action recognition, and inparticular to unveil when audio and visual features providecomplementary information in the specific domain of ego-centric object interaction in a kitchen.

2. Audio-based Classification

Our model is a VGG-11[17] neural network that takes asinput an audio spectrogram. This spectrogram only consid-

25664256

32 512 16512 16

VGG-11

Audio Spectrogram

Video Segment

Figure 1: Audio-based classification overview.

Video segments timeduration stats (secs)

Min: 0.5Mean: 3.38

Median: 1.78Std. Dev.: 5.04

Mode: 1.0Max: 145.16

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145Time (seconds)

ber of seg

FrequencyCumulative percentageThreshold

Figure 2: Statistics and histogram of the time duration of the video segments in the EPIC Kitchens dataset training split with a bin size ofhalf a second.

ers the first four seconds of a short video segment. To de-termine such time interval, we calculated the time durationstatistics of the video segments in the filtered training split.We also calculated its time duration histogram and cumu-lative percentage as shown in colors black and red, respec-tively in Fig. 2. As it can be observed from the thresholdline in yellow, setting the audio time size window equal to4 seconds, allows to completely cover 80.697% of all videosegments in one window. When the video segment has atime duration of less than four seconds, then a zero paddingis applied on its spectrogram.

We extracted the audio from all video segments usinga sampling frequency of 16 KHz2, as it covers most of theband audible to the average person [7]. Since the audio fromthe videos have two channels, we joined them in one signalby computing their mean value. From this signal, we com-puted its short-time Fourier transform (STFT) [11], as weare interested in noises rather than in human voices. TheSTFT used a Hamming window of length equal to roughly30 ms with a time overlapping of 50%. For convenienceof the input size of our CNN, we used a sampling framelength of 661. The spectrogram was equal to the squaredmagnitude of the STFT of the signal. Subsequently, we ob-tained the logarithm value of the spectrogram in order toreduce the range of values. Finally, all the spectrogramswere normalized. The size of the input spectrogram image

2We used FFMPEG for the audio extraction http://www.ffmpeg.org.

is 331× 248.

3. ExperimentsThe objective of our experiments was to determine the

classification performance by leveraging the audio modal-ity on an egocentric action recognition task. We used theEPIC Kitchens dataset [4] on our experiments. Each videosegment in the dataset shows a participant doing one spe-cific cooking related action. A labeled action consists of averb plus a noun, for example, “cut potato” or “wash cup”.We used the accuracy as a performance metric in our exper-iments. Moreover, as a comparison with using only visualinformation, we show the results obtained in the official testsplit of the dataset.

Dataset The EPIC Kitchens dataset includes 432 videosegocentrically recorded by 32 participants in their ownkitchens while cooking/preparing something. Each videowas divided into segments in which the person is doing onespecific action (a verb plus a noun). The total number ofverbs and nouns categories in dataset is 125 and 352, corre-spondingly. Currently, this dataset is been used on an ego-centric action recognition challenge 3. Thus, we only usedthe labeled training split and filtered it accordingly to [4],i.e. only using the verbs and nouns that have more than100 instances in the split. This results in 271 videos having

3https://epic-kitchens.github.io/2018

Table 1: Performance comparison with EPIC Kitchens challenge baseline results. See text for more information.Top-1 Accuracy Top-5 Accuracy Avg Class Precision Avg Class Recall

VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION

Chance/Random 12.62 1.73 00.22 43.39 08.12 03.68 03.67 01.15 00.08 03.67 01.15 00.05TSN (RGB) 45.68 36.80 19.86 85.56 64.19 41.89 61.64 34.32 09.96 23.81 31.62 08.81TSN (FLOW) 42.75 17.40 09.02 79.52 39.43 21.92 21.42 13.75 02.33 15.58 09.51 02.06TSN (FUSION) 48.23 36.71 20.54 84.09 62.32 39.79 47.26 35.42 10.46 22.33 30.53 08.83Ours 34.26 08.60 03.28 75.53 25.54 11.49 12.04 02.36 00.55 11.54 04.56 00.89

Chance/Random 10.71 01.89 00.22 38.98 09.31 03.81 03.56 01.08 00.08 03.56 01.08 00.05TSN (RGB) 34.89 21.82 10.11 74.56 45.34 25.33 19.48 14.67 04.77 11.22 17.24 05.67TSN (FLOW) 40.08 14.51 06.73 73.40 33.77 18.64 19.98 09.48 02.08 13.81 08.58 02.27TSN (FUSION) 39.40 22.70 10.89 74.29 45.72 25.26 22.54 15.33 05.60 13.06 17.52 05.81Ours 32.09 08.13 02.77 68.90 22.43 10.58 11.83 02.92 00.86 10.25 04.93 01.90

adjustcheckclose

cutdry

emptyfillflip

insertmix

moveopenpeelpour

pressput

removescoopshake

squeezetake

throwturn

turn-offturn-on

Verb classification Acc. 39.39%

Figure 3: Normalized verb confusion matrix.

Verb Noun ActionChance/Random 13.11% 2.48% 0.75%Ours (Top-1) 39.39% 13.41% 10.16%Ours (Top-5) 81.99% 35.60% 26.36%

Table 2: Action recognition accuracy for all our experiments.

22,018 segments, and 26 verb and 71 noun categories. Theresulting distribution of action classes is highly unbalanced.

Dataset split For the different combinations of experi-ments (verb, noun, and action), we divided the given la-beled data into training, validation, and test splits consider-ing all the participants. In all of the combinations, the dataproportions for the validation and test splits were 10% and15%, respectively. For the verb and noun experiments, thesplits were obtained by randomly stratifying the data be-cause each category has more than 100 samples. In the caseof the action experiment, the imbalance of the classes were

considered for the data split as follows. At least one sampleof each category was put in the training split. If the cate-gory had at least two samples, one of them went to the testsplit. The rest of the categories were randomly stratified onall splits.

Training For all our experiments we used the stochasticgradient descent (SGD) optimization algorithm to train ournetwork. We used a momentum and a batch size equal to0.9 and 6, correspondingly. The specific learning rates andnumber of training epochs for each experiment were: forverb were 5× 10−6 during 79 epochs; for noun were 2.5×10−6 during 129 epochs; and for action were 1.75 × 10−6

during 5 epochs.

Results and discussion The accuracy performance forall experiments is shown in Table 2. Additionally, it alsopresents random classification accuracy baseline from thedataset splits described above. We calculated the randomclassification accuracy of N categories as

ptraini · ptesti (1)

where ptraini and ptesti are the occurrence probability forclass i in the train and test splits, accordingly. As meansfor comparison, we also present the tests results on theseen (S1) and unseen (S2) splits from the EPIC Kitchenschallenge in Table 1. This results were obtained using thetrained networks for the experiments on verb and noun.

The overall results indicate a good performance usingaudio alone for verb classification. Additionally, the mod-els fail to recognize categories that do not produce soundsuch as flip (verb) and heat (noun), as seen on the confu-sion matrices in Fig. 3 and 4. In the case of verbs, themodel also fails on conceptually closed verbs like the pairsturn-on/open and turn-off /close. In the case of noun clas-sification, the model incorrectly predicts objects that havesimilar materials, for example, the categories can and panare metallic.

aflid liq

bagbin

board:choppingbottlebowlbox

breadcan

carrotcheesechicken

clothcoffee

colandercontainer

cupcupboard

cutlerydishwasher

doughdrawer

eggfilterfoodfork

fridgegarlicglasshandheathobjarjug

kettleknifeleaflid

liquid:washingmaker:coffee

meatmicrowave

milkoil

onionoven

packagepan

pastapeach

pepperplate

potpotato

ricerubbish

saladsalt

saucesausage

sinkskin

spatulaspice

spongespoonstock

taptomato

toptray

Noun classification Acc. 13.41%

Figure 4: Normalized noun confusion matrix.

We consider that an object may have different sounds de-pending on how it is manipulated, and this may help to bet-ter discriminate the verb performed on the object that mayresult visually ambiguous from an egocentric perspective.For instance, knifes and peelers are visually similar objectsthat could led to an action misclassification on verbs cut andpeel used on nouns like carrot, but their sounds were mostlycorrectly classified as seen in the confusion matrices in Fig.3 and 4.

4. Conclusion

We presented an audio based model for egocentric ac-tion classification trained on the EPIC Kitchens dataset. Weanalyzed the results on the splits we made from the train-ing set and made a comparison with the visual test baseline.The obtained results show that audio alone achieves a goodperformance on verb classification (34.26% accuracy). Thissuggests that audio could complement visual sources on thesame task in a multimodal manner. Further work will di-rectly focus on this research line.

References[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In

IEEE International Conference on Computer Vision, 2017.[2] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound-

net: Learning sound representations from unlabeled video.In Advances in Neural Information Processing Systems,2016.

[3] Ruth Campbell. The processing of audio-visual speech:empirical and neural bases. Philosophical Transactions ofthe Royal Society B: Biological Sciences, 363(1493):1001–1010, 2007.

[4] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da-vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price,and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vi-sion (ECCV), 2018.

[5] Francesca Frassinetti, Nadia Bolognini, and ElisabettaLadavas. Enhancement of visual perception by crossmodalvisuo-auditory interaction. Experimental brain research,147(3):332–343, 2002.

[6] David F. Harwath, Antonio Torralba, and James R. Glass.Unsupervised learning of spoken language with visual con-

text. In NIPS, 2016.[7] Henry Heffner and Rickye Heffner. Hearing ranges of lab-

oratory animals. Journal of the American Association forLaboratory Animal Science : JAALAS, 46:20–2, 02 2007.

[8] Sepp Hochreiter and Jrgen Schmidhuber. Long short-termmemory. Neural Computation, 9(8):1735–1780, 1997.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Pro-cessing Systems 25, pages 1097–1105. Curran Associates,Inc., 2012.

[10] David Li, Jason Tam, and Derek Toub. Auditory scene clas-sification using machine learning techniques. AASP Chal-lenge, 2013.

[11] Brian McFee, Colin A. Raffel, Dawen Liang, DanielPatrick Whittlesey Ellis, Matt McVicar, Eric Battenberg, andOriol Nieto. librosa: Audio and music signal analysis inpython. 2015.

[12] Andrew Owens, Phillip Isola, Josh H. McDermott, AntonioTorralba, Edward H. Adelson, and William T. Freeman. Vi-sually indicated sounds. 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 2405–2413, 2016.

[13] K. J. Piczak. Environmental sound classification with con-volutional neural networks. In 2015 IEEE 25th Interna-tional Workshop on Machine Learning for Signal Processing(MLSP), pages 1–6, Sep. 2015.

[14] Karol J. Piczak. Esc: Dataset for environmental sound clas-sification. In Proceedings of the 23rd ACM InternationalConference on Multimedia, MM ’15, pages 1015–1018, NewYork, NY, USA, 2015. ACM.

[15] A. Rakotomamonjy and G. Gasso. Histogram of gradients oftimefrequency representations for audio scene classification.IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 23(1):142–153, Jan 2015.

[16] Gerard Roma, Waldo Nogueira, and Perfecto Herrera. Re-currence quantification analysis features for auditory sceneclassification. 2013.

[17] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.

[18] Carlos Velasco, Russ Jones, Scott King, and Charles Spence.The sound of temperature: What information do pouringsounds convey concerning the temperature of a beverage.Journal of Sensory Studies, 28(5):335–345, 2013.

[19] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei,Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A com-prehensive survey of vision-based human action recognitionmethods. Sensors, 19(5), 2019.

How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...

Documents

Transcript of How Much Does Audio Matter to Recognize Egocentric Object ...Alejandro Cartas1, Jordi Luque2, Petia...

Petia Guerrero and Arturo I. Sotomayor

Recognizing Personal Contexts from Egocentric Images · Recognizing Personal Contexts from Egocentric Images Antonino Furnari, Giovanni M. Farinella, Sebastiano Battiato Department

CHAPTER 8 VISUALIZING BOUTIQUE DATA IN EGOCENTRIC …

The “False Consensus Effect”: An Egocentric Bias in Social …web.mit.edu/.../13_J_Experimental_Social_Psychology_279_(Ross).pdf · The “False Consensus Effect”: An Egocentric

S.Prokofjevas “ Petia ir vilkas”

The Egocentric Presidency · 2020. 1. 3. · The Egocentric Presidency In this paper, what does the term, “egocentric presidency,” actually mean and what argument is this paper

Egocentric spatial learning in schizophrenia investigated ...

Discovering Picturesque Highlights from Egocentric Vacation Videosvbettadapura.com/files/VacationHighlights-Bettadapura16.pdf · Discovering Picturesque Highlights from Egocentric

Summarizing Egocentric Video - cs.utexas.eduDepartment of Computer Science. ... Wearable camera. Input: Egocentric video of the camera wearer’s day. ... Egocentric recognitiongrauman/slides/grauman-iccv2013... ·

Deep Dual Relation Modeling for Egocentric Interaction ...openaccess.thecvf.com/content_CVPR_2019/papers/Li_Deep_Dual_R… · Egocentric interaction recognition [11, 25, 31, 35, 39]

Phenomenal Consciousness and the Allocentric-Egocentric Interface

Are children Egocentric Or Was piaget wrong…

by Sergi Lanau and Petia Topalova - International …Sergi Lanau and Petia Topalova 1 Authorized for distribution by Rishi Goyal June 2016 Abstract This paper examines the role of

Learning to Recognize Objects in Egocentric Activitieshomes.cs.washington.edu/~xren/publication/fathi_cvpr11... · 2011-04-12 · Learning to Recognize Objects in Egocentric Activities

Visual Memorability for Egocentric Cameras

Integrating Egocentric Localization for More Realistic ...

Egocentric Network Exploration for Immersive Analytics, J ...

Egocentric Design - Mobile UX Meetup

Temporal Regularization of Saliency Maps in Egocentric Videosdoras.dcu.ie/22565/1/temporal-regularization-saliency.pdf · Temporal Regularization of Saliency Maps in Egocentric Videos

PETIA POPOVA Board Member