Post on 23-Sep-2020
How Much Does Audio Matter to Recognize Egocentric Object Interactions?
Alejandro Cartas1, Jordi Luque2, Petia Radeva1, Carlos Segura2 and Mariella Dimiccoli3
1University of Barcelona2Telefonica I+D, Research, Spain
3Institut de Robotica i Informatica Industrial (CSIC-UPC)
Abstract
Sounds are an important source of information on ourdaily interactions with objects. For instance, a significantamount of people can discern the temperature of water thatit is being poured just by using the sense of hearing1. How-ever, only a few works have explored the use of audio forthe classification of object interactions in conjunction withvision or as single modality. In this preliminary work, wepropose an audio model for egocentric action recognitionand explore its usefulness on the parts of the problem (noun,verb, and action classification). Our model achieves a com-petitive result in terms of verb classification (34.26% accu-racy) on a standard benchmark with respect to vision-basedstate of the art systems, using a comparatively lighter archi-tecture.
1. IntroductionHuman experience of the world is inherently multimodal
[5, 3]. We employ different senses to perceive new infor-mation both passively and actively when we explore our en-vironment and interact with it. In particular, object manip-ulations almost always have an associate sound (e.g. opentap) and we naturally learn and exploit these associationsto recognize object interactions by relying on available sen-sory information (audio, vision or both). Recognizing ob-ject interactions from visual information has a long storyof research in Computer Vision [19]. Instead, audio hascomparatively been little explored in this context and mostworks have been focused on auditory scene classification[10, 14, 15, 16]. Only more recently, audio has been used inconjunction with visual information for scene classification[2] and object interactions recognition [1, 12].
All works previous to the introduction of Convolutional
1For more information on this fun fact, please refer to experiment 1 in[18]
Neural Networks (CNNs) for audio classification [10, 14,15, 16] shared a common pipeline consisting in first ex-tracting the time-frequency representation from the audiosignal, such as the mel-spectrogram [10], and then classi-fying it with methods like Random Forests [14] or SupportVector Machines (SVMs) [16]. More recently Rakotoma-monjy and Gasso [15] proposed to work directly on the im-age of the spectrogram instead of its coefficients to extractaudio features. More specifically, they used histogram ofgradients (HOGs) of the spectrogram image as features, onthe top of which they applied a SVM classifier. The ideaof using the image spectrogram as input to a CNN to learnfeatures in a end-to-end fashion was firstly proposed in [13].
Later on, Owens et al.[12] presented a model that takesas input a video without audio capturing a person scratch-ing or touching different materials, and generates the emit-ting synchronized sounds. In their model, the sequential vi-sual information is processed by a CNN (AlexNet [9]) con-nected to a Long Short-Term Memory (LSTM) [8], and theproblem is treated as a regression of an audio cochleagram.Aytar et al. [2] proposed to perform auditory scene clas-sification by using transfer learning from visual CNNs inan unsupervised fashion. In [6], a two-stream neural net-work learns semantically meaningful words and phrases atthe spectral feature level from a natural input image and itsspoken captions without relying on speech recognition ortext transcriptions.
Given the indisputable importance of sounds in humanand machine perception, this work aims at understandingthe role of audio in egocentric action recognition, and inparticular to unveil when audio and visual features providecomplementary information in the specific domain of ego-centric object interaction in a kitchen.
2. Audio-based Classification
Our model is a VGG-11[17] neural network that takes asinput an audio spectrogram. This spectrogram only consid-
arX
iv:1
906.
0063
4v1
[cs
.CV
] 3
Jun
201
9
64
248
128
128
25664256
64512
32512
32 512 16512 16
1
4096
1
4096
1
26
VGG-11
Audio Spectrogram
Video Segment
Figure 1: Audio-based classification overview.
Video segments timeduration stats (secs)
Min: 0.5Mean: 3.38
Median: 1.78Std. Dev.: 5.04
Mode: 1.0Max: 145.16
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145Time (seconds)
0
1000
2000
3000
4000
5000
Num
ber of seg
men
ts
0%
20%
40%
60%
80%
100%
FrequencyCumulative percentageThreshold
Figure 2: Statistics and histogram of the time duration of the video segments in the EPIC Kitchens dataset training split with a bin size ofhalf a second.
ers the first four seconds of a short video segment. To de-termine such time interval, we calculated the time durationstatistics of the video segments in the filtered training split.We also calculated its time duration histogram and cumu-lative percentage as shown in colors black and red, respec-tively in Fig. 2. As it can be observed from the thresholdline in yellow, setting the audio time size window equal to4 seconds, allows to completely cover 80.697% of all videosegments in one window. When the video segment has atime duration of less than four seconds, then a zero paddingis applied on its spectrogram.
We extracted the audio from all video segments usinga sampling frequency of 16 KHz2, as it covers most of theband audible to the average person [7]. Since the audio fromthe videos have two channels, we joined them in one signalby computing their mean value. From this signal, we com-puted its short-time Fourier transform (STFT) [11], as weare interested in noises rather than in human voices. TheSTFT used a Hamming window of length equal to roughly30 ms with a time overlapping of 50%. For convenienceof the input size of our CNN, we used a sampling framelength of 661. The spectrogram was equal to the squaredmagnitude of the STFT of the signal. Subsequently, we ob-tained the logarithm value of the spectrogram in order toreduce the range of values. Finally, all the spectrogramswere normalized. The size of the input spectrogram image
2We used FFMPEG for the audio extraction http://www.ffmpeg.org.
is 331× 248.
3. ExperimentsThe objective of our experiments was to determine the
classification performance by leveraging the audio modal-ity on an egocentric action recognition task. We used theEPIC Kitchens dataset [4] on our experiments. Each videosegment in the dataset shows a participant doing one spe-cific cooking related action. A labeled action consists of averb plus a noun, for example, “cut potato” or “wash cup”.We used the accuracy as a performance metric in our exper-iments. Moreover, as a comparison with using only visualinformation, we show the results obtained in the official testsplit of the dataset.
Dataset The EPIC Kitchens dataset includes 432 videosegocentrically recorded by 32 participants in their ownkitchens while cooking/preparing something. Each videowas divided into segments in which the person is doing onespecific action (a verb plus a noun). The total number ofverbs and nouns categories in dataset is 125 and 352, corre-spondingly. Currently, this dataset is been used on an ego-centric action recognition challenge 3. Thus, we only usedthe labeled training split and filtered it accordingly to [4],i.e. only using the verbs and nouns that have more than100 instances in the split. This results in 271 videos having
3https://epic-kitchens.github.io/2018
2
Table 1: Performance comparison with EPIC Kitchens challenge baseline results. See text for more information.Top-1 Accuracy Top-5 Accuracy Avg Class Precision Avg Class Recall
VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION
S1
Chance/Random 12.62 1.73 00.22 43.39 08.12 03.68 03.67 01.15 00.08 03.67 01.15 00.05TSN (RGB) 45.68 36.80 19.86 85.56 64.19 41.89 61.64 34.32 09.96 23.81 31.62 08.81TSN (FLOW) 42.75 17.40 09.02 79.52 39.43 21.92 21.42 13.75 02.33 15.58 09.51 02.06TSN (FUSION) 48.23 36.71 20.54 84.09 62.32 39.79 47.26 35.42 10.46 22.33 30.53 08.83Ours 34.26 08.60 03.28 75.53 25.54 11.49 12.04 02.36 00.55 11.54 04.56 00.89
S2
Chance/Random 10.71 01.89 00.22 38.98 09.31 03.81 03.56 01.08 00.08 03.56 01.08 00.05TSN (RGB) 34.89 21.82 10.11 74.56 45.34 25.33 19.48 14.67 04.77 11.22 17.24 05.67TSN (FLOW) 40.08 14.51 06.73 73.40 33.77 18.64 19.98 09.48 02.08 13.81 08.58 02.27TSN (FUSION) 39.40 22.70 10.89 74.29 45.72 25.26 22.54 15.33 05.60 13.06 17.52 05.81Ours 32.09 08.13 02.77 68.90 22.43 10.58 11.83 02.92 00.86 10.25 04.93 01.90
adju
stch
eck
close
cut
dry
empt
yfil
lfli
pin
sert
mix
mov
eop
enpe
elpo
urpr
ess
put
rem
ove
scoo
psh
ake
sque
eze
take
thro
wtu
rntu
rn-o
fftu
rn-o
nwa
sh
adjustcheckclose
cutdry
emptyfillflip
insertmix
moveopenpeelpour
pressput
removescoopshake
squeezetake
throwturn
turn-offturn-on
wash
Verb classification Acc. 39.39%
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3: Normalized verb confusion matrix.
Verb Noun ActionChance/Random 13.11% 2.48% 0.75%Ours (Top-1) 39.39% 13.41% 10.16%Ours (Top-5) 81.99% 35.60% 26.36%
Table 2: Action recognition accuracy for all our experiments.
22,018 segments, and 26 verb and 71 noun categories. Theresulting distribution of action classes is highly unbalanced.
Dataset split For the different combinations of experi-ments (verb, noun, and action), we divided the given la-beled data into training, validation, and test splits consider-ing all the participants. In all of the combinations, the dataproportions for the validation and test splits were 10% and15%, respectively. For the verb and noun experiments, thesplits were obtained by randomly stratifying the data be-cause each category has more than 100 samples. In the caseof the action experiment, the imbalance of the classes were
considered for the data split as follows. At least one sampleof each category was put in the training split. If the cate-gory had at least two samples, one of them went to the testsplit. The rest of the categories were randomly stratified onall splits.
Training For all our experiments we used the stochasticgradient descent (SGD) optimization algorithm to train ournetwork. We used a momentum and a batch size equal to0.9 and 6, correspondingly. The specific learning rates andnumber of training epochs for each experiment were: forverb were 5× 10−6 during 79 epochs; for noun were 2.5×10−6 during 129 epochs; and for action were 1.75 × 10−6
during 5 epochs.
Results and discussion The accuracy performance forall experiments is shown in Table 2. Additionally, it alsopresents random classification accuracy baseline from thedataset splits described above. We calculated the randomclassification accuracy of N categories as
acc =
N∑i
ptraini · ptesti (1)
where ptraini and ptesti are the occurrence probability forclass i in the train and test splits, accordingly. As meansfor comparison, we also present the tests results on theseen (S1) and unseen (S2) splits from the EPIC Kitchenschallenge in Table 1. This results were obtained using thetrained networks for the experiments on verb and noun.
The overall results indicate a good performance usingaudio alone for verb classification. Additionally, the mod-els fail to recognize categories that do not produce soundsuch as flip (verb) and heat (noun), as seen on the confu-sion matrices in Fig. 3 and 4. In the case of verbs, themodel also fails on conceptually closed verbs like the pairsturn-on/open and turn-off /close. In the case of noun clas-sification, the model incorrectly predicts objects that havesimilar materials, for example, the categories can and panare metallic.
3
bag
bin
boar
d:ch
oppi
ngbo
ttle
bowl
box
brea
dca
nca
rrot
chee
sech
icken
cloth
coffe
eco
land
erco
ntai
ner
cup
cupb
oard
cutle
rydi
shwa
sher
doug
hdr
awer
egg
filte
rfo
odfo
rkfri
dge
garli
cgl
ass
hand
heat
hob
jar
jug
kettl
ekn
ifele
aflid liq
uid:
wash
ing
mak
er:c
offe
em
eat
micr
owav
em
ilkoi
lon
ion
oven
pack
age
pan
past
ape
ach
pepp
erpl
ate
pot
pota
toric
eru
bbish
sala
dsa
ltsa
uce
saus
age
sink
skin
spat
ula
spice
spon
gesp
oon
stoc
kta
pto
mat
oto
ptra
ywa
ter
bagbin
board:choppingbottlebowlbox
breadcan
carrotcheesechicken
clothcoffee
colandercontainer
cupcupboard
cutlerydishwasher
doughdrawer
eggfilterfoodfork
fridgegarlicglasshandheathobjarjug
kettleknifeleaflid
liquid:washingmaker:coffee
meatmicrowave
milkoil
onionoven
packagepan
pastapeach
pepperplate
potpotato
ricerubbish
saladsalt
saucesausage
sinkskin
spatulaspice
spongespoonstock
taptomato
toptray
water
Noun classification Acc. 13.41%
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4: Normalized noun confusion matrix.
We consider that an object may have different sounds de-pending on how it is manipulated, and this may help to bet-ter discriminate the verb performed on the object that mayresult visually ambiguous from an egocentric perspective.For instance, knifes and peelers are visually similar objectsthat could led to an action misclassification on verbs cut andpeel used on nouns like carrot, but their sounds were mostlycorrectly classified as seen in the confusion matrices in Fig.3 and 4.
4. Conclusion
We presented an audio based model for egocentric ac-tion classification trained on the EPIC Kitchens dataset. Weanalyzed the results on the splits we made from the train-ing set and made a comparison with the visual test baseline.The obtained results show that audio alone achieves a goodperformance on verb classification (34.26% accuracy). Thissuggests that audio could complement visual sources on thesame task in a multimodal manner. Further work will di-rectly focus on this research line.
References[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision, 2017.[2] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound-
net: Learning sound representations from unlabeled video.In Advances in Neural Information Processing Systems,2016.
[3] Ruth Campbell. The processing of audio-visual speech:empirical and neural bases. Philosophical Transactions ofthe Royal Society B: Biological Sciences, 363(1493):1001–1010, 2007.
[4] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da-vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price,and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vi-sion (ECCV), 2018.
[5] Francesca Frassinetti, Nadia Bolognini, and ElisabettaLadavas. Enhancement of visual perception by crossmodalvisuo-auditory interaction. Experimental brain research,147(3):332–343, 2002.
[6] David F. Harwath, Antonio Torralba, and James R. Glass.Unsupervised learning of spoken language with visual con-
4
text. In NIPS, 2016.[7] Henry Heffner and Rickye Heffner. Hearing ranges of lab-
oratory animals. Journal of the American Association forLaboratory Animal Science : JAALAS, 46:20–2, 02 2007.
[8] Sepp Hochreiter and Jrgen Schmidhuber. Long short-termmemory. Neural Computation, 9(8):1735–1780, 1997.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Pro-cessing Systems 25, pages 1097–1105. Curran Associates,Inc., 2012.
[10] David Li, Jason Tam, and Derek Toub. Auditory scene clas-sification using machine learning techniques. AASP Chal-lenge, 2013.
[11] Brian McFee, Colin A. Raffel, Dawen Liang, DanielPatrick Whittlesey Ellis, Matt McVicar, Eric Battenberg, andOriol Nieto. librosa: Audio and music signal analysis inpython. 2015.
[12] Andrew Owens, Phillip Isola, Josh H. McDermott, AntonioTorralba, Edward H. Adelson, and William T. Freeman. Vi-sually indicated sounds. 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 2405–2413, 2016.
[13] K. J. Piczak. Environmental sound classification with con-volutional neural networks. In 2015 IEEE 25th Interna-tional Workshop on Machine Learning for Signal Processing(MLSP), pages 1–6, Sep. 2015.
[14] Karol J. Piczak. Esc: Dataset for environmental sound clas-sification. In Proceedings of the 23rd ACM InternationalConference on Multimedia, MM ’15, pages 1015–1018, NewYork, NY, USA, 2015. ACM.
[15] A. Rakotomamonjy and G. Gasso. Histogram of gradients oftimefrequency representations for audio scene classification.IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 23(1):142–153, Jan 2015.
[16] Gerard Roma, Waldo Nogueira, and Perfecto Herrera. Re-currence quantification analysis features for auditory sceneclassification. 2013.
[17] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.
[18] Carlos Velasco, Russ Jones, Scott King, and Charles Spence.The sound of temperature: What information do pouringsounds convey concerning the temperature of a beverage.Journal of Sensory Studies, 28(5):335–345, 2013.
[19] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei,Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A com-prehensive survey of vision-based human action recognitionmethods. Sensors, 19(5), 2019.
5