Multimodal Emotion and Stress Recognition

191
Research Collection Doctoral Thesis Multimodal Emotion and Stress Recognition Author(s): Kappeler-Setz, Cornelia Publication Date: 2012 Permanent Link: https://doi.org/10.3929/ethz-a-007316923 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Transcript of Multimodal Emotion and Stress Recognition

Page 1: Multimodal Emotion and Stress Recognition

Research Collection

Doctoral Thesis

Multimodal Emotion and Stress Recognition

Author(s): Kappeler-Setz, Cornelia

Publication Date: 2012

Permanent Link: https://doi.org/10.3929/ethz-a-007316923

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: Multimodal Emotion and Stress Recognition

Diss. ETH No. 20086

Multimodal Emotion and StressRecognition

A dissertation submitted to

ETH ZURICH

for the degree ofDoctor of Sciences

presented by

CORNELIA KAPPELER-SETZ

MSc EEIT ETH Zurichborn September 27, 1979

citizen of Dintikon AG and Schwyz SZ, Switzerland

accepted on the recommendation of

Prof. Dr. Gerhard Tröster, examinerProf. Dr. Ulrike Ehlert, co-examiner

2012

Page 3: Multimodal Emotion and Stress Recognition

Cornelia Kappeler-SetzMultimodal Emotion and Stress RecognitionDiss. ETH No. 20086

First edition 2012Published by ETH Zurich, Switzerland

ISBN 978-3-909386-24-6

Printed by Lulu.com

© Cornelia Kappeler-Setz 2012 (if not stated otherwise)

All rights reserved. No part of this publication may be reproduced, stored in a re-trieval system, or transmitted, in any form or by any means, electronic, mechanical,photocopying, recording, or otherwise, without the prior permission of the author.Copyright for parts of this publication may be transferred without notice.

Page 4: Multimodal Emotion and Stress Recognition

Acknowledgments

First of all, I would like to thank my academic supervisor, Prof. Dr. Gerhard Tröster,for giving me the opportunity to write this dissertation at the Wearable ComputingLab. I thank him for his support and for providing me with an excellent researchinfrastructure at the Electronics Laboratory. I enjoyed the innovative spirit in hisgroup.

I would also like to thank Prof. Dr. Ulrike Ehlert for co-examining and reviewingmy PhD dissertation. She enabled a prolific collaboration between the Institute ofPsychology of the University of Zurich and the Wearable Computing Lab, and facili-tated the interdisciplinary research of my thesis. At this place, I would like to thankDr. Roberto La Marca from the Institute of Psychology of the University of Zurich forthe unfamiliar psychological perspective and for his manifold psychological advice.It was an interesting and inspiring collaboration.

A very special thank you is addressed to Dr. Bert Arnrich for his academic support,his advice in technical and personal matters, and for proof reading my work.

I enjoyed a very encouraging and pleasant atmosphere, which is due to my twooffice mates Johannes Schumm and Christina Strohrmann. I would like to thankthem for their friendship and for many fruitful discussions.

I am grateful to the members of the Wearable Computing Lab, who made me feelwelcome in the group. Thank you all for participating in my experiments! Spe-cial thanks for interesting and entertaining noon breaks go to Corinne Mattmann,Patric Strasser, Peter Kaspar, and Yuriy Fedoryshyn.

I would also like to thank my semester and master students, Claudia Lorenz, RafaelSchönenberger, Basem Dokhan, and Philip Omlin, for their work, which helped meachieving some of the results of this thesis. Special thanks go to Sonja Stüdli andSilvio Unternährer for their support with the MIST experiment.

Furthermore, I would like to thank Ruth Zähringer, our secretary, for her help forany kind of problems, and for always having a sympathetic ear for me.

Finally, I would like to thank my husband Roman, as well as my old and new family.This work would not have been possible without their encouragement and sup-port.

Zurich, May 2012 CORNELIA KAPPELER-SETZ

Page 5: Multimodal Emotion and Stress Recognition
Page 6: Multimodal Emotion and Stress Recognition

To my husband

Page 7: Multimodal Emotion and Stress Recognition

vi

Page 8: Multimodal Emotion and Stress Recognition

Contents

Abstract ix

Zusammenfassung xiii

1 Introduction 11.1 Motivation for Emotion and Stress Recognition . . . . . . . . . . . . 11.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 State-of-the-Art in Emotion and Stress Research 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Emotion Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 The Nascence of Emotions according to James/Lange, Can-non and Schachter/Singer . . . . . . . . . . . . . . . . . . . . 6

2.2.2 The Specificity of Emotions . . . . . . . . . . . . . . . . . . . . 72.2.3 Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.4 Stress in Relation to Emotion Theories . . . . . . . . . . . . . 12

2.3 Structure Models for Emotion Description . . . . . . . . . . . . . . . 122.4 Emotion and Stress Elicitation . . . . . . . . . . . . . . . . . . . . . . 132.5 Automatic Emotion Recognition . . . . . . . . . . . . . . . . . . . . . 132.6 Automatic Stress Recognition . . . . . . . . . . . . . . . . . . . . . . 16

3 Signals, Features, and Classification Methods 173.1 Signals and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Electrocardiogram (ECG) . . . . . . . . . . . . . . . . . . . . . 173.1.2 Electrodermal Activity (EDA) . . . . . . . . . . . . . . . . . . . 203.1.3 Facial Electromyogram (EMG) . . . . . . . . . . . . . . . . . . 273.1.4 Electrooculogram (EOG) . . . . . . . . . . . . . . . . . . . . . 283.1.5 Respiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.6 Finger Temperature . . . . . . . . . . . . . . . . . . . . . . . . 313.1.7 Acceleration and Movement . . . . . . . . . . . . . . . . . . . 313.1.8 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Classification and Clustering Algorithms . . . . . . . . . . . . . . . . 363.2.1 Nearest Class Classifier (NCC) . . . . . . . . . . . . . . . . . . 373.2.2 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Support Vector Classifiers and Support Vector Machines (SVM) 38

Page 9: Multimodal Emotion and Stress Recognition

viii CONTENTS

3.2.4 Self Organizing Maps (SOM) . . . . . . . . . . . . . . . . . . . 393.2.5 Subtractive Clustering . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Methods for Handling Missing Feature Values . . . . . . . . . . . . . 42

4 Emotion Recognition using a Standardized Experiment Setup 454.1 Emotion Recognition from Speech using Self Organizing Maps . . . 45

4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 464.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Ensemble Classifier Systems for Handling Missing Data . . . . . . . 474.2.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 554.2.3 Methods for Handling Missing Feature Values . . . . . . . . . 574.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Emotion Recognition Using a Naturalistic Experiment Setup 655.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 655.2 Stress Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3 Discriminating Stress from Cognitive Load Using EDA . . . . . . . . . 69

5.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 695.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Discriminating Stress from Cognitive Load Using Pressure Data . . . 785.4.1 Pressure Mat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 785.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5 Discriminating Stress From Cognitive Load Using Acceleration Sensors 805.5.1 Acceleration Sensors . . . . . . . . . . . . . . . . . . . . . . . 805.5.2 Feature Calculation . . . . . . . . . . . . . . . . . . . . . . . . 805.5.3 Classification Methods . . . . . . . . . . . . . . . . . . . . . . 865.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6 Multimodal Classification . . . . . . . . . . . . . . . . . . . . . . . . . 1005.6.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 1015.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7 Generalization to a “Real-Life” Office Experiment . . . . . . . . . . . 1015.7.1 Office Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 1035.7.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 1045.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Page 10: Multimodal Emotion and Stress Recognition

CONTENTS ix

6 Elicitation, Labeling, and Recognition of Emotions in Real-Life 1096.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1096.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.3 Questionnaire Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3.1 Which are the most prominent emotions elicited by soccerwatching (1a)? . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3.2 Do the emotions experienced during the match depend onthe outcome of the match (victory or defeat) (1b)? . . . . . . 114

6.3.3 How do the emotions elicited by soccer watching compareto emotions elicited by standardized film stimuli (1c)? . . . . 118

6.4 Automatic Label Generation . . . . . . . . . . . . . . . . . . . . . . . 1196.4.1 Employed Ticker Data . . . . . . . . . . . . . . . . . . . . . . . 1196.4.2 Arousal and Valence Labeling using Internet Ticker Categories 1196.4.3 Arousal Labeling using Internet Ticker Texts . . . . . . . . . . 1276.4.4 Label Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.5 Feature Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.5.1 Feature Selection based on Statistical Analysis . . . . . . . . 130

6.6 Arousal and Valence Recognition . . . . . . . . . . . . . . . . . . . . . 1316.6.1 Single Modality Classification . . . . . . . . . . . . . . . . . . 1316.6.2 Multimodal Classification . . . . . . . . . . . . . . . . . . . . 1336.6.3 Stress Classification . . . . . . . . . . . . . . . . . . . . . . . . 133

6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.7.1 Elicited emotions . . . . . . . . . . . . . . . . . . . . . . . . . 1356.7.2 Association between ticker categories and emotions . . . . . 1356.7.3 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.7.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.7.5 Suggestions for Future Data Collection . . . . . . . . . . . . . 138

6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7 Conclusion and Outlook 1417.1 Summary of Achievements . . . . . . . . . . . . . . . . . . . . . . . . 1417.2 Outlook: The Future of Emotion Recognition . . . . . . . . . . . . . . 142

A Additional Information on Feature Calculation 145A.1 Quantile Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145A.2 Parameters for Pitch Frequency Calculation . . . . . . . . . . . . . . . 146A.3 Parameters for Intensity Calculation . . . . . . . . . . . . . . . . . . . 146

B Additional Information on Classification 149B.1 Derivation of Discriminant Functions for LDA and QDA . . . . . . . . 149B.2 Support Vector Classifiers and Support Vector Machines . . . . . . . 150B.3 Parameters for Subtractive Clustering . . . . . . . . . . . . . . . . . . 151

Glossary 153

Curriculum Vitae 170

Page 11: Multimodal Emotion and Stress Recognition

x CONTENTS

Page 12: Multimodal Emotion and Stress Recognition

Abstract

This work aims at automatically determining a person’s emotional state by meansof several sensors modalities. Since emotion and stress recognition is most ben-eficial in unconstrained, everyday life, we gradually moved from a standardized,laboratory experiment to a real-life setting.

Data loss due to artifacts is a frequent phenomenon in practical applications.However, such artifacts usually do not affect all the recorded signals at the sametime. Discarding the entire feature vector (i.e. all signals), if only a single signal iscorrupted, results in a substantial loss of data. This problem has rarely been ad-dressed in previous work on emotion recognition from physiological signals. Wetherefore investigated two methods for handling missing feature values, in orderto reduce artifact-induced data loss: Imputation (missing data is replaced) andreduced-feature models (classification models are reduced such that only the dataof valid signals is used). For the reduced-feature models approach, a separate clas-sifier was trained for each signal modality. To obtain a classification result, the re-sults of the single modality classifiers were fused by majority or confidence voting.

To test the methods for handling missing feature values, the five emotionsamusement, anger, contentment, neutral and sadness were elicited in 20 subjectsby standardized films, while six physiological signals (ECG, EMG, EOG, EDA, respira-tion and finger temperature) were recorded. Results showed that classifier fusionincreases the recognition accuracy by up to 16.3% in comparison to a single classi-fier that uses the features of all signal modalities simultaneously. Moreover, 100%of the data could be analyzed, even though only 47% of the data was completelyartifact-free.

Next, a more naturalistic emotion elicitation technique was chosen. Using astandardized but interactive laboratory protocol, which resembles a stressful situ-ation of an office worker, mental and social stress was elicited in 33 subjects. Ourgoal was to distinguish stress from mild cognitive load using physiological and ac-tivity signals. The signals were first evaluated separately in order to find suitablefeatures for distinguishing between stress and cognitive load. A separate classifierwas then trained for each signal. Similar to the previous experiment, the differentsignal modalities were combined using classifier fusion.

Analysis of the EDA data showed that the distributions of the EDA peak heightand the instantaneous peak rate carry information about the stress level of a per-son. Analysis of the acceleration data revealed, that body language also containsinformation about stress. Specific stress-related acceleration features have been

Page 13: Multimodal Emotion and Stress Recognition

xii CONTENTS

identified for the head, the right hand and the feet.Since different modalities convey different “kinds” of information about stress,

they complement each other. A classifier that fuses six modalities (head, righthand, left leg, right leg, heart, EDA) by majority voting yielded an accuracy of 84.6%.This accuracy is 6% higher than the accuracy reached by the best single modalityclassifier.

Finally, the generalization of the chosen features and classifiers to a "real-world" stress situation was investigated in a small office experiment. Classifierswere trained on the data of the laboratory stress experiment described above,and tested on the data of the office experiment. The results indicated, that sin-gle modality LDA classifiers using the chosen features exhibit good generalizationcapabilities. When combining the different modalities, 100% accuracy for distin-guishing stress from cognitive load was achieved with majority voting.

Going one step further, we investigated emotions in everyday life. Twenty-twosubjects watched a soccer game of their favorite team playing during the WorldCup. Internet live-ticker data were used to label the recorded physiological andmovement data. An unsupervised classifier was trained that used the ticker text topredict the corresponding ticker category (e.g. a goal). From the high accuracy (75%for three ticker categories), we concluded that the ticker category is related to thearousal expressed in the ticker text. Using questionnaires, we further investigatedwhich emotions are elicited and whether specific emotions can be associated withthe different ticker categories.

The emotions elicited by soccer watching cover each quadrant of the arousal-valence space and their intensities are comparable to the intensities achieved withstandardized film stimuli. Furthermore, emotions that are difficult to be elicitedwith other - ethically uncritical - elicitation techniques are strongly elicited by soc-cer watching. As expected, positive emotions predominated for a victory of thefavorite team, whereas negative emotions predominated for defeats. A recogni-tion accuracy of 79.2% was achieved in discriminating events of high arousal fromgame minutes without special incidents, and 87.1% was achieved in discriminating“positive” from “negative” events.

Page 14: Multimodal Emotion and Stress Recognition

Zusammenfassung

Das Ziel dieser Arbeit besteht darin, den emotionalen Zustand einer Person mitHilfe von Sensoren automatisch zu erkennen. Zur Emotions- und Stresserkennungwurden Experimente mit körpergetragenen Sensoren durchgeführt, wobei wir unsschrittweise von einem standardisierten Experiment im Labor zu einer möglichstalltagsnahen Situation bewegten.

Signalstörungen (z.B. verursacht durch Bewegung) können physiologischeDaten unbrauchbar machen. Gerade in praktischen Anwendungen sind solcheStörungen häufig. Allerdings treten die Störungen gewöhnlich nicht in allen Sig-nalen gleichzeitig auf. Verwirft man beim Auftreten einer Störung in einem einzel-nen Signal gleich den gesamten Datensatz (d.h. alle Signale), so führt dies zueinem erheblichen Datenverlust. Auf dieses Problem wurde in früheren Arbeitenzum Thema Emotionserkennung aus physiologischen Signalen wenig eingegan-gen. Daher untersuchten wir in dieser Arbeit zwei Methoden, mit welchen sichder mit Signalstörungen einhergehende Datenverlust vermeiden lässt: “Imputa-tion” (fehlende Daten werden ersetzt) und “reduced-feature models” (Klassifika-tionsmodelle werden “reduziert”, sodass sie nur Daten von gültigen Signalen ver-wenden). Für die “reduced-feature models” Methode wurde für jede Signalmoda-lität ein separater Klassifikator trainiert. Um schliesslich eine Klassifikationsaus-sage zu erhalten, wurden die Klassifikationsergebnisse der einzelnen Klassifika-toren miteinander kombiniert (entweder durch Mehrheitsentscheid oder mit Hilfedes Vertrauenswertes der Klassifikatoren in ihre Klassifikationsergebnisse).

Um die Methoden zur Vermeidung des störungsbedingten Datenverlustes zuvalidieren, wurden in einem Experiment mit 20 Teilnehmern mithilfe von standar-disierten Filmausschnitten die vier Emotionen Erheiterung, Ärger, Zufriedenheitund Traurigkeit, sowie ein emotions-neutraler Zustand hervorgerufen. Dabei wur-den sechs physiologische Signale aufgezeichnet (EKG, EMG, EOG, Hautleitfähigkeit,Atmung und Fingertemperatur). Die Emotionserkennungsrate, welche durch dieKombination mehrerer Klassifikatoren erreicht wurde, war bis zu 16.3% höher alsdie Emotionserkennungsrate eines einzelnen Klassifikators (welcher Merkmalealler Signalmodalitäten gleichzeitig benutzte). Obwohl nur 47% der Datensätzevöllig störungsfrei waren, konnten mit den “reduced-feature models” trotzdem alleDatensätze für die Analyse verwendet werden.

Im nächsten Schritt wurde eine alltagsnähere Technik zur Emotionsauslösungverwendet. Mit Hilfe eines standardisierten, interaktiven Experimentablaufs wur-den 33 Testpersonen unter mentalen und sozialen Stress gesetzt. Die Stresssitua-

Page 15: Multimodal Emotion and Stress Recognition

xiv CONTENTS

tion wurde so gewählt, dass sie der eines Büroangestellten ähnelte. Unser Ziel war,mit Hilfe von physiologischen und Bewegungssignalen Stress von geringer kogni-tiver Belastung zu unterscheiden. Die Signale wurden zuerst einzeln ausgewertet,um geeignete Signalmerkmale für die Unterscheidung von Stress und kognitiverBelastung zu finden. Für jedes Signal wurde ein Klassifikator trainiert. Ähnlich wieim vorhergehenden Experiment wurden die Klassifikationsergebnisse der einzel-nen Klassifikatoren miteinander kombiniert.

Die Auswertung der Hautleitfähigkeit (EDA) zeigte, dass die Höhe und die Fre-quenz der EDA-Ausschläge für die Unterscheidung von Stress und kognitiver Belas-tung relevant sind. Die Auswertung der Bewegungssignale ergab, dass sich die Kör-persprache ebenfalls für Stress und kognitive Belastung unterscheidet. Aus dieserBeobachtung konnten wir spezifische stressbezogene Bewegungsmerkmale fürden Kopf, die rechte Hand und die Füsse ermitteln.

Verschiedene Signalmodalitäten ergänzen sich gegenseitig, weil sie unter-schiedliche Informationen über Stress enthalten. Indem die Klassifikationsergeb-nisse von sechs Klassifikatoren (je einer für die Signalmodalitäten Kopf, rechteHand, linkes Bein, rechtes Bein, Herz und Hautleitfähigkeit) durch Mehrheitsent-scheid miteinander kombiniert wurden, konnte eine Klassifikationsgenauigkeitvon 84.6% erreicht werden. Die Kombination der sechs Modalitäten ergab eine um6% höhere Genauigkeit als eine Klassifikation basierend auf einer einzelnen Moda-lität.

Um die Generalisierungsfähigkeit der gewählten Merkmale und Klassifika-toren im realen Büroalltag zu testen, führten wir ein kleines Experiment imBüro durch. Klassifikatoren, welche mit den Daten des oben beschriebenen Labor-Stressexperimentes trainiert wurden, wurden auf die Daten des Büroexperimentesangewendet. Für die LDA-Klassifikatoren einzelner Signalmodalitäten konnte mitden gewählten Merkmalen eine gute Generalisierungsfähigkeit gezeigt werden.Durch die Kombination mehrerer Modalitäten resultierte gar eine Klassifikations-genauigkeit von 100% für die Unterscheidung von Stress und kognitiver Belastung.

In einem weiteren Schritt untersuchten wir Emotionen im Alltag. 22 Testperso-nen sahen sich dafür ein TV-Fussballspiel ihres Lieblingsteams während der Fuss-ballweltmeisterschaft 2010 an. Live-Ticker Daten aus dem Internet wurden für dieAnnotation der aufgenommenen physiologischen und Bewegungsdaten verwen-det. Ein Klassifikator wurde trainiert, welcher aus dem Live-Ticker Text die zuge-hörige Ereigniskategorie (z.B. Tor) ermitteln konnte. Aus der hohen Klassifikations-genauigkeit (75% für 3 Ereigniskategorien) schlossen wir, dass die Ereigniskate-gorien der Tickereinträge mit der im Text ausgedrückten emotionalen Aktiviertheitin Zusammenhang stehen. Mit Hilfe von Fragebögen untersuchten wir daraufhin,welche Emotionen durch das Ansehen eines Fussballspieles ausgelöst werden undob den Ereigniskategorien spezifische Emotionen zugeordnet werden können.

Die Emotionen, welche durch das Ansehen der Fussballspiele ausgelöst wur-den, verteilten sich auf die gesamte Ebene, welche durch die Emotionsdimen-sionen “Aktiviertheit” und “Valenz” aufgespannt wird. Die Intensität der durchFussball ausgelösten Emotionen war vergleichbar mit der Intensität von Emotio-nen, welche durch standardisierte Filmausschnitte ausgelöst wurden. Ausserdembeobachteten wir, dass Emotionen, welche teilweise nur mit ethisch kritischenMethoden zuverlässig ausgelöst werden können, sich durch das Anschauen eines

Page 16: Multimodal Emotion and Stress Recognition

CONTENTS xv

Fussballspieles leicht hervorrufen liessen. Erwartungsgemäss überwogen positiveEmotionen, wenn das bevorzugte Fussballteam das Spiel gewann, und negati-ve Emotionen, wenn das bevorzugte Team verlor. Für die automatische Unter-scheidung von Ereignissen hoher Aktiviertheit und Spielminuten ohne besondereVorkommnisse wurde eine Klassifikationsgenauigkeit von 79.2% erreicht. Die Un-terscheidung von positiven und negativen Ereignissen resultierte in einer Klassifi-kationsgenauigkeit von 87.1%.

Page 17: Multimodal Emotion and Stress Recognition

xvi CONTENTS

Page 18: Multimodal Emotion and Stress Recognition

1

Introduction

This chapter gives a brief explanation and motivation for automaticemotion and stress recognition and highlights the research contribu-tions of this thesis.

1.1 Motivation for Emotion and Stress RecognitionThis work aims at automatically determining a person’s emotional state by meansof several sensor modalities. Emotions and stress lead to bodily reactions and tochanges in behavior. These reactions and behavioral changes can be captured byappropriate sensors. The sensor readings in turn can be used to infer the emotionalstate of the subject.

The principle approach of supervised emotion recognition is shown in Fig. 1.1.The subject is equipped with sensors (e.g. physiological sensors, acceleration sen-sors, microphone), which measure the reaction of the subject to an emotionalstimulus. Specific features are calculated from each recorded sensor signal, e.g.the mean heart rate from the electrocardiogram (ECG). Because many differentfeatures and feature variants can be calculated, the number of features can growquickly. To select the most expressive features, visual inspection or automatic fea-ture selection is used. Based on the selected features, an emotion classifier deter-mines the emotional state of the subject. To create such an emotion classifier, alabeled data set is needed that contains the selected features and the correspond-ing “true emotion” of the subject (i.e. the label). In the following, we will call thisprocedure classifier training.

There are different possibilities to obtain the “true emotion” when recordinga data set for emotion classifier training. It is either assumed to be given by theexperiment setup (e.g. a sad film clip presented to the test subject), by question-naires (e.g. by asking the subjects which emotions they experienced) or by externalobservation (e.g. by asking experts which emotion they hear or see).

Once an emotion classifier is trained, it can automatically recognize the emo-tion of a subject without knowing their true emotion - with a certain accuracy. Todetermine this accuracy, the classifier is fed with another data set of calculated

Page 19: Multimodal Emotion and Stress Recognition

2 Introduction

features, for which the true emotion of the subject is known but not given to theclassifier. By comparing the recognized emotion of the classifier with the true emo-tion, the accuracy can be calculated by dividing the number of correct decisions bythe total number of decisions. This procedure is referred to as classifier testing.

The principle of automatic recognition is not limited to emotional states. Itworks analogously for stress. Instead of an emotional stimulus (c.f. Fig. 1.1), a stress-inducing stimulus is applied to the subject. A stress classifier is trained based ona data set of features that is labeled with the true stress level experienced by thesubject.

Emotion and stress recognition (or “affective computing” to use a broaderterm) are rather young disciplines that are gaining more and more interest. Thereare mainly two communities that could profit from emotion and stress recognition,i.e. the Computer Sciences and the Social and Psychological Sciences.

Social and Psychological Sciences Examples where automatic emotion andstress recognition could be beneficial include:

• Affective disorders (e.g. major depression, bipolar disorder) are characterizedby persistent or episodic exaggeration of mood states [6]. Emotion recogni-tion could provide the therapist with more insight into the daily variationsof the patient’s state and might help him to improve treatment.

• Chronic stress can suppress the immune function [141], increase infectiousdisease susceptibility [49], and may lead to depression [41]. If an electronic“Personal Health Trainer” kept track of stressful situations at work, it mighthelp creating a better “Work-Life-Balance” and thus be used for preventionof stress, depressions and burn-outs.

• Sociologists are interested in how people interact with each other and theprocesses behind it. Emotions play an important role in the development ofrelationships [101] and in group behavior [15]. Automatic emotion recogni-tion could enrich long-term studies conducted “out-of-the-lab” [211].

Figure 1.1: Procedure of supervised emotion recognition. For training the emotionclassifier, a data set of features recorded during emotional experiences is needed.This data set needs to be labeled with the true emotions of the subject.

Page 20: Multimodal Emotion and Stress Recognition

Research Contributions 3

Computer Sciences Many applications for emotion recognition can be foundin the field of Human-Computer Interaction (HCI). By including emotions, HCI shallbecome more natural, i.e. more similar to human-human interactions where in-formation is not only transmitted by the semantic content of words but also byemotional signaling, e.g. in prosody, facial expression and gesture [51, 11]. Possibleapplications of emotion recognition in HCI include:

• Call center application: The computer recognizes, if a user is frustrated orconfused and connects him to a human operator [121].

• Emotional music player: The software selects the music according to theuser’s emotional state / mood [45, 17, 96].

• Teleconference support: Emotional expressions in social interaction conveyimportant information about the others’ emotions, beliefs, and intentions[100]. A part of this information is lost when the facial expression of the peeris not visible, e.g. during a teleconference. Cowie [51] therefore suggests todisplay information about the participants’ emotional states on the screen.

• Computer-aided learning can be improved, if the computer is informedwhen a user is bored or frustrated [97, 159].

• Media content tagging: Manual video editing is time-consuming and labo-rious. Analyzing the emotional content of videos might help identifying in-teresting sequences automatically [125, 202].

As outlined by the application examples above, emotion and stress recognitionis most beneficial in unconstrained, everyday life. In this work, we will thereforegradually move from standardized, laboratory experiments to a real-life setting.

We will limit ourselves to investigating the emotional content in speech, physi-ological signals and movement data. Video recordings will only serve for documen-tation purposes. On the one hand, emotion recognition from video-taped facesrepresents a research area on it’s own [76]. On the other hand, facial video record-ings are difficult to obtain in mobile settings and critical regarding privacy.

1.2 Research ContributionsBeing a young field, emotion and stress recognition faces many different chal-lenges. Our work contributes to the following:

• Data loss due to artifacts is a frequent problem in practical applications,especially when measuring physiological signals with unobtrusive sensors[176]. This work therefore presents a method for handling data loss causedby artifacts.

• Existing studies on stress recognition mostly use a form of “mental stress”.Aiming at an experiment that is close to a real-life office situation, we willuse a combination of mental and psychosocial stress for our experiments.Furthermore, we will try to distinguish stress from cognitive load rather thandistinguishing between stress and rest (doing nothing). Moreover, we willidentify specific stress-related features for EDA and acceleration.

Page 21: Multimodal Emotion and Stress Recognition

4 Introduction

• Existing studies often focus on a single kind of sensor, e.g. on physiologicalsignals. However, it is expected that using several sensor modalities will yieldbetter recognition accuracies than concentrating on a single modality [103].This work makes a contribution in this direction by combining physiologicalsignals with behavioral indicators measured by acceleration.

• Previous work, e.g. [34, 110, 215, 124], has often investigated emotions andstress in standardized, laboratory experiments. This work gradually movesfrom standardized, laboratory experiments (Ch. 4) to a close to real-life set-ting (Ch. 6). We will show, how soccer watching can be used to elicit natu-ralistic emotions and how the intensities of the elicited emotions compareto the intensities achieved with standardized stimuli. We will also highlightthe difficulties and give practical advice for setting up real-life experiments.

• For real life emotion experiments, the occurrence of specific events is notpredictable. Therefore, the events need to be labeled. When aiming atrecording a large amount of data, manual labeling of the data becomestime-consuming. For soccer matches, a large amount of live ticker data isavailable on the Internet. We will show, how this data can be used for auto-matic labeling.

1.3 Thesis OutlineChapter 2 gives an overview on emotion and stress theories, emotion elicitationand annotation techniques, and existing work on automatic stress and emotionrecognition.

Chapter 3 presents the methods used in this work, including the measured sig-nals, the feature calculation, the classification algorithms, and the methods usedfor handling missing feature values.

In Chapter 4, two studies using standardized emotional stimuli are presented.In the first study, XY-fused Kohonen Networks were employed to recognize discreteemotions based on speech features. In the second study, film clips elicited fivediscrete emotions, while six physiological signals were recorded. The benefits ofemploying methods for handling missing feature values are highlighted.

A naturalistic elicitation technique is presented in Chapter 5. Using a standard-ized but interactive laboratory protocol that resembles a stressful situation of anoffice worker, mental and social stress was elicited in 33 subjects. Our goal wasto distinguish stress from mild cognitive load using physiological and activity sig-nals. After evaluating the signals separately and determining suitable features, thesignal modalities were combined using classifier fusion. The generalization of thechosen features and classifiers to a "real-world" stress situation is illustrated by asmall office experiment.

Emotion recognition in an everyday life situation is investigated in Chapter 6.Twenty-two subjects watched a soccer game of their favorite team playing duringthe world cup. Using questionnaires, we examined which emotions were elicitedand whether specific emotions can be associated with ticker categories.

Chapter 7 provides conclusion and outlook.

Page 22: Multimodal Emotion and Stress Recognition

2

State-of-the-Art in Emotion andStress Research

After introducing the terms emotion and stress, this chapter gives anoverview on emotion and stress theories, emotion elicitation and anno-tation techniques, and existing work on automatic stress and emotionrecognition.

2.1 Definitions

Definition of Emotion Until today, there is no consensual definition of the termemotion. Kleinginna and Kleinginna have collected 92 definitions of emotion tosuggest a consensual definition including the different aspects of emotion:

“Emotion is a complex set of interactions among subjective and objec-tive factors, mediated by neural/hormonal systems, which can (a) giverise to affective experiences such as feelings of arousal, pleasure/dis-pleasure; (b) generate cognitive processes such as emotionally rele-vant perceptual effects, appraisals, labeling processes; (c) activate wide-spread physiological adjustments to the arousing conditions; and (d)lead to behavior that is often, but not always, expressive, goal-directed,and adaptive.” ([106], p. 355)

Schmidt-Atzert further emphasizes, that an emotion is a state - characterized bya beginning, a duration, an end and a certain course over time - as opposed to amood which is of longer duration and smaller intensity, shows less variation and isnot related to an explicit stimulus [173]. The term feeling is usually employed as asubcategory of emotion and refers to the subjective experience of an emotion.

Definition of Stress Similar to emotion, there is no common definition forstress. We therefore give two exemplary definitions with different emphases:

Page 23: Multimodal Emotion and Stress Recognition

6 State-of-the-Art in Emotion and Stress Research

“Stress may be defined as a real or interpreted threat of the physiologi-cal or psychological integrity of an individual that results in physiologi-cal and/or behavioral responses." ([133], p. 508)

“Work-related stress is a pattern of reactions that occurs when workersare presented with work demands that are not matched to their knowl-edge, skills or abilities, and which challenge their ability to cope.” ([74],p. 2)

The term stress usually bears a negative connotation without any further specifi-cation of the quality of the associated subjective experience.

2.2 Emotion TheoriesThe existing emotion theories can be categorized into four families:Evolutionary Theories: Focus on the development of emotions during evolution

and their universality across cultures, age groups end species (e.g. Darwin[59], Ekman [70])

Behavioristic Theories: Focus on the question how certain stimuli generate emo-tional reactions and how such reactions can be learned (conditioned) (e.g.Watson [206])

Cognitive-physiological Theories: Focus on physiological changes induced byemotions (e.g. James [95], Schachter [167])

Cognitive Theories: Focus on the relationship between the appraisal of a cer-tain situation (regarding personal implications) and the corresponding emo-tional experience (e.g. Lazarus [117, 118], Scherer [169])

In the following, we detail the latter two families, because of their relevance for ouraim of designing an automatic emotion classifier. The cognitive-physiological the-ories focus on effects of emotions that are measurable with physiological sensorsand on the specificity of emotions, i.e. the question whether there is a character-istic physiological “pattern” associated with each emotion. The cognitive theoriesare introduced in Sec. 2.2.4 because they originate from stress research and are thusclosely related to our work in Chapter 5.

2.2.1 The Nascence of Emotions according to James/Lange, Can-non and Schachter/Singer

Early in emotion research, James and Lange stated independently, that an emo-tional stimulus induces bodily changes and the feeling of these changes as theyoccur is the emotion [95]. Emotions are thus seen as consequence of physiologi-cal changes and not as their cause. Furthermore, James supposed that differentemotions are associated with distinct, unique bodily expressions.

Cannon criticized the James/Lange theory in several respects [40]. He arguedthat similar bodily changes can arise in different emotional and even unemotionalsituations and that bodily changes are thus not emotion-specific. Furthermore, he

Page 24: Multimodal Emotion and Stress Recognition

Emotion Theories 7

argued that physiological changes are not a necessary condition for experiencingemotions. This argument was based on the observation that dogs showed emo-tional behavior despite a severed connection between the internal organs and thecentral nervous system [185]. That physiological changes are not a necessary condi-tion for experiencing emotions was later confirmed for humans: Cobos et al. foundno reduction of emotional feelings in patients with spinal cord injuries [48]. Can-non therefore stated in his alternative theory, that emotions originate from activityin the subcortical centers of the brain and that bodily changes and emotional ex-perience follow almost simultaneously [40].

Schachter and Singer included parts of both views in their two-factor theory in-cluding physiological arousal (i.e. the body) and cognition (i.e. the brain) [167]. Theysuggested that an emotional stimulus triggers unspecific physiological arousal.Depending on the cognitive appraisal of the situation, this perceived physiologi-cal arousal is then “mapped” to a specific emotion. An example: A person sees abear (stimulus) and is startled (physiologically aroused). At the same time cogni-tions (“Bears can kill people.” “The bear is not behind a fence.”) cause the personto appraise the situation as dangerous and to label his/her state of physiologicalarousal as ‘fear’. In particular, the theory states that “given a state of physiologicalarousal for which an individual has no immediate explanation, he will label thisstate and describe his feelings in terms of the cognitions available to him”, whichimplies that “the same state of physiological arousal could be labeled ‘joy’ or ‘fury’or ‘jealousy’ ” ([167], p. 398).

In summary, we can say that both, the body and the brain are involved in thenascence of emotion. The question whether bodily changes are emotion-specificis discussed next.

2.2.2 The Specificity of Emotions

In the context of emotion theories, one point of discussion concerns the specificityof emotions, i.e. whether there is a specific physiological “pattern” associated witheach emotion. This is an important question for our work: Without such patterns,computers will not be able to distinguish emotions based on physiological signals.

An overview of the existing work on emotion specificity is given by Cacioppo etal. in [37] and summarized in this section. Physiological changes include facial mus-cle movements, autonomic nervous system (ANS) activity and brain activity.

2.2.2.1 Facial muscle movements

Cacioppo et al. [37] conclude that the muscle activity of the zygomaticus ma-jor (“smiling muscle” on the cheek) and the orbicularis oculi (ring-shaped musclearound the eye, used to close the eyelid) varies as a function of positivity, whereasthe activity of the corrugator supercilii (small muscle at the eye brow, located to-wards the nose) varies as a function of negativity.

Page 25: Multimodal Emotion and Stress Recognition

8 State-of-the-Art in Emotion and Stress Research

2.2.2.2 Brain activity

The left anterior region of the brain was found to be related to approach-relatedemotions (e.g. happiness, anger), whereas the right anterior region was related toavoidance-related emotions (e.g. sadness, fear) [37].

2.2.2.3 ANS Activity

Autonomic differences between emotion pairs Already in 1983, Ekman etal. provided first evidence for emotion differentiation by ANS activity [143]. Theyinvestigated the emotions anger, fear, sadness, happiness, surprise, and disgust bymeasuring heart rate, left- and right-hand finger temperatures, skin resistance, andforearm flexor muscle tension. Two techniques were used for emotion induction:the directed facial action task, where subjects were instructed to contract a set offacial muscles that are specific for a certain emotion, and imagery where subjectswere asked to relive a past emotional experience.

Ekman found significant autonomic differences between groups of emotions,as depicted in Fig. 2.1. For the directed facial action task, three subgroups of emo-tions could be identified using heart rate and finger temperature: (1) happiness,surprise and disgust (low heart rate), (2) fear and sadness (high heart rate and lowfinger temperature), and (3) anger (high heart rate and high finger temperature).However, this “pattern” was not confirmed for the reliving emotions task. Whenreliving emotions, significant differences between emotions were only found forskin resistance and not for heart rate or finger temperature. As depicted in Fig. 2.1b,sadness resulted in lower skin resistance (which means higher skin conductance)than the other three negative emotions.

Even though a large number of studies has been performed since 1983, the “re-sults are far from definite regarding emotion-specific autonomic patterning” ([37],p. 180) and contradictions exist. To alleviate this problem, Cacioppo et al. employedmeta-analyzes on existing studies including 22 physiological variables [36, 37]. Themeta-analyzes resulted in a list of emotion pairs that significantly differed in oneor more physiological variables. These emotion pairs - which had been investigatedin at least two independent studies considered in [36, 37] - are shown in Table 2.1.

Three of the physiological variables in Table 2.1 were sufficient to create a deci-sion tree with pure leaf nodes, as depicted in Fig. 2.2. With appropriate thresholds,this decision tree could already serve as an emotion classifier. However, apart fromthe three physiological variables used for the decision tree, Table 2.1 lists eight addi-tional physiological variables, which also showed significant differences betweenemotion pairs (especially between anger and fear). Therefore, additional physio-logical variables can be used as supplementary variables to make classifiers morestable.

Autonomic differences between emotion groups Besides the pairwise emo-tion comparisons, the specificity of positive and negative emotions was also inves-tigated in [37]. Negative emotions (anger, fear, disgust, sadness) were generallyassociated with stronger ANS responses than positive emotions (happiness).

Page 26: Multimodal Emotion and Stress Recognition

Emotion Theories 9

Figure 2.1: Decision tree visualizing significant autonomic differences betweenemotions found by Ekman et al. [143] for two emotion induction techniques: di-rected facial action (a) and reliving emotion tasks (b). HR: heart rate, FT: Fingertemperature, SCL: Skin Conductance Level. Θi: Thresholds. The ∆ used in front ofeach variable indicates relative changes.

Figure 2.2: Decision tree derived from significant autonomic differences of Table 2.1[37]. The ∆ used in front of each variable indicates relative changes. Results forsurprise originate from [36]. HR: heart rate, DBP: Diastolic Blood Pressure, SCL: SkinConductance Level, θi: Thresholds with θ1 > θ2 > θ3 and θ4 > θ5.

Page 27: Multimodal Emotion and Stress Recognition

10 State-of-the-Art in Emotion and Stress Research

Table 2.1: Significant autonomic differences between emotions reported in [37] and[36]. Results for surprise are not mentioned in the newer study [37] and were there-fore taken from the text in [36].

Modality Physiological variable(relative to baseline) Differing Emotion Pairs

HeartHeart Rate

Anger > DisgustAnger > HappinessAnger > Surprise

Fear > AngerFear > Disgust

Fear > HappinessFear > Sadness

Happiness > SurpriseHappiness > DisgustSadness > SurpriseSadness > Disgust

Cardiac Output Fear > AngerStroke Volume Fear > Anger

Blood vessels(vascularmeasures)

Finger Pulse Volume Anger > FearHappiness > Fear

Total Peripheral Resistance Anger > FearFace Temperature Anger > Fear

Blood pressureDiastolic Blood Pressure(DBP)

Anger > FearAnger > HappinessAnger > Sadness

Sadness > HappinessSystolic Blood Pressure (SBP) Sadness > Happiness

ElectrodermalActivity

Skin Conductance Level (SCL)

Disgust > HappinessDisgust > Surprise

Fear > SurpriseSadness > Fear

Number of Nonspecific SkinConductance Responses

Anger > FearFear > Sadness

Respiration Respiration Rate Sadness > Fear

Page 28: Multimodal Emotion and Stress Recognition

Emotion Theories 11

Conclusion for emotion recognition Even though the presented decision treein Fig. 2.1 and the results in Table 2.1 suggest emotion-specific autonomic pattern-ing, newer studies and meta-analyzes have reported partly contradicting evidence,as discussed by Larsen et al. in [115]. Still, based on the review of the existing liter-ature, we were confident that automatic pattern recognition techniques could beused to distinguish between emotions based on physiological signals, especiallywhen considering several signals simultaneously.

2.2.3 StressIf a human organism is in danger or injured, various physiological changes occur,which are denoted by the term stress reaction [150]. The stress reaction was first de-scribed by Hans Selye [179, 180]. He observed three stages of adaption of the organ-ism: the alarm reaction, the stage of resistance, and the stage of exhaustion. Dur-ing the alarm reaction, physiological changes occur including e.g. hypoglycemia,incipient development of gastric ulcers, and increased secretion of adrenocorticalhormones (e.g. cortisol). During the stage of resistance, the physiological changesof the alarm reaction disappear or are reversed, which indicates the adaption ofthe organism to the new conditions. However, if the stressor continues to affectthe organism, resistance is lost and the symptoms of the alarm reaction reappeardue to exhaustion.

Regarding physiological changes, two components are involved in a typicalstress reaction [38], the hypothalamic-pituitary-adrenal (HPA) axis and the sym-pathetic nervous system, which forms part of the autonomic nervous system (to-gether with the parasympathetic nervous system). Besides the invasive assess-ment by blood sampling, the stress reaction of the sympathetic nervous systemcan be investigated non-invasively by measuring autonomic parameters, e.g. theheart rate, the blood pressure or the electrodermal activity.

Mc Even describes the long-term effects of the physiological responses to stressin [134]. Normally, the stress reaction is initiated by a stressor, sustained for an ap-propriate amount of time, and then turned off when the stress-eliciting situationis over. Inactivation turns the physiological variables back to baseline values. How-ever, inefficient inactivation can lead to allostatic load, “which is the wear and tearthat results from chronic overactivity or underactivity of allostatic systems ” ([134],p. 171). Mc Even describes four conditions that lead to allostatic load:

1. Frequent stress, i.e. repeated “hits” from multiple stressors

2. Lack of adaptation to repeated stress induced by the same type of stressor(e.g. lack of habituation to public speaking)

3. Prolonged response due to inability to shut off the stress response after thestress has terminated

4. Inadequate (i.e. too weak) response that leads to compensatory hyperactiv-ity of another physiological system. E.g. if the cortisol response is too low,secretion of inflammatory cytokines increases.

The magnitude of allostatic load depends on many factors including the type ofstressor, genetic predispositions, previous experiences, and individual coping ca-

Page 29: Multimodal Emotion and Stress Recognition

12 State-of-the-Art in Emotion and Stress Research

pabilities [67]. The accumulation of allostatic load can eventually lead to disease[134, 67] e.g. hypertension, myocardial infarction, vital exhaustion or burn-out.

2.2.4 Stress in Relation to Emotion Theories

Lazarus’ emotion theory [118] is based on his previous research in stress [117].He stated that the “stress emotions” (anger, fear, resignation) originate from ap-praisals of the current situation (relevance, personal harm or benefit) and ap-praisals of coping1 options and prospects [117, 118]. Depending on the action ten-dency resulting from the chosen coping strategy, specific “psychophysiological re-sponse patterns” arise to prepare the organism for the action.

Along a similar line, Scherer has developed a definition and a rather complexcomponent model of emotion [169]. One aspect of emotion is the so called re-sponse synchronization. Emotions prepare certain responses to an eliciting event,based on the appraisal of the implications that this event is supposed to have forthe subject. Therefore, “all or most of the subsystems of the organism must con-tribute to response preparation. The resulting massive mobilization of resourcesmust be coordinated, a process which can be described as response synchroniza-tion.” ([169], p. 701). Scherer states that response synchronization can in principlebe measured empirically, by gathering data from as many organismic subsystemsas possible, including ANS data, facial and vocal expression, gesture and question-naire data. Therefore, to capture the response synchronization, we chose to recordseveral physiological signals including facial expression in Chapter 4 and physio-logical signals combined with movement in Chapter 5.

2.3 Structure Models for Emotion Description

Since natural language knows many words to describe emotions, standardizedtechniques for emotion description and annotation are necessary to conduct re-peatable experiments. The existing emotion modeling concepts for emotion de-scription can be categorized into discrete and continuous models. In discrete mod-els, the emotions are represented by discrete states. The following 6 emotions areoften considered: Anger, disgust, fear, happiness, sadness, and surprise. Ekman andFriesen showed that facial expressions of these emotions are universal, i.e. theyare identified correctly across many different cultures [71]. Ekman later proposed alarger set of 15 “basic” emotions [70].

To model a large quantity of naturally occurring emotions, continuous modelshave emerged, where emotions are described on a variant number of continuousorthogonal axes [172, 163, 52]:

• Activation / Arousal: Disposition to act, calm vs. excited

• Valence: Positive versus negative, pleasant versus unpleasant

1Lazarus and Folkman defined coping as “constantly changing cognitive and behavioralefforts to manage specific external and/or internal demands that are appraised as taxing orexceeding the resources of the person” ([116], p. 141).

Page 30: Multimodal Emotion and Stress Recognition

Emotion and Stress Elicitation 13

• Dominance: Dominant versus submissive

The axes used most often are activation/arousal and valence. Russell’s circumplexmodel of affect combines the discrete with the continuous approach, by placingwords describing discrete emotions in a plane spanned by the valence and arousalaxes [162].

In this work, we employed questionnaires to assess emotion dimensions(arousal and valence) as well as discrete emotions.

2.4 Emotion and Stress Elicitation

When conducting experiments, a suitable method for emotion (or stress) elicita-tion is needed. In terms of the investigated emotions, pure (strong) emotions areusually considered, which are often played by actors [73, 32, 13]. In this way, it ispossible to obtain a relatively large amount of unambiguously labeled data withina rather short period of time, and a large range of emotions can be covered [12].However, actors may tend to exaggerate the signs of emotion and it is thus ques-tionable if the results obtained from these studies can be generalized to weaker,more subtle emotional states [52]. Furthermore, emotions in every-day life are of-ten mixed, not pure.

Apart from acting, the following methods for eliciting emotions, moods, andstress are known from psychology: Pictures, e.g. the International Affective Pic-ture System (IAPS) [28], movies [161], music [68], Directed Facial Action Tasks [72],social psychological methods (involving covers stories) [84, 60, 105], and social in-teractions [158]. The advantages and disadvantages of these methods are listed inTable 2.2

In computer science, emotional states have been elicited by Wizard-Of-Oz sys-tems, e.g. a driving simulation with a speech-driven navigation system [174] orgames [34]. In such a setting, naturalistic emotions can be elicited while the rangeof emotional states still remains constrained.

In this work, we gradually move from standardized experiments to a real-lifesetting by using the following emotion elicitation techniques: acting, standardizedfilm clips, a naturalistic stress task involving a deceptive cover story, and real-lifesoccer watching.

2.5 Automatic Emotion Recognition

Signal modalities that have been used to automatically detect emotions includefacial expression [144, 114, 33], speech [35, 57, 4], movement [42, 20, 61], and physi-ological signals. Since the goal of this work was to exploit physiological signals incombination with other sensors, this section is structured as follows: First, stud-ies that recognize emotions exclusively from physiological signals are presented.Second, studies that recognize emotions from a combination of physiological andother sensors are summarized.

Page 31: Multimodal Emotion and Stress Recognition

14 State-of-the-Art in Emotion and Stress Research

Table 2.2: Comparison of emotion elicitation techniques [161, 173]

Method Standard. stimuli Advantages DisadvantagesPi

ctur

es InternationalAffective PictureSystem (IAPS) [28].

Wide range of emotions,simple, highlystandardized experimentsetup allows forreplication

Short duration ofemotions (~6 s). Someemotions (e.g. fear, anger)are difficult to inducebecause subjects are notpersonally threatened.

Mov

ies

Recommended FilmClips for ElicitingDiscrete Emotions[161]: Amusement,anger, disgust, fear,plain neutral,pleasant neutral,sadness, surprise

Wide range of emotions;emotions develop overtime (~1-10 min); simple,highly standardizedexperiment setup allowsfor replication; possibilityto elicit cognitivelysophisticated emotionalstates such as nostalgia

Some emotions (fear,anger) are difficult toinduce because subjectsare not personallythreatened. “Willingsuspense of disbelief” isneeded to experienceemotion [161].

Mus

ic

Recommendedclassical musicpieces for inducinghappy and sadmoods [68].

Emotions develop overtime (15-20 min.), simple,highly standardizedexperiment setup allowsfor replication

Music taste mightinfluence experiencedemotions. Moods(positive or negative)rather than discreteemotions.

Soci

alps

ycho

logi

calm

etho

ds

Procedures andcover stories foranger, joy/sadness,sympathy in [84].Montreal ImagingStress Task [60] andthe Trier SocialStress Test [105] forstress induction.

Masking of experimentpurpose preventsresponses that are due todemand characteristicsand thus fosters realisticresponses [84]. Emotionsoccur in social context[84]. Effective techniquefor anger [84].

Time-consuming andcomplex experimentsetup (development ofconvincing cover story,training of experimenters,thorough debriefing) [84].Might be ethically critical[160].

Soci

alin

tera

ctio

n Recommendedexperimentprocedure to invokea conflict discussionin two interactingsubjects (e.g.couples), [158].

Since the emotions occurin social context (i.e. in anongoing emotionalrelationship), the elicitedemotions closelyresemble the emotionsthat occur in everyday life[158]. Temporalcharacteristics ofemotions can beinvestigated [158].

Time-consuming andcomplex experimentsetup, limitedstandardization(experimenter can nottotally controlinteractions, participantsmay change topic);Recorded emotions haveto be labeled using videoexcerpts [158].

DFA

Directions tocontract facialmuscles to expressemotions [72]

Wide range of emotions,highly standardizedexperiment setup

Low ecological validity(does not resemble aneveryday life situation)

Page 32: Multimodal Emotion and Stress Recognition

Automatic Emotion Recognition 15

Emotion Recognition Studies Employing Physiological Signals In [154], asingle actor tried to elicit and experience each of eight emotional states, daily, overmultiple weeks. The emotions anger, hate, grief, platonic love, romantic love, joy,reverence, and neutral could be recognized with 81% accuracy by using four phys-iological signals: The electromyogram (EMG) along the masseter, blood volumepressure, electrodermal activity (EDA) and respiration.

Lisetti and Nasoz [34] performed a study with 29 participants. Emotional filmclips in combination with difficult mathematical questions were used to inducesadness, anger, surprise, fear, amusement and frustration. The physiological sig-nals included EDA, heart rate and skin temperature. Among three classifier, a neu-ral network yielded the best recognition accuracy of 84.1%.

Kreibig et al [164] employed film clips to elicit the three emotions fear, neu-tral and sadness. An electrocardiogram (ECG), the EDA and the respiration wererecorded from 34 subjects. An LDA classifier reached a recognition accuracy of 69%.

The four emotional states joy, pleasure, sadness and anger were investigated in[93]. A single subject was considered. The emotional states were induced by musicselected by the test subject himself. The recorded signals included ECG, EDA, EMGmeasured at the neck, and respiration. Various classifiers and dimension reduc-tion techniques were compared. Linear discriminant function classification withsequential forward selection resulted in a recognition accuracy of 92% for the fouremotions. Classifications of emotion groups resulted in 88.6% for positive/nega-tive and in 96.6% for low/high arousal. In a subsequent work, the authors of [93]extended their analysis to three subjects [104]. The emotion induction techniqueand the measured signals remained the same as in [93]. An accuracy of 95% wasreached for subject-dependent and 70% for subject-independent classification.

For more studies, the reader is referred to a list of collected results presentedin [34].

Emotion Recognition Studies Combining Physiological Signals with OtherSensors A combination of speech with physiological signals was investigated ina “Who wants to be millionaire?” game in [103]. Four emotions, each lying in oneof the quadrants of the arousal-valence space, were elicited. Using blood volumepulse (BVP), EMG, EDA, respiration, temperature, and speech of 3 subjects, an accu-racy of 92%, 75% and 69% was achieved for subject-dependent and 55% for subject-independent LDA classification.

A multimodal approach in a seating environment was applied in [3]. A combi-nation of pressure sensors attached to the seat, EDA, and features extracted froma camera system were used to predict frustration during a learning task for 24 sub-jects. The achieved recognition accuracy was 79%.

Research Directions Most of the above mentioned studies focused on a singlekind of sensor, e.g. on physiological signals. However, it is expected that using sev-eral sensor modalities will yield better recognition accuracies than concentratingon a single modality. This work makes a contribution in this direction by combin-ing physiological signals with behavioral indicators measured by acceleration.

Page 33: Multimodal Emotion and Stress Recognition

16 State-of-the-Art in Emotion and Stress Research

Previous work on emotion recognition from physiology has rarely addressedthe problem of missing feature values. However, data loss due to artifacts is fre-quently encountered in practical applications, and missing feature values thusrepresent a serious problem. It is current practice to discard the correspondingepisodes when encountering artifacts [34] or to use imputation [98]. In our work,we will compare mean value imputation with a novel reduced-feature model tech-nique.

2.6 Automatic Stress RecognitionIn [87], ECG, EMG of the trapezius (shoulder), EDA and respiration were recordedduring a real-life driving task. The three levels of driving stress (rest, highway, andcity driving) could be classified with an accuracy of over 97% across multiple driversand driving days.

In [215], the authors used an interactive “Paced Stroop Test” where subjects hadto select the font color of a word shown on the screen. The word itself named acolor. The authors tried to distinguish the data segments with matching mean-ing and font color (non-stress) from the segments with mismatching meaning andfont color (stress). With features from EDA, BVP, pupil diameter and skin tempera-ture, recorded from 32 subjects, a Support Vector Machine (SVM) reached a recog-nition accuracy of 90.1% with 20-fold cross-validation. Without the pupil diameterfeature, the accuracy dropped to 61.45%.

Facial EMG, respiration, EDA and ECG were used in [99] for detecting five emo-tional states, i.e. high stress, low stress, disappointment, euphoria and neutral.Data was recorded from simulated car races. Data of a first race was used fortraining and data of a second race for testing. An SVM using radial basis functionsreached an accuracy of 86% for a single driver.

A different approach was chosen in [124]: instead of classifying discrete states,the authors tried to estimate a continuous variable (the stress level). Mental arith-metic and an alphabetic task were used to induce a stress response. Facial, physi-ological (heart rate, skin temperature, EDA), behavioral and performance data (e.g.error rate) were used in a Dynamic Bayesian Network in order to estimate the stresslevel of five subjects. The workload was taken as ground truth stress level. Thecorrelation between the ground truth and the estimated stress level was high (be-tween 0.79 and 0.92 for 5 subjects). How much the performance rate attributed tothe high correlation is not reported.

Research Directions The above mentioned studies mostly use a form of “men-tal stress”. Aiming at an experiment that is close to a real-life office situation, weused a combination of mental and psychosocial stress for automatic stress detec-tion in Chapter 5 of this work. Furthermore, we tried to discriminate stress fromcognitive load rather than distinguishing between rest (doing nothing) and stress.

Page 34: Multimodal Emotion and Stress Recognition

3

Signals, Features, and ClassificationMethods

In this chapter, the methods used for the studies presented in the sub-sequent chapters are introduced. The methods include the measuredsignals, the calculation of features, classification algorithms, and meth-ods for handling missing data.

3.1 Signals and Features

In this section, we will introduce the signals and features we used to detect emo-tions and stress. Some standard features (such as mean, standard deviation, min-imum, maximum, and range) were calculated from most of the recorded sensorsignals, whereas some special features were only calculated from a single signal.This section only describes the signal-specific features in detail. If not mentionedotherwise, Matlab software [197] was used for feature calculation.

3.1.1 Electrocardiogram (ECG)

The electrocardiogram (ECG) measures the electrical activity of the heart muscle.Each heart beat originates in the sinoatrial node, which generates an electrical im-pulse, i.e. an action potential. This impulse propagates through the heart muscleand causes the contraction of the heart. The accumulation of action potentialstraveling along the heart muscle fibers generates electrical potential fluctuations,which can be measured by electrodes placed on the skin. The resulting signal is arecurring pattern showing the heart activity, as schematically depicted in Figure 3.1.

The characteristic segments and peaks of the ECG pattern are named by theletters P to T. The R-peak represents the most prominent attribute of the ECG andits time stamp can be precisely determined. Thus, the heart rate can be calculated

Page 35: Multimodal Emotion and Stress Recognition

18 Signals, Features, and Classification Methods

Figure 3.1: Schematic diagram of an ECG signal with peak and segment names [9].

by computing the frequency of the R-peaks. The intervals between the R-peaks arenamed RR-intervals (see Fig. 3.2a).

3.1.1.1 R-Peak Detection

To detect the R-peaks in the ECG signal, an algorithm based on that of Pan, Hamil-ton and Tompkins was used [146, 83]. Corresponding C and Matlab Software ismade available under the GNU public license by G.D. Clifford at [46]. Clifford alsoprovides a method to refine the detected R-peaks [2]: By comparing each calculatedRR-interval with the RR-intervals in its vicinity, implausible intervals are detectedand removed.

3.1.1.2 Heart Rate (HR) Calculation

For each RR-interval RRi, a heart rate value in beats per minute (bpm) was calcu-lated according to Eq. 3.1.

hri = 60· 1

RRi. (3.1)

The heart rates values hri were assembled to a heart rate vector hr =

[hr1, hr2, . . . , hrn]. The heart rate vector was then used to calculate standard sta-tistical features (e.g. mean, maximum) for specific time windows (e.g. for 1 minutein Sec. 6.5).

Page 36: Multimodal Emotion and Stress Recognition

Signals and Features 19

(a)

(b)

Figure 3.2: (a) Example of a sequence of RR-intervals (b) Normalized spectrum ofthe RR-intervals, calculated by the Lomb-Scargle Periodogram [126, 166]. The HRVparameter “lf” is determined by calculating the power in the low frequency range(0.04-0.15 Hz), whereas “hf” is determined by calculating the power in the highfrequency range (0.15-0.4 Hz)

Page 37: Multimodal Emotion and Stress Recognition

20 Signals, Features, and Classification Methods

3.1.1.3 Heart Rate Variability (HRV)

Even though each heart contraction is generated at the heart itself, the heartrate is also modulated by the two branches of the autonomous nervous system.Whereas parasympathetic innervation (via the nervus vagus) decelerates the heartrate, sympathetic innervation accelerates it. These influences of the autonomousnervous system lead to oscillations in the interval between consecutive heartbeats,a phenomenon called heart rate variability (HRV) [193].

The two branches of the autonomous nervous system exhibit different laten-cies in affecting the heart rate (2-3s for the sympathetic, 1s for the parasympatheticnervous system). When analyzing heart rate variability, the influence of the sympa-thetic and parasympathetic nervous system can therefore be separated by meansof frequency analyses.

A decrease in HRV has been observed for acute stress in [130]. Bleil et al. [21]could relate negative affect (including trait depression, anxiety and anger) to theinverse of high-frequency HRV, even when considering only healthy subjects (i.e.subjects not suffering from depressive or anxiety disorders). Along the same lines,a recent study showed an increase in HRV with increasing social connectednessand positive emotions [109].

3.1.1.4 HRV Feature Calculation

Several measures for assessing HRV in time-domain and frequency-domain are rec-ommended in [193]. Table 3.1 contains the measures used as HRV features in thiswork.

All the frequency-domain features were calculated from the power spectraldensity of the RR-intervals. To calculate the power spectral density of the RR-intervals, the Lomb-Scargle Periodogram [126, 166] was employed, see Fig. 3.2b. TheLomb-Scargle Periodogram is a method for spectral estimation from unevenly sam-pled data and has been used for spectral analysis of HRV in [47].

3.1.2 Electrodermal Activity (EDA)

The human skin is responsible for protecting the body, for sensing, and for reg-ulating the water and the thermal balance [168]. The skin is composed of threelayers including the epidermis, the dermis and the subcutis (see Fig. 3.3). The sweatglands are composed of a secretory part and an excretory duct which brings thesweat to the skin surface. Due to the NaCl ions contained in the sweat, the con-ductivity of the skin increases when sweating.

The sweat glands (as well as the skin blood vessels) are exclusively innervatedby the sympathetic nervous system [25]. The number of active sweat glands thusincreases with sympathetic activation. This makes the skin conductance - whichis related to the number of active sweat glands [58] - an ideal measure for sym-pathetic activation and therefore for the stress reaction (see also Sec. 2.2.3). In con-trast, other physiological measures - e.g. the heart rate, see Sec. 3.1.1 - are influencedby both the sympathetic and the parasympathetic nervous system.

Page 38: Multimodal Emotion and Stress Recognition

Signals and Features 21

Table 3.1: HRV features used in this work

Feature Domain Descriptionsdnn time standard deviation of all RR intervals

rmssd time Root Mean Square (RMS) of differences betweenadjacent RR intervals

pnn50 timenumber of pairs of adjacent RR intervals differing bymore than 50 ms divided by the total number ofintervals

triangularindex time

total number of all RR intervals divided by the height ofthe histogram of all intervals measured on a discretescale with bins 7.8125 ms = 1/128 s.

lf frequency power in low frequency range (0.04-0.15 Hz)hf frequency power in high frequency range (0.15-0.4 Hz)

lf/hf frequency power in low frequency range divided by power in highfrequency range

Figure 3.3: Cross section of the skin with skin layers, sweat glands and hairs [129]

Page 39: Multimodal Emotion and Stress Recognition

22 Signals, Features, and Classification Methods

(a)

(b)

Figure 3.4: (a) EDA signal example, smoothed using a sliding-window mean-filter with window size of 1.2s. The EDA signal consists of a slowly changingpart (Skin Conductance Level: SCL), which is overlaid by short, fast conductancechanges, called Skin Conductance Responses (SCRs). (b) High-pass filtered EDA sig-nal (phase) with threshold and detected peaks. The phase signal was smoothedusing a sliding-window mean-filter with window size of 1.2s. The adaptive thresh-old is proportional to the RMS values of the EDA phase signal, calculated by a 4.7swide sliding window.

Page 40: Multimodal Emotion and Stress Recognition

Signals and Features 23

The EDA is usually measured at the palmar sites of the hands or the feet wherethe density of sweat glands is highest (> 2000/cm2). The measured skin conduc-tance signal consists of a slowly changing part which is overlaid by short, fast con-ductance changes (i.e. the phasic part), as depicted in Fig. 3.4a.

Different parameters can be extracted from the two parts:

Skin Conductance Level (SCL): The skin conductance level (SCL) denotes theslowly changing part of the EDA signal. It is a measure for general psychophysi-ological activation [38] and can vary substantially between individuals [200]: Av-erage values lie between 3 and 15µS, but values below 2µS and over 20µS are notuncommon [200].

Phasic parameters: Depending on the causality of the short-term conductancechanges (also denoted as “peaks” in the following), two different types are distin-guished:

a) Skin Conductance Response (SCR): If the peak occurs in reaction to a stimulus(e.g. a startle event) it is called (specific) skin conductance response. It ap-pears between 1.5 and 6.5 seconds after the stimulus. Features used to de-scribe the characteristics of a SCR include the amplitude of the SCR, the la-tency (between stimulus and SCR onset) and the recovery time, see Fig. 3.5.

b) Non-Specific Skin Conductance Response (NS.SCR):NS.SCRs exhibit the same or a very similar shape as specific SCRs, but oc-cur “spontaneously” without any external stimulus. However, there is evi-dence, that NS.SCRs are related to “internal stimuli”, i.e. cognitive processes[139, 55]. The test subjects in [139] rated their current concerns, negativeemotions, subjective arousal, and inner speech to be more intense at timeswhen NS.SCRs occured, compared to electrodermal non-responding periods.The frequency and the mean amplitude of NS.SCRs are therefore consideredas measures for psychophysiological activation [25]. Similar to the SCL, theNS.SCRs are subject to inter-individual variation [53]. NS.SCRs exhibit thesame or a very similar shape as specific SCRs.

Physiological processes involved in EDA signal evolution The typical shapeof the EDA signal is due to physiological processes. The following description ofthese processes is based on [26].

We first explain the mechanisms involved in the evolution of the SCL. The outerlayer of the epidermis (i.e. the stratum corneum in Fig. 3.3) shows a relatively highresistance while the resistance of the inner layers of the epidermis, as well as theresistance of the dermis and the subcutis, are low. Between the stratum corneumand the other parts of the skin resides a layer which is nearly impermeable for wa-ter and thus also for ions. It therefore acts as an electric barrier.

If a sweat gland is activated, the sweat rises in the duct until it reaches theepidermis. Due to the pressure in the duct and diffusion processes, the sweat pen-etrates into the epidermis. The stratum corneum absorbs the sweat like a sponge,until - when the sweating is strong enough - the surface of the skin becomes wet.

Page 41: Multimodal Emotion and Stress Recognition

24 Signals, Features, and Classification Methods

SCR Amplitude

Latency Recovery Time

63% Amplitude

Sk

in C

on

du

cta

nce

S]

Time [s]

Stimulus

Figure 3.5: Ideal Skin Conductance Reaction (SCR) with typically computed features[182]. © 2010 IEEE

Due to the NaCl ions contained in the sweat, the current can now flow through thestratum corneum. The conductivity of the skin thus rises with the amount of sweatabsorbed in the stratum corneum [39].

In addition to the amount of sweat absorbed in the stratum corneum, thesweat ducts themselves and their filling levels also influence the measured skinconductance [168]: When the sweat rises in the duct and reaches the skin surface,an electrical connection between the skin surface and the dermis is established,which generates an additional path for the current to flow from one electrode tothe other. The conductivity of the skin therefore also increases with the number offilled sweat ducts.

The physiological mechanisms described so far explain the slowly changingpart of the EDA signal, i.e. the SCL. An explanation of the SCRs is given in [66] (citedafter [26]): The upper part of those sweat ducts which are not completely filledwith sweat are collapsed. If the pressure in the ducts rises, the exits of the ductsare suddenly opened and the sweat reaches the skin surface thereby generating afast increase in skin conductance, i.e. an SCR or an NS.SCR. If the sweat secretionseizes, the sweat ducts collapse again and the skin conductance is reduced slowlyas the sweat is absorbed by the stratum corneum.

Psychological processes involved in EDA In an extensive summary of earlyEDA research in relation to stress [25], Boucsein shows that the SCL and the NS.SCRsare sensitive and valid indicators for the course of a stress reaction whereas otherphysiological measures (e.g. the heart rate) do not show equal sensitivity. TheLazarus Group showed that the SCL (and the heart rate) increased significantly dur-ing the presentation of cruel or disgusting films [117, 120]. Nomikos showed in [140]that even the expectation of an aversive event (an electrical stimulus in this case)could elicit a similar reaction in SCL as the event itself. Several authors [24, 112] in-vestigated how involuntary interruptions in the work flow due to prolonged com-puter system response times influenced the EDA. An increase of NS.SCRs for longsystem response times could be demonstrated. Others also showed an increase inskin conductivity during mental stress [94].

Page 42: Multimodal Emotion and Stress Recognition

Signals and Features 25

3.1.2.1 EDA Measuring Device

For all our experiments, the Emotion Board was used to measure EDA. It was devel-oped at the Wearable Computing Lab. at ETH Zurich and is shown in Fig. 3.6.

The employed measurement principle is referred to as exosomatic quasi con-stant voltage method [153]. Hereby, a constant voltage (500mV) is applied to oneelectrode leading to a current flowing through the skin to the other electrode, seeFig. 3.7. Measuring the voltage at the reference resistor allows us to directly deter-mine the skin conductance. To eliminate high-frequency noise, a 2nd order low-pass filter with a cut-off frequency of fc = 5 Hz is applied before A/D conversion ofthe measured signal (referred to as level in the following). An example of a recordedEDA level signal is shown in Fig. 3.4a.

Applying an additional high-pass filter (2nd order, fc = 0.05 Hz) yields the pha-sic part of the EDA signal (referred to as phase in the following). For further noise re-duction, this signal is once more low-pass filtered (2nd order, fc = 5 Hz), amplifiedand fed to the A/D converter. An EDA phase signal example is shown in Fig. 3.4b. Itwas recorded simultaneously with the EDA level signal shown in Fig. 3.4a.

A Bluetooth wireless link is used to transfer the EDA data. The finger straps ofthe Emotion Board incorporate snap fasteners for attaching commercially availableelectrodes. The electrodes are attached to the middle phalanges of the left indexand middle finger, see Fig. 3.6.

Refer to [177] for more information on the Emotion Board.

3.1.2.2 Peak Detection in Phase Signal

To identify SCRs and NS.SCRs, the peaks in the EDA phase signal needed to be de-tected, see Fig. 3.4b. This was achieved by an adaptive threshold proportional tothe RMS of the EDA phase signal. The RMS was calculated by a 4.7s long slidingwindow. From the detected peaks, the heights of the peaks and the time intervalsbetween consecutive peaks were calculated. In the following, we will refer to thereciprocal values of the time intervals by the term instantaneous peak rate.

3.1.2.3 Quantiles as Specific EDA Features

Quantiles q(p), calculated from the EDA peak height and from the instantaneouspeak rate, were used as special EDA features. An intuitive explanation of quantilesis given here. For a formal description, refer to Appendix A.1.

Quantiles are closely related to percentiles. A percentile is the value of a vari-able below which a certain percentage of observations falls [208]. For example,the 30th percentile is a threshold such that 30% of the observed data has smallervalues than this threshold. While percentiles are defined by percentages, quantilesare defined by probabilities, i.e. q(p = 0.3) is a threshold such that the probabil-ity that the variable has smaller values than this threshold equals 0.3. The 30thpercentile is thus essentially the same as the estimated 0.3-quantile q(p = 0.3).

Page 43: Multimodal Emotion and Stress Recognition

26 Signals, Features, and Classification Methods

Figure 3.6: Emotion Board: Unobtrusive, wearable device for measuring EDA [182],© 2010 IEEE. Size: 41x67mm. Number of electronic parts: 74. Power consumption:182 mW.

Figure 3.7: Analog part of the Emotion-Board with amplifiers and filters, adaptedfrom [177].

Page 44: Multimodal Emotion and Stress Recognition

Signals and Features 27

3.1.3 Facial Electromyogram (EMG)The electrical activity of a muscle can be measured using electromyography. Themeasured signal, called electromyogram (EMG), originates from action potentialsgenerated by the somatic nervous system. The action potentials travel along themuscle fibers and lead to muscle contraction. Using surface electrodes applied tothe skin, a superposition of action potentials from nearby muscle fibers is mea-sured, thus indicating the activity of the underlying muscle. The amplitude of anEMG signal is typically between aµV and a few mV [200, 16].

Figure 3.8: Example of an EMG signal recorded at the zygomaticus major (smilingmuscle), showing three muscle contractions. Top: Raw EMG signal. Bottom: Fil-tered EMG signal (14th order Butterworth high-pass filter with fc = 10 Hz). Musclecontraction leads to an increase in signal amplitude.

3.1.3.1 EMG Feature Calculation

The facial EMG is sensitive to low-frequency artifacts due to eye movements, eyeblinks, activity of neighboring muscles, and swallowing [27]. We therefore first fil-tered the EMG signal using a butterworth high-pass filter (14th order). Since theprimary energy in surface EMG signals lies roughly between 10 and 200 Hz [81],the cut-off frequency fc was chosen at 10 Hz1.

Figure 3.8 shows an example of an EMG signal recorded at the zygomaticusmajor (smiling muscle) during three consecutive muscle contractions. The top plot

1The butterworth filter was designed using Matlab’s Filter Design and Analysis Tool [197].Exact settings are: End of stopband Fstop = 5 Hz; beginning of passband Fpass = 10 Hz;stop band attenuation Astop = 80 dB; pass band ripple Apass = 1 dB; options: minimumorder, match stopband exactly.

Page 45: Multimodal Emotion and Stress Recognition

28 Signals, Features, and Classification Methods

shows the raw signal, whereas the bottom plot shows the high-pass filtered sig-nal. During muscle contraction, the EMG signal exhibits a higher amplitude thanduring muscle relaxation. The EMG signal power was therefore chosen as EMG fea-ture: With x being a vector of EMG values of length N and xi being the i-th valueof x, the EMG signal power is calculated according to

power(x) =1

N

N∑i=1

|xi|2 . (3.2)

For more information regarding EMG, including guidelines for data collectionand analysis, refer to [81]. For effects of electrode placement on facial EMG referto [194].

The relationship between facial EMG measurements and emotions are de-scribed in Sec. 2.2.2.

3.1.4 Electrooculogram (EOG)Electrooculography is a technique to measure the resting potential of the retina.It is mainly used for eye movement analysis. The electrooculogram (EOG) is mea-sured between two electrodes attached at the right and left side of the eye (for thehorizontal eye movements) or below and above the eye (for the vertical eye move-ments).

In the vertical EOG, the blinks are visible, as shown in Fig. 3.9. Eye blinks arerelated to startle events [77]. Moreover, the blink rate can be affected by the emo-tional state of a person [145].

Figure 3.9: Example of a vertical EOG signal with detected blinks indicated by greenrectangles.

3.1.4.1 Blink Rate Calculation

The template matching algorithm proposed in [31] was used to detect the blinksin the vertical EOG signal. The blink rate (in blinks per minute) was calculated by

Page 46: Multimodal Emotion and Stress Recognition

Signals and Features 29

Eq. 3.3 and used as EOG feature.

BlinkRate =number of blinks

T·60 (3.3)

with T indicating the duration of the recorded EOG signal in seconds.

3.1.5 Respiration

With each breath the thorax and the abdomen expand. These movements can bemeasured by an expansion sensor wound around the thorax. In this work, a novelstrain sensor [132] was integrated into an airplane seat belt, see Fig. 4.10a. Referto [176] for further information on the respiration sensor and the correspondingrecording hardware.

A normal respiration rate at rest is between 12 and 18 breaths per minute [187].Respiration can be used as a complementary measure to detect artifacts in otherphysiological signals (e.g. in EDA [200]) or to calculate respiratory sinus arrhythmia(the acceleration of the heart rate during inhalation and the deceleration during ex-halation) [82]. However, respiration can also serve as an independent measure foremotion: An overview of how the breathing rhythm is related to emotions with re-spect to brain processes is given by Homma and Masaoka in [89]. They showed i.a.an increase of the respiration rate during anticipatory anxiety that was unrelatedto oxygen consumption [131].

3.1.5.1 Respiration Rate Calculation

An exemplary respiration signal, recorded with the seat-belt-integrated respirationsensor, is shown in Fig. 3.10a. When the subjects moved on the airplane seat, arti-facts occurred, as marked by the red area in Fig. 3.10a.

The respiration rate is determined by detecting the peaks in the respiration sig-nal. Our peak detection algorithm is based on [92] and has shown good peak-detection performance in the presence of motion artifacts [62]. The algorithmworks as follows: To determine the points of maximal inhalation (indicated bythe black arrows in Fig. 3.10a), the respiration signal was first negated and mean-filtered using a sliding-window of 1s. The resulting signal r is depicted in blue inFig. 3.10b. A shifted version of r (red line in Fig. 3.10b) was then used as adaptivethreshold th to detect the inhalation peaks (red * in Fig. 3.10b). The threshold signalth at time ti can be calculated by:

th(ti) = r(ti −∆t1) +

√√√√ 1

2 · Fs ·∆t2

ti−∆t1+∆t2∑tj=ti−∆t1−∆t2

(r (tj)− rj)2 (3.4)

with rj = 12·Fs·∆t2

ti−∆t1+∆t2∑tk=ti−∆t1−∆t2

r (tk). Fs denotes the sampling frequency of the

respiration signal. The first term of Eq. 3.4 indicates a right-shift by ∆t1 = 1.5s,whereas the second term represents a sliding RMS with a window size of 2 ·∆t2 =

Page 47: Multimodal Emotion and Stress Recognition

30 Signals, Features, and Classification Methods

(a)

(b)

Figure 3.10: (a) Raw respiration signal recorded with a seat-belt-integrated respira-tion sensor [176]. The measured respiration signal is inversely proportional to theexpansion of the abdomen, as indicated by the inhalation and exhalation arrows. Amotion artifact disturbs the signal between 90 and 97s. (b) Negated, mean-filtered(1s window-size) respiration signal with shifted signal-version used as threshold.Stars indicate detected peaks.

Page 48: Multimodal Emotion and Stress Recognition

Signals and Features 31

7s, which results in an upward shift. The respiration peaks were determined bycalculating the maximum value of r in the regions where r > th. The red stars inFigure 3.10b indicate the detected peaks.

For every intervalRIi between two consecutive respiration peaks, a respirationrate value in breaths per minute was calculated according to

rri = 60· 1

RIi(3.5)

The respiration rate values rri were assembled to a respiration rate vector rr =

[rr1, rr2, . . . , rrn]. The respiration rate vector was used to calculate standard sta-tistical features (e.g. mean, maximum) for specific time windows (e.g. for the 3-minute film clips in Sec. 4.2).

3.1.6 Finger Temperature

Peripheral body temperature can be measured with a temperature sensor attachedto the finger. The skin temperature depends on the blood flow in the underlyingblood vessels, which is regulated by the sympathetic nervous system. When sym-pathetic fibers are activated, the blood vessels constrict, the blood flow reducesand the skin temperature decreases.

In several experiments [143, 123], Ekman and Levenson showed average in-creases of finger temperature between 0.1°C and 0.2°C due to anger. For fear, fin-ger temperature decreased between 0.01°C and 0.08°C. Moreover, normal subjectscan voluntarily increase or decrease their finger temperature by 0.5 − 1°C, whenreceiving temperature biofeedback [80].

In our experiments, we calculated standard statistical features (e.g. mean,slope) for the finger temperature signals recorded during specific time windows(e.g. during the 3-minute film clips in Sec. 4.2).

3.1.7 Acceleration and Movement

Ekman investigated the differences between nonverbal affective cues given by thehead and by the body [69]. A series of photographs were taken during standard-ized stress interviews. Three sets of pictures were created, one showing only thehead, one showing only the body up to the neck, and one showing the whole per-son. The three sets were shown to three groups of judges, who were asked to rateeach picture on the scales pleasantness-unpleasantness (valence), sleep-tension(arousal), and attention-rejection. Results showed that head nonverbal cues (i.e.mainly facial expression) primarily contain information on the kind of affect beingexperienced, whereas body movements are related to the level of arousal or thedegree of intensity of an affective experience [69].

In [137], subjects of different age groups reliably identified four emotions(happy, sad, angry, neutral) by watching body movements and gestures portrayedby actors (with faces blurred and voice muted). Moreover, the four emotionswere associated with 6 movement dimensions (smooth-jerky, stiff-loose, soft-hard,slow-fast, expanded-contracted, no action-a lot of action) as follows:

Page 49: Multimodal Emotion and Stress Recognition

32 Signals, Features, and Classification Methods

• Happiness: relatively jerky, loose, fast, hard, expanded, and full of action.

• Anger: very jerky, stiff, fast, hard, expanded, and full of action.

• Sadness: very smooth, loose, slow, soft, contracted, inactive.

• Neutral: relatively smooth, loose, slow, soft, very contracted, and inactive.

Movement, postures and gestures of a person can be measured in three differentways: by acceleration sensors or inertial measurement systems, by pressure sen-sors (e.g. placed on a seat) or by video-based motion tracking systems. A pressuremat combined with other sensors was used in [61] to create an intelligent tutoringsystem, whereas video-taped gestures were used for automatic emotion recogni-tion in [42, 20].

3.1.7.1 Acceleration Sensors

In our work, acceleration sensor nodes developed at the Wearable Computing Lab.at ETH Zurich were used [10]. They incorporate a 3-axis acceleration sensor andtransmit the recorded data via Bluetooth, see Fig. 3.11a.

3.1.7.2 Calculation of Sensor Orientation

To describe the spacial orientation of an acceleration sensor, three angles are used:the pitch, the roll and the yaw. The names of the angles originate from aeronautics,as illustrated in Fig. 3.11b. Because an acceleration sensor always measures earthgravitation, the pitch and the roll angles can be calculated from the accelerationsignal vector a = [ax, ay, az] by

Rotation around y-axis: pitch = − arcsin

(ax‖a‖

)(3.6)

Rotation around x-axis: roll = − arctan

(ayaz

)(3.7)

with ‖a‖ =√

(a2x + a2

y + a2z) denoting the absolute value of the acceleration vec-

tor. (The yaw angle can not be determined from acceleration signals.)Standard statistical features (e.g. mean, std) were calculated from the pitch,

the roll and the absolute acceleration ‖a‖, for specific time windows (e.g. for the47s excerpt of the MIST experiment data in Sec. 5.5.2).

3.1.8 SpeechFigure 3.12 shows the process of speech feature calculation used in Sec. 4.1. All thespeech features were calculated for entire sentences. The features were either de-rived directly from the speech signal waveform or from so-called contours. Con-tours can be viewed as features that are calculated for a series of time steps usinga sliding-window. In this work, we chose the pitch frequency and the intensity con-tours for speech feature calculation, because they were often used in other speechemotion recognition studies [51, 175, 90]. Moreover, features calculated from the

Page 50: Multimodal Emotion and Stress Recognition

Signals and Features 33

(a) (b)

Figure 3.11: (a) Acceleration sensor node [10]: PCB: 47 x 22 x 7 mm, powered by arechargeable Li-ion battery (3.7V, 600mAh, 37 x 27 x 6 mm). Weight: 22 g. (b) Visu-alization of the three axes of an acceleration sensor “attached” to an airplane andthe corresponding angles. The pitch indicates whether the plane is rising or sink-ing, the roll indicates an inclination of the wing and the yaw indicates the direction(e.g. west). Figure adapted from [148].

Figure 3.12: Process of speech feature calculation: The features were calculated forentire sentences. They were either calculated from the speech signal itself or fromcontours that are derived from the speech signal.

Page 51: Multimodal Emotion and Stress Recognition

34 Signals, Features, and Classification Methods

pitch frequency and the intensity have resulted in higher emotion recognition ac-curacies than features calculated from other contours [113, 5].

In the following, we first describe the calculation of the pitch frequency andintensity contours (cf. Fig. 3.13), before listing the features used in this work.

Pitch frequency contour calculation: In voiced speech, the vocal cords gener-ate a quasi-periodic signal, which is characterized by a fundamental frequency F0

and its harmonics (2F0, 3F0, . . .) [149]. The fundamental frequency is perceivedas pitch [142]. The terms pitch and fundamental frequency are therefore oftenused interchangeably in speech processing literature [88]. According to [88], F0

can range between 50 and 800 Hz, depending on the speaker. The fundamentalperiod (i.e. the smallest repeating cycle of the signal) is defined by T0 = 1

F0[149].

To determineF0 of the speech signals recorded in our experiments (cf. Sec. 4.1),we have used Praat [23], a free software for acoustic analysis. Since F0 varies overtime due to intonation, Praat first splits the speech signal into a series of overlap-ping windows (usually called frames in speech processing). For each frame, severalpitch candidates (15 by default) and their strengths are determined using auto-correlation or normalized cross-correlation. The normalized cross-correlation fora frame of a speech signal s is defined as [191]:

χ(k) = χ(−k) =

∑N+m−1j=m s(j) · s(j + k)√∑N+m−1

j=m s2(j) ·∑N+m−1j=m s2(j + k)

with m indicating the starting index of the frame and N the number of samplesin the correlation window. The variable k is called lag index. For a periodic signal,χ(k) results in a strong peak when the lag index k corresponds to the fundamentalperiod T0. Therefore, the normalized cross-correlation function is calculated for alllag indices Fs

F0,max≤ k ≤ Fs

F0,min, withF0,min andF0,max indicating the minimally

and maximally expected pitch, respectively, and Fs being the sampling rate of thespeech signal. The lag indices yielding the 14 largest cross-correlation values arechosen as pitch candidates, i.e. F0,candl = Fs/kcandlwith strength χ(kcandl), 1 ≤l ≤ 14.

Unvoiced and silent speech signal sections are not periodic, and therefore notcharacterized by a fundamental frequency. To account for this fact, an “unvoicedcandidate” is added to the 14 pitch candidates.

Given the 15 candidates for each frame, the Viterbi algorithm [198, 155] is usedto determine the best path through the candidate “space” and to finally assign asingle pitch frequency to each signal frame. The parameters of the Viterbi algo-rithm and the corresponding values chosen for our experiments are listed in Ap-pendix A.2.

An example of a pitch frequency contour is shown in Fig. 3.13. For unvoicedand silent frames, no pitch frequency can be calculated, which results in “gaps”in the pitch frequency contour. For more information on Praat’s pitch calculationalgorithm, refer to [22, 207].

Page 52: Multimodal Emotion and Stress Recognition

Signals and Features 35

Figure 3.13: Raw audio signal of the sentence “Hat sundig pron you venzy” (thisis a pseudo sentence, the words have no meaning) (top) and corresponding pitchfrequency (middle) and intensity contour (bottom). For unvoiced and silent framesno pitch frequency can be calculated.

Page 53: Multimodal Emotion and Stress Recognition

36 Signals, Features, and Classification Methods

Intensity contour calculation: An example of an intensity contour is depictedin Fig. 3.13 (bottom). It was also calculated using Praat [23].

To cancel the DC offset, Praat first subtracts the mean value of the speech sig-nal from the speech signal s. The resulting signal s′ is squared and convolved witha Gaussian analysis window w(n), which results in the intensity contour. With awindow sizeW , the intensity at time step n is thus calculated by

I(n) =

n+ W2∑

j=n−W2

(s′ (j)

)2 · w(n− j) (3.8)

Praat adapts the window sizeW according to the parameter “minimum pitch”mp,such thatW = 3.2

mp. We calculatedmp from the pitch frequency contour according

to: mp = min (F0). See Appendix A.3 for more details.

Calculated Features: The speech features were finally calculated based on theunprocessed speech signal, the pitch frequency contour, and the intensity contour(cf. Fig. 3.12), which resulted in 25 features per recorded sentence. The features arelisted in Table 3.2.

Table 3.2: Twenty five speech features, calculated for each recorded sentence.q(0.25) and q(0.75) represent quantiles, c.f. Sec. 3.1.2.3.

Underlying signal or contour FeaturesUnprocessed speech signal mean, std, min, max, rangePitch frequency contour mean, std, min, max, range, q(0.25), q(0.75)

Pitch frequency contourabsolute slope mean, rising slope maximum,rising slope mean, falling slope minimum,falling slope mean

Pitch frequency contour Ratio of voiced to total frames, mean duration ofunvoiced and silent signal segments

Intensity contour mean, std, min, max, q(0.25), q(0.75)

3.2 Classification and Clustering Algorithms

This section describes the employed classification algorithms, adapted from [64]and [85]. For all methods, we assume to have a set of N training data pairs(xk, yk), k = 1, . . . , N with xk ∈ Rd a d-dimensional feature vector and yk ∈{c1, c2, . . . , cnc} the labels for nc different classes. It is important to note that themore parameters need to be estimated for a certain classifier, the more trainingdata is necessary [156].

Page 54: Multimodal Emotion and Stress Recognition

Classification and Clustering Algorithms 37

3.2.1 Nearest Class Classifier (NCC)For each class ci, i = 1, . . . , nc, the Nearest Center Classifier (NCC) calculates amean vector µi from the training data samples xk belonging to the class ci:

µi =1

Ni

∑k∈ci

xk (3.9)

Ni denotes the number of training data belonging to class ci. Given the nc meanvectors, the class membership of a new feature vector x’ is calculated by:

C(x′) = argmini=1,...,nc

|x′ − µi|

3.2.2 Discriminant AnalysisThe idea of a discriminant classifier is to transform the training data such thatthe within-class variance becomes small, and the between-class variance becomeslarge.

A discriminant classifier can be described by a set of discriminant functionsδi(x), i = 1, . . . , nc for nc classes. A classifier assigns a feature vector x to class ciif the discriminant function that belongs to class ci yields the largest value, i.e.

C(x) = i if δi(x) = maxiδi(x) (3.10)

For a minimum error rate classifier, the posterior probability can serve as discrimi-nant function, i.e. δi(x) = P (ci|x). Hence, the class with the maximum posteriorprobability will be chosen by the classifier.

In linear and quadratic discriminant analysis (LDA and QDA), the state-conditional probability density functions of x, p(x|ci), are assumed to be multi-variate normal. With this assumption and some calculus given in Appendix B.1, thequadratic discriminant functions for QDA result in:

δi(x) = −1

2(x− µi)

TΣ−1i (x− µi)−

1

2ln(|Σi|) + ln(P (ci)) (3.11)

where µi is the mean vector (c.f. Eq. 3.9) and Σi the covariance matrix of the distri-bution of the d-dimensional feature vector x for class ci. |Σi| is the determinant ofthe covariance matrix and P (ci) denotes the prior probability for class ci.

In the special case where the covariance matrices are assumed to be equal forall classes, i.e. Σi = Σ ∀i, Eq. 3.11 can be simplified and results in linear discriminantfunctions for LDA:

δi(x) = xTΣ−1µi −1

2µTi Σ−1µi + ln(P (ci)) (3.12)

Note that - assuming a two-class problem and the dimension of the featurespace d� 2 - the number of parameters that need to be estimated from the train-ing data for LDA is about half the number needed for QDA.

Since both LDA and QDA perform well on diverse classification tasks [136], in-cluding emotion recognition [104], Hastie concluded that “it seems that whatever

Page 55: Multimodal Emotion and Stress Recognition

38 Signals, Features, and Classification Methods

exotic tools are the rage of the day, we should always have available these twosimple tools” ([85] p. 111).

3.2.3 Support Vector Classifiers and Support Vector Machines(SVM)

Support Vector Classifiers and Support Vector Machines (SVM) are binary classi-fiers. Given a set of N training data pairs (xk, yk), k = 1, . . . , N with xk ∈ Rd ad-dimensional feature vector and yk ∈ {−1, 1} the labels for the two classes, thesupport vector algorithm tries to find the optimal separating hyperplane, i.e. theone that maximizes the margin between the two classes, c.f. Fig. 3.14.

A hyperplane can be described by:

{x : f(x) = β0 + βT x = 0} (3.13)

If β is a unit vector, i.e. ‖β‖ = 1, the signed distance between any point xk and thehyperplane is given by f(xk) = β0 +βT xk . Since the class labels yk are coded by -1and 1, the class membership of xk can be determined be applying the sign functionon the distance to the hyperplane:

C(xk) = sign(f(xk)) = sign(β0 + βT xk) (3.14)

Such a classifier is called Support Vector Classifier. If two classes are separa-ble, it is possible to find a hyperplane which satisfies yk · f(xk) > 0 ∀k. However,more than one such hyperplane exists. To find the “optimal” hyperplane, the dis-tance between the closest training data samples of either class and the hyperplane(i.e.M in Fig. 3.14) needs to be maximized, which can be stated as an optimizationproblem:

maxβ,β0

M (3.15)

subject to (3.16)

yk(β0 + βT xk) ≥ M, k = 1, . . . , N (3.17)‖β‖ = 1 (3.18)

This optimization problem can be transformed to a convex optimization problemand be solved analytically. The inequality constraint 3.17 ensures, that the distancebetween each training data sample and the hyperplane is at leastM . The trainingdata samples, for which yk(β0 + βT xk) = M , define the optimal hyperplane andare called the support vectors. Themargin between the two classes thus amountsto 2 ·M . Fig. 3.14 shows an example of a separable training data set, the optimalhyperplane, the margin and three support vectors.

The support vector classifier is only able to generate linear decision bound-aries whereas Support Vector Machines (SVM) transform the data into a higher-dimensional space to generate non-linear boundaries. Support Vector Machines

Page 56: Multimodal Emotion and Stress Recognition

Classification and Clustering Algorithms 39

Figure 3.14: Two-dimensional classification problem with two classes (black andwhite points). The solid line indicates the optimal hyperplane found by the supportvector classifier. The dashed lines indicate the boundaries of the 2·M -wide margin.The three points on this margin are called support vectors. Adapted from [56].

are described in Appendix B.2.

3.2.4 Self Organizing Maps (SOM)In the following, we briefly introduce the basic principle of Self Organizing Maps(SOM) and XY-fused Kohonen Networks. Originally, the main purpose of SOMs wasto visualize a high-dimensional feature space on a two-dimensional grid of nodeswhile preserving the topological relationships of the original feature space on thetwo-dimensional display [108].

Basic SOM Principles: A SOM consists of a set of units that are spatially orderedin a two-dimensional grid (Fig. 3.15). In order to map the feature space to the SOM,we assume having a feature vector xstim = [x1, x2, . . . , xd]

T ∈ Rd. Each unit i inthe SOM is equipped with a weight vector wi = [wi1, wi2, . . . , wid]

T ∈ Rd, whichhas the same dimension as the feature vector. The image of the feature vectorxstim on the SOM grid is defined as the SOM unit s that matches best with xstimusing a distance measureD:

s = arg mini

D(xstim,wi) (3.19)

To preserve the topological relationships of the feature space on the two-dimensional SOM grid, the weight vectors wi are adapted in a training phase. Dur-ing the training, the N training feature vectors xstim,k , k = 1, . . . , N (also called“input vectors”) are presented to the SOM in random order. For each SOM unit,

Page 57: Multimodal Emotion and Stress Recognition

40 Signals, Features, and Classification Methods

Figure 3.15: Training of the SOM. The training input xstim is given to the SOM andthe winning node s is determined. The weights of all nodes are adapted dependingon the distance to the node s. Node r is one of the nodes in the closest neighbor-hood region.

the distance between it’s weight vector and the presented input vector is calcu-lated. The SOM unit reaching the highest similarity (i.e. the smallest distance) isassigned to be the winner s. After finding the winner, the weights of all nodes ofthe SOM are adapted depending on the distance to the node s. Given the input vec-tor xstim,k at time t, the update of the weight vector wi of the node i is calculatedby

wi(t+ 1) = wi(t) + hsi(t) [xstim,k(t)−wi(t)] (3.20)

using the Gaussian neighborhood function

hsi(t) = α(t) · exp

(−‖rs − ri‖2

2σ2(t)

)(3.21)

with a learning rate 0 < α(t) < 1, σ(t) the standard deviation of the Gaussian,and with location vectors rs, ri ∈ R2 of the winning nodes s and the node i. TheGaussian form of the neighborhood function implicates that the smaller the dis-tance of a node to the winning unit, the larger the modification of it’s weight. Theweights of the winning unit and its neighbors thus become slightly more similarto the presented input vector xstim. This is illustrated in Fig. 3.15.

Once the training of the SOM is finished, it can be used to predict the classmembership of a new, unlabeled feature vector. The new feature vector is pre-sented to the trained SOM and the winning unit determined according to Eq. 3.19.The class membership of the input vectors that were mapped to this particularnode during training are then used to predict the class membership of the new

Page 58: Multimodal Emotion and Stress Recognition

Classification and Clustering Algorithms 41

feature vector by “majority voting”.

Extension to Supervised SOMs: In recent years, supervised variants of SOMswere developed, which incorporate the class membership of the input vectors intothe learning process. In this work we will focus on the XY-fused Kohonen Net-work, which uses an additional grid of nodes to map the class information [135].For each input vector, the corresponding class is represented by a binary outputvector (e.g.[0, 0, 1, 0] to indicate that the input vector belongs to the third out offour classes). The additional grid responsible for the class information is calledYmap, whereas the Xmap represents the grid responsible for the input vectors. AXY-fused Kohonen Network exploits the similarities in both input map Xmap andoutput map Ymap. During the training, a “fused” distance measure is used that isbased on a weighted sum of distances between an input vector and the units inthe Xmap, and distances between the corresponding output vector and the unitsin the Ymap. The conjoint winning unit is determined by minimizing the fuseddistance measure. Both maps are updated simultaneously accordingly to the stan-dard SOM formalism: the input vector is used to update the Xmap, whereas thecorresponding output vector is used to update the Ymap.

The procedure for predicting the class membership of a new, unlabeled featurevector starts with presenting the new feature vector to the trained XY-fused Koho-nen Network. The position of the winning unit s in the Xmap is determined andused to look up the class membership of the corresponding unit in the Ymap: themaximum value of the winning unit’s weight vector (in the Ymap) determines thepredicted class membership (e.g. with wsY = [0.1, 0.2, 0.8, 0.4], membership ofclass 3 is predicted).

3.2.5 Subtractive ClusteringConsider a set of N training data samples xk , k = 1, . . . , N with xk ∈ Rd a d-dimensional feature vector. Subtractive clustering, proposed by Chiu [44], is an un-supervised2 algorithm for estimating the number and the location of cluster cen-ters from theN feature vectors.

Initially, each feature vector is considered as potential cluster center. The poten-tial of a feature vector xk to be a cluster center is calculated based on the distancesto all other training data samples:

Pk =

N∑i=1

e−α‖xk−xi‖2 (3.22)

with

α =4

r2a

and ra a positive constant. The more training data samples are in the neighbor-hood of the feature vectorxk , the higher is the potential of xk to be a cluster center.

2Unsupervised means, that the algorithm does not need labeled training data to performthe clustering.

Page 59: Multimodal Emotion and Stress Recognition

42 Signals, Features, and Classification Methods

The constant ra defines the radius of the neighborhood; training data samples out-side this radius contribute little to the potential of xk .

After calculating the potential for every feature vector, the feature vector withthe highest potential P ?1 = max

kPk is chosen as first cluster center x?1 . The poten-

tial of each feature vector is then recalculated using

Pk,new = Pk − P ?1 e−β‖xk−x?1‖2 (3.23)

withβ =

4

r2b

and rb a positive constant. This means that the potential of feature vector is re-duced, depending on the distance to the first cluster center. The feature vectorwith the highest potential Pk,new is then chosen as second cluster center.

This procedure (i.e. recalculation of the potentials using Eq. 3.23 and choosingthe next cluster center based on the maximum potential) is repeated until a stop-ping criterion is reached. The stopping criterion is based on a trade-off betweenthe potential of a new cluster center and the distance from already selected clus-ter centers. Refer to Appendix B.3 or to [44] for the formulas.

3.3 Methods for Handling Missing Feature Values

Since data loss due to artifacts is frequently encountered in practical applications,missing feature values represent a serious problem for classification, which needsto be addressed but has so far gained little attention in emotion recognition. In[165], several courses of actions for handling missing feature values are summa-rized: Discard the corrupted training or test data samples (in practice, this meansto discard a large amount of data), acquire missing feature values (infeasible in au-tomatic emotion recognition), impute missing feature values or employ reduced-feature models. The latter two are investigated in this work. They are illustrated inFig. 3.16.

Imputation uses an estimation of the missing feature value [165]. A simple pos-sibility used in practice comprises the substitution of missing feature values bythe corresponding mean value of the uncorrupted training samples. Imputationis needed if the applied classifier employs a feature whose value is missing.

The reduced-feature models technique represents an alternative approach [165]:Instead of imputation, a new model (i.e. a new classifier) is trained that employsonly the available features. A simple way to create a reduced-feature model is touse ensemble classifier systems, as described in the following paragraph.

Ensemble classifier systems Ensemble classifier systems consist of severalclassifiers whose decisions are fused to arrive at a final decision [151]. The idea be-hind is to consult “several experts”. It can be compared to the natural behavior ofhumans to seek a second (or third) opinion before making an important decision.The ensemble classifier system used in this work consists of one classifier per signal

Page 60: Multimodal Emotion and Stress Recognition

Methods for Handling Missing Feature Values 43

Figure 3.16: Three strategies for handling missing feature values.

Page 61: Multimodal Emotion and Stress Recognition

44 Signals, Features, and Classification Methods

modality; each signal modality thus represents an “expert”. In case of encounter-ing a missing feature value in a certain signal modality, no classifier is trained forthat modality and the final decision is only based on the signal modalities that areartifact-free.

Numerous ways to combine the decisions of several classifiers exist. We havechosen and evaluated the following two approaches:

1. Majority voting: The class that receives the highest number of votes is cho-sen. If several classes receive an equal number of votes, the correspondingclassifiers are identified as candidate classifiers and the classifier with thehighest confidence among them provides the deciding vote.

2. Confidence voting: The class of the classifier with the highest confidence ischosen.

For the two described classifier fusion schemes, a suitable underlying classifieris needed. It has to be able to handle unequal class distributions and to generate aprobabilistic output which can be used as confidence value. Linear and QuadraticDiscriminant Analysis (LDA and QDA, see Sec. 3.2.2) fulfill the mentioned conditions:Discriminant Analysis generates an estimate of the posterior probability for eachpotential class and selects the class that exhibits the largest posterior probability.The estimated posterior probability for the selected class was taken as confidencemeasure for classifier fusion.

Page 62: Multimodal Emotion and Stress Recognition

4

Emotion Recognition using aStandardized Experiment Setup

In this chapter, two studies using standardized emotional stimuli arepresented. In the first study, standard sentences were used to expressa set of emotions. XY-fused Kohonen Networks were employed to an-alyze the emotions based on speech features. In the second study, filmclips elicited five emotions (amusement, anger, contentment, neutral,sadness) in 20 subjects, while six physiological signals (ECG, EMG, EOG,EDA, respiration, finger temperature) were recorded. The different signalmodalities were first evaluated separately and then combined. Sincepart of the data was affected by artifacts, two methods for handlingmissing feature values (imputation and reduced-feature models) wereinvestigated in combination with two classifier fusion approaches.

4.1 Emotion Recognition from Speech using Self Orga-nizing Maps1

In this section we show how Self Organizing Maps (SOMs) can be employed todetect discrete emotions from speech. Speech was chosen because a large rangeof discrete emotions can be expressed using standard sentences known fromthe literature [12]. SOMs were chosen because of their ability to visualize high-dimensional data in two dimensions. In our case, calculating the speech featurespresented in Sec. 3.1.8 resulted in a 25-dimensional feature space. A SOM was thenused to map this 25-dimensional feature space onto a two-dimensional grid ofnodes while preserving the topological relationships of the original feature space[108]. This means, that data samples that are close to each other in the originalfeature space, will also be close to each other in the two-dimensional node grid,whereas distant data samples in the original feature space will be distant in the

1based on [7]

Page 63: Multimodal Emotion and Stress Recognition

46 Emotion Recognition using a Standardized Experiment Setup

node grid. The SOM thus serves as a means of investigating similarities and differ-ences between emotions.

There has been an ongoing discussion whether actors tend to exaggerate thesigns of emotions (e.g. [13, 199]). For that reason, subjects unexperienced in profes-sional acting were recruited for our study.

4.1.1 Data CollectionTen healthy subjects (7 undergraduate students, 1 PhD student, 1 postdoctoral re-search assistant, and 1 administrative staff member; 5 male, 5 female) were re-cruited for the experiment. By pronouncing two standard sentences, they ex-pressed the following 6 emotions: disgust, happiness, cold anger, boredom, pride,and desperation. The two standard sentences are composed of phonemes fromseveral Indo-European languages and have been suggested and used in [12]:

• “Hat sundig pron you venzy”

• “Fee gott laish jonkill gosterr”

In the beginning of the experiment, the subjects were instructed how to ex-press the emotions: For each emotion, 16 recordings of professional actors express-ing the emotion by speaking one of the standard sentences were concatenatedand played to the subjects. The recordings originated from the stimulus set2 of[12]. After listening to the professional actors, the subjects were asked to put them-selves into the according emotional state and to pronounce the standard sentencesthemselves. Each sentence could be rehearsed as many times as desired. Finally,each sentence was recorded twice using a standard headset microphone.

4.1.2 Evaluation MethodsBased on previous work [19, 121], 25 features were calculated for each recorded sen-tence, see Table 3.2. The 25 features were used to train XY-fused Kohonen Networks,a variant of Self-Organizing Maps described in Sec. 3.2.4.

4.1.3 ResultsIn Fig. 4.1, an XY-fused Kohonen network that was trained with voice recordings ofall subjects is shown. It can be observed that most nodes contain samples fromseveral emotions, which renders an accurate discrimination of different emotionsimpossible.

There are several reasons, why the training of the XY-fused Kohonen Networkdid not result in more homogeneous nodes. On the one hand, speech features arenot only depending on the pronounced emotion but also on the speaker. We there-fore expected more homogeneous clusters, when training the XY-fused Kohonen

2The stimulus set used is based on research conducted by Klaus Scherer, Harald Wallbott,Rainer Banse & Heiner Ellgring. Detailed information on the production of the stimuli can befound in [12].

Page 64: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 47

Network for a single subject. Fig. 4.2 shows such a network. As expected, samplesof different emotions were mapped onto different nodes.

Another reason for the inhomogeneous nodes of the XY-fused Kohonen Net-work in Fig. 4.1 could consist in similarities between certain emotions. For example,pride and happiness were often mapped to the same cluster. We therefore inves-tigated two emotions which differ strongly in emotional arousal: desperation andboredom. The trained XY-fused Kohonen Network is shown in Fig. 4.3. It can beobserved that most of the samples of boredom were mapped on the upper rightpart of the network, whereas the samples of desperation were mostly mapped onthe lower left part, resulting in good separability.

4.1.4 ConclusionWe have investigated emotion detection from speech data of subjects unexperi-enced in professional acting. A total of 25 features was calculated for each sentenceand used as input for a XY-fused Kohonen Network. Iin the resulting network, mostnodes contained samples from several emotions. An accurate discrimination of dif-ferent emotions based on the data of all subjects was therefore not possible. Usingdata from a single subject however showed the expected result: Samples from dif-ferent emotions were mapped onto separate nodes.

Moreover, considering only two emotions that differ regarding their emotionalarousal also resulted in two clearly separated regions in the XY-fused Kohonen Net-work.

Even though this study just represents a small feasibility study, it illustratestwo pertinent challenges of emotion recognition:

1. Inter-individual differences

2. Similarities between emotions.

The findings illustrate on the one hand that a person-dependent discrimination ofemotions from speech data might be feasible. On the other hand, a general person-independent model might be able to discriminate emotions that highly differ in thelevel of arousal, even for subjects unexperienced in professional acting.

4.2 Emotion Recognition from Physiology using Ensem-ble Classifier Systems for Handling Missing FeatureValues3

The study presented in this section has been conducted in the context of the Eu-ropean FP6 project SEAT, which aimed at increasing the comfort for airplane pas-sengers by integrating physiological sensors into the seat [152]. When choosingthe sensors, we aimed at minimizing discomfort induced by the sensor attach-ment [178]. Such unobtrusive sensors are generally prone to artifacts, which leads

3based on [184] © 2009 IEEE

Page 65: Multimodal Emotion and Stress Recognition

48 Emotion Recognition using a Standardized Experiment Setup

Figure 4.1: XY-fused Kohonen Network trained with sentences of all subjects. Theemotion classes are as follows: Disgust(2), Happiness(3), Cold anger(6), Bore-dom(7), Pride(8), Desperation(14).

to a lower signal quality compared to signals recorded with conventional sen-sors [176]. This issue is not restricted to our airplane seat but is relevant for anyapplication where unobtrusive sensors are employed.

The remainder of this chapter is based on work published in [183] and [184].It describes how the methods proposed in Sec. 3.3 can be employed to deal withartifacts in an emotion recognition scenario.

Often, the artifacts do not occur in all physiological signals simultaneously.Assuming that a single modality fails, the entire feature vector (including uncor-rupted features from the remaining signal modalities) is usually discarded. Thisresults in a substantial amount of unusable data for classifier training: As an exam-ple, 20% of the data in [34] was partly corrupt and had to be discarded. Moreover,predicting emotions during run-time becomes impossible, if only a single signalmodality, which was used to train the classifier, fails.

Missing feature values represent a serious problem in practical applicationsthat employ physiological signals. However, this problem has so far gained littleattention in emotion recognition from physiology. We therefore tested two meth-ods for handling missing feature values in combination with two classifier fusionapproaches, as introduced in Sec. 3.3.

4.2.1 ExperimentThis section describes the emotion recognition experiment used to show the per-formance of the proposed methods for handling missing feature values.

Page 66: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 49

Figure 4.2: Left: subject expressing her emotions by speaking standard sentences.Right: XY-fused Kohonen Network trained with sentences uttered by the subjectshown on the left. The emotion classes are: Disgust(2), Happiness(3), Cold anger(6),Boredom(7), Pride(8), Desperation(14).

4.2.1.1 Emotion Elicitation

The emotions to be recognized were chosen according to the two most frequentlyused emotion dimensions: arousal and valence (see also Sec. 2.3 and [51]). Onediscrete emotion in each quadrant of the arousal-valence space plus neutral wereselected as shown in Figure 4.4: amusement (high arousal, pos. valence), anger(high arousal, neg. valence), contentment (low arousal, pos. valence), sadness (lowarousal, neg. valence) and neutral (medium arousal, zero valence).

The success of emotion elicitation was tested by a computerized version ofthe Self-Assessment Manikin (SAM) questionnaire, and the Geneva Emotion Wheel(GEW) [14, 170], see Fig. 6.1. The SAM is shown in Fig. 4.5. The two rows measurethe valence and the arousal dimensions of the emotions, respectively. The ques-tionnaire can easily be filled in with two mouse clicks.

The results of the GEW evaluation were not used for this study. However, theGEW results will be employed to compare film emotion elicitation with a soccergame stimulus in Chapter 6.

Preparatory study for film clip selection: Film clips were employed as emo-tion elicitation technique. Based on literature ([161, 164]), two potential film clipswere selected for each of the five emotions, see Table 4.1. We evaluated the filmclips in a preparatory study and the film clip that was better suited for eliciting thecorresponding emotion was chosen for the main experiment. The procedure of thepreparatory study is described in the following.

The 14 subjects (13 male, 1 female)4, who participated in the preparatory study,

4The gender distribution for selecting the films in the preparatory study is different fromthe gender distribution of the subjects participating in the main study. A balanced genderdistribution in the preparatory study might have resulted in a different film clip selection.This represents a limitation of this study.

Page 67: Multimodal Emotion and Stress Recognition

50 Emotion Recognition using a Standardized Experiment Setup

Figure 4.3: XY-fused Kohonen Network trained with sentences of all subjects buttaking only the emotions Boredom(7) and Desperation(14) into account.

were divided into two groups. Each group watched one set of film clips (see Table 1).The order of film clips was chosen such that negative and positive emotions alter-nated and the order of emotions was the same for both groups. The subjects weresitting alone in an airplane seat and the light in the laboratory was dimmed. In or-der to enhance the subject’s attention to the film clips, video glasses were used forpresenting the clips (see Fig. 4.6). Since the questionnaires were filled in with themouse, there was no need to take off the glasses during the entire experiment. Atwo minutes recovery phase was introduced before each film clip. The screen wentblack and the subjects were requested to clean their mind and calm down. In caseit was necessary to know the story of the film in order to understand the film clip,a short introduction text was shown before the clip started. After each film clip,

Table 4.1: Film sets presented to two groups of subjects during the preparatorystudy: Films marked with ∗ originate from [161]. “John Q” was suggested in [164].Durations in minutes and seconds are given in brackets. Films set in italics wereselected for the main experiment. © 2009 IEEE

Film set 1 Film set 2Sadness Lion King∗ (2’12”) John Q (4’05”)Amusement When Harry met Sally∗ (2’56”) Madagascar (3’34”)Neutral Fireplace (1’49”) Fireplace (1’49”)Anger The Magdalena Sisters (2’31”) Cry Freedom∗ (2’13”)Contentment Winged Migration (2’52”) Alaska: Spirit of the Wild (2’30”)

Page 68: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 51

A m u s e m e n t

C o n t e n t m e n t

A n g e r

S a d n e s s

n e u t r a lv a l e n c e

a r o u s a l

Figure 4.4: Emotions to be recognized in arousal-valence space. © 2009 IEEE

Figure 4.5: Self-assessment manikin (SAM) used as digital questionnaire, modifiedfrom [128]. First row: valence, second row: arousal. For evaluating the film clips, anumber was assigned to each SAM picture resulting in a scale ranging from -2 to 2for both arousal and valence (e.g. arousal = 2 means highest arousal). © 2009 IEEE

Page 69: Multimodal Emotion and Stress Recognition

52 Emotion Recognition using a Standardized Experiment Setup

the subjects were asked to fill in the SAM and the GEW questionnaires. They wereinstructed to answer the questions according to the emotion they had experiencedthemselves and not according to the actor’s emotions in the film scene.

Figure 4.6: Participant wearing video glasses. (Note: Physiological signals wereonly recorded during the main study, not during the preparatory study.) © 2009IEEE

The film clips for the main experiment were chosen based on the evaluation ofthe SAM questionnaires from the preparatory study. To evaluate the SAM, a numberwas assigned to each SAM picture (see Fig. 4.5), which resulted in a scale rangingfrom -2 to 2 for both arousal and valence (e.g. arousal = 2 means highest arousal,valence = 2 means most positive valence). The arousal and valence SAM ratingsof all the participants of the preparatory study are visualized in Fig. 4.7 and 4.8.The two films eliciting the same emotion are contained in the same subplot. Thesizes of the blue and the red circles indicate how many subjects selected the samearousal and valence values. The larger the circle, the more subject have chosenthe same values. The blue and red crosses indicate the mean arousal and valenceSAM ratings of all the participants of the preparatory study who had watched thecorresponding film clip. The green x indicates the desired target value (e.g. [−2, 2]

for anger). For each emotion, the film clip that exhibited the smaller distance tothe target value was selected for the main experiment. The chosen film clips aretypeset in italics in Table 4.1 and plotted in red in Fig. 4.7. The neutral film was thesame for both groups of the preparatory study. The corresponding SAM ratings areshown in Fig. 4.8.

4.2.1.2 Experiment Procedure

Twenty (12 male, 8 female) participants were recruited for the main experiment.Mean age of all subjects was 28.6 (std = 10.90), for the men 24.58 (std = 1.38) and for

Page 70: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 53

(a) Anger films (b) Amusement films

(c) Sadness films (d) Contentment films

Figure 4.7: SAM results of the preparatory study, for 8 emotion-eliciting film clips:The sizes of the blue and the red circles are proportional to the number of sub-jects who chose the corresponding arousal and valence values. Crosses indicatethe mean arousal and valence SAM ratings of all the participants of the prepara-tory study who had watched the corresponding film clip. Green x-es indicate thedesired target values (e.g. [−2, 2] for anger). For each emotion, the film clip thatexhibited the smaller distance to the target value was selected for the main exper-iment.

Page 71: Multimodal Emotion and Stress Recognition

54 Emotion Recognition using a Standardized Experiment Setup

Figure 4.8: Comparison of the SAM ratings of the two subject groups participatingin the preparatory study for the neutral film clip (fireplace).

the women 34.63 (std = 15.83)5. The ethnic background of all the participants wasWestern European. Before the experiment, the participants received instructionand signed a consent form. They were paid 25 CHF for their participation of about1.25 hours.

The course of the experiment is shown in Fig. 4.9. The five film clips deter-mined during the preparatory study (see Table 4.1) were presented to the subjects.The order of the clips was the same as in the preparatory study and constant for allsubjects. After each film clip, the subjects had to fill in a SAM and a GEW question-naire. The procedure for the main experiment was essentially the same as for thepreparatory study with a few slight adjustments: The starting recovery time wasincreased to 10 minutes, to give the body enough time to calm down. Potentialagitation due to previous activities or due to the attachment of the sensors shouldnot influence the data recorded during the first film clip. The recovery time be-tween two film clips was increased from two to three minutes. The subjects wereagain asked to clear their mind and not think about anything emotionally activat-ing during these recovery phases while the screen of the video glasses was dark.The subjects were allowed to close their eyes. To indicate that the next film clip was

5The mean age for the women was higher than for the men, because two women wereabout 60 whereas the remaining subjects were about 25.

Figure 4.9: Course of the film emotion experiment. Film clips are depicted in or-ange. I: Introductory text to improve the understanding of the film clips. Q: Com-puterized SAM and GEW questionnaires.

Page 72: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 55

about to start, three short, gentle beeps were included when the recovery phasewas over. To keep the procedure the same for each film clip, an introduction textwas included for each of the clips. This had the additional benefit, that a potentialstartle effect due to the beeps was over until the film clip started. The subjectshowever reported that they did not or only shortly feel startled by the beeps.

4.2.2 Evaluation MethodsThis section describes the measured signals, the corresponding features, and theevaluation methods.

4.2.2.1 Signals

Respiration and EDA were recorded with custom electronics integrated into the air-plane seat [178], see Fig. 4.10a. The respiration was measured by a strain sensorattached to the seat belt [176, 62]. The EDA was measured using finger straps at-tached to the middle phalanges of the left index and middle fingers (Fig. 4.10a),which were connected to the airplane seat. The finger straps incorporated reusabledry Ag/AgCl electrodes from Brainclinics[29]. Like the emotion board presented inSec. 3.1.2.1, the electronic electronic circuits used for EDA measurement directly fil-ter and split the EDA signal into level and phase [176].

The ECG, the vertical EOG, the EMG of the musculus zygomaticus (muscle be-tween mouth and eye, contracted when smiling) and the finger temperature weremeasured by the commercial Mobi device of TMS International [1], depicted inFig. 4.11. Disposable, pre-gelled Ag/AgCl electrodes (Type H124SG) from Brainclinics[29] were used to measure ECG, EMG, and EOG. The electrode placement is shownin Fig. 4.10b.

For detecting eye movements, the horizontal component of the EOG would benecessary in addition to the vertical component. However, since the eye move-ments were expected to be influenced by the movements of the artists in the film,rather than by emotions, only the vertical EOG was recorded and the blinks ana-lyzed.

4.2.2.2 Features

A total of 51 features was calculated from the six signals, as described in Sec. 3.1:

• ECG: From heart rate: Min, max, mean, mean derivative, parameters of alinear fit. HRV parameters (see Table 3.1): sdnn, rmssd, pnn50, triangularindex, lf, hf, lf/hf.

• EDA: From level: Min, max, mean, mean derivative, parameters of a linear fit.From phase: Mean instantaneous peak rate, mean peak height, sum of thepeak heights / duration of the signal, q(0.25), q(0.5), q(0.75), q(0.85), andq(0.95) of the peak height and of the instantaneous peak rate (c.f. Sec. 3.1.2.3for explanation of quantiles q(p)).

• Respiration: From respiration rate: Mean, std, min, max, max/min, parame-ters of a linear fit. From peak heights: Mean, median (= q(0.5)) and std.

Page 73: Multimodal Emotion and Stress Recognition

56 Emotion Recognition using a Standardized Experiment Setup

(a) (b)

Figure 4.10: (a) Measurement of EDA and respiration with electronics integratedinto the airplane seat [178, 176]. Measurement of finger temperature with Mobi [1].(b) Placement of disposable Ag/AgCl electrodes for measuring ECG, EMG of the zy-gomaticus major (smiling muscle), and vertical EOG with the Mobi [1], see Fig. 4.11.

Figure 4.11: Mobi device from TMS International [1] for physiological signal acquisi-tion.

Page 74: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 57

• Temperature: Mean, std, min, max, max/min, parameters of a linear fit.

• EMG: Power.

• EOG: Blink rate.

The features were calculated for the entire duration of each film clip (emotionphase) and for the preceding recovery phase. The features calculated during therecovery phases served as the baseline values for calculating relative features bydivision or subtraction. The size of the resulting feature matrix was 100 x 51, i.e. (20subjects x 5 emotions) x 51 features.

Relative features were considered for two reasons:

1. The physiology of the body mightchange over time, i.e. the baseline mightshift during the course of the experiment. Computing the features with re-spect to the preceding recovery phase alleviates this problem.

2. Relative features reduce the effect of interindividual differences in physiol-ogy.

Detection of invalid feature values: The recorded data set contained cor-rupted segments due to technical problems and artifacts. The artifacts included:(1) EDA signal reached saturation of the amplifier during the course of the experi-ment; (2) R-peaks in ECG signal were erroneously detected because of a high T-waveor due to motion artifacts; (3) blinking was not visible in EOG signal due to dry skin.

The corrupted segments were identified and marked by visual inspection for(1) and (3) and automatically for (2): After R-peak detection, any RR-interval wasremoved if it deviated by more than 20% from the previous interval accepted asbeing true (refer to [2, 46] for the detailed algorithm). If more than 5 RR-intervalswere removed during an experiment phase (i.e. during a film clip or during a recov-ery phase), all the corresponding ECG features were declared as invalid.

This procedure resulted in 96% valid feature values for EOG, 78% for EDA, 64%for ECG and 100% for the remaining modalities (see Table 4.2). Only 47% of thefeature vectors were totally artifact-free, i.e. contained valid features for all modal-ities. If no methods for handling missing feature values had been employed, wewould have had to discard the remaining 53% of the data, even if only one of thesix modalities contained an artifact.

4.2.3 Methods for Handling Missing Feature ValuesThe methods for handling missing feature values have been described in detail inSec. 3.3. In this experiment, imputation was compared with the reduced-featuremodels technique. The reduced-feature models consisted in ensemble classifiersystems employing majority or confidence voting.

For the ensemble classifier systems, a suitable underlying classifier wasneeded: It had to be able to handle unequal class distributions and to generatea probabilistic output that could be used as confidence value. Since little data wasavailable in our case, we opted for a simple classifier with few parameters to beestimated.

Page 75: Multimodal Emotion and Stress Recognition

58 Emotion Recognition using a Standardized Experiment Setup

Table 4.2: Percentage of valid feature values per modality. Last row: Only 47% ofthe feature vectors were totally artifact-free, i.e. contained valid features for allmodalities. Prediction of emotion is thus impossible during more than 50% of thetime if missing feature values are not handled.

Signal modality Valid feature valuesECG 64%EDA 78%Vertical EOG 96%EMG 100%Respiration 100%Finger temperature 100%All 47%

Linear and Quadratic Discriminant Analysis (LDA and QDA) with diagonal co-variance matrix estimation fulfill all the mentioned conditions: Discriminant Anal-ysis generates an estimate of the posterior probability for each potential class andselects the class that exhibits the largest posterior probability. The estimated pos-terior probability for the selected class was taken as confidence measure.

4.2.4 Results

4.2.4.1 Success of Emotion Elicitation

First, the success of emotion elicitation was tested. For all participants of the mainexperiment, the mean and the median values of arousal and valence were calcu-lated based on the SAM questionnaires. The result is depicted in Figure 4.12. Emptysymbols indicate the mean values, whereas filled symbols indicate the median val-ues.

Overall, the resulting pattern was as expected, except for amusement wherewe expected high instead of medium values for arousal. On the one hand, thismight be due to a misunderstanding of the SAM arousal axis, where the “explo-sion” might have been interpreted as negative by the subjects. However, since weexplained the subjects, that high arousal can arise for both negative (e.g. boil withrage) and positive (e.g. jump for joy) emotions, such a misinterpretation should nothave taken place. On the other hand, the medium arousal rating of amusementmight also be related to the observation stated in [37], that negative emotionsare generally associated with stronger ANS responses than positive emotions. Thiswould be consistent with a higher arousal rating for anger than for amusement.

4.2.4.2 Classification Results

This section describes the emotion recognition results. The first paragraph com-pares the accuracies of single modality classifiers with the accuracy of a multi-modal classifier. The second paragraph investigated the proposed methods for

Page 76: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 59

Figure 4.12: Arousal and valence mean values (empty symbols) and median values(filled symbols), calculated based on the SAM questionnaires of the main experi-ment.

handling missing feature values. All the classifiers presented in the following wereevaluated in a leave-one-person out cross-validation.

Modality comparison: We first compared the accuracies of single modalityclassifiers with the accuracy of a multimodal classifier that uses all six modalities6.To allow a fair comparison, the analysis was performed on a reduced, artifact-freedata set, i.e. a data set containing valid feature values for all the modalities.

Figure 4.13 shows the results for discriminating 5 classes using LDA and QDA.When ranking the modalities according to the accuracies, the same sequence re-sults for both LDA and QDA: Electrodermal Activity clearly represents the best sin-gle modality, followed by finger temperature. Using all the modalities howeveryielded the best result.

Methods for handling missing feature values: We will now present how im-putation techniques and classifier fusion methods handle the case of missing fea-ture values. All the six signal modalities were used for this analysis.

The results for all 5 classes and for 4 classes (all except neutral) are presentedin Table 4.3. The table compares two kinds of ensemble classifier systems, one em-ploying majority voting and one employing confidence voting for classifier fusion,

6Note: The multimodal classifier used here is not an ensemble classifier system but asingle classifier, which considers all features simultaneously. No imputation was necessary,because the modality comparison was performed on an artifact-free subset of the data.

Page 77: Multimodal Emotion and Stress Recognition

60 Emotion Recognition using a Standardized Experiment Setup

Figure 4.13: Accuracies of single modality classifiers compared to a classifier usingall modalities. The multimodal classifier yields the best results: 53.2% for LDA and51.1% for QDA.

Table 4.3: Comparison of ensemble classifier systems and single classifiers (withimputation) for LDA and QDA. Single classifiers consider all the features simultane-ously. The results are presented for 5 and for 4 classes (neutral excluded). Classifierfusion yields considerable benefits.

Fusion Method Classifier ImputationYes/No

Accuracy5 classes

Accuracy4 classes

None LDA Yes 45.0% 52.5%Majority voting LDA No 41.0% 51.3%Confidence LDA No 47.0% 58.8%None QDA Yes 35% 40.0%Majority voting QDA No 49.0% 56.3%Confidence QDA No 48.0% 56.3%

Page 78: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 61

with a single classifier employing imputation. The single classifier considers allthe features simultaneously. The comparison of the three methods is shown sepa-rately for the two classification methods (LDA and QDA) and for the 4-class and the5-class problem.

The bold values in Table 4.3 show the best accuracy for each comparison. It canbe seen

that at least one of the ensemble classifier systems always yielded a consid-erable benefit in comparison to the corresponding single classifier using all thefeatures with imputation. In the best case (4 classes, QDA) the increase in accuracyamounted to 16.3%.

Another interesting observation is that confidence voting always performedbetter than majority voting for LDA, while the tendency was reversed for QDA. Thebest classifier fusion technique thus seems to depend on the underlying classifier.

The results for arousal and valence classification are shown in Table 4.4. Theneutral emotion was omitted for this analysis. The results again indicate that en-semble classifier systems outperform single classifiers which use all the features(with imputation). Unlike most other work, e.g. [93], a higher accuracy was reachedfor valence than for arousal classification. This is probably due to the EMG mea-surement of the Zygomaticus Major, which is a reliable smiling detector.

A problem of our majority voting scheme was identified during our analyses:Since the majority voting is initially performed without considering the confidencevalues, it can happen, that several “weak” classifiers agree on a wrong decision andthereby dominate another classifier that might exhibit a high confidence value. Apossibility to circumvent this problem would be to use the confidence values asweights during the majority voting process.

4.2.5 Discussion

In this work, we did not employ any feature selection. The maximally achievedaccuracies (53.2% for 5 classes, 58.8% for 4 classes, 67.5% for arousal and 73.8% forvalence) are therefore lower than reported in comparable studies e.g. [34].

The problem of missing feature values is not limited to emotion recognition,

Table 4.4: Arousal and valence classification (2 classes): Comparison of ensembleclassifier systems and single classifiers with imputation for LDA and QDA. Singleclassifiers consider all the features simultaneously.

Fusion Method Classifier ImputationYes/No

AccuracyArousal

AccuracyValence

None LDA Yes 62.5 % 71.3%Majority voting LDA No 67.5 % 70.0 %Confidence LDA No 61.3% 73.8%None QDA Yes 57.5 % 60.0%Majority voting QDA No 58.8% 68.8 %Confidence QDA No 60.0% 68.8 %

Page 79: Multimodal Emotion and Stress Recognition

62 Emotion Recognition using a Standardized Experiment Setup

but is relevant for any application where the physiological signals are recorded withunobtrusive sensors, e.g. included in seats or clothing. Our methods can thus bebeneficial for many applications. However, the problem of missing feature valuesmay be more important in the field of emotion recognition than in other fields, be-cause the effort for recording emotional data is high: Whereas e.g. data for activityrecognition can relatively easily be collected from many people and repeatedly forthe same person, this is not possible for emotional data: A single film can only beshown once to a single subject. Moreover, whereas the activity and thus the labelof the recorded data (e.g. “opening door”) is unambiguous, this is not necessarilythe case for emotions. Often, post-experimental labeling by several observers isnecessary - involving a significant effort. Clearly, with such laborious procedures,data loss due to artifacts is worse than if the lost data can easily be regained in arepeated recording session.

The question arises whether our recognition results are generalizable to otheremotional situations. A limitation of our study is, that we kept the order of thefilms constant for all subjects. We can therefore not exclude an effect due to timeon the physiological signals. On the other hand, the film clips were separated byrecovery phases and the features were calculated in relation to the recovery phasethat preceded each film clips. We therefore expect time effects to be minimal.

The normalization of the features by the preceding recovery phase might how-ever cause another problem for generalization to a “real-world” application. It im-plies that various baselines would have to be measured throughout the day to cal-culate the relative features.

One might also object that our classifiers might have detected characteristicsof the film, e.g. attention, rather than emotional effects. For this reason, we didnot include eye movement characteristics in the analysis - which were expected todepend mainly on the “movement patterns” of the particular film, rather than onemotion. Nevertheless, the objection can not be refuted. However, a similar studyhas trained a classifier on the data of one film set and has tested it on data recordedusing a different film set (eliciting the same emotions)[110]. Their results showedgood generalization of the classifier, which implies that the induced changes inphysiology were not due to the characteristics of the films, but were a consequenceof the experienced emotion.

4.2.6 Conclusion

In practical applications, data loss due to artifacts occurs frequently. Many of theseartifacts can be detected automatically by plausibility analyses (e.g. unrealistic RR-intervals). Often the artifacts do not occur in all physiological signals simultane-ously and discarding all the feature vectors containing invalid feature values resultsin a substantial amount of unusable data. In our experiment, more than half of thedata would have been lost if no strategy to handle missing feature values had beenemployed. With the proposed methods we were able to analyze 100% of the data.

Two methods for handling missing feature values in combination with twoclassifier fusion approaches have been investigated. Classifier fusion has beenshown to increase the recognition accuracy. A maximum increase in accuracy of

Page 80: Multimodal Emotion and Stress Recognition

Ensemble Classifier Systems for Handling Missing Data 63

16.3% was observed when comparing an ensemble classifier system to a single clas-sifier using imputation. Whether majority or confidence voting performed betterdepended on the underlying classifier.

Our experiment also showed that using all the modalities yielded better re-sults than using a single modality. EDA was the best modality, followed by fingertemperature and facial EMG of the zygomaticus major.

Page 81: Multimodal Emotion and Stress Recognition

64 Emotion Recognition using a Standardized Experiment Setup

Page 82: Multimodal Emotion and Stress Recognition

5

Emotion Recognition Using aNaturalistic Experiment Setup

In the previous chapter, standardized emotional stimuli were used toinduce discrete, basic emotions. In this chapter, a more naturalisticapproach was chosen that resembles a stressful situation of an officeworker. Using a standardized but interactive laboratory protocol, men-tal and social stress was elicited in 33 subjects. Our goal was to distin-guish stress from mild cognitive load using physiological and activitysignals. After evaluating the signals separately and determining suit-able features, the signal modalities were combined using classifier fu-sion. Finally, the generalization of the chosen features and classifiers toa "real-world" stress situation was tested in a small office experiment.

5.1 Introduction and Motivation

The problem of work-related stress is of growing interest, especially in westerncountries. It is associated with several illnesses, such as cardiovascular diseasesand musculoskeletal disorders (e.g. back pain) [74]. These diseases lead to absenceof work resulting in high economic costs [189]. If an electronic Personal HealthTrainer could advert us of stressful situations during office work, it could help cre-ating a better “Work-Life-Balance”.

Since we are aiming at an application in the office environment, a suitable ex-periment had to be found. Existing studies often use mental workload as stress-eliciting factor, e.g. [122, 124, 215, 171]. This is an important stress component, es-pecially for professions such as air traffic controllers [213] or electricity dispatch-ers [181]. However, there are additional contributing factors including performancepressure and social threat by superiors or colleagues [43]. A stress test that incorpo-rates all these factors was therefore employed. The work presented in this chapterwas conducted in cooperation with the Institute of Psychology, Clinical Psychol-ogy and Psychotherapy of the University of Zurich. The psychologist professionals

Page 83: Multimodal Emotion and Stress Recognition

66 Emotion Recognition Using a Naturalistic Experiment Setup

have adapted an existing stress task [60], which belongs to the category of “socialpsychological methods” presented in Table 2.2. Social psychological methods em-ploy deception and mask the real purpose of the experiment. This leads to a highecological validity of the observed responses, which means that the observed re-sponses are expected to be similar to those occurring outside of the laboratory in asimilar situation [84].

The chapter is organized as follows: First, the stress experiment is described.Then, the different sensor modalities are evaluated separately, regarding their po-tential for distinguishing between stress and cognitive load. Next, we investigate,whether the combination of several sensor modalities by classifier fusion furtherimproves the classification accuracy. The chapter concludes with a small experi-ment in the office, demonstrating the generalization of the trained classifiers to a"real-world" stress situation.

5.2 Stress Experiment1

The Montreal Imaging Stress Task (MIST, [60]) is a standardized computer-basedtask consisting of a stress and a control condition: The stress condition combinesmental arithmetic problems under time pressure with social-evaluative threatwhereas the control condition consists in mental arithmetic with neither time pres-sure nor social evaluation, which is similar to relaxed working on a computer. Theoriginal test was created in order to evaluate effects of psychosocial stress on phys-iology and brain activation by functional magnetic resonance imaging (fMRI), andhas shown to induce a moderate stress response in salivary cortisol [60]. For ourpurposes, the test was slightly modified for a use outside of an fMRI environmentby psychologist professionals in agreement with the inventors of the MIST. The pro-cedure was authorized by Zurich’s cantonal ethics commission.

For the study, 33 male, healthy subjects (mean age: 24.06±4.56) were recruited.They received monetary compensation (80 CHF) for participating in two sessionsof two hours. The participants were confronted with mental stress and social eval-uative threat during one session (the stress condition) and with mild cognitive loadduring another session (the cognitive load condition). Subjects participated in oneof the conditions first (randomly assigned) and in the other two weeks later. Asa cover story, the subjects were told that they were taking part in an experimentinvestigating the relationship between cognitive performance and physiologicalcharacteristics.

Both conditions consisted in three experiment phases denoted by “Baseline”,“MIST” and “Recovery” in Fig. 5.1; only the MIST phase was different for the twoconditions, as explained in the following.

The experimental schedule of the stress condition was:

1. Instructions and signing of consent form.

2. Habituation and Baseline phase (20 minutes): Questionnaires and readingmagazines.

1based on [182] © 2010 IEEE

Page 84: Multimodal Emotion and Stress Recognition

Stress Experiment 67

Figure 5.1: Experiment procedures for the stress (A) and the cognitive load condi-tion (B), adapted from [182]. Subjects participated in one of the conditions first(randomly assigned) and in the other two weeks later. ?: Synchronization events tosynchronize data from independent recording systems. D: Debriefing

3. S1: Cognitive stress I (4 minutes): Mental arithmetic under time pressure andwith performance evaluation (as described below).

4. S2: First feedback phase inducing mild social stress: Methodological ques-tions.

5. S3: Cognitive stress II (4 minutes): Like S1 but with additional pressure not tofail again.

6. S4: Second feedback phase inducing strong social stress: Personal questions.

7. S5: Combination of strong social and “cognitive” stress (4 minutes): Like S3but with experiment leader observing the subject.

8. Recovery phase: Questionnaires and reading magazines. The recovery phaselasted for 1 hour. However, only the first 20 minutes were used for featurecalculation and classification.

9. D: Debriefing including explanation of the true purpose of the experiment.

During the the MIST phases S1, S3, and S5 of the stress condition, the subjectsperformed mental arithmetic tasks on a computer. The program adapted the diffi-culty and the time limit such that the subjects could only solve 45-50% of the arith-metic problems correctly. This corresponds to a stressful office situation in whichthe worker is presented with work requirements that do not match his capabilities.

Fig. 5.2b shows the screen display during the stress condition. In addition to thearithmetic task and the rotary dial for response submission, a time bar showed theremaining time for solving the task and “Timeout” was displayed when the timehad elapsed. When subjects gave an answer within the given time, the feedback"right" or "wrong" was displayed, depending on the correctness of he answer. Acolor bar showed a comparison between the individual performance and the per-formance of a simulated, representative comparison population. The subjects were

Page 85: Multimodal Emotion and Stress Recognition

68 Emotion Recognition Using a Naturalistic Experiment Setup

1

2

56

3

4

(a)

111111111111111 222222222222222333333333333333

44444444444444444444444

55555555555555555666666666666666666677777777777777777777777777

88888888888888888888888

999999999999999999

0000000000000000

Personal Performance

Comparison Population

Rotary Dial

Time Bar

(b)

Figure 5.2: (a) Subject equipped with wearable sensor systems: (1, 2, 5, and 6) ac-celeration sensors, (3) electrodermal activity (EDA) sensor, (4) pressure mat. TheLifeShirt used to record ECG and breathing is hidden under the subject’s own shirt.(b) MIST screen during stress condition [182]. © 2010 IEEE

told, that the experiment would not work if their performance did not reach thegreen range of the bar and that the experiment leader would supervise the wholeexperiment remotely. This represented a social evaluative threat. Additionally, af-ter each block of mental arithmetic, the subjects received feedback from the exper-iment assistant or from the experiment leader, regarding their performance. Thisrepresents a similar situation as in real life when a superior complains about onesworking productivity.

During the first feedback phase, only mild social stress was induced: the exper-iment assistant told the subject that his performance was not as good as expectedand asked questions regarding methodological aspects of the task and the com-puter (e.g. “Is there a problem with the response submission on the keyboard?”)thus attributing the “problems” to the task itself and not to the capabilities of thesubject.

During the second feedback phase, strong social stress was induced: The exper-iment assistant called her supervisor, the experiment leader, who came and askedquestions referring to personal problematic characteristics (c.f. Fig. 5.13, upper leftcorner). The questions included: “Did you sleep badly?”, “Did you ever have mathproblems at school?”, “Did you drink alcohol or take drugs?”. The experiment leaderalso pointed out the high cost involved in such studies and requested the subjectsto make more effort. She stayed in the room behind the subject and observed himduring the last mental arithmetic session. By comments like “Are you stressed?”she tried to increase the stress level even more. After two third of the time, theexperiment leader acted resigned, told the experiment assistant to continue themental arithmetic and left the room.

For the cognitive load condition, the schedule of the MIST phases (C1-C5 in

Page 86: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using EDA 69

Fig. 5.1) was analogous to the schedule described above. However, during the men-tal arithmetic phases, there was no time limitation for solving the tasks and no so-cial evaluation. The color and time bars were thus not displayed. This correspondsto mild cognitive load. During the feedback phases, the experiment assistant ini-tiated a friendly conversation, by asking neutral questions like “How did you cometo participate in this study?”, “What are you studying?”.

Recorded signals: Figure 5.2a shows a test subject wearing all the employedsensors. The ECG and the breathing were recorded with the commercial LifeShirtSystem 200 [212], whereas the CONFORMat system [196], developed by Tekscan,was used to obtain the seat pressure distribution. Head, arm, and leg accelera-tion as well as the EDA were measured by custom wearable devices (cf. Sec. 3.1.7and 3.1.2.1). The experiment phases were annotated by the experiment assistant.

Prior to evaluation, the data of the different recording systems needed to besynchronized. This was achieved by including several synchronization events in theexperimental protocol, as indicated by the stars in Fig. 5.1.

5.3 Discriminating Stress from Cognitive Load UsingEDA2

This section describes the analysis of the Electrodermal Activity (EDA), which is anadequate measure for sympathetic activation (cf. Sec. 3.1.2). The Emotion Boardpresented in Sec. 3.1.2.1 was used for recording the EDA at a sampling rate of 16 Hzwith reusable dry Ag/AgCl electrodes from Brainclinics [29]. The electrodes wereattached to the middle phalanges of the left index and middle fingers, see Fig. 5.2aand 3.6. In case of dry skin, conductive gel was applied. The skin conductivity ofsubjects with wet skin exceeded the measurement range of 5 µS of the originalEmotion Board. A version of the Emotion Board with reduced gain was thereforeused for subjects with wet skin.

5.3.1 Evaluation MethodsThis section describes the employed evaluation methods consisting of

1. Preprocessing of the raw data

2. Feature calculation for the experiment phases “Baseline”, “MIST”, “Recovery”,and the concatenation of MIST and Recovery (“MIST+Recovery”)

3. Classification of the conditions “stress” and “cognitive load”

5.3.1.1 Preprocessing

First, the two raw signals delivered by the Emotion Board (the EDA level and theEDA phase signal, cf. Fig. 3.7) were smoothed using a sliding-window mean filter

2based on [182] © 2010 IEEE

Page 87: Multimodal Emotion and Stress Recognition

70 Emotion Recognition Using a Naturalistic Experiment Setup

with window size of 0.2 and 1.2s, respectively. An example of a smoothed EDA levelsignal is shown in Fig. 5.3.

Next, the peaks in the EDA phase signal were detected as described inSec. 3.1.2.2and used to compute the peak heights and the instantaneous peak rates(see Fig. 5.4).

Figure 5.3: Smoothed EDA level signal, recorded during a stress condition of theexperiment, adapted from [182].

Figure 5.4: Zoom into MIST phase S3 of Fig. 5.3: EDA level signal (below) and EDAphase signal (above). The red stars indicate the automatically detected peaks [182].© 2010 IEEE

5.3.1.2 Descriptive Statistics for Finding Meaningful Features

Because our goal was to distinguish between stress and cognitive load, we neededto find features that are different for stress and for cognitive load. To find such fea-

Page 88: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using EDA 71

tures, we visually explored the cumulative distributions of the EDA peak height andof the instantaneous peak rate. The cumulative distributions of the peak heightduring the different experimental phases “Baseline”, “MIST”, and “Recovery” arepresented in Fig. 5.5 for all subjects. The curves indicate the percentage of peaksthat is smaller than the peak height value qph on the x-axis. The curves for theMIST and the Recovery phases in the stress recording (red continuous and red dash-dotted lines) show values larger than all the other curves. This means that therewere more small peaks occurring during the MIST and the Recovery phases of thestress condition than during all the other phases. The cumulative distributions ofthe peak height for different experimental phases thus carry information to distin-guish these phases.

To describe the observed differences in the cumulative distributions, the fol-lowing quantiles (cf. Sec. 3.1.2.3) for the peak height and the instantaneous peakrate were calculated for each experiment phase (“Baseline”, “MIST”, “Recovery”, and“MIST+Recovery”), experiment condition (“stress”, “cognitive load”) and all subjects:q(0.25), q(0.5), q(0.75), q(0.85), and q(0.95). The labels qph1 − qph5 in Fig. 5.5 indi-cate the five calculated quantiles for the MIST phase of the stress condition.

100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1 istribution of the EDA peak height for all subjects

ADC values

qph1

qph2 qph3

qph4

qph5

Baseline Stress

Cumulative d

EDA peak height [ ]

0.85

0.75

0.5

0.25

0.95

Figure 5.5: Cumulative distribution of the peak height (in ADC values) for the threeexperiment phases (Baseline, MIST, Recovery) during stress and cognitive load forall subjects. The labels qph1 − qph5 indicate the quantiles q(0.25), q(0.5), q(0.75),q(0.85), and q(0.95) respectively for the MIST phase of the stress condition [182]. ©2010 IEEE

5.3.1.3 Calculated EDA Features

The following 16 features were calculated for each of the three experiment phases(“Baseline”, “MIST”, and “Recovery”), as well as for the concatenation of “MIST” and“Recovery”:

• EDA level: Mean, max, min, slope (calculated by a linear regression)

• EDA peak height: Mean, q(0.25), q(0.5), q(0.75), q(0.85), and q(0.95)

Page 89: Multimodal Emotion and Stress Recognition

72 Emotion Recognition Using a Naturalistic Experiment Setup

• Instantaneous peak rate: Mean, q(0.25), q(0.5), q(0.75), q(0.85), and q(0.95)

We denote the above features as non-relative features.

Baseline-related Features As described in Sec. 3.1.2, the EDA level and peakscan vary substantially between individuals, which may impair classification perfor-mance. The features calculated during the baseline (BL) were therefore used tocalculate relative features to reduce the effects of inter-individual differences. Wedefined the following relative features (given for the MIST phase here):

• EDA level: Mean(MIST)-mean(BL), max(MIST)-mean(BL), min(MIST)-mean(BL), slope(MIST)/slope(BL)

• EDA peak height: Mean(MIST)/mean(BL), q(0.25)(MIST)/q(0.25)(BL),q(0.5)(MIST)/q(0.5)(BL), q(0.75)(MIST)/q(0.75)(BL),q(0.85)(MIST)/q(0.85)(BL), and q(0.95)(MIST)/q(0.95)(BL)

• Instantaneous peak rate: Mean(MIST)/mean(BL), q(0.25)(MIST)/q(0.25)(BL), q(0.5)(MIST)/q(0.5)(BL), q(0.75)(MIST)/q(0.75)(BL),q(0.85)(MIST)/q(0.85)(BL), and q(0.95)(MIST)/q(0.95)(BL)

The relative features for the recovery phase and for the concatenation of the recov-ery and the MIST phases are defined analogously.

5.3.1.4 Feature Selection and Classification

As a performance index for classification, we chose the accuracy of a “leave-one-person-out cross-validation”. This means that a classifier is trained with the dataof all subjects except one and tested on the data of the subject that was left outfor training. This procedure is repeated for all subjects and the accuracy is calcu-lated by dividing the number of correctly classified test data samples by the totalnumber of test data samples. The method is similar to the “leave-one-out cross-validation” [64].

The leave-one-person-out cross-validation was performed for all possible fea-ture combinations and the combination that yielded the highest accuracy was cho-sen for each classifier.

The following classification methods were applied:

• Linear Discriminant Analysis (LDA), c.f. Sec. 3.2.2.

• Support Vector Machines (SVM) with linear, quadratic, polynomial and radialbasis function (rbf) kernels, c.f. Sec. 3.2.3.

• Nearest Class Center (NCC) algorithm, c.f. Sec. 3.2.1.

Data from 32 subjects were evaluated, since one subject had to be excludedfrom the analysis due to strong artifacts in the EDA signals. The artifacts werecaused by a poor fixation of the Emotion Board hardware to the arm, which re-sulted in a loose connection when the subject was moving.

Page 90: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using EDA 73

LDA SVM (lin) SVM (quad) SVM (polyn) SVM (rbf) NCC

0.74

0.76

0.78

0.8

0.82

0.84

Classifiers

Relative features

MIST phaseRecovery phaseMIST and Recovery phase

(1) (2)

(9)

(5) (6)

Max

imum

acc

urac

y fo

r cr

oss-

valid

atio

n

(a)

LDA SVM (lin) SVM (quad) SVM (polyn) SVM (rbf) NCC

0.74

0.76

0.78

0.8

0.82

0.84

Classifiers

Non−relative features

MIST phaseRecovery phaseMIST and Recovery phase

Max

imum

acc

urac

y fo

r cr

oss-

valid

atio

n

(10)

(7) (8)(3) (4)

(b)

Figure 5.6: (a) Cross-validation classification results for distinguishing between“stress” and “cognitive load” using relative features. For each classifier and experi-ment phase, the classification accuracy for the best feature combination is shown.The maximum accuracy of 81.3% was achieved by the SVM classifier with quadratickernel when using the concatenation of the “MIST” and the “Recovery” phases forfeature calculation [182] © 2010 IEEE. (b) Cross-validation classification results fordistinguishing between “stress” and “cognitive load” using non-relative features.For each classifier and experiment phase, the classification accuracy for the bestfeature combination is shown. The maximum accuracy of 82.8% was achieved bythe LDA when using the concatenation of the “MIST” and the “Recovery” phases forfeature calculation [182] © 2010 IEEE. - The feature combinations employed by theclassifiers numbered with (1) - (10) are given in Table 5.1 .

Page 91: Multimodal Emotion and Stress Recognition

74 Emotion Recognition Using a Naturalistic Experiment Setup

Table 5.1: Feature combinations for the classifiers that achieved the maximum ac-curacy for the phases “MIST”, “Recovery” and “MIST+Recovery” when using relativeor non-relative features [182] © 2010 IEEE. The features contained in the best fea-ture set are marked by a cross. The numbers (1) - (10) refer to the correspondingaccuracies shown in Fig. 5.6. p. h. stands for peak height and inst. p. r. for instanta-neous peak rate.

Expe

rimen

tPha

se:

MIS

TM

IST

Reco

very

Reco

very

MIS

T+

Rec.

Nr.

ofTy

peof

Feat

ures

:re

lativ

eno

n-re

l.re

lativ

eno

n-re

l.re

lativ

eno

n-re

l.tim

esCl

assif

iers

:SV

MSV

MSV

MSV

MLD

ASV

MSV

MSV

MSV

MLD

Ase

l.qu

ad.

poly

n.lin

.rb

fpo

lyn.

lin.

poly

n.qu

ad.

(Nr.

inFi

g.5.

6)(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

Mea

nED

Ale

vel

xx

xx

4M

ean

EDA

p.r.

0M

ean

EDA

p.h.

xx

xx

xx

6q(

0.25

)p.h

.x

x2

q(0.

5)p.

h.x

xx

xx

xx

xx

9q(

0.75

)p.h

.x

x2

q(0.

85)p

.h.

xx

xx

xx

6q(

0.95

)p.h

.x

1q(

0.25

)ins

t.p.

r.x

xx

3q(

0.5)

inst

.p.r.

0q(

0.75

)ins

t.p.

r.x

x2

q(0.

85)i

nst.

p.r.

xx

xx

xx

x7

q(0.

95)i

nst.

p.r.

xx

xx

xx

6M

axED

Ale

vel

xx

2M

inED

Ale

vel

x1

EDA

leve

lslo

pex

xx

xx

5

Page 92: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using EDA 75

5.3.2 Results

In this section, we present the classification results. Our aim was to distinguishbetween the two classes “stress” and “cognitive load”. As described above, we usedsix different sets of features (relative and non-relative features, both calculatedfor the phases “MIST”, “Recovery”, and “MIST+Recovery”) as input for six differentclassifiers (LDA, 4 SVMs with different kernel types, NCC). For each feature set andclassifier, the best feature combination (i.e. the feature combination yielding themaximum classification accuracy) was determined using a “leave-one-person-outcross-validation”. The resulting 36 maximum classification accuracies are shown inFig. 5.6. Fig. 5.6a and Fig. 5.6b contain the results for the relative features and forthe non-relative features, respectively.

It can be observed that all classification methods performed in a similar range(all accuracies were between 73% and 82.8%) . The maximum accuracy over allfeature sets and classifiers was 82.8%. It was achieved by the LDA when using non-relative features calculated from the concatenation of the “MIST” and the “Recov-ery” phases as classifier input.

Table 5.1 shows the feature combinations for the classifiers that achieved themaximum accuracy for each of the phases “MIST”, “Recovery” and “MIST+Recovery”,when using either relative or non-relative features. The corresponding accuraciesare marked by the numbers (1) - (10) in Fig. 5.6. In Table 5.1, the selected featuresare marked by a cross. The number in the last column indicates, how many timesa certain feature was chosen; this number is thus an indicator on how relevant thefeature is for distinguishing between stress and cognitive load.

The feature selected most often was the 0.5-quantile (i.e. the median) of thepeak height. The 0.85-quantile of the instantaneous peak rate represents the sec-ond most chosen feature. The third rank is achieved by three features from whichtwo are quantiles. The mean and median EDA peak rate were never chosen. Overall,theupper range quantiles (i.e. q(0.85) and q(0.95)) were relevant for the instanta-neous peak rate, whereas the mean and median (q(0.5)) were relevant for the peakheight.

The feature combination that reached the overall classification performancemaximum (with LDA and “MIST+Recovery”) includes the mean EDA peak height,q(0.5), q(0.75), and q(0.85) of the peak height, q(0.85) and q(0.95) of the instanta-neous peak rate, and the EDA slope.

5.3.3 Discussion

5.3.3.1 Experiment

To the best of our knowledge, the experiment is a novelty from two perspectives:First, we discriminated stress from cognitive load rather than distinguishing be-tween a stress and a rest condition. Second, most studies addressing stress recog-nition [215, 122, 124, 171] have investigated a form of “mental stress” rather than thecombination of mental and psychosocial stress employed in our work.

Page 93: Multimodal Emotion and Stress Recognition

76 Emotion Recognition Using a Naturalistic Experiment Setup

5.3.3.2 Features

EDA features were used to train various classifiers to discriminate stress from cog-nitive load. We found that the classifiers performed better with non-relative fea-tures than when using relative features. For a practical system, this means that thecalculation of the Baseline features is not necessary and that no calibration proce-dure is needed.

A further conclusion drawn from the best feature combinations (see Table 5.1)is, that quantiles are suitable features to distinguish stress from cognitive load:Four of the five features selected most often by the best classifiers were quantilefeatures (cf. last column of Table 5.1).

An interesting and unexpected observation was, that taking only the “MIST”phase for feature calculation and classification yielded slightly smaller accuraciesthan taking the “Recovery” phase or the concatenation of both. The “Recovery”phase thus seems to contain useful information for distinguishing stress from mildcognitive load. The explanation could be twofold:

1. EDA parameters do not only change at the exact time of a stress-inducingevent, but also due to the expectation of an aversive event [140], or due tocurrent concerns, negative emotion, subjective arousal, or inner speech [139].Since the debriefing took place only after the recovery phase (see Fig. 5.1),the subjects might still have been concerned about their insufficient perfor-mance during the experiment or might even have feared another confronta-tion with the experiment leader. These concerns might have influenced theEDA parameters measured during the recovery phase of the stress condition.

2. The frequency of skin conductance responses is influenced by both mentalload and by motor activity due to fast typing (keystroke intervals < 300ms)[107]. To select the answers of the mental arithmetic tasks fast, the subjectshad to manipulate the rotary dial by fast keyboard strokes. Therefore, the ef-fects of stress on EDA might have been masked by the effects of fast typingduring the “MIST” phase. Such a masking might have reduced the classifi-cation accuracy for the “MIST” phase with respect to the accuracy obtainedwith features from the Recovery phase.

5.3.3.3 Classification methods

Among all classifiers, the maximum accuracy of 82.8% was achieved by LDA whenusing the concatenation of the “MIST” and the “Recovery” phases for feature calcu-lation. Before evaluating the data, we had expected, that - due to the large inter-individual differences in EDA parameters - the more complex SVM classifiers, whichcan model non-linear class boundaries, would perform better than LDA. However,the superiority of LDA over more complex models is consistent with the findingsof others: A similar study on emotion recognition from physiological signals [104]found LDA to perform better than more complex multilayer perceptrons and betterthan the non-linear k-nearest neighbor method.

The superiority of LDA over more complex models can be explained as follows:1. Hastie [85] mentions a comparative study [136] where LDA was among the

top three classifiers for 7 of the 22 compared classification tasks. Hastie

Page 94: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using EDA 77

attributes this observation to the bias-variance trade-off3. He argues that,even though in many cases the decision boundary might not be linear, onecan nevertheless accept the bias resulting from a linear boundary, becausethe linear boundary can be estimated with smaller variance than a morecomplex boundary.

2. In our case, the available training data amounts to 31 subjects x 2 conditionsfor a leave-one-person-out cross-validation. For an LDA classifier employing7 features - such as the LDA classifier that resulted in the maximum accuracyof 82.8% - a lower limit of 14 x 2 training samples is recommended in [156].Our training data is thus sufficient for training the LDA. However, for morecomplex classification methods, the lower limit on the required training dataincreases [156], because more parameters need to be estimated from thetraining data. It is therefore likely, that the number of training data in ourcase was not sufficient to train a more complex SVM.

5.3.3.4 Limitation

A limitation of this study pertains to generalization: We have performed the cross-validation for all possible feature combinations and have selected the combinationthat yielded the highest classification accuracy for every classifier. In the following,we refer to the corresponding feature combination with the term “optimal featureset”, and to this kind of feature selection with “complete search feature selection”.A known problem of the optimal feature set is its potentially poor generalizationto new data [157]. The reason is, that the optimal feature set is highly optimizedto the training data - in our case the data recorded from the 32 study participants.This is one of the reasons, why later in this chapter we chose the features manuallybased on visualization and interpretation (c.f. Sec. 5.5).

5.3.4 Conclusion

We conclude from our experiment that the monitoring of EDA allows a discrimi-nation between cognitive load and stress with an accuracy larger than 80% withleave-one-person-out cross-validation and a complete search feature selection; allinvestigated classifier methods performed in a similar range, with the maximumof 82.8% achieved by LDA. We have found that the distributions of the EDA peakheight and the instantaneous peak rate carry information about the stress level ofa person. Quantiles are suitable features to describe these distributions. Moreover,when performing stress recognition based on EDA, not only the stress phase itselfbut also the recovery phase should be taken into account.

3The bias-variance trade-off describes the trade-off between a complex model, which fitsthe training data well and thus shows small bias but potentially large variance on the testdata, and a less complex model, which fits the training data less well and thus shows smallvariance but potentially large bias on the test data. A trade-off between the two extremeswill therefore lead to the best classification performance on the test data.

Page 95: Multimodal Emotion and Stress Recognition

78 Emotion Recognition Using a Naturalistic Experiment Setup

5.4 Discriminating Stress from Cognitive Load UsingPressure Data4

In this section we summarize our efforts to discriminate between stress and cog-nitive load using data from a pressure mat mounted on the seat. Self OrganizingMaps (SOMs) and XY-fused Kohonen Networks were used for classification.

5.4.1 Pressure MatIn order to record the pressure distribution we employed the CONFORMat pressuremat developed by Tekscan [196]. The mat consists of 1024 sensing elements, whichcover a sensing area of 47.1 cm × 47.1 cm. The sensors can measure pressures upto 34 kPa.

The pressure mat was attached to the seat cushion and the backrest of an officechair. The pressure was recorded with a sampling frequency of 25 Hz. Only thepressure data of the seat cushion was analyzed.

5.4.2 Evaluation Methods

5.4.2.1 Feature Calculation

For each time frame of the pressure recording, the center of pressure (CoP) wascalculated based on the 32× 32 = 1024 sensor elements. In Fig. 5.7 two exemplaryframes including the CoP for two sitting postures are shown. To characterize themovement of the subjects, the frequency spectrum of the absolute value of the CoPwas calculated for the MIST phase of the stress condition and for the MIST phaseof the cognitive load condition. Two exemplary spectra are shown in Fig. 5.8.

In a second step, the spectra were divided into 20 frequency bands of equalwidth and the mean value of each band was calculated. As a result, we obtaineda 20-dimensional feature vector for each subject and condition (i.e. “stress” and“cognitive load”), respectively.

5.4.2.2 Classification using Self Organizing Maps

For classification, SOMs and XY-fused Kohonen Networks were employed. Thesemethods have been explained in Section 3.2.4.

5.4.3 ResultsThe 20-dimensional feature vectors served as input vectors xstim for the SOM andthe XY-fused Kohonen Network, respectively. Table 5.2 shows the resulting classi-fication accuracies including the standard deviations for different SOM grid sizes.The highest accuracy of 73.75% for discriminating stress from cognitive load wasachieved with a 7x7 XY-fused Kohonen Network.

4© 2010 IEEE. Portions, reprinted, with permission, from [8]

Page 96: Multimodal Emotion and Stress Recognition

Discriminating Stress from Cognitive Load Using Pressure Data 79

5 10 15 20 25 30

510

2030

Sensor elements

Sen

sor

ele

men

ts

(a)

5 10 15 20 25 30

510

2030

Sen

sor

ele

men

ts

Sensor elements

(b)

Figure 5.7: Two frames recorded with a Tekscan pressure mat attached to the seatcushion of a chair for the two sitting postures “leaning right” and “leaning left” [8].Based on the 32×32 = 1024 sensor elements, the center of pressure was computed(black circle). © 2010 IEEE

Table 5.2: Classification accuracies for distinguishing between the two classes“stress” and “cognitive load” using SOMs and XY-fused Kohonen Networks. Theresults were obtained by a leave-one-person-out cross-validation [8]. © 2010 IEEE

Method Grid size Classification Accuracy (%)SOM 3x3 61.25± 3.48SOM 5x5 68.39± 3.04SOM 7x7 70.89± 2.92XY-fused 3x3 66.25± 2.59XY-fused 5x5 71.43± 2.23XY-fused 7x7 73.75± 2.53

5.4.4 Conclusion

We have shown how SOMs and XY-fused Kohonen Networks can be employed todiscriminate stress from cognitive load. In a leave-one-person-out cross-validationan overall discrimination accuracy of 73.75% could be achieved with a XY-fused Ko-honen Network. This provides evidence that a person-independent discriminationof stress from cognitive load is feasible in an office scenario when using only seatpressure data.

Compared to the existing work presented in Sec. 2.5 and 2.6, 73.75% seems tobe a rather low accuracy. However, one should keep in mind that our subjects hadto concentrate on the computer screen during most of the time and it is thereforeplausible, that they remained relatively fixed on their chair. For a different scenario,e.g. stress detection during driving, we would expect better results because themovement of the steering wheel would also be visible in the seat pressure data.

Page 97: Multimodal Emotion and Stress Recognition

80 Emotion Recognition Using a Naturalistic Experiment Setup

0.0 0.1 0.2 0.3 0.4 0.5

1e−08

1e−05

1e−02

1e+01

frequency

spectrum

bandwidth =1.15e−05

(a)

0.0 0.1 0.2 0.3 0.4 0.5

1e−06

1e−04

1e−02

1e+00

1e+02

frequencyspectrum

bandwidth =1.03e−05

(b)

Figure 5.8: Spectra of of the absolute value of the center of pressure recorded dur-ing the MIST phases for one subject. (a) cognitive load condition; (b) stress con-dition. The spectra differ considerably between the cognitive load and the stresscondition [8]. © 2010 IEEE

5.5 Discriminating Stress From Cognitive Load Using Ac-celeration Sensors

In this section we summarize our efforts to discriminate between stress and cogni-tive load using acceleration data. After determining suitable acceleration features,different classifiers were trained and compared with classifiers employing ECG andEDA features.

5.5.1 Acceleration SensorsTo record movement, the acceleration sensor nodes described in Sec. 3.1.7 wereused. The sensor nodes were placed on the head, at the right wrist, and at theankles, as shown in Fig. 5.2a. Data were sampled at 64 Hz.

5.5.2 Feature Calculation

5.5.2.1 Acceleration axes definition

The axes of all acceleration sensors were defined such that the x-axis pointed hor-izontally forward (i.e. in the viewing direction of the subject for head and feet,parallel to the arms when they were lying on the table, as depicted in Fig. 5.9). Con-sequently, the y-axis pointed to the left and the z-axis upward.

With this definition, the pitch and the roll angles described in Sec. 3.1.7 can beinterpreted as described in Table 5.3. Note that the yaw angle, e.g. a turned head,

Page 98: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 81

Figure 5.9: Definition of the axes of the acceleration sensors.

can not be measured with this sensor setup. Additional magnetometers would benecessary for this task.

5.5.2.2 Selection of experiment phases to be evaluated

The body posture and the kind of movements are influenced by

1. The amount of stress: Since body movements are related to the level ofarousal or the degree of intensity of an affective experience [69], we ex-pected the subjects to gesture more during e.g. the confrontation with theexperiment leader in experiment phase S4 than during the gentle conversa-tion with the experiment assistant in phase C4 (refer to Sec. 5.2 and Fig. 5.1for the description of the experiment phases).

2. The current activity of the subject: During the mental arithmetic phases (C1,C3, C5, S1, S3, S5), the hands were placed on the keyboard for typing, whereasthe hands could move freely during the feedback phases (C2, C4, S2, S4).

We therefore divided the MIST phase into sub-phases. The expected amount ofstress and it’s effect on movement for the sub-phases are shown in Table. 5.4. Sincethe effect of stress on movement is expected to be strongest towards the end of theMIST, the phases S4 and S5 were chosen for feature calculation and compared withthe corresponding phases C4 and C5, respectively.

Page 99: Multimodal Emotion and Stress Recognition

82 Emotion Recognition Using a Naturalistic Experiment Setup

Table 5.3: Movement description of pitch and roll angles

Pitch Roll

HeadNodding, increasing valuesmean dropping the headforward

Bending the head to left or right,increasing values mean bendingright

Right Hand (RH)Lift or drop hand,decreasing values meanlifting the hand up

Rotation of the wrist, increasingvalue means clockwise rotation(from the subject’s perspective,when the arm is parallel to thetable, as depicted in Fig. 5.2a)

Right leg (RL)

Swing the leg back andforth, increasing valuesmean swinging backward(i.e. reducing the kneeangle)

Twisting the knee to right or left(e.g. when putting one leg overthe other), increasing valuesmean a clockwise twist (fromsubject’s perspective)

Left leg (LL) As right leg. As right leg.

Table 5.4: Expected amount of movement during the different experiment phases,depending on the stress level and the activity. Expectations are based on the liter-ature [137, 69]. SL: Stress level; BL: Baseline; Rec.: Recovery; he: Head movement,Ha: Hand movement; Le: Leg movement. Since the effect of stress on movement isexpected to be strongest towards the end of the MIST, the phases C4, S4, C5, andS5 were chosen for acceleration feature calculation and subsequent stress classifi-cation.

Cognitive load condition Stress conditionActivity Phase SL He Ha Le Phase SL He Ha LeReading BL - * * * BL - * * *Typing C1 * - * - S1 ** - ** -Conversation C2 - * * * S2 ** ** ** *Typing C3 * - * - S3 *** * *** -Conversation C4 - * * * S4 **** **** **** *Typing C5 * - * - S5 **** ** *** -Reading Rec. - * * * Rec. ** * * *

Page 100: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 83

−40 −20 0 20 40−20

−10

0

10

20

30Scatter plot of clustering

Pitch [°]

Rol

l [°] 0 50 100 150 200 250 300

−50

0

50

Time [s]

Pitc

h [°

]

Pitch and Roll signals

0 50 100 150 200 250 300−50

0

50

Time [s]

Rol

l [°]

Figure 5.10: Result of the subtractive clustering algorithm, followed by NCC: Sub-tractive clustering was applied to the 3-dimensional signal consisting of the fil-tered, normalized pitch and roll (recorded at the right leg during the phases S4 andS5), and the normalized time. Left: Clusters in the pitch - roll space; red crossesindicate cluster centers. Right: Pitch and roll signals over time with cluster mem-bership indicated by different colors.

5.5.2.3 Acceleration features

The following features were calculated for the 3 acceleration axes ax, ay , az , forthe absolute value ‖a‖, for the pitch, and for the roll: mean, std, min, max. Thisresulted in 24 features per acceleration sensor. Additionally, we were interestedin the number of different postures of the head, the right hand and the feet (e.g.head up/down). To quantify the number of postures, two cluster features werecalculated using the following procedure:

1. Smoothing and downsampling of the pitch and roll signals using a sliding-window median-filter with window size of 1s and 0% overlap (resulted in asampling rate of 1 Hz for the filtered signals). Downsampling was employedto reduce computation time for the clustering in step 3.

2. Normalization of the filtered pitch and roll signals such that -180° and 180°were mapped to the values 0 and 1, respectively.

3. Subtractive clustering (see Sec. 3.2.5) applied to the 3-dimensional signal scomposed of the filtered, normalized pitch, the filtered, normalized roll, andthe normalized time. (Note: Oscillations between two clusters that wereclose to each other in the pitch-roll space were prevented by their separationin time.) The parameters needed for the subtractive clustering algorithm(cf. Sec. 3.2.5 and B.3) were: ra = [0.1, 0.1, 0.2], rb = 1.25 · ra, ε = 0.5, andε = 0.15. The subtractive clustering algorithm determines the number ofclusters and the locations of of the cluster centers.

4. Assign each data point of the 3-dimensional signal s to the closest clustercenter using NCC (see Sec. 3.2.1). Fig. 5.10 shows an example of the final clus-tering result.

5. Compute the number of clusters and the number of changes from one clus-ter to another per time, and add these two “cluster features” to the set of 24

Page 101: Multimodal Emotion and Stress Recognition

84 Emotion Recognition Using a Naturalistic Experiment Setup

already calculated acceleration features.

5.5.2.4 Window size for feature calculation

The procedure for feature calculation and classification employed in inSec. 5.5 and 5.6 is visualized in Fig. 5.11. Since movement is influenced byboth the amount of stress and the activity of the subject (e.g. typing or conver-sation), feature calculation and classification were first performed separately forthe feedback and for the mental arithmetic phases. Two kinds of stress classifierswere thus trained, one distinguishing the phase S4 from the phase C4, and onedistinguishing the phase S5 from the phase C5. This section describes, how weselected the parts of the phases S4, C4, S5, and C5 that were used for featurecalculation.

The duration of the second feedback phase of the cognitive load conditionwas shorter than the corresponding feedback phase of the stress condition (C4:86s±38s , S4: 260s±33s). This duration difference was due to the experiment setup:During the phase C4, the experiment assistant just asked a few questions, whereasduring the stress phase S4, a longer discussion with the experiment leader aboutthe subject’s performance took place. The phase S4 was therefore longer than thephase C4, which would represent a problem for the classification of stress (S4) andcognitive load (C4), if the entire phases S4 and C4 were used for the feature calcu-lation.The problem arises from the fact that the time dimension was included inthe calculation of the cluster features. The cluster features thus contain implicitinformation about the duration of the underlying acceleration signal. A systematicdifference in the duration of the two compared phases C4 and S4 would thereforebe modeled by any classifier. This would in turn lead to overestimated classificationaccuracies, because such a classifier would merely model the temporal differencesin our experiment setup instead of the information that is related to stress.

To avoid this problem, we calculated the acceleration features for the stressand the cognitive load condition based on signal sections of equal duration. Be-cause the shortest feedback phase C4 in the data set lasted 47 seconds, the sectionlength was set to 47 seconds. The last 47 seconds of the second feedback phase(S4 or C4, respectively) and 47 seconds of the last mental arithmetic phase (S5 orC5, respectively) were thus used for feature calculation, feature selection and clas-sification, see Fig. 5.11.

After investigating the feedback and the mental arithmetic phases separately,we were also interested, whether there is a “course over time” in the stress reac-tion, that could be modeled by a suitable classifier. We therefore also extractedfeature sets for the last 5 minutes of the MIST phase, which included the end ofS4 and the entire S5 sub-phase for the stress condition, and the end of C4 and theentire C5 sub-phase for the cognitive load condition. More details are given in theclassification Section 5.5.3.

5.5.2.5 Feature selection

This section describes, why we had to employ feature selection and how it wasdone.

Page 102: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 85

Figure 5.11: Procedure for feature calculation and classification employed inSec. 5.5 and 5.6. Since movement is influenced by both the amount of stress andthe activity of the subject (e.g. typing or conversation), feature calculation and clas-sification were first performed separately for the feedback and for the mental arith-metic phases. Two kinds of stress classifiers were thus trained, one distinguishingthe phase S4 from the phase C4 (violet, orange), and one distinguishing the phaseS5 from the phase C5 (green, yellow).

Page 103: Multimodal Emotion and Stress Recognition

86 Emotion Recognition Using a Naturalistic Experiment Setup

Some acceleration sensors had slipped during the experiment due to loose at-tachment.These data sets were excluded for classification, which resulted in a totalof 26 data sets from 26 subjects that could be used for classification. The num-ber of training data samples available in the leave-one-person-out cross-validationthus amounted to 25 subjects x 2 conditions = 50. With 50 training data samplesand 2 classes, not more than 12 features should be used for LDA classifier training(and even less for more complex methods) [156]. A selection of the 26 accelerationfeatures was therefore necessary to reduce the number of features.

As mentioned in Sec. 5.3.3, computing the classification accuracy for all possi-ble combinations of features and then selecting the best combination results in ahigh accuracy, but renders a generalization of the results to new data unlikely [157].In this section we therefore avoided the complete search feature selection and se-lected the features manually: We first plotted the features using scatter plots andinvestigated whether the expected differences between cognitive load and stress,as hypothesized in Table 5.4, could be observed. The features that showed a visi-ble, plausible difference between stress and cognitive load were then selected forsubsequent classification.

5.5.2.6 Comparison with EDA and ECG Data

To compare stress recognition from acceleration signals with stress recognitionfrom physiological data, we also calculated the following features for ECG and EDA:

• EDA features: From level: slope (calculated by a linear regression); from peakheight: q(0.25), q(0.5); from instantaneous peak rate: q(0.25), q(0.5) 5

• ECG features: From heart rate: max, mean, slope (calculated by a linear re-gression); HRV features: pnn50, sdnn, lf, hf

For the EDA peak detection, an amplitude criterion of 0.01 µS was applied (asrecommended in [25] for off-line computer analysis), i.e. peaks with amplitudessmaller than 0.01 µS were discarded.

For calculating the lf and hf features, the RR-intervals were first detrended bythe smoothness priors method proposed in [192], before calculating the spectrumof the RR-intervals.

For further details on feature calculation from EDA and ECG signals refer toSec. 3.1.1 and 3.1.2.

5.5.3 Classification MethodsOur aim was to automatically distinguish between the two classes “stress” and“cognitive load”, using the classifiers presented in this section. As mentionedabove, we first trained stress classifiers for the feedback phases and stress classi-fiers for the mental arithmetic phases, separately. In a second step, we considered

5Note: The EDA feature set mentioned here resulted in better classification accuracy thanthe EDA feature set mentioned in Sec. 5.3.2. This might be due to the fact, that the evaluatedexperimental phases (S4, C4, S5, C5) were not the same here as in Sec. 5.3.2, where theentireMIST and Recovery phases were used for feature calculation.

Page 104: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 87

the last five minutes of the MIST to train HMM classifiers. The last five minutesof the MIST include both feedback and mental arithmetic. HMM classifiers wereemployed because they model a “course over time”.

The following classifiers were thus trained and tested:

• Discriminant analysis with 47 s windows: LDA and QDA classifiers wereemployed. Stress classifiers were trained separately for the feedback andfor the mental arithmetic experiment phases, respectively, using featurescalculated from the 47s signal excerpts depicted in Fig. 5.11. Two kinds ofstress classifiers were thus trained, one distinguishing the phase S4 fromthe phase C4, and one distinguishing the phase S5 from the phase C5.

• Hidden Markov Models (HMM): A Hidden Markov Model (HMM) models asequence of states using a sequence of observations (i.e. features) and istherefore suitable to describe a “course over time”. Fig. 5.12 shows how thefeature sequences were calculated in our experiment: We extracted the last5 minutes of the MIST phase, which includes the end of S4 and the entireS5 sub-phase for the stress condition, and the end of C4 and the entire C5sub-phase for the cognitive load condition. A 47s sliding window with 50%overlap was then used to calculate the feature sequences. Two linear HMMswith Gaussian outputs λs and λc were trained for the stress and for the cog-nitive load condition, respectively. After training the HMMs, the test featuresequences were classified into “stress” or “cognitive load” according to thelog-likelihood that the test feature sequences were generated by one of theHMMs [155]: if log (P (test sequence | λs)) > log (P (test sequence | λc))then “stress”, else “cognitive load”.

• Discriminant analysis with 5 min windows: To investigate whether the HMMmodeling of a time course is beneficial regarding classification accuracy, wecompared the classification accuracies of the HMMs with the accuracies ofdiscriminant classifiers. To train the discriminant classifiers, we extractedthe features from the last 5 minutes of the MIST phases.

All the presented classification results were calculated by a leave-one-person-outcross-validation.

5.5.4 ResultsThis section describes the results of the features visualization and interpretation,as well as the classification results.

5.5.4.1 Feature Visualization, Interpretation, and Selection

The 26 acceleration features that had been calculated from the 47s-excerpts of thestress and the cognitive load conditions depicted in Fig. 5.11, were first visually in-spected and interpreted by scatter plots. A subset of features was then selected forsubsequent classification based on two criteria:

1. A difference between stress and cognitive load was visible from the scatterplots.

Page 105: Multimodal Emotion and Stress Recognition

88 Emotion Recognition Using a Naturalistic Experiment Setup

Figure 5.12: Procedure for feature calculation and classification using hiddenMarkov models (HMM).

2. The observed differences were in line with our expectations (c.f. Table 5.4)and could be explained based on previous findings in the literature.

Furthermore, a Wilcoxon signed rank test was applied to each chosen feature totest whether there were statistically significant (i.e. p < 0.05) difference between“stress” and “cognitive load”.

In the following, we will interpret the scatter plots of the features that showedvisible differences between stress and cognitive load during the feedback phases(S4, C4) and/or during the mental arithmetic phases (S5, C5). To ease understand-ing, Fig. 5.13 shows four snapshots of the experiment phases S4, C4, S5, and C5.

The Head Figure 5.14 shows scatter plots of the selected head features for thefeedback phases (S4 and C4) in the left column, and for the mental arithmeticphases (S5 and C5) in the right column. The plots are numbered with 1-6. Each fea-ture is shown once per experiment phase. The two experiment conditions “stress”and “cognitive load” are marked with red crosses and blue x-es, respectively. Welooked for differences between the two conditions for both the feedback (S4 vs.C4) and the mental arithmetic phases (S5 vs. C5). The following was observed forthe feedback phases:

• Figure 5.14, plot 1: The mean pitch is significantly smaller for the stress thanfor the cognitive load condition (p = 0.0071), which implies that the subjectslooked more upward during stress-inducing feedback (S4) than during theneutral feedback (C4). This can be explained by the experiment setup (seeFig. 5.13): the experiment leader - who only appeared in the stress condition(S4) - was standing and the subjects looked up to her while discussing. Dur-

Page 106: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 89

Figure 5.13: Snapshots of the four experiment phases compared and evaluated inthis section. The rows indicate the two experiment conditions, whereas the col-umns indicate the two kinds of phases (feedback and mental arithmetic). S4:Stress-inducing feedback and personal questions by the experiment leader. C4:Neutral feedback and questions by the experiment assistant. S5: Mental arith-metic under time pressure with experiment leader observing and commenting. C5:Mental arithmetics without time limitation and without social evaluation. Note:Since we were interested in the differences between stress and cognitive load, thefeatures calculated during the phase S4 were always compared with the featurescalculated during C4, whereas the features calculated during the phase S5 werealways compared with the features calculated during C5.

Page 107: Multimodal Emotion and Stress Recognition

90 Emotion Recognition Using a Naturalistic Experiment Setup

ing the corresponding phase of the cognitive load condition (C4), the sub-jects discussed with the experiment assistant who was sitting next to thesubject. The mean pitch is therefore mainly influenced by this specific ex-perimental setting and not related to stress. It was therefore not used forsubsequent classification.

• Figure 5.14, plot 3: The standard deviation of the pitch is significantly largerfor the stress than for the cognitive load condition (p < 0.001). This means,that there were more pronounced “nodding movements” during the stress-inducing feedback (S4) than during the neutral feedback (C4).

• Figure 5.14, plot 3: The standard deviation of the roll angle is larger for thestress than for the cognitive load condition (p = 0.0999), which indicatesmore sideways head movement during stress-inducing feedback (S4) thanduring neutral feedback (C4).

• Figure 5.14, plot 5: There were more changes between head posture clustersfor stress-inducing feedback (S4) than for neutral feedback (C4) (p = 0.0888).

All together we can conclude, that there was more head movement during thestress-inducing than during the neutral feedback phase. This observation was con-sistent with our expectations: Because “anger” is characterized by more and fastermovements than “netural” (cf. Sec. 3.1.7 and [137]), we expected the subjects to be-come nervous, to start arguing with the experiment leader, and therefore to usemore pronounced head movement during S4 than during C4.

The head features for the third mental arithmetic phases are shown on the rightside of Fig. 5.14. The following is observed:

• Figure 5.14, plot 2: As expected, there is little difference between the meanpitch of the stress and of the cognitive load condition (p = 0.2619). For bothconditions (phases S5 and C5), the subject’s gaze was focused on the screen.

• Figure 5.14, plot 4: Contrary to the feedback phase described above, the stan-dard deviation of the pitch was smaller for the stress than for the cognitiveload condition when considering the mental arithmetic phases (however,not significantly: p = 0.3795). This means, that the “nodding movements”were slightly more pronounced during cognitive load (C5) than during stress(S5). We assume that the subjects were very concentrated on the arithmetictasks during the stress condition and thus moved less than during the cog-nitive load condition.

• Figure 5.14, plot 4: There is no difference between the stress and the cogni-tive load condition for the standard deviation of the roll angle (p = 0.5621),agreeing with the assumption of little movement during mental arithmetic.

• Figure 5.14, plot 6: There were more changes between head posture clustersfor stress than for cognitive load (p = 0.0541). This observation seems tocontradict the two aforementioned observations for the standard deviationsof pitch and roll. However, looking at the maximum absolute accelerationdepicted in plot 2, a few “outliers” can be identified for the stress condition(result of the Wilcoxon signed rank test for the maximum absolute accelera-tion: p = 0.1970). Since a large value of the maximum absolute acceleration

Page 108: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 91

−40 −20 0 20

1200

1400

1600

1800

mean(pitch) [°]

max

(abs

) [m

g]Feedback

1

5 10 152

4

6

8

10

12

14

std(pitch) [°]

std(

roll)

[°]

3

5 6 7 8

0.1

0.15

0.2

0.25

0.3

0.35

# clusters

# cl

uste

r cha

nges

/tim

e [1

/s]

5

−20 0 201050

1100

1150

1200

1250

1300

mean(pitch) [°]m

ax(a

bs) [

mg]

Mental arithmetics2

2 4 6 80

2

4

6

8

10

12

std(pitch) [°]

std(

roll)

[°]

4

5 5.5 6 6.5 7

0.1

0.2

0.3

0.4

0.5

0.6

# clusters

# cl

uste

r cha

nges

/tim

e [1

/s]

6

Figure 5.14: Scatter plot of the head acceleration features calculated during the sec-ond feedback phase S4 and C4 (left column) and the third mental arithmetic phaseS5 and C5 (right column). Red +: stress condition, one + per subject; blue x: cog-nitive load condition, one x per subject. Red triangle: median value for stress con-dition; black triangle: median value for cognitive load condition. abs = ‖a‖ witha = [ax, ay, az]: acceleration signal vector, g: gravity constant

Page 109: Multimodal Emotion and Stress Recognition

92 Emotion Recognition Using a Naturalistic Experiment Setup

indicates a fast upward or lateral head movement, we argue that a subgroupof the subjects reacted to stress by “throwing” their head. Such a movementresults in two cluster changes but, if it is very short, it might not be reflectedin the standard deviation features, because the standard deviation was cal-culated for the entire 47s signal excerpt.

All together, we conclude that the subjects generally moved their head less dur-ing the mental arithmetic than during the feedback phases, because they werefocused on the sceen. A subgroup of subjects might have reacted to stress by“throwing” their head, which would be consistent with the “very jerky movements”observed for anger in [137]. However, video annotation of the data would be neces-sary to confirm this.

For classification in Sec. 5.5.4.2, the following features were chosen, based onthe observations described above: std of pitch and roll, the maximum absolutevalue of the acceleration, and the number of changes between clusters.

The number of clusters depicted in Figure 5.14 (plot 5 and 6) showed equal me-dian values for the stress and the cognitive load condition. This feature was there-fore not selected for classification. The mean pitch was also excluded, because itwas mainly influenced by the experimental setting (standing experiment leader).

The Right Hand Fig. 5.15 and 5.16 show the selected features that were calcu-lated from the acceleration signal recorded at the right hand. For the visualization,one subject had to be excluded, because the sensor was attached loosely and hadslipped during the experiment. The mental arithmetic phases (S5 and C5) were con-sidered first:

• Fig. 5.15, plot 2 shows the minimum and the maximum of the absolute valueof the hand acceleration. If the hand is resting, the absolute value of themeasured acceleration is constant and equals gravity (1g = 9.81m

s2). If there

is an abrupt movement in any direction, the absolute acceleration either de-creases or increases. There were higher maximal and lower minimal val-ues during the stress (S5) than during the cognitive load condition (C5). TheWilcoxon signed rank test was significant with p = 0.0016 for the minimumand p = 0.0056 for the maximum absolute acceleration. A possible reasonmight be that some of the subjects threw their hands up in resignation dur-ing the stress condition.

• Fig. 5.15, plot 4: The standard deviation of the absolute value of the handaccelerationwas significantly larger for stress than for cognitive load (p =

0.0013). This is plausible since the subjects were expected to type faster dur-ing stress (S5) than during cognitive load (C5).

• Figure 5.16, plot 2: The standard deviations of both pitch (p < 0.001) and roll(p = 0.0346) were significantly larger for stress (S5) than for cognitive load(C5), which also indicates generally more hand movement during stress.

We conclude that there was more overall hand movement during stress than dur-ing cognitive load, which is probably due to faster typing on the keyboard. Further-more, there were more “abrupt gestures” during stress than during cognitive load,which might have been due to the subjects throwing their hands in resignation.

Page 110: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 93

Like for the head, this observation is in line with the “very jerky movements” foundfor anger in [137]. Video annotation would however be necessary for exact analysisof the hand gestures.

The features for the feedback phase (S4 and C4) are shown on the left side ofFig. 5.15 and 5.16:

• Figure 5.15, plot 1: The minimum and maximum absolute acceleration showno difference between the stress-inducing feedback phase S4 and the neu-tral feedback phase C4 (p = 0.2699 for min, p = 0.3309 for max), indicatingno difference in “abrupt gestures”.

• Figure 5.15, plot 3: The standard deviation of the absolute hand accelerationwas larger for stress-inducing feedback (S4) than for neutral feedback (C4)(p = 0.0854), indicating more overall hand movement during stress.

• Figure 5.16, plot 1: The roll and pitch features do not differ between thephases S4 and C4 (p = 0.5372 for pitch std, and p = 0.8811 for roll std).

• Figure 5.16, plot 3: There were slightly more clusters identified during thestress phase S4 than during the phase C4 (p = 0.5911), as well as morechanges between clusters (p = 0.1416). This means, that more “hand pos-tures” were adopted during stress-inducing feedback than during neutralfeedback, and that more changes between “hand postures” occurred duringstress-inducing feedback.

We conclude, that there was slightly stronger overall hand movement during thestress-inducing feedback phase than during the corresponding neutral feedbackphase C4 of the cognitive load condition.

All the features described above were used for classification in Sec 5.5.4.2: std,min and max of absolute acceleration, std of pitch and roll, number of clusters, andnumber of cluster changes.

The Legs: We looked at both legs jointly. Four subjects were excluded for vi-sualization due to a broken sensor. The most consistent differences between thestress and the cognitive load condition were found for the orientation of the legs,rather than the amount of movement. The corresponding features are depicted inFig. 5.17. The following is observed:

• Figure 5.17, plot 1 and 2: For both the mental arithmetic and the feedbackphase, the mean pitch of the right leg is higher for the stress than for thecognitive load condition (mental arithmetic, plot 2: p = 0.0076; feedback,plot 1: p = 0.0076). An increasing pitch means that the leg is pulled back-wards (i.e. the knee angle decreases.) Consistently, the maximal pitch of theright leg is also higher for the stress than for the cognitive load condition(mental arithmetic, plot 2: p = 0.0614; feedback, plot 1: p = 0.0011).

• Figure 5.17, plot 5 and 6: The maximum pitch of the left leg is higher duringthe stress than during the cognitive load condition (feedback, plot 5: p =

0.0282, mental arithmetic, plot 6: p = 0.1329). This means that the left legwas also pulled more backwards during stress than during the cognitive loadcondition.

Page 111: Multimodal Emotion and Stress Recognition

94 Emotion Recognition Using a Naturalistic Experiment Setup

200 400 600 800 10001000

1500

2000

2500

min(abs) [mg]

max

(abs

) [m

g]

Feedback1

200 400 600 800 1000

20

40

60

80

100

120

140

160

min(abs) [mg]

std(

abs)

[mg]

3

200 400 600 8001000

1500

2000

2500

min(abs) [mg]

max

(abs

) [m

g]

Mental arithmetics2

200 400 600 800

20

40

60

80

100

120

min(abs) [mg]

std(

abs)

[mg]

4

Figure 5.15: Scatter plot of the right hand (RH) acceleration features during the sec-ond feedback phase S4 and C4 (left column) and the third mental arithmetic phaseS5 and C5 (right column). Red +: stress condition, one + per subject; blue x: cog-nitive load condition, one x per subject. Red triangle: median value for stress con-dition; black triangle: median value for cognitive load condition. abs = ‖a‖ witha = [ax, ay, az]: acceleration signal vector, g: gravity constant. For better visibility,plot 2 was zoomed in: one sample of each class lies outside the displayed range.

Page 112: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 95

0 10 20 30

10

20

30

40

50

std(pitch) [°]

std(

roll)

[°]

Feedback1

4 6 8 10

0.1

0.15

0.2

0.25

0.3

# clusters

# cl

uste

r cha

nges

/tim

e [1

/s]

3

0 5 100

5

10

15

20

std(pitch) [°]

std(

roll)

[°]

Mental arithmetics2

5 5.5 6 6.5 7

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

# clusters

# cl

uste

r cha

nges

/tim

e [1

/s]

4

Figure 5.16: Scatter plot of the right hand (RH) acceleration features during thesecond feedback phase S4 and C4 (left column) and the third mental arithmeticphase S5 and C5 (right column). Red +: stress condition, one + per subject; blue x:cognitive load condition, one x per subject. Red triangle: median value for stresscondition; black triangle: median value for cognitive load condition.

Page 113: Multimodal Emotion and Stress Recognition

96 Emotion Recognition Using a Naturalistic Experiment Setup

• Figure 5.17, plot 3 and 4: The mean and maximum roll of the right leg weresignificantly larger for stress than for cognitive load during the mental arith-metic phase (plot 4, p = 0.0045 for mean, p = 0.0314 for max). This indicatesa “clockwise turn” of the leg, e.g. when crossing the legs. This behavior wasless pronounced during the feedback phase (plot 3, p = 0.5235 for mean roll,p = 0.1567 for max roll).

We conclude that due to the stress, the subjects tended to tuck up their legs. Somealso crossed the right leg. These observations might indicate a more tense bodyposture during the stress than during the cognitive load condition. This wouldbe consistent with the “stiff movements” observed for anger or with “contractedmovements” observed for sadness in [137].

All the features described above were used for classification in Sec. 5.5.4.2:mean and max of pitch and roll from the right leg, max of pitch from the left leg.

5.5.4.2 Single Modality Classification Results

All the classifiers presented in the following were trained to distinguish betweenthe two classes “stress” and “cognitive load”.

Discriminant analysis with 47s windows: The feedback (C4 vs. S4) and men-tal arithmetic (C5 vs. S5) phases were first investigated separately using featurescalculated from the 47s signal excerpts depicted in Fig. 5.11.

We have investigated following classifiers: conventional LDA (with full covari-ance matrix, simply referred to by the term “LDA” in the following), LDA withdiagonal covariance matrix, conventional QDA (with full covariance matrix, sim-ply referred to by the term “QDA” in the following), and QDA with diagonal co-variance matrix. Among these four methods, the LDA with diagonal covariancematrix showed the best classification results (using leave-one-person-out cross-validation), as shown in Table 5.5. Generally, LDA performed better than the morecomplex QDA, which is consistent with the EDA classification results presented anddiscussed in Sec. 5.3.3.

For comparison with the acceleration classifiers, Table 5.5 also contains the clas-sification results obtained with the ECG and EDA features introduced in Sec. 5.5.2.6.The ECG and EDA features were calculated during the same time windows as theacceleration features.

The accuracies of the discriminant classifiers trained with features from the47s excerpts of the feedback phases (C4 vs. S4) and the mental arithmetic phases(C5 vs. S5) are shown in column 3 and 4 of Table 5.5, respectively. Even though theECG classifier yielded the maximum accuracy (78.8% when considering the mentalarithmetic phases C5 and S5), the classification results for the different limbs showthat body language measured by acceleration also contains information aboutstress. For the feedback phase, the LDA classifier (with diagonal covariance matrix)using head features even reached a higher accuracy than all the EDA classifiers andequal accuracy as the best ECG classifier, i.e. 71.2%. Overall, the results obtained bythe acceleration classifiers were comparable to those obtained with the pressuremat in Sec. 5.4.3.

Page 114: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 97

−50 0 50

−20

0

20

40

60

RL: mean(pitch) [°]

RL:

max

(pitc

h) [°

]

Feedback1

−10 0 10 20 300

20

40

60

80

100

RL: mean(roll) [°]

RL:

max

(rol

l) [°

]

3

−20 0 20 40 60 80

−20

0

20

40

60

LL: max(pitch) [°]

RL:

max

(pitc

h) [°

]

5

−40 −20 0 20 40

−20

0

20

40

60

RL: mean(pitch) [°]R

L: m

ax(p

itch)

[°]

Mental arithmetics2

0 20 40

0

20

40

60

80

RL: mean(roll) [°]

RL:

max

(rol

l) [°

]

4

−40 −20 0 20 40 60

−20

0

20

40

60

LL: max(pitch) [°]

RL:

max

(pitc

h) [°

]

6

Figure 5.17: Scatter plot of the leg acceleration features during the second feedbackphase S4 and C4 (left column) and the third mental arithmetic phase S5 and C5(right column). Red +: stress condition, one + per subject; blue x: cognitive loadcondition, one x per subject. Red triangle: median value for stress condition; blacktriangle: median value for cognitive load condition. RL: Right leg, LL: Left leg.

Page 115: Multimodal Emotion and Stress Recognition

98 Emotion Recognition Using a Naturalistic Experiment Setup

Table 5.5: Single modality classification results obtained with leave-one-person-outcross-validation, for different experiment phases and classifiers. Data from 26 sub-jects was used. For comparison, we added the results for ECG and EDA featurescalculated during the same time phases. The best of the four discriminant classi-fiers for each modality and experiment phase is marked bold. The best classifierfor each experiment phase is underlined. dcm: diagonal covariance matrix, * cal-culated by a linear regression, HR: heart rate, HRV: heart rate variability, p.h.: peakheight, p.r.: peak rate

47s windows 5 min windows

Phase: S4/C4(Feedback)

S5/C5(Arithmetic)

S4-S5/C4-C5

S4-S5/C4-C5

Modality Features Clas

sifie

rs LDA(dcm)LDA

QDA(dcm)QDA

LDA(dcm)LDA

QDA(dcm)QDA

LDA(dcm)LDA

QDA(dcm)QDA

3-stateHMM;5-stateHMM

ECG HR: max, mean, slope*; HRV:pnn50, sdnn, lf, hf

69.2%67.3%71.2%63.5%

78.8%75.0%75.0%75.0%

76.9%73.1%75.0%76.9%

73.1%75.0%

EDAslope*; from p.h.: q(0.25),q(0.5) from inst. p. r.: q(0.25),q(0.5)

65.4%69.2%61.5%57.7%

69.2%65.4%67.3%65.4%

73.1%65.4%63.5%65.4%

59.6%57.7%

Headstd(pitch), std(roll),# of cluster changes/time,max(abs)

71.2%67.3%67.3%61.5%

69.2%71.2%65.4%65.4%

59.6 %65.4%44.2%44.2%

65.4%65.4%

RightHand

std(abs), min(abs), max(abs),std(pitch), std(roll),# of clusters,# of cluster changes/time

59.6%48.1%55.8%51.9%

69.2%59.6%61.5%53.5%

65.4%55.8%59.6%53.8%

65.4%77.9%

RightLeg

mean(pitch), max(pitch),mean(roll), max(roll)

65.4%61.5%57.7%59.6%

63.5%67.3%57.7%57.7%

65.4%65.4%57.7%55.8%

59.6%63.5%

Left Leg max(pitch)

67.3%67.3%65.4%65.4%

57.7%57.7%55.8%55.8%

57.7%57.7%63.5%63.5%

61.5%57.7%

Page 116: Multimodal Emotion and Stress Recognition

Discriminating Stress From Cognitive Load Using Acceleration Sensors 99

It may come as a surprise that the EDA classification accuracies obtained basedon the 47s time windows (Table 5.5) were considerably lower than the EDA accuracyof 82.8% obtained in Sec. 5.3.2 based on the entire MIST and Recovery phases witha complete search feature selection. The following could be the reasons:

1. The 47s time windows might be too short for distinguishing “stress” from“cognitive load” using EDA. The EDA signal in Fig. 5.3 shows that the EDAchanges range from the end of the Baseline to about the middle of the Re-covery phase. With the 47s time windows, we only capture a small fraction ofthese changes. Moreover, the number of expected NS.SCRs during a restingphase is 3-7 responses/min, and 10-15 responses/minute during an activationphase [25]. This means, that the EDA features describing the distribution ofthe peak height and peak rate, are based on only a few EDA peaks when us-ing 47s time windows. This leads to high variance in the features.

2. In this section, we did not employ a complete search for the optimal featureset, but chose the features manually. It might well be, that a classificationaccuracy around 80% could be obtained for EDA, if the optimal feature set forthe 47s EDA signals was calculated and used to train the classifiers. However,such classifiers tend to show poor generalization to new data [157]. Since weplanned to test our stress classifiers on another data set (recorded in theoffice, see Sec. 5.7), we were more interested in good generalization than inthe highest obtainable classification accuracy for the MIST experiment.

Results of HMM classification in comparison to LDA The last column of Ta-ble 5.5 shows the results for HMM classification using a 3-state and 5-state HMM.

The 5-state HMM classifiers tend to yield higher accuracies than the 3-stateHMM classifiers. The achieved accuracies are however comparable to the LDA ac-curacies, with two exceptions:

1. The accuracy of the 5-state HMM classifier for the right hand sensor wasmore than 10% better than any of the corresponding discriminant classifiersin Column 5. We attribute this higher accuracy of the HMM to the fact, thatthe subjects were typing faster during the stress phase S5 than during thecognitive load phase C5.

2. For the EDA data, the accuracies of the HMM classifiers were clearly worsethan the accuracies of the discriminant classifiers. Two mechanisms couldbe responsible for this result:

(a) Little information might be contained in the sequence of EDA features,which might be due to differences in the individual stress reaction ofdifferent subjects.

(b) As already mentioned in the previous paragraph, the EDA changesrange from Baseline to Recovery. The 5 minutes might thus still be tooshort to capture the EDA changes relevant to stress. Moreover, sincethe size of the sliding window used to calculate the feature sequencesfor the HMM is 47s, the EDA peak height and peak rate features arecalculated based on a few peaks and thus the variance in the corre-sponding features is high.

Page 117: Multimodal Emotion and Stress Recognition

100 Emotion Recognition Using a Naturalistic Experiment Setup

Because of the fact, that the HMM generally did not perform better than thediscriminant classifiers, we conclude, that the information related to stress is notcontained in the sequence of states, with the exception of the right hand sensor.

5.5.5 ConclusionWe have shown, that body language measured by acceleration sensors containsinformation about stress. However, the acceleration signals can be biased by ac-tivities (e.g. typing or having a conversation). We therefore performed the analysisseparately for different activities.

The suitable features depended on the sensor placement. We have identifiedspecific stress-related acceleration features for the head, the right hand and thefeet.

Due to the low amount of training data, a feature selection had to be employed.We have selected the features manually, based on two criteria:

1. A difference between stress and cognitive load was visible from the scatterplots.

2. The observed differences were in line with our expectations and could beexplained based on previous findings in the literature.

In this way, we could also identify a feature (the mean pitch of the head), whichdid clearly not differ because of the stress experienced by the subject, but due tothe specific experiment setup (talking to a standing person in the stress conditionvs. talking to a sitting person in the cognitive load condition). Using the meanpitch of the head to train an LDA classifier (with diagonal covariance matrix) wouldhave resulted in slightly higher classification accuracy (73.1% instead of 71.2%), butthis classifier would probably have shown poor performance on new data recordedduring a different experiment.

Comparing the four acceleration sensor placements, a maximum accuracy of71.2% was achieved using features from the head for the LDA (with diagonal co-variance matrix) and 77.9% with a 5-states HMM using features of the right hand.The high accuracy achieved by the HMM for the right hand can be explained by thecapability of the HMM to model courses over time, e.g. faster typing.

When considering all modalities (i.e. acceleration, ECG, and EDA), simple LDAwith diagonal covariance matrix was competitive with the more complex QDA andHMM. LDA (with diagonal covariance matrix) achieved the overall maximum ac-curacy of 78.8% when using ECG features extracted from the mental arithmeticphases S5 and C5.

For the future, it would be interesting to identify specific expressive gesturesand postures like throwing up the hands or different types of crossed legs to furtherrefine stress recognition based on acceleration data..

5.6 Multimodal ClassificationSo far, we have used single modalities (e.g. EDA or acceleration) to discriminatestress from cognitive load. In this section, we will combine different modalities for

Page 118: Multimodal Emotion and Stress Recognition

Generalization to a “Real-Life” Office Experiment 101

classification. As in Section 4.2 we use majority and confidence voting for classifierfusion to create a multimodal classifier.

5.6.1 Evaluation MethodsThe employed features were the same as in the previous section and are listed inTable 5.5. Because the acceleration features are influenced by the activity of thesubject (e.g typing), the features and classifiers were again calculated separatelyfor 47s excerpts of the second feedback phase (S4 and C4, respectively) and of thelast mental arithmetic phase (S5 and C5, respectively, see Fig. 5.1). Data of 26 sub-jects were included in the analysis.

Separate classifiers were first generated for the following modalities: ECG, EDA,head, right hand, right leg and left leg. Since LDA with diagonal covariance matri-ces had yielded the best classification results in Sec. 5.5.4.2, LDA classifiers withdiagonal covariance matrices were used here. Majority and confidence voting, asdescribed in Sec. 3.3, were then employed for classifier fusion to create multimodalclassifiers. The multimodal classifiers are shown as orange and yellow boxes in Fig.5.11.

5.6.2 ResultsThe results for the single modalities were shown in Table 5.5, whereas the resultsfor multimodal classification are shown in Table 5.6. Only the combinations thatyielded higher accuracies than the best single classifier are reported. The best sin-gle classifiers for the second feedback phase (S4 vs. C4) resulted in 71.2% accuracy(using head features), whereas the best single classifier for the last mental arith-metic phase (S5 vs. C5) resulted in 78.8% (using ECG features). For both phases,higher accuracies were reached when employing classifier fusion. This implies,that different modalities convey different “kinds” of information about stress andtherefore complement each other.

For the mental arithmetic phase (S5 vs. C5), the maximum accuracy of 84.6%was achieved when combining all the 6 modalities using majority voting, whereasthe best classifier for the feedback phase (S4 vs. C4) included data from the ECG,the head and the left leg and achieved 78.8%.

5.6.3 ConclusionDifferent modalities convey different “kinds” of information about stress and there-fore complement each other. Thus, the classification accuracy of the multimodalclassifier was 6% higher than the accuracy of the best single modality classifier.

5.7 Generalization to a “Real-Life” Office ExperimentSo far, all the analyzes for stress recognition were performed on the data of the lab-oratory MIST experiment. Since our ultimate goal is to develop a “personal stress

Page 119: Multimodal Emotion and Stress Recognition

102 Emotion Recognition Using a Naturalistic Experiment Setup

Table 5.6: Cross-validation results of classifier fusion (LDA with diagonal covariancematrices). Only the combinations which yielded higher accuracies than the bestsingle classifier are shown. Note that majority (maj.) and confidence voting (conf.)yield the same result for the combination of 2 modalities. RH = Right hand, RL= Right leg, LL = Left leg, Acc. = accuracy. Classifier fusion increases recognitionaccuracy.

Phase: Feedback (S4/C4) Mental arithmetic(S5/C5)Best Single Classifier: 71.2% (head) 78.8% (ECG)

Modalities↓Acc. of Classifier

FusionAcc. Increase

∆%

Acc. of ClassifierFusion

Acc. Increase∆%

ECG+EDA 75% +3.8%ECG+Head 82.7% +3.9%ECG+RL 80.8% +2%ECG+EDA+Head 75% (maj) +3.8%ECG+EDA+RL 73.1% (conf) +1.8%ECG+EDA+RL 82.7% (maj) +3.9%ECG+Head+RL 76.9% (maj) +5.7%ECG+Head+LL 78.8% (maj) +7.6%ECG+Head+LL 82.7% (conf) +3.9%ECG+RH+RL 80.8% (conf) +2%ECG+RH+LL 80.8% (maj) +2%ECG+RL+LL 80.8% (conf) +2%EDA+Head+RH 73.1% (maj) +1.8%EDA+Head+RL 76.9% (maj) +5.7%EDA+Head+LL 76.9% (maj) +5.7%EDA+RL+LL 75% (maj) +3.8%ECG+EDA+Head+RL 73.1% (maj)ECG+EDA+RH+RL 82.7% (maj) +3.9%ECG+EDA+RH+LL 82.7% (maj) +3.9%ECG+EDA+RL+LL 73.1% (conf) +1.8%ECG+EDA+RL+LL 73.1% (maj) +1.8% 80.8% (maj) +2%ECG+Head+RH+RL 80.8% (maj) +2%ECG+Head+RH+LL 80.8% (maj) +2%ECG+Head+RL+LL 75% (maj) +3.8%ECG+RH+RL+LL 80.8% +2%EDA+Head+RH+RL 73.1% (maj) +1.8%EDA+Head+RL+LL 76.9% (maj) +5.7%EDA+RH+RL+LL 73.1% (maj) +1.8%ECG+EDA+Head+RH+RL 73.1% (maj) +1.8% 82.7% (maj) +3.9%ECG+EDA+Head+RH+LL 75% (maj) +3.8% 80.8% (maj) +2%ECG+EDA+Head+RL+LL 75% (maj) +3.8%ECG+Head+RH+RL+LL 75% (maj) +3.8% 80.8% (maj) +2%EDA+Head+RH+RL+LL 75% (maj) +3.8%

All 76.9% (maj) +5.7% 84.6% (maj) +5.8%

Page 120: Multimodal Emotion and Stress Recognition

Generalization to a “Real-Life” Office Experiment 103

detector”, one could ask whether the results obtained so far are transferable to a“real-life” office stress situation. Testing our classifiers on a data set of a differ-ent experiment is important to verify that the trained classifiers have been trainedon characteristics of stress and cognitive load rather than on characteristics of thespecific experiment setup (e.g. a difference in the duration of the compared exper-iment phases).

In the MIST experiment, deception was employed and the real purpose of theexperiment was masked (for the subjects) [60]. Due to this procedure, the observedreactions of the subjects are expected to be similar to those occurring outside ofthe laboratory in a similar situation [84]. A stress classifier that was trained on theMIST data, should therefore yield similar accuracies as for the MIST, when tested ondata recorded in a similar stress situation.

Moreover, the MIST experiment combines several stress factors, i.e. mentalworkload (1) under time (2) and performance pressure (3), and social-evaluativethreat (4). These stress factors are relevant in “real life”:

1. Mental workload is a predominant stress factor for professions such as airtraffic controllers [213] or electricity dispatchers [181].

2. Time pressure: According to a survey of the European Union, work intensity(i.e. working at high speed or working to tight deadlines) has increased inmost European countries over the last two decades, which had a negativeimpact on workers’ well-being [75].

3. Performance pressure: A study among general practitioners, lawyers, en-gineers, teachers, nurses and life insurance personnel showed that perfor-mance pressure and work-family conflicts were perceived to be the moststressful aspects of work [43]. The reported intensity of stress depended onthe profession: teachers reported most stress, followed by lawyers, nurses,engineers, insurance agents and medical doctors.

4. Social-evaluative threat: In [43], nurses, teachers, and engineers reportedhigh levels of stress due to poor social relations at the workplace (superiorsand colleagues).

To test our classifiers on data obtained from a real life experiment, we havedesigned a small office experiment with three subjects. Classifiers were trained onthe data of the laboratory MIST experiment and applied on the data of the threesubjects participating in the office experiment.

5.7.1 Office Experiment

Three ETH PhD students (all male, age: 26), who had started their PhD studies re-cently and were unaware of the investigation on stress patterns in the laboratory,were asked to participate in an office experiment about methodological issues ofphysiological recording in real life. They were told that physiological signals andquestionnaires should be recorded during normal office work. The same measure-ment setup as for the laboratory stress experiment (see Fig. 5.1) was used exceptthat the commercial “Alive ECG and Activity Monitor” from Alive Technologies [195]

Page 121: Multimodal Emotion and Stress Recognition

104 Emotion Recognition Using a Naturalistic Experiment Setup

was used to collect ECG data. The ECG was recorded with disposable Ag/AgCl elec-trodes placed on the chest. The experiment took place in the subject’s own officeroom. The procedure was as follows (cf. Fig. 5.18):

• O1: Baseline I: Reading articles of own choice (e.g. computer magazine, GEO)during 10 minutes to “calm down”; afterwards, reading predefined articlefrom computer magazine during 10 minutes.

• O2: Control condition: Discussion with experiment leader about the article.To avoid stress, the subjects were told, that the content of the discussion wasof no importance since we were just interested in the physiological activa-tion due to speaking.

• O3: Recovery I and Baseline II: Reading articles of own choice.

• O4: Stress condition: The experiment was interrupted by the professor (i.e.supervisor of the PhD) who claimed he had to talk very urgently to the sub-ject. The experiment leader finally agreed to continue the experiment af-ter the conversation and left the room. The professor had prepared individ-ual stress-inducing topics for each of the three subjects. Two subjects hadrecently finished their master thesis at our institute. The professor there-fore discussed their performance and the mark, thereby asking more criticalquestions than usual. The third subject was informed that his envisionedproject with a company could not take place (which was fortunately a lie).

• O5: Recovery II: The experiment leader entered the room and told the sub-ject, that the experiment would take a bit longer than originally planned.The subject again read articles of his own choice.

After the experiment, the subjects were debriefed by explaining the real aim of theexperiment.

5.7.2 Evaluation MethodsLDA classifiers with diagonal covariance matrices were trained on the data of thelaboratory MIST experiment (26 subjects) and tested on the data of the three sub-jects participating in the office experiment. The procedure for feature calculationand classification is illustrated in Fig. 5.18.

The features used for this analysis were the same as in Sec. 5.5, Table 5.5. Sincethe training features from the laboratory MIST experiment had been calculated us-ing 47s signal excerpts, the features for the office experiment were also calculatedon 47s extracts from the control (O2) and the stress condition (O4).

5.7.3 ResultsFor each signal modality, two LDA classifiers (with diagonal covariance matrices)were trained, one with the data of the second feedback phase (C4 and S4 of theMIST experiment, respectively) and another one with the data of the mental arith-metic phase (C5 and S5 of the MIST experiment, respectively). The data of the officeexperiment was used to test the two classifiers.

Page 122: Multimodal Emotion and Stress Recognition

Generalization to a “Real-Life” Office Experiment 105

Figure 5.18: Procedure for feature calculation and classification employed to testthe generalization abilities of the classifiers trained with the MIST laboratory datain Sec. 5.5. For each signal modality, two LDA classifiers were trained - one with thedata of the feedback phases C4 and S4 of the MIST, and another with the data of themental arithmetic phases C5 and S5 of the MIST. The data of the office experimentwas used to test the trained classifiers.

Page 123: Multimodal Emotion and Stress Recognition

106 Emotion Recognition Using a Naturalistic Experiment Setup

Table 5.7: Stress classification results (2 classes) for LDA classifiers trained on theMIST experiment data and tested on the office experiment data.

Training data extracted from phase:

Modality Features Feedback(47s)

Mental arithmetics(47s)

ECG

From heart rate: max, mean,slope (calculated by a linearregression); HRV features:pnn50, sdnn, lf, hf

83.3% 83.3%

EDA

From level: slope (calculated by alinear regression); from peakheight: q(0.25), q(0.5) frominstantaneous peak rate: q(0.25),q(0.5)

100% 66.6%

Headstd(pitch), std(roll),# of cluster changes/time,max(abs)

83.3% 50%

Right Handstd(abs), min(abs), max(abs),std(pitch), std(roll), # of clusters,# of cluster changes/time

66.7% 50%

Right Leg mean(pitch), max(pitch),mean(roll), max(roll) 83.3% 83.3%

Left Leg max(pitch) 100% 100%

Table 5.8: Stress classification results (2 classes) for multimodal classifiers trainedon the MIST experiment data and tested on the office experiment data. Classifierfusion increases recognition accuracy.

Training data extracted from phase:

Fusion Method Feedback (47s) Mental arithmetics(47s)

Majority 100% 83.3%Confidence 83.3% 66.7%

Page 124: Multimodal Emotion and Stress Recognition

Generalization to a “Real-Life” Office Experiment 107

Table 5.7 shows the classification results. Using training data from the MISTfeedback phase yielded better results than using training data from the MIST men-tal arithmetic phase, especially for the “Head” and “Right Hand” classifiers. Thisis plausible, because our office experiment only included discussion and no com-puter work. The office test data was thus more similar to the MIST feedback thanto the MIST mental arithmetic phase.

Majority and confidence voting were used to combine the single modality clas-sifiers (cf. Sec. 3.3) that had been trained with the MIST data. The multimodal classi-fiers were again tested on the data of the office experiment. The results are shownin Table 5.8. The multimodal classifier that was trained with data from the MISTfeedback phase achieved 100% accuracy with majority voting.

5.7.4 ConclusionOur results showed good generalization of the chosen features and classifiers formthe laboratory MIST experiment to our small office field study. This indicates, thatthe MIST resembles a “real-life” stress situation and data gained in the labora-tory can be used to design systems for the “real-world”. However, the results alsoshowed that the activity of the subject has to be taken into account.

Page 125: Multimodal Emotion and Stress Recognition

108 Emotion Recognition Using a Naturalistic Experiment Setup

Page 126: Multimodal Emotion and Stress Recognition

6

Elicitation, Labeling, andRecognition of Emotions in Real-Life

In the previous chapters, we used standardized experiment protocols.As opposed to those, we move to a less standardized, real life scenarioof soccer matches in this chapter. Soccer - particularly during impor-tant tournaments as the World Cup - is well known to elicit strong emo-tions in humans and is therefore predestinated to be used for emotionrecognition. In real-life emotion experiments, the occurence of specificnon-predictable, non-repeatable events needs to be labeled for each ex-periment. This can be done by the experiment leader, by the subject,by an observer (e.g. using video analysis), or preferably automatically.In this chapter, we present a new method for automatic event labelingfrom Internet data. By investigating questionnaires and analysing cat-egories of live-ticker data we were able to identify emotion dimensions(arousal and valence) that can directly be associated with certain tickerevent categories. Finally, classifiers were used to distinguish betweenarousal and non-events, based on the identified ticker categories.

6.1 Introduction and MotivationTo develop emotion recognition systems, databases containing “adequate trainingand test material are as fundamental as feature extraction” (Cowie et al., [51], p.66). A considerable part of this chapter hence deals with the acquisition and theautomatic annotation of adequate training and test data. According to Cowie et al.[51], the recorded material should

• be natural rather than acted,

• represent a wide range of emotional behavior, not exclusively archetypalemotions,

• cover a wide range of different subjects and preferably cultures,

Page 127: Multimodal Emotion and Stress Recognition

110 Elicitation, Labeling, and Recognition of Emotions in Real-Life

• be annotated with the emotional content.

In the following, we will present a technique for recording natural everydayemotional data that fulfill the requirements stated above. According to [173],watching a sporting event belongs to the category of emotion-eliciting, everydaysituations with the advantage that it is known when the sporting event takes place.Watching live TV soccer matches was therefore chosen as emotion elicitation tech-nique. Soccer matches are especially suited to elicit emotions for the followingreasons:

Natural everyday emotions: A soccer match elicits intense naturalistic emotions,especially when the spectators favor one of the teams [50, 204].

Wide range of emotions: Previous research suggests, that a wide range of emo-tions - such as e.g. happiness, pleasure, sadness, and anger - are elicited insports spectators [186], including soccer spectators [102].

Wide range of potential subjects: Soccer is the most popular sport worldwide1

and thus covers the best possible range of different subjects from almostany culture. Even though the majority of soccer fans are men, women be-come more and more involved and accepted as “authentic fans”, too [50].

Ethics: Deliberate elicitation of strong negative emotions by deceiving the testsubjects can be ethically critical even with debriefing [160]. When watchinga soccer match, the subjects are not deceived but knowingly and willinglytake the risk of experiencing negative emotions.

Masking: When watching soccer, it is socially acceptable to show emotions suchas grief, joy and anger, especially for men [201]. This fact facilitates the dis-play of natural, unmasked emotions.

Technical issues: Because the subjects are in an indoor location, power consump-tion of the recording devices and video documentation for later annotationare technically uncritical. Moreover, since the subjects assume a sitting pos-ture, artifacts due to motion are limited.

Annotation: Standardized stimuli (e.g. emotional films) can be reused many timesfor different subjects to collect a large amount of training and test datafor emotion recognition. In contrast, real-life stimuli (e.g. a goal) are non-predictable and non-repeatable and need to be labeled for each run of theexperiment (e.g. for each match). Soccer as a real-life stimulus is particu-larly attractive, because soccer matches are increasingly often labeled bythe media and these ticker labels are freely available on the Internet (e.g.www.blick.ch).

Soccer watching is known to induce intense subjective emotions [102, 54]. Parket al. [147] investigated the emotions experienced by spectators when watching 30swinning and losing scenes of their national soccer team. The scenes were selectedfrom video recordings of the most recent World Cup. Brain activation measuredby fMRI indicated that winning scenes produced positive emotional responses,

1http://mostpopularsports.net/

Page 128: Multimodal Emotion and Stress Recognition

Data Collection 111

whereas losing scenes produced negative emotional responses. Despite this evi-dence, watching soccer has not been evaluated yet as a technique for collecting alarge amount of everyday, natural emotional data.

Previous studies have only investigated soccer emotions over entire matches[102, 54, 204] and not for specific events. It is yet unclear, which emotions areelicited by certain events. This chapter thus provides a pioneering feasibility studyto determine, which emotions are elicited, whether they can be associated withspecific match events, and whether they can be recognized automatically fromphysiological signals.

Thus, the following questions are addressed in the following sections of thischapter:

1. Can watching soccer matches on TV be used as emotion elicitation tech-nique?

(a) Which are the most prominent emotions elicited by soccer watching?

(b) Do the emotions experienced during the match depend on the out-come of the match (i.e. victory or defeat of the favorite team)?

(c) How do the emotions elicited by soccer watching compare to emotionselicited by standardized film stimuli?

2. Can the recorded physiological data be labeled automatically

(a) using ticker categories? This implies the question, whether specificemotions can be associated with specific ticker categories.

(b) using the ticker text?

3. How does classification in this real-life scenario compare to the previouslypresented standardized experiments?

6.2 Data Collection

Subjects and Experiment: In the experiment, we recorded ECG, EDA, move-ment, and questionnaire data during 9 soccer matches of the World Cup 2010. Theexperiment was run indoor, either in a meeting room at the ETH Zurich or at a pri-vate home. Twenty-two subjects (students and PhD students, 17 male, 5 female,age 28.2 ± 2.96 years) watched a live TV soccer match of their favorite team play-ing. There were no restrictions on the test subjects except from being a supporterof one of the teams. Except for one subject (a Brazilian with a German husband,supporting Germany against Australia), the subjects all supported their nationalteam. ECG data was recorded for a maximum of three subjects and EDA data for amaximum of two subjects per match, due to hardware availability. There were norestrictions in the scenario, allowing other spectators to join the test subjects. Adetailed overview of the experiment is given in Table 6.1.

Page 129: Multimodal Emotion and Stress Recognition

112 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Table 6.1: World Cup soccer matches and recorded data. Group size indicates thenumber of spectators viewing the match. A small group included 5 or less specta-tors. ECG data was recorded from maximally 3 subjects and EDA data from max-imally 2 subjects per match, due to hardware availability. GEW: Geneva EmotionWheel questionnaire

Match # of subject andfavorite team

Matchresult

GEWdata

ECGdata

EDAdata

Groupsize Location

Switzerland-Spain 1 for Spain,1 for Switzerland 1:0 2 2 1 Large ETH

Germany-Australia 2 for Germany 4:0 1 2 1 Small HomeGermany-Serbia 3 for Germany 0:1 3 3 2 Large ETHSwitzerland-Chile 3 for Switzerland 0:1 3 3 2 Large ETHGermany-Ghana 2 for Germany 1:0 2 2 2 Small ETHItaly-Slovakia 3 for Italy 2:3 3 3 2 Large ETHSwitzerland-Honduras 3 for Switzerland 0:0 3 3 2 Small Home

Germany-England(last sixteen) 1 for Germany 4:1 1 1 1 Small Home

Germany-Spain(semifinal) 3 for Germany 0:1 3 3 1 Large ETH

Total 22 21 22 14

Physiological recording devices: Three Zephyr bioharnesses were used torecord the ECG at 250 Hz [214]. These devices also contain an acceleration sen-sors delivering activity measurements (i.e. acceleration magnitude) and posture(i.e. forward and backward inclination of the torso), with 1 Hz sampling rate. TheEDA data was recorded with two Emotion Boards (see Sec. 3.1.2.1, [177]) at 22 Hz.

Questionnaires: To investigate, which emotions were induced in the soccerspectators, the Geneva Emotion Wheel (GEW) [14, 170] was used. The GEW is aninstrument for measuring emotional reactions to objects, events or situations2.Twenty emotion families are arranged in a wheel shape (see Fig. 6.1). After eachhalf-time of the soccer match (45 min.), the subjects were asked to name all theemotions they had felt or were still feeling by choosing the intensity of the corre-sponding emotion families, e.g. “Happiness/Joy”. The intensity of each emotion isgiven by the distance to the center of the wheel. To simplify matters, we will in thefollowing refer to an emotion family as an “emotion”.

In addition to the emotions generally induced during soccer watching, we wereinterested whether specific emotions can be associated with specific events of thematch. For this purpose, the subjects were asked to name the situation of theirstrongest emotions (marked in the outermost or second-outermost circle of theGEW, see Fig. 6.1), and to try to specify this situation in time, e.g. “I felt angry whenthe referee sent XY off.” In the following, we will refer to these questions by the

2The original GEW is available at http://www.affective-sciences.org/researchmaterial

Page 130: Multimodal Emotion and Stress Recognition

Data Collection 113

Figure 6.1: Geneva Emotion Wheel, adapted for digital use from [14]. This versionwas used in Chapter 4.2.

Page 131: Multimodal Emotion and Stress Recognition

114 Elicitation, Labeling, and Recognition of Emotions in Real-Life

term “emotion-event assignment questions”.

Labeling: During the experiment, the experiment leader hand-labeled events(goals, chances and fouls) and the emotional behavior of the subjects. Addition-ally, live ticker data published on the Internet was used for subsequent labeling, asdescribed later in Sec. 6.4.2 and 6.4.3.

6.3 Questionnaire AnalysisFirst, we investigated whether soccer watching can be used to elicit emotions, and,if so, which emotions are reliably elicited (first question posed in the introduction,Sec. 6.1). These questions can be answered by the evaluation of the GEW question-naires.

6.3.1 Which are the most prominent emotions elicited by soccerwatching (1a)?

To determine the most prominent emotions occurring during soccer watching, theGEW questionnaires of all participants were summarized by calculating the meanvalue of each emotion over all the questionnaires (9 matches, 21 subjects). Themean values of the emotions are shown in Fig. 6.2. Since the GEW was filled inafter each half time of the match, the two half times are shown separately withyellow and red bars. The scale ranges from 0 (emotion not selected) to a maxi-mum value of 5 (emotion strongly felt). The emotions achieving a mean score > 1.4(i.e. lying above the third quartile q(0.75), c.f. Sec. 3.1.2.3) are: Involvement/interest,disappointment/regret, irritation/anger, worry/fear, sadness/despair, amusemen-t/laughter, feeling disburdened/relief. This result shows that a wide range of dif-ferent emotions, covering each quadrant of the “arousal-valence” space (cf. Sec. 2.3and Fig. 4.4), can be elicited by watching soccer.

The emotions experienced by our test subjects are consistent with the litera-ture: Sloan et al. [186] identified anger, discouragement, sadness, irritability, hap-piness, satisfaction, and pleasure in basketball spectators, whereas Kerr et al. [102]found boredom, anger, sullenness, humiliation, resentment, and relaxation in soc-cer spectators.

6.3.2 Do the emotions experienced during the match depend onthe outcome of the match (victory or defeat) (1b)?

According to related work, a victory of the favorite team induces more positive andless negative emotions than a defeat [102, 54, 204].

Figure 6.3 shows again the mean value of each emotion for our GEW question-naires, but now the results are separated according to a victory or a defeat of thefavorite team. The first plot was generated using data of the five subjects whosefavorite team had won the match. For the second plot, data of the remaining 16

Page 132: Multimodal Emotion and Stress Recognition

Questionnaire Analysis 115

Figure 6.2: Mean values of elicited emotions of 21 subjects, according to the GEW.Yellow: first half time of the match. Red: Second half time

Figure 6.3: Mean values of elicited emotions, separated by match outcome: Top:Data of 5 subjects whose favorite team won. Top: Data of the remaining 16 subjects(favorite team defeated or tie result).

Page 133: Multimodal Emotion and Stress Recognition

116 Elicitation, Labeling, and Recognition of Emotions in Real-Life

subjects was used. For 13 subjects, their favorite team was defeated. For the otherthree subjects, the match resulted in a tie. However, since a tie was not sufficient toreach the next round of the tournament, the data of the tie matches were includedin the defeats.

For the 7 emotions determined to be most prominent in the previous section6.3.1, statistical analyzes were performed using the following tests:

1. The Kruskal-Wallis test [111] is a statistical test that determines whether dataof independent groups originate from the same distribution. The test isbased on ranks (i.e. non-parametric) and is suitable to compare groups ofdifferent sizes. It was therefore used for comparisons between victories anddefeats.

2. The Wilcoxon signed ranks test [210] is a non-parametric, rank-based testto compare paired data (i.e. dependent groups of data). Since every subjectfilled in the GEW twice, the data of the first and the second half time aredependent, and the Wilcoxon signed ranks test was therefore appropriate.The two-sided version of the test was used.

6.3.2.1 Statistical Results

Comparison between victories and defeats: Table 6.2 shows the statisticalresults for the comparison of emotions induced in the 5 subjects experiencing avictory and the 16 subjects experiencing a defeat or a tie. The comparison is basedon the GEW questionnaires of the second half time.

According to the Kruskal-Wallis tests, victories led to significantly higher GEWratings of amusement/laughter and feeling disburdened/relief (with a significancelevel of p = 0.05). On the other hand, disappointment/regret and irritation/angerwere significantly stronger in subjects experiencing a defeat. For the remainingthree emotions, the Kruskal-Wallis tests did not show any significant differencesbetween victories and defeats.

Overall, we can conclude that a defeat elicits more negative and a victory morepositive emotions. This confirms the findings of Kerr et al. [102] who found losingfans to score significantly higher than winning fans on boredom, anger, sullenness,humiliation, and resentment, and lower on relaxation.

Comparison between half times: For labeling and subsequent emotion recog-nition, it is important to know whether the intensities of the elicited emotions de-pend on temporal characteristics of the match. As can be seen in Figure 6.3, theGEW emotion ratings seem to be different for the two half times of the match (seee.g. feeling disburdened/relief for winning subjects and disappointment/regret forlosing subjects). Wilcoxon signed ranks tests to compare match half times revealedthe following:

Victories: For the victories, no significant differences between the first and thesecond half time were found for any of the 7 emotions (with a significancelevel of p = 0.05). Others have made similar observations: In [102], levels

Page 134: Multimodal Emotion and Stress Recognition

Questionnaire Analysis 117

Table 6.2: Comparison of emotions elicited in the 5 subjects experiencing a victoryand the 16 subjects experiencing a defeat or tie, based on the GEW questionnairesof the second half time. The emotion intensity relation between victories and de-feats was determined using the mean ranks resulting from the Kruskal-Wallis test(mv : mean rank for victory,md: mean rank for defeat). ? indicates statistical signif-icance with p < 0.05, and ?? with p < 0.01, respectively.

Emotion Kruskal-Wallis Test:p-value,mv ,md

Relation

Involvement / Interest p = 0.8933,mv = 10.7,md = 11.1

Amusement / Laughter p = 0.0078??,mv = 16.4,md = 9.3

victory>defeat

Feeling disburdened / Relief p = 0.0079??,mv = 16.6,md = 9.3

victory>defeat

Sadness / Despair p = 0.6031,mv = 9.8,md = 11.4

Worry / Fear p = 0.2996,mv = 13,md = 10.4

Disappointment / Regret p = 0.0079??,mv = 4.9,md = 12.9

victory<defeat

Irritation / Anger p = 0.014?,mv = 5.1,md = 12.8

victory<defeat

Table 6.3: Statistical results for comparison of GEW emotion intensities during thefirst and the second half time for 16 subjects whose favorite team was defeated(ties included). A two-sided Wilcoxon signed rank test with a significance level of0.05 was used. ? : p < 0.05, ?? : p < 0.01. The direction of change in emotionintensity from the first to the second half time (increase or decrease) was deter-mined by a subsequent one-sided test and is consistent with the changes observedin Fig. 6.3.

Emotion p-value ChangeInvolvement / Interest p = 0.007?? DecreaseAmusement / Laughter p = 0.031? DecreaseFeeling disburdened / Relief p = 0.125 (Increase)Sadness / Despair p = 0.013? IncreaseWorry / Fear p = 0.0024?? DecreaseDisappointment / Regret p = 0.012? IncreaseIrritation / Anger p = 0.087 (Increase)

Page 135: Multimodal Emotion and Stress Recognition

118 Elicitation, Labeling, and Recognition of Emotions in Real-Life

of pleasant and unpleasant emotions of soccer fans did not change signifi-cantly for winning fans (except for boredom). In the basketball fans investi-gated in [186], negative emotions did not change for a win, whereas positiveemotions only increased for a difficult win, but not for an easy win. Our dataset contained both difficult (1:0, 1:0) and easy wins (4:0, 4:1), thus the easywins might have dominated the statistical result.Lack of significance for the victories might however also be due to the smallsample size (5 subjects) and deserves further investigation in future experi-ments.

Defeats and ties: Table 6.3 shows the results of the Wilcoxon signed ranks test tocompare the half times of defeat and tie matches. From the first to thesecond half time, the negative emotions disappointment/regret and sad-ness/despair showed a significant increase, irritation/anger showed a trendto increase (p = 0.087), and the positive emotions involvement/interest andamusement/laughter showed a significant decrease. These changes are asexpected from previous work: In [102], levels of pleasant and unpleasantemotions changed significantly for losing fans. In [186], negative emotionssuch as anger, discouragement, sadness, and irritability increased, whereaspositive emotions such as happiness, satisfaction, and pleasure decreasedafter a loss.Being a negative emotion, we had expected higher intensities for worry/fearduring the second half time than during the first. In contrast, the Wilcoxonsigned rank test revealed the opposite (see Table 6.3). This observation isprobably due to the recency effect [138], which states that recent events aremore likely to be remembered than distant events. As the second GEW ques-tionnaire was recorded after the end of the match, subjects might ratherhave remembered feeling sad and disappointed after the final whistle, thanfeeling worried before (when the match was still running).

Overall, we can conclude that negative emotions increase and positive emotionsdecrease in response to a defeat of the favorite team.

6.3.3 How do the emotions elicited by soccer watching compareto emotions elicited by standardized film stimuli (1c)?

Finally, we investigated whether soccer matches elicit emotions of comparable in-tensity as the standardized film emotion elicitation technique presented in Sec. 4.2.We therefore compared the GEW data of the soccer experiment with the GEW datarecorded during the film experiment, as shown in Fig. 6.4. The target emotionsof the films were: amusement, contentment, sadness and anger. Since the tar-get emotions included both positive and negative emotions, the comparisons inFig. 6.4 are shown separately for positive and negative emotions. Moreover, sincethe reported negative emotions were stronger during the second half time of thematches (c.f. Sec. 6.3.2.1), only the GEW data of the second half time is shown.

Statistical comparisons of the film target emotion GEW scores with the GEWscores recorded in the soccer experiment for ”amusement/laughter”, ”enjoyment/

Page 136: Multimodal Emotion and Stress Recognition

Automatic Label Generation 119

pleasure”3, ”sadness/despair”, and ”irritation/anger” showed no significant differ-ences between elicitation techniques, as all p-values of the Kruskal-Wallis testswere> 0.1. This leads to the conclusion that a soccer match can induce emotionsof similar intensities as those elicited by films.

Fig. 6.4 also illustrates, that the range of soccer-elicited emotions is similar tothe range of the film-elicited emotions. Interestingly, even ethically delicate ne-gative emotions such as anger and disappointment can be elicited efficiently by asoccer game.

6.4 Automatic Label Generation

Because events during a soccer match (e.g. a goal) are non-predictable, the physio-logical data recorded during a soccer match needs to be labeled during or after thematch. The labeling needs to be repeated for every recorded soccer match. This canbecome very time-consuming, especially if the labeling is done by hand. This sec-tion therefore describes two different methods for generating labels automaticallybased on Internet ticker data. The first method employs ticker categories, whereasthe second uses ticker texts.

6.4.1 Employed Ticker Data

Swiss live ticker data published at www.blick.ch was used for labeling. SportradarAG4 kindly provided us with the underlying raw data. Five exemplary ticker entriesare shown in Fig 6.5. A ticker entry is composed of a symbol indicating the eventcategory, and the corresponding text. For easier referencing, we have assigned aname to each category, as shown in Fig 6.5 on the right. The investigated cate-gories are: goals, “megachances”, small chances in which the ball was kept by thegoalkeeper, small chances in which the goal was missed, and red cards.

6.4.2 Arousal and Valence Labeling using Internet Ticker Cate-gories

The first labeling method is based on the ticker categories. For automatic gener-ation of unambiguous emotion labels from the ticker categories, each ticker cate-gory would have to be associated with a single emotion or with one of the quad-rants in the arousal-valence space. To the best of our knowledge, such associationsbetween match events (i.e. ticker categories) and emotions have not been reportedin the literature, because previous studies have only investigated soccer emotionsover entire matches and not for specific events [102, 54, 204]. Therefore, we de-veloped a hypothesis regarding the relationship between match events belonging

3Contentment is not contained in the GEW questionnaire’s emotions. For the comparisonof film and soccer data, we used “enjoyment/pleasure”, since this is the emotion that yieldedthe highest GEW score for the contentment film.

4www.sportradar.ag

Page 137: Multimodal Emotion and Stress Recognition

120 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Figure 6.4: Emotions elicited by soccer watching compared to emotions elicited byfilm clips in Sec. 4.2. Top: Mean GEW scores of second half time for the 5 subjectswhose favorite team won, compared to mean GEW scores of amusement and con-tentment films. Bottom: Mean GEW scores of second half time for remaining 16subjects (favorite team defeated or tie result), compared to mean GEW scores ofsadness and anger films.

Figure 6.5: Exemplary ticker entries for different events: Left: Symbol indicatingthe event category. (Note: The event category is encoded as a number in the rawticker data. Based on this number, the corresponding picture is displayed on thewebsite). Middle: Ticker text; Right: Category name we use in this chapter to referto the different event categories.

Page 138: Multimodal Emotion and Stress Recognition

Automatic Label Generation 121

Table 6.4: Hypothesized relationship between match events (i.e. ticker categories),emotions, and emotion dimensions (i.e. arousal and valence)

Ticker category Symbol Arousalhypothesis

Valencehypothesis

Emotionhypothesis

Goals of ownteam High Positive Joy

Goals ofadversary High Negative Anger

Missed“megachances”of own team

High Negative Anger

Missed“megachances”of adversary

High Positive Relief

Small chancesof own team,held by keeper

Low Negative Disappointmentor weak anger

Small chancesof own team,missed goal

Low Negative Disappointmentor weak anger

Small chancesof adversary,held by keeper

Low Positive Weak relief

Small chancesof adversary,missed goal

Low Positive Weak relief

Red cardsreceived by ownteam

High Negative Strong anger

Red cardsreceived byadversary

High Positive Malicious joy

Page 139: Multimodal Emotion and Stress Recognition

122 Elicitation, Labeling, and Recognition of Emotions in Real-Life

to different ticker categories, emotions, and emotion dimensions (i.e. arousal andvalence, cf. Sec. 2.3), as listed in Table 6.4.

Each match event is marked by a certain ticker category (e.g. “megachance”)and elicits a certain amount of arousal and a degree of valence. For each tickercategory, we formed a hypothesis regarding the amount of arousal and the degreeof valence that is elicited by match events belonging to this ticker category. Fig-ure 6.6 shows the expected location of match events of different ticker categoriesin arousal-valence space.

6.4.2.1 Relationship between Single Emotions and Ticker Categories

To test whether specific emotions can be associated with specific ticker categories(question 2a in Sec. 6.1), the test subjects had been asked to assign strongly experi-enced emotions to specific match events (“emotion-event assignment questions”,explained in Sec. 6.2). Using their answers combined with the ticker texts, the emo-tion ratings were manually associated with specific match events. For each tickercategory, we counted how many times the subjects had assigned an event of thiscategory to a specific emotion. This resulted in an “emotion distribution” for eachticker category, as shown in Fig. 6.7 and summarized in Table 6.5. Looking at theemotion distributions, we already have to conclude that the ticker categories cannot unambiguously be associated with single emotions.

6.4.2.2 Relationship between Arousal-Valence Quadrants and Ticker Cat-egories

In a next step, we investigated, which ticker categories can be associated withapproximate levels of arousal or valence, based on the “emotion distributions” inFig. 6.7. The result is shown in Table 6.6 and was used for ticker category labelingas described in the following Section 6.4.2.3.

Table 6.5: Summary of the results in Fig. 6.7, considering only the emotions thatachieved a fraction of 20% or more.

Page 140: Multimodal Emotion and Stress Recognition

Automatic Label Generation 123

Figure 6.6: Expected location of match events of different ticker categories inarousal - valence space, see also Table 6.4.

Figure 6.7: Emotions assigned to match events of different ticker categories, basedon the answers to the “emotion-event assignment questions” of 21 subjects. Theticker categories can not be associated with single emotions.

Page 141: Multimodal Emotion and Stress Recognition

124 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Table 6.6: Relationships between match events belonging to different ticker cate-gories, emotion dimensions (i.e. arousal and valence), and discrete emotions, foundfrom GEW evaluations. Only emotions achieving a fraction of 20% or more accord-ing to Fig. 6.7 are listed in the last column. Bold categories were labeled “arousal”and used for classification in Sec. .

Tickercategory

SymbolArousal

associationValence

associationGEW Emotion

association

Goals of ownteam

Medium -High

PositiveInvolvement/Interest,Happiness/Joy, Feeling

disburdened/Relief

Goals ofadversary

Medium -High

NegativeIrritation/Anger,

Shame/Embarrassment,Disappointment/Regret

Missed“megachances”of own team

Medium -High

MainlyNegative

Irrigation/Anger,Disappointment/Regret

Missed“megachances”of adversary

Medium -High

Negativeand Neutral

Worry/Fear,Astonishment/Surprise

Small chancesof own team,held by keeper

Not mentioned by subjects

Small chancesof own team,missed goal

Not mentioned by subjects

Small chancesof adversary,held by keeper

Seldom mentioned by subjects

Small chancesof adversary,missed goal

Not mentioned by subjects

Red cardsreceived byown team

Medium -High

NegativeIrritation/Anger,

Disappointment/Regret,Worry/Fear

Red cardsreceived byadversary

Not contained in data set

Page 142: Multimodal Emotion and Stress Recognition

Automatic Label Generation 125

6.4.2.3 Ticker Category Labeling

The “ticker category labels” were defined based on two criteria:

1. The number of times the subjects mentioned an event of the category, seeTable 6.7.

2. The arousal and valence levels of the emotions associated with the category,see Fig. 6.7 and Table 6.6.

The match events belonging to different ticker categories were labeled as follows:

Goals of own team: This is the category mentioned most often (20 times). Notethat since some of the goals were associated with more than one emo-tion, the number in the last column of Table 6.7 exceeds the number in thethird column. As expected, the goals of the own team induced purely pos-itive emotions (Fig. 6.7). The arousal level of the associated emotions wasmedium-high. Resulting label: “arousal”, positive.

Goals of adversary: The data set contained 20 goals of the adversary, which werementioned 9 times. Associated valence: negative; arousal level: medium -high. Resulting label: “arousal”, negative.

Missed “megachances” of the own team: The data set contained 90“megachances” of the own team, which were mentioned 18 times. Asexpected, the induced emotions were mainly negative, but “amusemen-t/laughter” and “happiness/joy” were also mentioned (Fig. 6.7).We thereforedid not assign any valence label to the missed “megachances” of the ownteam. The arousal level of the associated emotions was medium-high.Resulting label: “arousal”.

Missed “megachances” of the adversary: The data set contained 79“megachances” of the adversary, which were mentioned only 8 times.According to our hypothesis, the missed chances of the adversary wereexpected to induce positive emotions. In contrast, the questionnairesshowed that the subjects were worried and disappointed about their ownteam and surprised by the good performance of the adversary (which mightalso imply being negatively surprised about the bad defense performanceof own team). The expected emotion “relief” was never associated with amissed chance of the adversary. Due to this observation and because thiscategory was seldom mentioned by the subjects, the missed “megachances”of the adversary were excluded for ticker category labeling.

Small chances: Of totally 86 small chances, the subjects only mentioned three.Since the subjects were only asked to specify the situations involving intenseemotions, it is not possible to conclude whether the small chances induced(weak) emotions or none. The small chances were therefore excluded forticker category labeling.

Red cards received: A red card given to a player of the favorite team represents anadditional interesting match event type. Purely negative emotions were in-duced, anger being the most frequent. The arousal level of the associatedemotions was medium-high. Resulting label: “arousal”, negative.

Page 143: Multimodal Emotion and Stress Recognition

126 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Table 6.7: Occurrences of game events (according to the live ticker) and number oftimes the 21 subjects associated the events with a specific emotion marked in theGEW.

Event category Symbol Number ofoccurrences

Number oftimes

mentioned

Goals of own team 17 20

Goals of adversary 20 9

Missed “megachances”of own team 90 18

Missed “megachances”of adversary 79 8

Small chances of ownteam, held by keeper 23 0

Small chances of ownteam, missed goal 22 0

Small chances ofadversary, held bykeeper

16 3

Small chances ofadversary, missed goal 25 0

Red cards received byown team 6 7

Red cards received byadversary 0 0

Page 144: Multimodal Emotion and Stress Recognition

Automatic Label Generation 127

Other events of strong emotions, which were mentioned by the subjects but couldnot be labeled based on the ticker data, are: Fouls, an offside goal of the favoriteteam, emotions associated with longer time spans (e.g. begin, end) and sociallyinduced emotions (laughter, anger for being teased by another subject).

An interesting observation made during the evaluation is worth mentioninghere: Goals and “megachances” of the own team were mentioned more often thangoals and chances of the adversary. The subjects thus seemed to be more involvedin the actions of their favorite team.

Finally, we conclude that only extreme emotional events could be identifiedwhen questioning the subjects after each half time. Small chances as well as“megachances” of the adversary were seldom mentioned and were therefore notlabeled. Nevertheless, events such as “goals”, “megachances” of the favorite team,and “red cards” can reliably be associated with arousal. Regarding valence, “owngoals” are clearly positive, whereas “adversary goals” are clearly negative.

6.4.3 Arousal Labeling using Internet Ticker Texts

In our case, each ticker entry was associated with a ticker category. This ticker cat-egory was directly used for labeling as described in the previous section. How-ever, we also observed that the ticker author expressed his level of arousal by usingcapital letters, by repeating letters and by using a large amount of exclamationmarks. We therefore investigated, whether this text data can be exploited for au-tomatic labeling: We trained an unsupervised fuzzy c-means clustering (FCM) clas-sifier [65, 18] that predicts the three ticker categories goal, “megachance” and smallchance exclusively from three ticker text features: the number of letter repetitions,the number of exclamation marks, and the ratio of uppercase to lowercase letters,see Fig. 6.8.

The number of exclamation marks and the uppercase ratio are shown inFig. 6.9. The figure shows that the different ticker categories are well separatedin the feature space. Using a 10-fold cross-validation, the FCM classifier achievedan accuracy of 75% in predicting the three classes “goal”, “megachance” and “smallchance”.

Based on the FCM classification result, we concluded that the ticker cate-gory is strongly related to the arousal level expressed by the ticker author in histext. Therefore, the class predictions of the FCM classifier were used to create“arousal-text-labels” for emotion recognition: Events identified as a “goal” or a“megachance” by the FCM classifier were labeled “arousal”, as illustrated by theorange box in Fig. 6.8.

6.4.4 Label Summary

Based on the observations made in Sec. 6.4.2.3 and Sec. 6.4.3 three sets of labelswere finally generated:

Page 145: Multimodal Emotion and Stress Recognition

128 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Figure 6.8: Generation of arousal “ticker-text-labels”.

Page 146: Multimodal Emotion and Stress Recognition

Feature Calculation 129

Figure 6.9: Ticker text features: Upper case ration and number of exclamationmarks. The color indicates the ticker event category. The different ticker eventsare well separated in the feature space.

6.4.4.1 Arousal Labels

1. Arousal ticker category labels: All goals + “megachances” missed by the fa-vorite team + red cards received by a player of the favorite team were labeledas “arousal”, which resulted in 89 game minutes labeled as “arousal”.

2. Arousal ticker text labels: Only those events were labeled as “arousal” thatwere identified as a “goal” or a “megachance” by the FCM classifier pre-sented in Sec. 6.4.3, cf. Fig. 6.8. This also resulted in 89 game minutes labeledas “arousal”.

All game minutes, for which no match events were known before, during, or afterthe game minute, were labeled as “non-events”, resulting in 127 game minutes.

6.4.4.2 Valence Labels

The valence labels are solely based on the ticker categories: The goals of the ownteam were labeled “positive”. The goals of the adversary and the red cards givento a player of the favorite team were labeled “negative”. This resulted in 15 gameminutes labeled “positive”, and 16 game minutes labeled “negative”.

6.5 Feature CalculationAccording to the ticker label time resolution of one minute, we calculated all fea-tures for one minute windows. The following 34 features were calculated for thedata of the 14 subjects with available EDA recordings:

• ECG: From heart rate: Min, max, mean, slope (calculated by a linear regres-sion). HRV parameters (see Table 3.1): sdnn, rmssd, pnn50, triangular index,lf, hf.

Page 147: Multimodal Emotion and Stress Recognition

130 Elicitation, Labeling, and Recognition of Emotions in Real-Life

• EDA: From level: Min, max, mean, std, slope (calculated by a linear regres-sion). From peak height: q(0.25), q(0.5), q(0.75), q(0.85), q(0.95); from in-stantaneous peak rate q(0.25), q(0.5), q(0.75), q(0.85), q(0.95);

• Activity/Posture: From acceleration magnitude: mean, median, max, std;from posture (i.e. torso inclination): mean, median, min, max, std.

As in Sec. 5.5.2.6, an amplitude criterion of 0.01 µS was used to calculate theEDA peaks, and the smoothness priors method [192] was used to detrend the RR-intervals before calculating the HRV parameters lf and hf.

Feature normalization: To reduce inter-individual differences, the featureswere normalized by subtracting the median value. For a feature f(ti), with ti indi-cating the game minute, the normalized feature resulted in:

fnorm(ti) = f(ti)− mediant∈{1,2,...90}

f(t)

6.5.1 Feature Selection based on Statistical AnalysisFeature selection based on statistical analysis was used as follows:s

Feature selection for arousal: We used the Wilcoxon signed rank test to findfeatures that showed significant differences between “arousal” and “non-events”(using the data of all subjects). The features with p < 0.05 were included in thereduced feature set, resulting in:

• ECG: HR: max (p = 0.0002), mean (p = 0.0002); HRV: sdnn (p = 0.0203)

• EDA: From level: Min (p = 0.0006), max (p = 0.0002), mean (p = 0.0002),std (p = 0.0031). From EDA peak height: q(0.75) (p = 0.0353), q(0.85) (p =

0.0166), q(0.95) (p = 0.0134).

• Activity/Posture: From acceleration magnitude: mean (p = 0.0134), median(p = 0.0266), max (p = 0.0085), std (p = 0.0085); from posture: max (p =

0.0419), std (p = 0.0203).

Feature selection for valence: In 7 of the 9 games, only one of the teamshad scored a goal (c.f. Table 6.1). The data was therefore not paired, and we thusused the Kruskal-Wallis test [111]. Statistically significant (i.e. p < 0.05) differencesbetween “positive” and “negative” events resulted in the reduced feature set forvalence:

• ECG: HR: max (p = 0.0297); HRV: pnn50 (p = 0.0168), sdnn (p = 0.0020),rmssd (p = 0.0177).

• EDA: From level: max (p = 0.0009), std (p = 0.0023)

• Activity/Posture: None of the activity and posture features reached statis-tical significance; we thus chose the feature yielding the minimal p-value:mean posture (p = 0.0578).

Page 148: Multimodal Emotion and Stress Recognition

Arousal and Valence Recognition 131

6.6 Arousal and Valence Recognition

The above sections are of preparatory nature - the acquisition and annotation ofdata. With the automatically annotated data, we can proceed with the actual emo-tion recognition - the final goal of the chapter.

For classification, the data of the 14 subjects with available EDA recordings wereused.

6.6.1 Single Modality Classification

The three modalities (EDA, heart, activity) were first evaluated separately usingleave-one-person-out cross-validation. The ticker categories were used for label-ing. The results for arousal and valence classification are shown in Table 6.8a andTable 6.8b, respectively. The third column contains the classification accuracy whenemploying all features of the corresponding modality, whereas the fourth columncontains the classification accuracy when employing the reduced feature set. Thefeatures selected for the reduced feature set are marked with *.

When comparing the two result columns of Table 6.8a and Table 6.8b, we ob-served that the feature selection yields an increase in classification accuracy. Forvalence classification, accuracy increases of 10% and more were observed, which isdue to the low amount of labeled valence data (15+16).

Consistent with our previous experiments in Sec. 4.2 and Ch. 5, LDA yieldedhigher accuracies than QDA in most cases. This was not surprising for valenceclassification, because the amount of labeled valence data was not sufficient toreliably estimate all the QDA parameters [156]. However, LDA still outperformedQDA in arousal classification, even though much more labeled data (89+127) wasavailable.

In this experiment, EDA was the best modality for arousal classification,whereas the ECG outperformed the other modalities for valence classification. Apossible explanation is the sympathetic activation that is linked to high arousal(remember, that EDA is exclusively influenced by the sympathetic nervous system),whereas events differing in valence are rather reflected in parasympathetic activa-tion, measured by the heart rate variability parameters. In line with this assump-tion, three HRV parameters reached statistical significance for valence differentia-tion in Sec. 6.5.1, whereas only one HRV parameter was significant for arousal dif-ferentiation.

The activity and posture seems to be more related to arousal (72.2% accuracy)than to valence (61.3% accuracy).

6.6.1.1 Comparison of Arousal Labeling Techniques

Next, we compare the labeling techniques by comparing the classification accu-racies achieved with the two label sets (ticker categories and ticker texts), see Ta-ble 6.9. The maximum classification accuracy was obtained with data labeled byticker texts. The arousal level expressed by the ticker author through his texts wasthus consistent with the arousal level experienced by the spectators. We conclude

Page 149: Multimodal Emotion and Stress Recognition

132 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Table 6.8: with leave-one-person-out cross-validation, for two classes of arousal (a)and 2 classes of valence (b). Ticker categories were used to label the data. Third col-umn: classifier employs all features of the modality; Fourth column: classifier onlyemploys statistically significant features of the modality (marked with *); Bold: bestof the four discriminant classifiers for each modality and feature set. Underlined:Best classifier for each modality. LDA: Linear Discriminant Analysis, QDA: QuadraticDiscriminant Analysis, dcm: diagonal covariance matrix, HR: heart rate, HRV: heartrate variability, p.h.: peak height, inst. p.r.: instantaneous peak rate, acc: accelera-tion

Selected features: all sign.

Modality Features Clas

sifie

rs LDA(dcm)LDA

QDA(dcm)QDA

LDA(dcm)LDA

QDA(dcm)QDA

ECG HR: max? , min, mean? , slope; HRV: pnn50, sdnn? ,rmssd, lf, hf, triangular index

67.6%68.1%68.1%62.0%

70.8%70.8%59.4%66.6%

EDA

from EDA level: min? , max? , mean? , std? , slope; fromp.h.: q(0.25), q(0.5), q(0.75)? , q(0.85)? , q(0.95)?;from inst. p.r.: q(0.25), q(0.5), q(0.75), q(0.85),q(0.95);

75.0%70.8%69.9%67.6%

73.1%72.7%70.4%69.4%

Activity/Posture

from acc. magnitude: mean? , median? , max? , std?;from posture: mean, median, min, max? , std?

72.7%69.0%66.7%62.0%

72.2%71.3%71.3%63.4%

(a) Arousal classification based on ticker category labels

Selected features: all sign.

Modality Features Clas

sifie

rs LDA(dcm)LDA

QDA(dcm)QDA

LDA(dcm)LDA

QDA(dcm)QDA

ECG HR: max? , min, mean, slope; HRV: pnn50? , sdnn? ,rmssd? , lf, hf, triangular index

74.2%67.7%64.5%32.3%

83.9%77.4%83.9%45.2%

EDAfrom EDA level: min, max? , mean, std? , slope; fromp.h.: q(0.25), q(0.5), q(0.75), q(0.85), q(0.95); frominst. p.r.: q(0.25), q(0.5), q(0.75), q(0.85), q(0.95);

67.7%64.5%48.4%

-

67.7%74.2%58.1%64.5%

Activity/Posture

from acc. magnitude: mean, median, max, std; fromposture: mean? , median, min, max, std

67.7%45.2%54.8%48.4%

61.3%61.3%58.1%58.1%

(b) Valence classification based on ticker category labels

Page 150: Multimodal Emotion and Stress Recognition

Arousal and Valence Recognition 133

Figure 6.10: Classification chain for emotion recognition from soccer data.

that Internet ticker data can be used to generate automatic arousal labels, even ifno categorization is available.

6.6.2 Multimodal ClassificationTo build a multimodal classifier, we combined the three modalities using majorityand confidence voting. The classification results are shown in Table 6.10. A maxi-mum accuracy of 79.2% was reached for arousal, and 87.1% for valence classificationwith diagonal covariance LDA. This result is competitive with the 85% arousal and70% valence classification accuracy achieved with EDA, ECG and activity features in[86].

6.6.3 Stress ClassificationWilbert-Lampen, Leistner et al. have found a significant increase in the incidenceof cardiovascular events in association with matches involving the German soccerteam during the FIFA World Cup, held in Germany in 2006 [209]. The authors hy-pothesized that these additional emergencies were triggered by emotional stressinduced by soccer watching. Therefore, we trained stress classifiers using the lab-oratory stress data of Sec. 5.5 and tested them on the soccer arousal data. Theclassifier using ECG data yielded an accuracy of 61.2%, whereas the EDA classifierresulted in 59%. The activity data was not investigated because the recording sys-tems used in the stress experiment were not the same as the ones used in thesoccer experiment.

Page 151: Multimodal Emotion and Stress Recognition

134 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Table 6.9: Comparison of labeling techniques (ticker category labels vs. ticker textlabels) for single modality arousal classification (2 classes). Bold: best of the fourdiscriminant classifiers for each modality and labeling technique. Underlined: Bestclassifier for each modality. LDA: Linear Discriminant Analysis, QDA: Quadratic Dis-criminant Analysis, dcm: diagonal covariance matrix, HR: heart rate, HRV: heartrate variability, p.h.: peak height, acc.: acceleration.

Labeling method: Tickercategories

Tickertexts

Modality Features Clas

sifie

rs LDA(dcm)LDA

QDA(dcm)QDA

LDA(dcm)LDA

QDA(dcm)QDA

ECG HR: max, mean; HRV: sdnn

70.8%70.8%59.4%66.6%

70.8%72.2%70.4%65.3%

EDA from EDA level: min, max, mean, std, from p.h.:q(0.75), q(0.85), q(0.95)

73.1%72.7%70.4%69.4%

75.0%75.5%70.4%68.5%

Activity/Posture

from acc. magnitude: mean, median, max, std; fromposture: max, std

72.2%71.3%71.3%63.4%

74.1%69.9%71.8%64.8%

Table 6.10: Cross-validation results for multimodal classification. Note that major-ity and confidence voting yield the same results for the combination of two modal-ities. Act.: Activity/Posture

Arousal classification Valence classificationLabels: Ticker category Ticker text Ticker categoryBest single classifier: EDA (73.1%) EDA (75.5%) ECG (83.9%)

Classifiers →Modalities ↓

LDA(dcm)LDA

LDA(dcm)LDA

LDA(dcm)LDA

ECG+EDA (confidence /majority)

78.7%76.9%

79.2%77.3%

87.1%83.9%

ECG+Act. (confidence /majority)

72.7%69.9%

73.1%73.6%

77.4%77.4%

EDA+Act. (confidence /majority)

73.6%73.1%

75.9%75%

71.0%77.4%

ECG+EDA+Act. (confidence) 74.5%75.9%

75.9%76.4%

87.1%87.1%

ECG+EDA+Act. (majority) 76.4%75.9%

77.3%75.9%

80.6%80.6%

Page 152: Multimodal Emotion and Stress Recognition

Discussion 135

When feeding all the game minutes to the ECG-stress-classifier, the classifierpredicted “stress” for nearly all game minutes of the matches “Switzerland-Spain”,“Italy-Slovakia”, “Germany-Spain”. These matches were exactly the matches thatwere full of tension until the end. We conclude, that the classifier trained withthe laboratory stress data is able to predict stress in other, real life scenarios (e.g.soccer).

6.7 Discussion

6.7.1 Elicited emotionsOur results show that a wide range of different emotions, covering each quadrantof the “arousal-valence” space, can be elicited by watching soccer. Moreover, a de-feat of the favorite team elicited more negative and a victory more positive emo-tions. This is consistent with previous research [186, 102].

We have found that the GEW ratings of soccer-induced emotions were compa-rable to the ratings achieved with standardized film stimuli. It needs to be consid-ered however that the soccer and film studies involved different groups of subjectsand different situations (a single film vs. a culmination of several match events).Moreover, movies are an illusion of reality requiring a “willing suspension of disbe-lief” [161], whereas the soccer game is real. For certain subjects, soccer may there-fore elicit stronger emotions, than a movie probably can. Thus our observation/-conclusion, that soccer games are suitable emotion elicitators is on solid grounds.

6.7.2 Association between ticker categories and emotionsThe emotion-assignment questionnaire analysis in Sec. 6.4.2.3 showed, that theticker categories are not associated with single emotions. Moreover, the arousaland valence of the associated emotions can vary within categories (see Table 6.6).Possible reasons could be:

1. Personal characteristics: Individuals differ in their emotional experience de-pending on their personal characteristics (e.g. gender, age) and personalitytraits. Our study participants were young with small variations (28.2 ± 2.96).We thus assume the age influence to be small. However, gender might playa role: On the one hand, women tend to experience more positive and morenegative emotions than men [30]. On the other hand, European men aremore likely to be soccer fans than European women [50] and therefore iden-tify more with the team, which in turn leads to more intense affective reac-tions [204].

2. Personality traits: The emotional reactions of individuals are moderated bytheir personality: Individuals scoring high in neuroticism, are more likely toexperience negative emotions and to view the world negatively [205]. Onthe other hand, individuals scoring high in extroversion experience morepositive emotions [127]. Differences in personality traits between two sub-jects can therefore lead to a different emotional experience even if the emo-

Page 153: Multimodal Emotion and Stress Recognition

136 Elicitation, Labeling, and Recognition of Emotions in Real-Life

tional stimulus, e.g. a missed chance of the favorite team, is exactly thesame.

3. Appraisal of the emotion-eliciting stimulus: As described in Sec. 2.2.4, emo-tions depend on appraisals of the current situation, regarding

(a) Relevance: In our experiment, the subjects wanted their favorite teamto win the match. However, not all match events were equally rel-evant for this objective. If e.g. the match score was 3:0, the fourthgoal was not very relevant. The joy of the spectators in response tothe fourth goal was therefore likely to be less intense than when theirteam scored first. This is supported by work of Sloan [186] who foundno change in positive emotions in basketball spectator for “easy wins”.

(b) Harm or benefit: The “harm or benefit” appraisal dimension deter-mines the valence of the elicited emotion [173]. E.g. for a missedchance of the favorite team, some subjects might think it hinderingtheir aim of their favorite team winning the match and would there-fore experience a negative emotion. Others however might appraiseit as a good sign that a player of their favorite team nearly scored, andwould experience a positive emotion. Such appraisal differences mightbe responsible for the association of both positive and negative emo-tions with the category “missed chances of the own team” observed inFig. 6.7.

(c) Coping: Depending on the person and the situation, people use dif-ferent coping strategies to deal with an emotion-eliciting situation.Folkman and Lazarus stated 8 scales of coping including “Confronta-tive coping” and “accepting responsibility” [119, 78]. If e.g. a red card isgiven to a player of the favorite team, a subject scoring high in “con-frontative coping” and low in “accepting responsibility” is more likelyto blame the referee and to react with anger. In contrast, a subjectscoring low in “confrontative coping” and high in “accepting responsi-bility” is more likely to acknowledge the fault of the player and to reactwith disappointment or worry. The red cards were therefore assignedto anger, disappointment, or worry by our test subjects, see Fig. 6.7.

4. Team identification: Wann et al. observed that low-identification basketballfans showed less intense negative emotions to a defeat of their college’steam than high-identification fans [204]. They argue that this is due to low-identification fans decreasing their association with the team after defeatto protect their self-esteem [203], a process called “cutting off reflected fail-ure” [188]. Going further, Crisp et al. [54] observed that not only the inten-sity of negative emotions varies depending on group identification, but alsothe quality of negative emotions: Following match losses, low-identificationsoccer fans felt sad but not angry, whereas high-identification fans felt an-gry but not sad. This is consistent with our observation of both anger andsadness in response to goals of the adversary team (see Fig. 6.7). Moreover,the decrease in interest during matches ending in a defeat might reflect themechanism of “cutting off reflected failure” (see Fig. 6.3).

Page 154: Multimodal Emotion and Stress Recognition

Discussion 137

5. Primacy and recency effects: Since the GEW questionnaires were filled in af-ter the match half times, the remembered events suffer from primacy andrecency effects [138], i.e. early and late events are more likely to be remem-bered.

6.7.3 Labeling

We could successfully demonstrate, that automatic labeling of “arousal”, “positive”,and “negative” events using Internet ticker data is feasible. This enables the record-ing of a large amount of emotional data without cumbersome manual annotation.The availability of such data sets is crucial for further development of automaticemotion recognition systems [63].

Healey [86] has investigated emotions in real-life by collecting EDA, ECG, andactivity data. The test subjects had to annotate the emotions they experiencedduring the day on a mobile phone. However, since subjects often delayed annota-tion, complex post-processing of the labels involving two raters was necessary toextract sequences of data where the experimenters were sure that a certain emo-tion was experienced. In contrast, our approach for arousal and valence labeling isautomatic, with a time-resolution of one minute.

Since our achieved classification accuracies are comparable with Healey’s(79.2% vs. 85% [86] for arousal, 87.1% vs. 70% [86] for valence), we conclude thatour automatic labeling scheme is competitive with manual annotation.

6.7.4 Classification

Table 6.11 shows a comparison of the maximally achieved emotion recognitionaccuracies in our experiments involving different emotion elicitation techniques:films (Sec. 4.2), stress (Ch. 5), and soccer.

A higher arousal recognition accuracy was obtained in the soccer experimentthan in the film experiment. This result suggests, that the elicited emotions bysoccer events were stronger than those elicited by films.

The valence classification accuracy for the soccer events was 20% higherthan for the film stimuli. We attribute this observation to the effect of team-identification: Because the soccer experiment participants identified with their fa-vorite team, they rejoiced and suffered with them. On the other hand, the time of

Table 6.11: Comparison of maximum emotion recognition accuracies of our exper-iments with different emotion elicitation techniques: films, MIST (stress), and soc-cer

Elicitation technique Arousal/Stress ValenceFilms 73.8% 67.5%MIST 82.8% -Soccer 79.2% 87.1%

Page 155: Multimodal Emotion and Stress Recognition

138 Elicitation, Labeling, and Recognition of Emotions in Real-Life

the film clips (~3 minutes) was probably too short to make the film study partici-pants identify with the film characters in a similar way.

Wilhelm has stated that “real-life stress” can be much more intense than ob-served in the laboratory” ([211], p. 554). Higher arousal recognition accuracies mighttherefore be expected for the soccer than for the stress experiment, which washowever not the case. We assume the following mechanisms to be responsible:

• Imprecise arousal labels: Based on the GEW evaluation, we have assignedan arousal label to all of the events belonging to the categories “goal”,“megachances missed by the favorite team”, and “red card”. However, thismight result in wrong labels for some cases or subjects, due to the psy-chological factors described above (Sec. 6.7.2). Moreover, the ticker cate-gories were all associated with both high arousal (e.g. anger, involvement)and medium arousal (e.g. disappointment, relief) emotions (Fig. 6.7), whichmight imply variance in the data labeled “arousal”.

• Imprecise non-event labels: Even though we defined the “non-events” suchthat no events were known before, during, or after the corresponding gameminute, we can not rule out the possibility that previous events or the an-ticipation of future events would influence the emotional and physiologicalstate of the subjects [216, 140] - which induces variance in the data labeled“non-events”.

• Temporal resolution: The ticker events are associated with minutes of play(i.e. game minutes). However, the corresponding emotional state mightreach it’s maximum before or after this minute of play, which leads to tem-poral imprecision of the labels.

6.7.5 Suggestions for Future Data Collection

Even though the ticker categories were associated with levels of arousal, they werenot associated with single emotions. To solve this issue (and to further refine thearousal labels), the following is recommended for future experiments:

1. Instead of the GEW, a smaller questionnaire including only the emotions ofinterest can be used to save time.

2. Arousal Labeling:

(a) During each half time, the ticker text of goals, megachances and (own)red cards are collected. After the half time, the subjects are asked torate the emotions they had experienced in response to each of thoseevents. To avoid that subjects are influenced by the (possibly emo-tional) text of the ticker author, the ticker texts could be paraphrasedbefore presenting them to the subjects.

(b) An alternative for label collection might be: Record the TV match, re-play the goals, megachances, and red card scenes after match , and askthe subjects to rate the emotions.

Page 156: Multimodal Emotion and Stress Recognition

Conclusion 139

3. Since non-events are likely to be influenced by anticipation and continuationof other events, non-events would best be recorded a day before or after thegame, while e.g. watching a documentary.

Note: Asking for specific events (as suggested in points 2a and 2b) would eliminateprimacy and recency effects in remembering the events. But these effects wouldstill play a role in subjects remembering their exact emotions at the time of theevent. To prevent this problem, the subjects would have to annotate their emotionsduring the match, which would however distract them from the match. Moreover,the cognitive processes involved in filling in the questionnaires during the matchmight influence the physiological signals. Therefore, we consider annotation aftermatch half times inevitable.

6.8 ConclusionWe conclude that soccer watching represents a suitable technique to elicit natural-istic emotions. The induced emotions cover each quadrant of the arousal-valencespace and their intensities are comparable to the intensities achieved with stan-dardized film stimuli.

Furthermore, emotions that are difficult to elicit with other - ethically uncriti-cal - techniques such as anger, disappointment, and feeling disburdened/relief, arestrongly elicited by soccer watching.

We have successfully demonstrated, that automatic arousal and valence label-ing using Internet ticker data is feasible, which facilitates the creation of data-bases for developing emotion recognition systems. Both the ticker category andthe ticker text are suitable for arousal labeling.

Due to individual and situational differences, the ticker categories could notbe associated with single emotions. This issue might be avoided in future experi-ments by asking the the subjects for more specific annotation.

Using a multimodal classifier (ECG+EDA) with diagonal covariance LDA, arecognition accuracy of 79.2% was achieved in discriminating “arousal” from“non-events”, and 87.1% was achieved in discriminating “positive” from “negative”events. Adding activity/posture as a third modality did not further improve recog-nition accuracy.

Page 157: Multimodal Emotion and Stress Recognition

140 Elicitation, Labeling, and Recognition of Emotions in Real-Life

Page 158: Multimodal Emotion and Stress Recognition

7

Conclusion and Outlook

7.1 Summary of AchievementsThis work aimed at automatically recognizing stress and emotions by means ofseveral sensors modalities. Since emotion recognition is a young discipline, it stillfaces many challenges and open questions. Our work has contributed to the fol-lowing:

• Data loss due to artifacts is a frequent problem in practical applications,especially when measuring physiological signals with unobtrusive sensors.Previous work on emotion recognition from physiology has rarely addressedthis problem. Usually the entire feature vector (including features from allsignals) is discarded if only a part of it is corrupted, which results in a sub-stantial loss of data. We have addressed this problem by two methods forhandling missing data (imputation and reduced-feature models) in combi-nation with two classifier fusion approaches (majority and confidence vot-ing). With this method, 100% of the data could be analyzed, even thoughonly 47% of the data was artifact-free.

• Existing studies often focus on a single sensor modality, e.g. on physiolog-ical signals. However, it can be expected that using several sensor modali-ties instead of only one would yield better recognition accuracies. This workmade a contribution by combining different physiological signals with facialexpression in Chap. 4, and with behavioral indicators measured by accelera-tion in Chap. 5. Results showed that classifier fusion (including facial EMG,ECG, EOG, EDA, respiration and finger temperature) increased the recogni-tion accuracy in comparison to a single classifier by up to 16.3%. A classifierthat fused physiological with behavioral data (heart, EDA, head, right hand,left leg, right leg) by majority voting, reached 6% higher accuracy than thebest single modality classifier.

• Existing studies on stress recognition mostly use a form of “mental stress”.Aiming at an experiment that is close to a real-life office situation, we used acombination of mental and psychosocial stress for our experiments. Further-

Page 159: Multimodal Emotion and Stress Recognition

142 Conclusion and Outlook

more, we distinguished stress from cognitive load rather than distinguishingbetween stress and rest (doing nothing). We have identified specific stress-related features for EDA and for acceleration sensors attached at differentplaces of the body.

• For any recognition experiment, the question arises whether the featuresand methods generalize to a “real-life” setting. We therefore trained classi-fiers on the data of the laboratory stress experiment and applied them onthe data of three subjects participating in an office stress experiment. Theresults showed, that our methods show excellent generalization capabilities.When combining the different modalities, 100% accuracy was achieved withmajority voting.

• Previous work has mainly investigated emotions and stress in standardized,laboratory experiments. This work gradually moved from standardized, lab-oratory experiments to a real-life setting. We have successfully used soccerwatching as a technique to elicit naturalistic emotions. The induced emo-tions cover each quadrant of the arousal-valence space and their intensitiesare comparable to the intensities achieved with standardized film stimuli.Furthermore, emotions that are difficult to be elicited with other - ethicallyuncritical - techniques are strongly elicited by soccer watching. As expected,positive emotions predominated for a victory of the favorite team, whereasnegative emotions predominated for defeats.

• For real life emotion experiments, the occurrence of specific events is notpredictable. Therefore, the events need to be labeled. This can be doneonline by the experiment leader or the subject, offline by video analysis orautomatically. When aiming at recording a large amount of data, manuallabeling of the data becomes time-consuming. For soccer matches, a largeamount of live ticker data is available on the Internet, which can be used tolabel the recorded data. We have shown that both the ticker category andthe ticker text can be used for automatic labeling.

• In our real-life soccer experiment, classifiers were trained to distinguishbetween events of high arousal and game minutes without special inci-dents. Using a multimodal classifier (ECG+EDA) with diagonal covarianceLDA, a recognition accuracy of 79.2% was achieved. An accuracy of 87.1% wasachieved in discriminating “positive” from “negative” events.

7.2 Outlook: The Future of Emotion Recognition

The recognition accuracies achieved in the laboratory under standardized condi-tions (e.g. [34]) are promising. However, we face the problem that accuracies candrop as the experiment setups become less standardized and more “real-life”. Theultimate goal of emotion recognition is to develop an emotion recognition tech-nique, which reliably works in real life, such that it can be used for commercialapplications. The following challenges appearing while moving on from labora-tory conditions to “real-life” situations have to be tackled by the community. In our

Page 160: Multimodal Emotion and Stress Recognition

Outlook: The Future of Emotion Recognition 143

work, we preferred - whenever it was possible - the option that was compatiblewith “real-life”, at the cost of a lower accuracy.

Individual Differences: Every person is different and thus reacts in a different wayto the same stimulus. Currently, person-dependent training is used to over-come these individual differences. In the future, automatic classifier adap-tation might be employed, as proposed for gesture recognition in [79].

Lack of Reliable Baseline: In laboratory experiments, physiological variables areusually measured with respect to a preceding baseline period. However,such a baseline is difficult to acquire in “real-life”. Even though it might beobtained by summoning the user to “sit still and relax” once or twice a day,the true baseline will shift during the day. Another approach adopted in thiswork is to employ features that are less affected by baseline drift, e.g. stan-dard deviations or slopes.

Segmentation and Labeling: Before classification, a manual selection of relevanttime frames is usually employed, e.g. based on certain experiment phases.This is not possible in real life. To the best of our knowledge, all previous ex-periments resulting in high accuracies rely on manual segmentation and la-beling. Our soccer experiment is unique in that sense, that it does not rely onmanual labeling by the experimenter, but on an automatic labeling by freelyavailable Internet ticker data. It may be argued, that the soccer experimentprofits from this rather unique opportunity of online data and is thereforenot generally applicable. But we believe that with the fast progress of theweb and online data storage, this technique could be applied in many otherexperiments. Nevertheless, segmentation and labeling remains an impor-tant obstacle to be taken for emotion recognition in “real-life”.

Stimulus Variance: Recent studies by Stemmler [190] and Larsen [115] argue thatmeasured physiological signals are not only depending on the emotion itselfbut also on the situational context. In a laboratory experiment, the stimuliare exactly the same across all subjects. This was the case for our film emo-tion experiment. In contrast, the soccer experiment included a large vari-ance of stimulation (some games were very hectic, others were rather bor-ing). Furthermore, the evolution of a goal over time is different for each goaland the elicited emotion depends on the current score. Such variations aretypical in “real-life” and can not be neglected.

Unclear Physiological Origin of the Emotions: The somatovisceral afferencemodel of emotion (SAME) proposed by Cacioppo in [37] suggests thatthe very same physiological pattern can be experienced as emotion A oremotion B - dependent on the central processes in the brain. Since wecan not measure the brain processes in an unobtrusive way, Cacioppo’smodel would lead to the conclusion that it is impossible to reach a perfectdiscrimination of emotions, based on physiological signals. Therefore, todiscriminate emotions showing a similar physiological pattern, behavioraland contextual measures must be added.

Lack of Sufficient Data: Because the collection of emotional data involves a lot ofwork, the databases used in emotion recognition studies usually contain few

Page 161: Multimodal Emotion and Stress Recognition

144 Conclusion and Outlook

independent observations. If the number of sensors and calculated featuresare large, the danger of over-fitting arises. Simple methods, as diagonal-covariance LDA used in this work, are less affected by this problem than com-plex methods. In order that complex methods can exploit their potential,large databases are needed. Publicly available databases would enable theusage of more complex methods, the comparison of different algorithms,and the employment of automatic feature selection techniques.

Page 162: Multimodal Emotion and Stress Recognition

Appendix A

Additional Information on FeatureCalculation

This appendix gives additional information on feature calculation. First, a list andcorrespondig formulas of standard features used for most of the signal. Second,a mathematical formulation of quantiles, which were used as EDA features, aregiven. In the last section we give a list of the parameters used for calculating thepitch frequency contour of speech in praat [23]

A.1 Quantile Definition

Quantiles were used as EDA features. This section provides a mathematical formu-lation.

Quantiles (q(p)): Before giving the formulas, we repeat the intuitive explanationof quantiles. Quantiles are closely related to percentiles. A percentile is thevalue of a variable below which a certain percentage of observations falls[208]. For example, the 30th percentile is a threshold such that 30% of theobserved data has smaller values than this threshold. While percentiles aredefined by percentages, quantiles are defined by probabilities. The 30th per-centile is thus the same as the estimated 0.3-quantile q(p = 0.3), if the un-derlying calculation algorithms are the same.More formally, for a continous random variable Z with a cumulative distri-bution function FZ(z) = P (Z ≤ z), the p-quantile is defined as [91]

Q(p) = inf{z : F (z) ≥ p}, 0 < p < 1. (A.1)

Given a set of independent observations of the random variable{z1, . . . , zN}, the quantiles can be estimated in several ways. The fol-lowing estimation method is employed in Matlab (which corresponds todefinition 5 in [91]): First, the observations are placed in ascending orderz(1), . . . , z(k), . . . , z(N)such that z(1) ≤ z(2) ≤ . . . ≤ z(k) ≤ . . . ≤ z(N). Each

Page 163: Multimodal Emotion and Stress Recognition

146 Additional Information on Feature Calculation

observation z(k) is then taken as quantile for the corresponding probabilitypk =

(k− 12

)

N(Eq. A.2). For values of p which can not be associated with

a particular observation (i.e. they lie “between” observations), a linearinterpolation is used to find the corresponding quantile (Eq. A.3). For valuesof p below p1and above pN , the minimum and the maximum values of theobservations are used respectively. The quantile estimation thus results in:

q(p) =z(k) for p = pk =(k − 1

2)

N, (A.2)

k = 1, 2, . . . , N

q(p) =z(k) + (p− pk)(zk+1 + zk) for pk < p < pk+1 (A.3)q(p) =z(1) for p < p1 (A.4)q(p) =z(N) for p > pN (A.5)

A.2 Parameters for Pitch Frequency CalculationTo set the parameters for pitch frequency calculation, two sound samples fromeach of the 14 emotion categories in [12] were selected. The parameters were thenchosen by comparing the graphic output of praat with the pitch perceived by thelistener. Table A.1 shows the selected parameters for the pitch frequency contourcalculation.

A.3 Parameters for Intensity CalculationTo compute the intensity contour in praat, the speech signal values are first squaredand then convolved with a Gaussian analysis window (a Kaiser-20 window, withsidelobes below -190dB). Praat adapts the effective duration W of this analysiswindow according to the parameter “minimum pitch” mp, such that W = 3.2

mp.

We calculatedmp from the pitch frequency contour according to: mp = min (F0).Table A.2 shows all the parameters and the selected values used for the intensitycontour calculation.

Page 164: Multimodal Emotion and Stress Recognition

Parameters for Intensity Calculation 147

Table A.1: Parameters for pitch frequency contour calculation: The last column con-tains the value used in our investigations.

Parametername Meaning (from [23]) Value

MethodThere are two different algorithms to determine the pitchfrequency contour, the autocorrelation and thecross-correlation method.

Crosscorrelationmethod (cc)

Time step time step between two consecutive frames for which apitch frequency value is calculated

0.01s

Pitch floor Forces the algorithm to not recruit pitch frequencycandidates below this frequency

40 Hz

Pitch ceiling Forces the algorithm to not recruit pitch frequencycandidates above this frequency

500 Hz

Maximumnumber ofcandidates

Determines how many pitch frequency candidates will berecruited maximally per sample

15 (default)

Silencethreshold

Frames that do not contain amplitudes above the silencethreshold (relative to the global maximum amplitude) areconsidered probably silent.

0.035

Voicingthreshold

The strength of the unvoiced candidate, relative to themaximum possible crosscorrelation. In order to increasethe number of unvoiced decisions, this value has to beincreased.

0.24

Octave cost

Degree of favoring high-frequency candidates, relative tothe maximum possible crosscorrelation. To favourrecruitment of high-frequency candidates more, thisvalue has to be increased.

0.01 peroctave(default)

Octave-jumpcost

Degree of disfavoring pitch changes, relative to themaximum possible crosscorrelation. To decrease thenumber of large frequency jumps this value must beincreased.

0.35 (default)

Voiced /unvoiced cost

Degree of disfavoring voiced/unvoiced transitions relativeto the maximum possible crosscorrelation. In order todecrease the number of voiced/unvoiced transitions thisvalue must be increased.

0.15

Option “Veryaccurate”

If “no” (default), a Hanning window with a physicallength of 3

pitchflooris used, if “yes”, a Gaussian window

with a physical length of of 6pitchfloor

is used whichresults in a higher frequency resolution.

no (results ina windowsize of 0.075s)

Page 165: Multimodal Emotion and Stress Recognition

148 Additional Information on Feature Calculation

Table A.2: Parameters for intensity contour calculation: The last column containsthe values used in our investigations.

Parametername Meaning (from [23]) Value

Minimumpitchmp

Parameter which defines the window size of theGaussian analysis window. Needs to be set low enoughin order to avoid pitch-synchronous intensitymodulations in the resulting intensity contour.

Calculated fromthe pitchfrequency contour,i.e. minF0

Time step time step between two consecutive frames for which anintensity value is calculated

0.01s

Option“Subtractmean”

Option to subtract mean pressure (i.e. the DC offset)before computing the intensity.

yes (default)

Page 166: Multimodal Emotion and Stress Recognition

Appendix B

Additional Information onClassification

B.1 Derivation of Discriminant Functions for LDA andQDA

A classifier can be described by a set of discriminant functions δi(x), i = 1, . . . , ncfor nc classes. A classifier assigns a feature vector x to class ci if the correspondingdicriminant function yields the largest value, i.e.

C(x) = i if δi(x) = maxiδi(x) (B.1)

For a minimum error rate classifier, the posterior probability can serve as discrimi-nant function, i.e. δi(x) = P (ci|x). Hence, the class with the maximum posteriorprobability will be chosen by the classifier. The dicriminant functions for this clas-sifier are not unique. Any set of discriminant functions with γi(x) = f(δi(x)) andf(�) a monotonically increasing function will yield the same decisions as the orig-inal classifier using δi(x). The following discriminant functions can therefore beused interchangeably:

δi(x) = P (ci|x) =p(x|ci)P (ci)

p(x)(B.2)

δi(x) = p(x|ci)P (ci) (B.3)

δi(x) = ln(p(x|ci)) + ln(P (ci)) (B.4)

where P (ci) denotes the prior probability for class ci and p(x) the evidence, i.e. theprobability to observe x. Since p(x) is constant for all δi(x) for a given observation x,p(x) in Eq. B.2 can be omitted, which results in Eq. B.3. To obtain Eq. B.4, the naturallogarithm is applied to Eq. B.3.

In linear and quadratic discriminant analysis (LDA and QDA), the state-

Page 167: Multimodal Emotion and Stress Recognition

150 Additional Information on Classification

conditional probability density functions of x, p(x|ci), are assumed to be multi-variate normal:

p(x|ci) ∼ N(µi,Σi) =1

(2π)d/2 |Σi|1/2exp

[−1

2(x− µi)

TΣ−1i (x− µi)

](B.5)

whereµi is the mean vector and Σi the covariance matrix of the distribution of thed-dimensional feature vector x for class ci. |Σi| is the derminant of the covariancematrix.

Inserting Eq. B.5 into Eq. B.4 yields:

δi(x) = −1

2(x− µi)

TΣ−1i (x− µi)−

d

2ln(2π)− 1

2ln(|Σi|) + ln(P (ci)) (B.6)

Since d2

ln(2π) is a constant term, it can be omitted. The quadratic discriminantfunctions for QDA thus result in:

δi(x) = −1

2(x− µi)

TΣ−1i (x− µi)−

1

2ln(|Σi|) + ln(P (ci)) (B.7)

In the special case where the covariance matrices are assumed to be equal forall classes, i.e. Σi = Σ ∀i, Eq. B.7 can be simplified. The terms 1

2ln(|Σi|) and

xTΣ−1i x become independent of the class and can be omitted. This results in linear

discriminant functions for LDA:

δi(x) = xTΣ−1µi −1

2µTi Σ−1µi + ln(P (ci)) (B.8)

B.2 Support Vector Classifiers and Support Vector Ma-chines

This appendix provides additional details on Support Vector Classifiers and SupportVector Machines (SVM).

If the training data is not linearly separable, the optimization problem statedfor the Support Vector Classifier in Eq. 3.15-3.18 has no solution. To allow somepoints to be on the “wrong” side of the margin, the constraint 3.18 needs to beadapted by slack variables ξ = (ξ1, ξ2, . . . , ξN ):

yk(β0 + βT xk) ≥M(1− ξk), ξk ≥ 0,

N∑k=1

ξk ≤ constant

Support Vector Classifiers are only able to generate linear decision bound-aries. Support Vector Machines (SVM) on the other hand transform the train-ing data into a higher-dimensional space using a set of D basis functions (withD > d). A d-dimensional training sample xk is thus transformed to φ(xk) =

[φ1(xk), φ2(xk), . . . , φD(xk)].

Page 168: Multimodal Emotion and Stress Recognition

Parameters for Subtractive Clustering 151

The training of an SVM is analogous to the training of the Support Vector Clas-sifier, i.e. an optimal separating (linear) hyperplane f(x) = β0 + β

Tφ(x) = 0 is

determined in theD-dimensional space and the new classification rule is C(xk) =

sign(f(xk)). The linear boundaries in D-dimensional space then correspond tonon-linear boundaries in the original d-dimensional space.

During SVM training, the transformation function φ(x) is not needed explic-itly, but knowledge of the scalar product 〈φ(x), φ(x′)〉 is sufficient. For the scalarproducts, the following kernel functions were used in this work:

• d′-dimensional polynomial with d′ = 1, 2, 3:

K(x, x′) =⟨x, x′

⟩·(1 +

⟨x, x′

⟩)d′−1 (B.9)

• Radial basis functions (with σ = 1):

K(x, x′) = exp

(−‖x− x′‖2

2σ2

)(B.10)

B.3 Parameters for Subtractive ClusteringSubtractive clustering is an unsupervised algorithm for estimating the number andthe location of cluster centers from numerical data, proposed by Chiu [44]. In thiswork, subtractive clustering was used to calculate acceleration cluster features, asdescribed in Sec. 5.5.2.

The basic idea of subtractive clustering has been introduced in Sec. 3.2.5. In thissection, we provide the formulas for the stopping criterion of the subclusteringalgorithm as proposed in [44].

Given the j th cluster center x?j and it’s potential P ?j , the following stoppingcriterion is used:

if P ?j > εP ?1Accept x?j as a cluster center and continue.

else if P ?j < εP ?1Reject x?j and end the clustering process.

elsewith dmin = the shortest distance between x?j and the previously selectedcluster centers:

if dminra

+P?j

P?1≥ 1

Accept x?j as cluster center and continue.else

Reject x?j and set the potential of x?j to 0. Select the data point with thenext highest potential as the new x?j and re-test.

endend

The parameter ε represents the threshold for the potential above which thepoint x?j is definitely accepted as new cluster center, whereas ε represents the thre-

Page 169: Multimodal Emotion and Stress Recognition

152 Additional Information on Classification

hold below which x?j is definitely rejected. If the potential is in the “grey zone”,the decision whether to accept x?j as new cluster center is based on a trade-off be-tween it’s potential P ?j and the distance from already selected cluster centers. InSec. 5.5.2, we chose ε = 0.5 and ε = 0.15, following the recommendation of Chui[44].

Page 170: Multimodal Emotion and Stress Recognition

Nomenclature

ANS Autonomic nervous system

bpm beats per minute

BVP Blood Volume Pulse

CoP Center of Pressure

dcm diagonal covariance matrix

ECG Electrocardiogram

EDA Electrodermal activity

EMG Electromyography

EOG Electrooculagraphy

FCM Fuzzy C-Means (clustering)

FT Finger temperature

GEW Geneva Emotion Wheel

IAPS International Affective Picture System

LDA Linear Discriminant Analysis

LH Left Hand

LL Left Leg

NCC Nearest Class Center (classifier)

NS.SCR Non specific Skin Conductance Response

QDA Quadratic Discriminant Analysis

RH Right Hand

RL Right Leg

RMS Root mean square

SCL Skin Conductance Level

SOM Self-Organizing Map

Page 171: Multimodal Emotion and Stress Recognition

154 Additional Information on Classification

SVM Support Vector Machine

Zygomaticus Major Smiling muscle

Page 172: Multimodal Emotion and Stress Recognition

Bibliography

[1] TMS International: http: // www. tmsi. com . (cited on pages 55 and 56.)[2] Characterizing artefact in the normal human 24-hour RR time series to aid

identification and artificial replication of circadian variations in human beatto beat heart rate using a simple threshold (2002). (cited on pages 18 and 57.)

[3] A. Kapoor, W. Burleson, R. W. Picard. Automatic prediction of frustration. Int JHum-Comput Stud, 65(8): 724–736 (2007). (cited on page 15.)

[4] A. Madan, A. S. Pentland. VibeFones: Socially Aware Mobile Phones. In TenthIEEE International Symposium on Wearable Computers, pp. 109–112 (2006).(cited on page 13.)

[5] A. Álvarez, I. Cearreta, J. M. López, A. Arruti, E. Lazkano, B. Sierra, & N. Garay.A comparison using different speech parameters in the automatic emotionrecognition using feature subset selection based on evolutionary algorithms.In Proceedings of the 10th international conference on Text, speech and dia-logue, pp. 423–430. TSD’07, Springer-Verlag, Berlin, Heidelberg. ISBN 3-540-74627-7, 978-3-540-74627-0 (2007). (cited on page 34.)

[6] AmericanPsychiatric Association. Diagnostic and statistical manual of mentaldisorders (4th ed., text rev.) (2000). (cited on page 2.)

[7] B. Arnrich, C. Setz, R. La Marca, G. Tröster, & U. Ehlert. Self Organizing Mapsfor Affective State Detection. In Machine Learning for Assistive Technologies(2010). (cited on page 45.)

[8] B. Arnrich, C. Setz, R. La Marca, G. Tröster, & U. Ehlert. What does your chairknow about your stress level? IEEE Transactions on Information Technologyin Biomedicine, 14(2): 207–214 (2010). (cited on pages 78, 79, and 80.)

[9] A. Atkielski. Schematic diagram of normal sinus rhythm for a human heartas seen on ECG. Wikimedia Commons, Public Domain license. URL http:

//commons.wikimedia.org/wiki/File:SinusRhythmLabels.png (2006).(cited on page 18.)

[10] M. Bächlin, D. Roggen, & G. Tröster. Context-aware platform for long-term lifestyle management and medical signal analysis. In Proc. of the 2nd SensationInt. Conf. (2007). (cited on pages 32 and 33.)

[11] T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis, & S. Kol-lias. Emotion Analysis in Man-Machine Interaction Systems, 3361/2005 of Lec-ture Notes in Computer Science. Springer (2005). (cited on page 3.)

Page 173: Multimodal Emotion and Stress Recognition

156 BIBLIOGRAPHY

[12] R. Banse & K. R. Scherer. Acoustic profiles in vocal emotion expression. J PersSoc Psychol, 70(3): 614–636 (1996). (cited on pages 13, 45, 46, and 146.)

[13] T. Bänziger & K. Scherer. Using Actor Portrayals to Systematically Study Mul-timodal Emotion Expression: TheGEMEP Corpus. In ACII ’07: Proceedings ofthe Second International Conference on Affective Computing and IntelligentInteraction, pp. 476–487. Springer (2007). (cited on pages 13 and 46.)

[14] T. Bänziger, V. Tran, & K. R. Scherer. The Emotion Wheel - A tool for the verbalreport of emotional reactions, poster presented at the conference of the Inter-national Society of Research on Emotion, Bari, Italy (2005). (cited on pages 49,112, and 113.)

[15] S. G. Barsade. The Ripple Effect: Emotional Contagion and Its Influence onGroup Behavior. Administrative Science Quarterly, 47(4): 644–675 (2002).(cited on page 2.)

[16] J. V. Basmajian & C. J. De Luca. Muscles Alive: Their Functions Revealed byElectromyography. Williams & Wilkins, 5th edn. ISBN 068300414X (1985).(cited on page 27.)

[17] D. Bernhardt & P. Robinson. Interactive control of music using emotional bodyexpressions. In CHI ’08 extended abstracts on Human factors in computingsystems, pp. 3117–3122. CHI EA ’08, ACM, New York, NY, USA. ISBN 978-1-60558-012-8 (2008). (cited on page 3.)

[18] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.Kluwer Academic Publishers, Norwell, MA, USA. ISBN 0306406713 (1981).(cited on page 127.)

[19] M. Bhatti, Y. Wang, & L. Guan. A neural network approach for human emotionrecognition in speech. In Circuits and Systems, 2004. ISCAS ’04. Proceedings ofthe 2004 International Symposium on, 2, pp. II – 181–4 Vol.2 (2004). (cited onpage 46.)

[20] N. Bianchi-Berthouze & A. Kleinsmith. A categorical approach to affectivegesture recognition. Connection Science, 15(4): 259–269 (2003). (cited onpages 13 and 32.)

[21] M. E. Bleil, P. J. Gianaros, J. R. Jennings, J. D. Flory, & S. B. Manuck. Trait ne-gative affect: toward an integrated model of understanding psychological riskfor impairment in cardiac autonomic function. Psychosom Med, 70: 328–337(2008). (cited on page 20.)

[22] P. Boersma. Accurate Short-Term Analysis of the Fundamental Frequency andthe Harmonics-to-Noise Ratio of a Sampled Sound. Institute of Phonetic Sci-ences, University of Amsterdam, Proceedings, 17: 97–110 (1993). (cited onpage 34.)

[23] P. Boersma & D. W. (2007). Praat: doing phonetics by computer (Version 4.5.19)[Computer program] (Retrieved May 23, 2007, from http://wwwpraatorg/).(cited on pages 34, 36, 145, 147, and 148.)

[24] W. Boucsein. Psychophysiological investigation of stress induced by tempo-ral factors in human-computer interaction. In M. Frese, E. Ulich, & W. Dzida

Page 174: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 157

(eds.), Psychological issues of human-computer interaction in the work place,pp. 163–181. North-Holland Publishing Co., Amsterdam. ISBN 0-444-70318-7(1987). (cited on page 24.)

[25] W. Boucsein. Electrodermal Activity. Plenum Press, New York. ISBN0306442140 (1992). (cited on pages 20, 23, 24, 86, and 99.)

[26] W. Boucsein. PhysiologischeGrundlagen und Messmethoden der dermalenAk-tivität. In F. Rösler (ed.), Enzyklopädie der Psychologie, Bereich Psychophysi-ologie, Band 1: Grundlagen und Methoden der Psychophysiologie, pp. 551–623.Hogrefe, Göttingen (2001). (cited on pages 23 and 24.)

[27] A. v. Boxtel. Facial EMG as a tool for inferring affective states. In Proceedingsof Measuring Behavior, pp. 104–108 (2010). (cited on page 27.)

[28] M. M. Bradley & P. J. Lang. The International Affective Picture System (IAPS)in the Study of Emotion and Attention. In J. A. Coan & J. J. B. Allen (eds.), Thehandbook of emotion elicitation and assessment, chap. 2, pp. 29–46. OxfordUniversity Press, New York (2007). (cited on pages 13 and 14.)

[29] Brainclinics. URL http://www.brainclinics.com. (cited on pages 55and 69.)

[30] L. R. Brody & J. A. Hall. Gender and Emotion in Context. In M. Lewis, J. Haviland-Jones, & L. Barrett (eds.), Handbook of Emotions, Third Edition. Guilford Pub-lications. ISBN 9781609180447 (2010). (cited on page 135.)

[31] A. Bulling, D. Roggen, & G. Tröster. It’s in your eyes: towards context-awareness and mobile HCI using wearable EOG goggles. In Proceedings ofthe 10th international conference on Ubiquitous computing, pp. 84–93. Ubi-Comp ’08, ACM, New York, NY, USA. ISBN 978-1-60558-136-1 (2008). (cited onpage 28.)

[32] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, & B. Weiss. A Database ofGerman Emotional Speech. In Interspeech 2005, pp. 1517–1520 (2005). (citedon page 13.)

[33] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Min Lee, A. Kazemzadeh, S. Lee, U.Neumann, S. Narayanan. Analysis of emotion recognition using facial expres-sions, speech and multimodal information. In ICMI ’04: Proc. of the 6th Int.Conf. on Multimodal interfaces, pp. 205–211 (2004). (cited on page 13.)

[34] C. L. Lisetti, F. Nasoz. Using noninvasive wearable computers to recognizehuman emotions from physiological signals. EURASIP J Appl Signal Process,2004(1): 1672–1687 (2004). (cited on pages 4, 13, 15, 16, 48, 61, and 142.)

[35] C. Yu, P. M. Aoki, A. Woodruff. Detecting user engagement in everyday con-versations. In ICSLP ’04: 8th Int. Conf. on Spoken Language Processing, 2, pp.1329–1332 (2004). (cited on page 13.)

[36] J. T. Cacioppo, G. G. Berntson, D. J. Klein, & K. M. Poehlmann. Psychophys-iology of Emotion Across the Life Span. Annual Review of Gerontology andGeriatrics, 17: 27–74 (1997). (cited on pages 8, 9, and 10.)

Page 175: Multimodal Emotion and Stress Recognition

158 BIBLIOGRAPHY

[37] J. T. Cacioppo, G. G. Berntson, J. T. Larsen, K. M. Poehlmann, & T. A. Ito. Hand-book of Emotions (2nd ed.). chap. The Psychophysiology of Emotion, pp. 173–191. The Guilford Press, New York, 2nd edn. (2000). (cited on pages 7, 8, 9, 10,58, and 143.)

[38] J. T. Cacioppo, L. G. Tassinary, & G. G. Berntson (eds.). Handbook of psy-chophysiology. University Press, Cambridge, 3rd edn. (2007). (cited onpages 11 and 23.)

[39] S. D. Campbell, K. K. Kraning, E. G. Schibli, & S. T. Momii. Hydration charac-teristics and electrical resistivity of stratum corneum using a noninvasive four-point microelectrode method. J Invest Dermatol, 69: 290–295 (1977). (citedon page 24.)

[40] W. B. Cannon. The James-Lange theory of emotion: A critical examination andan alternative theory. American Journal of Psychology, 39: 106–124 (1927).(cited on pages 6 and 7.)

[41] A. Caspi, K. Sugden, T. E. Moffitt, A. Taylor, I. W. Craig, H. Harrington, J. Mc-Clay, J. Mill, J. Martin, A. Braithwaite, & R. Poulton. Influence of life stress ondepression: moderation by a polymorphism in the 5-HTT gene. Science, 301:386–389 (2003). (cited on page 2.)

[42] G. Castellano, S. Villalba, & A. Camurri. Recognising Human Emotions fromBody Movement and Gesture Dynamics. In A. Paiva, R. Prada, & R. Picard (eds.),Affective Computing and Intelligent Interaction, ACII, 4738 of Lecture Notes inComputer Science, pp. 71–82. Springer Berlin / Heidelberg. 10.1007/978-3-540-74889-2-7 (2007). (cited on pages 13 and 32.)

[43] K. B. Chan, G. Lai, Y. C. Ko, & K. W. Boey. Work stress among six professionalgroups: theSingapore experience. Social Science & Medicine, 50(10): 1415 –1432 (2000). (cited on pages 65 and 103.)

[44] S. Chiu. Fuzzy Model Identification based on cluster estimation. Journal of In-telligent Fuzzy Systems, 2: 267–278 (1994). (cited on pages 41, 42, 151, and 152.)

[45] J.-W. Chung & S. G. Vercoe. The affective remixer: personalized music arrang-ing. In CHI ’06: CHI ’06 extended abstracts on Human factors in computingsystems, pp. 393–398. ACM Press, New York, NY, USA (2006). (cited on page 3.)

[46] G. D. Clifford. Open Source Code: http: // www. mit. edu/ ~gari/ CODE/

ECGtools/ . (cited on pages 18 and 57.)

[47] G. D. Clifford. Advanced Methods & Tools for ECG Data Analysis, chap. ECG Sta-tistics, Noise, Artifacts, and Missing Data, pp. 55–99. Artech House (2006).(cited on page 20.)

[48] P. Cobos, M. Sanchez, N. Perez, & J. Vila. Brief report Effects of spinal cordinjuries on the subjective component of emotions. Cognition & Emotion, 18(2):281–287 (2004). (cited on page 7.)

[49] S. Cohen, E. Frank, W. J. Doyle, D. P. Skoner, B. S. Rabin, & J. M. Gwaltney. Typesof stressors that increase susceptibility to the common cold in healthy adults.Health Psychol, 17: 214–223 (1998). (cited on page 2.)

Page 176: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 159

[50] S. I. R. C. commissioned by Canon". Football Passions. Report on Websiteof Social Issues Research Centre. URL http://www.sirc.org/football/

football_passions.shtml (2008). (cited on pages 110 and 135.)

[51] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz,& J. Taylor. Emotion recognition in human-computer interaction. IEEE SignalProcessing Magazine, 18(1): 32–80 (2001). (cited on pages 3, 32, 49, and 109.)

[52] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, &J. G. Taylor. Emotion recognition in human-computer interaction. IEEE SignalProcessing Magazine, 18(1): 32–80 (2001). (cited on pages 12 and 13.)

[53] A. Crider. Electrodermal Response Lability-Stability: Individual Difference Cor-relates. In J.-C. Roy, W. Boucsein, D. C. Fowles, & J. H. Gruzelier (eds.), Progressin Electrodermal Research, 249 of NATO ASI Series, pp. 173–186. Springer US.ISBN 978-1-4615-2864-7 (1993). (cited on page 23.)

[54] R. J. Crisp, S. Heuston, M. J. Farr, & R. N. Turner. Seeing Red or Feeling Blue:Differentiated Intergroup Emotions and Ingroup Identification in Soccer Fans.Group Processes Intergroup Relations, 10: 9–26 (2007). (cited on pages 110,111, 114, 119, and 136.)

[55] H. D. Critchley. Electrodermal responses: what happens in the brain. Neuro-scientist, 8: 132–142 (2002). (cited on page 23.)

[56] Cyc. Graphic showing the maximum separating hyperplane and the margin.Wikimedia Commons, Public Domain license. URL http://commons.

wikimedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png

(2011). (cited on page 39.)

[57] D. Neiberg, K. Elenius, K. Laskowski. Emotion Recognition in SpontaneousSpeech Using GMMs. In ICSLP ’06: 9th Int. Conf. on Spoken Language Pro-cessing, pp. 809–812 (2006). (cited on page 13.)

[58] C. Darrow. The rationale for treating the change in galvanic skin response as achange in conductance. Psychophysiology, 1: 31–38 (1964). (cited on page 20.)

[59] C. Darwin. The expression of the emotions in man and animals. John Murray,London, 1st edn. (1872). (cited on page 6.)

[60] K. Dedovic, R. Renwick, N. K. Mahani, V. Engert, S. J. Lupien, & J. C. Pruess-ner. The Montreal Imaging Stress Task: using functional imaging to investi-gate the effects of perceiving and processing psychosocial stress in the humanbrain. Journal of Psychiatry & Neuroscience, 30(5): 319–325 (2005). (cited onpages 13, 14, 66, and 103.)

[61] S. D’Mello, S. Craig, B. Gholson, S. Franklin, R. Picard, & A. Graesser. IntegratingAffect Sensors in an Intelligent Tutoring System. In Affective Interactions: TheComputer in the Affective Loop Workshop at 2005 International Conference onIntelligent User Interfaces, pp. 7–13 (2005). (cited on pages 13 and 32.)

[62] B. Dokhan, C. Setz, B. Arnrich, & G. Tröster. Monitoring passenger’s breathing -a feasibility study. In Swiss Society of Biomedical Engineering Annual Meeting.Neuchâtel, Switzerland (2007). (cited on pages 29 and 55.)

Page 177: Multimodal Emotion and Stress Recognition

160 BIBLIOGRAPHY

[63] E. Douglas-Cowie. Editorial: Data and Databases. In P. Petta, R. Cowie, &C. Pelachaud (eds.), Emotion-Oriented Systems - The Humaine Handbook, pp.163–166. Springer (2011). (cited on page 137.)

[64] R. O. Duda, P. E. Hart, & D. G. Stork. Pattern Classification. Wiley-IntersciencePublication (2000). (cited on pages 36 and 72.)

[65] J. C. Dunn. A Fuzzy Relative of the ISODATA Process and Its Use in DetectingCompact Well-Separated Clusters. Journal of Cybernetics, 3(3): 32–57 (1973).(cited on page 127.)

[66] R. Edelberg. Electrodermal mechanisms: A critique of the two-effector hypoth-esis and a proposed replacement. In J. Roy, W. Boucsein, D. Fowles, & J. Gruze-lier (eds.), Progress in electrodermal research: From physiology to psychology,pp. 7–30. New York (1993). (cited on page 24.)

[67] U. Ehlert. Das endokrine System. In U. Ehlert & R. Kanel (eds.), Psychoendokri-nologie Und Psychoimmunologie, pp. 3–36. Springer. ISBN 9783642169632(2010). (cited on page 12.)

[68] E. Eich, J. T. W. Ng, D. Macaulay, A. D. Percy, & I. Grebneva. Combining MusicWith Thought to Change Mood. In J. A. Coan & J. J. B. Allen (eds.), Handbookof emotion elicitation and assessment, chap. 8, pp. 124–136. Series in affectivescience, Oxford University Press (2007). (cited on pages 13 and 14.)

[69] P. Ekman. The differential communication of affect by head and body cues.Journal of Personality and Social Psychology, 2: 726–735 (1965). (cited onpages 31, 81, and 82.)

[70] P. Ekman. Handbook of Cognition and Emotion. chap. Basic Emotions, pp.45–60. John Wiley & Sons, Ltd. (1999). (cited on pages 6 and 12.)

[71] P. Ekman, W. V. Friesen, M. O’Sullivan, A. Chan, I. Diacoyanni-Tarlatzis, K. Hei-der, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti, K. R. Scherer,M. Tomita, & A. Tzavaras. Universals and cultural differences in the judgmentsof facial expressions of emotion. Journal of Personality and Social Psychol-ogy, 53: 712–717 (1987). (cited on page 12.)

[72] Ekman, P. The Directed Facial Action Task: Emotional Responses Without Ap-praisal. In J. A. Coan & J. J. B. Allen (eds.), The handbook of emotion elicitationand assessment, chap. 3, pp. 47–53. Oxford University Press, New York (2007).(cited on pages 13 and 14.)

[73] I. S. Engberg & A. V. Hansen. Documentation of the Danish Emotional SpeechDatabase DES. URL http://cpk.auc.dk/~tb/speech/Emotions/des.pdf.Aalborg (1996). (cited on page 13.)

[74] European Foundation for the Improvement of Living and Working Condi-tions. Work-related stress. URL http://www.eurofound.europa.eu/ewco/reports/TN0502TR01/TN0502TR01.pdf. (cited on pages 6 and 65.)

[75] European Foundation for the Improvement of Living and Working Condi-tions. Changes over time - First findings from the fifth EuropeanWorking Con-ditionsSurvey. URL http://www.eurofound.europa.eu/pubdocs/2010/74/en/3/EF1074EN.pdf (2010). (cited on page 103.)

Page 178: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 161

[76] B. Fasel & J. Luettin. Automatic facial expression analysis: a survey. PatternRecognition, 36(1) (2003). (cited on page 3.)

[77] D. L. Filion, M. E. Dawson, & A. M. Schell. The psychological significance ofhuman startle eyeblink modification: a review. Biol Psychol, 47: 1–43 (1998).(cited on page 28.)

[78] S. Folkman & R. S. Lazarus. Manual for theWays of Coping Questionnaire.Consulting Psychologists Press, Palo Alto, CA (1988). (cited on page 136.)

[79] K. Förster, A. Biasiucci, R. Chavarriaga, J. del R. Millàn, D. Roggen, & G. Tröster.On the use of brain decoded signals for online user adaptive gesture recogni-tion systems. In proceedings of the Eighth International Conference on Perva-sive Computing (2010). (cited on page 143.)

[80] R. R. Freedman. Physiological mechanisms of temperature biofeedback. Ap-plied Psychophysiology and Biofeedback, 16: 95–115. 10.1007/BF01000184(1991). (cited on page 31.)

[81] A. J. Fridlund & J. T. Cacioppo. Guidelines for human electromyographic re-search. Psychophysiology, 23: 567–589 (1986). (cited on pages 27 and 28.)

[82] P. Grossman, J. van Beek, & C. Wientjes. A Comparison of Three Quantifica-tion Methods for Estimation of Respiratory Sinus Arrhythmia. Psychophysiol-ogy, 27(6): 702–714 (1990). (cited on page 29.)

[83] P. S. Hamilton & W. J. Tompkins. Quantitative investigation of QRS detectionrules using the MIT/BIH arrhythmia database. IEEE Trans Biomed Eng, 33(12):1157–1165 (1986). (cited on page 18.)

[84] Harmon-Jones, E. and Amodio, D. M. and Zinner, L. R. Social PsychologicalMethods of Emotion Elicitation. In J. A. Coan & J. J. B. Allen (eds.), The hand-book of emotion elicitation and assessment, chap. 6, pp. 91–105. Oxford Uni-versity Press, New York (2007). (cited on pages 13, 14, 66, and 103.)

[85] T. Hastie, R. Tibshirani, & J. H. Friedman. The Elements of Statistical Learning.Springer, 2nd edn. ISBN 978-0-387-84857-0 (2009). (cited on pages 36, 38,and 76.)

[86] J. Healey, L. Nachman, S. Subramanian, J. Shahabdeen, & M. E. Morris. Outof the Lab and into the Fray: Towards Modeling Emotion in Everyday Life. InPervasive, pp. 156–173 (2010). (cited on pages 133 and 137.)

[87] J. A. Healey & R. W. Picard. Detecting stress during real-world driving tasksusing physiological sensors. IEEE Transactions on Intelligent TransportationSystems, 6(2): 156–166 (2005). (cited on page 16.)

[88] W. J. Hess. Pitch and Voicing Determination of Speech with an ExtensionToward Music Signals. In J. Benesty, M. Sondhi, & Y. Huang (eds.), Springerhandbook of speech processing. Springer Handbook Of Series, Springer. ISBN9783540491255 (2008). (cited on page 34.)

[89] I. Homma & Y. Masaoka. Breathing rhythms and emotions. Exp Physiol, 93:1011–1021 (2008). (cited on page 29.)

Page 179: Multimodal Emotion and Stress Recognition

162 BIBLIOGRAPHY

[90] M. Hoque, M. Yeasin, & M. Louwerse. Robust Recognition of Emotion fromSpeech. In J. Gratch, M. Young, R. Aylett, D. Ballin, & P. Olivier (eds.), IntelligentVirtual Agents, 4133 of Lecture Notes in Computer Science, pp. 42–53. SpringerBerlin / Heidelberg. ISBN 978-3-540-37593-7 (2006). (cited on page 32.)

[91] R. J. Hyndman & Y. Fan. Sample Quantiles in Statistical Packages. The Ameri-can Statistician, 50(4): pp. 361–365 (1996). (cited on page 145.)

[92] T. International. Measuring respiration using the Cardio-Respiratory belt. URLhttp://www.tmsi.com/?id=27. (cited on page 29.)

[93] J. Wagner, J. Kim, E. André. From Physiological Signals to Emotions: Imple-menting and Comparing Selected Methods for Feature Extraction and Classi-fication. In ICME 2005. Amsterdam, Nederland (2005). (cited on pages 15and 61.)

[94] S. C. Jacobs, R. Friedman, J. D. Parker, G. H. Tofler, A. H. Jimenez, J. E. Muller,H. Benson, & P. H. Stone. Use of skin conductance changes during mentalstress testing as an index of autonomic arousal in cardiovascular research.American Heart Journal, 128: 1170–1177 (1994). (cited on page 24.)

[95] W. James. What is an emotion? Mind, 9: 188–205 (1884). (cited on page 6.)[96] J. H. Janssen, E. L. B. van den, & J. H. Westerink. Personalized affective music

player. In J. Cohn, AntonNijholt, & M. Pantic (eds.), Proceedings of the 3rdInternational Conference on Affective Computing and Intelligent Interactionand Workshops, ACII 2009, pp. 472–477. IEEE Computer Society (2009). (citedon page 3.)

[97] A. Kapoor, W. Burleson, & R. W. Picard. Automatic prediction of frustration. InInternational Journal of Human Computer Studies (2007). (cited on page 3.)

[98] A. Kapoor & R. W. Picard. Multimodal affect recognition in learning environ-ments. In Multimedia ’05: Proc. of the 13th annual ACM Int. Conf. on Multime-dia, pp. 677–682. ACM, New York, USA. ISBN 1-59593-044-2 (2005). (cited onpage 16.)

[99] C. D. Katsis, G. Ganiatsas, & D. I. Fotiadis. An integrated telemedicine platformfor the assessment of affective physiological states. Diagnostic Pathology, 1:16 (2006). (cited on page 16.)

[100] D. Keltner & J. Haidt. Social Functions of Emotions at Four Levels of Analysis.Cognition & Emotion, 13(5): 505–521 (1999). (cited on page 3.)

[101] D. Keltner & A. M. Kring. Emotion, social function, and psychopathology. Re-view of General Psychology, 2(3): 320–342 (1998). (cited on page 2.)

[102] J. H. Kerr, G. V. Wilson, I. Nakamura, & Y. Sudo. Emotional dynamics of soc-cer fans at winning and losing games. Personality and Individual Differ-ences, 38(8): 1855 – 1866 (2005). (cited on pages 110, 111, 114, 116, 118, 119,and 135.)

[103] J. Kim & E. André. Emotion Recognition Using Physiological and Speech Signalin Short-Term Observation. In Perception and Interactive Technologies, pp. 53–64. LNAI 4201, Springer-Verlag Berlin Heidelberg (2006). (cited on pages 4and 15.)

Page 180: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 163

[104] J. Kim & E. André. Emotion Recognition Based on Physiological Changes in Mu-sic Listening. IEEE Trans Pattern Anal Mach Intell, 30(12): 2067–2083 (2008).(cited on pages 15, 37, and 76.)

[105] C. Kirschbaum, K. M. Pirke, & D. H. Hellhammer. The ’Trier Social StressTest’–atool for investigating psychobiological stress responses in a laboratory setting.Neuropsychobiology, 28: 76–81 (1993). (cited on pages 13 and 14.)

[106] P. R. Kleinginna & A. M. Kleinginna. A categorized list of emotion definitions,with suggestions for a consensual definition. Motivation and Emotion, 5(4):345–379 (1981). (cited on page 5.)

[107] O. Kohlisch & F. Schaefer. Physiological changes during computer tasks: re-sponses to mental load or to motor demands? Ergonomics, 39(2): 213–224(1996). (cited on page 76.)

[108] T. Kohonen. Self-organizing maps, 30 of Springer Series in Information Sci-ences. Springer. ISBN 3-540-67921-9 (2001). (cited on pages 39 and 45.)

[109] B. E. Kok & B. L. Fredrickson. Upward spirals of the heart: autonomic flexibility,as indexed by vagal tone, reciprocally and prospectively predicts positive emo-tions and social connectedness. Biol Psychol, 85: 432–436 (2010). (cited onpage 20.)

[110] V. Kolodyazhniy, S. D. Kreibig, J. J. Gross, W. T. Roth, & F. H. Wilhelm. Anaffective computing approach to physiological emotion specificity: towardsubject-independent and stimulus-independent classification of film-inducedemotions. Psychophysiology, 48: 908–922 (2011). (cited on pages 4 and 62.)

[111] W. Kruskal & W. Wallis. Use of ranks in one-criterion variance analysis. Journalof the American Statistical Association, pp. 583–621 (1952). (cited on pages 116and 130.)

[112] W. Kuhmann, W. Boucsein, F. Schaefer, & J. Alexander. Experimental inves-tigation of psychophysiological stress-reactions induced by different systemresponse times in human-computer interaction. Ergonomics, 30: 933–943(1987). (cited on page 24.)

[113] O. W. Kwon, K. Chan, J. Hao, & T. W. Lee. Emotion Recognition by Speech Sig-nals. In Eighth European Conference on Speech Communication and Technol-ogy. ISCA (2003). (cited on page 34.)

[114] L. C. De Silva, T. Miyasato, R. Nakatsu. Facial emotion recognition using multi-modal information. In ICICS ’97: Int. Conf. on Information, Communicationsand Signal Processing, pp. 397–401 (1997). (cited on page 13.)

[115] J. T. Larsen, G. G. Berntson, K. M. Poehlmann, T. A. Ito, & J. T. Cacioppo. Hand-book of Emotions (3rd ed.). chap. The Psychophysiology of Emotion, pp. 180–195. The Guilford Press, New York, 3rd edn. (2006). (cited on pages 11 and 143.)

[116] R. Lazarus & S. Folkman. Stress, appraisal, and coping. Springer Series,Springer. ISBN 9780826141910 (1984). (cited on page 12.)

[117] R. S. Lazarus. Psychological stress and the coping process. McGraw Hill, NewYork (1966). (cited on pages 6, 12, and 24.)

Page 181: Multimodal Emotion and Stress Recognition

164 BIBLIOGRAPHY

[118] R. S. Lazarus. Progress on a cognitive-motivational-relational theory of emo-tion. Am Psychol, 46: 819–834 (1991). (cited on pages 6 and 12.)

[119] R. S. Lazarus. Coping theory and research: past, present, and future. Psycho-som Med, 55: 234–247 (1993). (cited on page 136.)

[120] R. S. Lazarus & E. M. Opton. The study of psychological stress: A summaryof theoretical formulations and empirical findings. In C. D. Spielberger (ed.),Anxiety and behavior. Academic Press, New York (1966). (cited on page 24.)

[121] C. M. Lee & S. S. Narayanan. Toward Detecting Emotions in Spoken Dialogs.IEEE Transactions on speech and audio processing, Vol 13 (2005). (cited onpages 3 and 46.)

[122] M.-h. Lee, G. Yang, H.-K. Lee, & S. Bang. Development stress monitoring systembased onPersonal Digital Assistant (PDA). In 26th IEEE EMBS Annual Interna-tional Conference, pp. 2364–2367. San Francisco (2004). (cited on pages 65and 75.)

[123] R. W. Levenson, P. Ekman, & W. V. Friesen. Voluntary facial action generatesemotion-specific autonomic nervous system activity. Psychophysiology, 27:363–384 (1990). (cited on page 31.)

[124] W. Liao, W. Zhang, Z. Zhu, & Q. Ji. A Real-Time Human Stress MonitoringSystem Using Dynamic Bayesian Network. In IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR’05) - Workshops, p. 70.Washington. ISBN 0-7695-2372-2-3 (2005). (cited on pages 4, 16, 65, and 75.)

[125] X.-S. H. Lie, X.-S. Hua, L. Lu, & H.-J. Zhang. AVE - Automated Home Video Edit-ing. In Proceedings of the eleventh ACM international conference on Multime-dia, pp. 490–497. ACM Press (2003). (cited on page 3.)

[126] N. R. Lomb. Least-squares frequency analysis of unequally spaced data. Astro-physics and Space Science, 39(2): 447–462 (1976). (cited on pages 19 and 20.)

[127] R. E. Lucas & F. Fujita. Factors Influencing the Relation Between Extraversionand Pleasant Affect. Journal of Personality and Social Psychology, 79(6): 1039– 1056 (2000). (cited on page 135.)

[128] M. M. Bradley, P. J. Lang. Measuring emotion: The self-assessment manikin andthe semantic differential. Journal of Behavioral Therapy and ExperimentalPsychatry, 25: 49–59 (1994). (cited on page 51.)

[129] Madhero88. Skin layers. Wikimedia Commons, Creative CommonsAttribution-Share Alike 3.0 Unported license ( http://creativecommons.org/licenses/by-sa/3.0/deed.en). URL http://commons.wikimedia.

org/wiki/File:Skin_layers.png (2011). (cited on page 21.)

[130] R. L. Marca, P. Waldvogel, H. Thorn, M. Tripod, P. H. Wirtz, J. C. Pruessner, &U. Ehlert. Association betweenCold FaceTest-induced vagal inhibition and cor-tisol response to acute stress. Psychophysiology, 48: 420–429 (2011). (cited onpage 20.)

[131] Y. Masaoka & I. Homma. The effect of anticipatory anxiety on breathing andmetabolism in humans. Respir Physiol, 128: 171–177 (2001). (cited on page 29.)

Page 182: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 165

[132] C. Mattmann, F. Clemens, & G. Tröster. Sensor for Measuring Strain in Textile.Sensors (MDPI), 8(6): 3719–3732 (2008). (cited on page 29.)

[133] B. Mc Ewen. Encyclopedia of Stress. chap. Stress, definition and concepts, pp.508–509. Academic Press, San Diego (2000). (cited on page 6.)

[134] B. S. McEwen. Protective and damaging effects of stress mediators. New Eng-land Journal of Medicine, 338: 171–179 (1998). (cited on pages 11 and 12.)

[135] W. Melssen, R. Wehrens, & L. Buydens. Supervised Kohonen networks for clas-sification problems. Chemometrics and Intelligent Laboratory Systems, 83:99–113 (2006). (cited on page 41.)

[136] D. Michie, D. J. Spiegelhalter, C. C. Taylor, & J. Campbell (eds.). Machine learn-ing, neural and statistical classification. Ellis Horwood, Upper Saddle River,NJ, USA. ISBN 0-13-106360-X (1994). (cited on pages 37 and 76.)

[137] J. Montepare, E. Koff, D. Zaitchik, & M. Albert. The Use of Body Movementsand Gestures as Cues to Emotions in Younger and Older Adults. Journal ofNonverbal Behavior, 23: 133–152. 10.1023/A:1021435526134 (1999). (cited onpages 31, 82, 90, 92, 93, and 96.)

[138] B. B. Murdock. The serial position effect of free recall. Journal of ExperimentalPsychology, 64(5): 482–488 (1962). (cited on pages 118 and 137.)

[139] R. Nikula. Psychological correlates of nonspecific skin conductance responses.Psychophysiology, 28: 86–90 (1991). (cited on pages 23 and 76.)

[140] M. S. Nomikos, E. Opton, J. R. Averill, & R. S. Lazarus. Surprise versus suspensein the production of stress reaction. Journal of Personality and Social Psychol-ogy, 8: 204–208 (1968). (cited on pages 24, 76, and 138.)

[141] A. O’Leary. Stress, emotion, and human immune function. Psychol Bull, 108:363–382 (1990). (cited on page 2.)

[142] E. B. Online. EncyclopÊdia Britannica. http://www.britannica.com/

EBchecked/topic/1357164/pitch. (cited on page 34.)

[143] P. Ekman, R. W. Levenson, W. V. Friesen. Autonomic nervous system activity dis-tinguishes among emotions. Science, 221: 1208–1210 (1983). (cited on pages 8,9, and 31.)

[144] P. Ekman, W. V. Friesen. Unmaskig the face. A guide to recognizing emotionsfrom facial clues. Englewood Cliffs, New Jersey: Prentice-Hall (1975). (cited onpage 13.)

[145] D. Palomba, M. Sarlo, A. Angrilli, A. Mini, & L. Stegagno. Cardiac responsesassociated with affective processing of unpleasant film stimuli. InternationalJournal of Psychophysiology, 36(1): 45 – 57 (2000). (cited on page 28.)

[146] J. Pan & W. J. Tompkins. A Real Time QRS Detection Algorithm. In BiomedicalEngineering (1985). (cited on page 18.)

[147] H. J. Park, R. X. Li, J. Kim, S. W. Kim, D. H. Moon, M. H. Kwon, & W. J. Kim.Neural correlates of winning and losing while watching soccer matches. Int JNeurosci, 119: 76–87 (2009). (cited on page 110.)

Page 183: Multimodal Emotion and Stress Recognition

166 BIBLIOGRAPHY

[148] G. Patkar. Photo I have taken at the 2005 Dubai Airshow of the new Air-bus A380 painted in full Emirates Airlines Colors. Wikimedia Commons,Public Domain license. URL http://commons.wikimedia.org/wiki/File:Emirates_A380_2.JPG (2009). (cited on page 33.)

[149] B. Pfister & T. Kaufmann. Sprachverarbeitung: Grundlagen und Methodender Sprachsynthese und Spracherkennung (Springer-Lehrbuch), chap. Darstel-lung und Eigenschaften des Sprachsignals, pp. 37–55. Springer PublishingCompany, Incorporated. ISBN 3540759093, 9783540759096 (2008). (citedon page 34.)

[150] J. P. J. Pinel. Biopsychologie. Spektrum, Heidelberg, 2nd edn. (2001). (cited onpage 11.)

[151] R. Polikar. Ensemble Based Systems in Decision Making. IEEE Circuits and Sys-tems Magazine, 6(3): 21–45 (2006). (cited on page 42.)

[152] S. Project. URL http://www.wearable.ethz.ch/research/groups/

context/seat. Sponsored by the European Union under contract Nr.:AST5-CT-2006-030958. (cited on page 47.)

[153] R. Piacentini. Emotions at Fingertips, Revealing Individual Features in GalvanicSkin Response signals [dissertation] (2004). (cited on page 25.)

[154] R. W. Picard, E. Vyzas, J. Healey. Toward Machine Emotional Intelligence: Anal-ysis of Affective Physiological State. IEEE Trans Pattern Anal Mach Intell, 23(10):1175–1191 (2001). (cited on page 15.)

[155] L. R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2): 257–286 (1989). (cited onpages 34 and 87.)

[156] S. J. Raudys & A. K. Jain. Small Sample Size Effects in Statistical Pattern Recog-nition: Recommendations for Practitioners. IEEE Trans Pattern Anal Mach In-tell, 13: 252–264 (1991). (cited on pages 36, 77, 86, and 131.)

[157] J. Reunanen. A Pitfall in Determining the Optimal Feature Subset Size. In Proc.of the Fourth International Workshop on Pattern Recognition in InformationSystems, pp. 176–185 (2004). (cited on pages 77, 86, and 99.)

[158] Roberts, N. A. and Tsai, J. L. and Coan J. A. Emotion Elicitation Using DyadicInteraction Tasks. In J. A. Coan & J. J. B. Allen (eds.), The handbook of emotionelicitation and assessment, chap. 7, pp. 106–123. Oxford University Press, NewYork (2007). (cited on pages 13 and 14.)

[159] M. M. T. Rodrigo & R. S. Baker. Coarse-grained detection of student frustra-tion in an introductory programming course. In Proceedings of the fifth in-ternational workshop on Computing education research workshop, pp. 75–80.ICER ’09, ACM, New York, NY, USA. ISBN 978-1-60558-615-1 (2009). (cited onpage 3.)

[160] L. Ross, M. R. Lepper, & M. Hubbard. Perseverance in self-perception and socialperception: biased attributional processes in the debriefing paradigm. J PersSoc Psychol, 32: 880–892 (1975). (cited on pages 14 and 110.)

Page 184: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 167

[161] Rottenberg, J. and Ray, R. R. and Gross, J. J. Emotion elicitation using films.In J. A. Coan & J. J. B. Allen (eds.), The handbook of emotion elicitation andassessment, chap. 1, pp. 9–28. Oxford University Press, New York (2007). (citedon pages 13, 14, 49, 50, and 135.)

[162] J. A. Russell. A circumplex model of affect. Journal of Personality and SocialPsychology, 39: 1161–1178 (1980). (cited on page 13.)

[163] J. A. Russell & A. Mehrabian. Evidence for a three-factor theory of emotions.Journal of Research in Personality, 11(3): 273 – 294 (1977). (cited on page 12.)

[164] S. D. Kreibig, F. H. Wilhelm, W. T. Roth, J. J. Gross. Cardiovascular, electrodermal,and respiratory response patterns to fear- and sadness-inducing films. Psy-chophysiology, 44: 787–806 (2007). (cited on pages 15, 49, and 50.)

[165] M. Saar-Tsechansky & F. Provost. Handling Missing Values when ApplyingClassification Models. J Mach Learn Res, 8: 1623–1657 (2007). (cited onpage 42.)

[166] J. D. Scargle. Studies in astronomical time series analysis. II - Statistical aspectsof spectral analysis of unevenly spaced data. The Astrophysical Journal, 263:835–853 (1982). (cited on pages 19 and 20.)

[167] S. Schachter & J. E. Singer. Cognitive, social, and physiological determinants ofemotional state. Psychol Rev, 69: 379–399 (1962). (cited on pages 6 and 7.)

[168] R. Schandry. Lehrbuch der Psychophysiologie: körperliche Indikatoren psy-chischen Geschehens. Psychologie Verlagsunion, München (1989). (cited onpages 20 and 24.)

[169] K. R. Scherer. What are emotions? And how can they be measured? SocialScience Information, 44(4): 695–729 (2005). (cited on pages 6 and 12.)

[170] K. R. Scherer. What are emotions? And how can they be measured? Socialscience information, 44(4): 695–729 (2005). (cited on pages 49 and 112.)

[171] S. Scherer, H. Hofmann, M. Lampmann, M. Pfeil, S. Rhinow, F. Schwenker, &G. Palm. Emotion Recognition from Speech: Stress Experiment. In Proceedingsof the Sixth International Language Resources and Evaluation (LREC’08). Eu-ropean Language Resources Association (ELRA), Marrakech, Morocco. ISBN2-9517408-4-0. Http://www.lrec-conf.org/proceedings/lrec2008/ (2008).(cited on pages 65 and 75.)

[172] H. Schlosberg. Three dimensions of emotion. Psychol Rev, 61: 81–88 (1954).(cited on page 12.)

[173] L. Schmidt-Atzert. Lehrbuch der Emotionspsychologie. Kohlhammer, 1st edn.(1996). (cited on pages 5, 14, 110, and 136.)

[174] B. Schuller. AutomatischeEmotionserkennung aus sprachlicher und manuellerInteraktion. Ph.D. thesis, Technische Universität München (2006). (cited onpage 13.)

[175] B. Schuller, G. Rigoll, & M. Lang. Hidden Markov model-based speech emotionrecognition. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP ’03). 2003 IEEE International Conference on, 2, pp. II – 1–4 vol.2 (2003).(cited on page 32.)

Page 185: Multimodal Emotion and Stress Recognition

168 BIBLIOGRAPHY

[176] J. Schumm. Quality assessment of physiological signals during ambulatorymeasurements, Dissertation ETH Zurich (2010). (cited on pages 3, 29, 30, 48,55, and 56.)

[177] J. Schumm, M. Bächlin, C. Setz, B. Arnrich, D. Roggen, & G. Tröster. Effectof movements on the electrodermal response after a startle event. Methodsof Information in Medicine, 47(3): 186–191 (2008). (cited on pages 25, 26,and 112.)

[178] J. Schumm, C. Setz, M. Bächlin, M. Bächler, B. Arnrich, & G. Tröster. Unobtru-sive Physiological Monitoring in an Airplane Seat. Personal and UbiquitousComputing, 14(6): 541–550 (2010). (cited on pages 47, 55, and 56.)

[179] H. Selye. A syndrome produced by diverse nocuous agents. Nature, 138: 32(1936). (cited on page 11.)

[180] H. Selye. Stress and the general adaptation syndrome. British Medical Jour-nal, 1: 1383–1392 (1950). (cited on page 11.)

[181] M. Seracin & R. Iordache. Research regarding the elaboration of methods andprocedures (indicators and techniques) for diagnostic analysis/assessment ofneuropsychological load, and the identification of risk factors in activities andprofessions with major exposure to risk. URL protectiamuncii.ro/pdfs/

Mental_stress_EN.pdf (2004). (cited on pages 65 and 103.)

[182] C. Setz, B. Arnrich, J. Schumm, R. La Marca, G. Tröster, & U. Ehlert. Discrimi-nating Stress from Cognitive Load Using a Wearable EDA Device. IEEE Transac-tions on Information Technology in Biomedicine, 14(2): 410–417 (2010). (citedon pages 24, 26, 66, 67, 68, 69, 70, 71, 73, and 74.)

[183] C. Setz, J. Schumm, C. Lorenz, B. Arnrich, & G. Tröster. Combining WorthlessSensor Data. In Measuring Mobile Emotions Workshop at MobileHCI (2009).(cited on page 48.)

[184] C. Setz, J. Schumm, C. Lorenz, B. Arnrich, & G. Troster. Using ensemble classifiersystems for handling missing data in emotion recognition from physiology:One step towards a practical system. In Affective Computing and IntelligentInteraction and Workshops, 2009. ACII 2009. 3rd International Conference on,pp. 1 –8 (2009). (cited on pages 47 and 48.)

[185] C. S. Sherrington. Experiments on the value of vascular and visceral factors forthe genesis of emotion. Proceedings of the Royal Society London, 66: 390–403 (1900). (cited on page 7.)

[186] L. R. Sloan. The function and impact of sports and fans: A review of contem-porary research. In J. H. Goldstein (ed.), Sports, games, and play: Social andpsychological viewpoints, 1, pp. 219 – 262. Erlbaum Associates, Hillsdale, NJ(1979). (cited on pages 110, 114, 118, 135, and 136.)

[187] M. Smith & V. Ball. Cardiovascular/respiratory physiotherapy. Elsevier HealthSciences. ISBN 9780723425953 (1998). (cited on page 29.)

[188] C. R. Snyder, M. Lassegard, & C. E. Ford. Distancing After Group Success andFailure: Basking in Reflected Glory and Cutting Off Reflected Failure. Journal ofPersonality and Social Psychology, 51(2): 382–388 (1986). (cited on page 136.)

Page 186: Multimodal Emotion and Stress Recognition

BIBLIOGRAPHY 169

[189] State Secretariat for Economic Affairs. DieKosten vonStress in der Schweiz.URL http://www.radix.ch/d/data/data_423.pdf. (cited on page 65.)

[190] Stemmler, G. Physiological processes during emotion. In P. Phillippot & R. S.Feldman (eds.), The regulation of emotion, chap. 2, pp. 33–70. Erlbaum, Mah-wah, NJ (2004). (cited on page 143.)

[191] D. Talkin. A robust algorithm for pitch tracking (RAPT). In W. B. Klein & K. K.Palival (eds.), Speech Coding and Synthesis. Elsevier (1995). (cited on page 34.)

[192] M. Tarvainen, P. Ranta-aho, & P. Karjalainen. An advanced detrending methodwith application to HRV analysis. Biomedical Engineering, IEEE Transactionson, 49(2): 172 –175 (2002). (cited on pages 86 and 130.)

[193] Task Force of the European Society of Cardiology and the North AmericanSociety of Pacing and Electrophysiology. Heart rate variability standards ofmeasurement, physiological interpretation, and clinical use. European HeartJournal, 17: 354–381 (1996). (cited on page 20.)

[194] L. G. Tassinary, J. T. Cacioppo, & T. R. Geen. A psychometric study of surfaceelectrode placements for facial electromyographic recording: I. The brow andcheek muscle regions. Psychophysiology, 26: 1–16 (1989). (cited on page 28.)

[195] A. Technologies. Alive Technologies Products: Alive Heart and Activity Monitor.URL http://www.alivetec.com/products.htm. (cited on page 103.)

[196] Tekscan. CONFORMat system for R&D. URL http://www.tekscan.com/

Conformat-pressure-measurement. (cited on pages 69 and 78.)[197] I. The MathWorks. MATLAB (R2008b). Version 7.7.0.471. (cited on pages 17

and 27.)[198] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically opti-

mal decoding algorithm. IEEE Transactions on Information Theory, 13: 260–269 (1967). (cited on page 34.)

[199] T. Vogt, E. André, & J. Wagner. Affect and Emotion in Human-Computer In-teraction. chap. Automatic Recognition of Emotions from Speech: A Reviewof the Literature and Recommendations for Practical Realisation, pp. 75–91.Springer-Verlag, Berlin, Heidelberg. ISBN 978-3-540-85098-4 (2008). (citedon page 46.)

[200] G. Vossel & H. Zimmer. Psychophysiologie. Urban-Taschenb"ucher, Kohlhammer. ISBN 9783170126220 (1998). (cited on pages 23, 27,and 29.)

[201] C. Walton, A. Coyle, & E. Lyons. Death and football: An analysis of mens talkabout emotions. British Journal of Social Psychology, pp. 401–416 (2004).(cited on page 110.)

[202] J. Wang, E. Chng, C. Xu, H. Lu, & X. Tong. Identify Sports Video Shots with"Happy" or "Sad" Emotions. Multimedia and Expo, IEEE International Confer-ence on, pp. 877–880 (2006). (cited on page 3.)

[203] D. L. Wann & N. R. Branscombe. Die-Hard and Fair-Weather Fans: Effects ofIdentification onBIRGing and CORFing Tendencies. Journal of Sport and SocialIssues, 14: 103–117 (1990). (cited on page 136.)

Page 187: Multimodal Emotion and Stress Recognition

170 Curriculum Vitae

[204] D. L. Wann, T. J. Dolan, K. K. McGeorge, & J. A. Allison. Relationships betweenspectator identification and spectators’ perceptions of influence, spectators’emotions, and competition outcome. Journal of Sport & Exercise Psychology(JSEP), 16: 347–364 (1994). (cited on pages 110, 111, 114, 119, 135, and 136.)

[205] D. Watson & L. A. Clark. Negative affectivity: the disposition to experience aver-sive emotional states. Psychol Bull, 96: 465–490 (1984). (cited on page 135.)

[206] J. B. Watson & R. Rayner. Conditioned emotional reactions. Journal of Experi-mental Psychology, 3: 1–14 (1920). (cited on page 6.)

[207] D. Weenink. Speech Signal Processing with Praat. URL http://www.fon.hum.uva.nl/david/sspbook/sspbook.pdf (2011). (cited on page 34.)

[208] Wikipedia. Percentile - Wikipedia, The Free Encyclopedia. URL http://

en.wikipedia.org/w/index.php?title=Percentile&oldid=425594467.[Online; accessed 18-May-2011] (2011). (cited on pages 25 and 145.)

[209] U. Wilbert-Lampen, D. Leistner, S. Greven, T. Pohl, S. Sper, C. Volker, D. Guthlin,A. Plasse, A. Knez, H. Kuchenhoff, & G. Steinbeck. Cardiovascular events duringWorld Cup soccer. N Engl J Med, 358: 475–483 (2008). (cited on page 133.)

[210] F. Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bul-letin, 1(6): 80–83 (1945). (cited on page 116.)

[211] F. H. Wilhelm & P. Grossman. Emotions beyond the laboratory: theoreticalfundaments, study design, and analytic strategies for advanced ambulatoryassessment. Biol Psychol, 84: 552–569 (2010). (cited on pages 2 and 138.)

[212] F. H. Wilhelm, W. T. Roth, & M. A. Sackner. The lifeShirt. An advanced sys-tem for ambulatory measurement of respiratory and cardiac function. BehavModif, 27: 671–691 (2003). (cited on page 69.)

[213] H. Zeier. Workload and psychophysiological stress reactions in air traffic con-trollers. Ergonomics, 37(3): 525–539 (1994). (cited on pages 65 and 103.)

[214] Zephyr. Graphic showing the maximum separating hyperplane and the mar-gin. URL http://www.zephyr-technology.com (2011). (cited on page 112.)

[215] J. Zhai & A. Barreto. Stress detection in computer users based on digital signalprocessing of noninvasive physiological variables. In 28th IEEE EMBS AnnualInternational Conference, pp. 1355–1358. New York (2006). (cited on pages 4,16, 65, and 75.)

[216] D. Zillmann. Attribution and misattribution of excitatory reactions. In J. Har-vey, W. Ickes, & R. Kidd (eds.), New directions in attribution research, pp.335–368. No. Vol. 2 in Wiley Monographs in Applied Econometrics Series, L.Erlbaum Associates, Hillsdale, NJ. ISBN 9780470263723 (1978). (cited onpage 138.)

Page 188: Multimodal Emotion and Stress Recognition

Publications: 2011

[1] Towards Long Term Monitoring of Electrodermal Activity in Daily Life Cornelia Setz, Franz Gravenhorst, Bert Arnrich and Gerhard Tröster, Special Issue on "Mental Health and the Impact of Ubiquitous Technologies" of Personal and Ubiquitous Computing, 2011.

[2] Design, implementation and evaluation of a seat-integrated multimodal sensor system Bert Arnrich, Cornelia Setz, Johannes Schumm, and Gerhard Tröster, In book: Sensor Fusion: Foundation and Applications, INTECH, 2011

2010

[3] Discriminating Stress from Cognitive Load Using a Wearable EDA Device Cornelia Setz, Bert Arnrich, Johannes Schumm, Roberto La Marca, Gerhard Tröster and Ulrike Ehlert (2010), in: IEEE Transactions on Information Technology in Biomedicine, 14:2(410-417)

[4] Pervasive Technologies in Airplane Seats: Towards Safer and Stress-Free Air Travel Cornelia Setz, Johannes Schumm, Dominique Favre, Christian Liesen, Bert Arnrich, Daniel Roggen and Gerhard Tröster Video and accompanying paper for The Eighth International Conference on Pervasive Computing, Helsinki, Finland, 2010

[5] Towards Long Term Monitoring of Electrodermal Activity in Daily Life

Page 189: Multimodal Emotion and Stress Recognition

Cornelia Setz, Johannes Schumm, Martin Kusserow, Bert Arnrich and Gerhard Tröster in: 5th International Workshop on Ubiquitous Health and Wellness (UbiHealth 2010), 2010

[6] Unobtrusive Physiological Monitoring in an Airplane Seat Johannes Schumm, Cornelia Setz, Marc Bächlin, Marcel Bächler, Bert Arnrich and Gerhard Tröster (2010), in: Personal and Ubiquitous Computing, 14:6(541-550)

[7] What does your chair know about your stress level? Bert Arnrich, Cornelia Setz, Roberto La Marca, Gerhard Tröster and Ulrike Ehlert (2010), in: IEEE Transactions on Information Technology in Biomedicine, 14:2(207-214)

[8] Self Organizing Maps for Affective State Detection Bert Arnrich, Cornelia Setz, Roberto La Marca, Gerhard Tröster and Ulrike Ehlert in: Machine Learning for Assistive Technologies, 2010

2009

[9] Combining Worthless Sensor Data Cornelia Setz, Johannes Schumm, Claudia Lorenz, Bert Arnrich and Gerhard Tröster in: Measuring Mobile Emotions Workshop at MobileHCI, 2009

[10] Using Ensemble Classifier Systems for Handling Missing Data in Emotion Recognition from Physiology: One Step Towards a Practical System Cornelia Setz, Johannes Schumm, Claudia Lorenz, Bert Arnrich and Gerhard Tröster in: Proceedings of the 2009 3rd International Conference on Affective Computing and Intelligent Interaction, ACII, pages 187-194, 2009

Page 190: Multimodal Emotion and Stress Recognition

[11] Unsupervised Monitoring of Sitting Behavior Bernd Tessendorf, Bert Arnrich, Johannes Schumm, Cornelia Setz and Gerhard Tröster in: Proceedings of EMBC09, 2009

2008

[12] Application of model predictive control to a cascade of river power plants Cornelia Setz, A. Heinrich, Ph. Rostalski, G. Papafotiou and M. Morari in: 17th IFAC World Congress, Seoul, Korea, 2008

[13] Activity and emotion recognition to support early diagnosis of psychiatric diseases D. Tacconi, O. Mayora, Paul Lukowicz, Bert Arnrich, Cornelia Setz, Gerhard Tröster and C. Haring in: Proceedings of 2nd International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health), 2008

[14] Effect of movements on the electrodermal response after a startle event Johannes Schumm, Marc Bächlin, Cornelia Setz, Bert Arnrich, Daniel Roggen and Gerhard Tröster (2008), in: Methods of Information in Medicine, 47:3(186-191)

[15] Effect of movements on the electrodermal response after a startle event Johannes Schumm, Marc Bächlin, Cornelia Setz, Bert Arnrich, Daniel Roggen and Gerhard Tröster in: Proceedings of 2nd International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health), 2008

Page 191: Multimodal Emotion and Stress Recognition

2007

[16] Monitoring passenger's breathing - a feasibility study Basem Dokhan, Cornelia Setz, Bert Arnrich and Gerhard Tröster in: Swiss Society of Biomedical Engineering Annual Meeting, Neuchâtel, Switzerland, 2007