StressSense: Detecting Stress in Unconstrained Acoustic...

10
StressSense: Detecting Stress in Unconstrained Acoustic Environments using Smartphones Hong Lu Intel Lab [email protected] Mashfiqui Rabbi Cornell University [email protected] Gokul T. Chittaranjan EPFL [email protected] Denise Frauendorfer University of Neuch˚ atel [email protected] Marianne Schmid Mast University of Neuch˚ atel [email protected] Andrew T. Campbell Dartmouth College [email protected] Daniel Gatica-Perez Idiap and EPFL [email protected] Tanzeem Choudhury Cornell University [email protected] ABSTRACT Stress can have long term adverse effects on individuals’ physical and mental well-being. Changes in the speech pro- duction process is one of many physiological changes that happen during stress. Microphones, embedded in mobile phones and carried ubiquitously by people, provide the op- portunity to continuously and non-invasively monitor stress in real-life situations. We propose StressSense for unob- trusively recognizing stress from human voice using smart- phones. We investigate methods for adapting a one-size-fits- all stress model to individual speakers and scenarios. We demonstrate that the StressSense classifier can robustly iden- tify stress across multiple individuals in diverse acoustic en- vironments: using model adaptation StressSense achieves 81% and 76% accuracy for indoor and outdoor environments, respectively. We show that StressSense can be implemented on commodity Android phones and run in real-time. To the best of our knowledge, StressSense represents the first system to consider voice based stress detection and model adaptation in diverse real-life conversational situations us- ing smartphones. Author Keywords mHealth, stress, sensing, user modeling, model adaptation ACM Classification Keywords H.1.2 User/Machine Systems; I.5 Pattern Recognition; J.3 Life and Medical Sciences: Health. General Terms Algorithms, Design, Human Factors, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. UbiComp ’12, Sep 5-Sep 8, 2012, Pittsburgh, USA. Copyright 2012 ACM 978-1-4503-1224-0/12/09...$10.00. INTRODUCTION Stress is a universally experienced phenomenon in our mod- ern lives. According to a 2007 study by the American Psy- chological Association, three quarters of Americans experi- ence stress-related symptoms [1]. Studies have shown that stress can play a role in psychological or behavioral disor- ders, such as depression, and anxiety [2]. The amount of cumulative stress in daily life may have broad consequences on societal well-being, as stress-causing events have nega- tive impact upon daily health and mood [2] and also con- tributes significantly to health care costs [3]. Because stress imparts negative public health consequences, it is advantageous to consider automatic and ubiquitous meth- ods for stress detection. Ubiquitous stress detection can help individuals become aware of and manage their stress levels. Meanwhile, distributed stress monitoring may allow health professionals the ability to examine the extent and severity of stress across populations. Many physiological symptoms of stress may be measured with sensors, e.g., by chemical analysis, skin conductance readings, electrocardiograms, etc. However, such methods are inherently intrusive upon daily life, as they require direct interaction between users and sensors. We therefore seek less intrusive methods to monitor stress. Researchers have widely acknowledged that human vocal production is influ- enced by stress [4, 5, 6, 7]. This fact poses the human voice as a potential source for nonintrusive stress detection. In this paper, we suggest that smartphones and their microphones are an optimal computer-sensor combination for the unob- trusive identification of daily stress. To be operational in real life, a voice-based stress classi- fier needs to deal with both the diverse acoustic environ- ments encountered everyday and the individual variabilities of portraying stress. Most existing research relating stress and speech has focused on a single acoustic environment us- ing high-quality microphones. This paper presents a method for detecting the occurrence of stress using smartphone mi- crophones and adapting universal models of stress to spe- cific individuals or scenarios using Maximum A Posteriori

Transcript of StressSense: Detecting Stress in Unconstrained Acoustic...

Page 1: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

StressSense: Detecting Stress in Unconstrained AcousticEnvironments using Smartphones

Hong LuIntel Lab

[email protected]

Mashfiqui RabbiCornell University

[email protected]

Gokul T. ChittaranjanEPFL

[email protected] Frauendorfer

University of [email protected]

Marianne Schmid MastUniversity of Neuchatel

[email protected]

Andrew T. CampbellDartmouth College

[email protected] Gatica-Perez

Idiap and [email protected]

Tanzeem ChoudhuryCornell University

[email protected]

ABSTRACTStress can have long term adverse effects on individuals’physical and mental well-being. Changes in the speech pro-duction process is one of many physiological changes thathappen during stress. Microphones, embedded in mobilephones and carried ubiquitously by people, provide the op-portunity to continuously and non-invasively monitor stressin real-life situations. We propose StressSense for unob-trusively recognizing stress from human voice using smart-phones. We investigate methods for adapting a one-size-fits-all stress model to individual speakers and scenarios. Wedemonstrate that the StressSense classifier can robustly iden-tify stress across multiple individuals in diverse acoustic en-vironments: using model adaptation StressSense achieves81% and 76% accuracy for indoor and outdoor environments,respectively. We show that StressSense can be implementedon commodity Android phones and run in real-time. Tothe best of our knowledge, StressSense represents the firstsystem to consider voice based stress detection and modeladaptation in diverse real-life conversational situations us-ing smartphones.

Author KeywordsmHealth, stress, sensing, user modeling, model adaptation

ACM Classification KeywordsH.1.2 User/Machine Systems; I.5 Pattern Recognition; J.3Life and Medical Sciences: Health.

General TermsAlgorithms, Design, Human Factors, Performance

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.UbiComp ’12, Sep 5-Sep 8, 2012, Pittsburgh, USA.Copyright 2012 ACM 978-1-4503-1224-0/12/09...$10.00.

INTRODUCTIONStress is a universally experienced phenomenon in our mod-ern lives. According to a 2007 study by the American Psy-chological Association, three quarters of Americans experi-ence stress-related symptoms [1]. Studies have shown thatstress can play a role in psychological or behavioral disor-ders, such as depression, and anxiety [2]. The amount ofcumulative stress in daily life may have broad consequenceson societal well-being, as stress-causing events have nega-tive impact upon daily health and mood [2] and also con-tributes significantly to health care costs [3].

Because stress imparts negative public health consequences,it is advantageous to consider automatic and ubiquitous meth-ods for stress detection. Ubiquitous stress detection can helpindividuals become aware of and manage their stress levels.Meanwhile, distributed stress monitoring may allow healthprofessionals the ability to examine the extent and severityof stress across populations.

Many physiological symptoms of stress may be measuredwith sensors, e.g., by chemical analysis, skin conductancereadings, electrocardiograms, etc. However, such methodsare inherently intrusive upon daily life, as they require directinteraction between users and sensors. We therefore seekless intrusive methods to monitor stress. Researchers havewidely acknowledged that human vocal production is influ-enced by stress [4, 5, 6, 7]. This fact poses the human voiceas a potential source for nonintrusive stress detection. In thispaper, we suggest that smartphones and their microphonesare an optimal computer-sensor combination for the unob-trusive identification of daily stress.

To be operational in real life, a voice-based stress classi-fier needs to deal with both the diverse acoustic environ-ments encountered everyday and the individual variabilitiesof portraying stress. Most existing research relating stressand speech has focused on a single acoustic environment us-ing high-quality microphones. This paper presents a methodfor detecting the occurrence of stress using smartphone mi-crophones and adapting universal models of stress to spe-cific individuals or scenarios using Maximum A Posteriori

Page 2: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

(MAP) adaptation. The contributions of this paper are as fol-lows. We experimentally show that: 1) Stress from humanvoice can be detected using smartphones in real life acousticenvironments that involve both indoor and outdoor conver-sational data. 2) A universal stress model can be robustlyadapted to specific individual users, thereby increasing theaccuracy across population of users. 3) A stress model canbe adapted to unseen environments, thereby lowering thecost of training stress models for different scenarios. 4) Theproposed stress classification pipeline can run in real-timeon off-the-shelf Android smartphones.

BACKGROUND AND RELATED WORKIn this section, we describe how stress and stressors are mod-eled, as well as provide discussion on the differences be-tween stress and emotion detection from speech. We discussindividual variability in portraying stress and why adaptationor personalization of models is necessary.

What is Stress?In general terms, stress is the reaction of an organism to achange in its equilibrium. However a more quotidian def-inition is that stress is the tension one experiences in re-sponse to a threat. When a person is able to cope success-fully with stress they experience eustress, the opposite ofdistress. Stress can therefore have positive or negative out-comes, dependent on a person’s coping ability and the sever-ity of the stressor. Stressors may be real or imagined; anevent that produces stress for one individual may have noaffect on another individual.

Stressors have been quantified in terms of “hassles” and “lifeevents,” the former referring to short-term intraday stress-causing events and the latter referring to larger events ofsparser and more momentous occasion [8]. Two main cat-egories of stressors are physical or psychological. Physicalstressors are those which pose a threat to a person’s physicalequilibrium, such as a roller coaster ride, a physical alterca-tion, or deprivation of sleep. A psychological stressor can bean event that threatens a person’s self-esteem, e.g., the pres-sure of solving a difficult mental task within a time limit.

Stress is a subjective phenomenon, and stressors are the ob-servable or imagined events and stimuli that cause stress. Inthis paper, we focus on cognitive stress and estimating thefrequency of stressors, rather than the severity of stress. Thereasons for this are severalfold. No universal metric existsfor stress severity in the psychology literature. By definition,stress cannot exist without stressors. While it is difficult toobjectively compare stress across individuals, research hasshown that there exist sets of stressors that are shared bypeople. Finally, we simplify our objective by not determin-ing the type of stress that an individual is experiencing, ortheir coping ability. We believe that mobile detection of thefrequency of stressors in one’s life is a more practical andrealistic goal than determining the severity of stress.

People often react to stress emotionally. For instance, a per-son undergoing distress from an argument may experienceanger or sadness. Stress detection in speech is therefore of-

ten wrapped up in speech emotion detection. Much of theliterature treats stress and emotion as conjugate factors forspeech production. While emotion detection in speech hasbeen examined by a significant scientific population, speechunder stress has received focussed attention by a smallergroup of researchers. We specifically aim at modeling speechunder stress, rather than emotional expression. In our view,a critical difference between stress and emotion detection isthat stress can always be linked to a stressor, whereas it isnot always possible to establish causal relationships betweenemotions and events.

Modeling and Detecting Stress From SpeechPioneering work on voice analysis and cognitive stress wasdone in 1968 when amplitude, fundamental frequency, andspectrograms were analyzed for speech under task-inducedstress [4]. It was found that neutral listeners could perceivedifferences in voice recordings of subjects undergoing stress-activating cognitive tasks.

A large body of research on stress detection from speechcentered around the Speech Under Simulated and ActualStress (SUSAS) dataset [9]. Evidence from experiments withSUSAS suggests that pitch plays a prominent role in stress.Other important features are those using energy, spectral char-acteristics of the glottal pulse, and phonetic variations (speak-ing rate) and spectral slope. Nonlinear features based on theTeager Energy Operator [10, 11] have also shown promisingdiscriminative capabilities, especially for talking styles suchas “anger” and “loud” [12].

Fernandez and colleagues performed voice-based stress clas-sification in a scenario involving both physically and physi-ologically induced stress [13]. Four subjects were asked toanswer mathematical questions while driving a car simulatorat two different speeds and with two different response in-tervals. Several models were tested on two levels of speechanalysis: intra-utterance and whole utterance. The best re-sults were found by using a mixture of hidden Markov mod-els at the intra-utterance level. Although this work is a com-prehensive study of several models for speech under stressthe data collection is done in a quiet environment using ahigh-quality professional microphone.

Paltial et al. propose a Gaussian mixture model based frame-work for physical stress detection [14]. Stress is induced byexecrising on a stair-stepper at 9-11 miles per hour. Theyinvestigate the effect of number of speakers in the trainingset that consists of all female subjects. The data is recordedin a single acoustic environment using professional grademicrophone. Adaboost is applied to combine the mel-scalecepstral coefficients and Teager energy operator to achieve a73% classification accuracy with a generic stress model.

Mobile Stress DetectionThere has been growing interest in inferring stress and emo-tion from mobile phone sensor data. Chang, Fisher, andCanny describe a speech analysis library for mobile phonevoice classification in [15]. They implement an efficientvoice feature extraction library for mobile phones and show

Page 3: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

the processing is able to run on off-the-shelf mobile phonesin realtime. However, their work does not consider stressclassification in mobile scenarios, mixed acoustic contexts,or any form of adaptation. No data is collected from mo-bile phones. The evaluation is done solely on the SUSASdata set. EmotionSense [16] presents a multi-sensor mobilesystem for emotion detection. The system is specifically de-signed for social psychology researchers, and includes sev-eral parameterizations toward that end. A universal back-ground model (UBM) is trained on all of the emotion cate-gories of an artificial dataset of portrayed emotions. Specificemotion classifiers are subsequently generated by maximuma posteriori (MAP) adapting from the UBM to specific emo-tions in the dataset. Unlike our work, EmotionSense doesnot train its emotion classifier on actual user data. Rather,their classifiers are trained from the artificial dataset and keptstatic. Again, there is no verification of the classifiers’ accu-racies on audio collected in the wild, nor is there any person-alization of the universal models to individual users.

As Scherer [17] states, there is “little evidence for a generalacoustic stress profile” because correlation between voiceand human emotion is subject to large differences betweenindividuals. Humans respond to the same stressors in differ-ent ways, and they also respond differently from each otherdepending on their coping ability, coping style, and person-ality. It is therefore imperative to develop a method of stressdetection that can model individual speakers in different sce-narios. This evidence leads us to believe that no universalmodel of stress will be effective for all individuals or all sce-narios. However, it is cumbersome to learn a new model ofstress for every individual or scenario. It is therefore advan-tageous to have a means for adapting models. Finally, moststress detection research has not considered diverse acousticenvironments. We believe it is important to evaluate clas-sifiers trained in more realistic acoustic environments, andthrough microphones in existing mobile devices.

EXPERIMENTAL SETUP

Data CollectionWe study stress associated with the cognitive load experi-enced by a participant during a job interview as an intervie-wee and conducting a marketing task as an employee. Wealso consider a neutral task where participants are not understress. These three tasks are designed with the help of behav-ioral psychologists. As SUSAS and many previous studies[9, 4, 13, 14], we assume that the subject’s voice are stressedonce the stressor is present. And reading without stressor isneutral. Participants were recruited through flyers advertisedat multiple locations within a university campus and throughmessages shared using social media sites. Data is collectedfrom a total of 14 participants (10 females, 4 males). Themean age was 22.86 years and participants had in averagefew experience in job interviews (mean experience of 2.86within a scale from 1 (no experience at all) to 5 (a lot ofexperience)). Thirteen participants were undergraduate stu-dents in different domains such as geology, psychology, bi-ology, and law. One participant was a PhD student. The datacollection is done in three phases:

(a) (b)

(c) (d)Figure 1: (a) the setup of the interview room. (b) the setupof data collection (c) and (d) map of the university and citycentre respectively with yellow markers indicating locationswhere participants had conversation with other people

1. Job Interviews: Job interviews of the participants areconducted indoors, as shown in Fig. 1(a). The subjectsare informed that they first have to go through the jobinterview, and whether they would be hired for the mar-keting task depends on their interview performance. Theinterview scenario comprises a structured interview with8 questions in French; the translation is as follows: i) Canyou describe yourself in a few words? ii) What is themotivation for you to apply for this job? iii) What is themeaning and importance of scientific research to you? iv)Can you give an example that shows your competence tocommunicate? v) Can you give an example of persuadingpeople? vi) Can you give an example of working consci-entiously? vii) Can you give an example of mastering astressful situation? viii) Can you tell me your strengthsand weaknesses? This set of questions is designed to testthe subjects’ competence to do the tasks, especially thelast four questions.Audio is continuously collected using a Google NexusOne Android smartphone and a microcone1 microphonearray. In addition to audio data, video cameras record theinterviewer and interviewee.

2. Marketing Jobs: When the interview is complete, partic-ipants are then briefed about the marketing job that theyhave been hired to conduct. The marketing task involvesrecruiting new participants from the general public forother studies that are frequently conducted at a local uni-versity. Each participant is rewarded 200 CHF for about 4hours work. The marketing task provides us with the op-portunity to study participants executing this real-worldjob at two different locations: (i) at a university campusand, (ii) in the center of a city, as shown in Fig. 1(c) andFig. 1(d), respectively.The participants are given flyerswith contact information for the study and a paper to col-lect the contact information of interested people. Partici-pants conducting recruitment campaigns wear two NexusOne smartphones positioned at two different places on

1http://www.dev-audio.com

Page 4: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

0

1

2

3

4

5

Diff

. in

Con

duct

ance

( µ

S)

Indoor (Neutral)Outdoor (Neutral)InterviewMarketing Job

Figure 2: Average increase in skin conductance for tasks

their body, as shown in Fig. 1(b). Participants were in-formed that their remuneration for the job would have afixed and performance-based component. The performance-based component would depend on the number of partic-ipants they would recruit. This was done to motivate theparticipants to perform the outdoor task. For each par-ticipants, the marketing job was designed to be carriedin four independent sessions in different days. However,data for all 4 sessions are not available for all subjects dueto technical issues with data collection or subjects failingto attend all the sessions.

3. Neutral Task: In addition to capturing stressed audio,we also collect audio data from neutral scenarios whereparticipants are not stressed. In neutral scenarios, partic-ipants had to read both indoor and outdoor. The readingmaterials are simple stories that often used to study differ-ent accents in languages. Compared to the job interviewsituation or to the recruitment there was no performanceexpected of the participants. Thus, there was no stress in-duction during this situation and participants were there-fore unlikely to be stressed.

The study is done over multiple days - one for the interview,one for the neutral task, two or more for the marketing task.All audio data is collected using Nexus One phones at 8kHz16 bit PCM using a custom android program that runs in thebackground. The same program also continuously recordsaccelerometer data and collects GPS data every 3 minutes.During all the tasks, every participant wore a wrist bandcalled Affectiva2, which includes a galvanic skin resistance(GSR) sensor used for ground-truth, as discussed next.

Ground-truth of Stress from GSR Sensor:During stress, the human body goes into an alert mode re-sulting in increased skin conductance, which can be mea-sured by a GSR sensor [2]. We used the Affectiva wrist bandto collect skin conductance data. As an external sensor, theGSR sensor is only used for ground truth purpose. For eachsession, there was a 5 minute calibration phase for the GSRsensor where participants wear the sensor and relax. Thiscalibration phase provides the baseline GSR readings (i.e.,skin conductance) for each session. To measure changes instress level for each session, we compute the difference be-tween the average baseline readings of the calibration phaseand the average GSR readings during the actual tasks. Mul-tiple sessions of outdoor marketing job recordings are av-2http://www.affectiva.com/

eraged. Figure 2 shows the average increase of GSR read-ings in different types of tasks. Clearly, the increase of GSRreading is higher for marketing and job interview sessionsthan both neutral scenarios (indoor or outdoor). This resultsuggests that participants are reacting to stressors inducedduring both the job interview and marketing tasks.

STRESSSENSE CLASSIFICATION MODELIn what follows, we detail the initial data preprocessing stepsand then discuss our StressSense features, classification frame-work and model adaptation.

Data preprocessingPrior to stress classification, the audio from the recordingsessions are preprocessed using a two step process: (i) audiois classified into voice and non-voice regions and (ii) the in-dividual speakers are segmented based on the output of thevoice classification. This process is carried out differentlyfor the indoor and outdoor scenario since we could leveragethe additional microphone array that was part of the instru-mented room, as shown in Figure 1(a).

In the indoor setting, information from the microcone micro-phone array is used to segment the speakers. The microconecomprises of 7 microphones in an array and records audio at16kHz in 16 bit PCM. The microcone manufacturer providessoftware to do automatic speaker turn segmentation from therecorded audio. We time align the microcone and Androidsmartphone audio by identifying the first peak in the crosscorrelation between the center channel of the microcone de-vice and the audio recorded on the phone.

For the outdoor audio, we do not have the advantage of us-ing the microphone array. We use a different classifier tosegment voice and non-voice region that has been demon-strated to be robust in outdoor noisy environments [18, 19,20]. In our test on 4 minutes of labeled audio data includinghuman speech, the classifier yields an accuracy of 83.7%,with precision 90% and recall 84%. This 4 minutes of au-dio is acquired from our outdoor dataset with high environ-mental noise, and the above mentioned performance of theclassifier is compatible with results found in [18, 19, 20].For speaker segmentation, the participants wear two smart-phones during data collection – one attached to their shoul-der and another to their waist, as shown in Fig. 1(b). Inthe future, we envision utilizing multiple microphones em-bedded within a single smartphone. To align the two audiostreams, mutual information between the voiced regions isused. Direct cross-correlation between audio streams fromthe microphones is avoided due to outdoor noisy environ-ments (even different rubbing pattern of phones at differentpositions on the body can cause confusing differences). Mu-tual information has been successfully used for alignmentand conversation detection in outdoor scenarios [19]. Uponalignment, energy comparison among waist and shoulder au-dio is used for speaker segmentation. We exploit the fact thataudio energy is inversely proportional to the distance of themicrophone. If person A and person B are in a conversa-tion with person A being instrumented with two phones, A’smouth is at a much shorter distance from the shoulder mi-

Page 5: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

Feature DescriptionPitch std standard deviation of pitchPitch range difference of max and min pitchPitch jitter perturbation in pitchSpectral centroid centroid frequency of the spectrumHigh frequency ratio ratio of energy above 500HzSpeaking rate rate of speechMFCCs cepstral representation of the voiceTEO-CB-AutoEnv Teager Energy Operator based non-

linear transformationTable 1: StressSense Acoustic Features

crophone compared to the waist microphone. On the otherhand, A’s microphones are almost equidistance from B’smouth. Thus when B is speaking the energy ratio betweenA’s two microphones will be close to one. On the other hand,while person A is speaking, the energy ratio between shoul-der and waist microphone will be greater than one.

Therefore, audio energy ratio can be used as a discriminatoryfeature for segmenting the subject from his conversationalpartners. We use the receiver operating characteristic (ROC)analysis to find an appropriate threshold for energy ratio todifferentiate between speakers. We test this threshold basedclassifier on a total of 20 minutes of manually labeled audiodata with labels indicating which speaker is speaking andwhen. This 20 minutes of audio data is gathered from twoseparate outdoor conversations in our outdoor dataset thatinvolved distinct participants. On this test data, the classifieryields an accuracy of 81% with precision 81% and recall95%. After running the speaker segmentation algorithm onour outdoor data set, we find that in most cases participantsare talking for 10-20 minutes in each session with the excep-tion of one participant who only talks for 4 minute.

StressSense FeaturesThe most widely investigated acoustic feature for stress ispitch (F0). Pitch reflects the fundamental frequency of vo-cal cord vibration during speech production. Normally, themean, standard deviation and range of pitch increase whensomebody is stressed [7, 21, 22, 23, 24], while the pitch jit-ter in voice usually decreases [25, 26]. The degree of changevaries from person to person depending on the person’s ex-perience and arousal level.

It is known that the locations of formants across the spectrumare different between stressed and neutral speech. Gener-ally, the distribution of spectral energy shifts toward higherfrequency when somebody is stressed. This change reflectsin spectral centroid which goes up during stressed speechand more energy concentrates at frequencies above 500Hz[25, 26]. In addition to changes in the spectrum energy dis-tribution, previous studies also show an increase in speak-ing rate under stressed conditions [24, 26]. We adopt thespeaking rate estimator proposed in [27], which derives thespeaking rate directly from the speech waveform. The pro-cessing window is set to one second for better dynamics.More recently, studies on stress analysis show that featurebased on a nonlinear speech production model is very use-ful in stress detection. The TEO-CB-AutoEnv feature [12]is based on a multi-resolution analysis of the Teager Energyprofile. It is designed to characterize the level of regularity inthe segmented TEO response and can capture variations in

excitation characteristics including pitch and its harmonics[12, 28, 13]. It is robust to the intensity of recorded audioand background noise. We adopt the 17 dimension TEO-CB-AutoEnv proposed in [29] which is an extension to theversion proposed in [12]. MFCC is a set of acoustic featuresmodeling the human auditory system’s nonlinear responseto different spectral bands. It is a compact representation ofthe short-term power spectrum of a sound. It is widely usedin speech analysis, such as speech recognition and speakerrecognition. As a generic speech feature, MFCCs are usedin detecting stressed speech in a number of prior studies [12,14]. We use 20-dimension MFCCs with the DC componentremoved since it corresponds to intensity of the sound.

Another widely investigated category of features for stressclassification are intensity based, such as, the mean, rangeand variability of intensity. Intensity based features requireconsistent control over ambient noise and both distance andorientation of the microphone with respect to the speaker. Ina mobile ubiquitous setting, it is impractical to make suchstrong assumptions about the orientation, body placement,and environmental contexts. Therefore, we do not adopt in-tensity based features. Also, the intensity of audio clips arenormalized to ensure the stress classification is not affectedby different intensity levels found in different sessions.

Table 1 summarized the acoustic features used by StressSense.In the feature extraction stage, the audio samples capturedby the phone are divided into frames. Each frame is 256samples (32 ms). It is known that human speech consists ofvoiced speech and unvoiced speech [30]. Voiced speech isvoice generated from periodic vibrations of the vocal chordsand includes mostly vowels. In contrast, unvoiced speechdoes not involve the vocal chords and generally includesconsonants. We use only voiced frames for analysis, i.e.,features are only computed from voiced frames of speech.The reason for this is twofold. First, the pitch and TEO-CB-AutoEnv features can only be reliably extracted from voicedframes. Second, voiced speech contains more energy thanunvoiced speech, thus, it is more resilient to ambient noise.We use the method introduced in [31] where zero crossingrate and spectral entropy are used to select voiced frames.The final dimensionality of feature we extracted is 42.

StressSense Classification

Classification Framework.Our classification framework uses Gaussian Mixture Mod-els (GMMs) with diagonal covariance matrix. We use oneGMM for each of the two classes, i.e., stressed speech andneutral speech. The framework makes decisions accordingto the likelihood function p(X|λ) of each class with equalprior, whereX is a feature vectors and λ(w, µ,Σ) is a GMMmodel with weight, mean, and covariance matrix parame-ters. We choose 16 as the number of components after eval-uating the Akaike Information Criterion (AIC) for severalmodels on the subjects. To avoid over fitting the trainingdata, the variance limiting technique [32] is applied with astandard expectation maximization (EM) algorithm to trainthe GMM speaker models [32]. To initialize the EM algo-rithm, k−Means is used to set the initial means and vari-

Page 6: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

ances of the GMM components. We investigate three train-ing schemes to model stress: universal model, personalizedmodel and speaker adapted universal model. In what fol-lows, we discuss these three models in more detail:

• Universal model uses a one-size-fits-all approach, whereone universal stress classifier is trained for all users. It isthe most widely adopted scheme in many mobile inferenc-ing systems due to its simplicity. The universal classifieris static after deployment.

• Personalized model uses a completely speaker depen-dent approach. It requires model training on each indi-vidual user’s own data and generate a speaker dependentmodel for each user. Clearly, this scheme is superior tothe universal model in terms of performance, but its us-ability and scalability greatly limits its application. Eachuser has to label their own data and train their own stressmodel before they can start using the system. The cumber-some bootstrapping phase and significant computationalresource required for this scheme render it infeasible inmost practical applications. We study this scheme mainlyto use it as a reference for best achievable performance.

• Model adaptation represents a middle ground betweenthe previous two schemes. All users start with a univer-sal stress classification model, which in turn gets adaptedto each individual user for better performance when morepersonal data is available as users carry the phone. Wedesign two adaptation methods: i) supervised adaptation,where a user explicitly contributes labelled data for adap-tation; and ii) unsupervised adaptation, which leveragesself-train [33] technique by utilizing unlabeled data – note,we refer to this type of adaptation as self-train in the re-maining paper. Both methods use the Maximum A Pos-teriori algorithm to adapt the universal model, but are dif-ferent in the way they assign samples for adaptation.

Model AdaptationIn order to adapt the parameters of a GMM, we use theMaximum A Posteriori (MAP) method developed in [34]for speaker verification. The paradigm of MAP adaptationfor speaker recognition is similar to ours for stress detection.Given the variation in stress parameters for individuals anddifferent scenarios (e.g., noisy outdoor), we expect that uni-versal model of stress may be adapted to individual speak-ers and scenarios. The MAP adaptation is a non-iterativeprocess and is therefore performed only once for each newset of observations. It is a modification of the Maximiza-tion step in the EM algorithm. It attempts to maximize theposterior probabilities of the unimodal gaussian componentsgiven new observations. We begin with a universal GMMmodel λ and new training observations (from individuals orscenarios), X = {x1, . . . ,xT }, where T is the number ofobservations. For gaussian component i, the component’sposterior probability given observation xt is

p(i|xt) =wipi(xt)∑Mj=1 wjpj(xt)

(1)

Using the posterior probability of component i, we can cal-culate the sufficient statistics for the weight, mean and vari-

ance:

ni =

T∑t=1

p(i|xt), Ei(x) =1

ni

T∑t=1

p(i|xt)xt,

Ei(xx′) =

1

ni

T∑t=1

p(i|xt)xtx′t (2)

The updated parameters of the MAP adapted GMM are cal-culated using α as follows:

αi = ni/(ni + r) (3)wi = [αini/T + (1− αi)wi]γ (4)µi = αiEi(x) + (1− αi)µi (5)

σi2 = αiEi(xx

′) + (1− αi)(σ2i + µ2

i )− µ2i , (6)

where r is a relevance factor that determines how much rel-evancy the original model should hold. We set r to 16, assuggested by [34] . γ is a scaling coefficient that ensures∑M

i wi = 1.

The two adaptation schemes, supervised adaptation and self-train adaptation (which is unsupervised adaptation) are dif-ferent in how they acquire new training samples. In caseof supervised scheme, new training observations are labeledexplicitly by the user. Once the user labels new data, MAP isapplied to adapt the universal model. However, the data an-notation process is tedious and error-prone, thus, the amountof labeled data might be scarce. The self-train scheme lever-ages only unlabeled data without any user input. It reusesthe predicted label and confidence statistics generated by theuniversal stress model during the inference process to selectnew training samples. Given the imperfection of the univer-sal stress model, obviously it is not wise to completely trustits predictions. Therefore, the self-train method determineswhether a data sample is suitable for adaptation accordingto the confidence level of the inference and uses high confi-dence samples. Furthermore, in most conditions, stress ex-hibits temporal dependence. Successive voice samples arevery likely captured from the same stress status, therefore,the likelihoods of the data samples from the GMM modelsshould be relatively stable. An abrupt change in likelihoodsometimes indicates a change of stress conditions, but it canbe an outlier. Therefore, the system apply a low pass filteron the output of each GMM to smooth the likelihood.

lti = α ∗ lti−1 + (1− α) ∗ l′tiwhere l′ is the original likelihood estimated by the GMMand l is the smoothed likelihood. The weight α determinesthe balance between the history and incoming data.

Once the likelihood function is estimated, the system usesan entropy based confidence score to estimate the quality ofclassification. For each data sample, the entropy of normal-ized likelihoods indicates the confidence level of the infer-ence result.The entropy is computed on the smoothed likeli-hood [l1, l2] for stress and neutral class, respectively. By

Entropy =

2∑i=1

(li/Z)× log (li/Z), Z =

2∑i=1

li

Page 7: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

Feature set A Feature set A B Feature set A B Cprecision recall f-score accuracy precision recall f-score accuracy precision recall f-score accuracy

universal 73.1% 58.9% 64.4% 68.6% 69.6% 70.0% 69.5% 69.6% 70.5% 74.8% 72.2% 71.3%unsupervised 74.9% 63.4% 68.1% 70.8% 75.8% 72.1% 73.7% 74.3% 78.8% 76.8% 77.5% 77.8%supervised 72.7% 70.3% 71.1% 71.9% 78.5% 80.1% 79.2% 79.0% 79.1% 85.0% 81.8% 81.1%

personalized 72.5% 71.9% 72.1% 72.2% 77.1% 83.1% 79.9% 79.1% 80.5% 87.1% 83.6% 82.8%

Table 2: Stress classification on indoor dataFeature set A Feature set A B Feature set A B C

precision recall f-score accuracy precision recall f-score accuracy precision recall f-score accuracyuniversal 66.1% 56.0% 60.1% 63.6% 65.7% 63.6% 64.2% 65.5% 67.0% 64.2% 65.2% 66.5%

unsupervised 67.7% 53.4% 58.8% 63.8% 68.3% 62.4% 64.7% 67.1% 76.7% 59.1% 65.7% 70.5%supervised 66.7% 62.6% 64.2% 65.5% 72.0% 72.0% 71.7% 71.8% 75.6% 75.9% 75.5% 75.5%

personalized 66.6% 65.1% 65.7% 66.1% 72.6% 78.3% 75.2% 74.0% 76.0% 82.4% 78.9% 77.9%

Table 3: Stress classification on outdoor data

where Z is a normalization term. If the universal classifierhas a high confidence on the prediction, the entropy will below. Otherwise, if the normalized likelihoods of the twoclasses are quite close to each other, the entropy will behigh. An entropy threshold is used to control whether a datasample is selected for adaptation. Because unlabeled data ischeap and abundant, the system can use a tight threshold toensure data quality. Once adaptation data set is selected, theMAP algorithm is applied to adapt the universal model asthe supervised scheme.

We test the effectiveness of two adaptation schemes in twouse cases: adapting universal model to individual speakers,and adapting universal model trained from one scenario toanother. A detail evaluation is presented in the next section.

STRESSSENSE EVALUATIONWe conduct three experiments using the dataset and stressmodels described earlier. First, we evaluate the importanceand effectiveness of different vocal features for different acous-tic environments. Next, we study the success of stress mod-els trained and tested in specific scenarios, i.e., either fullyindoors or fully outdoors. Finally, we investigate the stressclassification performance under mixed acoustic environments.The amount of neutral data for each subject is about 3 min-utes for both indoor and outdoor scenarios. The amount ofstress data for indoor scenario ranges between 4-8 minutesdepending on how talkative the subject is during the 10-mininterview session. The average amount of indoor stress datais 4.11 mins. For each subject, we also use 4 minutes ofspeech segments from different outdoor recruiting sessions.For all experiments, the universal models are tested using theleave-one-user-out cross validation method. In cases whereadaptation is performed, the target user’s data is equally par-titioned into an adaptation set and a test set. The adaptationset for a target user is used to customize the universal modellearned using training examples that exclude data from thatuser. Adaptation is done in two ways: (i) supervised adap-tation, where the adaptation set includes labels and (ii) self-train, where the adaptation set is unlabeled. The personal-ized model is evaluated using five-fold cross validation.

Stress Classification in Individual ScenariosIn the first experiment, we measure the relevance of the dif-ferent features using information gain based feature rank-ing [35]. We treat the indoor and outdoor scenarios as twoseparate data sets. Table 4 shows the top 10 features in eachscenario. It is clear that pitch features and speaking rate are

ranked as the most predictive in both scenarios. MFCC ismore relevant in the indoor environment, while TEO-CB-AutoEnv is more useful in outdoor environments. This find-ing is in line with what Fernandez and Picard observed in[13]; that is, the Teager energy operator is robust to ambientnoise. Even though MFCC is able to capture valuable stressrelated information, it is sensitive to ambient noise by itsnature as a generic acoustic feature. Therefore it is less dis-criminative outdoors. Spectral features (i.e., high frequencyenergy ratio and frequency centroid) are more discriminativein outdoor scenarios, but not indoor. Since MFCC is able tocapture information similar to spectral features, the contri-bution of spectral features is lower in the indoor scenario.

Rank Indoor Outdoor1 pitchStd pitchStd2 speakingRate speakingRate3 pitchRange pitchRange4 MFCC4 TEO-CB-AutoEnv175 MFCC3 HighFrequencyRatio6 MFCC14 TEO-CB-AutoEnv77 TEO-CB-AutoEnv2 MFCC28 MFCC15 TEO-CB-AutoEnv109 MFCC19 MFCC110 MFCC1 Centroid

Table 4: Feature ranking in different environments

To study the impact of the different features on classification,we divide the 42 features into three groups : A) pitch basedfeatures and the two spectral features, B) TEO-CB-AutoEnvfeature set, and C) the MFCC feature set. We add them oneby one in the indoor and outdoor stress classification tasks.Table 2 and Table 3 show how performance changes as thefeature set grows. Adding TEO-CB-AutoEnv to feature setA improves the classification performance significantly par-ticularly for the outdoor scenario. MFCC provides a smalladditional improvement of 3% on top of feature set A andB. To reproduce speech intelligibly at least the informationof three formants are required [36]. Therefore, our featuresset A and B can’t reconstruct the speech. But with C, it ispossible. In our future work, we will consider better privacysensitive features set without using MFCC.

Figure 3 and Figure 4 show the accuracy for each of the sub-jects for each model when all the features are used. Note thatsubject 1’s indoor data was corrupted due to cellular connec-tivity issue during the experiment, so we leave the data outfor the indoor experiment. The same problem was experi-enced by subject 5 in the outdoor case and was excludedin the outdoor experiment. As shown in the plots the uni-versal model is penalized for its one size fits all philosophy.The universal stress model provides the lowest accuracy of

Page 8: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

2 3 4 5 6 7 8 9 10 11 12 13 140.3

0.4

0.5

0.6

0.7

0.8

0.9

1Ac

cura

cy

Subject id

universalself−trainsupervisedpersonalized

Figure 3: Accuracy of indoor scenario

1 2 3 4 6 7 8 9 10 11 12 13 140.3

0.4

0.5

0.6

0.7

0.8

0.9

Accu

racy

Subject id

universalself−trainsupervisedpersonalized

Figure 4: Accuracy of outdoor scenario

0% 10% 20% 30% 40% 50% 60% 70% 80%0.6

0.65

0.7

0.75

0.8

0.85

Amount of data for adaption

f−scoreaccuracy

Figure 5: Supervised adaptation indoor

0% 10% 20% 30% 40% 50% 60% 70% 80%0.55

0.6

0.65

0.7

0.75

0.8

Amount of data for adaption

f−scoreaccuracy

Figure 6: Supervised adaptation outdoor

used buffer

new data

buffer queue

speech detection

processing Thread 1Microphone

API

sounddetection

featureextraction

GMMClassfication

processing Thread 2

Sensing AdmissionControl Inference

Figure 7: StressSense implementation component diagram.

71.3% for the indoor scenario and 66.6% for the outdoor sce-nario. For comparison [14] achieved 72.63% accuracy withonly female speakers on data from controlled indoor envi-ronment using professional microphones. The personalizedmodel provides the highest accuracy of 82.9% for the indoorscenario and 77.9% for the outdoor scenario.

The two adaptation schemes are intended to adapt a universalmodel to a new user when limited training data is available.None of the new user’s data is part of the universal model’straining data set. From Table 2 and Table 3, it is clear thateach of the adaptation methods perform well above the uni-versal stress model. Not surprisingly, the supervised adap-tation scheme, which take advantages of user input, does abetter job than the unsupervised self-train scheme. In bothcases, the supervised adaptation scheme provides a 10% boostin accuracy, and is only about 2% lower than the personal-ized model. Therefore supervised adaptation is a practicalalternative to a personalized scheme, when it is impracticalto collect enough labeled training data for user-dependentmodels. We conduct another experiment to study the optimalamount of labeled data required for supervised adaptation.In this experiment, 20% of the data is held as test set, andwe increase the adoption set from 10% to 80%. Figure 5 andFigure 6 show the performance of the supervised adaptedmodel improves as more data is used for adaptation. Forboth indoor and outdoor scenario, the increase in stress clas-sification performance levels off when approximately 40%of the data is used for adaptation. Generally speaking, usingabout 2 minutes of labeled data will increase the accuracy byabout 8.3%, with minimal increase in accuracy there after.

The self-train scheme is a fully unsupervised. In compari-

son to the universal scheme, the self-train scheme providesan increase of performance of 6.5% and 4.0% for indoorand outdoor scenario, respectively. The self-train schemeis more effective in the indoor case because the indoor uni-versal classifier works better than the outdoor one. Unlikethe supervised adaptation scheme which receive new labelinformation from the user, the self-train scheme relies on theclass information encoded in the universal models. There-fore, it is sensitive to the performance of the original stressmodel that adaptation starts with.

As expected, stress detection in an uncontrolled outdoor en-vironment is more challenging than the indoor environment.Even though human speech is the dominant sound in ourdata sets, the impact of different context is not negligible.In the following section, we investigate the effectiveness ofcross scenario model adaptation, i.e., whether a stress modeltrained in a given acoustic environment can be adapted toperform well in another very different acoustic environment.

Cross Scenario Model adaptationTo test how well stress model trained in one scenario per-forms in a different scenario, we apply the indoor model tooutdoor data and outdoor model to indoor data. The uni-versal and personalized models are tested directly, while theadaptation models are adapted from the universal model ofthe original scenario using adaptation data from the new sce-nario. We use a 50% split between adaptation and test sets.

Table 5 shows the performance of stress classifier trainedusing indoor data is applied to unconstrained outdoor data.Compared to Table 2, both the universal and personalizedclassifiers perform poorly and are incapable of properly han-

Page 9: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

precision recall f-score accuracyuniversal 67.6% 28.2% 38.9% 57.8%self-train 78.1% 26.4% 38.4% 60.1%

supervised 78.1% 70.1% 72.9% 74.8%personal 65.3% 39.4% 47.7% 59.7%

Table 5: Indoor model tested outdoor

precision recall f-score accuracyuniversal 64.8% 60.8% 62.2% 63.6%self-train 66.5% 59.0% 61.7% 64.6%

supervised 74.1% 77.3% 75.5% 74.9%personal 77.4% 77.7% 77.5% 77.4%

Table 6: Outdoor model tested indoor

Component Avg. Runtime(sec)Nexus S Galaxy Nexus

admission control 0.005 0.003feature extraction 1.95 1.20classification 0.27 0.22full pipeline 2.23 1.43

Figure 8: Runtime Benchmark

dling noisier outdoor data. Consequently, the self-train adap-tation also provides limited performance gain but supervisedadaptation is able to increases the accuracy by 17%. On theother hand, models trained on unconstrained outdoor dataworks better in the controlled indoor environment. Table6 shows the performance of stress classification when out-door stress classifiers are applied to indoor data. Due to thechange of environment, the performance of outdoor classi-fier is lower when compared to the Table 3, where the na-tive indoor classifiers are tested on indoor data. However,the performance drop is moderate compared to when the in-door model is applied to outdoor data. It is likely that classi-fiers trained on the real world data model speech under stressmore precisely than classifiers trained in controlled environ-ments and is more resilient to context changes. Supervisedadaptation can further improve the accuracy by 10%. Thisresult suggests that real world data is more important thandata collected in a controlled environment and leads to morerobust classifiers. When it is too difficult to collect a largereal world dataset, a small amount of labeled data (e.g., 2mins) from the new environment can make a big difference.

STRESSSENSE PROTOTYPE IMPLEMENTATIONFigure 7 shows our current proof-of-concept implementationof the StressSense classification pipeline. Our results showthat it is feasible to implement a computationally demandingstress classification system on off-the-shelf smartphones. In-sights and results from this initial implementation will serveas a basis for the further development and release of theStressSense App. The StressSense prototype is implementedon the Android 4.0 platform. The StressSense software com-prises approximately 5,000 lines of code and is a mixture ofC, C++ and Java. Most computationally demanding signalprocessing and feature (as listed in Table 1) extraction al-gorithms are written in C and C++ and interfaced with Javausing JNI wrappers. The speaking rate is computed by cross-complied Enrate library[37]. Java is used to build an An-droid application which allows us to access the microphonedata and construct a simple GUI to drive the app. Training isdone offline (future work will consider online training) - theoffline server side training code is implemented primarily inMatlab.

The StessSense software components and their interactionsare shown in Figure 7. In current pipeline implementation,the processing comprises several computational stages withincreasing computational cost. Each stage triggers the nextmore computationally demanding stage on an on-demandbasis – presenting an efficient pipeline. Using the standardAndroid API, we collect 8 kHZ, 16-bit, mono audio samplesfrom the microphone. The PCM formatted data is placed in acircular queue of buffers, with each buffer in the queue hold-ing one processing window. Each processing window in-cludes 40 non-overlapping frames. Each frame contains 256

16-bit audio samples. Once a buffer is full, it is provided tothe sound detector component. If the sound detector detectsthe data as non-silence then data proceeds to voice detectorto determine whether the incoming sound is human speech.We use sound and voice detection algorithm based on [31].If human speech is present in the data then stress detectionis applied. In this case, a stress/non-stress inference resultis made each processing window (every 1.28 sec). The cur-rent implementation does not consider speaker segmentationdue to the technical difficulty of recording from a second ex-ternal microphone on current Android phone (we intend tostudy the use of multiple mics on phones as well as bluetoothheadsets in future work).

StressSense is a CPU bound application due to the needto continuously run a classification pipeline. Note, how-ever, that the full classification pipeline is only fully en-gaged when necessary – in the presence of human voice– else it runs at a low duty cycle. The current prototypeis optimized for lower CPU usage at the cost of a largermemory footage. To best understand the cost of runningStressSense on a phone we conducted a detailed componentbenchmark test, as shown in Table 8. We implement andbenchmark StressSense on two Android phones: SamsungNexus S and Samsung Galaxy Nexus. The Nexus S comeswith a single core 1 GHz Cortex-A8 CPU while the newerGalaxy Nexus has a dual-core 1.2 GHz Cortex-A9 CPU. Fora fair comparison, the runtime shown in Table 8 is measuredwith one single processing thread for both phones. Whenthe whole pipeline is fully engaged, neither phone is ableto process the data in real time with a single core. TheNexus S and Galaxy Nexus take 2.23s and 1.43s, respec-tively, to process a window (1.28s) of audio data. How-ever, the Galaxy Nexus is able to achieve real-time process-ing when using two processing threads in parallel with twoCPU cores. During full operation, the CPU usage is 93%- 97% and 46% - 55% for Nexus S (one CPU core) andGalaxy Nexus (two CPU cores), respectively. The compu-tational power in newer generation phone is critical for real-time operation of StressSense. The memory usage is 7.8MBon Nexus S, whereas the memory usage on Galaxy Nexusis 15MB due to the memory used to process the extra pro-cessing thread. In terms of power consumption, the averagecurrent draw on a Galaxy Nexus is 53.64mA when the audiorecorded is not voice (the pipeline stops at admission controlstage in this case); when the audio recorded is human speech(full StressSense pipeline engages), the average current drawincreases to 182.90mA. With the standard 1750 mAh bat-tery, a Galaxy Nexus will last for 32.6 hours and 9.6 hoursin above two cases respectively. Note, that the increasedavailability of multi-core phones opens the opportunity toimplement more sophisticated pipelines in the future. Ourimplementation is one of the first that exploits dual-cores forcontinuous sensing; we show that this complex pipeline is

Page 10: StressSense: Detecting Stress in Unconstrained Acoustic ...publications.idiap.ch/downloads/papers/2012/Lu_UBICOMP_2012.pdf · tween stress and emotion detection from speech. ... ologically

capable of running continuously in real-time without signifi-cant degradation of the phone’s user experience. We believefuture quad-core phones will lead to increased performanceand make continuous sensing applications more common-place in our daily lives.

CONCLUSION AND FUTURE WORKWe presented an adaptive method for detecting stress in di-verse scenarios using the microphone on smartphones. Un-like physiological sensors, microphones do not require con-tact with the human body and are ubiquitous to all mobilephones. Using a non-iterative MAP adaptation scheme forGaussian mixture models, we demonstrated that it is feasi-ble to customize a universal stress model to different usersand different scenarios using only few new data observationsand at a low computational overhead. Our proof-of-conceptsoftware demonstrates that StressSense can run on off-the-shelf smart phone in real time. In our current work, weconducted an initial study of cognitive load related stress injob interviews and outdoor job execution tasks. For neutralvoice, we used reading data. But in real world the stress andneutral scenarios are much more diverse. As part of futurework – and building on our initial prototype implementation– we plan to design, deploy, and evaluate a StressSense An-droid App that harvests a diverse range of stress and neutralspeech data from phone calls and conversations occurringin the wild from heterogeneous population of user. Each ofthe StressSense classification models evaluated in this pa-per present tradeoffs in terms of training and the burden onusers.We plan to base our future StressSense application re-lease on an adaptive pipeline that uses the self-train modelfor speaker adaption and the supervised adaption model forenvironment adaption.

ACKNOWLEDGMENTSThis work is supported in part by the SNSF SONVB Siner-gia project, NSF IIS award #1202141, and Intel Science andTechnology Center for Pervasive Computing. The views andconclusions contained in this document are those of the au-thors and should not be interpreted as necessarily represent-ing the official policies, either expressed or implied, of anyfunding body. A special thank goes to Andy Sarroff for hisearly contributions to this work.

REFERENCES1. http://www.apa.org/pubs/info/reports/2007-stress.doc.2. S. Cohen, R.C. Kessler, and L.U. Gordon. Measuring stress: A guide for health

and social scientists. Oxford University Press, USA, 1997.3. A. Perkins. Saving money by reducing stress. Harvard Business Review,

72(6):12, 1994.4. Michael H. L. Hecker, Kenneth N. Stevens, Gottfried von Bismarck, and Carl E.

Williams. Manifestations of Task-Induced Stress in the Acoustic Speech Signal.The Journal of the Acoustical Society of America, 44(4):993–1001, 1968.

5. K.R. Scherer. Vocal affect expression: A review and a model for future research.Psychological Bulletin, 99(2):143–165, 1986.

6. K.R. Scherer. Dynamics of stress: Physiological, psychological and socialperspectives, chapter 9: Voice, Stress, and Emotion, pages 157–179. PlenumPress, New York, 1986.

7. F.J. Tolkmitt and K.R. Scherer. Effect of experimentally induced stress on vocalparameters. Journal of Experimental Psychology: Human Perception andPerformance, 12(3):302–313, 1986.

8. Allen D. Kanner, James C. Coyne, Catherine Schaefer, and Richard S. Lazarus.Comparison of two modes of stress measurement: Daily hassles and upliftsversus major life events. Journal of Behavioral Medicine, 4(1):1–39, 1981.

9. John H. L. Hansen. SUSAS. Linguistic Data Consortium, Philadelphia, 1999.10. H. Teager. Some Observations on Oral Air Flow During Phonation. Acoustics,

Speech and Signal Processing, IEEE Transactions on, 28(5):599–601, 1980.11. J.F. Kaiser. Some Useful Properties of Teager’s Energy Operators. In Proc.

ICASSP ’93, volume 3, pages 149–152, 1993.12. G. Zhou, J.H.L. Hansen, and J.F. Kaiser. Nonlinear feature based classification

of speech under stress. Speech and Audio Processing, IEEE Transactions on,9(3):201–216, 2001.

13. Raul Fernandez and Rosalind W. Picard. Modeling drivers’ speech under stress.Speech Communication, 40(1–2):145–159, 2003.

14. SA Patil. Detection of speech under physical stress: Model development, sensorselection, and feature fusion. 2008.

15. Drew Fisher Keng-hao Chang and John Canny. Ammon: A speech analysislibrary for analyzing affect, stress, and mental health on mobile phones. InProceedings of PhoneSense 2011, 2011.

16. Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, ChrisLongworth, and Andrius Aucinas. EmotionSense: A Mobile Phones basedAdaptive Platform for Experimental Social Psychology Research. In Proc.UbiComp ’10, pages 281–290, 2010.

17. K.R. Scherer, D. Grandjean, T. Johnstone, G. Klasmeyer, and T. Banziger.Acoustic correlates of task load and stress. In Proc. ICSLP2002, pages2017–2020, 2002.

18. S. Basu. A linked-hmm model for robust voicing and speech detection. InAcoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03).2003 IEEE International Conference on, volume 1, pages I–816. IEEE, 2003.

19. D. Wyatt, T. Choudhury, and J. Bilmes. Conversation detection and speakersegmentation in privacy sensitive situated speech data. In Proceedings ofInterspeech, pages 586–589. Citeseer, 2007.

20. T. Choudhury and S. Basu. Modeling conversational dynamics as a mixedmemory markov process. In Proc. of Intl. Conference on Neural Information andProcessing Systems (NIPS). Citeseer, 2004.

21. J.P. Henry. Stress, neuroendocrine patterns, and emotional response. 1990.22. K.R. Scherer. Voice, stress, and emotion. Dynamics of stress: Physiological,

psychological and social perspectives, pages 157–179, 1986.23. Robert Ruiz, Emmanuelle Absil, Bernard Harmegnies, Claude Legros, and

Dolors Poch. Time- and spectrum-related variabilities in stressed speech underlaboratory and real conditions. Speech Communication, 20(1-2):111 – 129, 1996.

24. J.H.L. Hansen. Analysis and compensation of speech under stress and noise forenvironmental robustness in speech recognition. Speech communication,20(1-2):151–173, 1996.

25. William A Jones Jr and Air Force Inst of Tech Wright-pattersonafb School ofEngineer An Evaluation of Voice Stress Analysis Techniques in a SimulatedAWACS Environment, 1990.

26. Lambert M Surhone, Miriam T Timpledon, and Susan F Marseken. Voice StressAnalysis, jul 2010.

27. N Morgan and E Fosler. Speech recognition using on-line estimation of speakingrate. Conference on Speech, 1997.

28. John Hansen and Sanjay Patil. Speech under stress: Analysis, modeling andrecognition. In Christian Muller, editor, Speaker Classification I, volume 4343 ofLecture Notes in Computer Science, pages 108–137. Springer Berlin /Heidelberg, 2007.

29. John H L Hansen, Wooil Kim, Mandar Rahurkar, Evan Ruzanski, and JamesMeyerhoff. Robust Emotional Stressed Speech Detection Using WeightedFrequency Subbands. EURASIP Journal on Advances in Signal Processing,2011:1–10, 2011.

30. J. Saunders. Real-time discrimination of broadcast speech/music. In Acoustics,Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings.,1996 IEEE International Conference on, volume 2, pages 993–996. IEEE, 1996.

31. H. Lu, A. Bernheim Brush, B. Priyantha, A. Karlson, and J. Liu. Speakersense:Energy efficient unobtrusive speaker identification on mobile phones. Pervasive2011, pages 188–205, 2011.

32. D.A. Reynolds and R.C. Rose. Robust text-independent speaker identificationusing gaussian mixture speaker models. Speech and Audio Processing, IEEETransactions on, 3(1):72–83, 1995.

33. D. Yarowsky. Unsupervised word sense disambiguation rivaling supervisedmethods. In Proceedings of the 33rd annual meeting on Association forComputational Linguistics, pages 189–196. Association for ComputationalLinguistics, 1995.

34. Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. SpeakerVerification Using Adapted Gaussian Mixture Models. Digital SignalProcessing, 10(1–3):19–41, 2000.

35. M.A. Hall and G. Holmes. Benchmarking attribute selection techniques fordiscrete class data mining. IEEE Transactions on Knowledge and Dataengineering, pages 1437–1447, 2003.

36. R.E. Donovan. Trainable speech synthesis. Univ. Eng. Dept, page 164, 1996.37. N. Morgan, E. Fosler, and N. Mirghafori. Speech recognition using on-line

estimation of speaking rate. In Fifth European Conference on SpeechCommunication and Technology, 1997.