SocialSense: A Collaborative Mobile Platform for Speaker...

SocialSense: A Collaborative Mobile Platform for

Speaker and Mood Identification

Mohsin Y Ahmed, Sean Kenkeremath, and John Stankovic

University of Virginia, Computer Science Department

Charlottesville, Virginia, USA {mya5dm, stk4zn, stankovic}@virginia.edu

Abstract. We present SocialSense, a collaborative smartphone based speaker

and mood identification and reporting system that uses a user's voice to detect

and log his/her speaking and mood episodes. SocialSense collaboratively works

with other phones that are running the app present in the vicinity to periodically

send/receive speaking and mood vectors to/from other users present in a social

interaction setting, thus keeping track of the global speaking episodes of all us-

ers with their mood. In addition, it utilizes a novel event-adaptive dynamic clas-

sification scheme for speaker identification which updates the speaker classifi-

cation model every time one or more users enter or leave the scenario, ensuring

a most updated classifier based on user presence. Evaluation of using dynamic

classifiers shows that SocialSense improves speaker identification accuracy by

30% compared to traditional static speaker identification systems, and a 10% to

43% performance boost under various noisy environments. SocialSense also

improves the mood classification accuracy by 4% to 20% compared to the base-

line approaches. Energy consumption experiments show that its device daily

lifetime is between 10-14 hours.

Keywords: social interaction, assisted living, depression, smartphone.

1 Introduction

Speaker identification systems based on in-home/on-body/smartphone microphones

are used for various applications such as voice based authentication, home health

care, security, and daily activity monitoring. With the pervasive usage of smartphones

in everyday life, it is an exceptionally suitable unobtrusive platform for speaker iden-

tification reducing the overhead of on-body or contextual sensors. Besides speaker

identification, speaker mood detection is another important problem in human interac-

tion studies and social psychology research. The challenges of smartphone based

speaker identification and mood detection include preserving user privacy, maintain-

ing identification accuracy, accurate operation of the system irrespective of

smartphone location, resilience against ambient noise and operating under energy

constraints.

Both speaker and mood identification are part of a bigger and important health

sensing problem, detection of human social interaction, which is an important indica-

tor of mental and physical health in people of all ages. Regular good social interaction

brings many health benefits including reduced risk for cardiovascular and Alzheimer's

disease, some cancers, osteoporosis and rheumatoid arthritis, steady blood pressure

and reduced risk of depression and other mental disorders. On the other hand, social

isolation culminates to loneliness and depression, physical inactivity and overall hav-

ing a greater risk of death for older people. Therefore, a system able to detect people's

social interactions and mood would be greatly beneficial for caregivers to more accu-

rately diagnose and treat patients suffering from psychological disorders.

We present SocialSense, a collaborative smartphone based speaker and mood iden-

tification and reporting system which logs user speaking and mood episodes from

his/her voice. A person can be uniquely identified by his smartphone Bluetooth ID.

After SocialSense detects its user's speaking episode and mood, it broadcasts a mes-

sage containing the user ID, the speaking episode timestamp, and corresponding

mood to all neighboring phones via Bluetooth broadcasting. Thus every phone logs a

global scenario of the social interaction environment. In a nutshell, SocialSense can

answer the following questions:

When is the phone user speaking?

What is the mood of the user during speaking?

With whom is the user speaking to? Who else are present around?

When are the other persons in the environment speaking?

What are the moods of other persons while they speak?

Besides detecting the smartphone user's mood, understanding the moods of other

persons present in a social interaction is an important indicator of the global mood,

hence the quality of the social interaction. Having this feature, SocialSense can poten-

tially be used to demonstrate and verify the effect of mood contagion, i.e. how multi-

ple individuals in a social interaction reach a mood convergence [1]. Using the idea of

mood contagion, SocialSense can be used as a recommender system where it can

recommend happy persons as potential conversation partners of sad persons to cheer

them up. The actual use of SocialSense for mood contagion is outside the scope of

this work.

Our prime target for the usage of SocialSense is in assisted living facilities for the

elderly where the prevalence and magnitude of depression is of major concern. More

than 1 million Americans reside in assisted livings presently. Studies found that, 20%

to 24% of assisted living residents have symptoms of major or minor depression

which is likely to cause physical, cognitive, and social impairment and delayed recov-

ery from medical illness and surgery to these elderly. The scary fact is that, many

depressive older adults end up committing suicide. Among men of age 75 and over,

rate of suicide is 37.4 per 100,000 population. Several diagnostic barriers exist for the

screening and treatment of depression in assisted livings which includes lack of regu-

latory requirements, privacy concerns, cost, and misinterpretation of depression. It is

suggested that assisted living staff (nurse, therapist, medical director) should proac-

tively assess for depressive syndromes instead of self-reporting of mood changes by

the residents. SocialSense can be used as an automated diagnostic tool to monitor the

mood and social interactions of the assisted living residents where each of the resi-

dents is provided with a smartphone with our system. [16]

Since SocialSense can capture the global scenario of a social interaction setting, it

can be used as a data collection system for various social psychology and human in-

teraction research. In addition, SocialSense incorporates a dynamic event-driven clas-

sification scheme for speaker identification. New people can enter into a social inter-

action while some people may leave at any time. SocialSense periodically refreshes

its Bluetooth neighbor set and whenever it detects a change in the set, i.e., some peo-

ple entered or left, it recreates the classification model based on the new neighbor set.

For this purpose, it imports the user training feature files from the newly arrived

phones to re-compute the classification model.

The main contributions of SocialSense are:

An unobtrusive voice based speaker identification and mood detection system us-

ing user's smartphone. It does not use any on-body or contextual sensors thus con-

tributing to mobility and user-friendliness.

A practical, easy, and short training scheme to train a phone to detect a person’s

own speaking episodes. One key novelty of SocialSense is that it avoids the need

of exhaustive training by all users in a social interaction setting and still accurately

detects all speakers by collaboration among the phones.

SocialSense has privacy support for users. No audio samples are recorded or stored

in the phone and features are extracted in real time and after classification they are

removed from the system. There is no way to reconstruct the original audio signal

at a later time from SocialSense.

SocialSense's voice based mood detection module in every phone is conventional,

however by collaboration among the phones, it can detect the mood of the mem-

bers of a conversation group and the change of one's mood when he/she switches

between conversation groups, i.e., demonstrate mood contagion. This novel idea

hasn't been explored before and such a system would be invaluable for further ex-

periments on social mood dynamics. Also, using a random forest classifier for

mood detection compared to GMM and SVM classifiers used in baseline systems

[7, 8], our system has a 4% to 20% increase in accuracy compared to the baselines.

SocialSense supports real life environments where new people enter and existing

people leave the social interaction environment. SocialSense periodically refreshes

its Bluetooth neighbor set to detect such changes in the environment.

Another novel feature of SocialSense is its dynamic event-driven classification

scheme where it performs speaker identification using an up-to-moment classifier

based on the current users present in the scenario. This yields an average 30% in-

crease in classification accuracy compared to static classification.

Evaluation with respect to noisy environments has been performed by injecting

various artificial noise to simulate real life noisy environments and results demon-

strates that SocialSense improves speaker identification accuracy in noise by 10%-

43% based on different types of noise and mood detection accuracy in noise by

33% compared to the state-of-the-art systems. SocialSense has been evaluated by

training with noise to yield these performance boosts, which hasn't been done in

baseline approaches.

2 Related Work

Many of the existing speaker identification systems require the total number of speak-

ers to be static, and they employ static classification schemes so that each speaker

needs to train the system beforehand, which makes them less realistic [2, 3].

SocialWeaver [4] uses a multi-level classification for speaker identification. The first

level uses energy histogram classifiers while the second level uses a GMM based

classifier. Neary [5] uses similarity of sound environment to detect conversational

fields. These energy and loudness based approaches have greater error in noisy condi-

tions and they fail if there is a person present in the scenario without his phone.

SpeakerSense [6] is a speaker identification platform built on a heterogeneous multi-

processor architecture. It attempts to reduce training overhead by training from real

life events as phone calls and one-to-one conversations, but does not evaluate the

system in noisy environments. Also, it requires the total number of speakers to be

static and does not support realistic dynamic environments where speakers enter and

leave on the fly.

There are a number of existing systems which detect user's mood from voice.

EmotionSense [7] provides dual systems for speaker identification and emotion detec-

tion from user's voice using Gaussian mixture methods. [8] provides SVM based clas-

sifiers that recognize human emotional states from their utterances. However, these

systems can only capture mood of a single person or entity and, therefore, are not

suitable for social psychology experiments where a system would need to know

moods of everyone in a social interaction. Also, there is no evidence that these sys-

tems would operate well under real life noisy environments.

Besides speaker identification and mood detection, there have been systems which

detect other aspects of social interaction using different modalities. Some of the exist-

ing work on social interactions uses only on-body sensors such as accelerometers,

gyroscopes, GPS, microphones, and cameras. Pierluigi et al. [9] built a badge having

a triaxial accelerometer and a JPEG camera which is used to detect the presence of

other people. Crowd++ [18] estimates the number of people talking in a certain place

by unsupervised machine learning technique from smartphone audio inference.

CenceMe [10] can automatically detect activities of individuals and share the sensing

results through social networks.

Another type of work uses ambient sensors. [11] uses a sociometric badge

equipped with infrared transmitter/receiver and microphone which senses and models

human networks. In [12] four video cameras and audio collectors are placed in public

areas such as the dining room, living room and hallway which can detect high-level

social interactions among people such as greeting, standing conversation, and walking

together.

We compare SocialSense with some state-of-the-art smartphone based sensing sys-

tems in table 1.

Table 1. Comparison of State-of-the-art

System Operations Classifiers used Results

EmotionSense

[7]

Speaker identifi-

cation, mood

detection

GMM 90% speaker ID accuracy,

70% mood detection accuracy

SpeakerSense

[6]

Speaker identifi-

cation

GMM 95% speaker identification

accuracy

Social

Weaver [4]

Speaker identifi-

cation, conversa-

tion group cluster-

ing

Loudness histo-

gram, GMM

90% speaker ID accuracy, 70-

90% accuracy for conversation

clustering

Neary [5] Detect conversa-

tional fields i.e

detecting multiple

persons who are

in a conversation

No classifier 96.6% precision and 67.9%

recall achieved in a controlled

experiment

Qiang et al [13] User activity,

speaker ID, prox-

imity, location

Naive Bayes,

Discriminant,

Boosted tree,

Bagged tree

92% accurate speech detection

Social

Sense

Speaker identifi-

cation, mood

detection, mood

contagion sensing

Logistic regres-

sion, Random

forest

94% speaker ID accuracy,

90% speaker ID accuracy in

noise, 80% mood detection

accuracy, 76% mood detection

accuracy in noise

3 SocialSense System Design and Operation

The assumption behind SocialSense is that every user in a social interaction setting

carries his/her own phone with the SocialSense app running in it. However, if one or

more persons is present without his phone, only his speaking and mood episodes will

remain undetected and unreported, while all other users' speaking and mood episodes

will be detected and broadcasted without any error.

Figure 1 shows the system diagram. The SocialSense app runs continuously in

each phone listening to audio streams. Silent frames are detected by comparing each

frame’s energy to a threshold, and filtered from further processing to save energy.

Each phone periodically updates its phone-set within its Bluetooth proximity range

(~10 m). It is required that the system meets the energy constrains of mobile devices

in order to make it usable in realistic scenarios. SocialSense is capable of running for

10 to 14 hours continuously in smartphones and tablets which is good enough for its

usage as a healthcare, research and data collection tool in assisted living. SocialSense

is made up of a number of modules described in the following sections.

3.1 Phone-set Formation

A phone's phone-set is defined as the set of phones running the SocialSense app situ-

ated within the Bluetooth proximity range from that phone. This module running in

every phone refreshes its phone-set periodically (generally every 30s) to keep the

most recent neighboring phones in its phone-set. The periodic interval is set so that it

is neither too short to trigger redundant phone-set discovery process nor too long to

miss significant changes in the phone-set, considering realities of human social inter-

action. All members of a phone-set are assumed to be close enough to participate in a

conversation. Conversely, phones not belonging to the phone set are assumed to be

not participating in a conversation.

Fig. 1. SocialSense block diagram

3.2 Speaking Episode Detection Module

This module determines whether a voice segment belongs to the phone user or not.

The speaker identification is a binary classification problem where every non-silent

audio segment must be classified into one of two classes: "phone user's voice" or

"anything else" (e.g. others' voice or ambient noise). It uses a dynamic logistic regres-

sion based classifier, which can be easily trained by the user (or support personnel in

assisted living). The user trains the speaker classification system by speaking for 60

seconds in front of the phone in normal tone and loudness. This simple, easy-to-use

and short self- training scheme allows the classifier being updated with the latest

voice samples of the user. In assisted living facilities, this training will be done by the

staff.

Some existing smartphone based speaker identification systems classify speakers

based on the loudness of the perceived audio signal [4], [13]. The hypothesis behind

those works is that, a user's voice is loudest in his own phone in a particular time in-

stant compared to any other neighboring phones at the same time (as the user is sup-

posed to be the closest person to his phone). However, this scheme doesn't work well

in noisy environments, and also in the situation where a person without any phone is

talking with people having their phones. In the latter case, when the person not having

his phone is talking, his voice will be loudest to the person's phone who is closest to

him, so that phone will incorrectly assume that the person without his phone is its user

and classify positively, which is incorrect. Other systems like SpeakerSense [6] re-

quire training a speaker model for each individual who needs to be recognized, thus

incurring large training overhead and resulting in complex, power-hungry classifiers.

SocialSense, on the other hand, uses a simple logistic regression based binary clas-

sifier with very little training overhead using 39 MFCC (mel-frequency cepstral coef-

ficient) features. The phone-user (or staff) can train the system easily by speaking for

60 seconds in front of the phone in normal tone and volume to create a speaker classi-

fication model. As human voice may occasionally change depending on his physio-

logical state, using this easy-to-use training scheme, the system can detect when its

user is speaking irrespective of his voice quality, in the presence of noise and even

when a person without his phone is present in the scenario as well. Unlike volume

based systems, SocialSense does not fail when a user is present without his phone.

Only his speaking and mood episodes remain undetected, but the systems in other

users' phones work fine. The presence of a user without his phone does not incur any

error or failure in the overall system operation.

3.3 Mood Detection Module

Detecting speaker mood in a mobile platform is a major challenge in this work. If a

voice segment has been classified as a user's voice by the speaking episode detection

module, this module further determines the user's mood (happy, sad, angry, neutral)

from his voice. Then it generates a speaking and mood vector consisting of the start-

ing and stopping timestamp of the user's speaking episode and mood during that

speaking episode. This module extracts 39 MFCC coefficients from each user utter-

ance window and calculates 9 different statistics on each MFCC coefficient culminat-

ing to 351 audio features. These statistics are: geometric mean, harmonic mean,

arithmetic mean, range, skewness, standard deviation, z-scored average, moment and

kurtosis. The MFCC coefficients combined with these statistics carry a large amount

of prosodic and energy based information correlated to emotion. It then uses these

features to train a random forest classifier from the EMA emotional utterance dataset

[15] for detecting mood.

3.4 Message-Exchange Module

SocialSense forms a Bluetooth network among all members of a phone-set. When a

phone has a speaking and mood vector to send, it broadcasts the vector using flooding

over the network. It has an incoming thread and an outgoing thread to handle incom-

ing and outgoing messages, respectively. It maintains a message queue, new vectors

to be broadcasted are enqueued in the queue and the outgoing thread sends vectors

one by one from the queue.

3.5 Dynamic Event-Driven Classification Module

For speaker identification, the logistic regression classifier uses a positive training file

to keep training samples from the phone's user, and uses another negative training file

to keep training samples from all other users. During startup of a conversation,

SocialSense broadcasts its local positive training file to all neighbors which they use

for their negative training. If there are 4 phones in the scenario, each phone uses its

local file for positive training and 3 other files received from others for negative train-

ing. The phone-set discovery process triggers every 30 seconds to refresh the phone-

set. If there is a change in the phone-set during a periodic phone-set refresh (an old

user leaves or a new user enters), an event is triggered. When the event triggers, each

phones broadcasts its positive training file over the network and updates its negative

training using only the files received from phones present in the current scenario, and

then rebuilds an updated classifier for speaker identification. This improves classifica-

tion accuracy by 33% on average compared to static training and makes the training

process for each user simple, which is shown in the evaluation section.

4 Evaluation

The evaluation consists of multiple parts. First, we evaluate how accurate SocialSense

is in identifying speakers. Then we evaluate the effectiveness of mood identification.

We have done these evaluations in quiet and noisy (artificially injected) environ-

ments, showing that training with noise in noisy environments yields good increase in

performance. We have demonstrated the impact of window size, amount of training

data, and dynamic classification for speaker identification. We also compare our

results with some state-of-the-art solutions.

4.1 Speaker Identification Evaluation

We have evaluated the performance of SocialSense's speaker classification module in

terms of the classification accuracy, which is the overall correctness of the model. The

data for these evaluations are taken from voice segments collected from 7 persons.

There were 4 females and 3 males among them. A 1.5 hour long conversation on var-

ious random topics between two of these females was recorded by us. Another 6 con-

versations, each around 5 minutes in length, between a male and a female, were col-

lected from the internet. We collected 3 solo speech recordings from the remaining 3

persons for 10 minutes each. We extracted individual voice recordings from each of

these 7 speakers separately from these recorded conversations and simulated 2, 3, 4

and 5 person conversations from these. We performed all the speaker identification

experiments from these simulated conversations. For example, for simulating 3 per-

son conversations, we trained the logistic regression classifier with one person as

positively trained, and the other two persons as negatively trained, with all 3 combina-

tions of three persons, and all 35 possible selections of 3 persons from a set of 7 per-

sons.

Training Size. Intuitively speaker identification accuracy increases with the increase

of training data, as the classifier can encode more information with a longer training.

This phenomenon is shown in figure 2. The training and testing data were taken from

voice samples collected from 7 persons, with 2, 3, 4 and 5 person simulated conversa-

tions. The accuracy for 3 separate window sizes is shown for training up to 180 se-

conds.

(a) Amount of training data

(b) Window size

Fig. 2. Speaker identification accuracy vs. (a) Amount of training data; (b) Window size

As we can see from figure 2(a), there is a sharp increase in accuracy between 30s

and 60s of training, and beyond that the accuracy increases slowly. Also the accuracy

is highest for a 5s window size. These values are the lower bounds on the training

data needed to accurately identify the speaker on the phones, i.e. a minimum of 60

seconds of training is required with a minimum of a 5 second window size.

Window Size. This test was conducted on 2 person conversations. One person was

trained as positive while the other was trained as negative for 60 seconds. The win-

dow size was varied from 1 to 8 seconds and each window was classified using the

logistic regression classifier.

Results from figure 2(a) and 2(b) suggest that, a window size between 5 to 7 se-

conds is optimal for speaker identification. 3-4 second long window sizes yield accu-

racy of 86-89% which is acceptable. 5-7 second long window sizes can be used for

warm conversations where each speaker talks for a long time before switching turns,

while a 3-4 second window can be used for cold conversations with frequent turn-

takings with short speaking duration in each turn.

Effect of Noise. Noise is a very important and realistic issue to consider to evaluate a

smartphone based speaker identification system. It is very likely that users will move

with their phones to different places (both indoor and outdoor) engaging in social

interactions. Therefore, the system must be able to correctly identify its speaker under

various types of noise.

Evaluation has been done to test the effect of artificial noise on speaker recognition

accuracy. These tests were also done using 2 person conversations collected from 7

speakers. We used Audacity [14], an open-source sound editing software to inject

artificial white and Brown noise into voice samples, and observed classification accu-

racy under different levels of noise. White noise is quite similar to television static or

the humming of an air conditioner and Brown noise is similar to gusty wind. There-

fore, these artificial noises can simulate real indoor and outdoor noisy environments.

(a) Similar train-test (b) Different train-test

Fig. 3. Effect of noise on speaker recognition accuracy, (a) With similar train-test set; (b) With

different train-test

Figure 3(a) shows the effect of white and Brown noise on SocialSense speaker

recognition accuracy with similar train test set. During no noise, the accuracy is best

at 100%, while during maximum noise, the accuracy degrades to 82%, which is an

18% drop. However, because the train and test sets are similar in this case, this is not

a realistic scenario. Figure 3(b) shows the effect of noise under different train-test

sets. Here, the best accuracy during no noise is 90.2% and the worst accuracy is only

33%, which is a shocking 57% drop, and demonstrates how the system will fail in

presence of noise, if no measure is taken.

Fig. 4. Effect of noise on speaker recogni-

tion accuracy, with training in noise

Fig. 5. Effect of dynamic training vs. generic

training

It is a design characteristic of SocialSense that a phone user can have his own

phone trained for detecting his speaking episodes. This adds a lot of flexibility to the

system. In noisy environments, the user can have his phone trained in noise to en-

hance speaker identification accuracy. Because of the short training session and each

user needing to train only himself (as opposed to other systems where all user need to

train every phone), the training overhead is low.

Figure 4 shows the effect of noise when the training is done in noise as well. It

shows that even in worst noisy conditions the accuracy drops to 76% for white noise

and 89% for Brown noise, which is a 43% performance boost for white noise and

10% boost for Brown noise. We have limited the noise amplitude to 0.1 in these ex-

periments as this level is commensurate to real life extremely noisy environments.

Effect of Dynamic Classification. Because of the dynamic event-driven classifica-

tion scheme in SocialSense, every phone is trained with a precise negative training set

comprised of the voices of all other persons present in the social interaction setting.

The phones update their training files by message exchange whenever a new person

enters or leaves the Bluetooth range. Without the dynamic classification, every phone

had to use a generic negative training comprising of generic voices from arbitrary

persons, since no apriori knowledge of the users is available.

We used voice samples from 5 persons for precise training, with 1 trained as posi-

tive and other 4 trained as negative. The testing samples had voices from all 5 per-

sons. For generic training, we used a separate voice collection from 3 people (The

EMA dataset [15]) for negative training, and used the same test set as precise training.

The performance comparison of precise and generic training is shown in figure 5,

which shows a significant classification improvement (30% on average) due to dy-

namic training. Consequently, this novel aspect of our solution results in a major per-

formance improvement.

Worst Case Analysis of Dynamic Speaker Classification. The phone-set refresh

process triggers once in every 30 seconds. If there is a change in the phone-set imme-

diately after a refresh process, all the phones will stay with outdated classifiers for 30

seconds in the worst case. There are 2 cases to consider: i) some new phones arriving,

ii) some existing phones leaving. In the first case, if some new phones arrive right

after the refresh process, they will remain unknown to the existing phones for 30 se-

conds until the next refresh process. In this time, the newly arrived phones cannot

send or receive any vectors, so their social interactions will not be logged. Also, dur-

ing this time, the newly arrived phones will have a blank negative set, so all persons'

speaking episodes will be considered as positive in these phones. To avoid classifica-

tion errors due to the initialization in the newly arrived phones, voice segments during

this initialization window are ignored. The second case, where some existing phones

leave right after the refresh process, is less complicated than the first case. In this

case, all the remaining phones will stay with redundant negative sets, containing train-

ings from people who doesn't exist anymore. But, there will be no classification error

in these phones unlike the first case. For both cases, situation comes back to normal in

at most 30 seconds, after the immediate next phone-set refresh.

4.2 Mood Detection Evaluation

Because of the difficulty associated to get real life data for mood evaluation, we per-

formed both training and testing from the Electromagnetic Articulography dataset

[15], which contains 680 acted utterances of a number of sentences in 4 different

emotions (anger, happiness, sadness and neutrality) by 3 speakers. We used 3 differ-

ent classifiers to model each mood using MFCC features with 9 statistics (total 351

features), naive Bayes, random forest, and decision tree. We also varied the acoustic

window size from 1 to 10 seconds.

(a) (b)

Fig. 6. (a) Effect of audio sample length on emotion recognition accuracy with various classifi-

ers; (b) Confusion matrix for emotion recognition with random forest classifier

The random forest classifier yielded best cross classification results for 10 folds, as

shown in figure 6(a). This classifier resulted in a 4% to 10% increase in accuracy

compared to the baseline EmotionSense [7] with varying window size, and a 20%

increase for speaker independent model compared to [8]. The results for the baselines

are taken from corresponding existing works. The figure demonstrates that mood

classification accuracy increases with increasing window sizes, however beyond 6s

window size it becomes stable and doesn't change much. The confusion matrix for the

random forest classifier is shown in figure 6(b).

Fig. 7. Effect of noise and improvement with training with noise for mood classification

Similar to the speaker identification module, we evaluated the performance of

mood detection module under noise. The effect of white noise has been noticed as to

Real

mood

Classified as

Angry Happy Neutral Sad

Angry 144 19 1 2

Happy 32 110 4 18

Neutral 4 11 125 16

Sad 1 4 23 154

be more detrimental than brown noise, so we have done this experiment for white

noise only. We injected white noise with amplitudes varying from 0.02 up to 0.1 into

the mood dataset. We trained with mood utterances without noise and tested with

utterances in noise.

As expected, the performance dropped drastically, as shown in figure 7. However,

similar to the speaker identification module, we trained the mood dataset in noise and

performed a 10 fold cross validation, which yielded a 33% performance boost in the

worst 0.1 noise amplitude, as shown in figure 7. The existing mood detection systems

hasn't evaluated the possibility of training with noise, which in our case, yielded a

significant increase in performance.

4.3 Energy

SocialSense consumes energy in two ways: (i) idle listening, and (ii) once a speech

episode is identified, it runs various modules and classifiers. Experiments were run to

determine the lifetime of tablets and smartphones running SocialSense. The least

energy cost for SocialSense is if it is idle listening and there are no speech episodes to

process. Our experiments showed that Nexus 7 tablet ran for 14 hours and the HTC

one smartphone ran for 12 hours for this best situation. When SocialSense is actively

processing speech episodes there are 5 modules in the system which consume the

majority of energy: i) acoustic processing and feature extraction, ii) logistic regression

speaker identification classifier, iii) random forest mood detection classifier, iv)

speech and mood vector transmit/receive, and v) periodic phone-set refresh, training

file exchange and classification model file recreation. In a second set of experiments

we modified the system to run all these modules continuously as if there was continu-

ous speech. This is the worst case in regards to energy costs. In these experiments the

Nexus 7 tablet ran for 12 hours (down from 14 hours) and HTC one smartphone ran

for 10 hours (down from 12 hours). Consequently, SocialSense can operate between

12-14 hours on a tablet, and between 10-12 hours on a smartphone. This demonstrates

that SocialSense can indeed be used as a healthcare device in assisted living since

such devices can be charged over night.

5 Discussion

SocialSense detects speaking episodes and the mood of a user, and by collaboration it

imports the speaking episodes and moods of the neighboring users as well. A user

interface can be built upon this fine grained information showing the social interac-

tion history of a user within a particular time-frame. Such a user interface will be able

to display a user's common conversation partners, his amount of participation and

engagement during a conversation with a particular partner, his mood during a con-

versation and hence mood during that time of that particular day, change of his mood

with time or change of conversation partner and so on. Many of these quantities are of

interest to psychologists when they treat a potentially depressive patient, and hence

ask him relevant questions. The patients' answers are often vague, confusing and er-

roneous because most of the time they do not remember their social interaction histo-

ry and mood for a very long time. SocialSense can eliminate the need for these oral

questionnaires and hence avoid all the errors as it logs the social interaction data of a

user with his moods. Therefore, this system can be used in places like an assisted

living facility where depression and related psychological disorders are common

among the occupants.

Robustness. As we argue that SocialSense is usable among the elderly in assisted

living facilities, we are aware of the fact that the elderly are prone to forgetfulness,

and it is very likely that they may sometimes forget to carry their phones during a

social interaction. Though SocialSense is most accurate if every person carries his

smartphone in order to detect everybody's speaking episodes and moods, the system

does not break down if such assumption is violated. If a person does not carry his

phone during a social interaction, his own speaking and mood episodes will remain

undetected and unreported and others will not have his information for complete

mood contagion. All the other persons' speaking and mood episodes will be detected

and reported correctly. This is a major system design enhancement compared to vol-

ume based systems [4, 13] which fail when one or more persons forget to carry their

phones. It is also important to note that overall diagnosis involves many conversations

over multiple days and some missing information when smartphones are forgotten or

turned off does not necessarily cause problems.

Training in Assisted Livings. SocialSense's easy to use individual training scheme

and adaptability to noisy environments is very suitable for its usage in assisted living.

We have shown in the evaluation section that it only takes 60 seconds of training for

the system to work in any particular environment. Assisted living residents generally

pass specific time of their days in specific locations (e.g., mornings in the hall room,

noon at lunch room, afternoon in the garden). The assisted living support person can

train the smartphone for each of these common environments. If a resident moves to a

new location where the system needs to be retrained because of different noise levels,

the support person can do the training very easily with 60 seconds of data.

Mood Contagion. Using SocialSense it is possible to detect not only the mood of an

individual user, but also the moods of others present in the social interaction setting.

According to the best of our knowledge, no such system has been built yet which can

detect such a global mood. Thus, SocialSense can be used as a platform to verify and

conduct experiments on mood contagion which is a psychological process by which a

group of people engaged in a social interaction reaches emotional convergence, i.e.

they all have similar feelings after a certain time though their initial feelings may be

different. It is hypothesized that interventions based on knowledge of mood contagion

can be used to help treat depression in the elderly.

"In-Phone" vs. "In-Cloud" Scheme. We adopted an "in-phone" processing scheme

as opposed to "in-cloud" processing as in [17]. The term "in-phone" means that all

data acquisition, feature extraction, and classification are performed in the phone

itself. A reasonable alternative to or solution is an "in-cloud" solution, where unpro-

cessed raw data (conversation recordings) or semi-processed data (features) are sent

to a central server where a web service performs further processing and classification.

However, the "in-cloud" approach requires connectivity to the internet by wi-fi or 3G

which is not always available or is sometimes unreliable. To handle the unreliability

of connections various buffering and upload schemes have to be developed. A high-

speed 3G/4G connection also imposes additional operating cost for each phone. The

"in-phone" approach is cheaper and better supports mobility and could be used even

when residents are away from the assisted living facilities.

Concurrent Speaking Episodes. In our experiments described above, we assumed

that users did not speak concurrently. In reality, speakers do speak concurrently on

some occasions. So we also evaluated our system to test how it performs when users

speak concurrently. Ideally, when two or more users are speaking concurrently, each

of their systems should detect their own speaking episodes and log them as "speak-

ing" in their individual phones. We performed experiments with 4 speakers (2 male, 2

female), with two concurrent speakers at a time for all 6 possible pairs of conversa-

tions. As expected, the system performance degraded. On average, SocialSense was

55% accurate in detecting a particular user’s speaking episode when 2 concurrent

users were speaking. While this sounds low, this result only applies to the portion of

the speaking episode when there is actual concurrency, e.g., when two people first

both start speaking (but then one usually backs off) or when someone interrupts a

speaker.

6 Conclusion

This paper presents the design, implementation, and evaluation of SocialSense which

is a collaborative mobile platform for speaker identification and mood and mood con-

tagion detection from users' voice. Aside from its ability to recognize speaker and

mood with significant accuracy, we have demonstrated its performance relative to the

amount of training data and length of window size, culminating in an optimal bench-

marking of these parameters. We provide empirical evidence that SocialSense per-

forms well under various noisy environments when trained with noise, with an easy-

to-use training scheme. Also, with a dynamic classification scheme, SocialSense is

30% more accurate in speaker identification compared to generic training with static

classification. SocialSense is 4%-20% more accurate in speaker independent mood

sensing compared to the baseline state-of-the-art mood sensing systems. It was also

shown that SocialSense lifetime on various devices is between 10 to 14 hours.

Acknowledgments. This work was supported, in part, by NSF Grants CNS-1319302

and CNS-1239483, and a gift from PARC, Palo Alto. We cordially thank the review-

ers for their insightful comments and suggestions.

References

1. Neumann, R., Strack, F. (2000) Mood Contagion: The automatic transfer of mood between

persons. Journal of Personality and Social Psychology, Vol. 79, No. 2, pp. 211-223.

2. Reynolds, D. A. (1995) Speaker identification and verification using gaussian mixture

speaker models. Speech Communication, Vol. 17, Issue 1-2, pp. 91 - 108.

3. Reynolds, D. A., Rose, R. C. (1995) Robust text-independent speaker identification using

gaussian mixture speaker models. Transactions on Speech and Audio Processing, Vol. 3,

Issue 1, pp. 72 - 83.

4. Luo, C., Chan, M. C. (2013) SocialWeaver: collaborative inference of human conversation

networks using smartphones. 11th ACM Conference on Embedded Networked Sensor Sys-

tems (SenSys), Roma, Italy.

5. Nakakura, T., Sumi, Y., Nishida, T. (2009) Neary: conversation field detection based on

similarity of auditory situation. 10th Workshop on Mobile Computing Systems and Appli-

cations (HotMobile), Santa Cruz, California, USA.

6. Lu, H., Brush, B., Priyantha, B., Karlson, A. K., Liu, J. (2011) SpeakerSense: Energy effi-

cient unobtrusive speaker identification on mobile phones. IEEE Pervasive Computing and

Communication (PerCom), Seattle, Washington, USA.

7. Rachuri, K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., Aucinas, A.

(2010) EmotionSense: A mobile phone based adaptive platform for experimental social

psychology research. ACM International Joint Conference on Pervasive and Ubiquitous

Computing (UbiComp), Copenhagen, Denmark.

8. Yu, C., Aoki, P. M., Woodruff, A. (2004) Detecting user engagement in everyday conver-

sations. 8th International Conference on Spoken Language Processing, South Korea.

9. Casale, P., Pujol, O., Radeva, P. (2009) Face-to-face social activity detection using data

collected with a wearable device. 4th Iberian Conference on Pattern Recognition, Portugal.

10. Miluzzo, E., Lane, N. D., Fodor, K., Peterson, R., Lu, H., Musolesi, M., Eisenman, S. B.,

Zheng, X., Campbell, A. T. (2008) Sensing meets mobile social networks: the design, im-

plementation and evaluation of the CenceMe application. 6th ACM Conference on Em-

bedded Networked Sensor Systems (SenSys), Raleigh, North Carolina, USA.

11. Choudhury, T. (2004) Sensing and modeling human networks. Ph. D. Thesis, Program in

Media Arts and Sciences, Massachusetts Institute of Technology.

12. Chen, D., Yang, J., Malkin, R., Wactlar, H. D. (2007) Detecting social interactions of the

elderly in a nursing home environment. ACM Transactions on Multimedia Computing,

Communications and Applications, Vol. 3, No. 1, pp. 1–22.

13. Li, Q., Chen, S., Stankovic, J. A. (2013) Multi-modal in-person interaction monitoring us-

ing smartphone and on-body sensors, IEEE International Conference on Body Sensor

Networks, Cambridge, MA, USA.

14. Audacity. http://audacity.sourceforge.net/.

15. Kim, J., Lee, S., Narayan, S. S. (2010) An exploratory study of manifolds of emotional

speech. Acoustics Speech and Signal Processing, Dallas, TX, USA.

16. Stefanacci, R. G. (2008) How big an issue is depression in assisted living? Assisted Living

Consult, Vol 4, No. 4, pp 30-35.

17. Miluzzo, E., Cornelius, C. T., Ramaswamy, A., Choudhury, T., Liu, Z., Campbell, A. T.

(2010) Darwin phones: the evolution of sensing and inference on mobile phones. 8th Inter-

national Conference on Mobile Systems, Applications, and Services (MobiSys), San Fran-

cisco, California, USA.

18. Xu, C. et al (2013) Crowd++: unsupervised speaker count with smartphones. ACM Inter-

national Joint Conference on Pervasive and Ubiquitous Computing, Zurich, Switzerland.

SocialSense: A Collaborative Mobile Platform for Speaker...

Documents

Transcript of SocialSense: A Collaborative Mobile Platform for Speaker...