IEEE TRANSACTIONS ON NEURAL SYSTEMS AND...

1534-4320 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSRE.2017.2712419, IEEETransactions on Neural Systems and Rehabilitation Engineering

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. XX, NO. XX, JANUARY 2017 1

A Graphical Model for Online Auditory SceneModulation Using EEG Evidence for Attention

Marzieh Haghighi, Mohammad Moghadamfalahi, Student Member, IEEE, Murat Akcakaya, Member, IEEE,Barbara G. Shinn-Cunningham, and Deniz Erdogmus, Senior Member, IEEE

Abstract—Recent findings indicate that brain interfaces havethe potential to enable attention-guided auditory scene analysisand manipulation in applications such as hearing aids and aug-mented/virtual environments. Specifically, noninvasively acquiredelectroencephalography (EEG) signals have been demonstratedto carry some evidence regarding which of multiple synchronousspeech waveforms the subject attends to. In this paper we demon-strate that: (1) using data- and model-driven cross-correlationfeatures yield competitive binary auditory attention classificationresults with at most 20 seconds of EEG from 16 channels or evena single well-positioned channel; (2) a model calibrated usingequal-energy speech waveforms competing for attention couldperform well on estimating attention in closed-loop unbalanced-energy speech waveform situations, where the speech amplitudesare modulated by the estimated attention posterior probabilitydistribution; (3) such a model would perform even better if itis corrected (linearly, in this instance) based on EEG evidencedependency on speech weights in the mixture; (4) calibratinga model based on population EEG could result in acceptableperformance for new individuals/users; therefore EEG-basedauditory attention classifiers may generalize across individuals,leading to reduced or eliminated calibration time and effort.

Index Terms—EEG, auditory attention detection, brain in-terface

I. INTRODUCTION

MOST listeners solve the cocktail party problem (CPP)of attending to one sound in the presence of competing

sounds with ease [1], [2]. Healthy adult listeners achieve thisby selectively attending to discriminating auditory features ofthe desired source, such as spectral profile, harmonicity, spatialposition, and temporal modulation, in order to differentiate itfrom other sources in the auditory scene [3], [4], [5], [6].Listeners with hearing loss have difficulty focusing selectiveattention [7], leading to a need for hearing aids to assistwith communication settings that may exhibit the CPP. Recentresearch findings indicate that neural activity reveals auditorycortical mechanisms of selective attention in the CPP usingspatial [8], [9], [10], [11], and spectral or pitch [12] cues ofsound sources. (For more information, please see reviews onthis topic [13]).

Cortical responses have been shown to entrain to thetemporal envelope of attended speech [14], [15], [16]. These

M. Haghighi is with the Department of Electrical and ComputerEngineering, Northeastern University, Boston, MA, 02115 USA e-mail:([email protected]).

M. Moghadamfalahi and D. Erdogmus are with Northeastern University.M. Akcakaya is with University of Pittsburgh.B.G. Shinn-Cunningham is with Boston University.

effects can be seen in non-invasive magnetoencephalog-raphy (MEG) [17], [18], [19] and electroencephalography(EEG) [15], as well as during invasive electrocorticography(ECoG) [20]. These observations have lead to efforts to userecorded cortical measurements to reconstruct an attendedspeech stimulus [21]. Recent studies have quantified thequality of speech reconstructed from such recordings, andhave explored how the temporal properties of cortical re-sponses track those of both attended and unattended speechsignals [22], [23], [24], [25], [26].

Differences in cortical responses to attended versus unat-tended sound sources can be used to develop algorithms thatdetermine what source a listener is attending in the CPP. Forexample, using EEG [22], [23], MEG [24], or ECoG [25],a stimulus reconstructed from cortical recordings correlatesmore strongly with an attended source versus an unattendedsource. Other approaches for categorizing auditory attentionduring the CPP have also been proposed and tested. In [27], abiophysically inspired state space model that tracked auditorysources using 160 MEG neural measurements was able tocategorize attentional focus within the order of a few seconds.In [28], three types of discriminant features were used totrain and test a linear classifier that aims to identify attendedand unattended sound sources using EEG. They achieved highclassification accuracy using 20 seconds of 128-channel EEGdata.

Previously, we achieved successful classification of at-tended versus unattended speech in a two-speaker scenariousing 60 seconds of single-channel EEG [29] and 20 sec-onds [30] of 16-channel EEG measurements. In that earlierstudy, we had also tested the idea of attention-estimate-basedmodulation of source amplitudes in order to assess the possi-bility of closed-loop auditory scene modulation and estimationof attention. The introduced system performed well in real-time testing, using 20 seconds of data for inference [30].

In this study, we extend our previously introduced EEG-assisted sound source modulation framework in two novelways: (1) a recursive maximum a posteriori (MAP, i.e.,Bayesian with unit cost for errors, zero cost for correctdecisions) auditory attention inference procedure based on agenerative signal model relating sound envelopes to EEG; incontrast, the previous model used MAP inference with classconditional likelihoods of EEG/envelope cross-correlation se-quence based features. (2) a probabilistic model for recursiveMAP auditory attention inference extended with temporal dy-namics of attention in a given context, as well as the impact of




sound source powers on attention. These augmentations makethe model more appropriate for auditory attention tracking in aclosed-loop system where (estimated) source amplitudes maybe modulated.

The new signal-modeling approach did not change perfor-mance significantly in test cases where conditions were similarto calibration conditions, but when the auditory scene deviatedfrom calibration conditions, the model based approach allowedus to employ a corrective action that resulted in improvedperformance. The model based approach also achieved com-petitive accuracy with a feature vector of lower dimension-ality. In addition, the performance of individual-specific andpopulation-based calibration of attention inference models arecompared using both cross-correlation and signal-modelingbased feature extraction approaches; the results demonstratethat while individual-specific calibration outperforms, subject-to-subject transfer of models is viable and future model canexploit this observation by using hierarchical models of EEG(sessions < individuals < population) to achieve transferlearning benefits towards reduced calibration effort.

II. PARTICIPANTS AND DATA ACQUISITION

A. EEG Acquisition and Preprocessing

Ten volunteers (5 male, 5 female), between the ages of25 to 30 years, with no known history of hearing impairmentor neurological problems participated in this study, whichfollowed a Northeastern IRB-approved protocol. EEG signalswere recorded using a g.USBamp biosignal amplifier usingactive g.Butterfly electrodes with cap application from g.Tec(Graz, Austria) at 256 Hz. Sixteen EEG channels (P1, PZ,P2, CP1, CPZ, CP2, CZ, C3, C4, T7, T8, FC3, FC4, F3,F4 and FZ according to International 10/10 System) wereselected to capture auditory related brain activities over thescalp. Signals were filtered by built-in analog bandpass ([0.5,60] Hz) and notch (60Hz) filters. The acoustic envelope ofspeech stimulus signals were calculated using the Hilberttransform and then both EEG brain activity measurementsand speech envelopes were filtered by a linear-phase bandpassfilter ([1.5, 10]Hz). Then, tx seconds of EEG and acousticenvelope signals following every stimulus and time locked tothe stimulus onset were extracted. This dataset has been usedin our previous work using a different analysis approach [30].

B. Experimental Paradigm

The experimental paradigm is summarized in Figure 1.Each subject completed one calibration session and one onlinesession. Calibration sessions were approximately 30 minutesand they consisted of 60 trials of 20-second-long stimuluspresentations separated by 4-second-long rest periods. In eachtrial, two speech waveforms were presented simultaneouslyand diotically (both sounds playing in both ears) as auditorystimuli through earphones. Speech waveforms were selectedfrom audio books of literary novels. One male and one femalespeaker narrated their stories in each trial for all sessions.The session was divided into two subsessions of 30 trials.In each subsession, subjects were instructed to direct their

attention to one of the speakers. For example, if one subjectwas requested to direct his/her attention to the male speaker inthe first subsession, the attention was on the female speaker inthe second subsession. The order of the attended speaker wasrandomized across subjects. The target speaker was indicatedto the subjects by displaying the letters F or M on a monitorduring each trial. In all speech waveforms, silent portionslonger than 0.2 second were truncated to be 0.2 second long.The amplitude of each speech waveform was scaled to yieldequal energy in calibration trials.

The online session consisted of 10 sequences that wereeach 2 minutes long. Each sequence contained 6 trials thatwere each 20 seconds long; the speech waveform energieswere normalized such that each trial had equal energy for bothspeakers. Before each sequence, by displaying the letter F or Mon the monitor, subjects were asked to attend to the designatedtarget speech waveform for that sequence. In each sequence,the speech mixture weights were initialized to be equal (0.5vs 0.5), and they were updated 5 times (after each 20-second-long segment like the calibration trials) during 0.5-second-longpauses in which inference calculations were carried out.

(a)

Figure 1. Experimental paradigm visualization: [First Row] Calibration[Second Row] Online (quasi-real-time) sessions. Direction of the arrowsindicate the designated attended speaker.

III. PROPOSED METHOD

A. System Framework Overview

The overarching BCI framework used in this work waspreviously introduced [30], and it consists of three mainmodules: Digital Signal Processing (DSP), Automatic GainControl (AGC), and Auditory Attention Inference. The systemtakes a mixture of sounds from the auditory environment as theinput, modifies the power of each source sound, and producesa new mixture as the output. In anticipated applications, suchas hearing aids or augmented/virtual reality, the output ofthis system would be delivered to the user’s ear, closingthe loop in online scenarios. The DSP module estimatesindividual/independent sound sources from the mixture ofsounds in the environment (which could involve blind sourceseparation or adaptive interference cancellation in a realisticsetting). In this work, we assume that we have the estimatedsources which are the outputs of the DSP system based onblind source separation.




Figure 2. EEG-based Auditory Scene Modulation

Auditory Attention Inference System estimates the prob-ability of attention on each specific sound source using EEGmeasurements and estimated sound sources. Gain Controllersystem takes the estimated probabilities from the AttentionInference system to modify gains of power of each specificsound. As we also emphasize in the introduction of thismanuscript, we extend the probabilistic framework for theAttention Inference module in this manuscript. The detailsof the extended Attention Inference and AGC modules areprovided in the following section.

B. Auditory Attention Inference and AGC Modules

We employ the graphical model presented in Figure 3to built the Auditory Attention Inference module. In thisgraphical model, An represents the unknown attention of thesubject during the nth trial; c is the contextual prior definedover the subject’s attention; εn is the EEG evidence obtainedin response to the attended source during the nth trial; andwn is the weight vector that modulates the sound sources.

The graph illustrated in Figure 3 extends the previousmodel presented in earlier work [30]. The previous modelassumed that the unknown attention of the subject was theonly factor that affected the EEG evidence and the attentionof the subject only depended on a contextual prior. The currentmodel relaxes this assumption to include the dependency ofattention An to attention at the previous trial An−1, and theweights that modulate the sound sources during the nth trial,wn. As the weights modulating different sound sources will bedifferent, this difference may affect the attention of a subject;therefore, we think that this extension in the graphical modelis essential to account for the effect of modulation in intentinference.

AnAn−1 wn+1

wn εn

c

Figure 3. Proposed graphical model.

Using the graphical model presented in Figure 3, for eachtrial the conditional posterior distribution for the attention ofthe subject, An conditioned on the EEG evidence εn, priorattention An−1, and modulation weights wn can be computed:p(An|An−1, εn,wn, c). Under certain Independence assump-tions, this posterior simplifies to

p(An|An−1, εn,wn, c) ∝

p(εn|An)p(An|wn)p(An|An−1, c)/p(An)(1)

where

• P (εn|An) is the conditional distribution of the EEGevidence conditioned on the attended sound source. Thisconditional distribution is calculated based on EEG ev-idence features extracted and learned for each soundsource (class) as explained in Section III-D.

• P (An|wn) is the conditional distribution of the attentionconditioned on the weights that modulate the soundsources. In the decision framework, this conditional prob-ability models how the relative power of each soundsource influences the probability of attending to that spe-cific source. In Section IV-C, we describe how we developan approximate model for this conditional distribution.

• P (An|An−1, c) is the conditional distribution of theattention in the nth trial conditioned on the attention onthe previous trial and context prior. Based on context,transition of the attention from one state to another canbe prompt or slow. In our experiments, we assume thatthis conditional distribution is uniform due to the lack ofany contextual evidence.

• P (An) is the prior distribution on the attention of asubject during the nth trial. In our experiments, sincethe trials are independent from each other, we assume auniform prior over the attention during each trial.

Next we use the posterior distribution computed in (1) tocalculate the weights for the (n + 1)th trial, wn+1, that areused by the AGC module to modulate the sound sources. Morespecifically, lets assume that Sn = (s1,n, ..., si,n, ..., sM,n) isa matrix containing original sound sources presented duringthe nth trial such that each column corresponds to a differentsound source with M as the number of sounds sources. DenoteSn = (s1,n, ..., si,n, ..., sM,n) as the matrix containing the es-timated sound sources that is obtained for example after blindsource separation. Moreover, wn = (w1,n, ..., wi,n, ..., wM,n)

ᵀ

is the vector of gain control weights with wi,n as the gainof the ith estimated sound source; and en is the vector ofEEG measurements which is collected during the nth trial.Note that according to this notation, An = i indicates thatthe attention is on the ith sound source. Initially, the systempresents all sound sources with equal weights, but as en isobserved and EEG evidence, εn is extracted based on en andSn, the posterior distribution of the attention is updated withthe observed EEG evidence and the AGC module updates theweights of the sound sources as a function of this updatedposterior distribution. We assume that the updated weight, ingeneral, is a function of past and present weights and attentionposterior distributions.




Since, in this paper, we do not focus on developingstrategies for how weights should be controlled, as in previouswork [30], the experiments below specifically use weights thatare instantaneously obtained by applying a saturating linearfunction to the current attention posterior distribution.

wn+1 ∝ p(An|An−1, εn,wn, c), (2)

with a limitation factor on wn+1,

wi,n+1 =

{wmax if p(An = i|An−1, εn,wn, c) ≥ wmax

wmin if p(An = i|An−1, εn,wn, c) ≤ wmin

These limitations on weight range were imposed to ensurethe audibility of all sources, to enable mistake correction inthe event of algorithm/human errors, and to allow shiftingattention if desired.

C. Signal Modeling for Feature Extraction

We propose to use a signal model that assumes EEG is afunction of the estimated sound sources. This model is inspiredby previous results that indicated there is high correlationbetween EEG and envelopes of the (attended and unattended)sounds sources at various time lags [30]. Specifically, EEGmeasurements were correlated with both attended and unat-tended speech envelopes. Correlation patterns were similarexcept that the correlation between the EEG and unattendedsource was delayed and had a lower amplitude compared tothe correlation between the EEG and attended source. Basedon these observations, we suggest the following signal model:

en(t) = wi,nsi,n∗hi+,n(t)+∑j 6=a

wj,nsj,n∗hj−,n(t) + n(t) (3)

where sound source index i is attended to, while othersj 6= i are unattended, hj−,n(t) is the impulse response functionfor unattended sound source with index j and hi+,n(t) isthe impulse response function for the attended source. Asdefined previously, si,n is the estimated speech envelope ofthe attended source and sj,n is that of unattended source j.Finally, we assume that n(t) is (temporally) white Gaussiannoise with zero mean and covariance E(nnT ) = σ2I .

D. Feature Extraction and Classification

This section describes how to extract EEG evidence, εusing the proposed signal model and EEG calibration data.Specifically, here we discuss how to learn the class/sound-source conditional EEG evidence distribution, P (εn|An),which is used by the Auditory Attention Inference module tocompute the posterior distribution of the attention as describedin (1). Assuming that there are N sources, first, using leave-one-out cross validation over the calibration data, we learn theimpulse responses hi = [h1

−, ..,hi+, ...,h

j−, ...,h

N− ] through

least-square estimation, assuming that the ith source is theattended or target and the others j 6= i are the unattendedor distractors. For a two speaker scenario, h1 = [h1

+,h2−]

and h2 = [h1−,h

2+] are estimated assuming that speaker-

1 and speaker-2 are the attended sources, respectively. The

details of the estimation are given in Appendix A. Theseimpulse response functions are then used in (3) to estimatethe EEG, and the correlation coefficient between measured(and preprocessed/filtered) EEG and estimated EEG (fromthe signal model assuming ith source being target, hi) arecalculated for each EEG channel. Accordingly, in a twospeakers scenario, we defined xch = [ρech,ech

A=1, ρech,ech

A=2]ᵀ

as a 2× 1 dimensional vector for each channel with ρech,echA=i

for i = 1, 2 as the correlation between the estimated and rawEEGs when the ith source is the attended source. Assumingthat the number of the EEG channels is Nc, final featurevector for each trial, x = (x1, ..., xch, ..., xNc)ᵀ is a 32 × 1dimensional vector resulting from concatenation of all thefeatures of all EEG channels.

Once the feature vector is extracted for each trial, fordimensionality reduction, we use the regularized discriminantanalysis (RDA). The RDA defines a quadratic projection ofhigh dimensional features to a one-dimensional evidence.Assuming that there are only two sound sources, if thedistributions of the features for both classes (EEG featurescorresponding to the attended and unattended sound sources)were Gaussians, then this projection would be the result oflog-likelihood ratio which optimizes the Bayesian Risk. Foreach EEG feature vector, we denote the RDA projections assRDA(x), and use these projected values as the EEG evidence,ε. The details of the RDA projection is provided in our previ-ous work [30]. Once the EEG evidence is extracted for classi; that is when the ith sound source is the target (i.e., A = i),using kernel density estimation we learn p(ε = ε|A = i). Weuse a Gaussian kernel the bandwidth of which is estimatedusing Silverman’s rule of thumb.

IV. EXPERIMENTAL RESULTS

All analyses are performed on data from calibration(source energies equal in each trial) and online (source en-ergies in each trial are modulated based on attention in-ference) sessions. The calibration session data will also bereferred to as the equal-energy/weight dataset (DT1); and theonline session data as the modulated-energy/weight dataset(DT2). Within equal-energy or modulated-energy datasets, 5-fold cross-validation is employed to estimate classificationperformance, which is quantified using area-under-ROC-curve(AUC). Classification performance using AUC by calibrating(training) with the equal-energy dataset and testing on themodulated-energy dataset is also analyzed. In all analyses bothcross-correlation (CC) and signal model (SM) based featuresare used. For both CC and SM features, parameters for theextraction method are optimized as explained in Appendix B.

A. Single-Channel Classifier is Competitive with the 16-Channel Classifier.

Using the selected order of h, SM feature vector xch

is formed as described in Section III-D. Using these EEG-channel-specific features individually, attended source posteri-ors are evaluated. Figure 4 (a) shows the median 5-fold cross-validation AUC (across 10 participants) for each EEG channel




CV on DT1 − SM

F3 F4

C3 C4 Cz

CPz

T7 T8

Pz

Fz

P1 P2

CP1 CP2

FC3 FC4

0.2

0.4

0.6

0.8

1

Learn using DT1, Test on DT2 − SM

F3 F4

C3 C4 Cz

CPz

T7 T8

Pz

Fz

P1 P2

CP1 CP2

FC3 FC4

0.2

0.4

0.6

0.8

1

(a)

CV on DT1 − CC

F3 F4

C3 C4 Cz

CPz

T7 T8

Pz

Fz

P1 P2

CP1 CP2

FC3 FC4

0.2

0.4

0.6

0.8

1

(b)

Learn using DT1, Test on DT2 − CC

F3 F4

C3 C4 Cz

CPz

T7 T8

Pz

Fz

P1 P2

CP1 CP2

FC3 FC4

0.2

0.4

0.6

0.8

1

Figure 4. Topographic map of classification performance over the scalpfor classifying attended versus unattended speakers, median values over allparticipants. (a). cross-validation on dataset-1 or equal weights dataset usingSM and CC features. (b). model trained using dataset-1 and validated ondataset-2 or variable weights dataset for SM and CC features.

in the form of a topographical map in equal-energy dataset(dataset 1). Figure 4 (b) shows the median AUC resulting fromlearning the model using the equal-energy dataset (dataset 1)and test it on the modulated-energy dataset (dataset 2).

To complement these channel-specific results, Table Ishows the 5-fold cross-validation AUC for CC and SM featureson equal-energy dataset (DT1) when using the best channeland when using all 16 channels. This table indicates thatwhile best scalp locations for EEG acquisition may be subjectspecific, a small number of well-positioned electrodes canbe competitive with respect to many channels, and this is asignificant practical consideration to be explored further.

Table IAUC FOR CLASSIFICATION USING SINGLE BEST CHANNEL VERSUS ALL 16

CHANNELS FOR EACH SUBJECT.

Participant Best (CC) All (CC) Best (SM) All (SM)1 Fz: 0.94 0.92 F3: 0.92 0.932 C3: 0.93 0.89 C3: 0.92 0.913 C4: 0.99 0.97 C4: 0.99 0.974 Fc3: 0.87 0.87 Fc4: 0.81 0.925 F4: 0.88 0.93 C4: 0.71 0.646 T7: 0.92 0.88 Cz: 0.88 0.877 C3: 0.84 0.84 T7: 0.82 0.848 C4: 0.90 0.88 F4: 0.88 0.929 CPz: 0.95 0.94 F4: 0.94 0.9310 C3: 0.89 0.84 C3: 0.82 0.93

B. AUC Improves with Increasing Trial Length

Different lengths of EEG is used to infer the attendedspeech source by varying the length of EEG used from 2seconds to 20 seconds. Figure 5 shows the median classifierAUC (across 10 participants) when using all 16 channels, withboth CC and SM features; shaded areas are 90% confidenceintervals. As expected, accuracy increases when longer EEGevidence windows (more samples) are used for inference.

Figure 5. Attended speech classifier AUC increases with trial length usingboth CC and SM features, according to 5-fold cross-validation in both (top)equal- and (middle) modulated-energy datasets, as well as (bottom) whencalibrating with equal-energy trials and testing on modulated-energy trials.Median curves and 90% confidence intervals are shown.

C. Compensating for Sound Source Power Variation ImprovesClassification Accuracy

The assumption that EEG-evidence statistics are invariantunder changing relative power of input sources is not realistic.In calibration, energy of both speech sources were kept equalin each trial. However, in online sequences where source am-plitudes are modulated based on attention inference posterior,the relative energies of sources deviate from being equal ineach trial. Testing a classifier calibrated with equal-energydataset on online dataset when trial energy is modulated byweights controlled by inference posteriors, the EEG featuredistribution drifts away from the calibration model and asexpected results in a performance drop. One could design acalibration session that explores various source energy levelsin trials to sample this parameter space sufficiently, but inpractice this may lead to prohibitively long calibration sessionsand will be undesired. A model based approach would enablecorrections for these effects to be incorporated into featureextraction, therefore calibration procedures can be reasonablein terms of time and effort requirements, while performancein online use can be maintained.

Figure 6 shows, with data pooled from the online(modulated-energy) datasets of the population of 10 partici-pants, the probability of attending to male or female speakersas estimated by the classifier (using SM or CC features)versus the modulation weight for the corresponding trial.The average of these probabilities in weight-bins, along withthe best line fits indicate that there is a dependency of theclassifier estimates of attention probabilities on source energy.Considering this effect of modulation weights on classifierassessment of EEG is expected to improve performance.

As a simple first-order approximation, using p(An =i|wn) = awi,n + b derived from the linear least squares fits




0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Male Played Weights

Pro

babili

ty o

f attendin

g to M

ale

Speaker SM

0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Female Played Weights

Pro

babili

ty o

f attendin

g to F

em

ale

Speaker

SM

0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Male Played Weights

Pro

babili

ty o

f attendin

g to M

ale

Speaker CC

0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Female Played Weights

Pro

babili

ty o

f attendin

g to F

em

ale

Speaker

CC

Figure 6. For SM (top) and CC (bottom) features, the (estimated) probabilityof attending to the male speaker (left) versus the female speaker (right) versusmodulation weight of the respective speaker in each trial, presented as a(green) scatter plot. The concentration of weights at 0.25, 0.5, and 0.75 aredue to initial weights being equal at 0.5 and lower and upper bounds of theweights being set to wmin = 0.25 and wmax = 0.75, respectively. The(blue) stars indicate average probabilities in one of eleven weight intervals,while the (red) lines indicate best linear least square fits to these averages.

indicated above, the dependence of attention on source weightsis approximately corrected in this preliminary study usingp(εn|An)p(An|wn). Clearly, this is not a proper posterior,since the likelihood of attention given weight is not a properprobability distribution function; further modeling effort mustbe put into this component in the future. With this simplisticcorrective action, classification AUC improves in a statisticallysignificant fashion, as demonstrated in Figure 7.

D. Population Classifier is Competitive with Individual-Specific Classifiers.

Subject to subject generalization of classification accuracyis a desirable feature in brain interfaces, as it would allowpooling data from multiple users and calibrating a populationclassifier, or using such a population model as the priorfor an individual classifier, in order to eliminate or reducecalibration time for a new user. Using leave-one-subject-outcross-validation scheme, the performance of a population clas-sifier is tested. Training the classifier using calibration (equal-energy) data from all subjects but one, and evaluating theequal-energy data from the left-out subject reveals that suchsubject to subject generalization is feasible. Validation AUCsof population classifiers obtained by leaving each subject outduring calibration are reported in Table II. In contrast, asan estimate of individual-specific classifier performance, 5-fold cross-validation AUCs for classifiers calibrated on equal-weight datasets for each participant are evaluated and in-cluded in the table. The process is repeated with SM andCC features for comparison. While we acknowledge that thisprocess focused only on equal-weight calibration datasets,

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Subject Index

0.07

0.33

0.2

0.16

0.04

0.18 0.050.04

0.18

0.12

Pe

rfo

rma

nce

(A

UC

) −

SM

without weights

with weights

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Subject Index

0.18

0.5

0.36

0.15

0.20.15 0.1 0.18

0.29

0.21

Pe

rfo

rma

nce

(A

UC

) −

CC

Figure 7. AUC and 90% confidence intervals with/without considering thelinear approximate p(An|wn) correction for both SM and CC features. Thenumbers above the two bars for each subject indicate the p-value for thenull hypothesis that the AUC after correction is less than or equal to the AUCbefore correction. For SM features, the linear corrective term improves AUC ina statistically significant fashion for almost all subjects, while for CC features,this is not the case. This result indicates that the signal model based approachmay be improved further and generalize with appropriate corrective termssuperimposed on calibration models with limited source energy variability.

Table IIPERFORMANCE (AUCS) OF THE POPULATION CLASSIFIER VERSUS

INDIVIDUAL CLASSIFIER ON EQUAL WEIGHTS DATASET (DATASET-1).

AUC Participant 1 2 3 4 5 6 7 8 9 10

Population classifier (SM) 0.86 0.78 0.97 0.85 0.72 0.84 0.78 0.86 0.68 0.81

Individual classifier (SM) 0.93 0.91 0.97 0.92 0.64 0.87 0.84 0.92 0.93 0.93

Population classifier (CC) 0.93 0.94 0.95 0.88 0.85 0.85 0.87 0.89 0.86 0.90

Individual classifier (CC) 0.92 0.89 0.97 0.87 0.93 0.88 0.84 0.88 0.94 0.84

these results could be considered as an upper bound ofperformance expectations when models are calibrated on equalenergy trials and tested online with modulated-energy trials,based on previously discussed results. Qualitatively, the resultsindicate that population data is representative of individualdata, and a hierarchical model that considers the individualas an instance of the population could lead to a calibrationprocedure that allows effective evidence pooling from multipleusers to reduce calibration time or improve calibration qualityfor a particular individual.

V. CONCLUSION

EEG has been demonstrated to exhibit useful evidenceregarding auditory attention in the presence of multiple com-peting speech waveforms. A data-driven cross-correlation fea-ture and a signal model based feature is considered in twoscenarios. The results demonstrate that dense electrode arraysas used in previous reports are not necessary; competitivebinary auditory attention classification results with at most 20seconds of EEG from 16 channels, or even a single well-positioned channel can be obtained. It is also shown that a




model calibrated using equal-energy speech waveforms canperform well in closed-loop unbalanced-energy speech wave-form conditions, where the speech amplitudes are modulatedby the estimated attention posterior probability distribution.Further analysis demonstrate that such a model would performeven better if it is corrected (in this case, linearly) to accountfor EEG evidence dependency on speech weights in mixture.Finally, results indicate that calibrating a model based onpopulation EEG could result in acceptable performance fornew individuals/users; more interestingly, population data canbe pooled to form a prior model for individual classifiers,thereby reducing calibration time significantly for new users.The results presented in this paper contribute to the field ofauditory-attention driven manipulation of auditory scenes inhearing aid and virtual/augmented reality applications.

APPENDIX ASIGNAL MODEL DERIVATIONS

In this appendix we demonstrate the estimation method forthe model parameter h as defined in equation (3). Define L asthe order of auto-regressive signal model, N as the number oftime samples in each trial (both for sound source envelopesand EEG signal), e as EEG time samples, and s1 and s2 asthe sound sources envelope time samples for source 1 and 2respectively. Then according to our model we can define,

eL+1

eL+2

.

.

.eN

e

=

s11 s12 . . s1Ls12 s13 . . s1L+1

. . . . .

. . . . .

. . . . .s1N−L . . . s1N

S1

h11h12..h1L

h1

+

s21 s22 . . s2Ls22 s23 . . s2L+1

. . . . .

. . . . .

. . . . .s2N−L . . . s2N

S2

h21h22..h2L

h2

+

nL+1

nL+2

.

.

.nN

n

Then, the model at nth trial is en = Cnh+ nn, where,

Cn =[S1nT

S2nT]

and h =

[h1

h2

].

Afterward, we define the following optimization problem toestimate the parameter values at which the sum of squaredestimation error in all trials is minimized.

h = argminh

∑n

‖en − Cnh‖22

The solution to our convex optimization problem is

h =∑n

(C>n Cn)−1 ∑

n

C>n en.

In our model we propose that model parameter values aredependent on target sound source hence, we solve the aboveoptimization problem two times to estimate the h: (1) usingtrials in which speaker 1 is target (h1) and (2) using trials inwhich speaker 2 is target (h2).

APPENDIX BMODEL ORDER SELECTION

The model order L of our proposed model in Sec-tion III-C, needs to be optimized for best performance. More-over, we need to define the optimum time window duration

[0, τmax] for CC features. In this appendix we present theeffect of these parameters on classification performance usingcross validation in Figure 8. We believe that the comparisonof different methods in our manuscript is fair because thepresented results are obtained at optimum values of theseparameters, according to the similar performance gains. Morespecifically, we selected the parameter values for which theaverage of AUCs over three types of analysis shown inFigure 8 are maximized.

In Figure 8, red curves show median of cross-validationperformances on dataset-1 over all participants. Shaded areasaround each curve are showing the corresponding standarderror for that curve. Green curves show median of cross-validation performances on dataset-2 over all participants. Andblue curves are showing the result of applying the learnedmodel using dataset-1, on dataset-2. Based on the explained

Figure 8. Median of classification performance (AUCs) over participantsversus; (a) impulse response length of the SM model (h length in secondsor L samples) (b) τmax in seconds for extracting CC features in [0, τmax]range.

criteria, impulse response length is set to 0.125 seconds(L = 32) and τmax = 0.275 seconds.

ACKNOWLEDGMENT

This work is supported by NSF (CNS-1136027, IIS-1149570, CNS-1544895), NIDLRR (90RE5017-02-01), andNIH (R01DC009834).

REFERENCES

[1] E. C. Cherry, “Some experiments on the recognition of speech, with oneand with two ears,” The Journal of the acoustical society of America,vol. 25, no. 5, pp. 975–979, 1953.

[2] A. W. Bronkhorst, “The cocktail party phenomenon: A review ofresearch on speech intelligibility in multiple-talker conditions,” ActaAcustica united with Acustica, vol. 86, no. 1, pp. 117–128, 2000.

[3] W. A. Yost, “The cocktail party problem: Forty years later,” Binauraland spatial hearing in real and virtual environments, pp. 329–347, 1997.

[4] W. A. Yost and S. Sheft, “Auditory perception,” in Human psy-chophysics. Springer, 1993, pp. 193–236.

[5] A. W. Bronkhorst, “The cocktail-party problem revisited: early pro-cessing and selection of multi-talker speech,” Attention, Perception, &Psychophysics, vol. 77, no. 5, pp. 1465–1487, 2015.

[6] B. G. Shinn-Cunningham, “Object-based auditory and visual attention,”Trends in cognitive sciences, vol. 12, no. 5, pp. 182–186, 2008.




[7] B. G. Shinn-Cunningham and V. Best, “Selective attention in normaland impaired hearing,” Trends in amplification, 2008.

[8] L. Dai and B. G. Shinn-Cunningham, “Contributions of sensory codingand attentional control to individual differences in performance in spatialauditory selective attention tasks,” Frontiers in Human Neuroscience,vol. 10, 2016.

[9] A. K. Lee, S. Rajaram, J. Xia, H. Bharadwaj, E. Larson, M. Hamalainen,and B. G. Shinn-Cunningham, “Auditory selective attention revealspreparatory activity in different cortical regions for selection based onsource location and source pitch,” Frontiers in neuroscience, vol. 6, p.190, 2013.

[10] J. C. Middlebrooks and P. Bremen, “Spatial stream segregation byauditory cortical neurons,” The Journal of Neuroscience, vol. 33, no. 27,pp. 10 986–11 001, 2013.

[11] J. Dong, H. S. Colburn, and K. Sen, “Cortical transformation of spatialprocessing for solving the cocktail party problem: a computationalmodel,” eneuro, vol. 3, no. 1, pp. ENEURO–0086, 2016.

[12] K. T. Hill and L. M. Miller, “Auditory attentional control and selectionduring cocktail party listening,” Cerebral cortex, p. bhp124, 2009.

[13] A. K. Lee, E. Larson, R. K. Maddox, and B. G. Shinn-Cunningham,“Using neuroimaging to understand the cortical mechanisms of auditoryselective attention,” Hearing research, vol. 307, pp. 111–120, 2014.

[14] N. Ding and J. Z. Simon, “Cortical entrainment to continuous speech:functional roles and interpretations,” Frontiers in human neuroscience,vol. 8, 2014.

[15] S. J. Aiken and T. W. Picton, “Human cortical responses to the speechenvelope,” Ear and hearing, vol. 29, no. 2, pp. 139–157, 2008.

[16] Y.-Y. Kong, A. Mullangi, and N. Ding, “Differential modulation ofauditory responses to attended and unattended speech in differentlistening conditions,” Hearing research, vol. 316, pp. 73–81, 2014.

[17] E. Ahissar, S. Nagarajan, M. Ahissar, A. Protopapas, H. Mahncke, andM. M. Merzenich, “Speech comprehension is correlated with temporalresponse patterns recorded from auditory cortex,” Proceedings of theNational Academy of Sciences, vol. 98, no. 23, pp. 13 367–13 372, 2001.

[18] H. Luo and D. Poeppel, “Phase patterns of neuronal responses reliablydiscriminate speech in human auditory cortex,” Neuron, vol. 54, no. 6,pp. 1001–1010, 2007.

[19] D. A. Abrams, T. Nicol, S. Zecker, and N. Kraus, “Right-hemisphereauditory cortex is dominant for coding syllable patterns in speech,” TheJournal of Neuroscience, vol. 28, no. 15, pp. 3958–3965, 2008.

[20] K. V. Nourski, R. A. Reale, H. Oya, H. Kawasaki, C. K. Kovach,H. Chen, M. A. Howard, and J. F. Brugge, “Temporal envelope oftime-compressed speech represented in the human auditory cortex,” TheJournal of Neuroscience, vol. 29, no. 49, pp. 15 564–15 574, 2009.

[21] B. N. Pasley, S. V. David, N. Mesgarani, A. Flinker, S. A. Shamma,N. E. Crone, R. T. Knight, and E. F. Chang, “Reconstructing speechfrom human auditory cortex,” PLoS Biol, vol. 10, no. 1, p. e1001251,2012.

[22] J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe,B. G. Shinn-Cunningham, M. Slaney, S. A. Shamma, and E. C. Lalor,“Attentional selection in a cocktail party environment can be decodedfrom single-trial eeg,” Cerebral Cortex, p. bht355, 2014.

[23] B. Mirkovic, S. Debener, M. Jaeger, and M. De Vos, “Decoding theattended speech stream with multi-channel eeg: implications for online,daily-life applications,” Journal of neural engineering, vol. 12, no. 4, p.046007, 2015.

[24] N. Ding and J. Z. Simon, “Neural coding of continuous speech inauditory cortex during monaural and dichotic listening,” Journal ofneurophysiology, vol. 107, no. 1, pp. 78–89, 2012.

[25] N. Mesgarani and E. F. Chang, “Selective cortical representation ofattended speaker in multi-talker speech perception,” Nature, vol. 485,no. 7397, pp. 233–236, 2012.

[26] W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-inspiredspeech envelope extraction methods for improved eeg-based auditoryattention detection in a cocktail party scenario,” 2016.

[27] S. Akram, A. Presacco, J. Z. Simon, S. A. Shamma, and B. Babadi, “Ro-bust decoding of selective auditory attention from meg in a competing-speaker environment via state-space modeling,” NeuroImage, vol. 124,pp. 906–917, 2016.

[28] C. Horton, R. Srinivasan, and M. DZmura, “Envelope responses insingle-trial eeg indicate attended speaker in a cocktail party,” Journal ofneural engineering, vol. 11, no. 4, p. 046015, 2014.

[29] M. Haghighi, M. Moghadamfalahi, H. Nezamfar, M. Akcakaya, andD. Erdogmus, “Toward a brain interface for tracking attended auditorysources,” in Machine Learning for Signal Processing (MLSP), 2016IEEE 26th International Workshop on. IEEE, 2016, pp. 1–5.

[30] M. Haghighi, M. Moghadamfalahi, M. Akcakaya, and D. Erdogmus,“EEG-assisted Modulation of Sound Sources in the Auditory Scene,”ArXiv e-prints, Nov. 2016.

Marzieh Haghighi received the B.Sc. degree inelectrical engineering from Sharif University ofTechnology, Tehran, Iran, in 2011. She is currentlya Ph.D. student in Electrical and Computer Engi-neering Department of Northeastern University anda research assistant in Cognitive Systems Labo-ratory, Boston, MA, USA. Her research interestsinclude brain-computer interfaces, machine learning,and statistical signal processing for biomedical andneural applications.

Mohammad Moghadamfalahi (S’14) received thePh.D. degree in electrical engineering from North-eastern University, Boston, MA, USA, in 2016and his B.Sc. degree in electrical engineering fromAmirkabir University (Tehran Polytechnics), Tehran,Iran, in 2008.

He currently works at Honeywell Laboratories asa Research and Development Engineer. His currentresearch interests are signal processing and machinelearning for neural and aerospace applications.

Murat Akcakaya (S’07-M’12) received the B.Sc. inelectrical and electronics engineering department ofMiddle East Technical University, Ankara, Turkey,in 2005, and the M.Sc. and the Ph.D. degree inelectrical engineering from Washington University inSt. Louis, MO, in May and December 2010, respec-tively. He is currently an Assistant Professor in theElectrical and Computer Engineering Department ofthe University of Pittsburgh. His research interestsare in the area of statistical signal processing andmachine learning with applications in non-invasive

brain computer interface design. Dr. Akcakaya was a winner of the studentpaper contest awards at the 2010 IEEE Radar Conference, 2010 IEEEWaveform Diversity and Design Conference, and 2010 Asilomar Conferenceon Signals, Systems and Computers.

Barbara Shinn-Cunningham received the Sc.B.in Electrical Engineering from Brown University.She earned both her master’s degree and Ph.D.from the Massachusetts Institute of Technology inElectrical and Computer Engineering. She worked atBell Communications Research, MIT Lincoln Labo-ratory, and Sensimetrics before joining the faculty atBU. She is an auditory neuroscientist best known forher work on attention and the cocktail party problem,sound localization, and the effects of room acousticsand reverberation on hearing.

Deniz Erdogmus received BS degrees in EE andMathematics in 1997, and MS in EE in 1999 fromthe Middle East Technical University in Turkey. Hereceived his PhD in ECE from the University ofFlorida in 2002, where he stayed as a postdoctoralresearch associate until 2004. He was an AssistantProfessor of Biomedical Engineering at the OregonHealth and Science University until 2008. Then hejoined Northeastern University, where he is currentlyan Associate Professor in the Electrical and Com-puter Engineering Department. His research focuses

on statistical signal processing and machine learning with applications tocontextual signal/image/data analysis with applications in cyberhuman sys-tems including brain computer interfaces and technologies that collaborativelyimprove human performance. He has served as an associate editor andprogram committee member for a number of journals and conferences inthese areas, including IEEE Signal Processing Letters, and the followingIEEE Transactions: Signal Processing, Biomedical Engineering, and NeuralNetworks.

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND...

Documents

Transcript of IEEE TRANSACTIONS ON NEURAL SYSTEMS AND...