Neural Encoding of Speech in Auditory Cortex
Transcript of Neural Encoding of Speech in Auditory Cortex
Neural Encoding of Speech in Auditory Cortex
Jonathan Z. SimonDepartment of BiologyDepartment of Electrical & Computer EngineeringInstitute for Systems Research
University of Maryland
University College London, 22 June 2015http://www.isr.umd.edu/Labs/CSSL/simonlab
AcknowledgementsCurrent (Simon Lab & Affiliates)
Francisco CervantesNatalia LapinskayaMahshid NajafiAlex PresaccoKrishna PuvvadaLisa UiblePeng Zan
Past (Simon Lab & Affiliate Labs)Nayef AhmarSahar AkramMurat AytekinClaudia BoninMaria ChaitMarisel Villafane Delgado Kim DrnecNai Ding Victor Grau-SerratJulian JenkinsDavid KleinLing Ma
Kai Sum LiHuan Luo Raul RodriguezBen WalshJuanjuan XiangJiachen Zhuo
CollaboratorsPamela AbshireSamira AndersonBehtash BabadiCatherine CarrMonita ChatterjeeAlain de CheveignéDidier DepireuxMounya ElhilaliBernhard EnglitzJonathan FritzCindy MossDavid PoeppelShihab Shamma
Funding NIH (NIDCD, NIA, NIBIB); USDA
Past Postdocs & Visitors Aline Gesualdi Manhães Dan HertzYadong Wang
Undergraduate Students Abdulaziz Al-Turki Nicholas AsendorfSonja BohrElizabeth CamengaCorinne CameronJulien DagenaisKatya DombrowskiKevin HoganKevin KahnAlexandria MillerIsidora RanovadovicAndrea ShomeMadeleine VarmerBen Walsh
Outline• Magnetoencephalography (MEG)
• Cortical Representations of Speech
- Encoding vs. Decoding
- Attended vs. Unattended Speech
• Work in Progress
- Attentional Dynamics
- Aging and the Cocktail Party Problem
- Foreground vs. Background
Magnetoencephalography• Non-invasive, Passive, Silent
Neural Recordings
• Simultaneous Whole-Head Recording (~200 sensors)
• Sensitivity• high: ~100 fT (10–13 Tesla)• low: ~104 – ~106 neurons
• Temporal Resolution: ~1 ms
• Spatial Resolution• coarse: ~1 cm• ambiguous
Neural Signals & MEG
tissue
CSF
skull
scalpB
MEG
VEEG
recordingsurface
currentflow
orientationof magneticfield
MagneticDipolarField
Projection
•Direct electrophysiological measurement•not hemodynamic•real-time
•No unique solution for distributed source
Photo by Fritz Goro
•Measures spatially synchronized cortical activity
•Fine temporal resolution (~ 1 ms)•Moderate spatial resolution (~ 1 cm)
Time Course of MEG Responses
BroadbandNoise
Auditory Evoked Responses
• MEG Response Patterns Time-Locked to Stimulus Events
• Robust
• Strongly Lateralized
Pure Tone
Component Analysis• Each component has both
spatial and temporal profile
• Data driven, e.g., PCA, ICA, DSS
• DSS: ordered by trial-to-trial reproducibility
• ➔ Spatial Filter, e.g. for single trials
• Can analyze temporal processing separately from anatomical origin Särelä & Valpola (2005)
de Cheveigné & Simon, J. Neurosci. Methods (2008)
MEG Responses
AuditoryModel
to Speech Modulations
Ding & Simon, J Neurophysiol (2012) “Spectro-Temporal Response Function”
(up to ~10 Hz)
MEG Responses Predicted by STRF Model
Linear Kernel = STRF
Ding & Simon, J Neurophysiol (2012)Zion-Golumbic et al., Neuron (2013)
Neural Reconstruction of Speech Envelope
2 s
stimulus speech envelopereconstructed stimulus speech envelope
Reconstruction accuracy comparable to single unit & ECoG recordings
(up to ~ 10 Hz)
MEG Responses
...
DecoderSpeech Envelope
Ding & Simon, J Neurophysiol (2012)Zion-Golumbic et al., Neuron (2013)
Neural Reconstruction of Speech Envelope
2 s
stimulus speech envelopereconstructed stimulus speech envelope
Reconstruction accuracy comparable to single unit & ECoG recordings
(up to ~ 10 Hz)
MEG Responses
...
DecoderSpeech Envelope
Ding & Simon, J Neurophysiol (2012)Zion-Golumbic et al., Neuron (2013)
Neural Reconstruction of Speech Envelope
2 s
stimulus speech envelopereconstructed stimulus speech envelope
Reconstruction accuracy comparable to single unit & ECoG recordings
(up to ~ 10 Hz)
MEG Responses
...
DecoderSpeech Envelope
Neural Representation of Speech: Temporal
Speech in Noise
Ding & Simon, J Neuroscience (2013)
Speech in Noise
Ding & Simon, J Neuroscience (2013)
Speech in Noise: Results
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cy
Ding & Simon, J Neuroscience (2013)
Speech in Noise: Results
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cy
Ding & Simon, J Neuroscience (2013)
Speech in Noise: Results
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cy
Ding & Simon, J Neuroscience (2013)
Speech in Noise: Results
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cy
Ding & Simon, J Neuroscience (2013)
Speech in Noise: Results
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cy
+6 dB
-6 dB1 s
A Neural Reconstruction ofUnderlying Speech Envelope
B Reconstruction Accuracy
corre
latio
n
SNR (dB)Q +6 +2 −3 −6 −9
0
.1
.2
0 25 50 75 1000
.1
.2
.3
C Correlation with Intelligiblity
intelligiblity (%)
reco
nstru
ctio
n ac
cura
cyacross Subjects
Ding & Simon, J Neuroscience (2013)
Noise-Vocoded Speech
Ding, Chatterjee & Simon, NeuroImage (2014)
“in noise” = +3 dB SNR
Noise-Vocoded Speech: Results
• Cortical entrainment to natural speech robust to noise• Cortical entrainment to vocoded speech is not• Not explainable by passive envelope tracking mechanisms
- noise vocoding does not directly affect the stimulus envelope
Noise-Vocoded Speech: Results
Cortical Speech Representations
• Neural Representations: Encoding & Decoding
• Linear models: Useful & Robust
• Speech Envelope only (as seen by MEG)
• Envelope Rates: ~ 1 - 10 Hz
Alex Katz, The Cocktail Party
The Cocktail Party
Alex Katz, The Cocktail Party
The Cocktail Party
Alex Katz, The Cocktail Party
The Cocktail Party
Alex Katz, The Cocktail Party
The Cocktail Party
Alex Katz, The Cocktail Party
The Cocktail Party
speech
competing speech
Experiments
speech
competing speech
Experiments
reverberation
Experiments in Progress
speech
competing speech
Experiments in Progress
olderlistener
speech
competing speech
Experiments in Progress
competing speech
speech
competing speech
Two Competing Speakers
Selective Neural Encoding
Selective Neural Encoding
Selective Neural Encoding
Unselective vs. Selective Neural Encoding
Unselective vs. Selective Neural Encoding
Selective Neural Encoding
Stream-Specific Representation
grand average over subjects
representative subject
Identical Stimuli!
reconstructed from MEG
attended speech envelopes
reconstructed from MEG
attending tospeaker 1
attending tospeaker 2
Ding & Simon, PNAS (2012)
Stream-Specific Representation
grand average over subjects
representative subject
Identical Stimuli!
reconstructed from MEG
attended speech envelopes
reconstructed from MEG
attending tospeaker 1
attending tospeaker 2
Ding & Simon, PNAS (2012)
Single Trial Speech Reconstruction
Ding & Simon, PNAS (2012)
Single Trial Speech Reconstruction
Ding & Simon, PNAS (2013)
Overall Speech Reconstruction
0.2
0
0.1
corre
latio
n
attended speechreconstruction
backgroundreconstruction
attended speech background
Distinct neural representations for different speech streams
Invariance Under Relative Loudness Change?
Invariance Under Relative Loudness Change?
Invariance under Relative Loudness Change
attended
backgroundcorr
elat
ion
.1
.2
-8 -5 0 5 8Speaker Relative Intensity (dB)
Neural Results
• Neural representation invariant to relative loudness change
• Stream-based Gain Control, not stimulus-based
Forward STRF Model
Spectro-Temporal Response Function (STRF)
Forward STRF Model
Spectro-Temporal Response Function (STRF)
STRF Results
•STRF separable (time, frequency)•300 Hz - 2 kHz dominant carriers•M50STRF positive peak•M100STRF negative peak
TRF
•M100STRF strongly modulated by attention, but not M50STRF
attended
.2
.5
1
3
0 100 200
Background
fre
qu
en
cy (
kH
z)
.2
.5
1
3
0 100 200
Attended
time (ms) time (ms)
background
STRF Results
•STRF separable (time, frequency)•300 Hz - 2 kHz dominant carriers•M50STRF positive peak•M100STRF negative peak
TRF
•M100STRF strongly modulated by attention, but not M50STRF
attended
.2
.5
1
3
0 100 200
Background
fre
qu
en
cy (
kH
z)
.2
.5
1
3
0 100 200
Attended
time (ms) time (ms)
background
STRF Results
•STRF separable (time, frequency)•300 Hz - 2 kHz dominant carriers•M50STRF positive peak•M100STRF negative peak
TRF
•M100STRF strongly modulated by attention, but not M50STRF
attended
.2
.5
1
3
0 100 200
Background
fre
qu
en
cy (
kH
z)
.2
.5
1
3
0 100 200
Attended
time (ms) time (ms)
background
Neural Sources
RightLeft
anterior
posterior
medial
M50STRFM100STRFM100
•M100STRF source near (same as?) M100 source: Planum Temporale
•M50STRF source is anterior and medial to M100 (same as M50?): Heschl’s Gyrus
5 mm
•PT strongly modulated by attention, but not HG
Cortical Object-Processing Hierarchy
0 100 200 400time (ms)
0
attendedbackground
Attentional Modulation
0 100 200 400
0
time (ms)
clean
-5 dB-8 dB
Influence of Relative Intensity
0 dB5 dB8 dB
•M100STRF strongly modulated by attention, but not M50STRF.•M100STRF invariant against acoustic changes.•Objects well-neurally represented at 100 ms, but not 50 ms.
Studies In Progress
• Attentional Dynamics
• Aging & Neural Representations of Speech
• Neural Representations of the Background
Attentional DynamicsAttend to Speaker 1
Attend to Speaker 2
Prob
abilit
y of
atte
ndin
g S
peak
er 1
Attentional DynamicsAttend to Speaker 1
Switch Attention
Attend to Speaker 2
Prob
abilit
y of
atte
ndin
g S
peak
er 1
Younger vs. Older Listeners
withCompetingSpeaker
Spee
ch R
econ
stru
ctio
n
withCompetingSpeaker
Older AdultsYounger Adults
In Quiet
Integration window (ms)
In Quiet
Younger vs. Older Listeners
withCompetingSpeaker
Spee
ch R
econ
stru
ctio
n
withCompetingSpeaker
Older AdultsYounger Adults
In Quiet
Integration window (ms)
In Quiet
speech
competing speech
Three Competing Speakers
competing speech
Foreground vs. Background
Foreground vs. Background
Foreground vs. Background
Foreground vs. Background
Foreground vs. Background
Foreground vs. Background
Foreground vs. Background
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
Backgrounds vs. Background
NoiseFloor
ForegroundIndividualBackgrounds
FusedBackground
Individual Speech Streams
Integration Window over Late Times Only
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
Backgrounds vs. Background
NoiseFloor
IndividualBackgrounds
FusedBackground
Foreground
Individual Speech Streams
Integration Window over Late Times Only
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
Backgrounds vs. Background
NoiseFloor
IndividualBackgrounds
FusedBackground
Foreground
Individual Speech Streams
Integration Window over Late Times Only
Backgrounds vs. BackgroundWhy not?
Speaker 1
Two Speakers
Speaker 2
Stimulus Background
MEG Response
Backgrounds vs. BackgroundWhy not?
Speaker 1
Two Speakers
Speaker 2
Stimulus Background
MEG Response
Backgrounds vs. BackgroundWhy not?
Speaker 1
Two Speakers
Speaker 2
Stimulus Background
MEG Response
Backgrounds vs. BackgroundWhy not?
Speaker 1
Two Speakers
Speaker 2
Stimulus Background
MEG Response
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
Backgrounds vs. Background
NoiseFloor
IndividualBackgrounds
FusedBackground
Foreground
Individual Speech Streams
Integration Window over Late Times Only
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
Backgrounds vs. Background
NoiseFloor
IndividualBackgrounds
FusedBackground
Foreground
Individual Speech Streams
Integration Window over Late Times Only
Backgrounds vs. Background
−0.2
0
0.2
0.4
Noise FloorSummed Individual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams II
NoiseFloor
IndividualBackgrounds
Summed
FusedBackground
Foreground
Individual Speech Streams
Integration Window over Late Times Only
Backgrounds vs. Background
−0.2
0
0.2
0.4
Noise FloorSummed Individual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams II
NoiseFloor
IndividualBackgrounds
Summed
FusedBackground
Foreground
Individual Speech Streams
+
+
Integration Window over Late Times Only
Backgrounds vs. Background
−0.2
0
0.2
0.4
Noise FloorSummed Individual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams II
NoiseFloor
IndividualBackgrounds
Summed
FusedBackground
Foreground
Individual Speech Streams
+
+
Integration Window over Late Times Only
Backgrounds vs. Background
−0.2
0
0.2
0.4
Noise FloorSummed Individual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams II
NoiseFloor
IndividualBackgrounds
Summed
FusedBackground
Foreground
Individual Speech Streams
?
+
+
Integration Window over Late Times Only
Backgrounds vs. Background
−0.2
0
0.2
0.4
Noise FloorSummed Individual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams II
NoiseFloor
IndividualBackgrounds
Summed
FusedBackground
Foreground
Individual Speech Streams
?
+
+
Integration Window over Late Times Only
Backgrounds vs. Background
Individual Backgrounds Summed
Fuse
d Ba
ckgr
ound
0 0.1 0.2 0.3
0
0.1
0.2
0.3
Background Representations
individual backgrounds reconstruction
fuse
d ba
ckgr
ound
reco
nstru
ctio
n
+
Integration Window over Late Times Only
Backgrounds vs. Background
Individual Backgrounds Summed
Fuse
d Ba
ckgr
ound
0 0.1 0.2 0.3
0
0.1
0.2
0.3
Background Representations
individual backgrounds reconstruction
fuse
d ba
ckgr
ound
reco
nstru
ctio
n
+
Integration Window over Late Times Only
Backgrounds vs. Background
Individual Backgrounds Summed
Fuse
d Ba
ckgr
ound
0 0.1 0.2 0.3
0
0.1
0.2
0.3
Background Representations
individual backgrounds reconstruction
fuse
d ba
ckgr
ound
reco
nstru
ctio
n
+
High latency areas (PT) represent fused background with better fidelity than individual backgrounds(p = 0.012)
Integration Window over Late Times Only
Foreground vs. Background
−0.2
0
0.2
0.4
Noise FloorIndividual BackgroundsFused BackgroundForeground
reco
nstru
ctio
n ac
cura
cy
Individual Speech Streams
ForegroundIndividualBackgrounds
Individual Speech Streams
Integration Window over Late Times Only
Foreground vs. BackgroundEarly vs. Late
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Core Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Higher Order Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
p > 0.10 p < 1e-11
Early (HG) Late (PT)
Background Background
Fore
grou
nd
Foreground vs. BackgroundEarly vs. Late
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Core Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Higher Order Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
p > 0.10 p < 1e-11
Early (HG) Late (PT)
Background Background
Fore
grou
nd
Foreground vs. BackgroundEarly vs. Late
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Core Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
0 0.1 0.2 0.3−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Higher Order Representations
backgrounds reconstruction
fore
grou
nd re
cons
truct
ion
p > 0.10 p < 1e-11
Early (HG) Late (PT)
HG represents attended and unattended speech with almost equal fidelity
Background Background
Fore
grou
nd
Summary• Cortical representations of speech- representation of envelope (up to ~10 Hz)
• Cortical Processing Hierarchy: Consistent with being neural representation of auditory perceptual object
• Object representation at 100 ms latency (PT), but not by 50 ms (HG)
• Preliminary evidence for - PT: additional fused background representation
- HG: almost equal representations
Thank You