Phonology and Phonetics Lexical...

16
Phonology and Phonetics 17 Editor Aditi Lahiri De Gruyter Mouton Lexical Representation A Multidisciplinary Approach edited by Gareth Gaskell Pienie Zwitserlood De Gruyter Mouton

Transcript of Phonology and Phonetics Lexical...

Phonology and Phonetics17

Editor

Aditi Lahiri

De Gruyter Mouton

Lexical RepresentationA Multidisciplinary Approach

edited by

Gareth GaskellPienie Zwitserlood

De Gruyter Mouton

ISBN 978-3-11-022492-4

e-ISBN 978-3- Il-022493-1

ISSN 1861-4191

Library oj Congress Cataloging-in-Publication Data

Lexical representation : a multidisciplinary approach / edited byGareth Gaskell, Pienie Zwitserlood.

p. cm - (Phonology and phonetics; 17)Includes bibliographical references and indexISBN 978-3-11-022492-4 (alk paper)I Lexicology 2 Psycholinguistics. 3.. Grammar, Comparative

and general - Word formation 4.. Grammar, Comparative andgeneral - Morphology L Gaskell, M .. Gareth II. Zwitserlood,Pienie.

P326L386 2011401 ' .. 9-dc22

2011013270

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;detailed bibliographic data are available in the Internet at http://dnb ..d-nhde

© 2011 Walter de Gruyter GmbH & Co. KG, Berlin/New York

Printing: Hubert & Co. GmbH & Co KG, G6ttingen00 Printed on acid-flee paper

Printed in Germany

www.degruytercom

This book is dedicated to William Marslen-Wilson

7;

Recognizing words from speech:The perception-action-memory loop

David Poeppel and William Idsardi

1. Conceptual preliminaries

1.1. Terminological

The failure to be sufficiently careful about terminological distinctions hasresulted in some unnecessary confusion, especially when considering theneurobiological literature. For example, the term speech perception hasunfortunately often been used interchangeably with language comprehen­sion. We reserve the term language comprehension for the computationalsubroutines that occur subsequent to the initial perceptual analyses. In par­ticular, language comprehension can be mediated by ear, eye, or touch.. Thelinguistic system can be engaged by auditory input (speech), visual input(text or sign), and tactile input (Braille). In other words, the processes thatunderlie language comprehension build on sensorimotor input processesthat appear to be, at least in part, independent While this point may seempedantic, the literature contains numerous reports that do not respect thesedistinctions and that conflate operations responsible for distinct aspects ofperception and comprehension.

We focus here on speech perception proper, the perceptual analysis ofauditory input Importantly, further distinctions must be considered.. Thereare at least three experimental approaches grouped under the rubric 'speechperception,' and because they are different in the structure of the input, theperceptual subroutines under investigation, and the putative endpoint of thecomputations, it is important to be cognizant ofthese distinctions, too.

(a) Most research on speech perception refers to experimentation onspecific contrasts across individual speech sounds, Le., sub-lexical/pre­lexical units of speech. Subjects may be presented with single vowels orsingle syllables and asked to execute particular tasks, such as discrimina­tion or identification. In a typical study, subjects listen to consonant-vowel(CV) syllables drawn from an acoustic continuum - for example seriesexemplifYing the /ra/-Ila/ tongue-shape contrast or the /bi/-/pil voicing con­trast - and are asked upon presentation of a single token to identifY thestimulus category.. This research strategy focuses on sublexical properties

therefore the interpretation of these results, especially in the context ofneurobiological findings, requires a great deal of caution, especially sincetask effects are known to modulate nonnal reactivity in dramatic ways.

b) A second line of research investigates speech perception through thelens of spoken word recognition. These studies have motivated a range oflexical access models (for instance lexical access from spectra, Klatt 1979,1989; instantiations of the cohort model, e.g., Gaskell and Marslen-Wilson2002; neighborhood activation model, Luce and Pisoni 1998; continuousmapping models, Allopenna et aL 1998, and others) and have yielde~ criti­cal infonnation regarding the structure of mental/neural representations oflexical material. Behavioral research has made a lot of significant contribu­tions to our understanding and has been extensively reviewed prior to theadvent of cognitive neuroscience techniques (see, for example, influentialedited volumes by Marslen-Wilson 1989 and Altmann 1990). Typical expe­rimental manipulations include lexical decision, naming, gating, and prim­ing. Recognizing single spoken words is considerably more natural thanperforming unusual task demands on sub-lexical material. Some mod~ls,

such as the influential TRACE model (McClelland and Elman 1986), VIewfeatural and lexical access as fully integrated, others argue for more cas­caded operations.

Some important cognitive neuroscience contributions in this domainhave been made by Blumstein and colleagues who have examined aspectsof spoken word recognition using lesion and imaging data (e.g." Misiurskiet aL 2005; Prabhakaran et al. 2006; Utman, Blumstein, and Sullivan 2001)The data support a model in which superior temporal areas mediate acous­tic-phonetic analyses, temporo-parietal areas perform the mapping to pho­nological-lexical representations, and frontal areas (specifically the inferiorfrontal gyrus) playa role in resolving competition (i.e., deciding) bet,:eenalternatives when listeners are confronted with noisy or underspecifiedinput. The effect of lexical status on speech-sound categorization has beeninvestigated extensively in the behavioral literature (typically in the contextof evaluating top-down effects), and Blumstein and colleagues, using voic­ing continua with word or non-word endpoints, have recently extended thiswork using fMRI (Myers and Blumstein 2008).. They demonstrate thatfMRI data show dissociations between functionally 'earlier' effects in thetemporal lobes (related to perceptual analyses) and putatively 'later,' down­stream decision processes implicating frontal lobe structures. A behavioraltask that has been used productively in studies of lexical representation isrepetition priming, and Gagnepain et al. (2007) used word and non-wordrepetition priming to elucidate which cortical structures are specificallysensitive to the activation of lexical entries. Bilateral superior temporal

172 David Poeppel and William Jdsardi

of speech and typically examines questions concerning the nature ofgorical perception in speech (e.g., Liberman 1996), the phonemic mventorvof speakers/listeners of different languages (e.g., Harnsbergerceptual magnet effects (e.g., Kuhl et al.. 2007), the changes associatedfi~st (e ..g .. , Eimas et aL 1971) and second language learning (e.g.. , FlegeHIllenbrand 1986), phonotactic constraints (e.g., Dupoux et aL 1999;bak and Idsardi 2007), the role of distinctive features (e ..g., Kingstonand other issues productively addressed at the pre-lexical level of~mllv~:i~

This work has been immensely productive in the behavioral lIte:ratureand is now prominent in the cognitive neurosciences. For example,fMRl, several teams have examined regionally specific hemodynamicfects when subjects execute judgments on categorically varying(Blumstein, Myers, and Rissman 2005; Liebenthal et al. 2005; RaizadaPoldrack 2007). These studies aim to show that there are regions respond­ing differentially to signals that belong to different categories, or that arespeech versus non-speech. Interestingly, no simple answer has resultedfrom even rather similar studies, with temporal, parietal and frontal areasall implicated Similarly, electrophysiological methods (EEG, MEG) havebeen used to probe the phonemic inventories of speakers of different lan­guages .. For example, NMtanen et al. (1997) were able to show subtle neu­rophysiological distinctions that characterize the vowel inventories of Fin­n!sh versus Estonian speakers. Kazanina, Phillips, and Idsardi (2006),dIscussed further below, used MEG data to illustrate how language-specificcontrasts (Russian versus Korean), including allophonic distinctions, can bequantified neurophysiologically.

Despite its considerable influence, it must be acknowledged that this re­search program has noteworthy limitations .. For example, a disproportio­nately large number of studies examine categorical perception as well asthe notion of 'rapid temporal processing', all typically based on plosivecontrasts (especially voice-onset time, VOT). While syllables with plosiveons.ets are admittedly fascinating in their acoustic complexity (and VOT iseasIly manipulated) a rich variety of other phenomena at the pre-lexicallevel have not been well explored. Moreover, these types of studies are'~aximally ecologically invalid': experimenters present single, sub-lexicalpIeces of speech in the context of experimental settings that require 'atypi­cal' attention to particular features - and by and large engage no furtherlinguistic processing, even designing studies with non-words so as to prec­lude as much as possible any interference from other linguistic levels of~nalysis. The results obtained are therefore in danger of masking or distort­mg the processes responsible for ecologically natural speech perception ..Speakers/listeners do not consciously attend to sub-lexical material, and

Recognizing words from speech 173

1.2. Methodological

words are represented and accessed, research on spoken-word recognitionis essential; and it goes without saying that we cannot do without an under­standing of the comprehension of spoken sent~nces. Here, :ve tak~ s~~ech.

perception to refer to a specific set of computatIOna~subroutI~es (dIscussedin more detail in section 13 below): speech perceptIOn compnses the set ofoperations that take as input continuously varying acoustic waveformsmade available at the auditory periphery and that generate as output thoserepresentations (morphemic, lexical) that serv.e as the data stru~tures forsubsequent operations mediating comprehenSIOn. More colloqUIally,. ourview can be caricatured as the collection of operations that lead from vIbra­tions in the periphery to abstractions in cortex (see Figure 1)..

175Recognizing words from speech

.. ddt- d· ct;na1l'l- Cl"T.Qo¥''lr rror'ln-.a..t Ilc"':.l-t,ta "'1'"1 bn~BraIn sCIence nee S ga ge,s, an pra i'" Hi """'i)' 5uu5"" u"aUi'" Vll llU

mans has been applied to speech and lexical access .. There are two types ofapproaches that the .consumer of t~e literature should k?ow: 'dir:~t' ~ech:

niques using electncal or magnetIC measurement deVIces an.d llldIrec~

recording using hemodynamically based measurements as proXIes for bramactivity. The different methods are suited to address di~ferent kinds of ques­tions about speech and language, and the careful alIgnment of researchquestion with technique should be transparent. .

The electrical and electromagnetic techniques directly measure dIffer­ent aspects of neuronal activity.. Electrophysiological approaches ap'plie~ tospoken-language recognition range from, o~ the on~ hand, .ve~ mv~sIve

studies with high spatial resolving power - smgle-umt recordmg m amm.alsinvestigating the building blocks underpinning phonemic representatIOn(Engineer et aL 2008; Rauschecker, Tian, and Hauser 1995; Sc?roeder et aL2008' Steinschneider et at 1994; Young 2008) and pre-surgIcal subduralgrid ;ecording in epilepsy patients (e.g .. , Boatman 2004; Crone et al. 2001)- to, on the other hand, noninvasive recording using electroencephalogra­phy (EEG/ERP) and magnetoencephalography (M~~). These metho~s

share the high temporal resolution (on the order of mIllIseconds) appropn­ate for assessing perceptual processes as they unfold in real time, but themethods differ greatly in the extent to which one can identify localizedprocesses. Insofar as one has mechanistic processing ~odels/hypot?eses

that address how speech is represented and processed III neuronal tIssue,electrophysiological techniques are criticaL Spoken language ,unfoldsquickly, with acoustic signal changes in the millisecond range havmg spe­cific consequences for perceptual classification. Accordingly, these tech-

sulcus and superior temporal gyrus (STS, STG) are particularly pn:>min{intsuggesting that the mapping to lexical information occurs in corticalgions slightly more ventral than perceptual computations (and bI!clter'albv:cf. Hickok and Poeppel 2000, 2004, 2007). Finally, subtle theoreticalposals about lexical representation have recently been tested in ele:ctroP!lv_siological studies. Eulitz and colleagues (Friedrich, Eulitz, and2006), for example, have used lexical decision designs to support UUI..lt:r;s­pecification models of lexical representation.

c) A third way in which speech perception is examined is in the COllte)(tof recognizing spoken sentences and assessing their intelligibility. Instudies, participants are presented with sentences (sometimes·c'onltaillinl!acoustic manipulations) and are asked to provide an index of mt,elligItllliltvfor example by reporting key words or providing other metrics thatperform~n~e. Understanding spoken sentences is, naturally, a critical goalbe.cause It IS the per~eptu~l task we most want to explain - but there is a bigpnce t~ pay. for .u~mg thIS type of ecologically natural material. In usingsentential stImulI, It becomes exceedingly difficult to isolate input-relatedperceptual processes per se (imagine teasing out effects of activation, com­petition, and selection a la Marslen-Wilson), because presentation of sen­tences necessarily entails lexical processes, syntactic processes, both lexicalsemantic and compositional semantic processes - and therefore engagesnumerous 'top-down' factors that demonstrably playa critical role in theoverall analysis of spoken input.

..Cognitive neuroscience methodologies have been used to test intelligi­bIlIty at the sentence level as well. In a series of PET and fMRI studies, forexample, Scott and colleagues have shown that anterior temporal lobes~r~c~~res, especially anterior STS, playa privileged role in mediating intel­lIgIbIlIty (e.g., Scott et al. 2000). Electrophysiological techniques have alsobeen used to study sentence-level speech intelligibility, and Luo and Poep­pel ~2007) have argued that phase information in the cortical signal of apartIcular frequency, the theta band, is particularly closely related to andmodulated by the acoustics of sentences..

In summary, the locution 'speech perception' has been used in at leastthree d~ffering ways.. Important attributes of the neurocognitive systemunderlymg speech and language have been discovered using all three ap­proaches discussed .. This brief outline serves to remind the reader that it ischallenging to isolate the relevant perceptual computations. Undoubtedly,we need to tum to all types of experimental approaches to obtain a fullcharacterization. For example, to understand the nature of distinctive fea­tures for perception and representation, experimentation at the level of sub­phonemic, phonemic, or syllabic levels will be critical; to elucidate how

174 David Poeppel and William Idsardi

7

13. 'Function-o-Iogical'

careful dissection of deficits and associated lesions has played a big role inestablishing some of the key insights of current models, including thatspeech perception is more bilaterally mediated than common textbook wis­dom holds to be true, and that frontal areas contribute to perceptual abilitiesunder certain task configurations (see, e.g." work by Blumstein for elabora­tion). Neuropsychological data establish both (a) that speech processingclearly dissociates from language processing as well as from other parts ofauditory cognition (Poeppel 2001) and (b) that the classical view that theleft temporal lobe subsumes speech and language comprehension is dramat­ically underspecified ..

While these school-marmish reminders regarding the benefits and limi­tations of techniques may seem irritating and perhaps even obvious, it isremarkable how often research is insensitive to crucial methodologicallimitations, thereby furthering interpretations that are untenable given theorigin of the data. Insofar as "ve seek a theoretically sound, formally expli­cit, and neuronally realistic model of spoken language processing andbrain, a thoughtful consideration of which techniques answer which ques­tions is essentiaL

The perspective summarized here has been developed in recent pieces(Poeppel and Hackl 2008; Poeppel, Idsardi, and Wassenhove 2008; Poeppeland Monahan 2008). What we hope to provide is a serviceable definitionfor the cognitive neuroscience of speech perception that links various inter­related questions from acoustics to phonology to lexical access. Figure 1,from Poeppel et al. (2008), summarizes what we take to be the problem..

The starting point for the perceptual-computational system is the acous­tic signal, a continuously varying waveform that encodes information ondifferent timescales (Fig Ia). For example, the amplitude envelope of thesignal correlates well with properties of the syllabic structure of an utter­ance; the fine structure of the signal, in contrast, carries information overshorter timescales (including features and segments). This input array mustultimately be transformed into a series of discrete segments that constitute amorpheme/word.. Because we believe the key goal to be the identificationof words, specifying the format of lexical representation is necessary..Moreover, the morphemes/words must be stored in a format that permitsthem to enter into subsequent linguistic computation (including, e.g." com­binatoric operations that underlie language comprehension); identifying aword is not nearly enough - the listener must be able to connect itformally

177Recognizing words[rom speech176 David Poeppel and William ldsardi

niques are necessary to zoom (in time) into such granularchanges. Moreover, although many aspects of speech cannot be aa,ar(;~sst~din animal models (for example lexical representation), the single-unitlocal-field-potential (LFP) animal work informs us about how singlerons and neuronal ensembles encode complex auditory signals. Thus,though the perceptual endgame is not the same for ferrets andphones, some of the subroutines that constitute perception can beeffectively using animal models.

The hemodynamic techniques, principally fMRI and PET, and morecently NIRS (near infra-red spectroscopy) have been used extensivelythe late 1980s to study speech perception (Binder et al. 2000; BlumsteinaL 2005; Burton, Small, and Blumstein 2000; Meyer et aL 2005; ObleseraL 2007; Raettig and Kotz 2008; Scott and Wise 2004)" The majortages - especially of fMRI - are its spatial resolution, and now, ubiquitous~vailability of the machines. It is now possible to detect activity Gltleren­tIally at a spatial scale of a millimeter and better, and therefore these nonin­vasive recordings are approaching a scale that is familiar from animal stu­dies (roughly the scale of cortical columns) (Bandettini 2003, Logothetis2008). However, the temporal resolution is limited, roughly to changesoccurrIng over hundreds of milliseconds (i.e., about a word or so). The~ain contribution of these approaches is to our understanding of the func­tIonal anatomy (see Section 3). Note, also, that these techniques provide a'spatial answer' - requiring as a hypothesis a 'spatial question,,' While the~ontribution of hemodynamic imaging to anatomy is considerable, ques­tIOns about representation - and especially online processing - are difficultto ad.dress using such methods. Recent reviews of fMRI, in particular, em­~haslze the need to complement such data with electrophysiological record­II1~S (Logothetis 2008). As a leading neuroscientist and imaging expert,NIkos Logothetis, points out, "fMRI is a measure of mass action. You al­most have to be a professional moron to think you're saying somethingprofound about the neural mechanisms. You're nowhere close to explainingwhat's happening, but you have a nice framework, an excellent startingpoint" (http://www.sciencenews.org/view/feature/idl50295/title/ Trawl­ing_the_brain).

Historically, neuropsychological data have been the most widely avail­able, consequently deficit-lesion correlation research forms the basis forthe functional anatomy of speech sound processing as we conceive it here(see Section 3; In recent years, the reversible (in)activation of neuronaltissue using transcranial magnetic stimulation, TMS, has received muchattention, although as yet few studies have investigated speech - and thosethat have have yielded very dodgy results, e.. g", D'Ausilio et al. 2009). The

ft

Figure 1 From wa~eforms to words .. ~ontinuously varying acoustic signals (a) areana!yzed III the.afferent audItory pathway, ultimately to be representedas neural v~rsI~ns'. of sp~ctrogra.ms in bilateral auditory cortex (b).Based, o~ thIS hIgh-resolutIon audItory representation, we hypothesizethat a pnmal sketch' - based on multi-time resolution analysis - is con­str~cted (c).. The perceptual endgame is the identification of words,WhICh we take to be r~presented in memory as sequences of segmentsthat themselves compnsed of bundles of distinctive features (d).. FromPoeppel et aL 2008,

179Recognizing words/rom speech

The input waveform (representation Rl) is analyzed by the auditory pe­riphery and is presumably represented in auditory cortex by neurons withsophisticated spectro-temporal receptive field properties (STRFs). One canthink of this as a neural version of a spectrogram, albeit one composed ofnumerous mini-spectrograms with specializations for certain spectro­temporal patterns (Figl b), such as the characteristic convergent second andthird formant trajectories near velars (Stevens 1998). This type of represen­tation (R2) is most likely a property of neurons in the auditory cortex, and itdoes not differentiate between speech and non-speech signals,. Moreover,given the highly conserved nature of mammalian auditory cortex, theserepresentations are very likely shared with other species, and consequentlythese representations can be investigated using animal models and single­cell recording approaches. Based on this initial (high resolution) auditorycortical pattern, multiple representations on different scales are constructed,in parallel. In this next step, 'auditory primitives' are built out of early audi­tory cortical elements, with one key feature being the time scale of the newrepresentations,. This third type of representation (R3) must be of a granu­larity that permits mappings (linking operations) from the encoding of sim­ple acoustic properties in early auditory cortical areas to speech primitivesin more downstream areas (arguably including STG and STS)., We conjec­ture that these intermediate representations encompass at least two subtypes(temporal primitives) commensurate with syllabic and segmental durations(Boemio et aL 2005; Giraud et aL 2007; Poeppel2001, 2003; Poeppel et aL2008).. The initial cortical representation is fractionated into (at least) twostreams, and concurrent multi-time resolution analysis then lies at the basisof subsequent processing., The specific nature of R3 is a critical researchquestion, and we have characterized the question as arriving at a 'primalsketch' for speech perception (Fig lc), akin to Marr's famous hypothesisabout intermediate representations for o~ject recognition; one possibilityfor the primal sketch is the PFNA coarse coding (plosive-fricative-nasal­approximant), discussed below,. The final, featurally specified representa­tion (R4) constitutes the format that is both the endpoint of perception - butwhich is also the set of instructions for articulation. As discussed furtherbelow, the loop between perception, memory, and action is enabled becausethe representational format used for words in memory, distinctive features,allows the mapping both from the input to words (identify features) andfrom words to action (features are in motoric coordinates).

Obviously, a major goal now must be to look for a Hegelian synthesisfor these various antitheses, i.e., levels of representation with competingstructures and affordances. In particular, how is it that we have so muchsolid evidence for both cohorts and neighborhoods, whose guiding assump-

phonologicalprimal sketch

(c)

.-...........

f+ant]

CORONAL

I[·voice]

/i\[+ cons. -SOn]

/ I ~jLARi?HAR PLAce

I IGLOT

I

a

(b)

/l(·cons, +son]

/ ILARiPHAR PlACe

I I[ATR] DORSAL

I

\[-cont]

DORSAL

x

(d) c

(a)

/1\[+ cons, ·son]

/' ILARiPHAR PLAce

I IGlOT

I

178 David Poeppel and William Idsardi

(i.e" ,in terms of representational specifications, such as noun, determiner,etc" In whatever neural code that it is specified in) to its neighboring envi­ronment, ~.g" to p~rform whatever phonological, morphological, syntactic,?r s~ma~tI~operatIOn the situation demands. We adopt the view, developedIn hn~u~stic :esearch over the last half century - and implicit since thePhoenIcians Invent:d an alphabet - that such lexical representations arerepr~s~nte? as a senes o~ segments that themselves are made up of bundleso~ ~lstInc~lve features (Fig Id; see Section 2 for more motivation); we ex­phcltly Will also allow other parallel representations, e.g., syllables.

7

2. I. Features

2. Linguistic bases of speech perception

181Recognizing words from speech

Because most modern societies are literate and often familiar with a lan­guage with an alphabetic script, there is a tendency to identifY speech per­ception with the perception of whole, single speech segmen~s (phones ?fphonemes) - the amount of speech generally captured by a smgle letter man alphabetic script However, segmental phonemes are not the smallestunits of representation, but are composed of distinctive !eature~ which con­nect articulatory goals with auditory patterns, and provIde a dIscrete, mod­ality - and task-neutral representation suitable for storage in long-termmemory (see Jakobson, Fant and Halle 1952, for the original proposals, andHalle 2002, for a spirited defense of this position; see Mielke 2007, for acontrasting view).. For example the feature [+round] encodes a speechsound component that in articulation involves rounding the lips through theenervation of the orbicularis orbis muscle, and on the auditory side a re­gion of speech with a downward sweep of all of the formants (when ~or­mant transitions are available), or diffuse spectra (in stop bursts and fnca­tives) .. The features thus are the basis of the translation (coordinatetransformation) between acoustic-space and articulator-space, and moreo­ver are part of the long-term memory representations for the phonologicalcontent of morphemes, forming the first memory-action-perception loop.

Phonetic features come in two kinds: articulator-bound and articulator­free .. The articulator-bound features (such as [+round]) can only be ex­ecuted by a particular muscle group .. In contrast, the articulator-free, or"manner" features, which (simplifying somewhat) specifY the degree ofconstriction at the narrowest point in the vocal tract, can be executed byany of several muscles along the vocal tract SpecifYing the degree of con­striction defines the sonority scale, and thus the major classes of segments:plosives (with complete constriction), fricatives (with constrictions suffi­ciently narrow to generate turbulent noise), sonorants (including nasals,

hypothesis - a mapping from acoustic to phonetic to phonol~gical represen­tations - is no longer central to the problem as we define It (although themapping from Rl/R2 toR3 to R4 is reminiscent of similar challenges). Themultiple levels of representation we envision are simultaneous representa­tions on different time-scales corresponding to different linguistic 'views'of the speech material.

180 David Poeppel and William Idsardi

tions seem irreconcilable? What kind of system is this that illustratesphonetic specificity (a surface property of speech sounds) and phl:}n()loigic,alunderspecification (a generalization following from a highlycode)? Again, we believe that in order to understand this n",n ...t~hr

fusing results we need to draw further distinctions, and we offer up aest proposal in order to have our exemplar cake and eat it too.line from Cutler and Fay (1982) we agree that there is "one mentallexicoJrl.phonologically arranged." But the operative word here is "arranged".envision a 3-step process that offers a place for each of the kinds ofings.. The first step is a coarse-coding of the signal into universalcategories (akin if not identical to Stevens' (1998) landmarks). Forcreteness let us say that this code is just the speech stream coded intocategories (PFNA: plosives, fricatives, nasals and approximants). J..'rf'lirn;_

nary modeling of English-like lexicons suggests that this codingpools of words of approximately the same size as the usual lexicalborhoods and with a fair overlap between various pools and neighborhoods.Within these pools we can now conduct a directed left-to-right search usingcontextually defined featural definitions (i.e .. , the cues for [labial] within[nasal] are different than those within [plosive], and differ language to lan­guage).. Moreover, this search can be guided by the differences amongst thewords in the active pool using analysis-by-synthesis and Bayesian inference(see below).. Finally, once the best-available word-form has been selected,the contents of that lexical item are examined, compared to the memorytrace of the incoming signal, and verified to in fact be the word we're look­ing for. Since the lexical entry contains a great deal of information (mor­phology, syntax, semantics, pragmatics, usage) there is little harm or cost(and much benefit) in storing a detailed phonetic summary of the form'spronunciation (though we would prefer a model-based statistical summaryto an exemplar cloud).. In sum, we get to the entry via a coarse-codedsearch with subsequent directed refinement, but the choice needs to be veri­fied to be accepted. Thus we expect (eventually) to see in the time-courseof word-recognition early effects of coarse-coding followed later by exem­plar-like effects of lexical item phonetic specificity, even if our currentmethods are perhaps too crude to pick up this distinction.

One way to think about the challenge is to consider the analogy to visualobject recognition. Research there has attempted to identifY which interme­diate representations can link the early cortical analyses over small spatialreceptive fields (edges or Gabor patches, or other early visual primitives)with the representation of objects. There have been different approaches tointermediate representations, but every computational theory, either expli­citly or implicitly, acknowledges the need for them. The more traditional

..

2.3, Predictable changes in pronunciation: phonological process

affects the pronunciation of the segments within the syllables, with weaksyllables having reduced articulations along several dimensions. It is possi­ble that groupings of other sizes (morae, feet) are also relevant; certa.inlylinguistic theories postulate menageries of such chunks. We believe that thesyllable level may begin to be calculated from the major-class detectorsoutlined in the previous section; typologically, language syllable structureseems to be almost exclusively characterized by sonority, with the articula­tor-bound features playing little role in the definition of the constitution ofsyllables. We hypothesize that the parallel sketch of major-class basedsyllables and the elaboration of segments via the identification of articula­tor-bound features offers a potential model for the synthesis of the so-farirreconcilable findings for cohort and neighborhood models of lexical

access.

183Recognizing words/rom speech

Speech is highly variable. One of the goals of distinctive feature theory isto try to identifY higher-order invariants in the speech signal that correlatewith the presence of particular features like the [+round] example above(Perkell and Klatt 1986). However, even if we had a perfect theory of pho­netic distinctive features, there is variation in the pronunciation of featuresand segments due to surrounding context, starting with co-articulation ef­fects inherent in the inertial movements of the articulators in the mouth.The net result of these patterned variations in pronunciation is that we arewilling to consider disparate pronunciations to be instances of "the samespeech sound" because we can attribute the differences in pronunciation tothe surrounding context of speech material. A particularly easy way to ob­serve this phenomenon is to consider different forms of the same wordwhich arise through morphological operations like prefixation and suffixa­tion" The 't's in "atom" and "atomic" are not pronounced the same way ­"atom" is homophonous with "Adam" for many speakers of American Eng­lish, whereas "atomic" has a portion homophonous with the name "Tom""In technical parlance, the 't' in "atom" is flapped, whereas the 't' in "atomic"is aspirated" This is by no means an unusual case. Every known languagehas such contextually determined pronunciations (allophonic variation) thatdo not affect the meanings of the words, and which, for the purpose of re­covering the words, appear to be extra noise for the listener" Even worse,languages pick and choose which features they employ for storing forms inmemory. English, for example, considers the difference between [I] and [r],

182 David Poeppel and William ldsardi

2.2" Groupings

with little constriction), glides and vowels (i.e., approximants, withIy no constriction). Moreover, as noted above, this divisioncomputational technique for calculating R2 and R3: build a set ofclass detectors from RI representations (Stevens 2002; Juneja andWilson 2008). To a crude first approximation, this consists of detectorsquasi-silent intervals (plosives), regions with significant amounts ofperiodicity (fricatives), regions with only one significant resonance rn.,<,~I.,\

and regions with general formant structure (approximants, which thenbe sub-classified). These definitions are plausibly universal, and alldetectors are also plausibly ecologically useful for non-speech tasksas predator or prey detection), and thus should be amenable to II1V'es1tlglitic)l1with animal models, and are good candidates for hard-wired circuits"the major class is detected, dedicated sub-routines particular to the recov­ered class are invoked to subsequently identifY the contemporaneous articu­lator-bound features. In this way, features such as [+round] may havetext-sensitive acoustic definitions, such as diffuse falling spectra in stopbursts, a relatively low spectral zero in nasals, and lowered formants invowels"

Even though the individual features are each tracked as a separate stream(like instruments in an orchestra), identification of the streams of phoneticfeatures by themselves is not sufficient to adequately capture the linguisti­cally structured representations. The features must be temporally coordi­nated, akin to the control exerted by the conductor" Speech-time is quan­tized into differently-sized chunks of time. There are two criticallyimportant chunk-sizes that seem universally instantiated in spoken lan­guages: segments and syllables" Temporal co-ordination of distinctive fea­tures overlapping for relatively brief amounts of time (10-80 ms) comprisesegments; longer coordinated movements (l00-500 ms) constitute syllabicprosodies" For instance "we" and "you" differ in the synchronization of[+round]: in "we," rounding coincides with the initial glide, in "you," therounding is on the vowel, and in "wu" rounding covers both segments. Thisfirst aggregation of features must somehow ignore various coarticulationand imprecise articulation effects which can lead to phantom (excrescent)segments, as can be seen in pronunciations of "else" which rhyme with"welts" (familiar to Tom Lehrer fans). At the syllable level, English dis­plays alternating patterns of weak and strong syllables, a distinction which

D

3. Cortical basis of speech perception: fractionating information flowacross a distributed functional anatomy

into segment-sized units and larger chunks, and the identification of equiva­lence classes of features and segments at various levels of abstraction, withthis stage of processing culminating in the identification of a stored wordform, which can then be mapped back out by the motor system in prohun­ciation.

185Recognizing words from speech

Historically, the reception of speech is most closely associated with thediscoveries by the German neurologist Wernicke. Based on his work, popu­larized in virtually every textbook since the early 20th century, it was hy­pothesized that posterior aspects of the left hemisphere, in particular the left<:upef'l:of' tempo~Jal my~'us (C'TG\ "'e~'e ~'espo~s;hla ..~.' ..ha ""ah ,,,,;,,, A" H,t'U' b .. j ....,.1 J, yV 1 1 1.1. lUl""" J..Vl u.n.." au.. l.Ji.:Uu VI "-.U.,",

input. Perception (modality-specific) and comprehension (modality­independent) were not distinguished, and so the left temporal lobe has be­come the canonical speech perception region. Because speech perceptiondissociates clearly from the auditory perception of non-speech, as well asfrom more central comprehension processes (see, for example, data frompure word deafness, reviewed in Poeppel 2001; Stefanatos, Gershkoff, andMadigan 2005), the search for a 'speech perception area' is, in principle,reasonable (and the search for such a specialized region has, of course,yielded a rich and controversial research program in the case of face recog­nition and the fusiform face area). However, data deriving fram lesion stu­dies, brain imaging, and intracranial studies have converged on a model inwhich a distributed functional anatomy is more likely.

The major ingredients in this model are two concurrent processingstreams. Early auditory areas, bilaterally, are responsible for creating ahigh-resolution spectra-temporal representation. In the terminology devel­oped in Section 1, the afferent auditory pathway from the periphery to cor­tex executes the set of transformations from R1 to R2. In superior temporalcortex, two parallel pathways originate" One pathway, the ventral stream, isprimarily involved in the mapping from sound to lexical meaning. We hy­pothesize that lexical representations per se (the mappings from concepts tolexical-phonological entries) 'reside' in middle temporal gyrus (MTG) (fora recent analysis of lexical processing and MTG based on neurophysiology,see Lau, Phillips, and Poeppel 2008), and the cortical regions that are partof the ventral stream perform the operations that transform acoustic signalsinto a format that can make contact with these long-term representations.

184 David Poeppel and William ldsardi

[±latera1], to be contrastive, so that "rip" and "lip" are differentare "more" and "mole"" Korean, on the other hand, treats this Olltenen<:eallophonic, a predictable aspect of the position of the segment in the ",,,,..;1;/

the word for water is "mul" but the term for freshwater is "muri ('"h,....ih,"

Koreans [I] and [r] are contextual pronunciations of the same sound­use [r] before vowels and [I] before consonants and at the ends of words.

Recent MEG studies (Kazanina et al. 2006) have confirmed thater~ do system~tically ignore allophonic differences (Sapir 1933).mIsmatch desIgn, Kazanina and colleagues compared the behavioralneural responses of Russian and Korean speakers to items cOlntaiiniingor "da". The difference in the feature [voice] between "t" and "d" is MguUI~

cant (contrastive) in Russian, as it serves to distinguish pairs of words~s "do~" (house) and "tom" (volume).. In Korean, however, this differenceIS predICtable, with "d" occurring only between sonorants, as can be seen in

the word "totuk" meaning 'thief, pronounced "toduk" (and "I-',.JH;;U

that way in the McCune-Reischauer Romanization system).., In this wordthe second "t" ~s prono~ncedas a "d" because it is flanked by vowels (simi~lar .to the EnglIsh flappmg rule). Subjects listened to sequences of items inwhIch one of the two types (for instance "da") was much more frequent(the ~'standard"); the other item (the "deviant", here "ta") occurred 13% of~~e"tIme. Neural responses to the i.tems in different were compared (i.e.,ta as standard was compared wIth "ta" as deviant). Russian speakers

showed a reliable difference in their responses to standards and deviantsindicating that they detected the deviant items in a stream of standards:Conversely, Korean sp.eakers showed no differences, suggesting that theyform a perceptual eqUIvalence class for "t" and "d", mapping these twosounds onto the same abstract representation,

Similar phonological processes can also change the pronunciation ofotherwise contrastive speech sounds" For instance, in English "s" and "z"a~e contr~stive, differing in the feature [voice], as can be seen by the mi­n~mal pair "seal" and "zeal"" However, the plural ending is pronouncede.Ithe~ "s" or "z" depe.n.ding on ~he adjacent sound: "cats" and "dogz", Eng­lIsh l~sten.ers are senSItIve to thIS sequential patterning rule, showing longerreactIOn tImes and differences in the neural resp0l}ses when "s" and "z" arecr?ss-spliced i~to incongruent positions, *"utz", *"uds" (Hwang et al. sub­mItted).., Thus, m later morphological computations, contrastive sounds are~rganized into higher-level equivalence classes displaying functional iden,.tIty (such as the plural ending)"

Phonological. perception thus requires the identification of major classfeatures and artIculator-bound features, the coordination of these features

4. Analysis by synthesis: an old algorithm, resuscitated

The recurrent theme has been the memory-action-perception (MAP) loop.We pointed out that the necessity of being able to both produ~e and perce­ive speech leads to the necessity of coordinate transformatIons between

187

Inputfmm+-othersensory

modaUties

Recognizing words from speech

Via hip·order frontal networks

Dooal stream

Dual stream model (Hickok and Poeppel 2007).. A functional .anatomicmodel of speech perception that hypothesizes speech perceptIOn .to beexecuted by two concurrent processing streams.. The ~reen bo.x IS thestarting point, the auditory input The ventral stream (pI.nk) mediates themapping from sounds to lexical-se~antic representatIOns. T~e dorsalstream (blue) provides the neuronalmfrastructure for. the mappmg fromsound analysis to articulation .. For discussion, see HICkok and Poeppel(2000,2004,2007).. For colour version see separate plate

Figure 2

186 David Poeppel and William ldsardi

One crucial cortical region involved in the mapping from sound "trlllr.tnrp

lexical representation is the superior temporal sulcus (STS) ..STS appear to be executing some of the essential computations that """'I...r'~ate the speech primitives. Our examination of both lesion and ImagIngsuggests that the ventral pathway is bilateral. Importantly, the left andcontributions are overlapping but not identical; for example, the rractl

lOI1ll­

tion of the auditory signal into temporal primitives of different gnlmliaI~Ity(i ..e., different time-scales) occurs differentially on the two sides {ts1oermoal. 2005; Giraud et al. 2007) .. In short the ventral pathway itself can bedivided into concurrent processing streams that deal preferentiallydifferent 'linguistic views' of the input signal, ideally mapping oWectllyonto the parallel linguistically motivated representations for segmentssyllables.

The second pathway is the dorsal stream.. The areas comprising thestream perform the mapping from sensory (or perhaps phonological)sentations to articulatory and motor representations (R4). Various parts ofthe dorsal stream lie in the frontal lobe, including in premotor areas as wellas in the inferior frontal gyrus. One critical new region that is motivated _and now identified - by research on this model is area SPT (Sylvian pariet­al-temporal; see Hickok et al.. 2003; Pa and Hickok 2008). Acoustic infor­mation is represented in a different coordinate system than articulatoryinformation, and thus the mapping from acoustic to motor requires a coor­dinate transformation. Moreover, the SPT "sensorimotor interface" pro­vides the substrate for dealing with working memory demands as well.Because the distributed functional anatomy has been described at lengthelsewhere (Hickok and Poeppel 2000, 2004, 2007) we aim here simply toemphasize two core features of the model that have stimulated the formula­tion of new hypotheses about the organization of the speech perceptionsystem: there are two segregated processing streams, each of which hasfunctionally relevant subdivisions; second, the organization of the ventralstream is bilateral, unlike the striking lateralization often observed in lan­guage processing.. On our view, it is the dorsal stream, principally involvedin production, which exhibits cerebral dominance .. The ventral stream, onthe other hand is asymmetric but has important contributions from bothhemispheres. The model highlights the distributed nature of the corticalfields underlying speech processing. Moreover, the model illustrates theperception-memory-action loop that we have described. The loop at thebasis of speech processing 'works' because of the nature of the shared cur~rency that forms the basis of the representation. We contend that this dis­tinctive featural representation is one that permits natural mappings frominput to memory to output

b

the set of synthesized representations gets smaller, and therefore verifica­tion or rejection based on subsequent input gets quicker. Although littleempirical data exist to date that test these (old) ideas, recent studies on clU­diovisual speech perception support analysis by synthesis and illustrate howthe amount of prior infonnation modulates (in time) cortical responses tospeech sounds (van Wassenhove, Grant, and PoeppeI2005).

The appeal of the analysis-by-synthesis view is threefold. First, it pro­vides a link to motor theories of perception, albeit at a more abstract level.Motor theories in their most direct fonn are not well supported empirically(for a recent discussion, see Hickok, 2010), but the hypothesis that a per­ceptual role for some of the computations underlying motoric action isworth exploring. There is an intriguing connection between perception andproduction, repeatedly observed in many areas of perception, and narrowlyperceptual or motoric theories seem not to be successful at capturing theobserved phenomena.. Second, analysis-by-synthesis for speech links tointeresting new work in other domains of perception.. For example, newresearch on visual object recognition supports the idea (Yuille and Kersten2006); and the concept has been examined in depth in work on sentencecomprehension by Bever (Townsend and Bever 2001) There is a closeconnection between the analysis-by-synthesis program and Bayesian mod­els ofperception (a point also made by Hinton and Nair 2005), and therebyalso to link up with more tenable accounts of what mirror neurons are(Kilner, Friston, and Frith 2007). Third, this approach provides a naturalbridge to concepts gaining influence in systems neuroscience. The viewthat the massive top-down architectural connectivity in cortex forms thebasis for generating and testing expectations (at every level of analysis) isgaining credibility, and predictive coding is widely observed using differenttechniques .. In our view, the type of computational infrastructure affordedby the analysis-by-synthesis research program provides a way to investigatespeech perception in the service of lexical access in a way that naturallylinks the computational, algorithmic, and implementational levels advo­cated by MarL

Research on the cognitive science and cognitive neuroscience of speechperception seems to us a productive approach to investigate questions aboutcognition and its neural basis more generally. The psychological models areincreasingly detailed and well articulated and facilitate a principled investi­gation of how the brain computes with complex representations. For thedevelopment of these important models, we owe a debt of gratitude.

189Recognizing words from speech

best- scoringlexical candidates

188 David Poeppel and William Idsardi

acoustic-space and articulatory-space, and economy considerations dtc:tateithat we look for a memory architecture that would enable and facilitateDistinctive features, as developed by Jakobson, Fant and Halle1950s, have such desirable characteristics, with both acoustic and ~rtif'nl"

tory definitions, and provide a basis for long-tenn memory reIJre:sell1tations.These (static) representational considerations are mirrored in the cOlmpiUtcl­tional algorithms for (dynamic) speech production and perception; therithms are similarly intertwined. Such a system was proposed at thebeginning of modem speech perception research, in thesynthesis approach of MacKay (1951) and Halle and Stevens (1962).and Poeppel (2010) and Poeppel and Monahan (2010) review thisHere we have the MAP loop writ both large and small. At eachspeech perception takes the fonn of a guess at an analysis, the sulDse:quentgeneration of a predicted output form, and a correction to that guess basedon an error signal generated by the comparison of the predicted output fannagainst the incoming signaL The initial hypotheses ('guesses') are generat­ed based on the current state and the smallest bit of input processed (say a30 ms sample of waveform), and whatever additional information may havebeen used to predict the signal (the prior, in a Bayesian framework). Theinitial set of guesses that trigger synthesis will be large, but at each subse­quent processing step, the set of supported guesses gets smaller, therefore

predicted acceptablesubsequent items word string

Figure 3 Analysis by synthesis. The processing stream (from left to right) is as­sumed to reflect detailed interactions between samples from the input,hypotheses (guesses), synthesized candidates, and error correction..There is a continuous alignment of bottom-up information and top-downregulated synthesis and candidate set narrowing until the target is ac­quired (Poeppel et al 2008).

..

Dupoux, Emmanuel, Kazuhiko Kakehi, Yuki Hirose, Christophe Pallier, and Jac­ques Mehler

1999 Epenthetic vowels in Japanese: A perceptual illusion? Journal ofExperimental Psychology. Human Perception and Performance 25:1568~ 1578.

Eimas Peter D., Einar R. Siqueland, Peter Jusczyk, and James Vigorito1971 Speech perception in infants .. Science 171: 303-306 ..

Engineer, Crystal T, Claudia A. Perez, Ye Ting H. Chen, Ryan S.. Carraway,Amanda CReed, Jai A. Shetake, Vikram Jakkamsetti, Kevin Q.Chang, and Michael P .. Kilgard

2008 Cortical activity patterns predict speech discrimination ability. Na­ture Neuroscience 11(5): 603-8 ..

Flege, James E, and James M .. Hillenbrand1986 Differential use of temporal cues to the Is/-/z/ contrast by native and

non-native speakers of English. Journal of the Acoustical Society ofArnerica 79(2): 508- 17

Friedrich, Claudia K, Carsten Eulitz, and Aditi Lahiri2006 Not every pseudoword disrupts word recognition: an ERP study.

Behavioral and Brain Functions 2: 36.Gagnepain, Pierre, Gael Chetelat, Brigitte Landeau, Jacques Dayan, Francis Eus­

tache, and Karine Lebreton2008 Spoken word memory traces within the human auditory cortex re­

vealed by repetition priming and functional magnetic resonance im­aging. Journal ofNeuroscience 28(20): 5281-9 ..

Gaskell, Gareth and William 0. Marslen-Wilson2002 Representation and competition in the perception of spoken words.

Cognitive Psychology 45(2): 220-66 ..Giraud, Anne-Lise, Andreas Kleinschmidt, David Poeppel, Torben E Lund,

Richard S.. 1. Frackowiak, and Helmut Laufs2007 Endogenous cortical rhythms determine cerebral specialization for

speech perception and production. Neuron 56(6): 1127-34.Halle, Morris

2002 From Memory to Speech and Back.. Berlin: Mouton de Gruyter.Halle, Morris and Kenneth N. Stevens.

1962 Speech Recognition: A Model and a Program for Research. IRETransactions ..

Harnsberger, James D..2000 A cross-language study of the identification of non-native nasal

consonants varying in place of articulation .. Journal ofthe AcousticalSociety ofAmerica 108(2): 764-783 ..

Hickok, Gregory2010 The role of mirror neurons in speech perception and action word

semantics. Language and Cognitive Processes

190 David Poeppel and William fdsardi

References

Allopenna, Paul D.., James S. Magnuson, and Michael K Tanenhaus1998 Tracking the time course of spoken word recognition

movements: Evidence for continuous mapping models.Memory and Language 38: 419-439.

Altmann, Gerry T. M.1990 Cognitive Models of Speech Processing. Cambridge, MA:

Press ..Bandettini, Peter A.

2003 Functional MRI In Handbook of Neuropsychology, Jordan Ur:atulanand Ian H.. Robertson (Eds.). The Netherlands: Elsevier..

Bever, Thomas and David Poeppelin press Analysis by synthesis: A (Re-)emerging program of research for

language and vision.. BiolinguisticsBinder, Jeffrey R, Julie A. Frost, LA. Hammeke, P.. S.F.. Bellgowan, JA Springer,

IN.. Kaufman, and EY Possing2000 Human temporal lobe activation by speech and nonspeech sounds.

Cerebral Cortex 10(5): 512-28 ..Blumstein, Sheila E, Emily R Myers, and Jesse Rissman

2005 The perception of voice onset time: an fMRI investigation ofphonet­ic category structure .. Journal of Cognitive Neuroscience 17(9):1353-66..

Boatman, Dana

2004 Cortical bases of speech perception: evidence from functional lesionstudies.. Cognition 92(1-2): 47-65.

Boemio, Anthony, Stephen Fromm, Allen Braun, and David Poeppel2005 Hierarchical and asymmetric temporal sensitivity in human auditory

cortices .. Nature Neuroscience 8(3): 389-95 ..Burton, Martha W, Steven L Small, Sheila E Blumstein

2000 The role of segmentation in phonological processing: an fMRI inves­tigation .. Journal ofCognitive Neuroscience 12(4): 679-90 ..

Crone, Nathan E, Dana Boatman, Barry Gordon, and Lei Hao2001 Induced electrocorticographic gamma activity during auditory per­

ception .. Brazier Award-winning article, 2001 Clinical Neurophysi­ology 112(4): 565-82 ..

Cutler, Anne and David A. Fay1982 One Mental Lexicon, Phonologically Arranged: Comments on Hur­

ford's Comments .. Linguistic Inquiry 13: 107-113.D'Ausilio, Alessandro, Friedemann Pulvermiiller, Paola Salmas, Haria Bufalari,

Chiara Begliomini, and Luciano Fadiga2009 The motor somatotopy of speech perception.. Current Biology

19(5):381-5.

Recognizing words from speech 191

Kuhl, Patricia, Barbara T. Conboy, Sharon Coffey-Corina, Denise Padden, MaritzaRivera-Gaxiola, and Tobey Nelson

2007 Phonetic learning as it pathway to language: new data and nativelanguage magnet theory expanded (NLM-e); Philosophical Transac­tions ofthe Royal Society B 363: 979-1000.

Lau, Ellen, Colin Phillips, and David Poeppel2008 A cortical network for semantics: (de)constructing the N400. Nature

Reviews Neuroscience 9: 920-933.Liberman, Alvin M.

1996 Speech A special code. Cambridge MA: MIT Press.Liebenthal, Einat, Jeffrey R. Binder, Stephanie M. Spitzer, Edward T., Possing, and

David A. Medler2005 Neural substrates of phonemic perception, Cerebral Cortex 15(10):

1621-31.Logothetis, Nikos K.

2008 What we can do and what we cannot do with flv1RI. Nature453(7197): 869-78,

Luce, Paul A and David B Pisoni1998 Recognizing spoken words: the neighborhood activation model. Ear

and Hearing 19(1): 1-36,Luo, Huan and David Poeppel

2007 Phase patterns of neuronal responses reliably discriminate speech inhuman auditory cortex Neuron 54(6): 1001-10.

MacKay, Donald M.1951 Mindlike behavior in artefacts British Journal for Philosophy of

Science 2: 105-121.Marslen-Wilson, William

1989 Lexical Representation and Process Cambridge MA: MIT Press,McClelland, James L.. and Jeff Elman

1986 The TRACE model of speech perception" Cognitive Psychology 18,

1-86"Meyer, Martin, Stefan Zysset, Yves D.. von Cramon, and Kai Alter

2005 Distinct tMRI responses to laughter, speech, and sounds along thehuman peri-sylvian cortex Cognitive Brain Research 24(2): 291­306,

Mielke, Jeff2007 The Emergence of Distinctive Features Oxford: Oxford University

Press ..Misiurski, Cara, Sheila E. Blumstein, Jesse Rissman, and Daniel Berman

2005 The role of lexical competition and acoustic-phonetic structure inlexical processing: evidence from normal subjects and aphasic pa­tients" Brain and Language 93(1): 64-78

....

192 David Poeppel and William Idsardi

Hickok, Gregory, Brad Buchsbaum, Colin Humphries, and Tugan Muftuler2003 Auditory-motor interaction revealed by tMRI: speech, music,

working memory in area Spt.. Journal of Cognitive Neur(lscl'em?B15(5): 673-82.

Hickok, Gregory and David Poeppel2000 Towards a functional neuroanatomy of speech perception..

Cognitive Science 4(4): 131-138.2004 Dorsal and ventral streams: a framework for understanding

of the functional anatomy of language" Cognition 92(1-2): 67-99.2007 The cortical organization of speech processing. Nature

Neuroscience 8(5): 393-402"Hinton, Geoffrey and Vinod Nair

2005 Inferring motor programs from images of handwritten digits.ceedings ofNIPS 2005"

Hwang, So-One, Phillip 1. Monahan, William Idsardi, and David Poeppelsubmitted The Perceptual Consequences of Voicing Mismatch in Obstruent

Consonant Clusters,Jakobson, Roman, Gunnar Fant and Morris Halle

1952 Preliminaries to Speech Analysis" Cambridge MA: MIT Press.Juneja, Amit, and Carol Espy-Wilson

2008 A probabilistic framework for landmark detection based on phoneticfeatures for automatic speech recognition, Journal of the AcousticalSociety ofAmerica 123(2): 1154-1168"

Kabak, Baris and William Idsardi2007 Perceptual distortions in the adaptation of English consonant clus­

ters: Syllable structure or consonantal contact constraints? Languageand Speech 50: 23-52,

Kazanina, Nina, Colin Phillips, and William Idsardi2006 The influence of meaning on the perception of speech sounds, Pro­

ceedings ofthe National Academy of Sciences USA 103(30): 11381­6,

Kilner, JM., Karl J, Friston, and c.. D., Frith2007 The mirror-neuron system: A Bayesian perspective. NeuroReport

18(6): 619-623"Kingston, John

2003 Learning foreign vowels, Language and Speech 46: 295-349"Klatt, Dennis

1979 Speech perception: A model ofacoustic phonetic analysis and lexicalaccess" Journal ofPhonetics 7: 279-342.

Klatt, Dennis1989 Review of selected models of speech perception, In Lexical Repre­

sentation and Process, William Marslen-Wilson (Ed), 169-226,Cambridge MA: MIT Press"

Recognizing words from speech 193

Raettig, Tim and Sonja A. Kotz2008 Auditory processing of different types of pseudo-words: an event­

related fMRI study .. Neuroimage 39(3): 1420-8.Raizada, Rajiv D. and Russell A. Poldrack

2007 Selective amplification of stimulus differences during categoricalprocessing of speech.. Neuron 56(4): 726-40.

Rauschecker, Joseph P., Biao Tian, and Marc Hauser1995 Processing of complex sounds in the macaque nonprimary auditory

cortex. Science 268(5207): 111-4..

Sapir, Edward1933 La n~alite psychologique des phonemes. Journal de P~ychologie

Normale et Pathologique. Reprinted as The psychological reality ofphonemes in Mandelbaum, D .. (ed.) Selected writings in language,culture and personality, Berkeley: University of California Press ..

Schroeder, Charles E, Peter Lakatos, Joshinao Kajikawa, Sarah Partan, and Aina

Puce2008 Neuronal oscillations and visual amplification of speech.. Trends in

Cognitive Sciences 12(3): 106-13Scott, Sophie K, C. Catrin. Blank, Stuart Rosen, and Richard J. S. Wise

2000 Identification of a pathway for intelligible speech in the left temporallobe .. Brain 123 Pt 12: 2400-6.

Scott, Sophie K and Richard 1. S.. Wise2004 The functional neuroanatomy of prelexical processing in speech

perception .. Cognition 92(1-2): 13-45.Stefanatos, Gerry A, Arthur Gershkoff, and Sean Madigan

2005 On pure word deafuess, temporal processing, and the left hemis­phere.. Journal of the International Neuropsychological Society

11 :456-470Steinschneider, Mitchell, Charles E.. Schroeder, Joseph C. Arezzo, and Herbert G.

Vaughan Jf1994 Speech-evoked activity in primary auditory cortex: effects of voice

onset time Electroencephalography and Clinical Neurophysiology

92(1): 30-43Stevens, Kenneth N ..

1998 Acoustic Phonetics Cambridge MA: MIT Press.2002 Toward a model for lexical access based on acoustic landmarks and

distinctive features Journal oj the Acoustical Society ojAmerica III

(4): 1872-1891 ..Townsend, David J, and Thomas G.. Bever

2001 Sentence Comprehension. The Integration of Habits and RulesCambridge: MIT Press

b

194 David Poeppel and William Idsardi

Myers, Emily B. and Sheila E. Blumstein2008 The neural bases of the lexical effect: anfMRI investigation.

bral Cortex 18(2): 278-88.Naatanen, Risto, Anne Lehtokoski, Mietta Lennes, Made Cheour, Minna Huotilai~

nen, Antti Iivonen, Martti Vainio, Paavo Alku, Risto 1.. I1nlorlielni.Aavo Luuk, Juri AIlik, Janne Sinkkonen, and Kimmo Alho

1997 Language-specific phoneme representations revealed by electricmagnetic brain responses. Nature 385(6615): 432-4 ..

Obleser, Jonas, 1.. Zimmermann, John Van Meter and JosefP Rauschecker2007 Multiple stages of auditory speech perception reflected in

related FMRI.. Cerebral Cortex 17(10): 2251-7.Pa, Judy and Gregory Hickok

2008 A parietal-temporal sensory-motor integration area for thevocal tract: evidence fiom an fMRI study of skilled musicians.. Neu­ropsychologia 46(1): 362-8.

Perkell, Joseph and Dennis Klatt1986 Invariance and Variability in Speech Processes. Hillsdale NJ: Erl­

baum.Poeppel, David

2001 Pure word deafness and the bilateral processing of the speech code.Cognitive Science 21 (5): 679-693.

2003 The analysis of speech in different temporal integration windows:cerebral lateralization as 'asymmetric sampling in time' SpeechCommunication 41: 245-255.

Poeppel, David and Martin Hackl2008 The architecture of speech perception. In 1. Pomerantz (Ed.), Topics

in Integrative Neuroscience From Cells to Cognition, CambridgeUniversity Press

Poeppel, David and Phillip 1. Monahan2008 Speech perception: Cognitive foundations and cortical implementa­

tion .. Current Directions in Psychological Science 17(2)..2010 Feedforward and Feedback in Speech Perception: Revisiting Analy­

sis-by-Synthesis. Language and Cognitive Processes ..Poeppel, David, William 1. ldsardi, and Virginie van Wassenhove

2008 Speech perception at the interface of neurobiology and linguistics.Philosophical Transactions oj the Royal Society ofLondon B Biolog­ical Sciences 363(1493): 1071-86 ..

Prabhakaran, Ranjani, Sheila E Blumstein, Emily R Myers, Emmette Hutchison,and Brendan Britton

2006 An event-related fMRI investigation of phonological-lexical compe­tition Neuropsychologia 44(12): 2209-21.

Recognizing words from speech 195

...

196 David Poeppel and William Idsardi

Utman, Jennifer Aydelott, Sheila E. Blumstein, and Kelly Sullivan2001 Mapping from sound to meaning: reduced lexical activation in

ca's aphasics. Brain and Language 79(3): 444-72.van Wassenhove, Virginie, Kenneth W. Grant, and David Poeppel

2005 Visual speech speeds up the neural processing of auditoryProceedings ofthe National Academy ojSciences USA 102(4): 16.

Young, Eric D.2008 Neural representation of spectral and temporal information

speech.. Philosophical Transactions ofthe Royal Society ojLondonBiological Sciences 363(1493): 923-45.

Yuille, Alan and Daniel Kersten2006 Vision as Bayesian inference: analysis by synthesis? Trends in Cog~

nitive Sciences 10(7): 301-8.

Brain structures underlying lexical processing ofspeech: Evidence from brain imaging

Matthew H. Davis and Jennifer M Rodd

1. The neural foundations of lexical processing of speech

A mental lexicon that links the form of spoken words to their associatedmeanings and syntactic functions has long been seen as a central to speechcomprehension. As should be apparent from other papers in this volume,William Marslen-Wilson's experimental and theoretical contributions havebeen supremely influential in guiding research on the psychological andcomputational properties of the lexicon. With recent developments in func­tional brain imaging, methods now exist to map these processes onto neu­roanatomical pathways in the human brain ..

In the current paper we will argue that theoretical proposals made invarious iterations of the Cohort account remain just as relevant to this neu­roscientific endeavour as they were for previous generations of researchersin the psychological and cognitive sciences .. Here we will review recentbrain imaging work on lexical processing in the light of these theoreticalprinciples..

Accounts of the neural processes underlying spoken word recognitionhave converged on the proposal that brain regions centred on the superiortemporal gyrus (STG) are critical for pre-lexical processing of spokenwords .. That is, this region is engaged in the acoustic-phonetic processesthat provide the input for later lexical and semantic analysis of the speechsignaL In the neuropsychological literature, cases of pure word deafness (anisolated impairment of speech perception in the presence of intact linguisticskills in other modalities and perception of non-speech sounds) is typicallyobserved following bilateral lesions to these superior temporal regions (Saf­fran, Marin, and Komshian 1976; Stefanatos, Gershkoff, and Madigan2005).. Functional imaging studies that compare neural responses to speechsounds and acoustically matched sounds that do not evoke a speech perceptevoke differential responses in superior temporal gyrus regions that sur­round but do not include primary auditory cortex (Davis and 10hnsrude2003; Scott et al. 2000; Uppenkamp et al. 2006; Vouloumanos et aL 2001).Although activation for speech greater than non-speech is often seen bilate­rally, those studies that focus on phonological aspects of speech processing