Differential Representation of Articulatory Gestures and Phonemes … · 2018. 9. 26. · 6 105...

45
1 Differential Representation of Articulatory Gestures and Phonemes in Precentral 1 and Inferior Frontal Gyri 2 Emily M. Mugler 1 , Matthew C. Tate 2 , Karen Livescu 3 , Jessica W. Templer 1 , 3 Matthew A. Goldrick 4, and Marc W. Slutzky 1,5,6 4 Departments of Neurology 1 , Neurosurgery 2 , Linguistics 4 , Physiology 5 , and Physical Medicine & 5 Rehabilitation 6 , Northwestern University, Chicago, IL 60611 6 Toyota Technological Institute 3 at Chicago, Chicago, IL 60637 7 8 Abbreviated title: Cortical Representation of Gestures and Phonemes 9 10 Submitting and Corresponding author: 11 Marc W. Slutzky, MD, PhD 12 Northwestern University, Dept. of Neurology 13 303 E. Superior Ave.; Lurie 8-121 14 Chicago, IL 60611 15 [email protected] 16 312-503-4653 (ph) 17 Number of pages: 46 18 Number of figures, tables, multimedia and 3D models (separately): Figures: 5, Tables: none. 19 Multimedia: none, 3D models: none 20 Number of words for Abstract, Introduction, and Discussion (separately): Abstract:165 21 Introduction:649 Discussion: 1,496 22 23

Transcript of Differential Representation of Articulatory Gestures and Phonemes … · 2018. 9. 26. · 6 105...

  • 1

    Differential Representation of Articulatory Gestures and Phonemes in Precentral 1

    and Inferior Frontal Gyri 2

    Emily M. Mugler1, Matthew C. Tate

    2, Karen Livescu

    3, Jessica W. Templer

    1, 3

    Matthew A. Goldrick4, and Marc W. Slutzky

    1,5,6 4

    Departments of Neurology1, Neurosurgery

    2, Linguistics

    4, Physiology

    5, and Physical Medicine & 5

    Rehabilitation6, Northwestern University, Chicago, IL 60611 6

    Toyota Technological Institute3 at Chicago, Chicago, IL 60637 7

    8

    Abbreviated title: Cortical Representation of Gestures and Phonemes 9

    10

    Submitting and Corresponding author: 11

    Marc W. Slutzky, MD, PhD 12

    Northwestern University, Dept. of Neurology 13

    303 E. Superior Ave.; Lurie 8-121 14

    Chicago, IL 60611 15

    [email protected] 16

    312-503-4653 (ph) 17

    Number of pages: 46 18

    Number of figures, tables, multimedia and 3D models (separately): Figures: 5, Tables: none. 19

    Multimedia: none, 3D models: none 20

    Number of words for Abstract, Introduction, and Discussion (separately): Abstract:165 21

    Introduction:649 Discussion: 1,496 22

    23

  • 2

    Conflict of Interest 24

    The authors declare no competing financial interests. 25

    26

    Acknowledgements 27

    We thank Robert D. Flint, Griffin Milsap, Weiran Wang, our EEG technologists, and our 28

    participants. This work was supported in part by the Doris Duke Charitable Foundation (Clinical 29

    Scientist Development Award, grant #2011039), a Northwestern Memorial Foundation Dixon 30

    Translational Research Award (including partial funding from NIH National Center for 31

    Advancing Translational Sciences, UL1TR000150 and UL1TR001422), NIH grants 32

    F32DC015708 and R01NS094748, and NSF grant #1321015. 33

    34

    35

  • 3

    ABSTRACT 36

    Speech is a critical form of human communication and is central to our daily lives. Yet, despite 37

    decades of study, an understanding of the fundamental neural control of speech production remains 38

    incomplete. Current theories model speech production as a hierarchy from sentences and phrases 39

    down to words, syllables, speech sounds (phonemes) and the actions of vocal tract articulators used 40

    to produce speech sounds (articulatory gestures). Here, we investigate the cortical representation of 41

    articulatory gestures and phonemes in ventral precentral and inferior frontal gyri in men and women. 42

    Our results indicate that ventral precentral cortex represents gestures to a greater extent than 43

    phonemes, while inferior frontal cortex represents both gestures and phonemes. These findings 44

    suggest that speech production shares a common cortical representation with that of other types of 45

    movement, such as arm and hand movements. This has important implications both for our 46

    understanding of speech production and for the design of brain machine interfaces to restore 47

    communication to people who cannot speak. 48

    49

    SIGNIFICANCE STATEMENT 50

    Despite being studied for decades, the production of speech by the brain is not fully understood. In 51

    particular, the most elemental parts of speech, speech sounds (phonemes) and the movements of 52

    vocal tract articulators used to produce these sounds (articulatory gestures) have both been 53

    hypothesized to be encoded in motor cortex. Using direct cortical recordings, we found evidence that 54

    primary motor and premotor cortices represent gestures to a greater extent than phonemes. Inferior 55

    frontal cortex (part of Broca’s area) appears to represent both gestures and phonemes. These findings 56

    suggest that speech production shares a similar cortical organizational structure with movement of 57

    other body parts. 58

  • 4

    59

    INTRODUCTION 60

    Speech is composed of individual sounds, called segments or (hereafter) phonemes (Bakovic, 61

    2014), that are produced by coordinated movements of the vocal tract (e.g., lips, tongue, velum, and 62

    larynx). However, it is not certain exactly how these movements are planned. For example, during 63

    speech planning, phonemes are coarticulated—the vocal tract actions (constrictions or releases), or 64

    articulatory gestures, that comprise a given phoneme change based on neighboring phonemes in the 65

    uttered word or phrase (Whalen, 1990). While the dynamic properties of these gestures, which are 66

    similar to articulator kinematics, have been extensively studied (Westbury, 1990; Nam et al., 2010; 67

    Bocquelet et al., 2016; Bouchard et al., 2016; Carey and McGettigan, 2016), there is no direct 68

    evidence of gestural representations in the brain. 69

    Recent models of speech production propose that articulatory gestures combine to create acoustic 70

    outputs (phonemes, and phoneme groupings such as syllables; (Browman and Goldstein, 1992; 71

    Guenther et al., 2006). Guenther et al. (Guenther et al., 2006) hypothesized that ventral premotor 72

    cortex (PMv) and inferior frontal gyrus (IFG, part of Broca’s area) preferentially represent 73

    (groupings of) phonemes and that ventral motor cortex (M1v) preferentially represents gestures. This 74

    hypothesis is analogous to limb motor control, in which premotor cortices preferentially encode 75

    reach targets and M1 encodes reaching details (Hocherman and Wise, 1991; Shen and Alexander, 76

    1997; Hatsopoulos et al., 2004; Pesaran et al., 2006). However, the model’s hypothesized 77

    localizations of speech motor control were based on indirect evidence from behavioral studies 78

    (Ballard et al., 2000), non-speech articulator movements (Penfield and Roberts, 1959; Fesl et al., 79

    2003) and fMRI studies of syllables (Riecker et al., 2000; Guenther et al., 2006; Ghosh et al., 2008; 80

    Tourville et al., 2008). None of the modalities used in these studies had sufficient combination of 81

  • 5

    temporal and spatial resolution to provide definitive information about where and how gestures and 82

    phonemes are encoded. 83

    Electrocorticography (ECoG) has enabled identification of neural activity with high spatial and 84

    temporal resolution during speech production (Kellis et al., 2010; Pei et al., 2011b; Bouchard et al., 85

    2013; Mugler et al., 2014b; Slutzky 2018). High gamma activity (70-200 Hz) in ECoG from ventral 86

    precentral gyrus (PCG, encompassing M1v and PMv) corroborated Penfield’s original somatotopic 87

    mappings of the articulators (Penfield and Boldrey, 1937) and roughly correlated with phoneme 88

    production (Bouchard et al., 2013; Lotte et al., 2015; Ramsey et al., 2017), as well as the manner and 89

    place of articulation (Bouchard et al., 2013; Lotte et al., 2015). Mugler et al. demonstrated that single 90

    instances of phonemes can be identified during word production using ECoG from PCG (Mugler et 91

    al., 2014b). However, the ability to decode phonemes from these areas was rather limited, which 92

    suggests that phonemes may not optimally characterize the representation of these cortical areas. 93

    Some evidence exists that cortical activations producing phonemes differ depending on the context of 94

    neighboring phonemes (Bouchard and Chang, 2014; Mugler et al., 2014a). Moreover, incorporating 95

    probabilistic information of neighboring phonemes improves the ability to decode phonemes from 96

    PCG (Herff et al., 2015). Therefore, these areas might demonstrate predominant representation for 97

    gestures, not phonemes. However, no direct evidence of gestural representation in the brain has yet 98

    been demonstrated. 99

    Here, we used ECoG from PCG and IFG to classify phonemes and gestures during spoken word 100

    production. We hypothesized that posterior PCG (approximate M1v) represents the movements, and 101

    hence gestures, of speech articulators. We first examined the ability to determine the positions of 102

    phonemes and gestures within words using ECoG. We next compared the relative performances of 103

    gesture and phoneme classification in each cortical area. Finally, we used a special case of 104

  • 6

    contextual variance — allophones, in which the same phoneme is produced with different 105

    combinations of gestures — to highlight more distinctly the gestural vs. phonemic predominance in 106

    each area. The results indicate that gestures are the predominant elemental unit of speech production 107

    represented in PCG, while both phonemes and gestures appear to be more weakly represented in IFG, 108

    with gestures still slightly more predominant. 109

    110

    MATERIALS AND METHODS 111

    Subject Pool 112

    Nine adults (mean age 42, 5 female) who required intraoperative ECoG monitoring during 113

    awake craniotomies for glioma removal volunteered to participate in a research protocol during 114

    surgery. We excluded subjects with tumor-related symptoms affecting speech production (as 115

    determined by neuropsychological assessment), and non-native English speakers, from the study. 116

    All tumors were located at least two gyri (~2-3 cm) away from the recording electrodes. As per 117

    standard of care, subjects were first anesthetized with low doses of propofol and remifentanil, 118

    then awoken for direct cortical stimulation mapping. All experiments were performed after 119

    cortical stimulation, hence during experiments, no general anesthesia had been administered for 120

    at least 45 min; no effects on speech articulation were detected. Subjects provided informed 121

    consent for research, and the Institutional Review Board at Northwestern University approved 122

    the experimental protocols. 123

    Electrode grid placement was determined using both anatomical landmarks and functional 124

    responses to direct cortical stimulation. Electrode grids were placed to ensure coverage of areas 125

    that produced movements of the articulators when stimulated. ECoG grid placement varied 126

    slightly with anatomy but consistently covered targeted areas of ventral posterior precentral 127

  • 7

    gyrus (pPCG, the posterior half of the gyrus, roughly equivalent to M1v), ventral anterior PCG 128

    (aPCG, the anterior half of the gyrus, roughly equivalent to PMv), and inferior frontal gyrus pars 129

    opercularis (IFG), usually aligning along the Sylvian fissure ventrally. We defined our locations 130

    purely by anatomy to be conservative, since it was impossible to define them functionally in 131

    vivo, but with the intention of estimating M1v and PMv. We confirmed grid location with 132

    stereotactic procedure planning, anatomical mapping software (Brainlab), and intraoperative 133

    photography (Hermes et al., 2010). 134

    135

    Data Acquisition 136

    A 64-electrode, 8x8 ECoG grid (Integra, 4-mm spacing) was placed over the cortex and 137

    connected to a Neuroport data acquisition system (Blackrock Microsystems, Inc.). Both stimulus 138

    presentation and data acquisition were facilitated through a quad-core computer running a 139

    customized version of BCI2000 software (Schalk et al., 2004). Acoustic energy from speech was 140

    measured with a unidirectional lapel microphone (Sennheiser) placed near the patient’s mouth. 141

    Microphone signal was wirelessly transmitted directly to the recording computer (Califone), 142

    sampled at 48 kHz, and synchronized to the neural signal recording. 143

    All ECoG signals were bandpass-filtered from 0.5-300 Hz and sampled at 2 kHz. Differential 144

    cortical recordings compared to a reference ECoG electrode were exported for analysis with an 145

    applied bandpass filter (0.53 - 300 Hz) with 75 µV sensitivity. Based on intraoperative 146

    photographs and Brainlab reconstructions of array coordinates, electrodes in the posterior and 147

    anterior halves of the precentral gyrus were assigned to pPCG and aPCG, respectively, while 148

    those anterior to the precentral sulcus and ventral to the middle frontal sulcus were assigned to 149

    IFG. Data will be made available upon request to the senior author. 150

  • 8

    151

    Experimental Protocol 152

    We presented words in randomized order on a screen at a rate of 1 every 2 seconds, in blocks 153

    of 4.5 minutes. Subjects were instructed to read each word aloud as soon as it appeared. Subjects 154

    were surveyed regarding accent and language history, and all subjects included here were native 155

    English speakers. All subjects completed at least 2 blocks, and up to 3 blocks. 156

    All word sets consisted of monosyllabic words and varied depending on subject and 157

    anatomical grid coverage. Stimulus words were chosen for their simple phonological structure, 158

    phoneme frequency, and phoneme variety. Many words in the set were selected from the 159

    Modified Rhyme Test (MRT), consisting of monosyllabic words with primarily consonant-160

    vowel-consonant (CVC) structure (House et al., 1963). The frequency of phonemes within the 161

    MRT set roughly approximates the phonemic frequency in American English (Mines et al., 162

    1978). The MRT was then supplemented with additional CVC words to incorporate all General 163

    American English phonemes to the word set with a more uniform phoneme incidence. Mean 164

    word duration was 520 ms. Consonant cluster allophone words contained initial stop consonants; 165

    each allophone example included a voiced, a voiceless, and a consonant cluster allophone word 166

    (for example, “bat”, “pat”, and “spat”; (Buchwald and Miozzo, 2011). 167

    168

    Signal Processing 169

    We examined normalized activity in the high gamma (70-290 Hz) band, since this band is 170

    highly informative about limb motor (Crone et al., 2001; Mehring et al., 2004; Chao et al., 2010; 171

    Flint et al., 2012a; Flint et al., 2012b; Flint et al., 2017), speech (Crone et al., 2001; Pei et al., 172

    2011a; Bouchard et al., 2013; Ramsey et al., 2017), and somatosensory activity (Ray et al., 173

  • 9

    2008), and correlates with ensemble spiking activity (Ray and Maunsell, 2011) and blood 174

    oxygenation level dependent (BOLD) activity (Logothetis et al., 2001; Hermes et al., 2012). 175

    ECoG signals were first re-referenced to a common average of all electrodes in the time domain. 176

    We used the Hilbert transform to isolate band power in 8 linearly distributed 20-Hz wide sub-177

    bands within the high gamma band that avoided the 60 Hz noise harmonics and averaged them to 178

    obtain the high gamma power. We then normalized each electrode’s high gamma band power 179

    changes, by subtracting the median and dividing by the interquartile range, to create frequency 180

    features for each electrode. 181

    To create features in the time domain, we segmented normalized high gamma values for each 182

    electrode in 50-ms time bins from 300 ms prior to and 300 ms after onset of each event 183

    (phoneme or gesture). This was far enough in advance of event onset to capture most relevant 184

    information in IFG, which starts about 300 ms before word onset (Flinker et al., 2015). This 185

    created discrete, event-based trials that summarized the time-varying neural signal directly 186

    preceding and throughout production of each phoneme or gesture. Time windows for allophone 187

    feature creation were shorter (-300 ms to 100 ms) to further reduce the effect of coarticulation on 188

    the allophone classification results. The phonemes that were classified in allophone analysis (/p/, 189

    /b/, /t/, /d/, /k/, and /ɡ/) were all plosives (stop consonants) and all had durations of less than 100 190

    ms, so we were able to use this shorter window without losing information about the phonemes. 191

    This is in contrast to the direct classification of phonemes and gestures, which included 192

    phonemes such as /m/ and /n/ that were longer in duration; hence we used activity up to 300 ms 193

    after onset to capture this information. 194

    195

    Phoneme and gesture labeling 196

  • 10

    Following standard practices, we used visual and auditory inspection of auditory spectral 197

    changes to manually label the onset of each phoneme in the speech signal (Mugler et al., 2014b). 198

    For plosives, phoneme onset was marked by acoustic release. For fricatives, phoneme onset was 199

    marked by the onset of aperiodic noise. For sonorants and vowels, onset was marked by changes 200

    to spectral properties. To label gesture onset times, acoustic-articulatory inversion was used on 201

    the audio recordings of subjects. This technique maps articulator trajectories from acoustic data, 202

    using a model that accounts for subject- and utterance-specific differences in production. We 203

    used an acoustic articulatory inversion (AAI) model, described in (Wang et al., 2015), based on a 204

    deep neural network trained on data from the University of Wisconsin X-ray Microbeam corpus 205

    (Westbury et al., 1990), with missing articulatory data filled in using the data imputation model 206

    of (Wang et al., 2014). This model performed highly in predicting articulators in data from the 207

    corpus that were not used in training (i.e., in cross-validation), with a root-mean squared error of 208

    only 1.96 mm averaged over all articulators. This error was smaller than that reported in similar 209

    studies, including a study that used AAI to then drive a speech synthesizer, in which an error of 210

    2.5 mm still produced synthesized speech that was recognizable a high percentage of the time 211

    (Bocquelet et al., 2016). Moreover, we simulated this error by adding Gaussian noise with mean 212

    of 0 and standard deviation of 1.96 to the position and velocity estimates from AAI and 213

    computed the error in gestural time estimates in 2 subjects. We found that this amount of noise 214

    translated to an error of 5.2±9.8 ms (mean±SD) in time, which was far smaller than our time bins 215

    used for decoding. While there could be some discrepancies in applying this model to patients in 216

    an operating room, possibly with dry mouths, lying on their side, even an error of 5 mm per 217

    articulator translated in simulation to errors of only 5.3±13 ms in timing. Even if there were 218

    errors bigger than this, the resulting errors in timing would bias the decoding performance results 219

  • 11

    to be poorer for gestures, rather than better. Thus, any discrepancies in gestural timing due to the 220

    limits of AAI would not affect our results. 221

    We used AAI to generate articulator positions of the lips, tongue tip, and tongue body at a 222

    time resolution of 10 ms (Figure 1). The lip aperture was defined as the Euclidean combination 223

    of vertical and horizontal positions in the sagittal plane, and tongue apertures were defined using 224

    vertical position. Position trajectories were smoothed with a Gaussian kernel of 50 ms. The 225

    onsets of each gesture (closure, critical closure, and release) were defined from the position and 226

    velocity traces as in (Pouplier and Goldstein, 2010). In brief, gesture onset time was defined as 227

    the moment the articulator had surpassed 20% of the difference between minimum velocity 228

    preceding movement and maximum velocity during gesture formation. For plosives, onset of 229

    gesture release (e.g. tongue tip release) was set to phoneme onset time. Since AAI does not 230

    provide laryngeal or velar information, the Task Dynamic (TADA) model of inter-articulator 231

    coordination was used to generate expected velar gesture onset times (Saltzman and Munhall, 232

    1989; Nam et al., 2012). This model (TADA) is not speaker-specific, so the onset times were 233

    scaled proportionally by the ratio of the default word duration (from TADA) to the actual 234

    duration of each word. We used these onset times for each event in the speech signal to segment 235

    ECoG features. 236

    237

    Intra-word position classification 238

    We analyzed how cortical high gamma activity varies with the context of phonemic and 239

    gestural events (i.e., coarticulation) in two subjects producing consonant-vowel-consonant 240

    words. We used the high gamma activity on each electrode individually to classify whether each 241

    consonant phoneme or gesture was the initial or final consonant in each word. The coarticulation 242

  • 12

    of speech sounds means that phonemes are not consistently associated with one set of gestures 243

    across intra-word positions. Therefore, we predicted that if gestures characterize the 244

    representational structure of a cortical area, the cortical activity associated with a phoneme 245

    should vary across word positions. In contrast, because gestures characterize speech movements 246

    that do not vary with context, the cortical activity associated with a gesture should also be 247

    context-invariant. Therefore, we did not expect to be able to classify a gesture’s position with 248

    better than chance accuracy. For this analysis, we included three types of gestures (closures of 249

    tongue tip, tongue body, or lips) and their associated phonemes. To reduce the likelihood of 250

    including cortical activity related to production of neighboring events (e.g., vowel-related 251

    phonemes or gestures) in our classification, we only used the high gamma activity immediately 252

    surrounding event onset (from 100 ms before to 50 ms after, in 25-ms time bins) to classify intra-253

    word position from individual electrodes. We classified initial vs. final position using linear 254

    discriminant analysis (LDA; with 10x10 cross-validation repeats), since there were only 6 255

    features for each classifier. 256

    To quantify the significance (and effect size) of our results, we examined the discriminability 257

    index d´ between accuracy (% correct) of phonemic or gestural position and chance accuracy. d´ 258

    between two groups is defined as the difference of their means divided by their pooled standard 259

    deviation. For example, 𝑑′ =(µ𝑔− µ𝑝)

    √( n𝑔𝜎𝑔 2 − n𝑝𝜎𝑝

    2 )/( n𝑔+ n𝑝). where g is the mean of gestural position 260

    accuracy, ng is the number of gesture instances minus one, and g is the standard deviation of 261

    gesture instances, and the same symbols with subscript p stand for phonemes. Mean values of d´ 262

    were computed from electrodes that were related to the corresponding gesture type. This was 263

    determined by classifying among all gestures (except larynx) using the high gamma activity from 264

    each individual electrode, in 25-ms time bins, from 100 ms before to 50 ms after gesture or 265

  • 13

    phoneme onset as features and classifying using classwise principal component analysis (see 266

    below). Each electrode was designated as being related to the gesture which was classified most 267

    accurately. 268

    269

    Event Classification and Statistical Analysis 270

    To obtain more detailed information about the encoding of each cortical area, we also used 271

    ECoG high gamma activity to classify which phoneme or gesture was being uttered at each event 272

    onset. We classified consonant phonemes and all gestures except for larynx. We limited our 273

    phoneme/gesture classification analysis to consonant phonemes for two reasons. First, the TADA 274

    model assumes that the larynx (or glottis) is closed by default (Browman and Goldstein, 1992), 275

    which makes it very difficult if not impossible to assign meaningful onset (closure) times to this 276

    gesture, which is present in all vowels. In addition, we wished to avoid influence of 277

    coarticulation of neighboring phonemes. Therefore, we removed vowels and /s/ phonemes, as 278

    well as the larynx closing gesture, from the analysis. To ensure sufficient accuracy of our 279

    classification models, we only included phonemes with at least 15 instances, resulting in roughly 280

    the same number of phoneme classes as gesture classes (average of 15.2 phonemes across 281

    subjects). The phonemes most commonly included were 282

    {/p/,/b/,/m/,/f/,/d/,/t/,/n/,/l/,/r/,/ɡ/,/k/,/v/,/j/}. We classified 12 gestures—lips (open, close, 283

    critical), tongue tip (open, close, critical), tongue body (open, close, critical) and velum (open, 284

    close, critical) in all subjects. 285

    Due to the large number of potential features and relatively low number of trials, we used 286

    classwise principal component analysis (CPCA) to reduce dimensionality of the input feature 287

    space and hence reduce the risk of overfitting. CPCA performs PCA on each class separately, 288

    which enables dimensionality reduction while preserving class-specific information (Das and 289

  • 14

    Nenadic, 2009; Das et al., 2009). For each class, the procedure chose a feature subspace 290

    consisting of all components with eigenvalues larger than the mean of the nonzero eigenvalues 291

    (Das and Nenadic, 2009). Linear discriminant analysis (LDA) was then used to determine the 292

    feature subspace with the most information about the classes. The high gamma features were 293

    then projected into this subspace and LDA was used to classify the data (Slutzky et al., 2011; 294

    Flint et al., 2012b). We used one-versus-the rest classification, in which one event class was 295

    specified, and events not in that class were combined into a “rest” group. We reported only the 296

    accuracy of classifying a given class (for example, in /p/ vs. the rest, we reported the accuracy of 297

    classifying the /p/ class, but not the “rest” class), to avoid bias in accuracy due to the imbalance 298

    in “one” and “rest” class sizes. We used 10-fold cross-validation with randomly-selected test 299

    sets (making sure that at least some of the target events were in each test set) to compute 300

    classification performance. We repeated the 10-fold cross-validation 10 times (i.e., re-selected 301

    random test sets 10 times), for a total of 100 folds. Chance classification accuracies were 302

    determined by randomly shuffling event labels 200 times and re-classifying. We created an 303

    overall performance for each subject as a weighted average of all the events; the performance of 304

    each phoneme or gesture was weighted by the probability of that phoneme or gesture in the data 305

    set. The Wilcoxon signed-rank test was used for all statistical comparisons reported. 306

    307

    Allophone classification 308

    Four participants read aloud a specific set of spoken, monosyllabic words from the speech 309

    control literature that included allophones to amplify the distinction between phonemic and 310

    gestural representation in specific cortical areas (Buchwald and Miozzo, 2011). Allophones are 311

    different pronunciations of the same phoneme in different contexts within words, which reflect 312

  • 15

    the different gestures being used to produce that phoneme (Browman and Goldstein, 1992). For 313

    example, consonant phonemes are produced differently when isolated at the beginning of a word 314

    (e.g., the /t/ in “tab”, which is voiceless) compared to when they are part of a cluster at the 315

    beginning of a word (e.g., the /t/ in “stab”, which is acoustically more similar to a voiced /d/, 316

    Figure 5A). Using word sets with differing initial consonant allophones (either CVC or CCVC 317

    in organization) enabled us to dissociate more directly the production of phonemes from the 318

    production of gestures. This can be thought of as changing the mapping between groups of 319

    gestures and an allophone, somewhat analogous to limb motor control studies that used artificial 320

    visual rotations to change the mapping between reach target and kinematics to assess cortical 321

    representation (Wise et al., 1998; Paz et al., 2003). 322

    We trained separate classifiers (CPCA with LDA, as in the prior section) for voiceless and 323

    voiced consonants (VLC and VC, respectively), and tested their performance in decoding both 324

    the corresponding isolated allophone (VLC or VC) and the corresponding consonant cluster 325

    allophone (CClA). For example, we built classifiers of /t/ (vs. all other consonants) and /d/ (vs. 326

    all other consonants) and tested them in classifying the /t/ in words starting with “st”. 327

    328

    RESULTS 329

    We simultaneously recorded ECoG from PCG and IFG (pars opercularis) and speech audio 330

    during single word, monosyllabic utterances by 9 human participants (8 with left hemispheric 331

    recordings) undergoing functional mapping during awake craniotomies for resection of brain tumors 332

    (Figure 2). 333

    334

    Phoneme-related, but not gesture-related, cortical activity varies with intra-word position 335

  • 16

    We first analyzed how high gamma activity varies with position of phonemes and gestures within 336

    words. We found that the high gamma activity in pPCG and aPCG did not change with intra-word 337

    position of the gesture (examples in Figure 3A, right). In contrast, when aligned to phoneme onset, 338

    high gamma activity in pPCG and aPCG did vary with intra-word position (Figure 3A, left). Figure 339

    3B shows an example of classification of tongue body and tongue tip closure position from all 340

    electrodes that predominantly encoded those gestures (based on single-electrode decoding of all 341

    gesture types – see Methods). Gesture classification accuracies were not larger than chance, while 342

    accuracies of classifying associated phonemes ({/k/,/ɡ/} for tongue body and {/t/,/d/,/l/,/n/,/s/} for 343

    tongue tip) were indeed larger than chance. To quantify the accuracy of classification compared to 344

    chance over electrodes, we computed the discriminability index d´ on each electrode (examples in 345

    Figure 3C). d´ is the difference of means (in this case, between phoneme or gesture position and 346

    chance accuracy) divided by the pooled SD (see Methods); a d´ value of >1 is considered large. We 347

    computed the mean d´ over all electrodes in pPCG and aPCG that were modulated with lip, tongue 348

    tip, or tongue body gestures (see Methods). We found that, over all of these electrodes in both 349

    subjects, d´ was large for the associated phonemes (2.3±0.6, mean±SEM) and no different from zero 350

    for gestures (-0.06±0.6). We also examined all electrodes in pPCG and aPCG, regardless of 351

    modulation, and found similar results: d´ was large for phonemes (2.7±0.3) and no different from 352

    zero for gestures (0.2±0.3). Thus, cortical activity for gestures did not vary with context, while 353

    cortical activity for phonemes varied substantially across contexts. 354

    355

    pPCG, aPCG, and IFG more accurately represent gestures than phonemes 356

    To further investigate sublexical representation in the cortex, we used high gamma activity from 357

    8 participants to classify which phoneme or gesture was being uttered at each event onset. We first 358

  • 17

    classified phonemes and gestures separately using recordings combined all precentral gyrus 359

    electrodes (pPCG/aPCG). Combined pPCG/aPCG (PCG for short) activity classified gestures with 360

    significantly higher accuracy than phonemes: 63.7±3.4% vs. 41.6±2.2% (mean±SEM across subjects, 361

    p=0.01) as seen in Figure 4A. Gestural representations remained significantly dominant over 362

    phonemes after subtracting the chance decoding accuracy for each type (mean 34.3±3.4% vs. 363

    17.5±2.2%, p=0.008; Figure 4B). 364

    M1v, PMv, and IFG have been theorized to contribute differently to speech production, 365

    movements, and preparation for speech. We therefore investigated the representation of each 366

    individual area by performing gesture and phoneme classification using the ensemble of electrodes 367

    from each cortical area, in each subject, separately. Classification performance of both types 368

    increased moving from anterior to posterior areas. In each area, gestures were classified with greater 369

    accuracy than phonemes (IFG: 48.8±6.8% vs. 39.1±5.6%, p = 0.03; aPCG: 58.3±3.6% vs. 370

    40.7±2.1%, p = 0.016; pPCG: 62.6±2.2% vs. 47.3±2.0%, p = 0.008; Figure 4C). This predominance 371

    remained after subtracting chance accuracy across subjects (IFG: 17.9±6.4%, p = 0.016, aPCG: 372

    25.3±12.0%, p = 0.008, pPCG: 27.7±16.4%, p = 0.016; Figure 4D). The difference was significant in 373

    pPCG and aPCG, but not in IFG, when using Bonferroni correction for multiple comparisons. The 374

    difference in accuracy was not due to gestures having a greater incidence than phonemes 375

    (mean±SEM of 61±13 vs. 147±44 instances per phoneme and gesture, respectively), as significant 376

    differences remained when we performed decoding on a dataset with maximum numbers of gesture 377

    and phoneme instances matched (data not shown). To quantify the difference further, we computed 378

    d´ between accuracies of gestures and phonemes in each area. The d´ values in pPCG and aPCG were 379

    both very high (3.6 and 2.9), while that in IFG was slightly less (2.0), suggesting decreased gestural 380

    predominance in IFG than in pPCG or aPCG. 381

  • 18

    382

    Allophone classification supports predominance of gestural representations 383

    In four participants, we used word sets emphasizing consonant allophones (voiced, voiceless, and 384

    clustered with /s/) to amplify the distinction between phonemic and gestural representations. The /t/ 385

    in “st” words was acoustically more similar to, and produced with high gamma activity more like, a 386

    /d/ in aPCG electrodes, and more like a solitary initial /t/ in aPCG and IFG (Figure 5A and 5B). We 387

    investigated the extent to which consonant cluster allophones (CClAs) behaved more similarly to 388

    voiceless consonants (VLCs) or to voiced consonants (VCs) in each area. If CClAs were classified 389

    with high performance using the voiceless classifier (Figure 5C, blue rectangle), we would infer that 390

    phonemes were the dominant representation. If CClAs were classified with high performance using 391

    the voiced classifier, we would infer that gestures were the dominant representation (Figure 5C, 392

    orange rectangle). If CClAs were classified with low performance by both classifiers (Figure 5C, 393

    green rectangle), it would suggest that the CClA were a distinct category, produced differently from 394

    the voiced and from the voiceless allophone. 395

    Cluster consonants behaved less like the phoneme and more like the corresponding gesture when 396

    moving from anterior to posterior in the cortex (Figure 5D and 5E). For example, in IFG and aPCG, 397

    the CClAs behaved much more like the VLC phonemes than they did in pPCG (p=0.6, 0.5, and 0.008 398

    and d´=0.1, 0.2, and 0.4 in IFG, PMV, and pPCG, respectively for performance of the VLC classifier 399

    on VLCs vs. CClAs). The CClAs behaved more like the VC phonemes in pPCG than in aPCG and 400

    IFG (d´=0.4, 0.7, and 0.3 in IFG, aPCG, and pPCG, respectively), although there was still some 401

    difference in pPCG between CClA performance and VC performance. The CClAs were produced 402

    substantially more like VC phonemes than like VLC phonemes in pPCG, which implies that pPCG 403

  • 19

    predominantly represents gestures. The difference between CClAs and VC phonemes suggests that 404

    the cluster allophones may represent another distinct speech sound category. 405

    406

    DISCUSSION 407

    We investigated the representation of articulatory gestures and phonemes in precentral and 408

    inferior frontal cortices during speech production. Activity in these areas revealed the intra-word 409

    position of phonemes but not the position of gestures. This suggests gestures provide a more 410

    parsimonious, and more accurate, description of what is encoded in these cortices. Gesture 411

    classification significantly outperformed phoneme classification in pPCG and aPCG, and in 412

    combined PCG, and trended toward better performance in IFG. Consonants in clusters behaved 413

    more similarly to the consonant that shared more similar gestures (voiced), rather than the consonant 414

    that shared the same phoneme (voiceless) in more posterior areas; this relationship tended to reverse 415

    in more anterior areas. Together, these results indicate that cortical activity in PCG (M1v and PMv), 416

    but not IFG, represents gestures to a greater extent than phonemes during production. 417

    This is the most direct evidence of gesture encoding in speech motor cortices. This evidence 418

    supports theoretical models incorporating gestures in speech production, such as the Task Dynamic 419

    model of inter-articulator coordination (TADA) and the Directions-Into-Velocities of Articulators 420

    (DIVA) model (Saltzman and Munhall, 1989; Guenther et al., 2006; Hickok et al., 2011). DIVA, in 421

    particular, hypothesizes that gestures are encoded in M1v. These results also suggest that models not 422

    incorporating gestures, instead proposing that phonemes are the immediate output from motor cortex 423

    to brainstem motor nuclei, may be incomplete (Levelt, 1999; Levelt et al., 1999; Hickok, 2012b). 424

    The phenomenon of coarticulation, i.e., phoneme production is affected by planning and 425

    production of neighboring phonemes, has long been established using kinematic, physiologic (EMG), 426

  • 20

    and acoustic methods (Öhman, 1966; Kent, 1977; Whalen, 1990; Magen, 1997; Denby et al., 2010; 427

    Schultz and Wand, 2010). Our results showing discrimination of intra-word phoneme position and 428

    differences in allophone encoding confirm the existence of phoneme coarticulation in cortical 429

    activity as well. Bouchard and colleagues first demonstrated evidence of PCG representation of 430

    coarticulation during vowel production (Bouchard and Chang, 2014). Our results demonstrate 431

    cortical representation of coarticulation during consonant production. Some have suggested that 432

    coarticulation can be explained by the different gestures that are used when phonemes are in different 433

    contexts (Browman and Goldstein, 1992; Buchwald, 2014). Since gestures can be thought of as a 434

    rough estimate of articulator movements, our results demonstrating gesture encoding corroborate a 435

    recent study of isolated vowel production showing that PCG encodes the kinematics of articulators to 436

    a greater extent than the acoustic outputs (Conant et al., 2018). 437

    The use of allophones enabled us to dissociate the correlation between phonemes and gestures, as 438

    a single consonant phoneme is produced differently in the different allophones. In pPCG, the CClAs 439

    did not behave like either the VLC phonemes or VC phonemes, though they were more similar to VC 440

    phonemes. This suggests that the CClAs are produced differently than either VCs or VLCs. It is also 441

    possible that there may have been some features in the CClAs that were related to /s/ production, in 442

    the time from 300 to 200 ms before plosive onset, that affected the results. Overall, these results 443

    support previous findings: prior to release of the laryngeal constriction, the CClAs are hypothesized 444

    to be associated with a laryngeal gesture that is absent in VC phonemes (Browman and Goldstein, 445

    1992; Cho et al., 2014). Thus, it is not surprising that we observed this difference in classification 446

    between CClAs and VCs (Figure 5D). These results, therefore, still support a gestural representation 447

    in M1v as well as in PMv and IFG. 448

  • 21

    This study provides a deeper look into IFG activity during speech production. The role of IFG in 449

    speech production to date has been unclear. Classically, based on lesion studies and electrical 450

    stimulation, the neural control of speech production was described as starting in the inferior frontal 451

    gyrus, with low-level, non-speech movements elicited in primary motor cortex (M1v; (Broca, 1861; 452

    Penfield and Rasmussen, 1949). The classical view that IFG was involved in word generation 453

    (Broca, 1861) has been contradicted by more recent studies. Electrical stimulation sites causing 454

    speech arrest were located almost exclusively in the ventral PCG ((Tate et al., 2014). Other recent 455

    studies have provided conflicting imaging evidence in IFG of phoneme production (Wise et al., 456

    1999), syllables (Indefrey and Levelt, 2004), and syllable-to-phoneme sequencing and timing 457

    (Gelfand and Bookheimer, 2003; Papoutsi et al., 2009; Flinker et al., 2015; Flinker and Knight, 2016; 458

    Long et al., 2016). Flinker et al. showed that IFG was involved in articulatory sequencing (Flinker et 459

    al., 2015). The equal classification performance for gestures and phonemes using IFG activity 460

    suggests that there is at least some information in IFG related to gesture production. While our results 461

    cannot completely address IFG’s function due to somewhat limited electrode coverage (mainly pars 462

    opercularis) and experimental design (monosyllabic words likely limited IFG activation and 463

    classification performance somewhat), they do provide evidence for gesture representation in IFG. 464

    These results imply that speech production cortices share a similar organization to limb-related 465

    motor cortices, despite clear differences between the neuroanatomy of articulator and limb 466

    innervation (e.g., cranial nerve compared to spinal cord innervation). In this analogy, gestures 467

    represent articulator positions at discrete times (Guenther et al., 2006), while phonemes can be 468

    considered speech targets. Premotor and posterior parietal cortices preferentially encode the targets 469

    of reaching movements (Hocherman and Wise, 1991; Shen and Alexander, 1997; Pesaran et al., 470

    2002; Hatsopoulos et al., 2004; Pesaran et al., 2006), while M1 preferentially encodes reach 471

  • 22

    trajectories (Georgopoulos et al., 1986; Moran and Schwartz, 1999), force (Evarts, 1968; Scott and 472

    Kalaska, 1997; Flint et al., 2014), or muscle activity (Kakei et al., 1999; Morrow and Miller, 2003; 473

    Cherian et al., 2013; Oby et al., 2013). This suggests that M1v predominantly represents articulator 474

    kinematics and/or muscle activity; detailed measurements of articulator positions are starting to 475

    demonstrate this (Bouchard et al., 2016; Conant et al., 2018). While we found that gesture 476

    representations predominated over phonemic representations in all 3 areas, there was progressively 477

    less predominance in aPCG and IFG, which could suggest a rough hierarchy of movement-related 478

    information in the cortex (although phonemic representations can also be distributed throughout the 479

    cortex; (Cogan et al., 2014). We also found evidence for encoding of gestures and phonemes in both 480

    dominant and non-dominant hemispheres, which corroborates prior evidence of bilateral encoding of 481

    sublexical speech production (Bouchard et al., 2013; Cogan et al., 2014). The homology with limb 482

    motor areas is perhaps not surprising, since Broca’s area is thought to be homologous to premotor 483

    areas in apes (Mendoza and Merchant, 2014). This analogous organization suggests that 484

    observations from studies of limb motor control may be extrapolated to other parts of motor and 485

    premotor cortices. 486

    As in limb movements, sensory feedback is important in speech production (Hickok, 2012a). 487

    However, it is unlikely that auditory or somatosensory feedback accounts for the relative 488

    representations of gestures and phonemes observed here. Motor cortical activity during listening is 489

    organized based on acoustics, rather than based on articulators (Cheung et al., 2016); thus any effect 490

    of auditory feedback would be to improve phoneme performance. Somatosensory feedback’s 491

    contribution to activity should be limited by the very short amount of time after events included in 492

    location and allophone analyses. Overall, consistent findings across multiple types of analyses 493

  • 23

    strongly favors gestural predominance. Possible sensory contributions to speech production 494

    representations is an important area for future research. 495

    Brain machine interfaces (BMIs) could substantially improve quality-of-life of individuals who 496

    are paralyzed from neurological disorders. Just as understanding the cortical control of limb 497

    movements has led to advances in motor BMIs, a better understanding of the cortical control of 498

    speech will likely improve the ability to decode speech directly from the motor cortex. A speech 499

    BMI that could directly decode attempted speech would be more efficient than, and could 500

    dramatically increase the communication rate over, current slow and often tedious methods for this 501

    patient population (e.g., eye trackers, gaze communication boards, and even the most recent spelling-502

    based BMIs; (Brumberg et al., 2010; Chen et al., 2015; Pandarinath et al., 2017). While we can use 503

    ECoG to identify words via phonemes (Mugler et al., 2014b), these results suggest that gestural 504

    decoding would outperform phoneme decoding in BMIs using M1v/PMv activity. The decoding 505

    techniques used here would require modification for closed-loop implementation, though signatures 506

    related to phoneme production have been used for real-time control of simple speech sound-based 507

    BMIs (Leuthardt et al., 2011; Brumberg et al., 2013). Also, analysis of preparatory (premotor) neural 508

    activity of speech production, which our study was not designed to examine, would be important to 509

    investigate for speech BMI control. Overall, improving our understanding of the cortical control of 510

    articulatory movements advances us toward viable BMIs that can decode intended speech 511

    movements in real-time. 512

    Understanding the cortical encoding of sublexical speech production could also improve 513

    identification of functional speech motor areas. More rapid and/or accurate identification of these 514

    areas using ECoG could help to make surgeries for epilepsy or brain tumors more efficient, and 515

    possibly safer, by reducing operative time and number of stimuli and better defining areas to avoid 516

  • 24

    resecting (Schalk et al., 2008; Roland et al., 2010; Korostenskaja et al., 2013). These results therefore 517

    guide future investigations into development of neurotechnology for speech communication and 518

    functional mapping. 519

    520

    521

    522

  • 25

    REFERENCES 523

    Bakovic E (2014) Phonology and phonological theory. In: The Oxford handbook of language 524

    production (Goldrick MA, Ferreira V, Miozzo M, eds), pp 199-209. Oxford: Oxford 525

    University Press. 526

    Ballard KJ, Granier JP, Robin DA (2000) Understanding the nature of apraxia of speech: Theory, 527

    analysis, and treatment. Aphasiology 14:969-995. 528

    Bocquelet F, Hueber T, Girin L, Savariaux C, Yvert B (2016) Real-time control of an 529

    articulatory-based speech synthesizer for brain computer interfaces. PLoS Computational 530

    Biology 12:e1005119. 531

    Bouchard KE, Chang EF (2014) Control of spoken vowel acoustics and the influence of phonetic 532

    context in human speech sensorimotor cortex. Journal of Neuroscience 34:12662-12677. 533

    Bouchard KE, Mesgarani N, Johnson K, Chang EF (2013) Functional organization of human 534

    sensorimotor cortex for speech articulation. Nature 495:327-332. 535

    Bouchard KE, Conant DF, Anumanchipalli GK, Dichter B, Chaisanguanthum KS, Johnson K, 536

    Chang EF (2016) High-resolution, non-invasive imaging of upper vocal tract articulators 537

    compatible with human brain recordings. Plos One 11:e0151327. 538

    Broca P (1861) Remarques sur le siège de la faculté du langage articule suivies d’une 539

    observation d’aphemie. Bull Soc Anat Paris 6:330-357. 540

    Browman CP, Goldstein L (1992) Articulatory phonology: an overview. Phonetica 49:155-180. 541

    Brumberg JS, Guenther FH, Kennedy PR (2013) An auditory output Brain–computer interface 542

    for speech communication. In: Brain-Computer Interface Research, pp 7-14: Springer. 543

    Brumberg JS, Nieto-Castanon A, Kennedy PR, Guenther FH (2010) Brain-computer interfaces 544

    for speech communication. Speech Communication 52:367-379. 545

  • 26

    Buchwald A (2014) Phonetic processing. New York, NY: Oxford University Press. 546

    Buchwald A, Miozzo M (2011) Finding levels of abstraction in speech production: evidence 547

    from sound-production impairment. Psychological Science 22:1113-1119. 548

    Carey D, McGettigan C (2016) Magnetic resonance imaging of the brain and vocal tract: 549

    applications to the study of speech production and language learning. Neuropsychologia 550

    98:1-11. 551

    Chao ZC, Nagasaka Y, Fujii N, Chao, Chao ZC, Nagasaka Y, Fujii N (2010) Long-term 552

    asynchronous decoding of arm motion using electrocorticographic signals in monkeys. 553

    Frontiers in Neuroengineering 3:3. 554

    Chen X, Wang Y, Nakanishi M, Gao X, Jung T-P, Gao S (2015) High-speed spelling with a 555

    noninvasive brain–computer interface. Proceedings of the National Academy of Sciences 556

    112:E6058-67. 557

    Cherian A, Fernandes HL, Miller LE (2013) Primary motor cortical discharge during force field 558

    adaptation reflects muscle-like dynamics. Journal of Neurophysiology 110:768-783. 559

    Cheung C, Hamilton LS, Johnson K, Chang EF (2016) The auditory representation of speech 560

    sounds in human motor cortex. eLife 5:1-19. 561

    Cho T, Lee Y, Kim S (2014) Prosodic strengthening on the/s/-stop cluster and the phonetic 562

    implementation of an allophonic rule in English. Journal of Phonetics 46:128-146. 563

    Cogan GB, Thesen T, Carlson C, Doyle W, Devinsky O, Pesaran B (2014) Sensory-motor 564

    transformations for speech occur bilaterally. Nature 507:94-98. 565

    Conant DF, Bouchard KE, Leonard MK, Chang EF (2018) Human sensorimotor cortex control 566

    of directly-measured vocal tract movements during vowel production. Journal of 567

    Neuroscience:2382-2317. 568

  • 27

    Crone NE, Hao L, Hart J, Boatman D, Lesser RP, Irizarry R, Gordon B (2001) 569

    Electrocorticographic gamma activity during word production in spoken and sign 570

    language. Neurology 57:2045-2053. 571

    Das K, Nenadic Z (2009) An efficient discriminant-based solution for small sample size 572

    problem. Pattern Recognition 42:857-866. 573

    Das K, Rizzuto DS, Nenadic Z (2009) Mental state estimation for brain--computer interfaces. 574

    IEEE Transactions on Biomedical Engineering 56:2114-2122. 575

    Denby B, Schultz T, Honda K, Hueber T, Gilbert JM, Brumberg JS (2010) Silent speech 576

    interfaces. Speech Communication 52:270-287. 577

    Evarts EV (1968) Relation of pyramidal tract activity to force exerted during voluntary 578

    movement. Journal of Neurophysiology 31:14-27. 579

    Fesl G, Moriggl B, Schmid UD, Naidich TP, Herholz K, Yousry TA (2003) Inferior central 580

    sulcus: Variations of anatomy and function on the example of the motor tongue area. 581

    NeuroImage 20:601-610. 582

    Flinker A, Knight RT (2016) A cool approach to probing speech cortex. Neuron 89:1123-1125. 583

    Flinker A, Korzeniewska A, Shestyuk AY, Franaszczuk PJ, Dronkers NF, Knight RT, Crone NE 584

    (2015) Redefining the role of Broca’s area in speech. Proceedings of the National 585

    Academy of Sciences 112:2871-2875. 586

    Flint RD, Ethier C, Oby ER, Miller LE, Slutzky MW (2012a) Local field potentials allow 587

    accurate decoding of muscle activity. Journal of Neurophysiology 108:18-24. 588

    Flint RD, Lindberg EW, Jordan LR, Miller LE, Slutzky MW (2012b) Accurate decoding of 589

    reaching movements from field potentials in the absence of spikes. Journal of Neural 590

    Engineering 9:046006. 591

  • 28

    Flint RD, Wang PT, Wright ZA, King CE, Krucoff MO, Schuele SU, Rosenow JM, Hsu FP, Liu 592

    CY, Lin JJ, Sazgar M, Millett DE, Shaw SJ, Nenadic Z, Do AH, Slutzky MW (2014) 593

    Extracting kinetic information from human motor cortical signals. Neuroimage 101:695-594

    703. 595

    Flint RD, Rosenow J, Tate MC, Slutzky MW (2017) Continuous decoding of human grasp 596

    kinematics using epidural and subdural signals. Journal of Neural Engineering 14: 597

    016005. 598

    Gelfand JR, Bookheimer SY (2003) Dissociating neural mechanisms of temporal sequencing and 599

    processing phonemes. Neuron 38:831-842. 600

    Georgopoulos A, Schwartz A, Kettner R (1986) Neuronal population coding of movement 601

    direction. Science 233:1416-1419. 602

    Ghosh SS, Tourville JA, Guenther FH (2008) A neuroimaging study of premotor lateralization 603

    and cerebellar involvement in the production of phonemes and syllables. Journal of 604

    Speech, Language, and Hearing Research 51:1183-1202. 605

    Guenther FH, Ghosh SS, Tourville Ja (2006) Neural modeling and imaging of the cortical 606

    interactions underlying syllable production. Brain and Language 96:280-301. 607

    Hatsopoulos N, Joshi J, O'Leary JG (2004) Decoding continuous and discrete motor behaviors 608

    using motor and premotor cortical ensembles. Journal of Neurophysiology 92:1165-1174. 609

    Herff C, Heger D, de Pesters A, Telaar D, Brunner P, Schalk G, Schultz T (2015) Brain-to-text: 610

    decoding spoken phrases from phone representations in the brain. Frontiers in 611

    Neuroscience 9:1-11. 612

  • 29

    Hermes D, Miller KJ, Noordmans HJ, Vansteensel MJ, Ramsey NF (2010) Automated 613

    electrocorticographic electrode localization on individually rendered brain surfaces. 614

    Journal of Neuroscience Methods 185:293-298. 615

    Hermes D, Miller KJ, Vansteensel MJ, Aarnoutse EJ, Leijten FSS, Ramsey NF (2012) 616

    Neurophysiologic correlates of fMRI in human motor cortex. Human brain mapping 617

    33:1689-1699. 618

    Hickok G (2012a) The cortical organization of speech processing: Feedback control and 619

    predictive coding the context of a dual-stream model. Journal of Communication 620

    Disorders 45:393-402. 621

    Hickok G (2012b) Computational neuroanatomy of speech production. Nature Reviews 622

    Neuroscience 13:135-145. 623

    Hickok G, Houde J, Rong F (2011) Sensorimotor integration in speech processing: 624

    computational basis and neural organization. Neuron 69:407-422. 625

    Hocherman S, Wise SP (1991) Effects of hand movement path on motor cortical activity in 626

    awake, behaving rhesus monkeys. Experimental Brain Research 83:285-302. 627

    House AS, Williams C, Hecker MHL, Kryter KD (1963) Psychoacoustic speech tests: A 628

    modified rhyme test. The Journal of the Acoustical Society of America 35:1899. 629

    Indefrey P, Levelt WJ (2004) The spatial and temporal signatures of word production 630

    components. Cognition 92:101-144. 631

    Kakei S, Hoffman DS, Strick PL (1999) Muscle and movement representations in the primary 632

    motor cortex. Science 285:2136-2139. 633

  • 30

    Kellis S, Miller K, Thomson K, Brown R, House P, Greger B (2010) Decoding spoken words 634

    using local field potentials recorded from the cortical surface. Journal of Neural 635

    Engineering 7:056007. 636

    Kent R (1977) Coarticulation in recent speech production. Jornal of Phonetics 5:15-133. 637

    Korostenskaja M, Wilson AJ, Rose DF, Brunner P, Schalk G, Leach J, Mangano FT, Fujiwara H, 638

    Rozhkov L, Harris E, Chen P-C, Seo J-H, Lee KH (2013) Real-time functional mapping 639

    with electrocorticography in pediatric epilepsy: comparison with fmri and esm findings. 640

    Clinical EEG and Neuroscience 45:205-11. 641

    Leuthardt EC, Gaona C, Sharma M, Szrama N, Roland J, Freudenberg Z, Solis J, Breshears J, 642

    Schalk G (2011) Using the electrocorticographic speech network to control a brain-643

    computer interface in humans. Journal of neural engineering 8:036004. 644

    Levelt WJM (1999) Models of word production. Trends in Cognitive Sciences 3:223-232. 645

    Levelt WJM, Roelofs A, Meyer AS (1999) A theory of lexical access in speech production. The 646

    Behavioral and brain sciences 22:1-75. 647

    Logothetis NK, Pauls J, Augath M, Trinath T, Oeltermann A (2001) Neurophysiological 648

    investigation of the basis of the fMRI signal. Nature 412:150-157. 649

    Long MA, Katlowitz KA, Svirsky MA, Clary RC, Byun TMA, Majaj N, Oya H, Howard MA, 650

    Greenlee JDW (2016) Functional Segregation of Cortical Regions Underlying Speech 651

    Timing and Articulation. Neuron 89:1187-1193. 652

    Lotte F, Brumberg JS, Brunner P, Gunduz A, Ritaccio AL, Guan C, Schalk G (2015) 653

    Electrocorticographic representations of segmental features in continuous speech. 654

    Frontiers in Human Neuroscience 09. 655

  • 31

    Magen HS (1997) The extent of vowel-to-vowel coarticulation in English. Journal of Phonetics 656

    25:187-205. 657

    Mehring C, Nawrot MP, de Oliveira SC, Vaadia E, Schulze-Bonhage A, Aertsen A, Ball T 658

    (2004) Comparing information about arm movement direction in single channels of local 659

    and epicortical field potentials from monkey and human motor cortex. J Physiol Paris 660

    98:498-506. 661

    Mendoza G, Merchant H (2014) Motor system evolution and the emergence of high cognitive 662

    functions. Progress in Neurobiology 122:73-93. 663

    Mines MA, Hanson BF, Shoup JE (1978) Frequency of occurrence of phonemes in 664

    conversational English. Language and Speech 21:221-241. 665

    Moran DW, Schwartz AB (1999) Motor cortical representation of speed and direction during 666

    reaching. J Neurophysiol 82:2676-2692. 667

    Morrow MM, Miller LE (2003) Prediction of muscle activity by populations of sequentially 668

    recorded primary motor cortex neurons. Journal of Neurophysiology 89:2279-2288. 669

    Mugler EM, Goldrick M, Slutzky MW (2014a) Cortical encoding of phonemic context during 670

    word production. In: Conference Proceedings of the IEEE Engineering in Medicine and 671

    Biology Society. 672

    Mugler EM, Patton JL, Flint RD, Wright Za, Schuele SU, Rosenow J, Shih JJ, Krusienski DJ, 673

    Slutzky MW (2014b) Direct classification of all American English phonemes using 674

    signals from functional speech motor cortex. Journal of Neural Engineering 11:035015. 675

    Nam H, Mitra V, Tiede MK, Saltzman E, Goldstein L, Espy-Wilson C, Hasegawa-Johnson M 676

    (2010) A procedure for estimating gestural scores from articulatory data. The Journal of 677

    the Acoustical Society of America 127:1851. 678

  • 32

    Nam H, Mitra V, Tiede M, Hasegawa-Johnson M, Espy-Wilson C, Saltzman E, Goldstein L 679

    (2012) A procedure for estimating gestural scores from speech acoustics. The Journal of 680

    the Acoustical Society of America 132:3980-3989. 681

    Oby ER, Ethier C, Miller LE (2013) Movement representation in the primary motor cortex and 682

    its contribution to generalizable EMG predictions. Journal of Neurophysiology 109:666-683

    678. 684

    Öhman SE (1966) Coarticulation in VCV utterances: Spectrographic measurements. The Journal 685

    of the Acoustical Society of America 39:151-168. 686

    Pandarinath C, Nuyujukian P, Blabe CH, Sorice BL, Saab J, Willett FR, Hochberg LR, Shenoy 687

    KV, Henderson JM (2017) High performance communication by people with paralysis 688

    using an intracortical brain-computer interface. eLife 6:e18554. 689

    Papoutsi M, de Zwart JA, Jansma JM, Pickering MJ, Bednar JA, Horwitz B (2009) From 690

    phonemes to articulatory codes: An fMRI study of the role of broca's area in speech 691

    production. Cerebral Cortex 19:2156-2165. 692

    Paz R, Boraud T, Natan C, Bergman H, Vaadia E (2003) Preparatory activity in motor cortex 693

    reflects learning of local visuomotor skills. Nature Neuroscience 6:882-890. 694

    Pei X, Barbour DL, Leuthardt EC, Schalk G (2011a) Decoding vowels and consonants in spoken 695

    and imagined words using electrocorticographic signals in humans. Journal of Neural 696

    Engineering 8:046028. 697

    Pei X, Leuthardt EC, Gaona CM, Brunner P, Wolpaw JR, Schalk G (2011b) Spatiotemporal 698

    dynamics of electrocorticographic high gamma activity during overt and covert word 699

    repetition. Neuroimage 54:2960-2972. 700

  • 33

    Penfield W, Boldrey E (1937) Somatic motor and sensory representation in the cerebral cortex of 701

    man as studied by electrical stimulation. Brain 60:389-443. 702

    Penfield W, Rasmussen T (1949) Vocalization and arrest of speech. Arch Neurol Psychiatry 703

    61:21-27. 704

    Penfield W, Roberts L (1959) Speech and brain mechanisms. Princeton: Princeton University 705

    Press. 706

    Pesaran B, Nelson MJ, Andersen RA (2006) Dorsal premotor neurons encode the relative 707

    position of the hand, eye, and goal during reach planning. Neuron 51:125-134. 708

    Pesaran B, Pezaris JS, Sahani M, Mitra PP, Andersen Ra (2002) Temporal structure in neuronal 709

    activity during working memory in macaque parietal cortex. Nature Neuroscience 5:805-710

    811. 711

    Pouplier M, Goldstein L (2010) Intention in articulation: Articulatory timing in alternating 712

    consonant sequences and its implications for models of speech production. Language and 713

    Cognitive Processes 25:616-649. 714

    Ramsey N, Salari E, Aarnoutse E, Vansteensel M, Bleichner M, Freudenburg Z (2017) Decoding 715

    spoken phonemes from sensorimotor cortex with high-density ECoG grids. NeuroImage. 716

    Ray S, Maunsell JH (2011) Different origins of gamma rhythm and high-gamma activity in 717

    macaque visual cortex. PLoS Biol 9:e1000610. 718

    Ray S, Crone NE, Niebur E, Franaszczuk PJ, Hsiao SS (2008) Neural correlates of high-gamma 719

    oscillations (60-200 Hz) in macaque local field potentials and their potential implications 720

    in electrocorticography. Journal of Neuroscience 28:11526-11536. 721

  • 34

    Riecker A, Ackermann H, Wildgruber D, Meyer J, Dogil G, Haider H, Grodd W (2000) 722

    Articulatory/phonetic sequencing at the level of the anterior perisylvian cortex: a 723

    functional magnetic resonance imaging (fMRI) study. Brain and Language 75:259-276. 724

    Roland J, Brunner P, Johnston J, Schalk G, Leuthardt EC (2010) Passive real-time identification 725

    of speech and motor cortex during an awake craniotomy. Epilepsy & Behavior 18:123-726

    128. 727

    Saltzman EL, Munhall KG (1989) A dynamical approach to gestural patterning in speech 728

    production. Ecological Psychology:333-382. 729

    Schalk G, McFarland DJ, Hinterberger T, Birbaumer N, Wolpaw JR (2004) BCI2000: a general-730

    purpose brain-computer interface (BCI) system. IEEE Transactions on Biomedical 731

    Engineering 51:1034-1043. 732

    Schalk G, Leuthardt EC, Brunner P, Ojemann JG, Gerhardt La, Wolpaw JR (2008) Real-time 733

    detection of event-related brain activity. Neuroimage 43:245-249. 734

    Schultz T, Wand M (2010) Modeling coarticulation in EMG-based continuous speech 735

    recognition. Speech Communication 52:341-353. 736

    Scott S, Kalaska J (1997) Reaching movements with similar hand paths but different arm 737

    orientations. I. Activity of individual cells in motor cortex. Journal of Neurophysiology 738

    77:826-852. 739

    Shen L, Alexander GE (1997) Preferential representation of instructed target location versus 740

    limb trajectory in dorsal premotor area. Journal of Neurophysiology 77:1195-1212. 741

    Slutzky MW (2018) Brain-machine interfaces: powerful tools for clinical treatment and 742

    neuroscientific investigations. The Neuroscientist doi.org/10.1177/1073858418775355. 743

  • 35

    Slutzky MW, Jordan LR, Lindberg EW, Lindsay KE, Miller LE (2011) Decoding the rat 744

    forelimb movement direction from epidural and intracortical field potentials. Journal of 745

    Neural Engineering 8:036013. 746

    Tate MC, Herbet G, Moritz-Gasser S, Tate JE, Duffau H (2014) Probabilistic map of critical 747

    functional regions of the human cerebral cortex: Broca's area revisited. Brain 137:2773-748

    2782. 749

    Tourville JA, Reilly KJ, Guenther FH (2008) Neural mechanisms underlying auditory feedback 750

    control of speech. Neuroimage 39:1429-1443. 751

    Wang W, Arora R, Livescu K, Bilmes JA (2015) Unsupervised learning of acoustic features via 752

    deep canonical correlation analysis. In: Proceedings of ICASSP. 753

    Wang WR, Arora R, Livescu K (2014) Reconstruction of articulatory measurements with 754

    smoothed low-rank matrix completion. 2014 IEEE Workshop on Spoken Language 755

    Technology Slt 2014:54-59. 756

    Westbury J (1990) X-ray microbeam speech production database. The Journal of the Acoustical 757

    Society of America 88:S56. 758

    Westbury J, Milenkovic P, Weismer G, Kent R (1990) X‐ray microbeam speech production 759

    database. The Journal of the Acoustical Society of America 88:S56-S56. 760

    Whalen DH (1990) Coarticulation is largely planned. Haskins Laboratories Status Report on 761

    Speech Research SR-101/102:149-176. 762

    Wise RJ, Greene J, Buchel C, Scott SK (1999) Brain regions involved in articulation. Lancet 763

    353:1057-1061. 764

    Wise S, Moody S, Blomstrom K, Mitz A (1998) Changes in motor cortical activity during 765

    visuomotor adaptation. Experimental Brain Research 121:285-299. 766

  • 36

    767

    768

  • 37

    Figure Legends 769

    Figure 1. Defining phoneme and articulatory gesture onsets. (A) Cerebral cortex of participant S5 770

    with recorded regions of speech motor cortex highlighted – IFG (green), aPCG (blue), and pPCG 771

    (purple). (B) Vocal tract with positions of the lips, tongue body, and tongue tip during production of 772

    a single word. Each trace represents the position, at 10-ms intervals, generated by the acoustic-773

    articulatory inversion model, from word onset (green) to word offset (magenta; see corresponding 774

    colors in (C)). (C) Example audio signal, and corresponding audio spectrogram, from S5 with labeled 775

    phonemic event onsets (blue vertical lines) mapped to vocal tract articulatory gesture positions. 776

    Target apertures for each articulatory gesture action are marked from open (open circle), to critical 777

    (half-filled circle), to closed (filled circle). Note that larynx has opposite open/close orientation as its 778

    default configuration is assumed to be near closure (vibrating; (Browman and Goldstein, 1992). Also 779

    note that while the initial and final consonants are associated with a specific velum-closing action, 780

    the vowel does not specify such a gesture (thus the state of the velum during the vowel depends on 781

    the surrounding gestures). 782

    783

    Figure 2. Electrode array locations for all 9 subjects. Top schematic shows the approximate area of 784

    cortex (rectangle) displayed for each subject. Shaded areas represent the different cortical areas: IFG 785

    (green), aPCG (blue), and pPCG (purple). Note that Subject 2 was implanted in the right hemisphere 786

    and so anterior-posterior direction is reversed. IFG electrodes in Subject 9 were excluded because 787

    they were too close to the tumor margin. CS=central sulcus; SF=Sylvian fissure. 788

    789

    Figure 3. Variation of cortical activity with intra-word position of phonemes and gestures. Phoneme-790

    related activity changes with context, while gesture-related activity does not. (A) Mean (±SD, shaded 791

  • 38

    areas) high gamma activity on two electrodes in subject S5 aligned to onset of the phoneme (left) or 792

    gesture (right) event. Activity is separated into instances of all events (/t/ or /k/ for phonemes, tongue 793

    tip closure (TTC) or tongue body closure (TBC) for gestures) occurring either at the beginning of a 794

    word (light green) or at the end of a word (dark green). Gray dashed lines represent [-100,50] ms 795

    interval around onset. (B) An example of classification accuracy (mean ± 95% CI) of intra-word 796

    position on one electrode (e56) related to either tongue body (left, same as bottom plots in A) or 797

    tongue tip (e46; right, same as top plots in A) in subject S5 for phonemes (blue) and gestures (red). 798

    Gestural position classification does not outperform chance (gray), while phonemic position 799

    classification performs significantly higher than chance. (C) Spatial distribution of d´ for differences 800

    between phonemic and gestural position accuracy and chance. Phonemic position accuracy is much 801

    higher than chance while gestural position accuracy is not on tongue-tip and tongue-body related 802

    electrodes (outlined electrodes). Shaded areas correspond to cortical areas as in Figure 2A. 803

    804

    Figure 4. Classification of phonemes and gestures. (A) Mean (±SEM over subjects) classification 805

    accuracy using combined aPCG and pPCG activity of phonemes (blue squares) and gestures (red 806

    circles). Shown are both raw accuracy (left, dotted lines showing chance accuracy) and accuracy 807

    relative to chance (right). Gestures were classified significantly (*) more accurately than phonemes. 808

    (B) Classification accuracy for phonemes (blue) and gestures (red) using activity from IFG, aPCG, 809

    and pPCG separately, for subject S5 (left, ±SD) and population mean (right, ±SEM). (C) Accuracy 810

    relative to chance in each area for subject S5 (left) and population mean (right). Gesture 811

    classification was significantly higher than phoneme classification in pPCG and aPCG (*). (D) d´ 812

    values (mean±SEM over subjects) between gesture and phoneme accuracies in each area. Source 813

    data are included for A and B-D. 814

  • 39

    Figure 5. Classification of consonant allophones using ECoG from each cortical area. (A) Examples 815

    of audio waveforms, averaged spectrograms, and gestures for an allophone set ({/t/,/st/,/d/}) aligned 816

    to vowel onset (black vertical line). Only the trajectories for articulators that show differences for 817

    these phonemes are depicted (TT: tongue tip, Lx: larynx; filled circle: close, open circle: open, half-818

    filled: partial closure (critical). Colors throughout the Figure represent VLC (/t/, blue), VC (/d/, 819

    orange), and CClA (/st/, gray). (B) Examples of normalized high gamma activity (mean±SE) at 3 820

    electrodes during /t/, /d/, and /st/ production in subject S5. Allophone onset is at time 0. One 821

    electrode from each cortical area is shown. CClA activity (gray) in these IFG and aPCG electrodes is 822

    more similar to the VLC (blue) especially around time 0, while in pPCG, it is more similar to VC 823

    (orange). (C) Schematic depicting three different idealized performance patterns in a single cortical 824

    area. Solid circles denote performance of classification of VLCs (blue) and VCs (orange) using their 825

    respective classifiers. Gray-filled circles denote CClA classification performance using the VLC 826

    (blue outline) and VC (orange outline) classifiers. High CClA performance (close to that of the 827

    respective solid color) would indicate that the allophone behaved more like the VLC or VC than like 828

    other consonants in the data set. Blue rectangle: CClA performed similarly to the VLC; orange 829

    rectangle: CClA performed similarly to the VC; green rectangle: CClA performed differently than 830

    both VLCs and VCs. (D) Classification performance (mean±SEM across subjects and allophone 831

    sets) in each cortical area of VLCs and CClAs in voiceless classifiers, and VCs and CClAs in voiced 832

    classifiers. CClAs show much lower performance on VLC classifiers than VLCs perform in pPCG, 833

    while the performance is much closer in IFG and aPCG. The opposite trend occurs with CClA 834

    performance on the VC classifiers. (E) d´ values (mean±SEM across subjects and sets) between the 835

    singlet consonant performance and allophone consonant performance for each area; larger values are 836

    more discriminable. Blue circles: VLC vs. CClA performance using VLC classifiers orange circles: 837

  • 40

    VC vs. CClA performance using VC classifiers. In summary, CClAs perform more like VLCs and 838

    less like VCs moving from posterior to anterior. Source data are included for (D). 839

    840

  • Cover PageClean Copy of Article FileFigure 1Figure 2Figure 3Figure 4Figure 5