kellen_thesis

Modeling Human Phonotactic

Judgments Using

Classifiers

A Thesis

Presented to the

Faculty of

San Diego State University

In Partial Fulfillment

of the Requirements for the Degree

Master of Arts

in

Linguistics

by

Kellen C. Stephens

Spring 2013

SAN DIEGO STATE UNIVERSITY

The Undersigned Faculty Committee Approves the

Thesis of Kellen C. Stephens:

Modeling Human Phonotactic

Judgments Using

Classifiers

Jean-Marc Gawron, ChairDepartment of Linguistics and Asian/Middle Eastern Languages

Robert MaloufDepartment of Linguistics and Asian/Middle Eastern Languages

Chris WerryDepartment of Rhetoric and Writing Studies

Approval Date

iii

Copyright c© 2013by

Kellen C. Stephens

iv

DEDICATION

This work is dedicated to my loving girlfriend, Courtney. Without you none of this wouldhave been possible.

v

ABSTRACT OF THE THESIS

Modeling Human PhonotacticJudgments Using

Classifiersby

Kellen C. StephensMaster of Arts in Linguistics

San Diego State University, 2013

This work investigates the problem of modeling human phonotactic acceptabilityjudgments. In particular, the question of which type of model is best suited to this task:models based on the probability of the phone segments in a particular word (phonotacticprobability) or models based on the number of words that share phone sequences with aparticular word (neighborhood density). The models developed in this work were based onclassifiers, a type of machine learning model that is able to determine whether a new data itembelongs to a particular class. Two type of classifiers were used, the perceptron: a linearclassifier that is based on the number of times phone sequences occur (essentially aprobability based model) and k Nearest Neighbor (kNN), a non-linear classifier that is basedon the number of existing data items that are similar to a new data item (essentially aneighborhood density model). Two types of kNN models were developed: one based on twoclasses (like the perceptron models) and one based on multiple classes. The quality of thesemodels was evaluated by comparing their judgments against the judgments of human subjectson a short phonotactic acceptability task. Human subjects were asked to rate a set of fiftynonce words from 1 to 7 based on their assessment of the phonotactic acceptability of thesewords. Comparison was calculated using the Spearman correlation. The two-class classifiermodels (based on perceptron and kNN) performed by far the best and both performedapproximately equally well under optimal conditions, while the multi-class kNN modelperformed considerably worse. The results did not indicate that one type of model(phonotactic probability or neighborhood density) performed better than the other. However,it is possible that both phonotactic probability and neighborhood density effects areinfluencing the performance of both types of models. And finer grained differences mayemerge using a phonotactic acceptability task that employs more sophisticated means tocontrol for both types of effects.

vi

TABLE OF CONTENTSPAGE

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 MACHINE LEARNING COMPONENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Non-Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 The k-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 The Clustered Neighborhood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.2 Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7 Summary of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 METHODOLOGY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Classifier Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.2 Feature Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Verification of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Clustered Neighborhood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Additional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 The Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Summary of Models Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.1 Perceptron Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.2 kNN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

3.6.3 Clustered Neighborhood Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.4 N-Phone Conditional Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 Summary of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Perceptron Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Uniphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.3 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.4 All Phone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.5 Summary of Perceptron Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 kNN Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Uniphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.2 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.3 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.4 All Phone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2.5 kNN Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Clustered Neighborhood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Probability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1 Addressing Some Loose Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.2 kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.1.3 Clustered Neighborhood Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

APPENDIX

TEST DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

viii

LIST OF TABLESPAGE

Table 3.1. Summary of Negative Training Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Table 3.2. Uniphone Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Table 3.3. Biphone Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 3.4. Triphone Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 3.5. All Phone Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 3.6. Biphone Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Table 3.7. Triphone Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 3.8. Uniphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Table 3.9. Biphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Table 3.10. Triphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 3.11. All Phone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 3.12. Uniphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 3.13. Biphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 3.14. Triphone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 3.15. All Phone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 3.16. Clustered Neighborhood Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 3.17. N-Phone Conditional Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 4.1. Perceptron Experiment 1 Uniphone Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39




Table 4.5. Perceptron Experiment 1 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42




Table 4.9. Perceptron Experiment 1 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Table 4.10. Perceptron Experiment 2 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


ix


Table 4.13. Perceptron Experiment 1 All Phone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49




Table 4.17. kNN Experiment 1 Uniphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54




Table 4.21. kNN Experiment 1 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60




Table 4.25. kNN Experiment 1 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67




Table 4.29. kNN Experiment 1 All Phone Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74




Table 4.33. Clustered Neighborhood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Table 4.34. Probability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Table 5.1. Perceptron Experiment1 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Table 5.2. Perceptron Experiment4 Biphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Table 5.3. Perceptron Experiment4 Triphone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Table 5.4. Perceptron Experiment 1 All Phone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Table 5.5. Perceptron Experiment 4 All Phone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Table 5.6. Extended Clustered Neighborhood Results vs Original ClusteredNeighborhood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Table 5.7. Basic Neighborhood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Table A.1. Test Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

x

LIST OF FIGURESPAGE

Figure 2.1. A linearly unseparable problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 5.1. kNN biphone results varying k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure 5.2. kNN triphone results varying k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Figure 5.3. kNN all phone results varying k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Figure 5.4. kNN all phone results varying k extended. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xi

ACKNOWLEDGEMENTSI would like to thank my thesis committee for their time and comments. In particular, I

would like to thank Dr. Gawron for his continued support and guidance, and for giving me theopportunity to work on research.

I would also like to thank my mother and grandfather for their continued support. Andmy girlfriend Courtney for her love and patience.

1

CHAPTER 1

INTRODUCTION

The domain of phonotactics is generally concerned with describing the sequences ofsounds that are allowed in a language, in other words, sequences of sounds that speakers ofthat language find acceptable. In early phonological theories, this distinction was thought tobe categorical: a word would either be acceptable or not, with no middle ground. However,many researchers have found that phonotactic acceptability is gradient [2, 15, 21]. Interest inmodeling human phonotactic judgments increased following the publication of [9]. Colemanand Pierrehumbert developed a probabilistic context-free grammar (PCFG) of the onsets andrhymes of English syllables. The model was trained on 48,580 English monosyllabic anddisyllabic words, each parsed according to their onset-rhyme structure. The model was testedon 116 nonce words developed for an earlier phonotactic acceptability study using humanjudgments [8]. The probabilities produced by the model were taken by Coleman andPierrehumbert to represent the phonotactic acceptability of the nonce words. The authorscompared the acceptability scores produced by their phonotactic PCFG to the humanjudgments for the same data obtained in the earlier study [8]. There was significant correlationbetween the human acceptability judgments and those produced by the PCFG model. Thissuggests that the PCFG model is capturing a lot of the information that represents humanphonotactic knowledge, and that gradient acceptability is a large part of this knowledge.

Coleman and Pierrehumbert found that acceptability scores were highly dependent ontwo factors: phonotactic probability and neighborhood density. Phonotactic probability refersto the frequency that a segment or sequence of segments occurs in a language. Segments orsequences of segments that occur more often will have higher phonotactic probability.Phonotactic probability is defined over sub-sequences of a word; the sequences are typicallybiphone sequences (sequences of two segments). The biphone probability of the sequence ab

is typically defined to be the conditional probability of b given a, Pr(b | a). A number ofmethods have been used to derive the probability of a word from the probabilities of itsnphone sequences, like the joint probability of the nphone sequences in the word or theaverage probability of the nphone sequences in the word (Listing 1, adapted from [1]).Average probabilities have typically been preferred in the literature [2, 49].

2

Probability of abcd

a Joint Probability of abcd

Pr(abcd) = Pr(a | #)× Pr(b | a)× Pr(c | b)× Pr(d | c)× Pr(# | d) (1.1)

b Average Probability of abcd

Pr(abcd) =Pr(a | #) + Pr(b | a) + Pr(c | b) + Pr(d | c) + Pr(# | d)

|abcd |(1.2)

Neighborhood density refers to the number of other words that are phonotacticallysimilar to a given word. A word that is phonotactically similar to a large number of otherwords will have a high neighborhood density. The neighborhood density is typically definedto be the number of words that differ from a given word by one edit distance; that is, wordsthat can be derived from a given word using one addition, deletion, or subtraction. Forexample, the nonce word [strajb] has neighbors like strike, stride, tribe, strobe, etc.

Frisch et al. [15] confirmed these results with a more elaborate and controlled data set.Frisch et al. [15] used the phonotactic PCFG model architecture developed by Coleman andPierrehumbert [9] to construct a data set that controlled for length and constituent probability.The length of each example varied from one to four syllables, and constituent probabilitiesembodied the full range of possibilities, from very high to very low. This created a widevariety of possible word forms, considerably more varied than the Coleman andPierrehumbert [9] data. The authors used this data in three phonotactic acceptability tasksinvolving human subjects, a rating task (on a 1-7 scale, similar to Coleman and Pierrehumbert[9]), a word recall task, and an accept-reject task, in which subjects were asked to decidecategorically whether a nonce word is acceptable. In all three tasks, Frisch et al. [15] foundphonotactic probability to be a significant factor, just like Coleman and Pierrehumbert [9].

As a comparison to the phonotactic PCFG, the authors constructed a neighborhoodbased model, but this model is somewhat different than the more standard definition that weintroduced above. The authors consider the neighborhood of a word to be all of the words thatshare at least 66% of its segments in the appropriate positions. For example, in this schemethe words /[email protected]@t/ and /[email protected]@t/ would be a part of the same neighborhood. A nonce wordwas scored by the log of the number of neighbors that it has. The authors report that thismodel did not perform significantly worse than the PCFG model on the rating task. Thissupports the conclusions of Coleman and Pierrehumbert [9]: that the acceptability ratings areclosely related to both phonotactic probability and neighborhood density. Although, theauthors do note that phonotactic probability and neighborhood density are highly correlated,

3

since high probability sequences are high probability because they occur in many words, andthus have a dense neighborhood.

In a more sophisticated study of the interaction between phonotactic probability andneighborhood density in determining phonotactic acceptability, Bailey and Hahn [2] foundboth factors to be significant to determining phonotactic acceptability. Indeed, neighborhooddensity was found to be the more significant factor. Bailey and Hahn [2] proposed theGeneralized Neighborhood Model (GNM) as a robust model of neighborhood density. It isbased on the standard neighborhood model that we introduced above, but extends theneighborhood range to include words n edit distances away, also incorporating informationabout the severity of adding, removing, and changing segments. This enables the GNM toscore words with no neighbors a single edit distance away, a well-known weakness of thetraditional neighborhood density model. For example, a nonce word like drusp has noneighbors a sigle edit distance away. Thus the traditional neighborhood model would not beable to distinguish between drusp, a reasonably acceptable word in English, and bzarshk, aclearly unacceptable word in English.1 Both would receive low scores from the traditionalneighborhood model. The GNM, on the other hand, would find many neighbors just a few editdistances away from drusp (trust, dusk, rusk, truss, dust, etc.), giving it a much higher scorethan bzarshk, which has few, if any, neighbors within a few edit distances.

The GNM is a log-linear model that classifies new examples based on their similarityto actual words. The log-linear form of the classification rule, Equation 1.3, is closely relatedto the linear models introduced in the next chapter.

score(w) =∑w∈NN

weightw ×e−di,w

s (1.3)

Where NN is the set of neighbors, weightw is the weight associated with word w, di ,wis the distance between the unknown item i and the real word w (calculated as edit distance),and s is a model parameter that determines how much influence very similar neighbors haveon the score. The result is a classification rule that favors dense neighborhoods with a highdegree of similarity.

Bailey and Hahn [2] compared this model to a stochastic model based on averageprobability, as defined in Listing 1. As we mentioned above, they found that both phonotacticprobability and neighborhood density play a significant role in determining phonotacticacceptability, but that the effect of neighborhood density was measurably greater, leadingthem to conclude that neighborhood density is the more important of the two factors.

1This example was adapted from [1]

4

Other research, however, has suggested that phonotactic probability may in fact be themore significant factor to determining phonotactic acceptability for nonce words in particular.Vitevitch and Luce [49, 50, 51] found that human subjects repeated high probability noncewords with dense neighborhoods more quickly than low probability nonce words with sparseneighborhoods, as might be expected. Interestingly they found the converse to be true foractual words. In other words, human subjects repeated actual words with low probability andsparse neighborhoods more quickly than actual words with high probability and denseneighborhoods. This is unexpected based on the other results that we have discussed.Vitevitch and Luce hypothesize that humans have two levels of phonotactic processing toaccount for their seemingly contradictory result: a lexical level, where whole words areprocessed, and a sub-lexical level, where phonetic sequences are processed. Of course there issome interaction between the two levels, but nonce words are primarily processed at thesub-lexical level, the domain of phonotactic probabilities. Actual words, on the other hand,are primarily processed at the lexical level, the domain of neighborhood density. Processing atthe lexical level requires more time, because other, similar words also receive input activationfrom the unknown word and lexical information must be accessed. Thus, in the view ofVitevitch and Luce, since nonce words are primarily processed at the sub-lexical level,phonotactic probabilities should be the more important factor.

Further support for the precedence of phonotactic probability in the determination ofphonotactic acceptability of nonce words comes from Albright [1]. In the comparison ofphonotactic probability and neighborhood density models (among them the GNM), theprobability models consistently produced better ratings over the entire range of acceptability.This was particularly true for nonce words that contained low probability sequences, despitethe improvements of the GNM. The lexical models were unable to score nonce words withextremely rare, but acceptable sequences, making them indistinguishable from nonce wordsthat are completely unacceptable. This is certainly not a desirable quality in a phonotacticacceptability model, since human respondents routinely make this distinction. The probabilitymodels, on the other hand, performed much better, in large part due to their ability to handlelow probability/low density sequences.

As Albright [1] points out the differences between his results and the results of Baileyand Hahn [2] could have arisen due to differences in the data. Human subjects may have hadmore difficulty rating the Bailey and Hahn [2] data since, on the whole, the Bailey and Hahn[2] data fall near the middle of the acceptability scale, without representative examples of theextreme values. Albright [1] suspects these ratings may be skewed since the human subjectshad no representative extreme values on which to base their judgments. In addition to this,Bailey and Hahn [2] included real words in their acceptability task. The ratings for the real

5

words were not included in the final results. However, using real words has been shown toaffect the outcome of this type of experiment. Shademan [43] demonstrated that the perceivedneighborhood density effect can be affected by the number of actual words included in the testdata. Shademan [43] theorizes that including real words in the test data activates the lexicallevel of processing to a greater degree (recall that Vitevitch and Luce claim that nonce wordsare primarily processed at the sub-lexical level), leading to an exaggerated neighborhooddensity effect.

On the other hand, since neighborhood density models have extreme difficulty scoringwords near the bottom end of the scale, they may be entirely unsuited for this type ofmodeling. Albright [1] points out a major weakness of the neighborhood density models. It isentirely possible for a nonce word to contain a major phonotactic violation and still have alarge neighborhood. Albright [1] gives an example: [sru:] has a number of neighbors only oneedit distance away. For example, brew, crew, drew, grew, roux, screw, shrew, etc., but itclearly contains a highly unacceptable sequence in English, [sr]. A neighborhood densitymodel, like the GNM, would rate [sru:] on par with a nonce word like [fru:], which isundeniably more acceptable in English. This entails that neighborhood density models areessentially unable to account for overt phonotactic violations (like [sr]), which would seem tobe an essential quality in any phonotactic acceptability model.

The present work takes up the problem laid out by Albright [1], that is, a comparisonof phonotactic probability and neighborhood density models, but from a very differentperspective. Some of the techniques that we used were first developed in information retrieval(see for example [34]), but have since become widely used in a number of different domains.We used a type of model commonly known in machine learning as a classifier to developmodels that are essentially equivalent to the phonotactic probability and neighborhood densitymodels that we discussed above, albeit in a very different form. The first series of models thatwe developed are based on the perceptron algorithm, a very simple linear classifier. Thesemodels can be equated to phonotactic probability models, even in the very simplistic form thatwe will discuss, because their outcome is proportional to the frequency of occurences of asegment or sequence of segments. The other models that we will discuss are based on thekNN algorithm, a very simple non-linear classifier. These models can be equated toneighborhood density models, since their outcome is based on similarity to known examples.We applied these models to a novel phonotactic acceptability task and compared the results tothe results of human respondents on the same task [9, 15].

There is ample motivation to model this problem using classifiers. In fact, the GNM ofBailey and Hahn [2] is a type of classifier, and is closely related to the perceptron classifiersthat we develop in this work. Basically, a classifier is a type of machine learning model that is

6

able to decide whether or not a new data item belongs to a particular class of like items. Inphonotactic modeling, we might use classes that represent phonotactic neighborhoods, oreven classes that simply represent ’acceptable’ and ’unacceptable’. Classifiers have beenapplied to many different types of problems, both in- and outside of Linguistics. They areoften found to perform as well, or better, than other types of models for these problems.Linear classifiers have been applied to a number of problems in Natural Language Processing,for example, part-of-speech tagging [40, 47, 48], natural language parsing [31, 36, 38], toname just a few. Classifiers have also proven successful in Theoretical and Psycholinguistics:the dative shift in English [5], modeling stress patterns [19], and developing theoreticgrammars that are capable of handling gradience [18, 22, 24, 25]. In fact, classifiers areparticularly well-suited to the task of grammatical gradience since most base their decisionson some notion of a real-valued score, which makes modeling gradient phenomena quitenatural. This of course is one of the motivating factors behind this study.

The remainder of this thesis proceeds as follows: In the next chapter, we introduce thelearning algorithms that we used for this paper, one linear, the perceptron, and and onenon-linear, kNN. We also discuss certain aspects of the kNN-derived clustered neighborhoodmodel, that we develop further in later chapters. The clustered neighborhood model is aninterpretation of a neighborhood model based on clustering the training data. It is, at least inspirit, a more faithful interpretation of a neighborhood model that the two-class kNN modelsthat we develop. In the third chapter, we introduce the major components of our models andhow they were trained. We also develop the clustered neighborhood model further, noting anumber of similarities to the GNM. We also introduce our test data and our procedures forcollecting human responses. In the fourth chapter, we review our results. The fifth chapteraddresses some issues with our models and discusses the work as a whole. Finally, the sixthchapter summarizes our work.

7

CHAPTER 2

MACHINE LEARNING COMPONENTS

We saw in the last chapter that there is some disagreement in the literature as towhether phonotactic probability or neighborhood density models are better for modelinghuman phonotactic judgments. Early models made little attempt to separate the two [9, 15].Later research focused on separating the two factors, usually by comparing the performanceof phonotactic probability and neighborhood density models on the same test data. Bailey andHahn [2] found that neighborhood density models, particularly the GNM, performed better ontheir data. In a follow-up study, Albright [1] found just the opposite: that phonotacticprobability models have the edge in performance. In this chapter, we introduce the learningalgorithms we used to train our versions of phonotactic probability and neighborhood densitymodels: the perceptron and kNN. The models that we use are typically called classifiers. Webriefly saw in the last chapter that a classifier is simply a model that can predict whether apiece of data belongs to a particular class. Despite this apparent simplicity however, classifiersinclude a large class of models of great power and flexibility. Predictions are usually based onsome notion of shared features. If a new data point shares a lot of features with the membersof a particular class, there is a good chance that it too belongs to that class. The features useddepend a lot on the problem. We will have more to say about features in the next chapter, first,however, we will introduce the classifiers themselves.

The chapter begins with an introduction to the Vector Space Model, the basic conceptthat underlies classification. We discuss the distinction between linear and non-linearclassifiers, and describe the classifiers underlying our phonotactic probability andneighborhood density models, the perceptron and kNN, respectively. We also discuss thebias-variance tradeoff, a common concern in machine learning, and how it can affect classifiermodeling. Finally, we give the basic architecture of the clustered neighborhood model, whichis based on agglomerative clustering, a technique for grouping data into subsets based onsimilarity. This technique represents a considerable departure from previous interpretations ofthe neighborhood density model, but is conceptually very similar. We conclude with asummary of the chapter.

2.1 THE VECTOR SPACE MODELThe Vector Space Model is essential to most modern classification algorithms. The

basic idea is that the problem space can be mapped into high-dimensional vector space. Thus,

8

each data point can be represented in terms of numerical features as a vector, which has someposition in vector space. Vectors that are similar should fall into similar regions of vectorspace, and hopefully these regions do not overlap with one another. This simple idea makesclassification possible. These regions represent different classes. Thus, all vectors that fallinto similar regions of vector space are most likely members of the same class. This conceptis known as the contiguity hypothesis [34].

We represent our data in terms of features as a high-dimensional vector; each featurerepresents a separate dimension in the vector space. The features used largely depend on theproblem at hand and can vary a great deal. For example, in the phonotactic models presentedin this paper, the features we use represent phonological segments, or sequences of segments.But other types of problems may call for different types of features, like words in textclassification [26, 27, 28] or color statistics in image classification [44]. In linear models likethe perceptron, each feature is associated with two real numbered values, a magnitude and aweight. Again, just like the features themselves, there is a great deal of flexibility regardingthe weighting schemes used. However, different weighting schemes are more suitable fordifferent problems. For example, a family of weighting schemes known as TF-IDF (TermFrequency-Inverse Document Frequency) are quite common in text classification. A wordfeature’s weight is derived from the number of training documents the feature occurs in (theInverse Document Frequency, the weight) multiplied by the number of times the featureoccurs in the current example document (Term Frequency, the magnitude). However, TF-IDFwould make little sense in the context of deciding phonotactic acceptability. We used a simplebinary scheme in our models. A feature receives a value of 1 when it is present in an exampleand a value of 0 if the feature is not present. All features are therefore considered equally inthe evaluation, which is a reasonable starting point. We leave more sophisticated weightingschemes for future research.

As we have seen, the vector space model offers a great deal of flexibility. It allows usto incorporate a large amount of information into a model with little overhead. New featurescan be added and others removed relatively easily. There are few practical limits on thenumber of features used, since vectors are typically sparse in practice. That is, only thefeatures with non-zero weights are represented in the vector. This allows us to use as manyfeatures in our models as necessary to adequately describe the problem, since most of thefeatures will not be represented in a given vector. Conceiving of a problem in terms of thevector space model offers a great deal of power and flexibility, while (generally) remainingefficient in both learning and classification.

9

2.2 LINEAR CLASSIFIERSWhen we consider a problem in terms of the vector space model, the Contiguity

Hypothesis suggests the basic ideas underlying classification. If a group of vectors falls withinone region of the vector space and another group of vectors falls within a separate region ofthe vector space, we should be able to learn a model that will be able to predict which of thesetwo groups, or classes, a new vector is most similar to based on where it is located in vectorspace. The machine learning algorithms that are typically used to learn such models areknown as classifiers. There are a number of classifier options available and each works in aslightly different way, but they all attempt to accomplish the same goal: with a high degree ofaccuracy. Essentially, a classifier learns to model where the vectors representative of eachclass are located in vector space. It can then compare new examples to this model todetermine which class the new example belongs to based on where it is located relative to theregions associated with each class.

Mathematically, the simplest type of classifier, based on the type of model that itlearns, is called a linear classifier. Linear classifiers are so called because the classificationdecisions that they make are based on some kind of linear combination of features, in otherwords a linear model. For a two class problem with only two dimensions (vectors with onlytwo features), this is simply a straight line separating the two classes. All of the examples ofone class fall on one side of this line and all of the examples of the other class fall on the otherside. In higher dimensions, this line becomes what is called a hyperplane, but both are definedby a linear equation, which is presented as Equation 2.1.

−→w T−→x = b (2.1)

Here, −→w is a weight vector, or model, learned during training, −→x is the examplevector, and the scalar b is the decision threshold. In classification, we compare the weightvector −→w to the example vector −→x by computing their dot product. We then compare thisvalue to the decision threshold b. In a two class problem, for example, if the result of the dotproduct is larger than b, the example belongs to one class; if it is less than b, the examplebelongs to the other class. In general, an example is assigned to a class C if −→w T−→x > b andassigned to class C ′ if −→w T−→x ≤ b. It is actually possible to simplify Equation 2.1 even further,and doing so gives a good example of the flexibility of vector space models. We can factor binto the weight vector −→w as a constant feature, which simplifies Equation 2.1 to −→w T−→x = 0.Thus, an example is assigned to class C if −→w T−→x > 0 and class C’ if −→w T−→x < 0.

Despite this apparent simplicity, computing the model parameters −→w and b canbecome quite complex. First of all, linear classifiers are not suitable for all problems. It isentirely possible that no linear hyperplane exists that will accurately separate the data, in other

10

words, the data are not linearly separable. When a separating hyperplane does exist, there arein fact an infinite number of possible separating hyperplanes. Which one we get depends on anumber of factors. The most important of these is of course the learning algorithm that weuse. Different linear classifiers compute the model parameters −→w and b in different ways.This often has an effect on the model (the linear separator) that is produced. For certainclassifiers, like the perceptron, even the structure of the training data can affect the classifierthat is learned. Other classifiers, like the Support Vector Machine (SVM), use verysophisticated means to ensure that the separator learned from training is less sensitive toidiosyncrasies of the training data and is in fact optimal. Finally, real data for most problemsare rarely perfectly separable by a linear hyperplane. Some examples are bound to fall on thewrong side of the hyperplane. These are noise vectors. If the classifier pays too muchattention to noise, it can produce a linear separator that essentially memorizes the data. This isknown as overfitting. In the worst case, too much noise could prevent the classifier fromfinding a separator at all. In either case, these models could not be applied to new data.

The task of learning an accurate model can be quite complex, but most modernclassifiers are capable of producing reasonably accurate models given a suitable problem. Wewill look at one linear classifier in more detail: the perceptron. It is quite simple, yet managesto model phonotactic judgments reasonably well. Our perceptron models are comparable tothe phonotactic probability models that have been proposed in the literature [1, 2, 9, 21],although it may not seem to be at first pass. The weights in the model that is produced aredirectly proportional to the frequency of occurence of our features: segments and sequencesof segments, just like the phonotactic probability models. Thus, we can encode a stochasticmodel implicitly, albeit in a somewhat different form, that is comparable to the phonotacticprobability models.

2.3 THE PERCEPTRON ALGORITHMThe perceptron algorithm dates back to the early days of artificial intelligence. It was

originally developed in the late fifties and early sixties [41], but is still used in more modernguises today, like the structured perceptron [10], MIRA (margin infused relaxed algorithm)[36], and the multi-layer perceptron (neural networks) [42]. While these more moderninterpretations are more robust and often make better generalizations overall, they are stillbased on the relatively simple principles of the original perceptron.

Learning with the perceptron is quite simple. As with any other linear classifier, thegoal of perceptron learning is to compute a weight vector −→w such that all of the examples ofclass C are on one side of the decision hyperplane and all of the examples of class C ′ are onthe other side. More formally, we want to find the −→w such that for all −→x c in class C,

11

−→w T−→x c > 0 and for all −→x c′ in class C ′, −→w T−→x c′ ≤ 0. During training the goal is to find the −→wthat minimizes the objective function: the Perceptron Criterion, presented as Equation 2.2.

EP (−→w ) = −

∑ı∈M

−→w T−→x ıCı

Cı = +1 if xı ∈ C

Cı = −1 if xı ∈ C ′

(2.2)

Where M is the set of incorrectly classified examples, −→x ı and Cı are the feature vectorand class for example ı, respectively. The perceptron criterion gives the total error rate for −→wover the training data. The goal then of perceptron learning is to find the −→w that minimizes theperceptron criterion. This is an example of gradient-descent learning: with each iteration, −→wis moved in the direction that minimizes the error rate. Minimizing the error rate on thetraining set is equivalent to maximizing the accuracy of the classifier on the training set, soactual implementations of the Perceptron Criterion vary. We describe our method below.

Despite these variations, perceptron learning is fairly standard. The learning algorithmiterates through the training data, classifying each example in turn. The classification decisionis determined by the dot product of the weight vector and an example vector. The result iscompared to 0 (using the equation we derived from Equation 2.1). The class of an examplevector depends on whether the dot product is positive or negative. If the example is classifiedcorrectly, no changes to the weight vector are required. If the example is classified incorrectly,however, the weight vector must be adjusted, because the hyperplane that it describes does notadequately separate the training data. The adjustments made to the weight vector depend onhow the example should have been classified. If the example, −→x ı, should have been positivebut was classified negative, −→x ı is added to −→w . On the other hand, if −→x ı should have beennegative but was classified positive, −→x ı is subtracted from −→w . The weight vector is adjustedeach time it classifies a training example incorrectly. This is known as online learning: ratherthan minimizing the error rate on the entire training set at once (a batch algorithm, like theSVM), the perceptron minimizes the error of a single example at a time. As one mightimagine, adjusting the weight vector to correctly classify one example may lead to themisclassification of other examples that the model had correctly classified in previousiterations. Despite this however, the perceptron algorithm is guaranteed to find a linearseparator if one exists in a finite number of iterations (numerous proofs of convergence existfor the perceptron, see [41] or [37], to name a few). The Perceptron Convergence Theoremdoes not guarantee however that the optimal solution will be found. Which model is producedcan depend on a number of factors, like the order of the presentation of the training data. Thishas to do with the bias-variance tradeoff that we will discuss later in this chapter. Despite this

12

however, the perceptron is easy to implement and often performs reasonably well on linearlyseparable problems.

Algorithm 1 The Perceptron Algorithm1: function DECISION(−→x ,−→w ) . Decision Function2: if −→w · −→x > 0 then3: return +1

4: else5: return −16: end if7: end function

8: −→w ← −→0 . Initialize Variables

9: while not converged do . Perceptron Learning Algorithm10: d← DECISION(−→x j,

−→w )

11: if CLASS(−→x j) = d then12: continue13: else if CLASS(−→x j) = +1 ∧ d = −1 then14: −→w [θ]← −→w [θ]− 1 . Decrement Threshold Feature15: −→w ← −→w +−→x j

16: else if CLASS(−→x j) = −1 ∧ d = +1 then17: −→w [θ]← −→w [θ] + 1 . Increment Threshold Feature18: −→w ← −→w −−→x j

19: end if20: end while

The perceptron algorithm (adapted from [33]) is presented as Algorithm 1. Theweight vector, which includes the threshold feature −→w [θ], is initialized to

−→0 . The learning

algorithm iterates through the training data, calling the decision function on each example.The decision function computes the dot product of the weight vector, −→w , and an example, −→x .If the result is larger than the threshold, the decision function classifies −→x in the +1 class,otherwise −→x is classified in the −1 class. When an example is misclassified, adjustments aremade to the weight vector as discussed above. Our implementation continues to iterate overthe training data until the decrease in the error rate between iterations is less than a 0.1 for tenconsecutive iterations. The perceptron then returns its model.

13

2.4 NON-LINEAR CLASSIFIERSWe have mentioned several times in this chapter that it is possible that a problem

might not be linearly separable, in which case a linear classifier would not be able to find asolution. In Figure 2.1 (adapted from [34]), for example, no linear classifier would be able toclassify the circular region correctly. This type of problem requires a more powerful type oflearning algorithm (or more dimensions) to model correctly. If we model the data in twodimensions (as in Figure 2.1) a non-linear classifier would be required. Non-linear classifiersare able to model much more complex decision surfaces, like the circular region in Figure 2.1.This increased power, however, comes at a price. To understand why, we must understand thebias-variance tradeoff.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2.1. A linearly unseparable problem.

As the name suggests, the bias-variance tradeoff has to do with two factors that affectthe performance of a machine learning model, all else being equal: bias and variance. Wehave already seen an example of how variance can affect a machine learning model. Wementioned that the model produced by the perceptron can vary depending on certain aspectsof the data, like the order of the presentation of the examples. Thus the perceptron has highervariance than a learning algorithm that can guarantee an optimal solution, like the SVM. Thisis also true of many non-linear learning methods, and has to do with the allowed memory

14

capacity of the learning method. The SVM takes steps to manage the memory capacity of themodel, which prevents overfitting. The perceptron and many non-linear models do not takethese steps to manage memory capacity. In other words, they are able memorize more aboutthe training data. This is especially true for non-linear classifiers, since they are capable oflearning complex non-linear decision surfaces. This can result in particular sensitivity to noisedata, examples that do not conform to the general model. A model that pays too muchattention to the noise data may not generalize well to new data. Again, this is an example ofoverfitting the training data.

Bias, on the other hand, has more to do with accuracy than consistency. A learningmethod has high bias if it produces classifiers that consistently classify certain examplesincorrectly. We have seen a few examples of this already. Linear models will have high biaswhen they encounter a problem like the one illustrated in Figure 2.1. Linear models willconsistently classify the data points in the circular region incorrectly, since they are unable tomodel the complex, non-linear decision surface. The most accurate learning algorithmseffectively balance bias and variance.

2.5 THE K-NEAREST NEIGHBOR ALGORITHMA relatively simple example of a non-linear classifier, and the basis of our clustered

neighborhood density models, is k Nearest Neighbor (kNN). Rather than makingclassification decisions based on some combination of the features in an example, kNNcompares the example to other examples directly, making its classification decision on thenew example’s similarity to other examples. The new example is assigned to the class of themajority of its nearest neighbors, which is highly reminiscent of the neighborhood densitymodels we discussed in the previous chapter. We developed two types of models based onkNN: the first is a standard kNN model using two classes, which we trained on the same dataas the perceptron. The second, the clustered neighborhood model, is a more faithfulinterpretation of a neighborhood density model as described in [1, 2] and is based on amulticlass version of kNN.

The principles underlying kNN are very simple; however, it consistently performs wellon a number of problems and can easily be generalized to more than two classes. A fact thatwe take advantage of in our clustered neighborhood model. kNN is based on vector similaritytechniques that are widely used in information retrieval. In fact, similar techniques are used inpopular search engines. During training, kNN simply memorizes the training data. Thisconstitutes the model. During classification, kNN compares each example in the training datato the example being classified using some type of similarity (or equivalently distance) metric.Training vectors that are the most similar (or closest if we are using a distance metric) to anew example vector are its nearest neighbors. The classification decision is based on the

15

classes of the new example’s k nearest neighbors. There are several ways to implement thisdecision. It can be as simple as assigning the new example to the majority class of its nearestneighbors. However, we used a different method known as weighting by similarity [34]. Inthis scheme, the new example is assigned to the class of its most similar nearest neighbors.We sum the example vector’s similarity scores for each class, and the class with the largestsum is chosen. Similarity scores are typically based on cosine similarity, but we also used anumber of other similarity measures, which we describe below. However, these alternatesimilarity measures can be computationally more costly.

k can be any positive integer, but odd values are generally preferred, since this reducesthe possibility of ties. The best choice for k can depend on the problem, but odd values lessthan 10 are not uncommon. In our experiments, k was equal to 7. For our data, this allowed usto include most of the highest similarity neighbors for the vast majority of our test data.Larger values for k did not improve performance significantly, and as k became really large,practical run time increased dramatically.

Algorithm 2 kNN learning and classification algorithms1: function TRAIN KNN(D) . Train kNN2: D′ ← PROCESS(D)

3: return D′

4: end function

5: function KNN-CLASSIFY(D′, k, d) . Classify with kNN6: Sk ← COMPUTENEARESTNEIGHBORS(D′, k, d)

7: C ← FINDMAJORITY(Sk)

8: return C9: end function

The kNN algorithm is presented as Algorithm 2. It operates just as we described in theprevious paragraph. During training, the training data is simply processed and returned. Forus, this means converting our data into a feature vector representation. Assuming our trainingdata is already processed, this results in O(n) complexity for training, where n is the size ofthe training set. The real cost comes during classification. The heart of the classify function isthe ComputeNearestNeighbors function which finds the test examples k nearest neighborsusing some kind of similarity metric. Finding the nearest neighbors for a single example is onthe order of O(nv), where n is the size of the training set and v is the length of the examplevector. This can be quite slow if the training set is large. Classifying more than one example

16

adds another factor to the run time: we must also consider the number of examples in the testset, O(nvt).

Needless to say, classification with kNN can be quite slow compared to theperceptron. It is rarely used in serious applications for this reason. There are, however, severalmethods to improve the performance of kNN. We have incorporated an inverted index into ourkNN implementation. The inverted index is another idea borrowed from information retrieval.Essentially, it is an index of our feature set. Each feature is listed along with all of theexamples that it occurs in. So, using an inverted index, kNN only has to compare a testexample with training vectors that share at least one feature with the test example. This canmean a considerable speed up in classification. For us, this can mean reducing the number oftraining examples that we must compare to a test example by up to 90%, which represents anenormous improvement.

We mentioned above that the ComputeNearestNeighbors function uses some type ofsimilarity metric to determine vector similarity. There are a number of possible similaritymetrics to choose from, among these we also include the closely related distance metrics (alsosee [11, 12, 13] for an alternative but closely related approach). We have selected five for ourexperiments. Four similarity metrics: dot product, cosine similarity, the Dice coefficient [14],and the Jaccard coefficient [23]; and one distance metric: Euclidean distance. Both similarityand distance metrics measure the similarity between two vectors, but they do so in differentways. A higher score indicates greater similarity when using a similarity metric. On the otherhand, a lower score indicates greater similarity when using a distance metric. Distancemetrics, as the name suggest, measure the distance between two vectors in vector space. Twovectors that are similar are closer together in vector space and so receive a lower score from adistance metric. Similarity and distance measures are the heart of kNN and using differentmeasures can have an effect on the behavior of kNN. Certain measures may improveperformance, while others may hinder it. We review the measures that we used in more detailbelow:

Similarity

• Dot Product

−→v 1 · −→v 2 =n∑

i=1

v1iv2i

This is the same dot product that was mentioned in conjunction with theperceptron. It is, in fact, a measure of the co-linearity of two vectors. Thiscan be a reasonable indication of their similarity. As we have seen, dotproduct only considers the features that the two vectors have in common, so

17

it can be quite efficient.

• Cosine Similarity

Simcos(−→v 1,−→v 2) =

−→v 1

‖−→v 1‖·−→v 2

‖−→v 2‖

Cosine similarity is just dot product normalized by Euclidean length, orEuclidean norm. The Euclidean norm of a vector −→v is equal to

√−→v T−→v .Normalizing a vector for length remedies a few problems, which we willdiscuss in the next chapter in the section on normalization.

• Dice Coefficient

Dice(−→v 1,−→v 2) =

2(−→v 1 · −→v 2)

‖−→v 1‖2 + ‖−→v 2‖2

The Dice Coefficient applies a different type of normalization to dot product.It uses a type of normalization called Dice Family Normalization by Gawronand Stephens [16]. It was originally developed by Dice [14] to find thesimilarity between two sets. The intuition carries over to vectors quitenaturally. The dot product in the numerator represents the amount ofinformation that is shared by two vectors, −→v 1 and −→v 2. The 2 is requiredbecause some of the information overlaps between the two vectors. Thedenominator represents the total amount of information that overlapsbetween the two vectors.

• Jaccard Coefficient

Jaccard(−→v 1,−→v 2) =

|S1

⋂S2|

|S1

⋃S2|

=−→v 1 · −→v 2

‖−→v 1‖2 + ‖−→v 2‖2

Jaccard is closely related to Dice Coefficient. And much like Dice, Jaccardwas originally developed to compute the similarity of sets [23]. The intuitionis very similar to Dice. The numerator represents the shared informationbetween two vectors, and the denominator represents the total amount ofinformation present in both vectors. The denominator is very similar to Diceexcept that the shared information is subtracted from the total amount ofinformation in the vectors. Jaccard and Dice have been shown to performvery similarly [16], and this was also the case in our experiments.

18

Distance

• Euclidean Distance

‖−→v 1 −−→v 2‖ =√

(−→v 1 −−→v 2) · (−→v 1 −−→v 2)

The Euclidean Distance between two vectors is equivalent to the length of theline segment between them in Euclidean space (real-valued vector space).Euclidean Distance is the Euclidean norm of the difference between twovectors. In other words, the length of the line segment that runs between theends of these two vectors. As we mentioned above, the more similar twovectors are, the lower this value will be.

Unlike the perceptron, kNN returns a classification but does not return any type ofscore. In order to compare the predictions of the kNN models to the judgments of the humanrespondents, we need some type of score that we can interpret as our standard kNN’sacceptability judgment. We used a separate scoring scheme for the clustered neighborhoodmodel, which we will describe in detail below. For the standard kNN model, we developed ascoring scheme that is derived from the test example’s nearest neighbors. However, slightvariations were required for the similarity and distance metrics. To derive a score for thesimilarity measures, we weighted the similarity scores between the test example and each ofits nearest neighbors by the class, {+1,−1}, of that neighbor. The score of the test example,then, is the arithmetic average of these values. So for example, say we have a test example x,with two nearest neighbors: example A, in class +1, and example B, in class −1. To find ouracceptability score for x, we weight the similarity scores between x and its neighbors by theirclass. So, if x’s similarity score with vector A is 3, and with vector B is 4, the acceptability ofx is +1(3)+−1(4)

2= −0.5. Using this method, less acceptable test examples receive lower scores

(< 0), and more acceptable examples receive higher scores (> 0).Deriving a score for distance measures required a slight variation. The above

formulation does not make as much sense for distance measures, because higher scoresalways represent greater distance. We do not want a test example that has been classified +1

to receive a higher score than an example that has been classified as −1 when they are thesame distance from the training example. Here +1 represents a more acceptable word and −1represents a less acceptable word. Ideally, we would like training vectors that are equidistantfrom the test vector to receive equal scores. We accomplished this by simply taking theaverage distance between a test example and its nearest neighbors. A score of 0.0 representsthe highest possible acceptability score in this scheme, higher scores represent lower levels ofacceptability. This scoring scheme will result in negative correlations with our humanrespondants, since higher scores represent greater acceptability in our human surveys. This

19

simply means that we will have to look at the absolute value of the correlation whenconsidering Euclidean Distance.

2.6 THE CLUSTERED NEIGHBORHOOD MODELIn this section, we will commence our two-part discussion of the clustered

neighborhood model. We begin with a basic discussion of the structure of the model and themethods that we used to develop training data for the clustered neighborhood model. In thenext chapter, we discuss the specifics of the training data and how the model was trained.

2.6.1 The ModelAs we mentioned above, the clustered neighborhood model is based on a multiclass

version of kNN, and represents somewhat of a departure from the two models that we havediscussed so far. We used a technique called agglomerative clustering to find groups of similarexamples in the training data. These groups represent our neighborhoods, which we used asclasses in the multiclass kNN model. The scoring scheme for the clustered neighborhoodmodel is based on these neighborhoods. The multiclass version of kNN underlying theclustered neighborhood model classifies an example exactly the same way that our standardkNN model does, weighting by similarity. Weighting by similarity has a particular advantagefor the clustered neighborhood model. The overall number of votes is not as significant as theoverall similarity of these votes. A new example may not have as many nearest neighbors in asmaller neighborhood as it does in a larger neighborhood, but it could be much more similarto the examples in the smaller neighborhood. Weighting by similarity allows the clusteredneighborhood model to select the more similar, smaller neighborhood when it is appropriateto do so. This would not be possible in a simple voting scheme.

The scoring scheme that we used with the clustered neighborhood model is alsosomewhat different than that used with the standard kNN models. The score is based on anexample’s neighborhood rather than its nearest neighbors. The score is the sum of thesimilarity scores of the example with all of the members of its neighborhood. This schemefavors possible words with a high degree of similarity and dense neighborhoods, just like theGNM of Bailey and Hahn [2]. We will describe the training and development aspects of theclustered neighborhood model in more detail in the next chapter. First, we will discuss themethod we used to construct the phonotactic neighborhoods for the clustered neighborhoodmodel.

2.6.2 Agglomerative ClusteringAgglomerative Clustering is the bottom-up form of hierarchical clustering. As the

name implies, hierarchical clustering algorithms produce a hierarchy of clusters, as opposed

20

to flat clustering algorithms, like K-means, which produce sets of clusters. Another asset isthat we are not required to know the number of clusters in advance, unlike flat clustering,which requires the number of clusters as a parameter. In hierarchical clustering, the number ofclusters that are found is a property of the data itself, rather than an a priori assumption.

Agglomerative clustering is so called because it is a bottom-up process. The data allbegin in individual clusters, called singleton clusters. These singleton clusters are combinedin successive iterations until all of the data is included in a single hierarchical cluster.

There are several similarity measures that are commonly used in agglomerativehierarchical clustering. The group average measure is typically suggested for most problems[34], but we found that the single-link method, also known as the nearest point algorithm,produced the highest quality neighborhood clusters. The decision to merge two clusters insingle-link clustering depends only on their most similar members. Essentially, we merge thetwo clusters that are closest together. This idea is presented in Equation 2.3.

SIM-Single(c1, c2) = mini∈C1,j∈C2

Distance(i, j) (2.3)

However, this decision only depends on the two most similar members, none of theother members of either cluster are considered. This can produce clusters that are notapproximately spherical which is generally the desired shape for a cluster [34].

In order to use this hierarchical clustering as training data for the clusteredneighborhood model, we require sets of clusters. These will become the neighborhoods in ourmodel. We thus need to cut the hierarchy at some point, producing disjoint sets of clusters likeflat clustering. There are several methods for accomplishing this, but the basic idea in allcases is to produce clusters with a high intra-cluster similarity (similarity among members ofthe same cluster), which is at or above some prespecified threshold. We used a method knownas inconsistent flat clustering. In this method, a node in the hierarchical cluster and all of itsdescendants must have an inconsistency coefficient less than or equal to the prespecifiedthreshold value. The inconsistency coefficient compares the height of each link in thehierarchy with the average height of other links at the same level in the hierarchy [35]. Highervalues for the inconsistency coefficient for a link indicate lower similarity. For our data, athreshold of around 0.7 produced the best results.

2.7 SUMMARY OF CHAPTERIn this chapter, we introduced the machine learning aspects of this work and the basic

concepts of classification. The vector space model gives us a convenient and efficient way torepresent data, as a feature vector, and it leads quite naturally to the concept of classificationdue to the contiguity hypothesis. The classifiers that were used in this work, the perceptronand kNN, were also introduced, as well as the basic architecture of the clustered

21

neighborhood model. We also discussed the differences between linear and non-linearclassifiers and how this difference can affect classifier performance through the bias-variancetradeoff. We reviewed several similarity functions that are commonly used with kNN: dotproduct, cosine similarity, the Dice coefficient, Jaccard Coefficient, and Euclidean distance.We also introduced the scoring metrics we developed for use with the standard kNN modelsas well as the clustered neighborhood model. Finally, we introduced agglomerative clustering,a hierarchical clustering technique that we used to find neighborhoods for the clusteredneighborhood model.

22

CHAPTER 3

METHODOLOGY

In the last chapter, we introduced the learning algorithms we used to build phonotacticmodels of English (the perceptron, two-class kNN, and the multiclass kNN-based clusteredneighborhood model) and how they work. The learning algorithm is, of course, only one partof a classifier model. There are other pieces that are just as important. This chapter introducesthe other parts of a classifier model: namely, the training data and the feature encoding. Theproper training data and feature encoding are critical to learning a successful classifier.Assuming that a problem is linearly separable, if our training data is not representative of theproblem space, or, as is more often the case, not as representative as we think it is, a classifiermodel will never be able to learn a good approximation of the separator(s). The same can besaid for the feature encoding. Choosing the wrong feature encoding can blur the distinctionbetween classes, again making it more difficult to find an accurate separator. It is important tochoose a feature encoding that includes as much of the right kind of information as possible,but determining what the right kind of information is usually involves some experimentation.

In addition to the training data and feature encodings, we also describe the clusteredneighborhood model in more detail, as well as several more traditional probabilistic modelsthat we developed for the sake of comparison. The chapter goes on to describe the test datathat we developed to evaluate our models. This test data was given to a group of humansubjects, and the results were compared to our phonotactic models using the Spearmancorrelation. Finally, we conclude with a summary of the chapter.

3.1 TRAINING DATAIn this section, we discuss the training data that we developed to train our models. In

particular, we discuss the training data itself and how it was created. We go on to discuss thefeature encodings and normalization methods that we used to produce the feature vectorrepresentations of the training data that our models were actually trained on.

3.1.1 Classifier DataClassifiers require both positive and negative training data to learn a model. Our

positive data came from the Carnagie Mellon University (CMU) Pronunciation Dictionary.The CMU Pronunciation Dictionary includes 127, 008 words encoded in the DARPAbetphonetic alphabet, as well as a number of punctuation marks and special characters which wedid not include in our positive training data. The CMU dictionary uses 70 total phoneme

23

symbols: 24 consonant symbols and 46 vowel symbols. The vowel symbols indicate stresswith a number (0, 1, or 2) following the symbol when the symbol is used in a word (’AE1’,for example). Without stress indicators, there are only 15 distinct vowel symbols. Stressindicators were included in our representation; however, we made no special considerations ofstress beyond this in our models. The length of the words in the CMU Dictionary varieswidely, from a single segment (’a’: a) to twenty segments (’deinstitutionalization’:diInst@tuS@n@l@zeS@n). Syllabification was not annotated in the data, and we did not use anyadditional software to syllabify the data. Our desire was to train models based only on thefeatures immediately apparent in the data.

Since binary classifiers (like the perceptron and two-class kNN) require negative data,we must construct negative data to produce our models. Since humans receive very littlenegative data as part of their input during language acquisition, the use of negative data inmodeling acquisition is not uncontroversial. Conceptually, what we are proposing here issome version of internally generated negative data.We suggest that during phonotacticlearning, negative data (that is, impossible phonotactic sequences in the learner’s language)are generated internally as a point of comparison to the positive data that the learnerencounters in their environment. This idea is highly reminiscent of the generator inOptimality Theory [39]. For a given input form, the generator produces all of thelinguistically possible variations of that form, without reference to phonotactic structure, andthen constraints are used to select the best output form. In Optimality Theory, the generatorplays a similar role in language learning. The generator is applied to a known output form andessentially produces negative data, from which the language learner is able to deduce thecorrect input form and constraint ranking. The role of negative data in our experiments isanalogous. Our experiments indicate that the most successful forms of negative data are basedon knowledge independent generation; that is, we do not need to make any assumptions aboutwhat the learner knows at the beginning of the learning process.

We generated our negative data randomly based on the CMU data. We experimentedwith several different methods of generating negative data. In each case, we generated127, 000 negative examples, the same number of examples that are in the CMU Dictionary.After each negative example was generated it was compared to the words in the CMUDictionary to ensure that we did not accidentally generate a positive example. Since we areoperating under a closed-world assumption, we assume that this ensures that we are notaccidentally generating any actual English words. This may not necessarily be a realisticassumption, but it is necessary for practical reasons. We generated negative training datausing four different methods.

24

1. Experiment 1The first set of negative data, which we refer to as [experiment1], was generated almostcompletely at random. A word length for each negative example was randomly selectedfrom the empirical distribution of word lengths found in the CMU Dictionary (from 1 to20 segments). Phoneme symbols were then selected at random until the specified wordlength was reached, with no other restrictions. This method, of course, does notgenerate examples that are particularly reminiscent of English words (for example,>dZmoEO and

>ajaæ>au, but it is a reasonable method to start with.

2. Experiment 2The second set of negative data, [experiment2], was based on the conditionaldistribution of consonants and vowels (CV distribution) in the CMU Dictionary. Weconverted each word into a consonant-vowel representation, in which all consonants arerepresented by ’C’ and all vowels are represented by ’V’. ’start’ and ’end’ symbols werealso added to each word. We constructed a conditional probability distribution over thedata in this consonant-vowel representation. This gives us the probability that a type ofsegment, either a consonant or a vowel, will occur given the preceding segment. Foreach negative example, we first generated a consonant-vowel template, a stringconsisting of ’C’s and ’V’s, based on the consonant-vowel distribution. We thenrandomly generated a segment for each position in the template, a consonant in a ’C’position and a vowel in a ’V’ position.

For clarity, we will construct a short example. We begin with the start symbol, ’start’.We then randomly select a ’C’ or a ’V’ according to the CV distribution. Say wegenerated a ’C’. We then select another symbol by the same process. This timehowever, the ’end’ symbol can also be selected, which signifies the end of the word. Forthe purposes of our example, say we generated a ’V’ next. This process continues untilthe ’end’ symbol is selected. So, assuming the next symbol that we generated was the’end’ symbol, our template would be ’CV’. Next, we select segments at random to fillthe positions of the template, a consonant for a ’C’ position and a vowel for a ’V’position. In this example, we might generate an [m] for the ’C’ position and a [@] for the’V’ position, resulting in [m@]. This method results in examples that bear somewhatmore resemblance to English words than the examples generated by the previousmethod, like T and tNuSsl@nOd.

3. Experiment 3The third set of negative data was generated directly from the positive data. Onesegment of each positive example was changed at random. This of course results in

25

negative data that is more similar to the positive data than either of the two previousnegative data sets. This method generates words like vkns@l (from ’vancil’ apparently aproper name) and g@~ldz (from ”world’s”). We did not make the requirement thatconsonants be substituted for consonants and vowels for vowels. This gives rise to somedistinctly non-English phone sequences, as are evident in the examples above.

4. Experiment 4Finally, the fourth negative data set, [experiment4], includes negative examples fromeach of the three other data sets in roughly equal parts. This data set provides the mostcoverage of negative data, which could prove helpful to the model.

Each of these negative data sets was combined with the positive data in turn, yieldingfour total training sets. The negative data are summerized in Table 3.1. The order thatexamples appeared in each training set was completely random. Ten percent of each of thefour training sets was withheld for development and testing, so each training set consisted of228, 615 examples. We encoded these training sets into a number of different featureencodings, which we will detail below, to produce the actual training data used to train ourclassifier models.

3.1.2 Feature EncodingsAs we mentioned above, the proper feature encoding is essential to successful

classification. Choosing the correct feature representation for the data could mean thedifference between learning an accurate classifier and learning a classifier that performs nobetter, or possibly worse, than chance. We experimented with four basic feature sets. Wemade four variations of each of these feature sets: whether vowels were included in therepresentation (if not, all vowels were mapped to a single feature), and whether positionalinformation was included in the representation (if so, the feature for a segment includes itsposition in the word that it occurred in). These ideas are motivated by ideas from textclassification (see [28] for example) that are designed to reduce the size of the feature spacewithout losing too much significant information, thereby (hopefully) producing a bettermodel. Numbers, for example, are often mapped to a single feature in text classification, anddocuments are typically represented as sets of words (’bag-of-words’ model). Thesetechniques are designed to reduce the total number of features in the model. Although,clearly, we do lose some of the information that is available in the data. For many problemsthough, these techniques can improve classification performance.

The best feature sets strike a balance between the amount of information included inthe model and the size of the feature space. As the amount of information included in themodel increases, the number of features required to represent this information also increases.

26

Table 3.1. Summary of Negative Training Data Generation

Method Example (in IPA)[experiment1] Random Generation:

1. Select number of seg-ments

2. Select symbols until re-quired number of seg-ments is reached

>dZmoEO>ajaæ >aw

[experiment2] Consonant-Vowel Distribu-tion:

1. Create conditionalprobability distribu-tion over ’C’ and ’V’symbols

2. Select symbols accord-ing to this distributionuntil ’end’ symbol isselected

TtNuSsl@nOd

[experiment3] Segment Substitution:One symbol in each positiveexample is replaced by a ran-domly selected symbol. Con-sonants are replaced by con-sonants, and vowels are re-placed by vowels

vkns@lg@~ldz

[experiment4] All types:

1. One third of the nega-tive data are generatedaccording to the proce-dure for [experiment1]



27

This could mean an exponential increase in the size of the the feature space. Larger models,with more features, require more data to train, which may not always be possible. As modelsbecome larger, it becomes less likely that we will get good estimates of the proper weights forall of our features using the same amount of training data. We thus run the risk of overfittingthe training data, producing a model that is extremely accurate when tested on the data that itwas trained on, but that does not generalize at all to new data. So finding ways to reduce thesize of the model, and thus learn better weight values for a lager proportion of our featurespace, can mean important gains in performance.

The four basic feature representations are based on sequences of segments. In allcases, the basic feature depends on whether or not positional information is included. In thefirst, the basic feature is the single segment (the ’uniphone’ representation). If positionalinformation is included, the basic uniphone feature is something like (1,’s’), which representsthe segment [s] in the first position of the word. If not, the basic feature is simply the segmentitself (eg ’s’). Without positional information, a word is simply represented as the set ofsegments that appear in it. The second basic feature representation is based on thetwo-segment sequence (the ’biphone’ representation). If positional information is included,the basic feature is something like (2, (’s’,’t’)). 2 If not, we simply use the biphone itself(’s’,’t’). The third basic feature representation is based on the three-segment sequence (the’triphone’ representation). The basic features look like (3, (’s’,’t’,’r’)) with positionalinformation and simply (’s’,’t’,’r’) without positional information. The final featurerepresentation combines the uniphone, biphone, and triphone representations into a singlerepresentation. We termed this the ’all’ representation. This approach incorporates a kind ofback-off into the model, which can make the resulting models more robust and accurate (see[36] for a similar approach to dependency parsing). A summary of the basic features is givenin Tables 3.2 - 3.5.

Table 3.2. Uniphone Features

Feature Variation Total Features Exampleuniphone Without Position

Without vowels 26 ’t’, ’Vow’With vowels 70 ’t’, ’@1’

With PositionWithout vowels 550 (1,’t’), (1, ’Vow’)

With vowels 1,400 (1,’t’), (1,’@1’)

2In the biphone and triphone sets, features in the first position, eg (1, (’start’),x)), always include the ’start’symbol. The triphone representation uses two start symbols: ’start1’ and ’start2’.

28

Table 3.3. Biphone Features

Feature Variation Total Features Examplebiphone Without Position

Without vowels 675 (’t’,’Vow’)With vowels 5,040 (’t’,’@1’)

With PositionWithout vowels 14,850 (1,(’start’,’t’))

With vowels 110,880 (2, (’t’,’@1’))

Table 3.4. Triphone Features

Feature Variation Total Features Exampletriphone Without Position

Without vowels 18,275 (’t’,’Vow’,’r’)With vowels 363,020 (’t’,’@1’,’r’)

With PositionWithout vowels 402,050 (2,(’t’,’Vow’,’r’))

With vowels 7,986,440 (2, (’@1’,’r’,’end’))

Table 3.5. All Phone FeaturesFeature Variation Total Features Example

all Without PositionWithout vowels 18,975 [(’t’,’Vow’,’r’),(’t’,’Vow’),

(’Vow’,’r’),(’t’),(’Vow’),(’r’)]

With vowels 368,130 [(’t’,’@1’,’r’),(’t’,’@1’),(’@1’,’r’),(’t’),(’@1’),(’r’)]

With PositionWithout vowels 417,450 [(2,(’t’,’Vow’,’r’)),

(2,(’t’,’Vow’)),(3,(’Vow’,’r’)),(2,’t’),(3,’Vow’), (4,’r’)]

With vowels 8,098,720 [(2,(’t’,’@1’,’r’)),(2,(’t’,’@1’)),(3,(’@1’,’r’),(2,’t’), (3,’@1’),(4,’r’)]

29

Back-off is a common approach to dealing with sparse data representations. The basicidea is simple. When the model encounters a feature that it has never seen before, it can backoff to another representation of the same data that it has seen before. Back-off is commonlyused in ngram models (see for example [30]). In our triphone models, for example, the modelnever encounters most of the possible features during training, so the model never learns aweight value for those features. This leaves large gaps in our larger models. The features cannever be considered in the classification decision. This of course could lead to inaccurateclassification. By including the uniphone and biphone representations along with the triphonerepresentation, we are attempting to fill in some of these gaps. If the model encounters a wordwith a triphone feature that it has never seen, like (’t’,’s’,’r’), it can back-off to the biphonerepresentations (’t’,’s’) and (’s’,’r’), which may have weights and may be able to be consideredin the classification decision. Failing that, of course, the model can back-off to the uniphonerepresentation. This gives the model a much better chance of including all of the relevantinformation in the classification decision, which can improve classification.

3.1.3 NormalizationWe also experimented with different forms of vector normalization. Vectors are often

different sizes. Certain similarity functions, like dot product, the basis of the perceptron, showa bias toward longer vectors. The dot product of two longer vectors may be larger than the dotproduct of two shorter vectors, making the two longer vectors appear more similar than theymay actually be. Gawron and Stephens [16] refer to this as a length bias. Two long vectorsmay have more features in common than two short vectors, regardless of whether thesefeatures are relevant or not, simply because they have more features. The length bias can leadto incorrect classification. Large numbers of irrelevant features may cause a new vector toseem more similar to the incorrect class, resulting in incorrect classification.

Normalization is employed to neutralize the length bias. We consider the length of thevector in similarity calculation, thus reducing its effect on the outcome. We used severalmethods for normalization in our experiments. For the perceptron, we tried both L1-3 andL2-normalization.4 Vector normalization can also be incorporated into kNN, depending on thesimilarity function used. Cosine Similarity, Dice, and Jaccard all employ some form ofnormalization, while dot product and Euclidean distance do not.

3.2 VERIFICATION OF RESULTSThe results of the classifier models (perceptron and two-class kNN) were verified

using an SVM implementation, SVMLight, developed by Thorsten Joachims [29]. The SVM

3In L1-normalization, the vector is normalized by its the sum of its components:−→x∑−→x

4In L2-normalization, the vector is normalized by its length:−→x‖−→x ‖ =

−→x√−→x T−→x

30

models were trained and tested on the same data used for our classifier models. This step wasprimarily intended to verify that our implementations were working correctly. As such, theSVM results are not presented in this work. However, the success of the perceptron modelssets up an interesting comparison with the SVM. We leave this to future work, since our SVMresults are not fully developed.

3.3 CLUSTERED NEIGHBORHOOD MODELTo construct the strict neighborhood model, we randomly selected one tenth of the

examples from the CMU Dictionary (the positive data used for the classifier models). Thiscame to 5,705 randomly selected words. This restriction on the size of the training data is dueto the memory requirements of our clustering method, which we implemented using the SciPypython package. This data was clustered using the single-link method, in both the biphoneand triphone representations5, producing 4,153 clusters using the biphone representation and4,223 clusters using the triphone representation. These clusters constitute the neighborhoodsin the clustered neighborhood model, and thus the training data for the underlying kNNmodel. New examples are assigned to one of these neighborhoods and scored using themethod described in the previous chapter.

Clustering the biphone data produced some reasonable clusters. Many data pointsformed singleton clusters, but this is not surprising, since, as we saw in Chapter 1, manywords have very small or non-existent neighborhoods. The bigram representation producedclusters like those in Table 3.6.

These clusters are quite reasonable. The clusters seemed to be formed based on wordinitial and word final sequences, with word initial sequences being the most important.Length is certainly a factor, clusters always consist of words of approximately the samenumber of segments. This is likely due to the similarity function underlying the clusteringalgorithm, Euclidean distance, which is not normalized and thus sensitive to vector length.However, in this case, this seems to be the effect that we want, as the same effect would ariseusing the GNM or other more traditional neighborhood models based on edit distance: Morechanges are required to derive a long word from a short word than to derive a short word fromanother short word.

The clusters based on triphones are somewhat less impressive. Cluster membershipseems much more restricted than the biphone representation. Clusters were generally smaller.The triphone clusters seem highly dependent on final sequences, unlike the biphone clusters,which were dependent on both initial and final sequences, with initial sequences being more

5Both the biphone and triphone representations included vowels but not position

31

Table 3.6. Biphone Clusters

4

hU1d@0ls@0n (hudelson)

hE1nd@~0s@0n (henderson)

226

ri0bI1ldI0N (rebuilding)

ri0bju1kI0N (rebuking)

ri0z@1ltI0N (resulting)

ri0zI1stI0N (resisting)

69

s@0prE1S@0n (suppression)

sO1ltsm@0n (saltsman)

sE1lI0gm@0n (seligman)

swa1nI0g@0n (swanigan)

swI1rI0N@0n (swearengen)

996

skæ1lpI0N (scalping)

skre1pI0N (scraping)

skwO1kI0N (squawking)

skwE1rI0N (squaring)

skw@~1tI0N (squirting)

skwi1zI0N (squeezing)

2936

pæ1ts (pat’s)

p@1ts (putts)

p>aj1ps (pipes)

pi1ps (peeps)

pi1ts (peet’s)

pu1ps (poops)

important. Word length is again an important factor, words of the same number of segmentstended to form clusters. Several of the largest clusters are depicted in Table 3.7.

These clusters are much more reminiscent of a neighborhood model based onneighbors a single edit distance away [49]. There is also more pronounced evidence ofchaining, the major weakness of single-link clustering. Each of the large triphone clusters inTable 3.7 show evidence of chaining. Point-wise similarity between particular clustermembers is high but overall intercluster similarity may be much lower. For example, cluster662: (blæ1k, blE1k), (plæ1k, pl@1k), and (blæ1k, plæ1k) are very similar. Each pair is one edit

32

Table 3.7. Triphone Clusters

214

bl>aj1 (bligh)

blo1 (blow)

dr>aj1 (dry)

I2m>aj1 (imai)

662

blæ1k (black)

blE1k (blech)

plæ1k (plack)

pl@1k (pluck)

272

bo1lz (boals)

ni1lz (neal’s)

pO1lz (paul’s)

sO1lz (salls)

so1lz (seoul’s)

s>oj1lz (soil’s)

distance away. The same cannot be said about (blE1k, plæ1k), (blE1k, pl@1k), or (blæ1k,pl@1k). This is evidence of a mild chaining effect. (blE1k, plæ1k), (blE1k, pl@1k), and (blæ1k,pl@1k) are still quite similar. Overall, the quality of the biphone clusters seems better than thetriphone clusters.

3.4 ADDITIONAL MODELSIn addition to the perceptron, two-class kNN, and clustered neighborhood models, we

also constructed several more standard probability models like the one described in [1] and [2]for a more complete comparison. These models are based on biphone and triphoneconditional probabilities. The probabilities of entire words were calculated as described in thefirst chapter: We tried models based on both joint and average probability. The averageprobability was computed as the geometric mean of the sequence probabilities, followingBailey and Hahn [2].

3.5 THE EXPERIMENTThe basic design of the experiment draws heavily on the work of [1, 2, 9, 15]. The

quality of phonotactic acceptability models is typically evaluated by comparing the judgmentsproduced by the models to those of human respondents. We have done the same. We

33

developed a short test set based on the data developed by Bailey and Hahn [2] and the facts ofEnglish phonotactics described in [20]. Twenty-five single-syllable nonce forms were selectedat random from the Bailey and Hahn [2] data. We made slight alterations to each of thesenonce forms to give rise to nonce forms with more serious phonotactic violations. Thisprocess produced nonce forms with phonotactic violations like inappropriate vowel tenseness([miksTs], [stempts]), and questionable onsets or codas, either by violating sonority sequenceconstraints ([bnIlk], [tl@sp]) or by using segments that do not occur in a particular position(NIN). This produced a test set of 50 total nonce forms, drawn from the full range of theacceptability scale, from completely unacceptable to completely acceptable. As Albright [1]points out, the lack of phonotactically unacceptable test examples was a short-coming ofBailey and Hahn [2], and may have biased their results in favor of the neighborhood basedmodels. The test data is given in the Appendix (Table A.1).

We tested our phonotactic models on this test set. The score that a test examplereceived from a phonotactic model is taken to represent its acceptability according to themodel. For the sake of comparison, we gave the same test to a group of human subjects. Forthe human evaluation, the test data was divided into two sets, neither set included bothversions of a single nonce form from the original set of 25. We created two versions of eachof these two sets. The order in which the examples appeared was random in both cases. Thesetest data were presented in written survey format to 23 lower-level graduate and upper-levelundergraduate students recruited from lower-level graduate courses in Linguistics. Theexamples were presented in standard IPA. Participants were asked to rate each form on a scalefrom 1 to 7, where 1 represents a totally unacceptable English word and 7 represents a totallyacceptable English word (following [9, 15]). One pair of words was excluded immediately,swOntS and swontS. This was due to the fact that only one of these forms, swOntS, waspresented on the human test by mistake.

Human responses largely followed the pattern suggested in the literature. Generally,there was a clear division between the original nonce forms derived from Bailey and Hahn [2]and the altered version. This was true in all but one case, but there is evidence that this is theresult of a reasonably consistent misinterpretation of the vowel in the example. [stempts]received a surprisingly high average score (6.375) compared to its counterpart, [stImpts](5.53846). We believe many respondents confused the tense vowel [e] for the lax vowel [E],resulting in the higher score for [stempts]. We excluded both examples from the analysis forthis reason. Clearly then, the written survey format has introduced some variance into ourhuman sample. But potentially less variance than an oral presentation (see [51] for manypotential confounding factors that arise from an oral presentation).

34

Overall, the human survey results were largely as expected, encompassing the fullrange of acceptability values. Ratings were most consistent at the extreme ends of theacceptability scale. Variance was typically less than 1.00 for the examples that subjects rated2 or less or 6 or more. Responses were much more varied nearer the middle of the scale. Thehuman responses were averaged over all subjects to create an average acceptability vectorrepresenting the human judgments. Responses that were more than two standard deviationsfrom the mean were not counted when computing mean responses. The averaged humanvector was compared to the results from each of our phonotactic models in turn using theSpearman correlation [9, 17]. This allows us to compare the performance of the differentmodels directly. Better models will have higher correlation values with the human results.

3.6 SUMMARY OF MODELS USEDIn this section, we present a tabular summary of the models that we developed for this

work.

3.6.1 Perceptron ModelsA summary of the perceptron models is presented in Tables 3.8 - 3.11.

Table 3.8. Uniphone Models

Basic Feature Variation Normalizationuniphone Without Position None, L1, L2

With/Without Vowels

With PositionWith/Without Vowels

Table 3.9. Biphone Models

Basic Feature Variation Normalizationbiphone Without Position None, L1, L2

With/Without Vowels


35

Table 3.10. Triphone Models

Basic Feature Variation Normalizationtriphone Without Position None, L1, L2

With/Without Vowels


Table 3.11. All Phone ModelsBasic Feature Variation Normalizationall Without Position None, L1, L2

With/Without Vowels


3.6.2 kNN ModelsA summary of the kNN models is presented in Tables 3.12 - 3.15.

Table 3.12. Uniphone Models

Basic Feature Variation Similarity Functionuniphone Without Position Dot Product

With/Without Vowels Cosine SimilarityDice Coefficient

With Position Jaccard CoefficientWith/Without Vowels Euclidean Distance

Table 3.13. Biphone Models

Basic Feature Variation Similarity Functionbiphone Without Position Dot Product



36

Table 3.14. Triphone Models

Basic Feature Variation Similarity Functiontriphone Without Position Dot Product



Table 3.15. All Phone ModelsBasic Feature Variation Similarity Functionall Without Position Dot Product



3.6.3 Clustered Neighborhood ModelsA summary of the clustered neighborhood models is presented in Table 3.16.

Table 3.16. Clustered Neighborhood Models

Basic Feature Variation Similarity Functionbiphone without position/ with vowels Cosine Similaritytriphone without position/ with vowels Cosine Similarity

3.6.4 N-Phone Conditional Probability ModelsA summary of the probability models is presented in Table 3.17.

Table 3.17. N-Phone Conditional Probability Models

Basic Feature Variation CombinationMethod

biphone without position/ with vowels joint probability,average probability

triphone without position/ with vowels joint probability,average probability

37

3.7 SUMMARY OF CHAPTERIn this chapter, we described the data that we used to train our classifier models: the

training data and the feature representations. We trained our classifier models using the CMUDictionary and negative data we generated using several different methods. We useduniphone, biphone, and triphone representations, as well as a representation that included allthree, the ’all phone’ models. We also varied whether vowels and positional information wereincluded in feature representations. We experimented with several forms of vectornormalization to reduce the length bias inherent in unnormalized similarity measures like DotProduct. The clustered neighborhood model was described in more detail and we reviewedthe clustered data that the model is based on. We also described the test data and how it wasused to compare the performance of the various models that we experimented with. Finally,we included a series of tables summarizing the models used in this work.

38

CHAPTER 4

RESULTS

This chapter reviews the correlation results of the models that we experimented with.6

There was a great deal of variability in performance depending on the type of model and thetype of features used. The best classifier models, those based on the perceptron and kNN,performed better than the clustered neighborhood model and the standard probability models.The standard probability model performs reasonably well, yet not to the level of the classifiermodels. The clustered neighborhood model lagged somewhat behind the other types ofmodels, but this may be due in large part to the limited amount of data that it was trained on(an issue we will address in the next chapter). The chapter will proceed as follows: First, wewill review the results for the classifier based models, the perceptron and kNN models, notinghow performance is affected by varying the type of negative data and the basic featurerepresentation. We then discuss the results from the basic clustered neighborhood model, inboth the biphone and triphone representations. Finally, we discuss the standard biphone andtriphone probabilitiy models. We conclude with a summary of the chapter.

4.1 PERCEPTRON RESULTSThis section presents the correlation results for the perceptron models. Results are

presented by feature type and ordered by the underlying training data used.

4.1.1 Uniphone ResultsOverall, the uniphone models performed poorly compared to the other perceptron

models, the most successful models only managing to produce correlations aroundr ≈ 0.29, p ≈ 0.19 and none produced a significant correlation. The best models wereproduced by the [experiment1] and [experiment4] negative data, see Table 3.1. Correlationswere quite varied, even within the same negative data group. Though all were near the lowend of the spectrum. Even at the lower end of the scale, variation in features andnormalization clearly make a difference.

The [experiment1] negative data produced the highest score for the uniphone modelsoverall with the L2-normalized non-positional model with vowels (r = 0.2937, p = 0.1903).The other [experiment1] non-positional models were less successful, scoring generally lowerthan their positional counterparts, the notable exception being the L2-normalized positional

6The results presented here have not been corrected for multiple comparisons, which introduces the possibilityof error in the results. This issue will be addressed in future work.

39

model with vowels, which only produced a meagre correlation of r = 0.1213 (p = 0.3611).In nearly all cases, the models with vowels performed better than their non-vowelcounterparts. Normalization was usually beneficial, although the best type of normalization touse seems to depend on the particular feature variation. The non-positional models benefitedmost from L2-normalization, while L1-normalization had a slight edge for the positionalmodels. However, both types of normalization improved performance approximately the sameamount using the positional features without vowels (r ≈ 0.19, p ≈ 0.28). Among thebiggest surprises is the performance drop-off of the L2-normalized model when both positionand vowels were included in the feature representation. This is particularly surprising sinceL2-normalization improved the non-positional model with vowels so much; and performedcomparably with L1-normalization in the positional model without vowels. In this case, evenno normalization is preferable to L2-normalization. The [experiment1] results are presentedin Table 4.1, rho values are given along with their p values in parentheses.

Table 4.1. Perceptron Experiment 1 Uniphone Results

Feature Data Variation NormalizationNone L1 L2

uniphone experiment1 Without PositionWithout Vowels 0.0697 0.1060 0.1339

(0.4193) (0.3782) (0.3474)With Vowels 0.1364 0.1981 0.2937

(0.3446) (0.2796) (0.1903)

With PositionWithout Vowels 0.0865 0.1907 0.1954

(0.4002) (0.2872) (0.2824)With Vowels 0.2220 0.2535 0.1213

(0.2559) (0.2260) (0.1903)

The [experiment2] models produced results in a narrower range than the [experiment1]models. The best model scored only r = 0.2234 (p = 0.2546), somewhat lower than the best[experiment1] models. The gap between the performance of the positional and non-positionalmodels that we saw for the [experiment1] models is somewhat more modest for the[experiment2] models. Both positional and non-positional models performed comparably.The disparity between the models with and without vowels is also somewhat less pronounced.This is most apparent in the L1-normalized models. The L1-normalized models consistentlyperform better with vowels. L2-normalization is only effective for the positional models;however, both are among the best performing [experiment2] models (r ≈ 0.21, p ≈ 0.26).

40

The non-positional models are among the worst performing of the [experiment2] models(r ≈ 0.11, p ≈ 0.37). The [experiment2] results are presented in Table 4.2.




(0.3370) (0.3248) (0.3673)With Vowels 0.1419 0.2234 0.1091

(0.3386) (0.2546) (0.3747)


(0.3133) (0.3084) (0.2614)With Vowels 0.1207 0.2166 0.2112

(0.3618) (0.2611) (0.2665)

The [experiment3] negative data produced the worst performing uniphone modelsoverall. None of the unnormalized models were able to produce a correlation above r = 0.1.Using some type of normalization improves performance somewhat, but these are only on parwith the mid range of the [experiment1] and [experiment2] models. The non-positionalmodels performed better, particularly those using some type of normalization. And here,vowels improved the L2-normalized model (from r = 0.1028, p = 0.2896 without vowels tor = 0.1442, p = 0.3362 with vowels). The positional models were in general quite poor. Thebest, the L1-normalized model with vowels, managed a correlation of onlyr = 0.1254 (p = 0.3517). The [experiment3] results are presented in Table 4.3.

The best [experiment4] models produced correlations that were somewhat higher thanthe best [experiment1] models (up to r = 0.2904, p = 0.1932). The positional models were,on the whole, somewhat better than the non-positional models. Here, only the L1-normalizedmodels managed even a very mild correlation (r ≈ 0.17, p ≈ 0.30 both with and withoutvowels). All of the positional models without vowels performed reasonably well compared tothe other uniphone models (above r = 0.24, p ≈ 0.2), but only the L2-normalized modelimproves at all when vowels are included in the feature representation (up to r = 0.2759,

p = 0.2057), but this is of course only a very slight improvement. The [experiment4] resultsare presented in Table 4.4.

41




(0.4030) (0.2896) (0.3818)With Vowels 0.0668 0.1442 0.2100

(0.4226) (0.3362) (0.2677)

With PositionWithout Vowels -0.0179 0.0418 0.0100

(0.5208) (0.4515) (0.4883)With Vowels 0.0433 0.1299 0.0339

(0.4496) (0.3517) (0.4607)




(0.4203) (0.3078) (0.4855)With Vowels 0.0921 0.1755 0.0914

(0.3939) (0.3029) (0.3946)

With PositionWithout Vowels 0.2391 0.2904 0.0.2567

(0.2394) (0.1932) (0.2231)With Vowels 0.1880 0.1254 0.2759

(0.2899) (0.3457) (0.2057)

4.1.2 Biphone ResultsOn the whole, the biphone models are much better than the uniphone models. In fact,

the best performing biphone models were among the best models overall. Much like theuniphone results, the [experiment1] and [experiment4] data produced the best models,producing correlations as high as r = 0.7381 (p = 0.0048). The [experiment2] and[experiment3] models were somewhat worse, reaching a maximum of r = 0.5365

(p = 0.0444). Although, the [experiment3] models were not quite as far behind, relatively, asthey were using the uniphone representation.

The [experiment1] models were among the best performing of all of the models thatwe developed, producing correlations as high as r = 0.7168, p = 0.0067. The positionalmodels slightly outperformed their non-positional counterparts, much like the uniphone

42

models. Much like the uniphone models, including vowels typically improved performance.When both position and vowels were included, normalization had virtually no effect. In theother cases, however, L1-normalization proved most effective. The [experiment1] results arepresented in Table 4.5.

Table 4.5. Perceptron Experiment 1 Biphone Results


biphone experiment1 Without PositionWithout Vowels 0.5808 0.6341 0.5379

(0.0304) (0.0181) (0.0439)With Vowels 0.6077 0.6827 0.6446

(0.0237) (0.0103) (0.0161)


(0.0277) (0.0127) (0.0217)With Vowels 0.7168 0.7156 0.7059

(0.0065) (0.0067) (0.0076)

As we mentioned already, the [experiment2] models did not perform quite as well asthe [experiment1] models, the best only managing a correlation of r = 0.5037 (p = 0.0571).The range of scores was quite narrow compared to the [experiment1] and [experiment4]models, similar to the [experiment2] uniphone models. Interestingly, position, vowels, andnormalization do not seem to have a consistent effect across model types. Vowels improvedthe normalized models without position, particularly the L2-normalized version (fromr = 0.2384, p = 0.0942 without vowels to r = 0.4197, p = 0.0994 with vowels), but had theopposite effect on the unnormalized models (down from r = 0.4193, p = 0.0996 withoutvowels to r = 0.2964, p = 0.1881 with vowels). Again, the best non-positional models werethe L1-normalized versions, producing the highest scoring [experiment2] model: thenon-positional models without vowels (r= 0.5037, p = 0.0571). All of the positional modelsperformed similarly, the models with vowels having a slight edge. When vowels wereincluded, normalization had very little effect, much like the [experiment1] positional modelswith vowels. L1-normalization was again the most effective. The [experiment2] results arepresented in Table 4.6.

The performance of the [experiment3] models was quite similar to the [experiment2]models. This contrasts with the [experiment3] uniphone models, which produced the worstuniphone scores. Again, results did not seem particularly affected by varying position and

43




(0.0996) (0.0755) (0.0943)With Vowels 0.2965 0.5037 0.4197

(0.1881) (0.0571) (0.0994)


(0.1090) (0.1531) (0.1238)With Vowels 0.4343 0.4550 0.4380

(0.0910) (0.0798) (0.0889)

vowels. The unnormalized and L2-normalized models were somewhat improved by includingpositional information, but had very little effect on the L1-normalized version. Includingvowels was not consistently effective. Vowels improved the unnormalized models, both withand without position (from r = 0.3221, p = 0.1670 to r = 0.3922, p = 0.1164 withoutposition, and from r = 0.3856, p = 0.1207 to r = 0.4151, p = 0.1021 with position).However, among the normalized models, only the positional L1-normalized model showedany improvement, and this was very mild (from r = 0.5025, p = 0.0576 to r = 0.5285,

p = 0.0470). The other normalized models, both with and without position, actuallydecreased in performance when vowels were included. When vowels were not includednormalization offered consistent improvements. Again, L1-normalization worked well inmost cases, although L2-normalization proved most effective when both position and vowelswere included. Overall, the normalized [experiment3] models are somewhat more successfulthan the [experiment2] models, a somewhat unexpected result based on the uniphone results.Although both remain quite far behind the [experiment1] and [experiment4] models. The[experiment3] results are presented in Table 4.7.

The performance of the [experiment4] models was quite varied, in a similar pattern tothe [experiment1] models. However, the best [experiment4] models were somewhat betterthan the best [experiment1] models. The positional models were consistently better than theirnon-positional counterparts. Including vowels also improved performance in all cases, muchlike the [experiment1] models. All of the positional models with vowels performed quitereasonably, with the slight edge going to the L1-normalized version. Normalization improvedperformance in some cases, but only L1-normalization offered consistent improvements. Infact, L1-normalization in conjunction with position and vowels produced the best performingbiphone model, with r = 0.7381 (p = 0.0048). L2-normalization was only particularly

44




(0.1670) (0.0466) (0.0803)With Vowels 0.3922 0.4807 0.4201

(0.1164) (0.0672) (0.0991)


(0.1207) (0.0576) (0.0444)With Vowels 0.4151 0.5295 0.4420

(0.1021) (0.0470) (0.0867)

effective when position was not included, but even so, did not reach the performance of theL1-normalized model. The [experiment4] results are presented in Table 4.8.




(0.1852) (0.0349) (0.0702)With Vowels 0.4498 0.6058 0.5568

(0.0826) (0.0241) (0.0376)


(0.0454) (0.0227) (0.0423)With Vowels 0.6977 0.7381 0.6768

(0.0085) (0.0048) (0.0111)

4.1.3 Triphone ResultsThe performance of the triphone models was generally on par with the biphone

models. The best triphone models produced correlations as high as r = 0.7384 (p = 0.0047),in the same range as the best biphone models. The results by negative data type followed avery similar pattern to the biphone results overall. However, the slight edge probably goes tothe triphone models. The triphone representations produced more consistent results acrossrepresentations. This is particularly true for the [experiment1] and [experiment4] models. It

45

would seem that in these cases the increased information available in the trigramrepresentation reduces the effects of the variation of vowels, position, and normalization onthe performance of the models. In fact, the non-positional models without vowels arebeginning to come to the fore. The best triphone models are of this type. The reason for this ismost likely the size of the triphone models. The basic triphone models, without position orvowels, are themselves quite large (18,275 features). Adding more information, as we do byadding vowels and positional information, can increase the size of the model exponentially(the triphone model with both vowel and positional information has 7,986,440 features). Thetraining data that we have is simply not large enough to train these extremely large modelsadequately and performance begins to drop off.

The [experiment1] triphone models were on the whole competitive with the[experiment1] biphone models. For the first time, although the difference is slight, thenon-positional models perform better than the positional models. Including vowels onlyimproves some of the positional models. The L1-normalized model performs approximatelythe same with and without vowels (r ≈ 0.64, p ≈ 0.0165) in all cases, but in general did notimprove performance dramatically. L1-normalization improved the non-positional modelswithout vowels. In all other cases, normalization offers only minor, if any, improvements.Again, however, much like the biphone [experiment1] results, L2-normalization is the mosteffective for the positional model with vowels. Even so, the performance of the positionalmodels was generally below that of the non-positional models. The [experiment1] results arepresented in Table 4.9.

Table 4.9. Perceptron Experiment 1 Triphone Results


triphone experiment1 Without PositionWithout Vowels 0.6696 0.6940 0.6899

(0.0121) (0.0089) (0.0094)With Vowels 0.5469 0.6389 0.5515

(0.0408) (0.0172) (0.0393)


(0.0204) (0.0161) (0.0219)With Vowels 0.6427 0.6390 0.6772

(0.0164) (0.0171) (0.0110)

46

The [experiment2] models were again among the worst of the triphone models, justlike the [experiment2] biphone and uniphone models. The best performing model onlymanaged a correlation of r = 0.4851 (p = 0.0652), marginally below the best performing[experiment2] biphone model. As we have mentioned, this slight dip in performance betweenthe biphone and triphone models is likely due to the exponential increase in the size of thefeature space. Correlation results were generally more consistent than the [experiment2]biphone models, similar to the [experiment1] triphone models. Position had only a very minoreffect on performance, only improving the unnormalized model with vowels (fromr = 0.4314, p = 0.0926 without vowels to r = 0.4851, p = 0.0652 with vowels). In all othercases, including position offers no improvement or even decreases performance. Includingvowels, on the other hand consistently improved performance, for both the positional andnon-positional models. Normalization did not prove consistently effective across modelvariations. In fact, only the non-positional models with vowels and the positional modelswithout vowels showed any improvement using normalization. In the other cases,normalization was less effective. The unnormalized models actually performed as well as, orbetter than, their normalized counterparts. The [experiment2] results are presented inTable 4.10.




(0.0996) (0.0972) (0.1100)With Vowels 0.4314 0.4842 0.4673

(0.0926) (0.0656) (0.0736)


(0.1319) (0.1159) (0.1157)With Vowels 0.4851 0.4588 0.4652

(0.0652) (0.0779) (0.0747)

The [experiment3] triphone results were quite similar to the [experiment2] triphoneresults, with the best [experiment3] model producing a correlation of r = 0.5304 (p =

0.0466), somewhat better than the best performing [experiment2] model. Position made verylittle difference overall, only slightly improving the unnormalized model without vowels(from r = 0.4314, p = 0.0926 to r = 0.5279, p = 0.0470). In all other cases, the positional

47

and non-positional models performed approximately the same. Including vowels in thefeature representation was also not particularly effective. The models without vowelsperformed somewhat better in all cases. Normalization was only particularly effective for thenon-positional model with vowels, and here L2-normalization had the slight advantage(r = 0.5291, p = 0.0471). In the other cases, normalization was less effective, onlyproducing minor benefits, if any. Normalization had virtually no effect on the positionalmodels without vowels, which were the best performing [experiment3] models, all in therange of r ≈ 0.53 (p ≈ 0.05). The [experiment3] results are presented in Table 4.11.




(0.0926) (0.0573) (0.0471)With Vowels 0.3970 0.4483 0.4176

(0.1133) (0.0833) (0.1006)


(0.0470) (0.0484) (0.0466)With Vowels 0.4135 0.4407 0.4210

(0.1031) (0.0875) (0.0986)

The [experiment4] models produced the highest correlations using the triphone featurerepresentation. Also much like the biphone results, the best performing [experiment4]triphone models produced correlations above r = 0.70 (p ≤ 0.01). Much like we saw in the[experiment1] results, the non-positional models were somewhat more effective than thepositional models, particularly when vowels were not included in the feature representation.Including vowels in the feature representation was generally not effective, except for theunnormalized positional model, which was slightly improved by including vowels (fromr = 0.5920, p = 0.0275 to r = 0.6109, p = 0.0229). Even with this slight improvement, thebest positional models are still somewhat behind the non-positional models. The pattern is ofcourse similar to that found in the [experiment1] models: the more economic featurerepresentation proves to be the most effective. Normalization had some effect for the modelswithout vowels, both those with and without position, but was less effective when vowelswere included in the feature representation. The results of [experiment4] are presented inTable 4.12.

48




(0.0110) (0.0047) (0.0078)With Vowels 0.6383 0.6274 0.6343

(0.0173) (0.0194) (0.0180)


(0.0275) (0.0133) (0.0145)With Vowels 0.6109 0.5636 0.5836

(0.0229) (0.0355) (0.0295)

4.1.4 All Phone ResultsThe all phone feature representation produced the best models in our experiments,

producing correlations above r = 0.77 (p ≤ 0.003), somewhat better than the biphone andtriphone models. The all phone results followed a similar pattern to the biphone and triphoneresults. The [experiment1] and [experiment4] models produced the best results, with the[experiment2] and [experiment3] models lagging behind somewhat. The [experiment2] and[experiment3] results were quite similar overall, but the difference between the top scoring ofthese models was somewhat larger than we saw for the biphone and triphone results. Thebenefits of reducing the size of the feature space are evident for most negative data types. Thebest performing models were all non-positional models without vowels (producing theaforementioned correlations above r ≈ 0.77, (p < 0.003). The best performing positionalmodel, on the other hand, produced a slightly lower correlation of r = 0.723 (p = 0.0059).However, in all cases, including backoff features improved performance across all negativedata types.

The [experiment1] negative data produced the best models in our experiments. Again,the non-positional models outperformed the positional models, much like the [experiment1]triphone results. Including vowels in the feature representation was only effective for thenormalized positional models, producing mild improvements in both cases. In most othercases, including vowels slightly decreased performance. Normalization generally improvedperformance. Normalization was effective when neither vowels nor position were included inthe feature representation (producing correlations in the range of r ≈ 0.78, p ≈ 0.002), thehighest correlation of all of the models that we developed). In the other cases,L1-normalization had a slight edge. Although, the best of these, the positional model without

49

vowels (r= 0.7235, p = 0.0059) is somewhat behind the normalized non-positional models.This is markedly different than the [experiment1] triphone models, which were largelyunaffected by normalization. This is most likely due to the increased length of the all phonevectors, which include the triphone, biphone, and uniphone representations, and so are muchlonger. Normalization helps to neutralize the effect that vector length has on the resultingacceptability score. The [experiment1] results are presented in Table 4.13.

Table 4.13. Perceptron Experiment 1 All Phone Results


All Phone experiment1 Without PositionWithout Vowels 0.6658 0.7713 0.7853

(0.0126) (0.0027) (0.0021)With Vowels 0.5663 0.6656 0.6192

(0.0347) (0.0127) (0.0211)


(0.0684) (0.0113) (0.0175)With Vowels 0.6064 0.7235 0.6769

(0.0240) (0.0059) (0.0111)

The [experiment2] all phone models were again the lowest performing of the all phonemodels, just like the [experiment2] biphone and triphone models. The performance of the[experiment2] models were marginally better than their triphone counterparts, the bestperforming producing a correlation of r = 0.5249 (p = 0.0487). Including position was onlyeffective when vowels were also included (from r ≈ 0.41, p ≈ 0.1 without position tor ≈ 0.51, p ≈ 0.05 with position). The positional models with vowels were, in fact, the bestperforming [experiment2] models. Without vowels, including position was less effective,producing slightly lower correlations of r ≈ 0.44 (p ≈ 0.09). Similarly without position,vowels were also less effective. The non-positional models without vowels were slightlybetter than the non-positional models with vowels. Normalization generally had very littleeffect. The only exception being L2-normalized non-positional model without vowels, whichwas slightly worse than the other non-positional models without vowels (r = 0.4398,p = 0.0879 for the L2-normalized model compared to r ≈ 0.48, p ≈ 0.06 for the othernon-positional models without vowels). In all other cases, the normalized and unnormalizedmodels performed approximately the same, which is quite similar to the triphone[experiment2] models. The [experiment2] results are presented in Table 4.14.

50




(0.0684) (0.0659) (0.0879)With Vowels 0.4187 0.4290 0.4070

(0.1000) (0.0939) (0.1070)


(0.0591) (0.0895) (0.0833)With Vowels 0.4991 0.5249 0.5029

(0.0590) (0.0487) (0.0574)

The slight advantage that the [experiment3] biphone and triphone models had over the[experiment2] models has extended somewhat using the all phone representation. The[experiment3] all phone models performed slightly better in most cases, producingcorrelations as high as r = 0.6143 (p = 0.0222). The exception being the positional modelswithout vowels, which performed worse than their [experiment2] counterparts, managingcorrelations of only r ≈ 0.43 (p ≈ 0.09). Again, much like the [experiment1] results, themost economic feature representation performed the best: the non-positional models withoutvowels, which produced the best correlations of the [experiment3] group (r ≈ 0.6, p ≈ 0.02).The positional models suffered somewhat, the best producing a slightly lower correlation ofr = 0.5641 (p = 0.0353). Including vowels in the feature representation was not helpful inany case, but the decrease in performance was most noticeable for the normalizednon-positional models. Normalization was only particularly effective for the models withoutvowels. This was even more pronounced for the non-positional models, which increased fromr = 0.4989 (p = 0.0591) without normalization to around r ≈ 0.61 (p ≈ 0.02) with eitherform of normalization). When position was included, L2-normalization showed a distinctadvantage. The [experiment3] results are presented in Table 4.15.

The [experiment4] models were among the best of the all phone models, reaching thesame level of performance as the [experiment1] models (r ≈ 0.77, p ≈ 0.0025). This is asimilar pattern to what we have seen in the biphone and triphone results. The models that usedthe most economic feature set (the non-positional models without vowels) were again the bestperformers, particularly when combined with L1-normalization. Just like the [experiment1]and [experiment3] results, including position in the feature representation did not improveperformance. The best positional model (again the L2-normalized model without vowels)produced a slightly lower correlation of r = 0.6937 (p = 0.0089). Vowels decreased the

51




(0.0591) (0.0222) (0.0233)With Vowels 0.4530 0.4738 0.4686

(0.0809) (0.0704) (0.0730)


(0.0704) (0.0581) (0.0353)With Vowels 0.4405 0.4375 0.4279

(0.0875) (0.0892) (0.0946)

performance of both the positional and non-positional models, this difference being morepronounced for the normalized non-positional models (down from r ≈ 0.76, p ≈ 0.003

without position to r ≈ 0.59, p ≈ 0.025 with position). Normalization was only particularlyeffective for the non-positional models without vowels. Here, the L1-normalized models had aslight edge over the L2-normalized models (r = 0.7753, p = 0.0025 using L1-normalizationcompared to r = 0.7422, p = 0.0044 using L2-normalization). Normalization was lesseffective in the other cases. Once again, however, L2-normalization produced a slightimprovement for the positional model with vowels (up to r = 0.6937, p = 0.0089), much likethe [experiment3] results. The [experiment4] results are presented in Table 4.16.




(0.0188) (0.0025) (0.0044)With Vowels 0.5673 0.6178 0.5787

(0.0344) (0.0214) (0.0311)


(0.0130) (0.0135) (0.0089)With Vowels 0.6371 0.6123 0.5938

(0.0175) (0.0226) (0.0271)

52

4.1.5 Summary of Perceptron ResultsOverall, the perceptron results were reasonably good. Performance is significantly

(p < 0.003) improved over the uniphone representation using the biphone, triphone, and allphone representations. The performance of these representations was not significantlydifferent (the best models in each group produced correlations above r = 0.70. Moreinterestingly, the biphone, triphone, and all phone models performed best using differentfeature variations. The biphone models performed the best when both vowels and positionalinformation were included in the model, while the triphone and all phone models performedbest when vowels and positional information were not included in the model. As we havementioned, this is likely due to the large sizes of the triphone and all phone models. Byremoving vowels and position from the model, we decrease its size, giving us a betterestimation of the separating hyperplane. Since the biphone models are generally smaller, theincreased information is actually beneficial.

There is also a clear pattern based on the type of negative data used. The models basedon the [experiment1] and [experiment4] data are clearly superior to the models based on the[experiment2] and [experiment3] data. This is true across the uniphone, biphone, triphone,and all phone representations. The difference, however, was only marginally significant(p < 0.05) within each feature set. However, the [experiment1] and [experiment4] dataclearly offer an advantage over the other data types. In fact, the [experiment4] dataconsistently offers a marginal advantage. Although the difference between the performance ofthe [experiment1] and [experiment4] based models is not significant.

4.2 KNN RESULTSIn this section, we present the correlation results for the kNN models. This section

follows the same organization as the Perceptron results section.

4.2.1 Uniphone ResultsThe kNN uniphone models performed surprisingly well, producing correlations as

high as r = 0.4386 (p = 0.0885), substantially better than the perceptron uniphone models.The uniphone models produced a very wide range of correlation scores; while some weresurprisingly high, many were comparable to their perceptron counterparts. In all cases, thepositional models proved to be better than the non-positional models. The distribution of theuniphone results by negative data does not follow the same patterns that we found in theperceptron results. The positional models with vowels were consistently among the highestscoring uniphone models in all cases. This was also generally true of the positional modelswithout vowels, the exception being the [experiment3] models based on the similarityfunctions (dot product, cosine, Dice, and Jaccard). The Euclidean distance models do not

53

suffer the same deficiencies. Altering the underlying similarity or distance function clearlyhas some effect, though the division is primarily tripartite: between dot product, thenormalized similarity functions (cosine, Dice, and Jaccard), and Euclidean distance.Euclidean distance often outperformed the similarity functions.

The highest scoring [experiment1] uniphone models performed surprisingly wellcompared to the perceptron uniphone models. More information clearly improvesperformance. The positional models, in general, outperformed the non-positional models. Thepositional models were also improved by including vowels. There is a clear division betweenthe dot product models, the normalized similarity models, and the Euclidean distance models.The normalized similarity models generally performed approximately the same in each case.The dot product models typically scored somewhat below the normalized similarity models,but followed the same general pattern: the dot product models improved in the same situationsas the normalized similarity models. The Euclidean distance models typically performedbetter than the similarity based models. Although, when both position and vowels wereincluded in the model both the normalized similarity based models and the Euclidean distancemodels were able to produce correlations as high as r = 0.4387 (p = 0.0885). This wassomewhat of a surprise considering the perceptron uniphone results. Including vowelsimproved both the positional and the non-positional similarity based models, although thepositional models had a distinct advantage. The Euclidean distance based models were lesssensitive to the inclusion of vowels in the feature representation. Including position, however,was more effective. When position was included, the Euclidean distance models were amongthe highest scoring [experiment1] models, both with and without vowels. The [experiment1]results are presented in Table 4.17

The [experiment2] models were again surprisingly good compared to the[experiment2] perceptron uniphone models. The best performing models were in the samerange as the best [experiment1] models (r ≈ 0.39, p ≈ 0.13). The distribution of results issomewhat different from the [experiment1] results. The normalized similarity metrics wereagain not as effective in their non-positional form. Including vowels even decreased theirperformance, which is the opposite of what we found for the [experiment1] non-positionalmodels. Both dot product and Euclidean distance performed better than the normalizedsimilarity metrics in their non-positional form. The non-positional dot product modelsperformed nearly as well as their positional counterparts. On the other side of the spectrumwere the non-positional Euclidean distance models, which could not be correlated with thehuman results. Including position was generally effective. However, only the unnormalizedmetrics (dot product and Euclidean distance) performed as well as their [experiment1]counterparts. The [experiment2] results are presented in Table 4.18.

54

Tabl

e4.

17.k

NN

Exp

erim

ent1

Uni

phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

-0.0

498

0.04

550.

0525

0.05

250.

2182

(0.5

579)

(0.4

471)

(0.4

390)

(0.4

390)

(0.2

586)

With

Vow

els

0.07

560.

1366

0.12

440.

1179

0.21

82(0

.412

5)(0

.344

4)

(0.3

578)

(0.3

649)

(0.2

596)

With

Posi

tion

With

outV

owel

s0.

2730

0.33

080.

3227

0.32

760.

4080

(0.2

083)

(0.1

602)

(0.1

666)

(0.1

627)

(0.1

064)

With

Vow

els

0.34

390.

4256

0.43

300.

4358

0.43

87(0

.150

2)(0

.095

9)(0

.091

7)(0

.090

1)(0

.088

5)

55

Tabl

e4.

18.k

NN

Exp

erim

ent2

Uni

phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.25

090.

2010

0.20

100.

2010

-(0

.228

3)(0

.276

7)(0

.276

7)(0

.276

7)-

With

Vow

els

0.30

810.

1191

0.12

260.

1226

-(0

.178

3)(0

.363

6)

(0.3

597)

(0.3

597)

-

With

Posi

tion

With

outV

owel

s0.

2714

0.22

750.

2347

0.23

720.

4070

(0.2

097)

(0.2

505)

(0.2

436)

(0.1

071)

(0.1

070)

With

Vow

els

0.36

560.

3795

0.37

820.

3808

0.42

39(0

.134

4)(0

.124

8)(0

.125

7)(0

.123

9)(0

.096

9)

56

The [experiment3] models were relatively similar to the [experiment1] models. Thenon-positional models performed quite poorly in general, only producing correlations aroundr ≈ 0.1 (p ≈ 0.38). The non-positional Euclidean distance models produced scores in thesame range, unlike their [experiment1] counterparts. Vowels marginally improved the modelsbased on the normalized similarity metrics without position. However, including vowels didnot have the same effect on non-positional Euclidean distance or dot product models. Theperformance of the dot product model decreased substantially when vowels were included(down from r = 0.1871, p = 0.2908 to r = −0.0964, p = 0.6110). The Euclidean distancemodel performed equally well with and without vowels, although somewhat below theperformance of their [experiment1] counterparts (r ≈ 0.1, p ≈ 0.39). Position did notuniversally improve performance. Without vowels, the positional similarity based modelsperformed just as poorly as their non-positional counterparts. Performance was substantiallyimproved when vowels were included. This was particularly true for the normalized similaritybased models, which each improved to around r ≈ 0.4 (p ≈ 0.1). The positional dot productmodels were again somewhat behind this mark, while the positional Euclidean distancemodels performed in the same range, both with and without vowels (r ≈ 0.4, p ≈ 0.1),among the best performing of the [experiment3] group. The [experiment3] results arepresented in Table 4.19.

The [experiment4] results were somewhat different from the results of the othernegative data types. The normalized similarity based models without position all performedapproximately the same, both with and without vowels (r ≈ 0.15, p ≈ 0.32). Again, the dotproduct and Euclidean distance based models followed different patterns. The non-positionaldot product model improved substantially when vowels were included (from r = −0.0235,p = 0.5274 to r = 0.2656, p = 0.2150). In fact, the non-positional dot product model withvowels performed the best of all of the non-positional [experiment3] models. Again, as in[experiment2], the non-positional Euclidean distance models were not able to be correlatedwith the human results. The positional models, on the whole, performed quite similarly to the[experiment1] positional models. Again only the similarity based models were particularlyaffected by including vowels. Without vowels the normalized similarity based models wereall in the range of r ≈ 0.28 (p ≈ 0..2), with the positional dot product model somewhatbehind (r = 0.2271, p = 0.2509). Including vowels improved performance substantially ineach case, improving scores to around r ≈ 0.39 (p ≈ 0.11). As we have seen several timesalready, the positional Euclidean distance models were virtually unaffected by vowels, bothproducing correlations in the range of r ≈ 0.4 (p ≈ 0.1). The Euclidean distance modelswere, once again, the best of the [experiment4] group. The [experiment4] results arepresented in Table 4.20.

57

Tabl

e4.

19.k

NN

Exp

erim

ent3

Uni

phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.18

71-0

.003

10.

0133

-0.0

002

0.10

01(0

.290

8)(0

.503

6)(0

.484

4)(0

.500

2)(0

.394

8)W

ithVo

wel

s0.

3081

-0.0

964

0.11

120.

0835

0.08

01(0

.178

3)(0

.611

0)

(0.3

723)

(0.4

036)

(0.4

075)

With

Posi

tion

With

outV

owel

s0.

0453

0.07

460.

0739

0.07

390.

4174

(0.4

474)

(0.4

137)

(0.4

145)

(0.1

993)

(0.1

007)

With

Vow

els

0.24

390.

3967

0.41

200.

4172

0.40

43(0

.234

9)(0

.113

5)(0

.104

0)(0

.100

9)(0

.108

7)

58

Tabl

e4.

20.k

NN

Exp

erim

ent4

Uni

phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

-0.0

235

0.16

360.

1664

0.16

37-

(0.5

274)

(0.3

153)

(0.3

124)

(0.3

153)

-W

ithVo

wel

s0.

2656

0.14

520.

1547

0.15

55-

(0.2

150)

(0.3

350

)(0

.324

8)(0

.324

0)-

With

Posi

tion

With

outV

owel

s0.

2271

0.29

330.

2706

0.27

680.

4006

(0.2

509)

(0.1

993)

(0.2

105)

(0.2

049)

(0.1

110)

With

Vow

els

0.39

140.

3902

0.39

220.

3983

0.41

20(0

.116

9)(0

.117

7)(0

.116

4)(0

.112

5)(0

.104

0)

59

4.2.2 Biphone ResultsThe biphone representation improved performance in nearly all cases. The best

performing models producing correlations as high as r ≈ 0.57 (p ≈ 0.03). However, this doesnot even approach the performance of the best perceptron biphone models. Results are quitevaried, similar to the uniphone models, and for the most part followed a similar distribution tothe uniphone results. Position and vowels did not consistently improve performance.Improvements were highly dependent on the similarity or distance function used and thenegative data type. In most cases though, including both position and vowels produced thebest models. When both position and vowels were included, correlations were generallyabove r ≈ 0.5 (p < 0.06), even reaching as high as r = 0.5751 (p = 0.0321). Again, the[experiment1] models were the best performing, with the [experiment2] and [experiment4]models not far behind, while the [experiment3] models lagged behind. The interestingexception to this was the [experiment3] positional Euclidean distance model without vowels,which produced the second highest biphone score (r = 0.5545, p = 0.0383). Indeed, for themost part, all of the Euclidean distance models were at or near the top of each group, andwere by far the most consistent.

As we mentioned, the [experiment1] data produced the highest scoring biphonemodels. In general all of the [experiment1] models performed as well as or better than theircounterparts from the other negative data types. Overall, the biphone representation offeredsubstantial improvements over the uniphone representation for the models based on the[experiment1] data. Position improved performance in all cases, particularly for thenormalized similarity metrics. Vowels were only particularly helpful for the similarity basedmodels when position was also included. When combined with the normalized similarityfunctions, including both position and vowels produced the highest scoring biphone models(r ≈ 0.57, p ≈ 0.03). On the other hand, including vowels typically decreased theperformance of the non-positional models based on the normalized similarity metrics.However, the non-positional dot product model was slightly improved when vowels wereincluded. The Euclidean distance models followed a completely different pattern. Thenon-positional Euclidean distance models performed equally well with and without vowels,just like the [experiment1] uniphone models. However, including vowels in the positionalmodel decreased performance. This is unlike the models based on the similarity metrics,which always performed better when both position and vowels were included in the featurerepresentation. The [experiment1] results are presented in Table 4.21.

The performance of the [experiment2] models was somewhat below that of the[experiment1] models. The best were only able to produce correlations around r = 0.5

(p ≈ 0.05). The models based on the normalized similarity metrics performed best when both

60

Tabl

e4.

21.k

NN

Exp

erim

ent1

Bip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.37

710.

3615

0.35

980.

3797

0.48

57(0

.126

4)(0

.137

3)(0

.138

5)(0

.124

7)(0

.064

4)W

ithVo

wel

s0.

3993

0.31

450.

3088

0.31

820.

4778

(0.1

118)

(0.1

731

)(0

.177

7)(0

.170

2)(0

.068

6)

With

Posi

tion

With

outV

owel

s0.

4323

0.44

940.

4494

0.45

490.

5310

(0.0

921)

(0.0

828)

(0.0

828)

(0.0

799)

(0.0

454)

With

Vow

els

0.52

010.

5751

0.57

290.

5723

0.50

59(0

.050

4)(0

.032

1)(0

.032

7)(0

.032

9)(0

.032

9)

61

vowels and positional information were included in the underlying feature representation.Including vowels offered the most improvements. In fact, without vowels all of thenormalized similarity models performed quite poorly. The dot product and Euclidean distancemodels were again quite different. The non-positional dot product models performed equallywell with and without vowels (r ≈ 0.33, p ≈ 0.15). However, when positional informationwas included, there was a substantial difference between the model with vowels and themodel without vowels. The model without vowels was substantially worse than the modelwith vowels (r = 0.0946, p = 0.3910 without vowels compared to r = 0.4691, p = 0.0727

with vowels). The Euclidean distance model performed quite a bit differently than thesimilarity based models. The Euclidean distance models performed much better withoutvowels, both when positional information was included (r = 0.5087, p = 0.0550) and when itwas not (r = 0.4142, p = 0.1026). With vowels, performance was somewhat more modest(r = 0.3783, p = 0.1257 without position and r = 0.4175, p = 0.1007 with position). The[experiment2] results are presented in Table 4.22.

The [experiment3] results followed a similar pattern to the [experiment2] results.Again, we find distinct patterns based on the similarity or distance function used. Thenormalized similarity models performed quite poorly without vowels, regardless of whetherposition was included or not. Although, including position made a slight improvement(r < 0.07, p < 0.48 without position compared to r < 0.18, p < 0.35 with position). Whenvowels were included, performance improved, but was still below that of the [experiment2]models (up to r ≈ 0.29, p ≈ 0.19 without position and r ≈ 0.3982, p ≈ 0.11 with position).The dot product models followed a very similar pattern to the [experiment2] dot productmodels. The non-positional models performed similarly with and without vowels(r = 0.2361, p = 0.2423 without vowels and r = 0.2669, p = 0.2137 with vowels). Whenposition was included, there was a substantial difference between the model with vowels andthe model without. Without vowels, the dot product model only produced a correlation ofr = 0.1324 (p = 0.3490). Including vowels improved performance substantially (up tor = 0.2967, p = 0.1878). The Euclidean distance models were reasonably good compared tothe other [experiment3] models. The Euclidean distance models were improved by includingposition both with (r = 0.4665, p = 0.0740) and without vowels (r = 0.5545, p = 0.0383).But including vowels decreased performance, particularly when positional information wasalso included. However, all of the Euclidean distance models performed better than theirsimilarity based counterparts. In fact, the positional model without vowels produced thesecond highest score of all of the biphone models (r = 0.5545, p = 0.0383). The[experiment3] results are presented in Table 4.23.

62

Tabl

e4.

22.k

NN

Exp

erim

ent2

Bip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.33

510.

1561

0.10

820.

1104

0.41

42(0

.156

9)(0

.323

4)(0

.375

7)(0

.373

3)(0

.102

6)W

ithVo

wel

s0.

3322

0.30

850.

2807

0.29

840.

3783

(0.1

591)

(0.1

772

)(0

.201

6)(0

.186

3)(0

.125

7)

With

Posi

tion

With

outV

owel

s0.

0946

0.14

200.

1406

0.14

300.

5087

(0.3

910)

(0.3

385)

(0.3

400)

(0.3

375)

(0.0

550)

With

Vow

els

0.46

910.

5063

0.50

310.

5095

0.41

75(0

.072

7)(0

.056

0)(0

.057

4)(0

.054

7)(0

.100

7)

63

Tabl

e4.

23.k

NN

Exp

erim

ent3

Bip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.23

610.

0106

0.05

790.

0695

0.44

12(0

.242

3)(0

.487

6)(0

.482

8)(0

.419

5)(0

.087

2)W

ithVo

wel

s0.

2669

0.28

790.

2920

0.29

200.

4278

(0.2

137)

(0.1

953)

(0.1

918)

(0.1

918)

(0.0

946)

With

Posi

tion

With

outV

owel

s0.

4323

0.44

940.

4494

0.45

490.

5310

(0.0

921)

(0.0

828)

(0.0

828)

(0.0

799)

(0.0

454)

With

Vow

els

0.52

010.

5751

0.57

290.

5723

0.50

59(0

.050

4)(0

.032

1)(0

.032

7)(0

.032

9)(0

.032

9)

64

The [experiment4] results were somewhat idiosyncratic compared to the other biphoneresults. The normalized similarity models again performed best when both position andvowels were included in the underlying feature representation (r ≈ 0.53, p ≈ 0.04), whichwere the best scores produced by the [experiment4] models. Individually, both position andvowels made some improvements. The dot product models, on the other hand, followed avery different pattern. Without position, including vowels decreased performance (down tor = 0.2611, p = 0.2190 with vowels from r = 0.4572, p = 0.0787 without vowels). Withposition, including vowels had the opposite effect (up to r = 0.5032, p = 0.0573 with vowelsfrom r = 0.2605, p = 0.2195 without vowels). The performance of the Euclidean distancemodels was also somewhat different than any of the other [experiment4] models. Includingposition improved performance somewhat, but including vowels had no effect(r ≈ 0.45, p ≈ 0.08 without position and r ≈ 0.49, p ≈ 0.06 with position). This issomewhat different than the other biphone Euclidean distance models, which generallyperformed worse when vowels were included. The [experiment4] results are presented inTable 4.24.

4.2.3 Triphone ResultsThe triphone representation produced a wide range of results, some slightly better than

their biphone counterparts, while others performed no better than their uniphone equivalents.Performance seemed more dependent on the negative data type than the uniphone or biphonemodels. Much like the biphone results, position and vowels did not consistently improveperformance, and improvements were highly dependent on the underlying similarity ordistance function. The normalized similarity functions (cosine, Dice, and Jaccard) tended toperform similarly in most cases. The dot product and Euclidean distance models againfollowed their own patterns. Dot product was highly dependent on the feature representation,ranging from virtually no correlation at all (r = 0.0215, p = 0.4750) to among the mostcompetitive triphone models (r = 0.5908, p = 0.0278). Euclidean distance on the other hand,was relatively more consistent across data and feature variations (from r = 0.3632,p = 0.1361 to r = 0.5495, p = 0.0400).

The [experiment1] data again produced the highest scoring models. The performanceof the normalized similarity models followed a similar pattern to their biphone counterparts.Including position improved performance in all cases. Including vowels, on the other hand,was only effective if position was also included. With both position and vowels, thenormalized similarity metrics produced some of the best performing triphone models(r ≈ 0.58, p ≈ 0.02). As we mentioned, the dot product and Euclidean distance modelsfollowed their own patterns. The non-positional dot product models were competitive with the

65

Tabl

e4.

24.k

NN

Exp

erim

ent4

Bip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.45

720.

1993

0.21

100.

2248

0.45

08(0

.078

7)(0

.278

4)(0

.266

7)(0

.253

2)(0

.082

1)W

ithVo

wel

s0.

2611

0.28

470.

2558

0.26

170.

4586

(0.2

190)

(0.1

980

)(0

.223

8)(0

.218

4)(0

.078

0)

With

Posi

tion

With

outV

owel

s0.

2605

0.32

630.

3262

0.33

320.

4950

(0.2

195)

(0.1

637)

(0.1

637)

(0.1

583)

(0.0

607)

With

Vow

els

0.50

320.

5311

0.52

880.

5354

0.49

70(0

.057

3)(0

.046

4)(0

.047

2)(0

.044

8)(0

.060

0)

66

other metrics, particularly when vowels were included, which produced one of the highestscoring non-positional models (r = 0.4714, p = 0.0716). The positional models weresomewhat different. Without vowels, the dot product model was somewhat behind the othersimilarity metrics (r = 0.3695, p = 0.1317). With vowels, however, the dot product modelproduced one of the highest scores using the [experiment1] data (r = 5908, p = 0.0278). TheEuclidean distance models were the highest scoring non-positional models. Including vowelsimproved performance (up to r = 0.4802, p = 0.0675). This is quite different from biphoneEuclidean distance models, which typically performed the same or worse when vowels wereincluded. The positional Euclidean distance models followed this more familiar pattern. Themodel without vowels performed slightly better (r = 0.5388, p = 0.0436) than the modelwith vowels (r = 0.5011, p = 0.0582). The [experiment1] results are presented in Table 4.25.

The [experiment2] triphone results were overall similar to their biphone counterparts.The normalized similarity based models performed more or less like their biphonecounterparts. Including position was not universally effective. Indeed, position only improvedthe performance of the normalized similarity models when vowels were also included. In fact,the models without vowels performed marginally worse when positional information wasincluded (from r ≈ 0.15, p ≈ 0.32 without position to r ≈ 0.13, p ≈ 0.35 with position).The normalized similarity models were substantially better when vowels were included,particularly when position was also included, a pattern that we have seen a number of timesfor the normalized similarity based models. Without position, the models with vowelsproduced correlations around r ≈ 0.31 (p ≈ 0.17) for Dice and Jaccard, and r = 0.2656

(p = 0.2149) for the cosine model. This improved to r ≈ 0.45 (p ≈ 0.08) in all cases whenposition was also included. The dot product models performed similarly to the othersimilarity based models. The non-positional model was substantially improved by includingvowels (from r = 0.1248, p = 0.3464 without vowels to r = 0.3483, p = 0.1470 withvowels).The positional models were also quite similar to the normalized similarity basedmodels. Without vowels, the dot product model performed very poorly (r = 0.0980,

p = 0.3872), slightly worse than the normalized similarity models. When vowels wereincluded, performance increased dramatically (up to r = 0.4116, p = 0.1043). As we mightexpect, the Euclidean distance models followed a different pattern. The non-positional modelwas improved by including vowels (from r = 0.3632, p = 0.1361 without vowels tor = 0.4062, p = 0.1075 with vowels). The positional model, on the other hand, was muchbetter without vowels, producing the highest correlation using the [experiment2] data(r = 0.4902, p = 0.0629). This dropped to r = 0.4066 (p = 0.1073) when vowels were alsoincluded. The [experiment2] results are presented in Table 4.26.

67

Tabl

e4.

25.k

NN

Exp

erim

ent1

Trip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.38

220.

3971

0.38

530.

4273

0.42

66(0

.123

0)(0

.113

3)(0

.114

4)(0

.094

9)(0

.095

3)W

ithVo

wel

s0.

4714

0.38

020.

3580

0.36

210.

4802

(0.0

716)

(0.1

243

)(0

.139

9)(0

.136

9)(0

.067

5)

With

Posi

tion

With

outV

owel

s0.

3695

0.44

940.

4494

0.46

490.

5388

(0.1

317)

(0.0

828)

(0.0

828)

(0.0

748)

(0.0

436)

With

Vow

els

0.59

080.

5833

0.58

770.

5934

0.50

11(0

.027

8)(0

.027

2)(0

.028

6)(0

.027

1)(0

.058

2)

68

Tabl

e4.

26.k

NN

Exp

erim

ent2

Trip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.12

480.

1507

0.15

600.

1517

0.36

32(0

.346

4)(0

.329

1)(0

.323

4)(0

.328

0)(0

.136

1)W

ithVo

wel

s0.

3483

0.26

560.

3083

0.31

680.

4062

(0.1

470)

(0.2

149

)(0

.178

1)(0

.171

2)(0

.107

5)

With

Posi

tion

With

outV

owel

s0.

0980

0.12

980.

1298

0.13

440.

4902

(0.3

872)

(0.3

518)

(0.3

518)

(0.3

468)

(0.0

629)

With

Vow

els

0.41

160.

4494

0.44

450.

4581

0.40

66(0

.104

3)(0

.082

8)(0

.085

4)(0

.078

3)(0

.107

3)

69

The [experiment3] results followed a reasonably similar pattern to the [experiment2]results. Although, overall, they were somewhat worse than their biphone counterparts, muchlike the [experiment2] triphone results. Including position did not affect the normalizedsimilarity models as much as including vowels. Without vowels, the normalized similaritymodels were exceedingly poor, whether position was included or not (r = 0.0307, p = 0.4643

with cosine, and r ≈ 0.1, p ≈ 0.49 for Dice and Jaccard when position was not included,which improved to r ≈ 0.13, p ≈ 0.34 for cosine and Dice, and r = 0.1696, p = 0.3090 forJaccard). With vowels, performance was substantially improved in both cases, particularlywhen position was also included (r ≈ 0.30, p ≈ 0.18 without position and r ≈ 0.38, p ≈ 0.12

with position). The dot product models follow a reasonably similar pattern to the othersimilarity based models. Without vowels, both the positional and non-positional modelsperformed quite poorly (both around r ≈ 0.02, p ≈ 0.47). With vowels, scores improved (upto r = 0.3263, p = 0.1637 without position and r = 0.3596, p = 0.1387 with position). TheEuclidean distance models performed similarly to the Euclidean distance models based on theother data sets. The non-positional model was improved by including vowels (fromr = 0.3734, p = 0.1290 to r = 0.4470, p = 0.0840). However, much like we saw in the[experiment1] and [experiment2] Euclidean distance models, the performance of thepositional models was marginally decreased when vowels were included (down tor = 0.4900, p = 0.0630 with vowels from r = 0.5494, p = 0.0400 with vowels). The[experiment3] results are presented in Table 4.27.

The [experiment4] triphone results were similar to the [experiment4] biphone resultsin that they did not follow the same patterns that we have seem for the other data sets. In fact,the [experiment4] triphone models, on the whole, performed approximately the same as the[experiment4] biphone models. The performance of the normalized similarity based modelswas quite varied. Without position, they were quite poor, performing worse than both the dotproduct and Euclidean distance models (with a maximum of r = 0.3127, p = 0.1745 withoutvowels and a maximum of r = 0.2602, p = 0.2198 with vowels). Without vowels, the cosineand Jaccard models perform quite similarly (r ≈ 0.31, p ≈ 0.17), while the Dice model wassomewhat behind (r = 0.2697, p = 0.2113). Including vowels, decreased performance in allcases. Interestingly there was some difference between the normalized similarity functions inthis case. The cosine model decreased to r = 0.2143 (p = 0.2634), the Dice model tor = 0.2479 (p = 0.2311), and Jaccard to r = 0.2602 (p = 0.2198). Including positionimproved the performance of the normalized similarity models, and evened out the slightdifferences between them. Without vowels, all produced scores around r ≈ 0.35 (p ≈ 0.14).With vowels, this improved to r ≈ 0.52 (p ≈ 0.05). Again, the dot product and Euclideandistance models followed much different patterns. Without position, the dot product models

70

Tabl

e4.

27.k

NN

Exp

erim

ent3

Trip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.02

150.

0307

0.00

860.

0160

0.37

34(0

.475

0)(0

.464

3)(0

.490

0)(0

.481

4)(0

.129

0)W

ithVo

wel

s0.

3263

0.29

480.

3004

0.30

240.

4470

(0.1

637)

(0.1

894

)(0

.184

7)(0

.183

1)(0

.084

0)

With

Posi

tion

With

outV

owel

s0.

0263

0.13

850.

1392

0.16

960.

5495

(0.4

695)

(0.3

423)

(0.3

415)

(0.3

090)

(0.0

400)

With

Vow

els

0.35

960.

3897

0.38

610.

3848

0.49

00(0

.138

7)(0

.118

1)(0

.120

4)(0

.121

3)(0

.063

0)

71

were the best of the similarity based models, producing correlations around r ≈ 0.37

(p ≈ 0.13), both with and without vowels. As we have seen in many other cases, includingpositional information only improved the dot product model when vowels were also included.In fact, the positional dot product model was considerably worse than its non-positionalcounterparts (r = 0.2864, p = 0.1966). The dot product model performed best when bothposition and vowels were included, producing the highest scoring model using the[experiment4] data (r = 0.5482, p = 0.0403). The Euclidean distance models performedquite a bit differently from both the similarity based models and the other Euclidean distancemodels that we have seen. The non-positional model was improved by including vowels(from r = 0.3822, p = 0.1230 without vowels to r = 0.4741, p = 0.0703 with vowels).Although, both were among the best non-positional models using the [experiment4] data.Scores improved with position, whether or not vowels were also included (r = 0.4920,p = 0.0621 without vowels and r = 0.5136, p = 0.0530 with vowels). This is distinctlyunlike the results that we saw with other triphone Euclidean distance models that includedposition, which tended to score lower when both position and vowels were included. The[experiment4] results are presented in Table 4.28.

4.2.4 All Phone ResultsThe overall pattern of the all phone results was very similar to the triphone results.

Although the best performing all phone models produced the highest scores of all of the kNNmodels. Again, the models based on the [experiment1] and [experiment4] data performed thebest, producing correlations as high as r = 0.6913 (p = 0.0092). The [experiment2] and[experiment3] models produced correlations somewhat below this mark. This was lessnoticeable for the Euclidean distance models, which were generally more consistent over allof the data sets. The similarity based models were not quite so robust and showed a distinctpreference for the [experiment1] data. Including position and vowels in the featurerepresentation was not consistently effective. And again, the effects were highly dependent onthe data set and the particular feature representation used. The normalized similarity modelstended to perform similarly in most cases. The dot product and Euclidean distance modelsagain followed their own distinct patterns.

Much like the biphone and triphone results, the [experiment1] data produced thehighest scoring models. The performance of the normalized similarity models followedsimilar patterns. Including position improved performance in all cases. Without position, allof the normalized similarity models performed approximately the same, both with(r ≈ 0.37, p ≈ 0.13) and without vowels (r ≈ 0.37, p ≈ 0.13). Although, the Jaccard modelwithout vowels performed marginally better than cosine similarity or Dice coefficient.Including positional information improved performance in most cases. Without vowels, the

72

Tabl

e4.

28.k

NN

Exp

erim

ent4

Trip

hone

Res

ults

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.37

750.

3127

0.26

970.

3100

0.38

22(0

.126

2)(0

.174

5)(0

.211

3)(0

.176

8)(0

.123

0)W

ithVo

wel

s0.

3678

0.21

430.

2479

0.26

020.

4741

(0.1

329)

(0.2

634

)(0

.231

1)(0

.219

8)(0

.070

3)

With

Posi

tion

With

outV

owel

s0.

2864

0.34

950.

3494

0.35

470.

4920

(0.1

966)

(0.1

460)

(0.1

461)

(0.1

461)

(0.1

423)

With

Vow

els

0.54

820.

5248

0.51

740.

5174

0.52

99(0

.040

3)(0

.048

7)(0

.051

6)(0

.051

6)(0

.046

8)

73

normalized similarity models produced correlations in the range of r ≈ 0.61 (p ≈ 0.02).Including vowels improved these results considerably (up to r ≈ 0.678, p ≈ 0.0108 forcosine and Dice, and r = 0.6913, p = 0.0092 for Jaccard). The dot product and Euclideandistance models displayed a different pattern. The dot product models without positionperformed virtually the same regardless of whether vowels were included (r ≈ 0.40,p ≈ 0.11). Including position substantially improved performance, particularly when vowelswere also included (r = 0.5229, p = 0.0494 without vowels and r = 0.6032, p = 0.0247

with vowels). The Euclidean distance models followed the same general pattern that we sawfor the triphone Euclidean distance models using the [experiment1] data. The non-positionalmodel improved when vowels were included (from r = 0.4979, p = 0.595 without vowels tor = 0.5310, p = 0.0464) with vowels. Both were by far the best performing non-positional[experiment1] models. Including position improved performance in both cases. However,when both vowels and position were included performance decreased (from r = 0.6477,p = 0.0156 without vowels to r = 0.5807, p = 0.0305 with vowels). This is similar to thetriphone Euclidean distance models using the [experiment1] data. The [experiment1] resultsare presented in Table 4.29.

The [experiment2] results were somewhat unique in that they were generally quitepoor. The normalized similarity models were terrible without position, both with and withoutvowels. Without vowels, the normalized similarity models only produced marginalcorrelations (r ≈ 0.01, p ≈ 0.48). Scores were slightly improved when vowels wereincluded, but improvements were not consistent for all of the normalized similarity models.The cosine model only improved to r = 0.0590 (p = 0.4316). The Dice model improvedslightly more, up to r = 0.0753 (p = 0.4129); and the Jaccard model improved the most, upto r = 0.1007 (p = 0.3842). The positional models were substantially better. Without vowels,scores improved to r ≈ 0.25 (p ≈ 0.22). When vowels were included, scores improveddramatically (up to r ≈ 0.54, p ≈ 0.04), among the top scoring models using the[experiment2] data. The dot product models behaved differently of course. Without position,the dot product models were considerably better than the other similarity based models.Without vowels, the dot product model produced a correlation of r = 0.1332 (p = 0.3482).This improved to r = 0.3055 (p = 0.1805) when vowels were included in the underlyingfeature representation. Including position improved performance, but not to the level of thenormalized similarity models (up to r = 0.1996, p = 0.2781 without vowels and r = 0.4768,

p = 0.0691 with vowels). The Euclidean distance models were more consistent. However,they do not follow the same pattern that we saw for the [experiment1] Euclidean distancemodels. Both non-positional models are easily the best of the [experiment2]

74

Tabl

e4.

29.k

NN

Exp

erim

ent1

All

Phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.40

620.

3750

0.37

350.

3952

0.49

79(0

.107

6)(0

.127

9)(0

.129

0)(0

.114

5)(0

.059

5)W

ithVo

wel

s0.

4001

0.36

160.

3575

0.35

430.

5310

(0.1

114)

(0.1

372

)(0

.140

1)(0

.142

6)(0

.046

4)

With

Posi

tion

With

outV

owel

s0.

5229

0.60

150.

6105

0.61

220.

6477

(0.0

494)

(0.0

251)

(0.0

230)

(0.0

227)

(0.0

156)

With

Vow

els

0.60

320.

6787

0.67

880.

6913

0.58

07(0

.024

7)(0

.010

8)(0

.010

8)(0

.009

2)(0

.030

5)

75

non-positional models. Without vowels, the Euclidean distance model produced a correlationof r = 0.4116 (p = 0.1042). Surprisingly, however, when vowels were included, this scoredecreased to r = 0.3447 (p = 0.1496). Position improved performance in both cases. Withoutvowels, the Euclidean distance model was on par with the best scoring [experiment2] models(r = 0.5335, p = 0.0455). When vowels were included performance decreased to r = 0.4668

(p = 0.0739). The [experiment2] results are presented in Table 4.30.The [experiment3] results had a number of similarities to the [experiment2] results.

The normalized similarity models were again quite terrible without position. In fact, thenon-positional normalized similarity models were even worse than their [experiment2]counterparts. Without vowels, all of the normalized similarity models produced negativecorrelations. The cosine model was the worst, only producing a correlation of r = −0.0812(p = 0.5938). The other normalized similarity models (Dice and Jaccard) were onlymarginally better, producing correlations in the range of r ≈ −0.03 (p ≈ 0.53). When vowelswere included, performance improved somewhat. Still however, none of the normalizedsimilarity models produced a correlation above 0.1. With vowels, the cosine and Jaccardmodels produced correlations in the range of r ≈ 0.09 (p ≈ 0.40), while the Dice model wassomewhat behind the others (r = 0.0758, p = 0.4123). When position was included, scoresimproved. Without vowels, however, this improvement was marginal (up to r ≈ 0.14,p ≈ 0.34). When both position and vowels were included, the normalized similarity functionsproduced the best scoring [experiment3] similarity based models. The Dice and Jaccardmodels produced correlations in the range of r ≈ 0.46 (p ≈ 0.077). The cosine model wassomewhat behind this mark (r = 0.4351, p = 0.0905). The dot product results were alsoquite unique compared to the other all phone dot product models. This was true in particularfor the non-positional models. Without vowels, the dot product model performed essentiallythe same as its [experiment2] counterpart (r = 0.1448, p = 0.3356). When vowels wereincluded, performance dropped to r = 0.0420 (p = 0.4512), even worse than the normalizedsimilarity models. Position did not consistently improve performance. Without vowels, thepositional dot product model only managed a correlation of r = 0.1112 (p = 0.3724), roughlyon par with the normalized similarity models. The Euclidean distance models are easily thebest performing [experiment3] models in all cases. The non-positional model was improvedby including vowels (from r = 0.3972, p = 0.1131 without vowels to r = 0.4533, p = 0.0807

with vowels). Position improved performance in both cases. However, much like theEuclidean distance models based on the other data sets, including vowels and position in thefeature representation decreased performance. Without vowels, the positional Euclideandistance model produced the best score using the [experiment3] data (r = 0.6065,

76

Tabl

e4.

30.k

NN

Exp

erim

ent2

All

Phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.13

320.

0168

0.01

900.

0154

0.41

16(0

.348

2)(0

.480

4)(0

.477

9)(0

.482

1)(0

.104

2)W

ithVo

wel

s0.

3055

0.05

900.

0753

0.10

070.

3447

(0.1

805)

(0.4

316

)(0

.412

9)(0

.384

2)(0

.149

6)

With

Posi

tion

With

outV

owel

s0.

1996

0.24

960.

2517

0.25

760.

5335

(0.2

781)

(0.2

298)

(0.2

276)

(0.2

222)

(0.0

455)

With

Vow

els

0.47

680.

5450

0.53

450.

5521

0.46

68(0

.069

1)(0

.041

5)(0

.045

1)(0

.039

1)(0

.073

9)

77

p = 0.0239). This dropped to r = 0.5197 (p = 0.0507) when vowels were included in theunderlying feature representation. The [experiment3] results are presented in Table 4.31.

The [experiment4] results were again somewhat unique compared to the other allphone results, and were among the best performing of the kNN models. Although, they werestill somewhat behind the best performing [experiment1] models. The normalized similaritymodels followed a reasonably similar pattern to the [experiment1] models. Without positionthey did not perform quite as well as the [experiment4] Euclidean distance models (r ≈ 0.29,p ≈ 0.19). When vowels were included, performance decreased (down to r ≈ 0.21,p ≈ 0.26). Position improved performance substantially, particularly when vowels were alsoincluded in the underlying feature representation. Without vowels, the cosine and Dicemodels performed similarly, producing correlations in the range of r ≈ 0.47 (p ≈ 0.07). TheJaccard model was marginally better (r = 0.4948, p = 0.0609). With vowels, the normalizedsimilarity models were among the best [experiment4] models (r ≈ 0.61, p ≈ 0.02). The dotproduct models followed the same basic pattern of the [experiment1] dot product models. Thenon-positional models performed essentially the same with and without vowels. Although,there was a slight decrease in performance when vowels were included (from r = 0.3622,p = 0.1368 without vowels to r = 0.3486, p = 0.1367 with vowels). Again, includingposition improved performance, but not to the same level as the normalized similarity models.Without vowels, the positional dot product model produced a correlation of r = 0.4466

(p = 0.0842). Performance improved slightly with vowels, up to r = 0.4909 (p = 0.0626),but still not to the level of the top scoring [experiment4] models. The Euclidean distancemodels were again among the best performing and most consistent. The Euclidean distanceresults followed a similar pattern to the [experiment2] Euclidean distance results. Withoutposition, the Euclidean distance models were easily the best performing models, both withand without vowels. When vowels were not included, the non-positional model produced acorrelation of r = 0.5003 (p = 0.0585). Including vowels decreased performance somewhat(down to r = 0.4798, p = 0.0676). Position improved performance. However, once again,including both vowels and position hurt the Euclidean distance models. Without vowels, thepositional model produced the highest score for the [experiment4] Euclidean distance models(r = 0.6045, p = 0.0244). When vowels were included, performance decreased slightly(down to r = 0.5711, p = 0.0332). The [experiment4] results are presented in Table 4.32.

4.2.5 kNN Results SummaryThe kNN results were quite different from the perceptron results. Performance was

much more consistent between feature representations (uniphone, biphone, triphone, and allphone). The best uniphone models are much better than their perceptron counterparts. The

78

Tabl

e4.

31.k

NN

Exp

erim

ent3

All

Phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.14

48-0

.081

2-0

.034

4-0

.023

10.

3972

(0.3

356)

(0.5

938)

(0.5

400)

(0.5

269)

(0.1

131)

With

Vow

els

0.04

200.

0899

0.07

580.

0924

0.45

33(0

.451

2)(0

.396

3)(0

.412

3)(0

.393

4)(0

.080

7)

With

Posi

tion

With

outV

owel

s0.

1112

0.13

570.

1346

0.14

520.

6065

(0.3

724)

(0.3

453)

(0.1

947)

(0.3

350)

(0.0

239)

With

Vow

els

0.31

990.

4351

0.21

400.

4612

0.51

97(0

.168

8)(0

.090

5)(0

.263

8)(0

.076

7)(0

.050

7)

79

Tabl

e4.

32.k

NN

Exp

erim

ent4

All

Phon

eR

esul

ts

Feat

ure

Dat

aV

aria

tion

Func

tion

Dot

Prod

uct

Cos

ine

Dic

eJa

ccar

dE

uclid

ean

unip

hone

expe

rim

ent1

With

outP

ositi

onW

ithou

tVow

els

0.36

220.

2874

0.28

860.

3008

0.50

03(0

.136

8)(0

.195

8)(0

.194

7)(0

.184

4)(0

.058

5)W

ithVo

wel

s0.

3486

0.21

560.

2140

0.20

300.

4798

(0.1

367)

(0.2

622

)(0

.263

8)(0

.274

7)(0

.067

6)

With

Posi

tion

With

outV

owel

s0.

4466

0.47

250.

4784

0.49

480.

6045

(0.0

842)

(0.0

711)

(0.0

683)

(0.0

609)

(0.0

244)

With

Vow

els

0.49

090.

6177

0.61

710.

6187

0.57

11(0

.062

6)(0

.021

4)(0

.021

6)(0

.021

2)(0

.033

2)

80

best kNN uniphone model produced a score of r = 0.4358 (p = 0.0901), compared tor = 0.2937 (p = 0.1903) for the best perceptron uniphone model. While it is not a significantdifference, it is substantial. In fact, only the best all phone models are significantly better thanthe best kNN uniphone models. Although, overall performance generally increases when amore informative feature representation is used. As the feature representation becomes moreinformative, the increase in performance becomes larger. Thus using the biphonerepresentation improved performance over the uniphone representation (up to r = 0.5751,p = 0.0321); likewise the triphone representation improved performance over the biphonerepresentation (up to r = 0.5934, p = 0.0271); and so on for the all phone representation (upto r = 0.6913, p = 0.0092). Similarly, using feature variations with more information provedmost effective. The best kNN models were based on feature sets that included both vowelsand positional information. This was true regardless of the basic feature, whether uniphone,biphone, triphone, or all phone. Again, the [experiment1] and [experiment4] data sets provedto be the most effective, just as we saw using the perceptron. However, even the all phonemodels that included both vowels and positional information, using the [experiment1] data(scoring as high as r = 0.6913 (p = 0.0092)) did not manage to perform quite as well as thebest perceptron models (which scored as high as r = 0.7853, p = 0.0021). The difference isnot significant, but the perceptron models do have a distinct advantage.

Varying the underlying function (dot product, cosine similarity, Dice Coefficient,Jaccard Coefficient, or Euclidean distance) had some effect on performance, with dot productand Euclidean distance based models performing quite differently than the normalizedsimilarity models (cosine, Dice, and Jaccard). The normalized similarity functions performvery similarly in most cases. They perform best when both position and vowels were includedin the underlying feature representation. Ultimately, the normalized similarity functionsproduced the highest scoring kNN models (r = 0.6787− 0.6913, p = 0.0108− 0.0305),using the all phone representation with vowels and position and the [experiment1] data. Thedot product based models behaved somewhat differently. The dot product models did notproduce scores as high as the normalized similarity models. However, they were reasonablycompetitive when position was not included, often scoring higher than their normalizedsimilarity counterparts in these cases. This was particularly true using the triphone and allphone feature representations. Finally, Euclidean distance produced the most consistentmodels across variations, typically producing some of the best results in each case. Although,the best Euclidean distance model did not perform quite as well as the best (normalizedsimilarity) kNN models, it was not far behind (r = 0.6477, p = 0.0156). Unlike thesimilarity based models, Euclidean distance performed best when position was included but

81

vowels were not. The differences between the performance of these functions was rarelysignificant, however.

4.3 CLUSTERED NEIGHBORHOOD RESULTSThe clustered neighborhood models did not perform particularly well compared to the

perceptron and kNN models. However, the clustered neighborhood results reported here arebased on only a small fraction of the data used to train the perceptron and kNN models.Considering this fact, the clustered biphone results perform reasonably well, on par with thekNN uniphone models. Performance, however, is highly dependent on the set of examples thatwe randomly select for training data, due to the memory limitations of our clustering method.

The dot product model produces the best score (r = 0.3781, p = 0.1258), which issurprising based on the kNN results. The normalized similarity models perform almostidentically, a familiar characteristic from the kNN results. Each scored in the range ofr ≈ 0.34 (p ≈ 0.15). The Euclidean distance model did not perform particularly well, onlymanaging an absolute correlation of r = 0.1784 (p = 0.2999).

The triphone representation did not prove to be effective for the clusteredneighborhood model. The triphone Euclidean distance model is marginally worse than itsbiphone counterpart, only producing an absolute correlation of r = 0.1107 (p = 0.3729).However, it was the only triphone based model that produced a correlation. The similaritybased models did not manage to produce correlation scores at all. This is likely the result ofthe small amount of data used to train the strict neighborhood model. Many, if not all, of thetest examples were assigned to neighborhoods with only a single member which they did notshare any features with, resulting in an acceptability score of 0.0 for all test examples. This isdespite the fact that the similarity based models selected the same classes as the Euclideandistance models. How, then, did the Euclidean distance model manage to produce acorrelation score? It is due to the fact that Euclidean distance does not depend on sharedfeatures, unlike the similarity based models. An acceptability score can be computed whetheror not two vectors have any features in common. Thus, although the triphone Euclideandistance model suffers from the same shared feature problem as the similarity based models, itis still able to produce acceptability scores for the test examples. Of course, however, due tothe larger number of unknown features in the triphone model, it did not perform as well as thebiphone models. The clustered neighborhood results are presented in Table 4.33.

4.4 PROBABILITY RESULTSMuch like the clustered neighborhood models, the biphone representation produced

the best probability models. The model based on joint biphone probability slightly edged outthe model based on geometric mean (joint: r = 0.4842, p = 0.0656; average: r = 0.4419,

82

Tabl

e4.

33.C

lust

ered

Nei

ghbo

rhoo

dR

esul

ts

Feat

ure

Var

iatio

nFu

nctio

nD

otPr

oduc

tC

osin

eD

ice

Jacc

ard

Euc

lidea

nbi

phon

eW

ithou

tPos

ition

,0.

3781

0.34

160.

3422

0.34

25(−

)0.1784

With

Vow

els

(0.1

258)

(0.1

520)

0.15

15(0

.151

2)(0

.058

5)

trip

hone

With

outP

ositi

on,

--

--

(−)0.1107

With

Vow

els

--

--

(0.3

729)

83

p = 0.0868). This difference of course is not significant. The triphone models were extremelypoor. The joint probability model produced virtually no correlation (r = 0.0660, p = 0.4235),while the average probability model produced a strongly negative correlation (r = −0.5758,p = 0.9681). The is an extremely poor result. A negative correlation in this case indicates amodel that makes the opposite prediction to our human respondents. The probability resultsare presented in Table 4.34.

Table 4.34. Probability Results

Feature Variation MethodAverage Joint

biphone Without Position, 0.4419 0.4841With Vowels (0.0868) (0.0656)

triphone Without Position, −0.5758 0.0660With Vowels (0.9681) (0.4235)

4.5 CHAPTER SUMMARYThis chapter presented the correlation results of the models that we developed. There

was considerable variation in performance between models, depending on the particulars ofthe training data (data set, feature representation, etc.) and the type of model (perceptron,kNN, clustered neighborhood, or probability). The classifier models (perceptron and kNN)produced the best results, particularly using the all phone representation. Interestingly, whilethe best perceptron and kNN models had a lot of similarities (both used the [experiment1]data, the all phone feature set, and some form of normalization), they were most effective withdifferent feature variations. The perceptron performed best without vowels or position, inother words, using the all phone feature set with the lowest number of features. The kNNmodel, on the other hand, performed best when both vowels and position were included (theall phone feature set with the most features). This is most likely due to the fundamentallydifferent approaches to the problem embodied by the perceptron and kNN. The perceptroncreates a model based on the proportions of the features in the training data, in our casesegments or sequences of segments. This abstracts away from the training examplesthemselves. Since the model is based on the proportions of the features in the trainingexamples, rather than the actual examples themselves, removing vowels and positionalinformation can thus give us better estimates of the proportions of the features in the largermodels (like the all phone model). kNN, on the other, hand models the training examplesdirectly. Similarity scores between two examples can be calculated more accurately whenboth examples include as much information as possible. In this case, removing vowels and

84

positional information blurs many of the distinctions between examples, thus making moreexamples seem more similar. This in turn, makes classification more difficult. The model isless able to make the fine distinctions that accurate classification requires. Thus, it is evidentwhy the (larger) perceptron models perform the best when vowels and positional informationis removed and why the kNN models perform best when both are included.

The clustered neighborhood and probability models were quite different from theclassifier models, and did not perform nearly as well overall. The best perceptron modelswere significantly better than both (p < 0.0059), while the best kNN model was onlysignificantly better than the clustered neighborhood model (p = 0.0158). In both cases, thetriphone representation was not effective, and both suffer from feature sparsity. There aremany weaknesses to the clustered neighborhood model. Most notably, that it is not entirelyclear how well it approximates a neighborhood based model like the GNM of Bailey andHahn [2], a topic that we will return to in the next chapter. It is also very sensitive to therandom selection of the training data, since the data sample is so small. This is reflected by itsperformance compared to the two-class kNN models. Much like the clustered neighborhoodmodels, the probability based models also performed better using the biphone feature set. Ineach case, the triphone representation substantially hurt performance. However, the biphoneprobability models were somewhat better than their clustered neighborhood models.

Although our classifier-based correlate to a probability model outperforms theneighborhood based models (kNN and the clustered neighborhood model), we cannot say atthis point that our results confirm the results of Albright [1]. A number of questions stillremain. For example, why do the best perceptron models perform so much better that themore traditional probability models? For that matter, why do the best kNN models, sincethese are essentially neighborhood based models, perform nearly as well as the bestperceptron models? While we have produced some impressive results, at this point we cannotbe sure that they show what we have set out to show. We delve deeper into these issues andothers in the next chapter.

85

CHAPTER 5

DISCUSSION

In this chapter, we will first discuss a few loose ends that have come up during thecourse of our research, many of which we have mentioned at various points during the courseof this work. We experimented with variations on the perceptron, kNN, and clusteredneighborhood models that we discussed in previous chapters. First we will investigate themajor weakness of the perceptron: its variance. We mentioned in the second chapter that theperceptron is known to be sensitive to certain idiosyncrasies of the training data. To explorethis further, we have produced five variations of the top-performing perceptron models fromthe previous chapter to investigate the effect that varying the presentation of the training datahas on model performance. We also briefly mentioned in the second chapter that the choice ofk in the kNN model can potentially have an effect on performance. We made severalvariations of the top-performing kNN models presented in the previous chapter to investigatethe effect that varying the choice of k has on model performance. Finally, we extend theclustered neighborhood model to include the entire set of positive training data that was usedto train the other types of models. This of course addresses one of the obvious weaknesses ofthe clustered neighborhood model. As a point of comparison, we also constructed a moretraditional neighborhood model based on edit distance, which we briefly discussed in the firstchapter. The chapter concludes with a discussion of our work as a whole, its overallcontributions and a number of its weaknesses.

5.1 ADDRESSING SOME LOOSE ENDSIn this section we address some remaining questions concerning the models that we

have presented. In particular, we address the variance of the perceptron models, the choice ofk for the kNN models, and the reduced training set we used to train the clusteredneighborhood model.

5.1.1 PerceptronAs we mentioned in the second chapter, the perceptron has a lot of variance compared

to other, more sophisticated linear classifiers like the SVM. We should expect, then, that theorder of the training data could have a greater effect on the overall quality of the modelsproduced by the perceptron. To investigate this, we produced five variations of the originaltraining data. These variations were constructed randomly and were checked against both the

86

original training data and the other random variations to ensure that each version was unique.These data were encoded using the six top-scoring feature encodings and data sets from theresults presented in the previous chapter. Thus we constructed five new versions of the[experiment1] biphone positional model with vowels, the [experiment4] biphone positionalmodel with vowels, the [experiment4] triphone non-positional model without vowels, the[experiment1] all phone non-positional model without vowels, and the all phone[experiment4] non-positional model without vowels. The results of these experiments arepresented below (Versions 1-5) along with the results of the original model (Version 0).Averages were calculated using the Fischer transformation [45].

There was a great deal of variation in the results when we simply altered the order ofthe presentation of the training examples. Certain versions of the training data produced muchhigher scores, while others were much lower. In general, the biphone representation provedmost consistent, using both the [experiment1] and [experiment4] data, and produced scores ashigh as r = 0.8497 (p = 0.0004). This represents a considerable improvement over theoriginal biphone results. For the most part, using some type of normalization produced betterand more consistent results. L1-normalization seems to be generally preferable for thebiphone representation, producing scores in the range of r = 0.72− 0.80 (p = 0.0058−0.0014) range. Although, the highest scoring model used L2-normalization. When comparedwith the results of the other versions of the training data, the original ordering of the trainingdata produced models near the lower end of the scale. The [experiment1] results are presentedin Table 5.1, and the [experiment4] results are presented in Table 5.2.

Table 5.1. Perceptron Experiment1 Biphone Results

Experiment1 With Position With Vowels none L1-norm L2-normVersion 0 0.7168 0.7153 0.7059

(0.0065) (0.0067) (0.0076)Version 1 0.7180 0.7408 0.7003

(0.0064) (0.0046) (0.0082)Version 2 0.7424 0.7547 0.7368

(0.0044) (0.0036) (0.0048)Version 3 0.6467 0.7416 0.7051

(0.0157) (0.0045) (0.0077)Version 4 0.6939 0.8028 0.7626

(0.0089) (0.0014) (0.0032)Version 5 0.7478 0.8026 0.7560

(0.0041) (0.0015) (0.0036)Average 0.7125 0.7616 0.7288

87

Table 5.2. Perceptron Experiment4 Biphone Results

Experiment4 With Position With Vowels none L1-norm L2-normVersion 0 0.6977 0.7381 0.6768

(0.0085) (0.0048) (0.0111)Version 1 0.7604 0.7774 0.8497

(0.0033) (0.0024) (0.0004)Version 2 0.6834 0.7888 0.7452

(0.0102) (0.0020) (0.0042)Version 3 0.7547 0.7729 0.7512

(0.0036) (0.0026) (0.0039)Version 4 0.6953 0.7457 0.6541

(0.0087) (0.0042) (0.0145)Version 5 0.7197 0.7245 0.7002

(0.0063) (0.0058) (0.0082)Average 0.7199 0.7589 0.7372

Much like we saw in the previous chapter, the models based on the triphonerepresentation did not generally perform as well as those based on the biphone representation.As we discussed, this is likely due to the size of the triphone feature space compared to thenumber of training examples in our training data. Varying the order of the presentation of thetraining examples produced lower scoring models than the original version (in the r = 0.57−0.70 (p = 0.03− 0.007) range. Normalization did not consistently improve performance,although it was helpful in some cases. Once again, L1-normalization generally produced thebest results. So it would seem that the original triphone models were near the top of the rangeof possible triphone models. The [experiment4] triphone results are presented in Table 5.3.

Table 5.3. Perceptron Experiment4 Triphone Results

Experiment 4 Without Position Without Vowels none L1-norm L2-normVersion 0 0.6778 0.7384 0.7036

(0.0110) (0.0047) (0.0078)Version 1 0.5784 0.6281 0.6093

(0.0312) (0.0193) (0.0233)Version 2 0.5808 0.7008 0.6283

(0.0304) (0.0081) (0.00192)Version 3 0.5749 0.6390 0.5306

(0.0321) (0.0171) (0.0466)Version 4 0.6659 0.7079 0.6732

(0.0126) (0.0074) (0.0116)Version 5 0.5741 0.6397 0.6444

(0.0324) (0.0170) (0.0161)Average 0.6107 0.6779 0.6346

88

Much like the triphone variation results, the original all phone models seemed to be atthe top of the range. However, the all phone representation produced much more consistentresults across training data versions than the triphone representation. Results were generallymore consistent, particularly when some form of normalization was used, than even thebiphone results. Although, the best all phone models did not perform quite as well as the bestbiphone models. Using the [experiment1] data, the L1-normalized models produced results inthe range of r = 0.71− 0.77 (p = 0.0066− 0.0023). The L2-normalized models producedslightly higher results in the range of r = 0.72− 0.78 (p = 0.0057− 0.0021). Theunnormalized models were somewhat behind, in the range of r = 0.63− 0.71 (p = 0.0179−0.0067). Clearly, normalization is important for the all phone representation, regardless of thedata used, simply due to the shear length of the all phone vectors. These results aresummarized in Table 5.4.


Experiment1 Without Position Without Vowels none L1-norm L2-normVersion 0 0.6658 0.7713 0.7853

(0.0126) (0.0027) (0.0021)Version 1 0.6347 0.7155 0.7698

(0.0179) (0.0066) (0.0027)Version 2 0.6498 0.7226 0.7555

(0.0152) (0.0060) (0.0036)Version 3 0.7182 0.7398 0.7571

(0.0064) (0.0046) (0.0035)Version 4 0.7011 0.7337 0.7843

(0.0081) (0.0051) (0.0021)Version 5 0.7145 0.7789 0.7260

(0.0067) (0.0023) (0.0057)Average 0.6820 0.7446 0.7637

The models produced using the [experiment4] data did not perform quite as well as the[experiment1] models. In fact, the [experiment4] models were generally worse than weexpected given the original results. In most cases, normalization improved performance,although often not as much as for the original version. Using the [experiment4] data, theL1-normalized models produced results in the range of r = 0.66− 0.71 (p = 0.0135−0.0064). L2-normalization produced a slightly wider range of results r = 0.62− 0.76

(p = 0.0202− 0.0031). And of course the unnormalized models produced more modestresults, in the range of r = 0.59− 0.70 (p = 0.0268− 0.0081). Interestingly, it seems theorder of the presentation of the training data could have an effect on the choice ofnormalization. This can be seen in Version 5 of the [experiment4] data, where the

89

unnormalized model outperforms the L2-normalized model, and is not too far behind theL1-normalized model. These results are presented in Table 5.5.


Experiment4 Without Position Without Vowels none L1-norm L2-normVersion 0 0.6304 0.7753 0.7422

(0.0188) (0.0025) (0.0044)Version 1 0.5949 0.6604 0.7180

(0.0268) (0.0135) (0.0064)Version 2 0.6109 0.6753 0.7638

(0.0229) (0.0113) (0.0031)Version 3 0.6133 0.6852 0.6860

(0.0224) (0.0100) (0.0099)Version 4 0.7013 0.7185 0.6774

(0.0081) (0.0064) (0.0110)Version 5 0.6583 0.6701 0.6235

(0.0138) (0.0120) (0.0202)Average 0.6363 0.6998 0.7047

The perceptron models clearly have high variance. The biphone models performed thebest across variations, producing the best results when averaged across variations. This isreminiscent of results reported in [1] and [49, 50, 51], which suggested the biphone basedmodels tend to perform the best. Although, none of these studies investigated the performanceof a model comparable to our all phone model. Interestingly, it seems that the order of thepresentation of the original training data produced triphone and all phone models near the topof their respective performance ranges, while it produced only average or below averagebiphone models. This suggests that the optimal presentation of the training data to theperceptron can be affected by the feature representation that is used. A particular ordering thatproduces good results using one feature representation likely may not produce good resultsusing another feature representation. Because of the high variance, the raw results are difficultto compare with the other types of models (kNN, clustered neighborhood, and probability).However, the averages that we have produced for the top scoring models in this chapter arelikely closer to the actual expected value. In fact, the average correlation produced by the[experiment1] model is very close to the correlation produced by an SVM model trained onthe original biphone data using the same features during our model verification process. TheSVM of course produces models in which the order of the presentation of the training data isirrelevant.

90

5.1.2 kNNThe major parameter of a kNN model is of course k, the number of nearest neighbors

used to decide the class of an unclassified example. Our original models used k = 7. Thisnumber was chosen primarily because our test examples had, on average, in the area of 5 to 10nearest neighbors (using the biphone representation). However, in this chapter, we elected tovary k in a small range to see if this would have any effect on performance. We varied thevalue of k from 5 to 30 in increments of 5 to determine the effect on overall performance. Weselected the top-performing kNN models from our original experiments in much the sameway we did for the perceptron above. Thus we produced new versions of the positional[experiment1] biphone model with vowels, the positional [experiment1] triphone model withvowels, and the positional [experiment1] all phone model with vowels (recall that the bestperforming kNN models all included both position and vowels). We limited this investigationto the normalized similarity functions (cosine similarity, Dice coefficient, and JaccardCoefficient), because these functions produced the highest scoring models overall. The resultsof this experiment are presented below.

The biphone models performed the best when k is between 5 and 10 inclusive,indicating that k = 7 was a very reasonable choice. Performance began to decrease when kwas greater than 10, which was expected. However, performance begins to increase again as kapproaches 30. This was less expected; however performance never reached the levels that wesaw when k was between 5 and 10. Between k = 13 and k = 20, the normalized similarityfunctions performed virtually indistinguishably. Nearer the ends of the spectrum (k < 13 andk > 20), there were some interesting differences. The model based on cosine similarityperformed the best up to k ≈ 11, at which point, it became virtually indistinguishable fromthe model based on Dice Coefficient. The model based on the Jaccard Coefficient performedthe worst in these situations, scoring nearly 0.1 less than the cosine model when 5 ≤ k ≤ 10

and as much as 0.05 less when k = 30. The Dice models typically scored between the cosinemodel and the Jaccard model. When k = 5, the Dice and Jaccard models performed identically(r ≈ 0.57). As k increased, the Dice model approached the performance of the cosine model.The performance of the Dice model increased dramatically between k = 5 and k = 10. Afterthis point, the cosine and Dice models perform almost identically. Although, the cosine modelperformed slightly better than the Dice model when k > 27. These results are depicted inFigure 5.1.

The triphone models followed a very similar pattern. However, performance began todecrease when k was greater than 5. Indeed, the triphone were better with k = 5 than thetriphone results reported in the previous chapter. The performance of the triphone modelscontinued to decline very gradually until k was greater than 20, at which point performance

91

0 10 20 30 40 50 60 70k Value

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Corr

ela

tion (

r)CosineDiceJaccard

Figure 5.1. kNN biphone results varying k.

began to gradually increase. This trend continued as k approached 30. The performance of allof the normalized similarity functions was very similar up until k ≥ 28. At this point point thenormalized similarity functions perform slightly differently. The cosine model again performsslightly better than the other two, the Jaccard model, slightly worse, and the Dice model inbetween the others. In general however, the cosine and Dice models perform virtually thesame. The Jaccard model performed somewhat differently in several cases. It diverged fromthe other two models very slightly between k = 5 and k = 10, and around k = 15 and k = 20. Ingeneral however, all three models performed about the same in most cases. These results aredepicted in Figure 5.2.

Unlike the biphone and triphone models, increasing k consistently improvedperformance by as much as 0.09. This was an entirely unexpected result, but it makes sensewhen we consider the number of features in the all phone models. Since there areexponentially more features in the all phone models than the biphone or triphone models, andthese features are redundant, more examples will share features. Thus it is not surprising thata given example will have more nearest neighbors than in the biphone or triphone models.Thus increasing the number of nearest neighbors we consider, increases the quality of thedecision made by the all phone model. In addition to the very different shape of the graph, thesimilarity functions also followed a somewhat different pattern of performance. The cosineand Dice models again perform quite similarly in most cases (with the exception of k ≈ 15,where the Dice model performed slightly worse than the cosine model. The Jaccard model onthe other hand, performed slightly better than the other two, unlike the pattern that we saw inthe biphone and triphone results. The difference in performance was greatest around k = 10;

92

0 10 20 30 40 50 60 70k Value

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Corr

ela

tion (

r)CosineDiceJaccard

Figure 5.2. kNN triphone results varying k.

however the Jaccard model maintained a very slight advantage as k approached 30. Theseresults are presented in Figure 5.3.

0 10 20 30 40 50 60 70k Value

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Corr

ela

tion (

r)

CosineDiceJaccard

Figure 5.3. kNN all phone results varying k.

To investigate this trend further, we extended the range of k that we considered for theall phone models, from 5 to 30 up to 5 to 70. Unfortunately, but perhaps not surprisingly,beyond k = 30, performance began to drop off, but at a very gradual rate, ultimately ending uparound r ≈ 0.715 (p ≈ 0.0006) with k = 70. When the results are extended beyond k = 30, wesee a much greater variation in performance among the three functions. The performance of

93

each model decreased after k = 30, however the rate of this decrease was different for eachfunction. The Jaccard model was consistently better than the models based on the other twofunctions. The gap in performance was greatest between k ≈ 40 and k ≈ 60. Above k = 60,the gap in performance was somewhat more modest, although the Jaccard model stillperformed better than the other two models. The performance of the Dice model decreasedquickly between 30 ≤ k ≤ 35. After this, the performance of the Dice model is reasonablyconsistent until k = 60. At this point, there was another dramatic decrease in performancefollowed by a very slight increase as k approached 70. The cosine model performedsomewhat less consistently than the other two models between k = 30 and k = 55. We see thesame dramatic decrease in performance between k = 30 and k = 35 that we saw in the case ofthe Dice model. However, performance increased slightly as k approached 40. This wasfollowed by a more gradual decrease between k = 40 and k = 50. After k = 50, there wasanother slight increase, at which point, the performance of the cosine model leveled off.These results are presented in Figure 5.4.

0 10 20 30 40 50 60 70k Value

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Corr

ela

tion (

r)

CosineDiceJaccard

Figure 5.4. kNN all phone results varying k extended.

Varying k had some very interesting effects on the performance of the various kNNmodels. We found that, while k = 7 was a reasonably good choice for the biphone basedmodels, the other types of models (triphone and all phone) performed better with alternativechoices for k. In the case of the triphone models, the best choice of k was closer to k = 5,while the all phone models produced the best results with a much larger choice of k, closer tok = 30. We suspect that the triphone models performed best with a lower choice of k due tothe extreme sparsity of features suffered by the triphone models. With fewer features known

94

to the model, a smaller set of nearest neighbors, likely with higher similarity between them,produced the best performing triphone models. As k increased, the overall similarity betweenthe nearest neighbors decreased, making it more difficult for the triphone models to produceaccurate results. As we discussed, the benefit to the all phone models of a larger k is likelydue to the opposite reason. The larger number of redundant features increased the number ofhigh similarity nearest neighbors for any given unclassified example. So increasing the size ofthe set of nearest neighbors also increased the accuracy of the all phone models. In fact, usingk = 30 increased the performance of the all phone models nearly to the level of the bestperforming perceptron models (even to the level of the average biphone results reportedearlier in this chapter), erasing the gap reported in the previous chapter. Although, the overallperformance of the all phone models does not reach the same level as some of the raw,unaveraged perceptron models, the averaged scores are a much better indicator of the actualperformance of the perceptron models due to high variance.

5.1.3 Clustered Neighborhood ModelWe saw in the previous chapter that the clustered neighborhood model was generally

worse than the other types of models that we tested. We suspected that this lack ofperformance could, at least in part, be due to the fact that the clustered neighborhood modelswere based on only as small portion of the training data used to train the other types ofmodels, only around 5,000+ examples. Recall that this restriction was due to the memoryrequirements of the clustering method that we used, a fairly unsophisticated implementationbased on the SciPy Python package. To test this, we extended the clustered neighborhoodmodel to include the entirety of the training data. Since we still faced the memory limitation,we were unable to cluster all 200,000+ examples outright. Instead, we classified theremaining training examples using the original clustering we discussed in Chapter 3. Weassumed that the original clustering was a reasonable representation of the data. Each of theremaining training examples was thus classified into one of the 4,153 neighborhoods (classes)that were produced by the original clustering of the data. We then used this classification ofthe training data to train an extended clustered neighborhood model. We trained the extendedclustered neighborhood model on the entire training set (just like the other types of models),and we applied this model to our test set. We limited this experiment to the biphone data,since the triphone versions were unable to produce reasonable results. The results producedby the extended clustered neighborhood model are presented in Table 5.6. As we can see,extending the clustered neighborhood model improved performance substantially in nearly allcases. The best performing models produced results on par with the biphone joint probabilitymodels, a considerable improvement over the original clustered neighborhood. The similarity

95

Tabl

e5.

6.E

xten

ded

Clu

ster

edN

eigh

borh

ood

Res

ults

vsO

rigi

nal

Clu

ster

edN

eigh

borh

ood

Res

ults

Feat

ure

Var

iatio

nFu

nctio

nD

otPr

oduc

tC

osin

eD

ice

Jacc

ard

Euc

lidea

nbi

phon

eW

ithou

tPos

ition

,0.

4419

0.47

280.

4700

0.46

950.

1210

With

Vow

els

(0.0

868)

(0.0

709)

(0.0

723)

(0.0

725)

(0.3

615)

biph

one

With

outP

ositi

on,

0.37

810.

3416

0.34

220.

3425

(−)0.1784

With

Vow

els

(0.1

258)

(0.1

520)

0.15

15(0

.151

2)(0

.058

5)

96

based models all performed reasonably well (r > 0.44, p > 0.087), with the normalizedsimilarity models scoring as high as (r ≈ 0.47, p ≈ 0.07). The Euclidean distance model, onthe other hand, still performed quite badly (r = 0.1210, p = 0.3615). While this is somewhatbetter than the original model, it is certainly below expectation. It is unclear why theEuclidean distance model performs so poorly in the clustered neighborhood case, and it maybe related to the scoring function that we used for the actual classification.

Despite this improvement, as we mentioned in the previous chapter, we can still not becompletely sure that our kNN-based clustered neighborhood model is really comparable toother neighborhood based models that have been proposed [2, 32]. To address this issue, weconstructed a simple neighborhood model based on edit distance in the style of Luce [32]. Webriefly described models of this type in the first chapter. Basically, the neighborhood of aword is defined to include all of the words that differ from it by one edit distance. Weconstructed this simple neighborhood model in order to compare its performance to theperformance of our clustered neighborhood model. This will give us a better indication ofhow well our clustered neighborhood model compares to other neighborhood based models inthe literature. The results of this experiment are presented in Table 5.7.

Table 5.7. Basic Neighborhood Results

Basic Neighborhood Model Score0.5284

(0.0474)

Surprisingly, this simple model performs quite well, somewhat better than thengram-based probability models we discussed in the previous chapter and, more importantly,better than our clustered neighborhood model. However, the performance of the basicneighborhood model and the clustered neighborhood model are in roughly the same range.This is an indication that our clustered neighborhood model is a reasonable, albeit imperfect,approximation of the basic neighborhood model. Although, it is unlikely, in its current state,that our clustered neighborhood model could approach the performance of the GNM of Baileyand Hahn [2]. Despite this, there are a number of improvements that could be made to theclustered neighborhood model. It is quite likely that improving the initial clustering wouldimprove overall performance, since, after all, we are assuming that the original clustering(based on only 5,000+ examples) is sufficient to account for all of the data. This is unlikelysince the original clustering is based on only a small fraction of the total training data, whichis clearly a weakness of the approach. Clustering all of the data, however, would require muchmore sophisticated clustering techniques (particularly in the area of memory management)than the ones presented here. Another possibility is the clustering algorithm itself. We used

97

simple single-link clustering due to the fact that, on the surface at least, it produced the bestclusters of the methods that we tried. Unfortunately, it is entirely possible that this wassomething of a happy accident due to the relatively small amount of data that we wereworking with. We may have just randomly selected a set of training examples that werereasonably similar, making it appear that single-link clustering was producing high qualityclusters. When, in fact, these clusters do not actually generalize to the rest of the dataparticularly well. However, there are a large number of other clustering algorithms availablethat may perform better. We leave these questions to future research.

5.2 DISCUSSIONWe have argued that the perceptron and kNN (here kNN-based models include the

clustered neighborhood models) produce models that are conceptually similar toprobability-based and neighborhood-based models, respectively. Indeed, on the face of it, theydo: the perceptron models are based on the number of times a sequence occurs in a givencontext (essentially probability), and kNN models are based on the number of sequences ithas encountered that are similar to a given sequence in some way (essentially aneighborhood). As a point of comparison, we also presented results for conceptually morestandard models: the n-phone probability models and the basic neighborhood model.However, based on the results presented in this and the previous chapter, we are unable toconclude that a probability based model is in any way superior to a neighborhood basedmodel, which was our original intent (following [1]). In our experiments, both types ofmodels performed essentially equally well. In both groups, the perceptron and two-class kNNgroup and the n-phone probability and clustered neighborhood group (we can include thebasic neighborhood model in this group as well), the neighborhood based models performedessentially as well as the probability based models. Granted, the best probability basedmodels (the all phone perceptron models in the previous chapter and the biphone perceptronmodels in this chapter) did produce slightly better results than the neighborhood based models(the all phone kNN models), these differences disappear when the perceptron results areaveraged over different versions of the perceptron training set, resulting in equal performancefor both types of models. This fact, in conjunction with the fact that the SVM producedmodels that performed nearly identically to the averaged perceptron results and the extendedall phone kNN results, suggests that scores in the range of r ≈ 0.78− 0.80 could be an upperlimit for this problem due to the somewhat inconsistent nature of phonotactic acceptability.

In addition to this, it is not entirely clear that probability and neighborhood effectshave been properly and effectively separated in any of our models. We could very well beseeing a mixed effect of both probability and neighborhood factors in each model, though theexact nature and extent of this effect are unclear. This is well known to be a very difficult

98

problem and may very well require more sophisticated techniques for test development thanthe ones presented here (cf [6]).

Despite these weaknesses, we have seen a number of very interesting results over thecourse of this work. The most notable of these is the large gap between the performance of theclassifier based models (the perceptron and kNN) and the more conceptually traditionalmodels (the n-phone probability, the basic neighborhood, and even the clusteredneighborhood models). The classifier based models performed substantially better than themore traditional models. This seems most likely due to the use of negative data in theclassifier based models. The use of negative data, particularly the use of knowledge-freenegative data (like the [experiment1] negative data, which assumes no knowledge of thephonotactic system in question) improves performance substantially, both over models thatmake use of no negative data and those that use other forms of negative data with some typeof knowledge built in (like the [experiment2] negative data). This is perhaps a surprisingresult, but makes a reasonably good case for the idea that humans make use of some form ofimplicit negative data when learning the phonological system of their native language (cf [7]).Of course, this is a highly contentious issue within the Linguistics community, but the ideathat some form of implicit negative data is involved in linguistic learning is not an entirelynew one. This idea is well attested in Optimality Theory (see for example the ConstraintDemotion Algorithm [46], or the Gradual Learning Algorithm [3, 4]. In both cases, negativedata are an integral part of the learning process. In our experiments, the use of negative data(particularly of the knowledge free variety) improves performance substantially over the moretraditional models which use only positive data.

In any case, classifiers, even the relatively simple ones used in this work, seemparticularly well suited for this type of problem. In combination with the correct form ofnegative data (ideally randomly generated, knowledge-free negative data) and the properfeature encoding, the two-class classifier-based models (the perceptron and kNN)outperformed all of the other model types that we tested. As far as we know, no otherresearchers have applied a truly discriminative classification approach to this problem, despitethe fact that it is clearly a fruitful approach. The closest are perhaps Hayes and Wilson [22],who applied a Maximum Entropy model (MaxEnt) to the problem of phonotactic learning.The main difference between their work and the work presented here is the feature encodingsthat were used. Hayes and Wilson [22] developed some reasonably complex, linguisticallymotivated feature encodings. Whereas, we limited our feature representations to features thatwere directly observable in the data. Such a limitation is not uncommon, and in fact is oftendesirable (cf [36]). So, although we were not able to determine whether probability orneighborhood based models are better suited to the problem of phonotactic learning, we

99

certainly showed that classifier-based models in general are well suited to this type of learningproblem.

Comparing our work to the work of other researchers that we have mentioned can bequite difficult due to the wide variety of approaches that have been utilized in the study of thisproblem. Our approach is perhaps most similar to the approach used by Coleman andPierrehumbert [9]. However, the results that they present are limited to p (significance) values.Of the four models that were discussed in their work, the best, based on the natural logarithmof the probability of a word, produced a p-value less than 0.001, which is slightly better thanour best performing models. Two of their models, one based on the probability of a word andthe other based on the probability of the worst part of a word, produced p-values less than0.01, which is comparable to many of our models. However, direct comparison with theresults of Coleman and Pierrehumbert [9] is confounded due to the fact that no actualcorrelation scores were reported. Several other questions surround these results as well.Namely, their test data and procedures. While the basic construction of our test data wasclosely based on the procedures described in [9], our test data was designed to include a widevariety of acceptability levels, while the Coleman and Pierrehumbert [9] data was morelimited in scope. It is likely that our test data was considerably more challenging than that ofColeman and Pierrehumbert [9], so despite the slightly lower scores, it is likely that ourmodels are in fact performing better overall. However, we are unable to say for sure withoutconstructing their CFG-based model and applying it to our test data. Another consideration isthe number of human participants in the study, which could certainly affect the resultingcorrelations and p-values. The Coleman and Pierrehumbert [9] correlation results were basedon a human sample of only six participants, roughly a third of the number used in our ownstudy. This small pool of human participants could very well have artificially inflated thecorrelation results presented in [9]. Without constructing their CFG-based model and testingit on our own data, it is difficult to say for sure.

The other important cases ([1] and [2]) are considerably more difficult to compare toour own work. Both approach the problem in terms of how much variance can be explainedby a particular type of model, an approach which we did not take. So, while direct comparisonis impossible, it is still possible to compare the over-arching themes of our results. Bothworks suggested that one particular type of model is superior to the other for this problem (beit probability in the case of Albright [1] or neighborhood in the case of Bailey and Hahn [2]).However, our results suggest that both perform equally well under the right conditions: theoptimal negative data type and feature encoding, and the choice of k in the case of kNN.While there are, admittedly, a number of weaknesses in our work, it is quite clear that, in thecase of a classifier based approach, probability based and neighborhood based models are

100

equally well suited to the learning problem. Again, further research is required to decide thisfor sure. It seems likely that the performance of both our probability based (perceptron) andneighborhood based (kNN) classifier models are approaching an upper bound for thisproblem, so we would suggest neither is superior to the other under their respective optimalconditions. The differences in performance between the two types of models could be anartifact of the test data or the overall quality of the models tested, and not in fact the result ofthe superiority of one type of model over the other. This, then, remains an open question.

5.3 CHAPTER SUMMARYThis chapter first presented a number of variations on the models presented in the

previous chapters. First, we varied the order of the presentation of the training examples to theperceptron, since the perceptron is well known to be sensitive to training data presentation.We found this to be the case for our data as well. We produced five random variations of thetraining data that we tested using the negative data and feature sets of the top-scoringperceptron models presented in the previous chapter. We found that varying the order ofpresentation of the training data had a large effect on the quality of the models produced.Some were much better than the originals, reaching scores as high as r = 0.8497

(p = 0.0004) (the [experiment4] positional biphone model with vowels), while others weremuch worse than the original (the [experiment4] non-positional triphone model withoutvowels, which dropped to r = 0.5306 (p = 0.0466). We averaged these results using theFischer Transformation, which gave us a much better approximation of the actual expectedvalue of these models, giving us a much better basis for comparison.

We then went on to discuss the effect of varying k in our kNN models. We varied kfrom 5 to 30 inclusive by intervals of five. Again, we made variations of the six top-scoringkNN models presented in the previous chapter. The biphone and triphone based modelsperformed more or less as expected, with the original k value of 7 being very close to optimal.Performance began to drop off when k was between 5 and 10. However, somewhatunexpectedly, performance began to increase slightly as k approached 30. Although thisincrease did not improve performance above the level achieved when k was between 5 and 10.More unexpectedly, the performance of the all phone models increased as k approached 30.This does make sense though, when we consider that the larger number of features used in theall phone representations. Since more features are used, there are thus more examples withfeatures in common, making the larger neighborhood quite reasonable. To see how far thistrend continued, we extended the range up to 70. However, performance began to decreasewhen k was larger than 30, suggesting that 30 is roughly the optimal choice for k in the allphone models.

101

Finally, we extended the clustered neighborhood model to include all of the positiveexamples used to train the other types of models. Since our clustering method is alreadymemory bound, we accomplished this by using the original clustering to classify theremaining examples, assuming that the original clustering is representative of the data as awhole. While this method did improve performance, there is considerable reason to believethat the original clustering is not fully representative of the data as a whole, and that greaterimprovements would be possible using alternative clustering methods and more sophisticatedmemory management techniques.

As a point of comparison, we constructed a more traditional neighborhood modelbased on edit distance: the basic neighborhood model. This model simply counts the numberof training examples that are one edit distance away from a given test example. This modelperforms surprisingly well, outperforming both our extended clustered neighborhood modeland the more traditional bigram probability models. However, the results of the basicneighborhood model are in the same range as the results of the extended clusteredneighborhood model (the difference between them is of course not significant). For thisreason, we consider the clustered neighborhood model in its present form to be a reasonableapproximation to the basic neighborhood model. However, it is clearly not a goodapproximation of the GNM of Bailey and Hahn [2]. We have already discussed a number ofmethods that could lead to substantial improvements in the performance of the clusteredneighborhood model.

The chapter concludes with a general discussion of the work as a whole, both itsstrengths and its weaknesses. While we were not able to determine conclusively whetherprobability or neighborhood based models are superior, we did find that two-class classifierbased models were quite successful on this problem. The two-class classifier modelssubstantially outperformed the other types of models. The type of negative data and thefeature encoding are critical to the success of the two-class classifier models. The[experiment1] negative data was consistently the most successful, typically producing thehighest scores for a given feature representation and model type. This fact that our modelsperform the best using the [experiment1] negative data is significant because the[experiment1] negative data was completely randomly generated, in other words, knowledgefree. The other types of negative data were generated using some form of knowledge aboutthe language to some extent. The [experiment2] negative data incorporates knowledge aboutthe distribution of consonants and vowels in English, while the [experiment3] dataincorporates a certain amount of knowledge about particular words (since the [experiment3]data was generated directly from actual words). This finding lends support to the idea, notunlike proposals found in Optimality Theory, that some form of internally generated negative

102

data is utilized in the human phonotactic learning process. On the other hand, there is apossibility that our test set is somehow biased towards models based on completely randomlygenerated negative data (the [experiment1] data). However, we feel that this is unlikely, sincethe test set was designed without reference to any particular type of negative training data, andthe negative test examples bear little resemblance to the randomly generated negative data.Although this question cannot be fully answered without applying these models to a variety ofdifferent test sets. We leave this experiment to future work. The feature set used also had anoticeable affect on the performance of the models. The best performing feature encodingused the biphone and all phone feature sets. Based on the results presented in this chapter, itseems that the biphone feature set, including both position and vowels, is perhaps superior.Although, the performance of the all phone model without position or vowels was not farbehind.

103

CHAPTER 6

SUMMARY

The final chapter summarizes the work as a whole. In addition, we present somesuggestions for future research based on the work presented here.

6.1 SUMMARYIn Chapter 1, we introduced the basic principles of phonotactic modeling, our central

concern in this work, as well as many of the key results in this area of research. Research intophonotactic modeling was largely inspired by the work of Coleman and Pierrehumbert [9],and the work presented here is no different. Coleman and Pierrehumbert [9] developed aseries of CFG-based phonotactic models of English. They compared the performance of thesemodels on a phonotactic acceptability task to the performance of human subjects on the sametask. The task was simply to rate a series of English nonce words (from 1 to 7) based on theirlevel of phonotactic acceptability (how well a nonce word conforms to English phonotactics).They found that their phonotactic models performed quite well when compared to theperformance of the human subjects (the best model producing a correlation that wassignificant to p ≤ 0.001). Coleman and Pierrehumbert [9] concluded that the high level ofperformance was due to both the phonotactic probability (the probability of a nonce wordcalculated by the combined probability of its phone sequences) and the neighborhood density(the number of real words that share similar phone sequences). However, they made no effortto determine which factor had more influence.

Subsequent studies made concentrated efforts to explore whether phonotacticprobability or neighborhood density was more significant to the result. Bailey and Hahn [2]compared the performance of the CFG-style models of Coleman and Pierrehumbert [9] to aneighborhood based model they dubbed the GNM (Generalized Neighborhood Model) andconducted a similar experiment. Based on their results, Bailey and Hahn [2] concluded thatthey GNM accounted for more of the variance in their results than the probabilistic modelsdid, and thus concluded that neighborhood based models (in particular their GNM) are betterfor modeling human phonotactic behavior. These models and results were later re-examinedby Albright [1], who concluded just the opposite, that probability based models performedbetter for phonotactic modeling. Albright [1] found that the neighborhood based models (likethe GNM) consistently had difficulty with perfectly acceptable nonce words that containedrare phone sequences, and so had small or non-existent neighborhoods. This results in these

104

words receiving the same scores as nonce words that contain overtly unacceptable phonotacticsequences, clearly an undesirable result. Albright [1] performed a similar experiment to [2]using a different data set and found, contrary to earlier findings, that the probability basedmodels were able to explain more of the variance in the results than the neighborhood basedmodels. He attributed this discrepancy to the data sets, however, he notes that more research isnecessary before more definite conclusions can be made.

Our work continues along this same vein, however using much different methods. Weaimed to compare the performance of probability and neighborhood based models usingclassifiers, a considerable departure from earlier studies. We used two types of classifiers: theperceptron, which we contend is a relative of a probability based model, since a perceptronmodel in this context is based on the number of times a phone sequence occurs in a particularenvironment; and kNN, which we contend is a corollary of a neighborhood model, since akNN model is based on the number of other words that are similar to a given word and howsimilar they are. To compare the performance of these models we constructed a phonotacticacceptability task similar to the one used by Coleman and Pierrehumbert [9] and based on thedata used by Bailey and Hahn [2]. We presented this task to a number of human subjects andcompared their performance to the performance of our models on the same task. Using thistechnique, we set out to determine whether probability based or neighborhood based modelsare better in the area of phonotactic modeling.

In the second chapter, we introduced classifiers and the principles underlyingclassification in more detail, in particular the vector space model. The vector space modelallows us to convert our data into a series of feature vectors that have some position in vectorspace. We are able to perform classification based on the simple idea that similar vectors willfall in similar regions of the vector space. Thus, we consider that vectors that fall in similarregions of vector space are members of the same class. And, we are thus able to trainclassifier models to determine the position of a new vector in the vector space, and therefore,its class. The application of this concept to phonotactic modeling is quite straightforward. Werequire two classes: one that represents acceptable words and one that represents unacceptablewords, which are represented by distinct regions in vector space. A new example will fall intoone of these regions, and we can thus determine which class it belongs to. We used two verydifferent types of classifiers, to determine which performs better, as we have mentioned.

The perceptron is an example of a type of classifier known as a linear classifierbecause it makes its decisions based on a linear combination of the feature vectors. This linearcombination of features takes into account the number of times a particular feature (in thiscontext, a phone sequence) occurs, essentially a probabilistic model. We also constructed anumber of models based on the kNN algorithm. kNN is an example of a type of classifier

105

known as a non-linear classifier, generally a much more powerful type of model than a linearmodel like the perceptron. kNN makes its classification decisions based on the similarity of anew example to other examples that it has already seen, assigning it to the class with thelargest number of the most similar examples (essentially a neighborhood model). This type ofclassification is much different than the perceptron, and allows kNN to deal with problemsthat the perceptron cannot handle since it is capable of finding complex non-linear decisionsurfaces. However, this increased power comes at a price: non-linear models often requiremore training data since their models are more complex and are well known to have highvariance, since they may pay too much attention to noise vectors (vectors that are notconsistent with the rest of the training data). Despite these well-known problems kNN hasbeen shown to perform very well on a number of problems.

In addition to the perceptron and kNN, we also briefly introduced our clusteredneighborhood model, a different type of neighborhood model that we constructed based onthe kNN algorithm in a multi-class form. We clustered the training data using a techniqueknown as agglomerative clustering to construct phonotactic neighborhoods. We trained amulti-class kNN model using this clustering and applied it to the same phonotacticacceptability task. We discussed this model in more detail in the third chapter.

In the third chapter, we introduced our methodology. We began the chapter bydiscussing the training data that we used. All of our training data was based on the CMUPronunciation Dictionary, which includes 127,008 examples encoded in the DARPAbetphonetic alphabet. Since classifiers require both positive and negative examples to produce amodel, we also experimented with different types of negative data generation. The negativedata is intended to simulate some form of internally generated negative data that might beutilized in the human phonotactic learning process. We generated four types of negative data:one completely randomly generated ([experiment1]), one based on the conditional probabilitydistribution of vowels and consonants found in the CMU Pronunciation Dictionary, one basedon randomly substituting a random segment for a segment in an actual word found in theCMU Pronunciation Dictionary, and one using an equal number of all three of these types ofnegative data.

We also introduced the feature encodings that we used for our feature vectors, again,we used four different feature sets. Each feature set is based on phone sequences: one featureset was based on simple uniphone sequences (uniphone), one on biphone sequences (biphone),one on triphone sequences (triphone), and one using a combination of all three (all phone). Inaddition to these basic features, we made a number of slight variations, namely varyingwhether or not vowels and positional information were included in the feature representation.This resulted in four different features sets for each type of basic feature (uniphone, biphone,

106

triphone, and all phone). We also discussed several forms of normalization that we employedto reduce the effect of vector length on the final classification decision.

We then introduced the clustered neighborhood model in more detail. We clustered arandom selection of the positive data from the CMU Pronunciation Dictionary usingagglomerative clustering and used the resulting clustering as training data for the clusteredneighborhood model. This produced around 4,000 clusters which we used as our phonotacticneighborhoods. We trained two different versions, one based on the biphone representationand the other based on the triphone representation. The clustered neighborhood models weretrained directly from these clusterings and then applied to our test set. As a point ofcomparison, we also constructed several more traditional probability models based onbiphone and triphone sequences.

We developed a test set based on the data of Bailey and Hahn [2]. We selected 25nonce words from the test set that they used and altered them according to the phonotacticconstraints outlined in [20]. This produced a test set of 50 examples: 25 of which arecomparatively more acceptable and 25 of which are less acceptable. The examples encompassthe full range of possible acceptability values, from completely acceptable to completelyunacceptable. We applied our models to this test set and compared their results to the resultsof a group of human subjects on the same test using the Spearman correlation (cf [9]).

In chapter 4, we reviewed our results. The perceptron and two-class kNN modelsperformed the best by far, producing correlations as high as r = 0.7853 (p = 0.0021) for theperceptron and r = 0.6913 (p = 0.0092) for kNN. The best perceptron models outperformedthe best kNN models by a slight margin, however this difference was not significant. Varyingthe properties of the models (the feature set, whether vowels or positional information wasincluded, and normalization) did have an impact on model performance. Both types of modelsperformed the best using the all phone feature set, indicating that incorporating some form ofbackoff into the models does improve performance. However, while the improvements usingthe all phone feature set were large in some cases, particularly for kNN, they were notsignificant. Including vowels and position had different effects depending on the type ofmodel. The perceptron models typically performed the best without vowels or position,particularly as the number of features increased (in the triphone and all phone models). ThekNN models, on the other hand, typically performed better when both vowels and positionwere included in the feature representation. This is unsurprising given the differences betweenhow each type of model makes its classification decisions. The perceptron abstracts awayfrom the individual examples, making its decision based on the location of a separatinghyperplane. An accurate representation of the separating hyperplane requires learning aweight for as many features in the model as possible. Thus reducing the overall number of

107

features makes it much more likely that the perceptron will learn reasonable weights for moreof the features, producing a better model. kNN, on the other hand, bases its classificationdecisions directly on other examples. So, including as much information as possible in thefeature representation (in this case including both position and vowels) gives a better basis forcomparison between two examples, making classification more accurate. Finally, both typesof classifiers were generally aided by some form of normalization. L1-normalizationgenerally proved most effective for the perceptron, although L2-normalization was also quiteeffective for the larger feature sets when vowels and position were not represented. Likewise,the best performing kNN models generally used one of the normalized similarity functions(cosine similarity, Dice coefficient, and Jaccard Coefficient). Although, the Euclideandistance models, which are not normalized, often proved more consistent if somewhat behind.

The more traditional models did not perform quite as well in our experiments. Thebest of these were the biphone probability models, which produced correlations ofr = 0.4419 (p = 0.0868) using average probability to combine individual biphoneprobabilities and r = 0.4842 (p = 0.0656) using joint probability. The clusteredneighborhood models performed the worst of all the models that we tested, only managing acorrelation of r = 0.3781 (p = 0.1258) in the best case (although this version of the clusteredneighborhood models suffers from a lack of initial training data). The best perceptron modelsare significantly better than both of these models, while the best kNN model was onlysignificantly better than the clustered neighborhood models. This difference in performancebetween the classifier models and the more traditional models seems largely due to theinclusion of negative training data. The classifiers were better able to model the data becausethey also included an accurate representation of phonotactically unacceptable data, producingan overall better picture of the phonotactic space. Many have argued that the use of negativedata is unrealistic since children are rarely exposed to explicit negative data during thelanguage learning process. However, we contend that the type of knowledge-free negativedata that proved the most effective in our experiments could potentially be generatedinternally during language learning as a point of comparison, thereby producing betterlinguistic generalizations overall.

In the fifth chapter, we addressed some of the weaknesses of our models. In particularthe reasonably high variance of the perceptron, the proper choice of k for kNN, and the lackof training data for the clustered neighborhood model. We also addressed the issue of whetherthe clustered neighborhood model was a reasonable approximation of a neighborhood basedmodel. We found that indeed the performance of the perceptron models can vary greatlydepending simply on the order of the presentation of the training data. But when averagedover a number of different versions, correlations approach the correlations produced by a

108

support vector machine, which is known to produce an optimal decision surface. Interestingly,when multiple versions were averaged, the biphone models proved to be the best scoringmodels, surpassing the mark produced by the all phone models in chapter 4. Although thedifferences were very minor and certainly not significant. This indicates essentially that thereis very little difference between the biphone and all phone models in terms of performance,possibly giving the edge to the biphone models since they use fewer features.

The effects of the choice of k in the kNN model were primarily limited to the all phonemodels. The value of k = 7 that was used in chapter 4 was quite reasonable for the biphoneand triphone models. The biphone models performed best when k was between 5 and 10,making 7 a very reasonable choice. The triphone model, on the other hand, performed betterwhen k was nearer to 5, suggesting that performance may have improved slightly if k wascloser to 5. However, improvement would be very minor in this case since k = 5 and k = 7 areso close. The all phone model performed better when k was much higher, around k = 30. Thismakes sense, because the number of features in the all phone model is so much larger than ineither the biphone and triphone representations and these features are represented in threedifferent ways (the all phone model included features in the uniphone, biphone, and triphonerepresentations). This makes the number of possible nearest neighbors much higher since it ismore likely that a new example will share features with a greater number of training examples.

The original incarnation of the clustered neighborhood model described throughoutthis work suffers from some obvious weaknesses, most notably a lack of training data, since itwas trained on only a random set of 5,000+ examples chosen from the complete training data.Adding all of the examples to the training set did improve performance, up to the same level asthe biphone probability models (r ≈ 0.47, p ≈ 0.07). However, the way that we incorporatedthe remaining examples is perhaps less than satisfactory, and we expect a version of theclustered neighborhood model based on a clustering of the full training data would probablyperform much better than the version presented in this chapter. However, such an approachwould require more sophisticated clustering and memory management techniques than theones presented here. We also constructed a more traditional neighborhood model based onedit distance as a point of comparison. Somewhat surprisingly, this basic neighborhood modeloutperformed both the clustered neighborhood models and the biphone probability models bya small margin. However, the difference was not significant, and we contend that the clusteredneighborhood model in its present form is a reasonable approximation of this basicneighborhood model. Although, it is likely that the clustered neighborhood model in itspresent form would perform considerably worse than the GNM of Bailey and Hahn [2].

While our classifier based models were quite successful, we were unable to detect anyparticular differences between our correlates of probability and neighborhood based models.

109

Both performed approximately equally well in our experiments. More sophisticated testdevelopment techniques may be required to divine any differences between the two, if anyreal differences indeed exist. And in fact, it may be that both probability and neighborhoodbased effects play a role in phonotactic acceptability. And these effects may not be easilyseparable. We leave these questions to future research. The significance levels of our resultsare reasonably comparable to those reported in the literature, in particular the results ofColeman and Pierrehumbert [9]. Although, in truth, the models presented in our work may infact perform better, since the human sample that we used was nearly three times larger thanthat used by Coleman and Pierrehumbert [9].

6.2 SUGGESTIONS FOR FUTURE WORKThere are a number of questions that remain unresolved in our work. Most notably, the

very question that we aimed to address. We were unable to show that either probability orneighborhood based models are superior. We believe that this remains an open question. Wehave mentioned that the most fruitful possibility to separate the two types of effects may bemore sophisticated test development procedures that account for and isolate both effects. It isa very real possibility that a more sophisticated test set may not reveal any difference, despitewhat has been mentioned in the literature. However, we leave this to future research.

Perhaps the most interesting questions that our work has raised surround the clusteredneighborhood model. Clearly, its current form leaves a lot to be desired. However, there are anumber of possible avenues of improvement. A more sophisticated clustering technique couldyield large improvements. First, using a clustering method that is able to cluster all of the dataat the same time would be a particular advantage, and would remove some of the unnecessaryassumptions that we have had to make. In particular, that the original clustering based on thesmaller training set was representative of the training data as a whole. Such a technique wouldalso eliminate the variance in the model caused by randomly selecting the smaller trainingsample. Ideally such a technique would also make no assumptions about the number ofclusters (neighborhoods) in the training data. In addition to this, using a different clusteringalgorithm may also prove effective. While single-link clustering produced quite reasonableclusters, there are a number of much more sophisticated algorithms that may prove moreeffective.

A direct comparison between our classifier models and the GNM of Bailey and Hahn[2] would also be highly desirable. It is not entirely clear how well our models compare to theGNM. A direct comparison would, of course, resolve this issue. However, while such acomparison would be very interesting, it would not lead to any particular insight into thequestion of whether a probability or neighborhood based model is superior. Although, itwould lead to a better understanding of exactly where our work fits in with the literature.

110

While our classifier models performed quite well, they are based on some ratherunsophisticated learning algorithms, the perceptron and kNN. There are a large number ofmore sophisticated and modern learning algorithms that could be applied. This may lead tosome improvements. However, it is possible, as we noted in the previous chapter, that ourmodels reached an upper limit of performance for this problem, since phonotacticacceptability is often an unintuitive and inconsistent problem. The average correlations for thehuman respondents were r = 0.74733 for the first test and r = 0.83325 for the second test. Ourbest models are not too far off of this mark, although the human-to-human correlations couldimprove with more rigorous and controlled test presentation. This remains to be seen.

111

BIBLIOGRAPHY

[1] A. Albright. Gradient phonological acceptability as a grammatical effect. Unpublishedpaper: under revision, 2007.

[2] T. M. Bailey and U. Hahn. Determinants of wordlikeness: Phonotactics or lexicalneighborhoods? J. Memory and Language, 44:568–591, 2001.

[3] P. Boersma. How we learn variation, optionality, and probability. In R. J. J. H. van Son,editor, Proceedings of the Institue of Phonetic Sciences, volume 21, pages 43–58.University of Amsterdam, Amsterdam, The Netherlands, 1997.

[4] P. Boersma and B. Hayes. Empirical tests of the gradual learning algorithm. Linguistic

Inquiry, 32:45–86, 2001.

[5] J. Bresnan, A. Cueni, T. Nikitina, and R. H. Baayen. Predicting the dative alternation. InG. Bourne, I. Kraemer, and J. Zwarts, editors, Cognitive Foundations of Interpretation,pages 69–94. Royal Netherlands Academy of Science, Amsterdam, Netherlands, 2007.

[6] J. Chandlee. The role of frequency in phonotactic acceptability judgments. CognitiveScience Graduate Student Conference. Unpulished student conference, 2012.

[7] A. Clark and S. Lappin. Computational learning theory and language acquisition. InR. Kempson, N. Asher, and T. Fernando, editors, Handbook of the Philosophy of

Science, volume 14: Philosophy of Linguistics, pages 445–475. Elsevier, Oxford, 2012.

[8] J. S. Coleman. The psychological reality of language-specific constraints. FourthPhonology Meeting. Unpublished conference, 1996.

[9] J. S. Coleman and J. Pierrehumbert. Stochastic phonological grammars andacceptability. In J.S. Coleman, editor, Third Meeting of the ACL Special Interest Group

in Computational Phonology, pages 49–56. Association for Computational Linguistics,Somerset, NJ, 1997.

[10] M. Collins. Discriminative training for hidden markov models: Theory and experimentswith perceptron algorithms. In J. Hajic and Y. Matsumoto, editors, Proceedings of the

ACL-02 Conference on Empirical Methods in Natural Language Processing, volume 10,pages 1–8. Association for Computational Linguistics, Philadelphia, PA, 2002.

[11] W. Daelemans and A. van den Bosch. Memory-based language processing. CambridgeUniversity Press, Cambridge, 2005.

112

[12] W. Daelemans and A. van den Bosch. Memory-based learning. In A. Clark, C. Fox, andS. Lappin, editors, Handbook of Computational Linguistics and Natural Language

Processing, pages 154–179. Wiley-Blackwell, Oxford, 2010.

[13] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. Timbl: Tilburgmemory based learner, version 6.3, reference guide. Technical Report 10-01, ILK andCLiPS Research Groups, Tilburg University, Tilburg, Netherlands, 2010.

[14] L. R. Dice. Measures of the amount of ecologic association between species. Ecology,26:297–302, 1945.

[15] S. A. Frisch, N. R. Large, and D. B. Pisoni. Perception of wordlikeness: Effects ofsegment probability and legnth on the processing of nonwords. J. Memory and

Language, 42:481–496, 2000.

[16] J. M. Gawron and K. Stephens. Semantic word similarity revisited. Unpublished paper,2011.

[17] J. M. Gawron, D. Gupta, K. Stephens, M. H. Tsou, B. Spitzberg, and L. An. Usinggroup membership markers for group identification in web texts. In J. Breslin,N. Ellison, J. Shanahan, and Z. Tufekci, editors, Proceedings of the 6th International

AAAI Conference on Weblogs and Social Media (ICWSM), pages 467–470. AAAI,Dublin, Ireland, 2012.

[18] S. Goldwater and M. Johnson. Learning OT constraint rankings using a maximumentropy model. In J. Spenader, A. Eriksson, and O. Dahl, editors, Proceedings of the

Workshop on Variation within Optimality Theory, pages 111–120. Stockholm University,Stockholm, Sweden, 2003.

[19] P. Gupta and D. S. Touretzky. Connectionist models and linguistic theory: Investigationsof stress systems in language. Congitive Science, 18:1–50, 1994.

[20] M. Hammond. Underlying representation in optimality theory. In I. Roca, editor,Derivation and Constraints in Phonology, pages 349–366. Clarendon Press, Oxford,1997.

[21] M. Hammond. Gradience, phonotactics, and the lexicon in English phonology.International J. English Studies, 4:1–24, 2004.

[22] B. Hayes and C. Wilson. A maximum entropy model of phonotactics and phonotacticlearning. Linguistic Inquiry, 39:379–440, 2008.

113

[23] P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11:37–50,1912.

[24] G. Jager. Maximum entropy models and stochastic optimality theory. In A. Zaenen,J. Simpson, T. Holloway King, J. Grimshaw, J. Maling, and C. Manning, editors,Architectures, Rules, and Preferences. Variations on Themes by Joan W. Bresnan, pages467–479. CSLI Publications, Stanford, CA, 2007.

[25] G. Jager and A. Rosenbach. The winner takes it all - almost: Cumulativity ingrammatical variation. Linguistics, 44:937–971, 2006.

[26] T. Joachims. Text categorization with support vector machines: Learning with manyrelevant features. In C. Nedellec and C. Rouveirol, editors, Proceedings of the European

Conference on Machine Learning, pages 137–142. Springer-Verlag, Chemnitz,Germany, 1998.

[27] T. Joachims. Making large-scale svm learning practical. In B. Scholkopf, C. Burges, andA. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 41–56.MIT Press, Cambridge, MA, 1999.

[28] T. Joachims. Learning to classify text using support vector machines. Kluwer, Norwell,MA, 2002.

[29] T. Joachims. Svm light v. 6.02, 2011. http://svmlight.joachims.org/, accessed Oct. 2011.

[30] D. Jurafsky and J. H. Martin. Speech and language processing. Prentic Hall, UpperSaddle River, NJ, 2009.

[31] T. Kudo and Y. Matsumoto. Japanese dependency structure analysis based on supportvector machines. In H. Schutze and K. Y. Su, editors, Proceedings of the Joint SIGDAT

Conference on Empirical Methods in Natural Language Processing and Very Large

Corpora, pages 18–25. Association for Computation Linguistics, Hong Kong, 2000.

[32] R. D. Luce. Response times: Their role in inferring elementary mental organization.Oxford University Press, Oxford, 1986.

[33] C. Manning and H. Schutze. Foundations of statistical natural language processing.MIT Press, Cambridge, MA, 1999.

[34] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval.Cambridge University Press, New York, NY, 2009.

114

[35] Mathworks. Inconsistency coefficient, 2012.http://www.mathworks.com/help/stats/inconsistent.html, accessed Mar. 2012.

[36] R. McDonald. Discriminitive learning and spanning tree algorithms for dependencyparsing. Phd dissertation, University of Pennsylvania, Philadelphia, PA, 2006.

[37] M. L. Minsky and S. A. Papert. Perceptrons. MIT Press, Cambridge, MA, 1969.

[38] J. Nivre. An efficient algorithm for projective parsing. In H. Bunt, J. Carroll, andG. Satta, editors, Proceedings of the 8th International Workshop on Parsing

Technologies (IWPT), pages 149–160. Association for Computational Linguistics,Nancy, France, 2003.

[39] A. Prince and P. Smolensky. Optimality theory: Constraint interaction in generative

grammar. Blackwell, Hoboken, NJ, 1993/2004.

[40] A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In E. Brill andK. Church, editors, Proceedings of the Conference on Empirical Methods in Natural

Language Processing, pages 133–142. Association for Computation Linguistics,Philadelphia, PA, 1996.

[41] F. Rosenblatt. Principles of neurodynamics: Perceptrons and the theory of brain

mechanisms. Spartan, Buffalo, NY, 1962.

[42] D. E. Rumelhart and J. L. McClelland, editors. Parallel distributed processing:

Explorations in the microstructure of cognition, volume 1: Foundations. MIT Press,Cambridge, MA, 1986.

[43] S. Shademan. Grammar and analogy in phonotactic well-formedness judgments. Phddissertation, UCLA, Los Angeles, CA, 2007.

[44] C. Shang and D. Barnes. Combining support vector machines and information gainranking for classification of mars McMurdo panorama images. In W. C. Sui, editor,Proceedings of the 2010 IEEE 17th International Conference on Image Processing,pages 1061–1064. IEEE, Hong Kong, 2010.

[45] N. Silver and W. Dunlap. Averaging correlation coefficients: should Fisher’s Ztransformation be used? J. Applied Psychology, 72:146–148, 1987.

[46] B. Tesar and P. Smolensky. The learnability of optimality theory: An algorithm andsome basic complexity results. Technical Report CU-CS-678-93, University ofColorado, Boulder, Boulder, CO, 1993.

115

[47] K. Toutanova and C. D. Manning. Enriching the knowledge sources used in a maximumentropy part-of-speech tagger. In H. Schutze and K. Y. Su, editors, Proceedings of the

Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and

Very Large Corpora, pages 63–70. Association for Computation Linguistics, HongKong, 2000.

[48] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech taggingwith a cyclic dependency network. In M. Hearst and M. Ostendorf, editors, Proceedings

of HLT-NAACL 2003, pages 252–259. Association for Computation Linguistics,Edmonton, Canada, 2003.

[49] M. S. Vitevitch and P. A. Luce. When words compete: Levels of processing in spokenword recognition. Psychological Science, 9:325–329, 1998.

[50] M. S. Vitevitch and P. A. Luce. Probabilistic phonotactics and neighborhood activationin spoken word recognition. J. Memory and Language, 40:374–408, 1999.

[51] M. S. Vitevitch and P. A. Luce. Increases in phonotactic probability facilitate spokennonword repetition. J. Memory and Language, 52:193–204, 2005.

116

APPENDIX

TEST DATA

117

TEST DATA

Table A.1. Test DataBailey and Hahn Data Derived DatablEsk lbEskbrOntS bpOndZbrIlk bnIlkdwESt dwrEStdZInT dZrInTkwEnT kwEnrTmIksTs miksTsplImf plIfmsfim sfsImSEndZ SEvdZskOlf skOlfbskep skekskræN skNæNslOntS lOtSnspEtS spnEtSstImpts stemptsstr

>ajb stm

>ajb

swIsp wsIspswOntS swontSTrINk rTINktrEltS tpEltSTrEltS TrelTtr@sp tl@spvIN NINzIntS zIptS

kellen_thesis

Documents

Transcript of kellen_thesis