Criticisms of dialectometry, esp. Levenshtein-based...

26
Criticisms of dialectometry, esp. Levenshtein-based work Measure is too insensitive, 0/1 segment differences Too little attention to phonetic/phonological conditioning Too reliant on transcription—what about acoustics? Where is the sociolinguistics? Isn’t variationist linguistics mostly about sociolinguistics? “Distance-based” methods yield too little insight into the linguistic basis of differences (concrete differences lost in the aggregate sums) —the hint is that it may be all smoke & mirrors So what? Isn’t this all just confirming what we knew earlier? Let’s use these criticisms to inspect novel developments. John Nerbonne j.nerbonne@rug 1/25

Transcript of Criticisms of dialectometry, esp. Levenshtein-based...

Page 1: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Criticisms of dialectometry, esp. Levenshtein-basedwork

Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)

—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?

Let’s use these criticisms to inspect novel developments.John Nerbonne j.nerbonne@rug 1/25

Page 2: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Sensitivity of measure

Binary segment distances too roughFrequent concern in Groningen (Heeringa Diss., 2004)

Segment distances based on phonetic features, phonologicalfeatures, canonical spectrogramsHigh correlation with rough measures when compared at aggregate(varietal) levelBut no substantial improvement in aggregate distance measures(validation wrt dialect speakers’ judgments)compare height measurements in in., cm, mm, µm

Difficult problem in general — due inter alia to fine detail inatlases, e.g., 1,300 different vowels in LAMSASNew procedure (Martijn Wieling) induces segment weights fromdata

John Nerbonne j.nerbonne@rug 2/25

Page 3: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Inducing segment distances

Sound correspondences were obtained using the Levenshteinalgorithm using a Pointwise Mutual Information procedure(Wieling et al., 2009; included in RuG/L04)

Levenshtein algorithm:

l E I k @ nl i k h 8 n

1 1 1 1

Segment distances based on Pointwise Mutual Information:

PMI(x , y) = log2

(p(x , y)

p(x) p(y)

)

John Nerbonne j.nerbonne@rug 3/25

Page 4: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Evaluating segment weight induction

Evaluation with respect to alignment correctness—more sensitive than aggregate correlations with judgments

50% less error using alignments with induced weightsCompetitive with sophisticated bio-informatic techniques from,(pair Hidden Markov Models)Future project: evaluate the segment weights against linguisticcriteria, compare weights induced from different data sets

Wieling, Prokic, & Nerbonne “Evaluating the pairwise string alignments ofpronunciations” LaTeCH-SHELT&R, 2009.Wieling, Magaretha & Nerbonne “Inducing Phonetics from Dialect Variation”Submitted to Journal of Phonetics Jan., 2011.

John Nerbonne j.nerbonne@rug 4/25

Page 5: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Phonetic/Phonological conditioning

Example: Some Dutch varieties lose final-syllable schwas, but onlywhen /n/ follows. The [@]:[∅] correspondence occurs, but not always.

Solution (partial): apply Levenshtein algorithm not to sequence ofphonetic symbols but instead to BIGRAMS

l o p @ nl o p m

"1 1

#l lo op p@ @n n##l lo op ∅ -m

"m"#

1 1 1

Heeringa et al. (2006): bigram-based routine correlates more highlywith speakers’ judgments.

John Nerbonne j.nerbonne@rug 5/25

Page 6: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Transcription vs. acoustics

Levenshtein distance indeed relies on sequencesSimply using frequently sampled spectrograms too rough (e.g.,male-female differences dominate)Vowel databases are attractive dialectological study, usuallystudied via formant trackingFormant tracking too unreliable (25% error) to apply automaticallyHow to maintain dialectometric program, focusing on largesamples variation?

John Nerbonne j.nerbonne@rug 6/25

Page 7: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Leinonen’s work on Swedish

Data source: SveDia (Eriksson, 2004)107 sites, 12 speakers/sitehalf men, half womenhalf about 27 yr. old, half about 6519 vowels were focus of analysis, 5 repeititions each

Challenge: automatically assess vowel qualityFormant tracking too error prone for automatic application

Applied technique due to Pols, JacobiBand-pass filtered vowel spectra, lowest filter dropped for womenSeparate PCAs for men and women, reducing sex differences

An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects— Prize (25,000 Sw. Kronor), 2010 Royal Gustav Adolf

Academy for Swedish Culture

John Nerbonne j.nerbonne@rug 7/25

Page 8: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Leinonen’s older vs. younger speakers

John Nerbonne j.nerbonne@rug 8/25

Page 9: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Where’s the sociolinguistics?

Techniques are general, and may be applied to variation due togeography, social status, or native language (accents)Sociolinguistics has created fewer large data repositoriesBut some work exists

Leinonen’s work on age and sexTwo papers on the effects of bordersOngoing work on foreign accents in English

—George Mason’s The speech accent archiveOngoing work combining geographic and social explanans

John Nerbonne j.nerbonne@rug 9/25

Page 10: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Linking dialect classification to linguistic basis

Dialectometry’s strength lies in aggregating over many linguisticvariables, using sum (or average) of variables’ differences tocharacterize relations between varietiesIndividual items not longer recognizable in sums!Lots of attempts to identifying characteristic variables

Heeringa (2004) found items whose distances correlated with majordimensions in MDSShackleton (2005,2007) — PCANerbonne (2006) — factor analysisProkic (2007) developed measures of correspondence strength

All post hoc analyses!

John Nerbonne j.nerbonne@rug 10/25

Page 11: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Co-clustering Varieties and Features

Important method in investigating dialectal variation: clustersimilar (dialectal) varieties togetherGoal: cluster varieties and linguistic features simultaneouslyNew research: Bi-partite spectral clustering

Based on the spectrum of a graph

John Nerbonne j.nerbonne@rug 11/25

Page 12: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Generating a bipartite graph from alignments

A bipartite graph is a graph whose vertices can be divided in twodisjoint sets where every edge connects a vertex from one set to avertex in another set. Vertices within a set are not connected.

From the alignments, we extract the number of soundcorrespondences for each variety (compared to a reference site)

We generated a bipartite graph of varieties v and soundcorrespondences s

There is an edge between vi and sj iff freq(sj in vi) > 0

John Nerbonne j.nerbonne@rug 12/25

Page 13: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Example of a bipartite graph A

[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1

John Nerbonne j.nerbonne@rug 13/25

Page 14: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Example of co-clustering a biparte graph

Based on the adjacency matrix A:[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]

Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1

We can calculate the eigenvectors (of the Laplacian) of A:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

λ3 = .53, xxx = [.12 .12 -.7 .12 .12 .25 .25 -.33 -.33 .25 .25]T

John Nerbonne j.nerbonne@rug 14/25

Page 15: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Example of co-clustering a biparte graph

To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

John Nerbonne j.nerbonne@rug 15/25

Page 16: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Example of co-clustering a biparte graph

To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

We obtain the following co-clustering:

-0.32

-0.34

-0.34

-0.23

0.23

0.34

0.34

-0.32

0

0.32

0.32

John Nerbonne j.nerbonne@rug 15/25

Page 17: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Results: {2,3,4} clusters of varieties

John Nerbonne j.nerbonne@rug 16/25

Page 18: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Results: {2,3,4} clusters of sound correspondences

Some sound correspondences specific for the Frisian area

Reference [2] [2] [a] [o] [u] [x] [x] [r]Frisian [I] [i] [i] [E] [E] [j] [z] [x]

Some sound correspondences specific for the Limburg areaReference [r] [r] [k] [n] [n] [w]Limburg [ö] [K] [x] [ö] [K] [f]

Some sound correspondences specific for the Low Saxon area

Reference [@] [@] [@] [-] [a]Low Saxon [m] [N] [ð] [P] [e]

John Nerbonne j.nerbonne@rug 17/25

Page 19: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Discussion

Bipartite spectral graph partitioning detects the linguistic basis forthe dialectal grouping when used to simultaneously cluster

varieties and sound correspondences

Further experimentation:Use of standard and protolanguage correspondences as referencevarietiesMeasures proposed to identify the importance of soundcorrespondences (published)Various frequency thresholds

M.Wieling & J.Nerbonne (2011) “Bipartite spectral graphpartitioning for clustering dialect varieties and detecting theirlinguistic features” Computer Speech and Language 25, 700-715.

John Nerbonne j.nerbonne@rug 18/25

Page 20: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Gabon Bantu

0 100 300 500

0.00

0.10

0.20

Bantu

Area: BantuData: 53 sites, 160 wordsSource: Van der Veen, LyonNote: Late settlement

John Nerbonne j.nerbonne@rug 19/25

Page 21: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Bulgaria

0 100 200 300 400 500

0.00

00.

002

0.00

4

Bulgaria

Area: BulgariaData: 482 sites, 54 wordsSource: Stoykov’s atlasNote: Long Turkish domination

John Nerbonne j.nerbonne@rug 20/25

Page 22: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Germany

0 200 400 600 800

0.04

0.08

0.12

Germany

Area: GermanyData: 186 sites, 201 wordsSource: Kleiner DeutscherLautatlas (Goschel)

John Nerbonne j.nerbonne@rug 21/25

Page 23: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

LAMSAS / Lowman

0 200 600 1000

0.0

0.2

0.4

LAMSAS / Lowman

Area: Eastern Seaboard, USData: 357 sites, 145 wordsSource: Mid & South Atlantic,LAMSAS (Lowman)Note: Settlement in last 400 yr.

John Nerbonne j.nerbonne@rug 22/25

Page 24: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

The Netherlands

0 50 100 200 300

0.01

0.03

0.05

0.07

The Netherlands

Area: The NetherlandsData: 424 sites, 562 wordsSource: Goeman-Taeldeman-van Reenen Atlas

John Nerbonne j.nerbonne@rug 23/25

Page 25: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

Norway

0 100 200 300 400 500

1.0

2.0

3.0

4.0

Norway

Area: NorwayData: North Wind & Sun15 sites, 58 wordsSource:www.ling.hf.ntnu.no/nos

John Nerbonne j.nerbonne@rug 24/25

Page 26: Criticisms of dialectometry, esp. Levenshtein-based worknerbonne/outgoing/talks/discussion-2011.pdfBinary codes capable of correcting deletions, insertions and reversals. Doklady Adademii

References

Inderjit Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of theseventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM New York.

Ton Goeman, and Johan Taeldeman. 1996. Fonologie en morfologie van de Nederlandse dialecten. Een nieuwemateriaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48:38–59.

Vladimir Levenshtein. 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady AdademiiNauk SSSR, 164:845–848.

Wilbert Heeringa. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. Ph.D. thesis,Rijksuniversiteit Groningen.

Martijn Wieling and John Nerbonne. 2009. Bipartite spectral graph partitioning to co-cluster varieties and soundcorrespondences in dialectology. In: Monojit Choudhury (ed.) Proceedings of the TextGraphs-4 Workshop at the 47thMeeting of the Association for Computational Linguistics, August 2009, Singapore. Available viahttp://www.martijnwieling.nl.

Martijn Wieling, Jelena Prokic, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In:Lars Borin and Piroska Lendvai (eds.) Language Technology and Resources for Cultural Heritage, Social Sciences,Humanities, and Education (LaTeCH - SHELT&R 2009) Workshop at the 12th Meeting of the European Chapter of theAssociation for Computational Linguistics. Athens, 30 March 2009, pp. 26-34. Available via http://www.martijnwieling.nl.

John Nerbonne j.nerbonne@rug 25/25