Criticisms of dialectometry, esp. Levenshtein-based...

Criticisms of dialectometry, esp. Levenshtein-basedwork

Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)

—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?

Let’s use these criticisms to inspect novel developments.John Nerbonne j.nerbonne@rug 1/25

Sensitivity of measure

Binary segment distances too roughFrequent concern in Groningen (Heeringa Diss., 2004)

Segment distances based on phonetic features, phonologicalfeatures, canonical spectrogramsHigh correlation with rough measures when compared at aggregate(varietal) levelBut no substantial improvement in aggregate distance measures(validation wrt dialect speakers’ judgments)compare height measurements in in., cm, mm, µm

Difficult problem in general — due inter alia to fine detail inatlases, e.g., 1,300 different vowels in LAMSASNew procedure (Martijn Wieling) induces segment weights fromdata

John Nerbonne j.nerbonne@rug 2/25

Inducing segment distances

Sound correspondences were obtained using the Levenshteinalgorithm using a Pointwise Mutual Information procedure(Wieling et al., 2009; included in RuG/L04)

Levenshtein algorithm:

l E I k @ nl i k h 8 n

1 1 1 1

Segment distances based on Pointwise Mutual Information:

PMI(x , y) = log2

(p(x , y)

p(x) p(y)

)


Evaluating segment weight induction

Evaluation with respect to alignment correctness—more sensitive than aggregate correlations with judgments

50% less error using alignments with induced weightsCompetitive with sophisticated bio-informatic techniques from,(pair Hidden Markov Models)Future project: evaluate the segment weights against linguisticcriteria, compare weights induced from different data sets

Wieling, Prokic, & Nerbonne “Evaluating the pairwise string alignments ofpronunciations” LaTeCH-SHELT&R, 2009.Wieling, Magaretha & Nerbonne “Inducing Phonetics from Dialect Variation”Submitted to Journal of Phonetics Jan., 2011.


Phonetic/Phonological conditioning

Example: Some Dutch varieties lose final-syllable schwas, but onlywhen /n/ follows. The [@]:[∅] correspondence occurs, but not always.

Solution (partial): apply Levenshtein algorithm not to sequence ofphonetic symbols but instead to BIGRAMS

l o p @ nl o p m

"1 1

#l lo op p@ @n n##l lo op ∅ -m

"m"#

1 1 1

Heeringa et al. (2006): bigram-based routine correlates more highlywith speakers’ judgments.


Transcription vs. acoustics

Levenshtein distance indeed relies on sequencesSimply using frequently sampled spectrograms too rough (e.g.,male-female differences dominate)Vowel databases are attractive dialectological study, usuallystudied via formant trackingFormant tracking too unreliable (25% error) to apply automaticallyHow to maintain dialectometric program, focusing on largesamples variation?


Leinonen’s work on Swedish

Data source: SveDia (Eriksson, 2004)107 sites, 12 speakers/sitehalf men, half womenhalf about 27 yr. old, half about 6519 vowels were focus of analysis, 5 repeititions each

Challenge: automatically assess vowel qualityFormant tracking too error prone for automatic application

Applied technique due to Pols, JacobiBand-pass filtered vowel spectra, lowest filter dropped for womenSeparate PCAs for men and women, reducing sex differences

An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects— Prize (25,000 Sw. Kronor), 2010 Royal Gustav Adolf

Academy for Swedish Culture


Leinonen’s older vs. younger speakers


Where’s the sociolinguistics?

Techniques are general, and may be applied to variation due togeography, social status, or native language (accents)Sociolinguistics has created fewer large data repositoriesBut some work exists

Leinonen’s work on age and sexTwo papers on the effects of bordersOngoing work on foreign accents in English

—George Mason’s The speech accent archiveOngoing work combining geographic and social explanans


Linking dialect classification to linguistic basis

Dialectometry’s strength lies in aggregating over many linguisticvariables, using sum (or average) of variables’ differences tocharacterize relations between varietiesIndividual items not longer recognizable in sums!Lots of attempts to identifying characteristic variables

Heeringa (2004) found items whose distances correlated with majordimensions in MDSShackleton (2005,2007) — PCANerbonne (2006) — factor analysisProkic (2007) developed measures of correspondence strength

All post hoc analyses!


Co-clustering Varieties and Features

Important method in investigating dialectal variation: clustersimilar (dialectal) varieties togetherGoal: cluster varieties and linguistic features simultaneouslyNew research: Bi-partite spectral clustering

Based on the spectrum of a graph


Generating a bipartite graph from alignments

A bipartite graph is a graph whose vertices can be divided in twodisjoint sets where every edge connects a vertex from one set to avertex in another set. Vertices within a set are not connected.

From the alignments, we extract the number of soundcorrespondences for each variety (compared to a reference site)

We generated a bipartite graph of varieties v and soundcorrespondences s

There is an edge between vi and sj iff freq(sj in vi) > 0


Example of a bipartite graph A

[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1


Example of co-clustering a biparte graph

Based on the adjacency matrix A:[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]

Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1

We can calculate the eigenvectors (of the Laplacian) of A:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

λ3 = .53, xxx = [.12 .12 -.7 .12 .12 .25 .25 -.33 -.33 .25 .25]T



To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T



To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

We obtain the following co-clustering:

-0.32

-0.34

-0.34

-0.23

0.23

0.34

0.34

-0.32

0

0.32

0.32


Results: {2,3,4} clusters of varieties


Results: {2,3,4} clusters of sound correspondences

Some sound correspondences specific for the Frisian area

Reference [2] [2] [a] [o] [u] [x] [x] [r]Frisian [I] [i] [i] [E] [E] [j] [z] [x]

Some sound correspondences specific for the Limburg areaReference [r] [r] [k] [n] [n] [w]Limburg [ö] [K] [x] [ö] [K] [f]

Some sound correspondences specific for the Low Saxon area

Reference [@] [@] [@] [-] [a]Low Saxon [m] [N] [ð] [P] [e]


Discussion

Bipartite spectral graph partitioning detects the linguistic basis forthe dialectal grouping when used to simultaneously cluster

varieties and sound correspondences

Further experimentation:Use of standard and protolanguage correspondences as referencevarietiesMeasures proposed to identify the importance of soundcorrespondences (published)Various frequency thresholds

M.Wieling & J.Nerbonne (2011) “Bipartite spectral graphpartitioning for clustering dialect varieties and detecting theirlinguistic features” Computer Speech and Language 25, 700-715.


Gabon Bantu

0 100 300 500

0.00

0.10

0.20

Bantu

Area: BantuData: 53 sites, 160 wordsSource: Van der Veen, LyonNote: Late settlement


Bulgaria

0 100 200 300 400 500

0.00

00.

002

0.00

4

Bulgaria

Area: BulgariaData: 482 sites, 54 wordsSource: Stoykov’s atlasNote: Long Turkish domination


Germany

0 200 400 600 800

0.04

0.08

0.12

Germany

Area: GermanyData: 186 sites, 201 wordsSource: Kleiner DeutscherLautatlas (Goschel)


LAMSAS / Lowman

0 200 600 1000

0.0

0.2

0.4

LAMSAS / Lowman

Area: Eastern Seaboard, USData: 357 sites, 145 wordsSource: Mid & South Atlantic,LAMSAS (Lowman)Note: Settlement in last 400 yr.


The Netherlands

0 50 100 200 300

0.01

0.03

0.05

0.07

The Netherlands

Area: The NetherlandsData: 424 sites, 562 wordsSource: Goeman-Taeldeman-van Reenen Atlas


Norway

0 100 200 300 400 500

1.0

2.0

3.0

4.0

Norway

Area: NorwayData: North Wind & Sun15 sites, 58 wordsSource:www.ling.hf.ntnu.no/nos


References

Inderjit Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of theseventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM New York.

Ton Goeman, and Johan Taeldeman. 1996. Fonologie en morfologie van de Nederlandse dialecten. Een nieuwemateriaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48:38–59.

Vladimir Levenshtein. 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady AdademiiNauk SSSR, 164:845–848.

Wilbert Heeringa. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. Ph.D. thesis,Rijksuniversiteit Groningen.

Martijn Wieling and John Nerbonne. 2009. Bipartite spectral graph partitioning to co-cluster varieties and soundcorrespondences in dialectology. In: Monojit Choudhury (ed.) Proceedings of the TextGraphs-4 Workshop at the 47thMeeting of the Association for Computational Linguistics, August 2009, Singapore. Available viahttp://www.martijnwieling.nl.

Martijn Wieling, Jelena Prokic, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In:Lars Borin and Piroska Lendvai (eds.) Language Technology and Resources for Cultural Heritage, Social Sciences,Humanities, and Education (LaTeCH - SHELT&R 2009) Workshop at the 12th Meeting of the European Chapter of theAssociation for Computational Linguistics. Athens, 30 March 2009, pp. 26-34. Available via http://www.martijnwieling.nl.


Criticisms of dialectometry, esp. Levenshtein-based...

Documents

Transcript of Criticisms of dialectometry, esp. Levenshtein-based...