Post on 27-Nov-2020
Criticisms of dialectometry, esp. Levenshtein-basedwork
Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)
—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?
Let’s use these criticisms to inspect novel developments.John Nerbonne j.nerbonne@rug 1/25
Sensitivity of measure
Binary segment distances too roughFrequent concern in Groningen (Heeringa Diss., 2004)
Segment distances based on phonetic features, phonologicalfeatures, canonical spectrogramsHigh correlation with rough measures when compared at aggregate(varietal) levelBut no substantial improvement in aggregate distance measures(validation wrt dialect speakers’ judgments)compare height measurements in in., cm, mm, µm
Difficult problem in general — due inter alia to fine detail inatlases, e.g., 1,300 different vowels in LAMSASNew procedure (Martijn Wieling) induces segment weights fromdata
John Nerbonne j.nerbonne@rug 2/25
Inducing segment distances
Sound correspondences were obtained using the Levenshteinalgorithm using a Pointwise Mutual Information procedure(Wieling et al., 2009; included in RuG/L04)
Levenshtein algorithm:
l E I k @ nl i k h 8 n
1 1 1 1
Segment distances based on Pointwise Mutual Information:
PMI(x , y) = log2
(p(x , y)
p(x) p(y)
)
John Nerbonne j.nerbonne@rug 3/25
Evaluating segment weight induction
Evaluation with respect to alignment correctness—more sensitive than aggregate correlations with judgments
50% less error using alignments with induced weightsCompetitive with sophisticated bio-informatic techniques from,(pair Hidden Markov Models)Future project: evaluate the segment weights against linguisticcriteria, compare weights induced from different data sets
Wieling, Prokic, & Nerbonne “Evaluating the pairwise string alignments ofpronunciations” LaTeCH-SHELT&R, 2009.Wieling, Magaretha & Nerbonne “Inducing Phonetics from Dialect Variation”Submitted to Journal of Phonetics Jan., 2011.
John Nerbonne j.nerbonne@rug 4/25
Phonetic/Phonological conditioning
Example: Some Dutch varieties lose final-syllable schwas, but onlywhen /n/ follows. The [@]:[∅] correspondence occurs, but not always.
Solution (partial): apply Levenshtein algorithm not to sequence ofphonetic symbols but instead to BIGRAMS
l o p @ nl o p m
"1 1
#l lo op p@ @n n##l lo op ∅ -m
"m"#
1 1 1
Heeringa et al. (2006): bigram-based routine correlates more highlywith speakers’ judgments.
John Nerbonne j.nerbonne@rug 5/25
Transcription vs. acoustics
Levenshtein distance indeed relies on sequencesSimply using frequently sampled spectrograms too rough (e.g.,male-female differences dominate)Vowel databases are attractive dialectological study, usuallystudied via formant trackingFormant tracking too unreliable (25% error) to apply automaticallyHow to maintain dialectometric program, focusing on largesamples variation?
John Nerbonne j.nerbonne@rug 6/25
Leinonen’s work on Swedish
Data source: SveDia (Eriksson, 2004)107 sites, 12 speakers/sitehalf men, half womenhalf about 27 yr. old, half about 6519 vowels were focus of analysis, 5 repeititions each
Challenge: automatically assess vowel qualityFormant tracking too error prone for automatic application
Applied technique due to Pols, JacobiBand-pass filtered vowel spectra, lowest filter dropped for womenSeparate PCAs for men and women, reducing sex differences
An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects— Prize (25,000 Sw. Kronor), 2010 Royal Gustav Adolf
Academy for Swedish Culture
John Nerbonne j.nerbonne@rug 7/25
Leinonen’s older vs. younger speakers
John Nerbonne j.nerbonne@rug 8/25
Where’s the sociolinguistics?
Techniques are general, and may be applied to variation due togeography, social status, or native language (accents)Sociolinguistics has created fewer large data repositoriesBut some work exists
Leinonen’s work on age and sexTwo papers on the effects of bordersOngoing work on foreign accents in English
—George Mason’s The speech accent archiveOngoing work combining geographic and social explanans
John Nerbonne j.nerbonne@rug 9/25
Linking dialect classification to linguistic basis
Dialectometry’s strength lies in aggregating over many linguisticvariables, using sum (or average) of variables’ differences tocharacterize relations between varietiesIndividual items not longer recognizable in sums!Lots of attempts to identifying characteristic variables
Heeringa (2004) found items whose distances correlated with majordimensions in MDSShackleton (2005,2007) — PCANerbonne (2006) — factor analysisProkic (2007) developed measures of correspondence strength
All post hoc analyses!
John Nerbonne j.nerbonne@rug 10/25
Co-clustering Varieties and Features
Important method in investigating dialectal variation: clustersimilar (dialectal) varieties togetherGoal: cluster varieties and linguistic features simultaneouslyNew research: Bi-partite spectral clustering
Based on the spectrum of a graph
John Nerbonne j.nerbonne@rug 11/25
Generating a bipartite graph from alignments
A bipartite graph is a graph whose vertices can be divided in twodisjoint sets where every edge connects a vertex from one set to avertex in another set. Vertices within a set are not connected.
From the alignments, we extract the number of soundcorrespondences for each variety (compared to a reference site)
We generated a bipartite graph of varieties v and soundcorrespondences s
There is an edge between vi and sj iff freq(sj in vi) > 0
John Nerbonne j.nerbonne@rug 12/25
Example of a bipartite graph A
[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1
John Nerbonne j.nerbonne@rug 13/25
Example of co-clustering a biparte graph
Based on the adjacency matrix A:[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]
Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1
We can calculate the eigenvectors (of the Laplacian) of A:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T
λ3 = .53, xxx = [.12 .12 -.7 .12 .12 .25 .25 -.33 -.33 .25 .25]T
John Nerbonne j.nerbonne@rug 14/25
Example of co-clustering a biparte graph
To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T
John Nerbonne j.nerbonne@rug 15/25
Example of co-clustering a biparte graph
To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T
We obtain the following co-clustering:
-0.32
-0.34
-0.34
-0.23
0.23
0.34
0.34
-0.32
0
0.32
0.32
John Nerbonne j.nerbonne@rug 15/25
Results: {2,3,4} clusters of varieties
John Nerbonne j.nerbonne@rug 16/25
Results: {2,3,4} clusters of sound correspondences
Some sound correspondences specific for the Frisian area
Reference [2] [2] [a] [o] [u] [x] [x] [r]Frisian [I] [i] [i] [E] [E] [j] [z] [x]
Some sound correspondences specific for the Limburg areaReference [r] [r] [k] [n] [n] [w]Limburg [ö] [K] [x] [ö] [K] [f]
Some sound correspondences specific for the Low Saxon area
Reference [@] [@] [@] [-] [a]Low Saxon [m] [N] [ð] [P] [e]
John Nerbonne j.nerbonne@rug 17/25
Discussion
Bipartite spectral graph partitioning detects the linguistic basis forthe dialectal grouping when used to simultaneously cluster
varieties and sound correspondences
Further experimentation:Use of standard and protolanguage correspondences as referencevarietiesMeasures proposed to identify the importance of soundcorrespondences (published)Various frequency thresholds
M.Wieling & J.Nerbonne (2011) “Bipartite spectral graphpartitioning for clustering dialect varieties and detecting theirlinguistic features” Computer Speech and Language 25, 700-715.
John Nerbonne j.nerbonne@rug 18/25
Gabon Bantu
0 100 300 500
0.00
0.10
0.20
Bantu
Area: BantuData: 53 sites, 160 wordsSource: Van der Veen, LyonNote: Late settlement
John Nerbonne j.nerbonne@rug 19/25
Bulgaria
0 100 200 300 400 500
0.00
00.
002
0.00
4
Bulgaria
Area: BulgariaData: 482 sites, 54 wordsSource: Stoykov’s atlasNote: Long Turkish domination
John Nerbonne j.nerbonne@rug 20/25
Germany
0 200 400 600 800
0.04
0.08
0.12
Germany
Area: GermanyData: 186 sites, 201 wordsSource: Kleiner DeutscherLautatlas (Goschel)
John Nerbonne j.nerbonne@rug 21/25
LAMSAS / Lowman
0 200 600 1000
0.0
0.2
0.4
LAMSAS / Lowman
Area: Eastern Seaboard, USData: 357 sites, 145 wordsSource: Mid & South Atlantic,LAMSAS (Lowman)Note: Settlement in last 400 yr.
John Nerbonne j.nerbonne@rug 22/25
The Netherlands
0 50 100 200 300
0.01
0.03
0.05
0.07
The Netherlands
Area: The NetherlandsData: 424 sites, 562 wordsSource: Goeman-Taeldeman-van Reenen Atlas
John Nerbonne j.nerbonne@rug 23/25
Norway
0 100 200 300 400 500
1.0
2.0
3.0
4.0
Norway
Area: NorwayData: North Wind & Sun15 sites, 58 wordsSource:www.ling.hf.ntnu.no/nos
John Nerbonne j.nerbonne@rug 24/25
References
Inderjit Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of theseventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM New York.
Ton Goeman, and Johan Taeldeman. 1996. Fonologie en morfologie van de Nederlandse dialecten. Een nieuwemateriaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48:38–59.
Vladimir Levenshtein. 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady AdademiiNauk SSSR, 164:845–848.
Wilbert Heeringa. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. Ph.D. thesis,Rijksuniversiteit Groningen.
Martijn Wieling and John Nerbonne. 2009. Bipartite spectral graph partitioning to co-cluster varieties and soundcorrespondences in dialectology. In: Monojit Choudhury (ed.) Proceedings of the TextGraphs-4 Workshop at the 47thMeeting of the Association for Computational Linguistics, August 2009, Singapore. Available viahttp://www.martijnwieling.nl.
Martijn Wieling, Jelena Prokic, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In:Lars Borin and Piroska Lendvai (eds.) Language Technology and Resources for Cultural Heritage, Social Sciences,Humanities, and Education (LaTeCH - SHELT&R 2009) Workshop at the 12th Meeting of the European Chapter of theAssociation for Computational Linguistics. Athens, 30 March 2009, pp. 26-34. Available via http://www.martijnwieling.nl.
John Nerbonne j.nerbonne@rug 25/25