Criticisms of dialectometry, esp. Levenshtein-based...

Criticisms of dialectometry, esp. Levenshtein-basedwork

Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)

—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?

Let’s use these criticisms to inspect novel developments.John Nerbonne j.nerbonne@rug 1/25

Sensitivity of measure

Binary segment distances too roughFrequent concern in Groningen (Heeringa Diss., 2004)

Segment distances based on phonetic features, phonologicalfeatures, canonical spectrogramsHigh correlation with rough measures when compared at aggregate(varietal) levelBut no substantial improvement in aggregate distance measures(validation wrt dialect speakers’ judgments)compare height measurements in in., cm, mm, µm

Difficult problem in general — due inter alia to fine detail inatlases, e.g., 1,300 different vowels in LAMSASNew procedure (Martijn Wieling) induces segment weights fromdata

John Nerbonne j.nerbonne@rug 2/25

Inducing segment distances

Sound correspondences were obtained using the Levenshteinalgorithm using a Pointwise Mutual Information procedure(Wieling et al., 2009; included in RuG/L04)

Levenshtein algorithm:

l E I k @ nl i k h 8 n

1 1 1 1

Segment distances based on Pointwise Mutual Information:

PMI(x , y) = log2

(p(x , y)

p(x) p(y)

Evaluating segment weight induction

Evaluation with respect to alignment correctness—more sensitive than aggregate correlations with judgments

50% less error using alignments with induced weightsCompetitive with sophisticated bio-informatic techniques from,(pair Hidden Markov Models)Future project: evaluate the segment weights against linguisticcriteria, compare weights induced from different data sets

Wieling, Prokic, & Nerbonne “Evaluating the pairwise string alignments ofpronunciations” LaTeCH-SHELT&R, 2009.Wieling, Magaretha & Nerbonne “Inducing Phonetics from Dialect Variation”Submitted to Journal of Phonetics Jan., 2011.

Phonetic/Phonological conditioning

Example: Some Dutch varieties lose final-syllable schwas, but onlywhen /n/ follows. The [@]:[∅] correspondence occurs, but not always.

Solution (partial): apply Levenshtein algorithm not to sequence ofphonetic symbols but instead to BIGRAMS

l o p @ nl o p m

#l lo op p@ @n n##l lo op ∅ -m

Heeringa et al. (2006): bigram-based routine correlates more highlywith speakers’ judgments.

Transcription vs. acoustics

Levenshtein distance indeed relies on sequencesSimply using frequently sampled spectrograms too rough (e.g.,male-female differences dominate)Vowel databases are attractive dialectological study, usuallystudied via formant trackingFormant tracking too unreliable (25% error) to apply automaticallyHow to maintain dialectometric program, focusing on largesamples variation?

Leinonen’s work on Swedish

Data source: SveDia (Eriksson, 2004)107 sites, 12 speakers/sitehalf men, half womenhalf about 27 yr. old, half about 6519 vowels were focus of analysis, 5 repeititions each

Challenge: automatically assess vowel qualityFormant tracking too error prone for automatic application

Applied technique due to Pols, JacobiBand-pass filtered vowel spectra, lowest filter dropped for womenSeparate PCAs for men and women, reducing sex differences

An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects— Prize (25,000 Sw. Kronor), 2010 Royal Gustav Adolf

Academy for Swedish Culture

Leinonen’s older vs. younger speakers

Where’s the sociolinguistics?

Techniques are general, and may be applied to variation due togeography, social status, or native language (accents)Sociolinguistics has created fewer large data repositoriesBut some work exists

Leinonen’s work on age and sexTwo papers on the effects of bordersOngoing work on foreign accents in English

—George Mason’s The speech accent archiveOngoing work combining geographic and social explanans

Linking dialect classification to linguistic basis

Dialectometry’s strength lies in aggregating over many linguisticvariables, using sum (or average) of variables’ differences tocharacterize relations between varietiesIndividual items not longer recognizable in sums!Lots of attempts to identifying characteristic variables

Heeringa (2004) found items whose distances correlated with majordimensions in MDSShackleton (2005,2007) — PCANerbonne (2006) — factor analysisProkic (2007) developed measures of correspondence strength

All post hoc analyses!

Co-clustering Varieties and Features

Important method in investigating dialectal variation: clustersimilar (dialectal) varieties togetherGoal: cluster varieties and linguistic features simultaneouslyNew research: Bi-partite spectral clustering

Based on the spectrum of a graph

Generating a bipartite graph from alignments

A bipartite graph is a graph whose vertices can be divided in twodisjoint sets where every edge connects a vertex from one set to avertex in another set. Vertices within a set are not connected.

From the alignments, we extract the number of soundcorrespondences for each variety (compared to a reference site)

We generated a bipartite graph of varieties v and soundcorrespondences s

There is an edge between vi and sj iff freq(sj in vi) > 0

Example of a bipartite graph A

[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1

Example of co-clustering a biparte graph

Based on the adjacency matrix A:[a]/[i] [2]/[i] [r]/[x] [k]/[x] [r]/[ö] [r]/[K]

Appelscha 1 1 1 0 0 0Oudega 1 1 1 0 0 0Zoutkamp 0 0 1 1 0 0Kerkrade 0 0 0 1 1 1Appelscha 0 0 0 1 1 1

We can calculate the eigenvectors (of the Laplacian) of A:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

λ3 = .53, xxx = [.12 .12 -.7 .12 .12 .25 .25 -.33 -.33 .25 .25]T

To cluster in k = 2 groups, we use:λ2 = .057, xxx = [-.32 -.32 0 .32 .32 -.34 -.34 -.23 .23 .34 .34]T

We obtain the following co-clustering:

Results: {2,3,4} clusters of varieties

Results: {2,3,4} clusters of sound correspondences

Some sound correspondences specific for the Frisian area

Reference [2] [2] [a] [o] [u] [x] [x] [r]Frisian [I] [i] [i] [E] [E] [j] [z] [x]

Some sound correspondences specific for the Limburg areaReference [r] [r] [k] [n] [n] [w]Limburg [ö] [K] [x] [ö] [K] [f]

Some sound correspondences specific for the Low Saxon area

Reference [@] [@] [@] [-] [a]Low Saxon [m] [N] [ð] [P] [e]

Discussion

Bipartite spectral graph partitioning detects the linguistic basis forthe dialectal grouping when used to simultaneously cluster

varieties and sound correspondences

Further experimentation:Use of standard and protolanguage correspondences as referencevarietiesMeasures proposed to identify the importance of soundcorrespondences (published)Various frequency thresholds

M.Wieling & J.Nerbonne (2011) “Bipartite spectral graphpartitioning for clustering dialect varieties and detecting theirlinguistic features” Computer Speech and Language 25, 700-715.

Gabon Bantu

0 100 300 500

Area: BantuData: 53 sites, 160 wordsSource: Van der Veen, LyonNote: Late settlement

Bulgaria

0 100 200 300 400 500

Bulgaria

Area: BulgariaData: 482 sites, 54 wordsSource: Stoykov’s atlasNote: Long Turkish domination

Germany

0 200 400 600 800

Germany

Area: GermanyData: 186 sites, 201 wordsSource: Kleiner DeutscherLautatlas (Goschel)

LAMSAS / Lowman

0 200 600 1000

LAMSAS / Lowman

Area: Eastern Seaboard, USData: 357 sites, 145 wordsSource: Mid & South Atlantic,LAMSAS (Lowman)Note: Settlement in last 400 yr.

The Netherlands

0 50 100 200 300

The Netherlands

Area: The NetherlandsData: 424 sites, 562 wordsSource: Goeman-Taeldeman-van Reenen Atlas

Norway

0 100 200 300 400 500

Norway

Area: NorwayData: North Wind & Sun15 sites, 58 wordsSource:www.ling.hf.ntnu.no/nos

References

Inderjit Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of theseventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM New York.

Ton Goeman, and Johan Taeldeman. 1996. Fonologie en morfologie van de Nederlandse dialecten. Een nieuwemateriaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48:38–59.

Vladimir Levenshtein. 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady AdademiiNauk SSSR, 164:845–848.

Wilbert Heeringa. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. Ph.D. thesis,Rijksuniversiteit Groningen.

Martijn Wieling and John Nerbonne. 2009. Bipartite spectral graph partitioning to co-cluster varieties and soundcorrespondences in dialectology. In: Monojit Choudhury (ed.) Proceedings of the TextGraphs-4 Workshop at the 47thMeeting of the Association for Computational Linguistics, August 2009, Singapore. Available viahttp://www.martijnwieling.nl.

Martijn Wieling, Jelena Prokic, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In:Lars Borin and Piroska Lendvai (eds.) Language Technology and Resources for Cultural Heritage, Social Sciences,Humanities, and Education (LaTeCH - SHELT&R 2009) Workshop at the 12th Meeting of the European Chapter of theAssociation for Computational Linguistics. Athens, 30 March 2009, pp. 26-34. Available via http://www.martijnwieling.nl.

Criticisms of dialectometry, esp. Levenshtein-based...

Documents

Transcript of Criticisms of dialectometry, esp. Levenshtein-based...

Vergleich der Levenshtein-Distanz und des Bloomfilter ... · PDF fileVergleich der Levenshtein-Distanz und des Bloomfilter-Algorithmus zur Objektidentifizierung im Kontext des Forschungsprojekts

STEMMING BAHASA JAWA MENGGUNAKAN ALGORITMA …etheses.uin-malang.ac.id/16387/1/12650132.pdfFauziyah, Muna. 2019. Stemming Bahasa Jawa Menggunakan Algoritma Levenshtein dan Analisa

PERANCANGAN SISTEM PENILAIAN TES KETELITIAN ENTRY … · KETELITIAN ENTRY DATA UNTUK PEREKRUTAN PEGAWAI DENGAN MENGGUNAKAN ALGORITMA LEVENSHTEIN DISTANCE” ... membutuhkan pegawai

Grundlagen der Dynamischen Programmierung Die · PDF fileEdit / Levenshtein - Distance Longest-Common-Subsequence CYK 3 Aufgaben 4 Literatur Armin Krupp Dynamische Programmierung -

repository.bsi.ac.id€¦ · Web viewAlgoritma dan metode yang digunakan adalah metode pencarian interpolasi untuk proses pencarian dan Algoritma . Levenshtein Distance. Metode

Binary Relations - Stanford Universityweb.stanford.edu/.../fall1516/lectures/06/Small06.pdfBinary Relations A binary relation over a set A is a structure that indicates properties

des Gemeinsamen Bundesausschusses zur · PDF fileDie Levenshtein-Distanz zwischen zwei Zeichenketten ist die kleinstmögliche Zahl an Einfügungen, Löschungen oder Substitutionen

Mean Field Competitive Binary MDPs and Structured Solutionshelper.ipam.ucla.edu/publications/mfg2017/mfg2017_14526.pdfBinary choice models are widely used in various decision problems

Applying the Levenshtein Distance to Catalan dialects: A ...nerbonne/papers/Valls-et-al-2010-submitted.pdf · Applying the Levenshtein Distance to Catalan dialects: A brief comparison

Universal Levenshtein Automata. Building and Properties · PDF fileSoﬁa University St. Kliment Ohridski Faculty of Mathematics and Informatics Department of Mathematical Logic and

Information Theoretic Approaches in Computational ... · Thomas Zastrow - University Tübingen Computational Dialectometry 1 Information Theoretic Approaches in Computational Dialectometry

Nondeterministic Finite Automata in Hardware - the Case …tjt7a/docs/levenshtein_automata.pdf · Nondeterministic Finite Automata in Hardware - the Case of the Levenshtein Automaton

Finding variants for construction-based dialectometry: A ...

Text Search - cw.fel.cvut.cz · Text Search Automata Examples Automata Reperesenting Operations on Regular Languages Operations on Regular Languages Hamming and Levenshtein Distance

L osung - Ubungsblatt 1 · PDF fileAufgabe 2: Levenshtein Distanz/Ahnlichkeit Gegeben: x =’Sean’ und y =’Shawn

Channel modeling and Levenshtein distances with context ...edoc.ku-eichstaett.de/5598/1/cmldcw.pdf · Channel modeling and Levenshtein distances with context-dependent weights Gun

Parallelization of the Levenshtein distance algorithm · PDF file110 1. Introduction The Levenshtein distance [6] between two strings of characters is equal to the minimum number of

Plot showing the Levenshtein distance (counting ...

Levenshtein Automata

Efficient Multiple and Predicate Dispatchingdefude/P2P/PAPIERS/chambers99.pdfbinary decision tree blending class identity tests, class range tests, ... compiled by the Vortex optimizing