Talk at Institut Jean Nicod on 6 October 2010
-
Upload
robin-ryder -
Category
Education
-
view
886 -
download
0
Transcript of Talk at Institut Jean Nicod on 6 October 2010
Phylogenetic methods of language diversification
Robin J. RyderCEREMADE – Paris Dauphine and CREST – ENSAE
Work done at the Department of Statistics, University of Oxford, under the supervision of Geoff K. Nicholls
www.slideshare.net/robinryder
What to expect
Past attempts: Swadesh and glottochronology
Background from Evolutionary Biology
Modern methods (a sample) + criticisms
Application to dating of Proto-Indo-European
Before we start...
Statistics: additional insight alongside the comparative method
None of these models represent the truth. Nonetheless, they can provide us with information.
Please interrupt me!
What Statistics add
Quantitative estimates
Estimation of uncertainty
Model testing
Automatization
Swadesh and glottochronology
200/100 word list
Compares 2 languages (c=fraction of shared cognates)
Assumes r=fraction of shared cognates after 1000 years constant for all languages (86%)
Infers age t of Most Recent Common Ancestor
I you (singular) he we you (plural) they this that here there who what where when how not all many some few other one two three four five big long wide thick heavy small short narrow thin woman man (adult male) man (human being) child wife
husband mother father animal fish bird dog louse snake worm tree forest stick fruit seed leaf root bark flower grass rope skin meat blood bone fat (n.) egg horn tail feather hair head ear eye nose mouth tooth tongue fingernail foot
leg knee hand wing belly guts neck back breast heart liver drink eat bite suck spit vomit blow breathe laugh see hear know think smell fear sleep live die kill fight hunt hit cut split stab scratch dig swim fly (v.)
walk come lie sit stand turn fall give hold squeeze rub wash wipe pull push throw tie sew count say sing play float flow freeze swell sun moon star water rain river lake sea salt stone sand dust
earth cloud fog sky wind snow ice smoke fire ashes burn road mountain red green yellow white black night day year warm cold full new old good bad rotten dirty straight round sharp dull smooth wet dry correct near
far right left at in with and if because name
Bergsland & Vogt (1962)
Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian
Discredited Glottochronology
Sankoff (1973): sample selection bias, no estimation of uncertainty
Fair criticism
Bad observation protocol from Swadesh
Does not apply so much to modern methods
Genetics 101
Genetic information is stored in DNA
DNA uses 4 letters: A, C, T and G
DNA transmission
DNA transmission
DNA transmission
DNA transmission
Phylogenetics
A: TTGCAATCCGB: TAGCAATCCGC: CTGCAATACGD: CTGCAATAGA
Compare different possible trees
Charles Darwin
« The formation of different languages and of distinct species, and the proofs that both have developed through a gradual process, are curiously parallel... We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation. »
Similarities between genes and languages
Characteristic Genetics Linguistics
Discrete units Genes, nucleotides Lexical, morpholosyntactic and/or phonological traits
Transmission Transcription Learning, imitation
Horizontal Viruses, hybridization... Borrowing, creoles...transmission
Change Point mutation, indels... Vowel shift, innovations, word loss
As in genetics, a tree model is relevant for certain types of linguistic data.
Indo-European languages
Questions
Topology
Internal ages
Age of the root: 6000-6500 BP or 8000-9500 BP?
(BP=Before Present)
Core vocabulary
100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...
Borrowing is possible (non-tree-like change), but:
“Easy” to detect
Uncommon
Does not introduce systematic bias
Data coding
Old English: stierfþ
Old High German: stirbit, touwit
Avestan: miriiete
Old Church Slavonic: umĭretŭ
Latin: moritur
Oscan: ?
Cognacy classes:
1. {stierfþ, stirbit}
2. {touwit}
3. {miriiete, umĭretŭ, moritur}
Data coding (2)
Specialist linguists make cognacy judgments
Eliminate known borrowing
Only do this for languages which are known to be related
Data
Indo-European languages
Core vocabulary (Swadesh 100 or 200)
Two data sets
Dyen et al. (1997): 87 languages, mostly modern
Ringe et al. (2002): 24 languages, mostly ancient
Constraints
Constraints on parts of the topology
Constraints on some internal ages
We use these constraints to infer rates and other ages
Using models from biology
First attempts: Jordan & Gray (2000), Gray & Atkinson (2003)
Biological models make assumptions which do not apply to languages
Gray and Atkinson (2003); tree of 87 Indo-European languages obtained using lexical data and the mrbayes package (Huelsenbeck & Ronquist).
Selection of criticisms
Multiple births
Missing data
Rate heterogeneity
Description of the model (1)
Traits are born at rate λ
Trait instances die at rate μ
λ and μ are constants
Description of the model (2)
Catastrophes occur at rate ρ
At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.
λ/μ=ν/κ: the number of traits is constant on average.
Description of the model (3)
Observation model: each data point (0s and 1s) is missing with probability ξ
Some traits are not observed and are therefore deleted from the data
Registration process
Registration process
Registration process
Registration process
Posterior distribution
Likelihood calculations
Statistical inference (MCMC)
Fit the model to the data
Trees that make the data likely
Move across the tree space; try thousands of possible trees
Obtain a sample of plausible trees and dates
Samples weighted by quality of fit to data
Synthetic data
True tree, ~40 words/language Consensus tree
Synthetic data (2)
Death rate (μ)
Influence of borrowing
True tree, ~40 words/languageBorrowing: 10%
Consensus tree
Influence of borrowing (2)
Consensus treeTrue tree, ~40 words/languageBorrowing: 50%
Influence of borrowing (3)
Topology is reconstructed correctly
Dates are underestimated for high levels of borrowing
Root age Death rate (μ)
Borrowing: 50%
Detecting borrowing
Confirmed: hardly any borrowing!
Cross-validation
Root age
Conclusions
Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.
Applicable to a variety of linguistic and cultural data sets
TraitLab: it's free!
62
Questions
otázky
spørgsmåler
vragen
questions
Fragen
domande
pytania
questões
întrebări
вопросы
vprašanja
preguntespreguntas
frågor
vrae
spurningar
quaestiones
ερωτήσεις
въпроси
kesses
spørsmåler
kláusimai
запитанні
سوال
पशcwestiwnau
References
Ryder & Nicholls (2011), Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European, JRSS C
Nicholls (2008), Horses or farmers? The tower of Babel and confidence in trees, Significance
Nicholls & Gray (2008), Dated ancestral trees from binary trait data and their application to the diversification of languages, JRSS B
Gray & Atkinson (2003), Language-tree divergence times support the Anatolian theory of Indo-European origin, Nature
Gray & Jordan (2000), Language trees support the express-train sequence of Austronesian expansion
Bergsland & Vogt (1962), On the validity of glottochronology, Current Anthropology
Sankoff (1973), Mathematical developments in lexicostatistic theory, Current Trends in Linguistics
Ryder (2010), Phylogenetic models of language diversification, DPhil thesis, University of Oxford