Talk at Institut Jean Nicod on 6 October 2010

Phylogenetic methods of language diversification

Robin J. RyderCEREMADE – Paris Dauphine and CREST – ENSAE

Work done at the Department of Statistics, University of Oxford, under the supervision of Geoff K. Nicholls

www.slideshare.net/robinryder

What to expect

Past attempts: Swadesh and glottochronology

Background from Evolutionary Biology

Modern methods (a sample) + criticisms

Application to dating of Proto-Indo-European

Before we start...

Statistics: additional insight alongside the comparative method

None of these models represent the truth. Nonetheless, they can provide us with information.

Please interrupt me!

What Statistics add

Quantitative estimates

Estimation of uncertainty

Model testing

Automatization

Swadesh and glottochronology

200/100 word list

Compares 2 languages (c=fraction of shared cognates)

Assumes r=fraction of shared cognates after 1000 years constant for all languages (86%)

Infers age t of Most Recent Common Ancestor

I you (singular) he we you (plural) they this that here there who what where when how not all many some few other one two three four five big long wide thick heavy small short narrow thin woman man (adult male) man (human being) child wife

husband mother father animal fish bird dog louse snake worm tree forest stick fruit seed leaf root bark flower grass rope skin meat blood bone fat (n.) egg horn tail feather hair head ear eye nose mouth tooth tongue fingernail foot

leg knee hand wing belly guts neck back breast heart liver drink eat bite suck spit vomit blow breathe laugh see hear know think smell fear sleep live die kill fight hunt hit cut split stab scratch dig swim fly (v.)

walk come lie sit stand turn fall give hold squeeze rub wash wipe pull push throw tie sew count say sing play float flow freeze swell sun moon star water rain river lake sea salt stone sand dust

earth cloud fog sky wind snow ice smoke fire ashes burn road mountain red green yellow white black night day year warm cold full new old good bad rotten dirty straight round sharp dull smooth wet dry correct near

far right left at in with and if because name

Bergsland & Vogt (1962)

Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian

Discredited Glottochronology

Sankoff (1973): sample selection bias, no estimation of uncertainty

Fair criticism

Bad observation protocol from Swadesh

Does not apply so much to modern methods

Genetics 101

Genetic information is stored in DNA

DNA uses 4 letters: A, C, T and G

DNA transmission

Phylogenetics

A: TTGCAATCCGB: TAGCAATCCGC: CTGCAATACGD: CTGCAATAGA

Compare different possible trees

Charles Darwin

« The formation of different languages and of distinct species, and the proofs that both have developed through a gradual process, are curiously parallel... We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation. »

Similarities between genes and languages

Characteristic Genetics Linguistics

Discrete units Genes, nucleotides Lexical, morpholosyntactic and/or phonological traits

Transmission Transcription Learning, imitation

Horizontal Viruses, hybridization... Borrowing, creoles...transmission

Change Point mutation, indels... Vowel shift, innovations, word loss

As in genetics, a tree model is relevant for certain types of linguistic data.

Indo-European languages

Questions

Topology

Internal ages

Age of the root: 6000-6500 BP or 8000-9500 BP?

(BP=Before Present)

Core vocabulary

100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...

Borrowing is possible (non-tree-like change), but:

“Easy” to detect

Uncommon

Does not introduce systematic bias

Data coding

Old English: stierfþ

Old High German: stirbit, touwit

Avestan: miriiete

Old Church Slavonic: umĭretŭ

Latin: moritur

Oscan: ?

Cognacy classes:

1. {stierfþ, stirbit}

2. {touwit}

3. {miriiete, umĭretŭ, moritur}

Data coding (2)

Specialist linguists make cognacy judgments

Eliminate known borrowing

Only do this for languages which are known to be related

Data

Indo-European languages

Core vocabulary (Swadesh 100 or 200)

Two data sets

Dyen et al. (1997): 87 languages, mostly modern

Ringe et al. (2002): 24 languages, mostly ancient

Constraints

Constraints on parts of the topology

Constraints on some internal ages

We use these constraints to infer rates and other ages

Using models from biology

First attempts: Jordan & Gray (2000), Gray & Atkinson (2003)

Biological models make assumptions which do not apply to languages

Gray and Atkinson (2003); tree of 87 Indo-European languages obtained using lexical data and the mrbayes package (Huelsenbeck & Ronquist).

Selection of criticisms

Multiple births

Missing data

Rate heterogeneity

Description of the model (1)

Traits are born at rate λ

Trait instances die at rate μ

λ and μ are constants


Catastrophes occur at rate ρ

At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.

λ/μ=ν/κ: the number of traits is constant on average.


Observation model: each data point (0s and 1s) is missing with probability ξ

Some traits are not observed and are therefore deleted from the data

Registration process

Posterior distribution

Likelihood calculations

Statistical inference (MCMC)

Fit the model to the data

Trees that make the data likely

Move across the tree space; try thousands of possible trees

Obtain a sample of plausible trees and dates

Samples weighted by quality of fit to data

Synthetic data

True tree, ~40 words/language Consensus tree

Synthetic data (2)

Death rate (μ)

Influence of borrowing

True tree, ~40 words/languageBorrowing: 10%

Consensus tree

Influence of borrowing (2)

Consensus treeTrue tree, ~40 words/languageBorrowing: 50%

Influence of borrowing (3)

Topology is reconstructed correctly

Dates are underestimated for high levels of borrowing

Root age Death rate (μ)

Borrowing: 50%

Detecting borrowing

Confirmed: hardly any borrowing!

Cross-validation

Root age

Conclusions

Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.

Applicable to a variety of linguistic and cultural data sets

TraitLab: it's free!

62

Questions

otázky

spørgsmåler

vragen

questions

Fragen

domande

pytania

questões

întrebări

вопросы

vprašanja

preguntespreguntas

frågor

vrae

spurningar

quaestiones

ερωτήσεις

въпроси

kesses

spørsmåler

kláusimai

запитанні

سوال

पशcwestiwnau

References

Ryder & Nicholls (2011), Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European, JRSS C

Nicholls (2008), Horses or farmers? The tower of Babel and confidence in trees, Significance

Nicholls & Gray (2008), Dated ancestral trees from binary trait data and their application to the diversification of languages, JRSS B

Gray & Atkinson (2003), Language-tree divergence times support the Anatolian theory of Indo-European origin, Nature

Gray & Jordan (2000), Language trees support the express-train sequence of Austronesian expansion

Bergsland & Vogt (1962), On the validity of glottochronology, Current Anthropology

Sankoff (1973), Mathematical developments in lexicostatistic theory, Current Trends in Linguistics

Ryder (2010), Phylogenetic models of language diversification, DPhil thesis, University of Oxford

Talk at Institut Jean Nicod on 6 October 2010

Education

Transcript of Talk at Institut Jean Nicod on 6 October 2010