Talk at Institut Jean Nicod on 6 October 2010

63
enetic methods of language divers Robin J. Ryder CEREMADE – Paris Dauphine and CREST – ENSAE at the Department of Statistics, University of Oxford, under the supervision of Geo www.slideshare.net/robinryder

Transcript of Talk at Institut Jean Nicod on 6 October 2010

Page 1: Talk at Institut Jean Nicod on 6 October 2010

Phylogenetic methods of language diversification

Robin J. RyderCEREMADE – Paris Dauphine and CREST – ENSAE

Work done at the Department of Statistics, University of Oxford, under the supervision of Geoff K. Nicholls

www.slideshare.net/robinryder

Page 2: Talk at Institut Jean Nicod on 6 October 2010
Page 3: Talk at Institut Jean Nicod on 6 October 2010

What to expect

Past attempts: Swadesh and glottochronology

Background from Evolutionary Biology

Modern methods (a sample) + criticisms

Application to dating of Proto-Indo-European

Page 4: Talk at Institut Jean Nicod on 6 October 2010

Before we start...

Statistics: additional insight alongside the comparative method

None of these models represent the truth. Nonetheless, they can provide us with information.

Please interrupt me!

Page 5: Talk at Institut Jean Nicod on 6 October 2010

What Statistics add

Quantitative estimates

Estimation of uncertainty

Model testing

Automatization

Page 6: Talk at Institut Jean Nicod on 6 October 2010

Swadesh and glottochronology

200/100 word list

Compares 2 languages (c=fraction of shared cognates)

Assumes r=fraction of shared cognates after 1000 years constant for all languages (86%)

Infers age t of Most Recent Common Ancestor

Page 7: Talk at Institut Jean Nicod on 6 October 2010

I you (singular) he we you (plural) they this that here there who what where when how not all many some few other one two three four five big long wide thick heavy small short narrow thin woman man (adult male) man (human being) child wife

husband mother father animal fish bird dog louse snake worm tree forest stick fruit seed leaf root bark flower grass rope skin meat blood bone fat (n.) egg horn tail feather hair head ear eye nose mouth tooth tongue fingernail foot

leg knee hand wing belly guts neck back breast heart liver drink eat bite suck spit vomit blow breathe laugh see hear know think smell fear sleep live die kill fight hunt hit cut split stab scratch dig swim fly (v.)

walk come lie sit stand turn fall give hold squeeze rub wash wipe pull push throw tie sew count say sing play float flow freeze swell sun moon star water rain river lake sea salt stone sand dust

earth cloud fog sky wind snow ice smoke fire ashes burn road mountain red green yellow white black night day year warm cold full new old good bad rotten dirty straight round sharp dull smooth wet dry correct near

far right left at in with and if because name

Page 8: Talk at Institut Jean Nicod on 6 October 2010

Bergsland & Vogt (1962)

Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian

Discredited Glottochronology

Sankoff (1973): sample selection bias, no estimation of uncertainty

Fair criticism

Bad observation protocol from Swadesh

Does not apply so much to modern methods

Page 9: Talk at Institut Jean Nicod on 6 October 2010

Genetics 101

Genetic information is stored in DNA

DNA uses 4 letters: A, C, T and G

Page 10: Talk at Institut Jean Nicod on 6 October 2010

DNA transmission

Page 11: Talk at Institut Jean Nicod on 6 October 2010

DNA transmission

Page 12: Talk at Institut Jean Nicod on 6 October 2010

DNA transmission

Page 13: Talk at Institut Jean Nicod on 6 October 2010

DNA transmission

Page 14: Talk at Institut Jean Nicod on 6 October 2010

Phylogenetics

A: TTGCAATCCGB: TAGCAATCCGC: CTGCAATACGD: CTGCAATAGA

Page 15: Talk at Institut Jean Nicod on 6 October 2010

Compare different possible trees

Page 16: Talk at Institut Jean Nicod on 6 October 2010
Page 17: Talk at Institut Jean Nicod on 6 October 2010

Charles Darwin

« The formation of different languages and of distinct species, and the proofs that both have developed through a gradual process, are curiously parallel... We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation. »

Page 18: Talk at Institut Jean Nicod on 6 October 2010

Similarities between genes and languages

Characteristic Genetics Linguistics

Discrete units Genes, nucleotides Lexical, morpholosyntactic and/or phonological traits

Transmission Transcription Learning, imitation

Horizontal Viruses, hybridization... Borrowing, creoles...transmission

Change Point mutation, indels... Vowel shift, innovations, word loss

As in genetics, a tree model is relevant for certain types of linguistic data.

Page 19: Talk at Institut Jean Nicod on 6 October 2010

Indo-European languages

Page 20: Talk at Institut Jean Nicod on 6 October 2010

Questions

Topology

Internal ages

Age of the root: 6000-6500 BP or 8000-9500 BP?

(BP=Before Present)

Page 21: Talk at Institut Jean Nicod on 6 October 2010

Core vocabulary

100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...

Borrowing is possible (non-tree-like change), but:

“Easy” to detect

Uncommon

Does not introduce systematic bias

Page 22: Talk at Institut Jean Nicod on 6 October 2010

Data coding

Old English: stierfþ

Old High German: stirbit, touwit

Avestan: miriiete

Old Church Slavonic: umĭretŭ

Latin: moritur

Oscan: ?

Cognacy classes:

1. {stierfþ, stirbit}

2. {touwit}

3. {miriiete, umĭretŭ, moritur}

Page 23: Talk at Institut Jean Nicod on 6 October 2010

Data coding (2)

Specialist linguists make cognacy judgments

Eliminate known borrowing

Only do this for languages which are known to be related

Page 24: Talk at Institut Jean Nicod on 6 October 2010

Data

Indo-European languages

Core vocabulary (Swadesh 100 or 200)

Two data sets

Dyen et al. (1997): 87 languages, mostly modern

Ringe et al. (2002): 24 languages, mostly ancient

Page 25: Talk at Institut Jean Nicod on 6 October 2010

Constraints

Constraints on parts of the topology

Constraints on some internal ages

We use these constraints to infer rates and other ages

Page 26: Talk at Institut Jean Nicod on 6 October 2010
Page 27: Talk at Institut Jean Nicod on 6 October 2010

Using models from biology

First attempts: Jordan & Gray (2000), Gray & Atkinson (2003)

Biological models make assumptions which do not apply to languages

Page 28: Talk at Institut Jean Nicod on 6 October 2010
Page 29: Talk at Institut Jean Nicod on 6 October 2010

Gray and Atkinson (2003); tree of 87 Indo-European languages obtained using lexical data and the mrbayes package (Huelsenbeck & Ronquist).

Page 30: Talk at Institut Jean Nicod on 6 October 2010

Selection of criticisms

Multiple births

Missing data

Rate heterogeneity

Page 31: Talk at Institut Jean Nicod on 6 October 2010

Description of the model (1)

Traits are born at rate λ

Trait instances die at rate μ

λ and μ are constants

Page 32: Talk at Institut Jean Nicod on 6 October 2010

Description of the model (2)

Catastrophes occur at rate ρ

At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.

λ/μ=ν/κ: the number of traits is constant on average.

Page 33: Talk at Institut Jean Nicod on 6 October 2010

Description of the model (3)

Observation model: each data point (0s and 1s) is missing with probability ξ

Some traits are not observed and are therefore deleted from the data

Page 34: Talk at Institut Jean Nicod on 6 October 2010

Registration process

Page 35: Talk at Institut Jean Nicod on 6 October 2010

Registration process

Page 36: Talk at Institut Jean Nicod on 6 October 2010

Registration process

Page 37: Talk at Institut Jean Nicod on 6 October 2010

Registration process

Page 38: Talk at Institut Jean Nicod on 6 October 2010

Posterior distribution

Page 39: Talk at Institut Jean Nicod on 6 October 2010

Likelihood calculations

Page 40: Talk at Institut Jean Nicod on 6 October 2010

Statistical inference (MCMC)

Fit the model to the data

Trees that make the data likely

Move across the tree space; try thousands of possible trees

Obtain a sample of plausible trees and dates

Samples weighted by quality of fit to data

Page 41: Talk at Institut Jean Nicod on 6 October 2010

Synthetic data

True tree, ~40 words/language Consensus tree

Page 42: Talk at Institut Jean Nicod on 6 October 2010

Synthetic data (2)

Death rate (μ)

Page 43: Talk at Institut Jean Nicod on 6 October 2010

Influence of borrowing

True tree, ~40 words/languageBorrowing: 10%

Consensus tree

Page 44: Talk at Institut Jean Nicod on 6 October 2010

Influence of borrowing (2)

Consensus treeTrue tree, ~40 words/languageBorrowing: 50%

Page 45: Talk at Institut Jean Nicod on 6 October 2010

Influence of borrowing (3)

Topology is reconstructed correctly

Dates are underestimated for high levels of borrowing

Root age Death rate (μ)

Borrowing: 50%

Page 46: Talk at Institut Jean Nicod on 6 October 2010

Detecting borrowing

Confirmed: hardly any borrowing!

Page 47: Talk at Institut Jean Nicod on 6 October 2010
Page 48: Talk at Institut Jean Nicod on 6 October 2010
Page 49: Talk at Institut Jean Nicod on 6 October 2010
Page 50: Talk at Institut Jean Nicod on 6 October 2010
Page 51: Talk at Institut Jean Nicod on 6 October 2010
Page 52: Talk at Institut Jean Nicod on 6 October 2010
Page 53: Talk at Institut Jean Nicod on 6 October 2010
Page 54: Talk at Institut Jean Nicod on 6 October 2010
Page 55: Talk at Institut Jean Nicod on 6 October 2010
Page 56: Talk at Institut Jean Nicod on 6 October 2010
Page 57: Talk at Institut Jean Nicod on 6 October 2010
Page 58: Talk at Institut Jean Nicod on 6 October 2010
Page 59: Talk at Institut Jean Nicod on 6 October 2010

Cross-validation

Page 60: Talk at Institut Jean Nicod on 6 October 2010

Root age

Page 61: Talk at Institut Jean Nicod on 6 October 2010

Conclusions

Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.

Applicable to a variety of linguistic and cultural data sets

TraitLab: it's free!

Page 62: Talk at Institut Jean Nicod on 6 October 2010

62

Questions

otázky

spørgsmåler

vragen

questions

Fragen

domande

pytania

questões

întrebări

вопросы

vprašanja

preguntespreguntas

frågor

vrae

spurningar

quaestiones

ερωτήσεις

въпроси

kesses

spørsmåler

kláusimai

запитанні

سوال

पशcwestiwnau

Page 63: Talk at Institut Jean Nicod on 6 October 2010

References

Ryder & Nicholls (2011), Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European, JRSS C

Nicholls (2008), Horses or farmers? The tower of Babel and confidence in trees, Significance

Nicholls & Gray (2008), Dated ancestral trees from binary trait data and their application to the diversification of languages, JRSS B

Gray & Atkinson (2003), Language-tree divergence times support the Anatolian theory of Indo-European origin, Nature

Gray & Jordan (2000), Language trees support the express-train sequence of Austronesian expansion

Bergsland & Vogt (1962), On the validity of glottochronology, Current Anthropology

Sankoff (1973), Mathematical developments in lexicostatistic theory, Current Trends in Linguistics

Ryder (2010), Phylogenetic models of language diversification, DPhil thesis, University of Oxford