Phylogenetic inference in linguistics and biology ... inference_IIP_2016.pdfNostratic dictionary...

26
Phylogenetic inference in linguistics and biology: similarities and differences Dmitry Leshchiner MIPT - Moscow Institute of Physics and Technology (State University)

Transcript of Phylogenetic inference in linguistics and biology ... inference_IIP_2016.pdfNostratic dictionary...

  • Phylogenetic inference in linguistics and biology:

    similarities and differencesDmitry Leshchiner

    MIPT - Moscow Institute of Physics and Technology (State University)

  • Romance languages – words of Swadesh list

    • Cognates distinct from majority are colored; borrowings shown in red

    bark belly bird dog feather hair knee

    Romanian scoarţă burtă pasăre câine peánă păr genunchi

    Logudorese Sardinian cothichìna frente pudzone cane prumma pìlos benuccru

    Standard Italian corteccia ventre uccello can penna capello ginocchio

    Venice Venetian scòrsa pansa oxèɫo can pena cavéi zenòcio

    Valencia Catalan escorça panxa pardal gos ploma cabell genoll

    Castilian Spanish corteza tripa pájaro perro ploma pelo rodilla

    Standard Portuguese casca ventre ave cão pena cabelo joelho

    Galician corteza barriga ave can pluma cabelo xeonllo

    Provençal Occitan rusco vèntre aucèu chin plumo péu ginoun

    Standard French écorce ventre oiseau chien plume cheveux genou2

  • Phylogenetic tree of 48 Romance languages

    • Romanian 150 BC –

    • Sardinian 1 AD –

    • 125 AD –• (Italian – Catalan) 300 AD –

    • Catalan• Northern Italian 600 AD –

    • Ligurian, Lombard• Piemontese, Emiliano• Venetian, Dalmatian

    • Central & Southern Italian• (Iberian Latin) 700 AD –

    • Spanish• Portuguese

    • French – Ladin – Romansh• French – Occitan 900 AD – …• Ladin – Romansh 300 AD – ...

    3

  • Languages used to build the tree

    4

  • The origin of historical linguistics

    • Rasmus Rask (22 Nov 1787 – 14 Nov 1832) – Danish

    • Franz Bopp (14 Sept 1791 – 23 Oct 1867) – German

    • August Schleicher (19 Feb 1821 – 6 Dec 1868) – German

    5

  • Indo-European languages map

    6

  • Catch more flies with honey than with vinegar

    • Here are some samples of proto-IE lexical fund• There are about 3000 terms known

    • There are two IE terms for ‘honey’, and several for fly/insect

    • Single terms are for fundamental concepts, such as ‘eat’ and ‘heart’

    heart honey honey fly (n.) fly (n.) to eat

    Hittite ker/kard- milit- ad-/ed-

    Sanskrit hr̥d, hr̥dáḥ mádhu mathuṇa 'bug' átti

    Armenian sirt meɫr, meɫu mathil 'louse' mun, mnoy 'mosquito' utem

    Greek καρδίᾱ μέλι, -τος μέθυ, -υος 'wine' μυι̃α (*μυσι̯α) ἔδω

    Latin cor, cordis mel, mellis medus, -ūs 'honey wine' musca edō

    Gothic haírto miliþ maþa `worm' itan

    English heart mele-, mil-dēaw `honey-dew, nectar' mead moth eat

    Russian сердце мёд мотыль муха (*mousā) ем, есть

    7

  • The origin of modern historical linguistics

    • Karl Brugmann (16 Mar 1849 – 29 Jun 1919) – German

    • Berthold Delbrück (26 Jul 1842 – 3 Jan 1922) – German• Grundriß der vergleichenden Grammatik der indogermanischen Sprachen

    • Neogrammarians (Junggrammatiker)

    • Ferdinand de Saussure (26 Nov 1857 – 22 Feb 1913) – Swiss

    8

  • Regularity of sound correspondences

    • A diachronic sound change affects all words in which its environment is met and everywhere leads to the same result

    • Knowing the series of sound correspondences allows one to reconstruct the ancestral sounds (and words proto-forms)

    heart honey honey fly (n.) fly (n.) mouse to eat

    *Proto-IE *k'erd- *melit- *medhu- *mat- (-th-) *mūs- *mūs- *ed-

    Hittite ker/kard- milit- ad-/ed-

    Sanskrit hr̥d, hr̥dáḥ mádhu mathuṇa 'bug' mū́ṣ- átti

    Armenian sirt meɫr, meɫu mathil 'louse' mun, mnoy 'mosquito' mukn utem

    Greek καρδίᾱ μέλι, -τος μέθυ, -υος 'wine' μυι̃α (*μυσι̯α) μυ̃ς, μῠὸς ἔδω

    Latin cor, cordis mel, mellis medus, -ūs 'honey wine' musca mūs, mūris edō

    Gothic haírto miliþ maþa `worm' itan

    English heart mele-, mil-dēaw `honey-dew, nectar' mead moth mouse eat

    Russian сердце мёд мотыль муха (*mousā) мышь ем, есть

    9

  • Stages of the language reconstruction

    • Compare basic vocabulary and establish (some) prospective sound correspondences

    • Establish preliminary kinship hypothesis and check the divergence!• The whole job below has to be done successively up – for all the groups!...

    • Look for semantically close lexical items obeying the correspondences

    • Collect sufficient statistics and establish supported correspondences

    • Clean up the comparisons based on supported correspondences

    • Reconstruct the proto-sounds based on the corpus of comparisons

    • Compile the complete etymological dictionary for the proto-language

    10

  • Nostratic dictionary (Moscow, 1971-1984)

    • Vladislav Illich-Svitych (12 Sept 1934 – 22 Aug 1966)

    • Was first to rigorously apply neogrammarian method

    to the “long-range” comparison of languages

    • Proposed and proved the kinship of six great families

    of “Nostratic” languages of the Old Word:

    Indo-European, Uralic, Altaic, Dravidic, Kartvelian,

    and Semito-Hamitic [Afro-Asiatic]

    11

  • World’s language macro-families map

    • 7,097 living languages(in ethnologue.com)

    • Counts by regions:• Europe – 287

    • Asia – 2296

    • Pacific – 1313

    • Africa – 2139

    • Americas – 1062

    12

    http://www.ethnologue.com/

  • A new breakthrough

    • Sergei Starostin (24 Mar 1953 – 30 Sept 2005)

    • Precise reconstruction of Old Chinese sound

    • The reconstruction of North Caucasian

    • Proposed and proven Sino-Caucasian kinship

    • Proven an Altaic origin of Japanese

    • Developed the lexicostatistics dating technique

    • Proposed the existence of Borean macro-family

    13

  • Provisional phylogenetic tree of Borean

    14

  • Examples of Borean cognate sets

    • ‘smear, fat’

    15

    **Eurasiatic *ć`ämV 'smear' **Sino-Caucasian *cwä́̆jŋĕ 'liver; gall'

    *Indo-European *smē- 'to oil' *North Caucasian *c_wä̆jmĕ 'gall; anger'

    Greek σμάω 'to wipe' Chechen stim 'gall'

    Latin macula 'stain' Adyghe gʷǝħa-gʷǝ-ź 'anger'

    *Altaic *sĕme (-a) 'fat' *Sino-Tibetan *sĭn (? *sĭŋ) 'liver'

    *Turkic *semiŕ 'fat' *Chinese 辛 *sin 'bitter, pungent'Turkish semiz 'fat' Modern Beijing Chinese xīn

    Literary Manchu semsu 'inner fat' Tibetan mćhin 'liver'

    Korean sam 'placenta' Burmese sańh 'liver'

    *Proto-Uralic *čamče 'skin layer' *Yenisseian *seŋ 'liver'

    Saam (Lapp) cuoʒ'ʒå 'membrane Burushaski *-sán 'spleen'

    *Proto-Kartvelian *cem- Basque *[beHa]-sum(a) 'bile, gall'

    Georgian cm-; cm-el- 'fat' **Austric *ʔcVʔń 'salty, bitter'

    **Afro-Asiatic *sim-an- ~ *sin-am- *Austroasiatic cVŋ 'bitter'

    *Semitic *šam(-an)- 'fat, oil' *Austronesian *qasiN 'saltiness'

    Hebrew šämän ‘fat, oil’

    Egyptian smy 'fat milk, cream'

  • Examples of Borean cognate sets

    • ‘to know’

    16

    **Eurasiatic *g(w)enV 'to know' **Sino-Caucasian *=alg[w]Ăn (?) 'to know, ask'

    *Indo-European *g'en[o]-, *g'nō- 'to know' *North Caucasian *=alg[w]Ăn 'to speak'

    Hittite kanes- 'to know' Avar gal- 'to speak, talk'

    Sanskrit jānā́ti 'to know, apprehend' Lak =uk:i- 'to count; to read'

    Armenian canauth 'to know' Abkhaz a-ga-rá 'to be heard'

    Greek γιγνώσκω 'I know' Hurrite kul- 'to say, to pronounce solemnly'

    Latin nōsco (gnōsco) 'to know' *Sino-Tibetan *khān (~ *gh-) 'see, look, know'

    *Germanic *kunnan 'know, be able' *Chinese 看 *khān(s) 'look, see, regard'Gothic kunnan 'know' Modern Beijing Chinese kàn

    English know Tibetan mkhan 'teacher, professor'

    Russian знать 'to know' **Austric *KVN 'think'

    *Kartvelian *gen-/gn- 'to understand' *Austroasiatic *KVń 'think'

    **Afroasiatic *ki(ha)n- 'know, learn' *Bantu *-gàn- 'think'

    *Semitic *kVhVn- 'act as a priest' *Macro-Khoisan *kVʔV[ŋ] 'to know'

    Phoenician khn 'priest' *Proto-Bushman *!hã

    Hebrew kōhēn 'priest' *Proto-Khoe *!ʔã́

    Egyptian čny 'learn' Sandawe khéʔé

    Hadza kaʔasa

  • …Imagine the depth of time!...

    Imagine the depth of time when these languages separated! ... Polish and Russian separated so long ago! Now think how long ago [this happened to] Kurlandic! Think when [this happened to] Latin, Greek, German, and Russian! Oh, great antiquity!

    • Mikhail Lomonosov (drafts for Russian Grammar, published 1755)

    …well it’s even earlier than he perhaps thought:Polish and Russian – 1500 yrs. ago…

    Kurlandic [Latvian] and Russian – 3500 yrs. ago…

    Latin, Greek, German – and Russian: 5000 yrs. ago…

    17

  • Methods of language family dating

    • Swadesh 100 words list (100 meanings of “universal” core vocabulary)

    • Measuring core vocabulary preservation

    • Proposed by Morris Swadesh (1909 – 1967) in 1950-ies

    • Intended to use it for constant-rate decay dating (“glottochronology”…)

    • …but it does not work like this!

    • One of more recent manifestations of that:• R.D. Gray & Q.D. Atkinson, Nature 426, 435-439 (2003)• They got unreasonably deep IE divergence times by applying ‘biological’ clock laws

    • • But still defend it – and many people believe

    18

  • Starostin’s empirical version of law of decay

    • Swadesh’s formula (coincidence rate is square of retention rate…):

    • Starostin’s formula (two corrections – ‘aging of words’; stable “core”):

    Starostin, Sergei. Comparative-Historical Linguistics and Lexicostatistics. In: Time Depth in Historical Linguistics, v. 1. Cambridge: The McDonald Institute for Archaeological Research, 2000, pp. 223-265.

    19

  • 20 years later, now making sense of it?

    • (…And making it time-invariant, too)• Coincidence rate is not square of retention rate! – retentions are correlated

    • A model of “coordinated decay” – to take into account the effect of ‘synchronization’ of lexical replacement

    M.E. Vasilyev, A.Yu. Militarev. Glottochronology in Comparative-Historical Linguistics and the Models of Linguistic Divergence. In: Orientalia et Classica,Vol. XIX. – М., 2008 – pp. 509-536 (Ch.2, pp. 523 ff.) M. Gell-Mann, D. Leshchiner. (2008, unpublished)

    20

  • Comparing two formulas

    • Starostin’s formula gives much higher breakup ages when split is deep

    21

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 0,2 0,4 0,6 0,8 1 1,2

    Time divergence as function of coincidence rate

    Starostin's formula Coordinated decay

  • Still much to do

    • (but nobody does…)

    • Calibrate and validate the formulae on empirical data

    • Make adjustments to the model, if needed

    • Work out a clustering procedure consistent with the model

    • Work out direct dating technique – a hierarchical procedure for it

    • Work out the procedure to deal with data of ancient fixation

    • Learn to deal with decay speed variance and with regional effects

    • Learn to properly deal with data sparsity at large divergence times

    22

  • Things to envy (and not to envy) with biology

    • Incredible amount of data and of effort invested

    • Rigorous use of data analysis techniques• (however, often not making full use of the data)

    • In particular, Markov chain Monte Carlo modeling is desirable to have

    • They don’t deal with correlated retentions issues (though they may exist)!

    • They don’t normally deal with ancient genetic data (though it may change)

    • They normally do not use directional assumption in modeling (we do)• Therefore, our use of ‘rooted trees’ is a constant source of contention• We use UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

    23

  • Phylogenetic inference principles in linguistics

    • Tree model (no hybridization, creolization, regional influence)• Those things exist, but they operate on top of primary genetic relationship

    • “Coding characters” are word roots filling predefined usage slots

    • No issues of homology definition (gene alignment) here!

    • Maximal parsimony used in etymology but not in lexicostatistics

    24

  • Biological inference procedures

    • Population divergence• In principle, it’s possible to cluster populations and assign most given

    individuals to one (it’s quite robust given the amount of data that exist)• In practice, start with given populations assignment (based on few samples

    and, mostly, geographic distribution) and work with frequency averages• Make dating conclusions based on constant genetic drift rate

    • Species divergence (molecular phylogenetics)• Use a small sample of available characters (for convenience, and for given

    calibration)• sequencing of about 1000 base pairs is common

    • Use constant drift rate assumption• Use maximal parsimony for cladistics

    25

  • Conclusions

    • Geneticists resolve issues that linguists did not even start to address• We need to learn from them

    • On the other hand, linguists have (and will have) to resolve the issues that geneticists are not even aware of• So they don’t need to learn from us

    • And that last condition is pretty much unfortunate, in fact• It creates no incentives to work jointly or even to attentively listen• And so it seriously hampers mutual understanding and impedes the progress

    • But I expect linguistics to survive eventually (perhaps, in the long run)

    26