Post on 31-Jul-2020
БИОИНФОРМАТИЧЕСКИЕУЛОВКИ ДЛЯ АНАЛИЗА
ДРЕВНИХ ДНК
TATIANA TATARINOVA
WHAT HAVE BEEN SEQUENCED?
KHAZARIA: WHERE AND WHEN?
• КАК НЫНЕ СБИРАЕТСЯ ВЕЩИЙ ОЛЕГ
• ОТМСТИТЬ НЕРАЗУМНЫМ ХОЗАРАМ*,
• ИХ СЕЛЫ И НИВЫ ЗА БУЙНЫЙ НАБЕГ
• ОБРЕК ОН МЕЧАМ И ПОЖАРАМ.
*Хозары — кочевой народ, некогда обитавший на юге России.
KHAZARIAN PUZZLE
Khazars were mentioned first by several Arabic historians in VIII century AD, and last in XIII century, as one of the peopleconquered by Baty-khan
CLAIMS OF ASHKENAZI CONNECTION?
ARTHUR KOESTLER THE THIRTEENTH TRIBE THE KHAZAR EMPIRE AND ITS HERITAGE HUTCHINSON OF
LONDON, LONDON 1976
Lev Gumilev, Discovery of Khazaria
No written sources from Khazaria other than three manuscripts in ancient Hebrew
One of the rules was called Joseph
Jewish artefacts
Legends about Jewish practices
THE MISSING LINK OF JEWISH EUROPEAN ANCESTRY: CONTRASTING THE RHINELAND AND THE KHAZARIAN
HYPOTHESES BY ERAN ELHAIKGENOME BIOLOGY AND EVOLUTION, 2013
SO, WHO IS MR. KHAZAR?
aDNA may provide answers to this historic riddle
Ingredient 1: high quality input
Stepped grave vs Niche grave
ANCIENT DNA (ADNA) IS THUS EXPECTED TO REVOLUTIONIZE EVOLUTIONARY
GENETICS IN THE SAME MANNER THAT SYSTEMATIC APPROACH TO ANALYSIS OF
FOSSIL RECORDS REVOLUTIONIZED PALEONTOLOGY: IT IS A DIRECT WINDOW INTO
THE PAST ‒ A “TIME CAPSULE”.
RECENTLY DNA SAMPLES WERE OBTAINED FROM NEANDERTHAL, DENISOVA,
MAMMOTH, PALEO-HORSE, ANCIENT SEEDS ETC.
Many of the questions we addressed in this paperToward high-resolution population genomics using archaeological samples
Irina Morozova, Pavel Flegontov, Alexander Mikheyev, Hosseinali Asgharian, Petr Ponomarenko, Vladimir Klyuchnikov, GaneshPrasad ArunKumar, Sergey Bruskin,Egor Prokhortchouk, Yuriy Gankin, Evgeny Rogaev, Yuri Nikolsky, Ancha Baranova,Eran Elhaik, Tatiana V. Tatarinova, DNA Research 2016
Ingredient 2: high-quality
sequencing
GENOTYPING SEQUENCING
Potentially, every position on a
genome is studied. However, quality
is variable, lower than for the SNP
chip (0.1% error is achieved for 75%
of bases). Some areas require read
depths of 100 or more.
Large, but limited number of high-quality calls.1 million SNPs can be genotyped for $100Error rate (wrong calls) <1% (reported by 23 and me)
Der Sarkissian et al. 2015
http://mammoth.psu.edu/hair.html
QUALITY ASSESSMENT
• NUMBER OF SNPS 124,780,238
• QUALITY (Q) (3.01, 226.77)
• MEAN Q 14.14, MEDIAN Q 7.80
• AVERAGE DEPTH OF COVERAGE 5
• COVERED 1-2% OF GENOME
Consider 300 bronze age genomes published in 2014-2015• Allentoft et al. 2015 (RISE*)• Haak, Lazaridis et al. Nature 2015 (I0*)• Gamba et al. Nature Communications 2014 (I1*)• Mathieson et al. 2015 (I*)
SYSTEMATIC ERRORS
Samples I and Rise: both sequenced on HiSeqRise: whole genome, I - targeted
SO, WE DEAL WITH POOR QUALITY
• LOW COVERAGE
• POOR QUALITY
• INSUFFICIENT NUMBER OF SNPS PER INDIVIDUAL
• DIFFERENT GROUPS GET DIFFERENT RESULTS FROM THE SAME SAMPLES
• INDIVIDUAL SNPS CANNOT BE TRUSTED!
APPROACH: AGGREGATION
• ADMIXTURE
• GPS
• PATHWAYS
• USING PROBABILITY TO MODEL
• LOCALLY AGGREGATED ANCESTRY RLAI
To infer population structure from genotype data, it is necessary to first reduce the
dimensionality of the dataset due to the thousands of SNPs it encompasses.
From SNPs to Admixture
Thousands of SNPs
North EastAsian Mediterranian South African
South West Asian Native American Oceanian South East Asian
NorthernEuropean
Sub-SaharanAfrican
HGDP00985 0.5253 0.0202 0 0.2222 0.0404 0.0101 0.0101 0.1717 0
HGDP01094 0.04 0.04 0 0.03 0.83 0 0.01 0.05 0
HGDP00982 0.0102 0.1531 0.0306 0.0714 0.0408 0 0.0102 0.2041 0.4796
ADMIXTURE
Admixture proportions in geographically adjacent populations, such as Italian and Greeks, and populations sharing similar history, like British and Germans, are similar.
19
GPS ORIGIN PREDICTION
20
A B
X ΔGEO = α × ΔGEN + 𝛽
APPLICATION OF GPS TO ADNA (BRONZE AGE)
30 OUT OF 100 BRONZE AGE
SAMPLES (ALLENTOFT ET AL
2015) HAD OVER 500 OF
ANCESTRY INFORMATIVE
MARKERS.
WE APPLIED GPS ALGORITHM TO
FIND THE CLOSEST MODERN
POPULATION.
GPS accurately assigned:
• ~100% of all individuals to their continental regions• 80% of all individuals to their country of origin• 60% of all individuals to their inner-country region
22
PCA
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
I1280RISE568
BritishI0550I0246Tatars
RISE546I0060I0115I0235
RISE240RISE154
PolandI0805
ChuvashsRISE562
BulgariansI1530I1281I1303
Ashkenazi_PolandNorthern Caucasian
NogaisSephardic Jews B
NorthEastAsian Mediterranean SouthAfrican SouthWestAsian NativeAmerican
Oceanian SouthEastAsian NorthernEuropean SubsaharanAfrican
SNPS PER PATHWAYS
Changes in biological pathways during 6,000 years of civilization in Europe, Chekalin et al, 2018,Molecular Biology and Evolution
• READMIX DEVELOPED TO TREAT INDIVIDUALS OF MIXED
ORIGIN AND REPRESENTS AN INDIVIDUAL AS A LINEAR
COMBINATION OF ADMIXTURE VECTORS OF REFERENCE
POPULATIONS
• 30%BRITISH+10%RUSSIAN+60%CHINESE
• P=A1RS1 + A2RS2 +... + APRSP+ERROR26
More complex cases?
reAdmix
HOW IT WORKS
• WE ASSUME RIGHT AWAY THAT THE GIVEN ANCIENT PROPORTIONS CONTAIN
ERROR
• START WITH A GUESS POPULATION
• ADD/REMOVE POPULATIONS TO ACHIEVE OPTIMAL FIT
• CONDITIONAL OPTIMIZATION (SUCH AS “I KNOW THAT THERE WAS A JEWISH
• ANCESTOR SOMEWHERE IN MY PEDIGREE”)
27
READMIX APPROACHAim: to find the smallest subset of modern populations whose combinedadmixture components are similar to those of the individual within a smalltolerance margin.
The algorithm consists of three phases:
1. Iteratively build the first candidate solution and improve it.
2. Generate the predefined number M of additional candidate solutionsrandomly and apply the Differential Evolution (DEEP).
3. Identify the populations that have stable membership in the solution acrossthe set, that is, are part of solution in at least 75% of cases.
Let R={ri}
i=1..Ibe the set of modern populations where
ri=(ri,1, ..., ri,K) and K is the dimension (K=9).
We seek two sets S=(s1,...,s
p) and A=(a
1,...,a
p) where
siare the indices of modern populations a
iare the coefficients of modern populations
in the approximation
each
of test vector T
SOHN ET (2012) AL BENCHMARK
• 2 COMPONENTS
• 4 COMPONENTS
4-dim space: European, African, Native American and East Asian
Color coding: red-European, green-African, yellow- Native American, blue-East
Asian, and white- unassigned
reAdmix
RLAI (ROBUST INFERENCE OF LOCAL ANCESTRY)
RLAI METHOD
In every window find the most similar position
COMPARISON WITH OTHERS
• LAMP
• PROBABILITY OF A SEGMENT TO BELONG TO A SPECIFIC POPULATION
• LAMP-ANC
• MODIFICATION OF LAMP, SKIPPING ESTIMATION OF ANCESTRAL ALLELES, THEREFORE MORE
RELIABLE
• RFMIX
• TREATS ORIGIN IS A HIDDEN PARAMETER
COMPARISON
• RFMIX HAS THE HIGHEST ACCURACY FOR MIXES EUROPE-JAPAN AND
EUROPE -AFRICA. TRIPLE MIXES SHOW DROP IN QUALITY
RLAI
• RLAI SHOWS ACCURACY ABOVE 0.9 FOR ALL MIXES INCLUDING TRIPLE
RLAI ACCURACY AS A FUNCTION OF GENERATIONS
ZOOMING IN AND OUT
Unique ID Part
Archaeological culture Reference
Date (2-sigma)
Min Аgе
Max Age
Location
Country Lat Lon
Coverage SNPs Sex #reads
mtDNA haplogroup
% endogenous
I0047 ToothCentral_LNBA
Haak, Lazaridis et al. Nature 2015
2111-1891 cal BCE 4037 3952
Halberstadt-Sonntagsfeld
Germany 51.89 11.04 1.655
836,247 F
17,431,013 V9 0.449
CORDED WIRE ANALYSIS
Das et al 2016Behar et al 2013
Sample Bone Gender Age Century Location Race
67 Left humerusM 35-40 IX Martynovsky district
mongoloid
166 Left femur F 25-30 VIII-IX Martynovsky district mongoloid
531Right tibia and left
ulna M 35-40 VIII-IX Dubosvky districteuropoid
619 Left femur M 35-40 VII-VIII Dubosvky district mongoloid
656 Right tibiaM 30-35 VII-VIII Dubosvky district europoid (?)
1251 Left humerus M 40 IX Zimovnikovsky district undefined
1564 Left tibiaM 25-35 VIII-IX Belokalitvinsky distict europoid (?)
1566 Right humerusM 35-40 VIII-IX Belokalitvinsky distict undefined
1986Right humerus and left
tibia M 35-45 VIII-X Orlovsky district europoid (?)
NINE 8TH -9TH CENTURY GENOMES OF KHAZARS
DNA extraction conducted in two labs independentlySequencing performed by Dr. Mikheyev (OIST)Test all samples on MiSeq and the best samples on HiSeq0.32-0.48 of human genome coveredAverage depth of coverage ~0.75X
BIOINFORMATICS PROCEDURES
• USING MULTIPLE PIPELINES IN PARALLEL
• PALEOMIX (SCHUBERT ET AL,
CHARACTERIZATION OF ANCIENT AND
MODERN GENOMES BY SNP DETECTION
AND PHYLOGENOMIC AND METAGENOMIC
ANALYSIS USING PALEOMIX. NAT PROTOC.
2014)
• MAPDAMAGE, SCHMUTZI, ANGSD, FASTQC,
CUTADAPT, FOLLOWED BY GATK
• PILEUPCALLER
(HTTP://STEPHANSCHIFFELS.DE/SOFTWARE/)
MTDNA
Sample 67 166 531 619 656 1251 1564 1566 1986
Coverage 30.69 62.11 5.43 7.51 30.86 71.07 86.44 31.29 38.52
Haplogroup D4e5 C4 X2e2 H1a3 C4a1 H5b H13c1 D4b1a1a C4a1c
Using BAM Analysis Kit
YDNA
• 619 - Q
• 1986 - R1A
• 1251 - R1A
• 656 - C3
NGSADMIX ANALYSIS
ANCESTRY INFORMATIVE MARKERS ADMIXTURE
Sample Q20 DP2 Q30 DP2 Q20 DP3 Q30 DP3
1251 7057 6715 1347 1140
1566 7404 7115 1380 1158
1564 3448 3274 512 439
166 6538 6289 1113 927
1986 10049 9572 2273 1886
656 877 858 57 47
531 389 385 8 7
67 1166 1152 79 75
619 1041 1036 37 35
GPS ANALYSIS, MODERN AND ANCIENT SAMPLES AS REFERENCE
GPS algorithm: Elhaik, Tatarinova et al (2014)
SAMPLE NEAREST MODERN DISTANCE NEAREST ANCIENT DISTANCE
1251 Tajik 0.18 Steppe MLBA 0.06
1564 Lebanese 0.16 Levant BA 0.19
1566 Yakut 0.04 Pazyryk IA (Altai) 0.27
166 Evenk 0.09 Pazyryk IA (Altai) 0.52
1986 Shor 0.10 Pazyryk IA (Altai) 0.14
531 Ishkasim 0.19 Early Sarmatian IA 0.17
619 Turkmen 0.26 Pazyryk IA (Altai) 0.40
656 Kazakh 0.29 Pazyryk IA (Altai) 0.41
67 Khanty 0.12 Pazyryk IA (Altai) 0.16
READMIX ANALYSIS, MODERN REFERENCE
Sample
Populations Proportions
1251 Turkmen Abkhazian Belarusian Yizu 0.494 0.142 0.238 0.127
1564 Druze 0.649
1566 Yakut 1
166 Yakut Even Sakha 0.637 0.339
1986 Yakut Saami Abhaz 0.368 0.428 0.167
531 Yaghnobi(Tajikistan)
Kets Kurmi Selkup 0.506 0.053 0.202 0.239
619 Egypt Yakut Azeri Yizu 0.315 0.183 0.206 0.297
656 Mongolian Even Sakha Egypt Yizu 0.369 0.252 0.193 0.187
67 Yakut 0.623
READMIX, ANCIENT REFERENCESample
Populations Proportions
1251 Steppe Eneolithic
Anatolia Neolithic
SE Iberia CA Zevakino Chilikta IA
0.624 0.149 0.124 0.103
1564 Peloponnese Neolithic
0.595
1566 Pazyryk IA 1.000
166 Pazyryk IA 1.000
1986 Pazyryk IA 0.832
531 Beaker Central Europe
Armenia MLBA
Yamnaya Ukraine
Maros.SG 0.388 0.217 0.258 0.138
619 Beaker Central Europe
Pazyryk IA Anatolia Neolithic
Peloponnese Neolithic
0.489 0.237 0.175 0.099
656 Beaker Central Europe
Armenia MLBA
Pazyryk IA 0.443 0.299 0.193
67 Pazyryk IA 0.875
Ancestry and demography and descendants of Iron Age nomads of the Eurasian Steppe, Martina Unterlander et al, Nature Comm 2017
F3 OUTGROUP
166 1564
OVERLAP WITH THE ASHKENAZI GENOME CONSORTIUM MARKERS
• SEQUENCING AN ASHKENAZI
REFERENCE PANEL SUPPORTS
POPULATION-TARGETED
PERSONAL GENOMICS AND
ILLUMINATES JEWISH
• AND EUROPEAN ORIGINS,
SHAI CARMI, KEN Y. HUI,…, ITSIK PE’ER, NATURE COMMUNICATIONS VOLUME 5,
ARTICLE NUMBER: 4835 (2014)
Sample 1251 1564 1566 166 1986 531 619 656 67
Total 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09
Overlapping positions (out
of 953mutations in
known Ashkenazi
genes)
247 190 301 282 327 21 23 48 70
Same allele as in Ashkenazi database
1 0 3 1 5 0 0 0 0
ASHKENAZIM AND KHAZARS
1. NO SIGNIFICANT ASHKENAZI GENETIC AFFINITY WAS DETECTED IN ANY OF THE SEQUENCED INDIVIDUALS
2. ALL OF THE STUDIED KHAZARS, EVEN THOSE WITH SIGNIFICANT CAUCASIAN ANCESTRY, HAD SIGNIFICANT
ASIATIC NUCLEAR GENETIC CONTRIBUTIONS, WHICH ARE MISSING FROM PRESENT-DAY JEWISH POPULATIONS
3. WHILE LOCAL WOMEN WERE RECRUITED INTO ASHKENAZI COMMUNITIES, NONE OF THE IDENTIFIED
MITOCHONDRIAL HAPLOTYPES ARE COMMON IN PRESENT-DAY ASHKENAZI JEWS
4. THE EUROPEAN GENETIC COMPONENTS OF THE KHAZARS DERIVE FROM THE CAUCASUS TRIBES THAT WERE
UNDER CONTROL OF THE KHAGANATE, RATHER THAN FROM MORE DISTANT LEVANTINE POPULATIONS MORE
CLOSELY RELATED TO ASHKENAZI AND SEPHARDIC JEWS. WHILE JEWS PROBABLY LIVED IN THE TERRITORY OF
THE KHAZAR KHAGANATE ALONG WITH CHRISTIANS, MUSLIMS AND PAGANS, IT SEEMS UNLIKELY THAT THEY
FORMED ITS RULING CLASSES, WHICH WERE DOMINATED BY STEPPE NOMADS FROM THE EAST, AND THUS THE
KHAZARS WERE NOT LIKELY PROGENITORS OF THE ASHKENAZIM.
CONCLUSIONS
• USE MULTIPLE METHODS FOR ANALYSIS FOR VALIDATION
• THERE WERE TWO GROUPS OF KHAZARS, EUROPEAN AND
ASIAN, BOTH GROUPS MIXED
• KHAZARS WERE PROBABLY NOT THE DIRECT ANCESTORS OF
ASHKENAZI
• NEED MORE MONEY FOR MORE GENOMES
ACKNOWLEDGEMENTS
• OKINAWA: ALEXANDER MIKHEYEV
• ROSTOV: IGOR KORNIENKO, ELENA BATYEVA,
VLADIMIR KLYUCHNIKOV
• PETERSBURG: YURI ORLOV, IVAN DMITRIEVSKY
• TOMSK: ALEXEI ZARUBIN
• MOSCOW: NIKITA MOSHKOV