-Strings Seguindo o raciocínio anterior, nomes de strings ...
Statistical/computational phylogeneticsbouchard/pub/phylo-oxford-lecture.pdf · Evolution of...
Transcript of Statistical/computational phylogeneticsbouchard/pub/phylo-oxford-lecture.pdf · Evolution of...
Statistical/computational phylogenetics
Alexandre Bouchard-Côté
Department of Statistics,
University of British Columbia
Organization
• Background in phylogenetics
• Non-standard application areas:
• Clonal evolution inside an individual cancer tumour
• Reconstructing ancient languages
• Current research on Divide-and-Conquer Sequential Monte Carlo methods
Background on phylogenetics
Notation/background on trees
root
EDCBA }
}Observations (typically)
Hidden sequences (typically)
Clock trees:
ab c
a+b = c d+b = c a = d ...
Non-clock tree: remove additivity restrictions on branch lengths
A
B
CD
Clade: group containing a taxa and all its descendants. (I will restrict them to leaves, e.g. {D,E})
Branch lengths
d
root?
v ∈ V
Stochastic process on a tree
Model: String-valued Continuous Time Markov Chain
Evolution of strings is denoted by MBranch length is T = T6.Strings {Xt : 0 � t � T}; Xt has a countable infinite domain of all possiblemolecular sequences.
0 T1
T2
T3
Branch length
Strings
TA CGx1
TA Cx2
TG Cx3
TG CTGx4
TG CTG Ax5
CTG Ax6
T4
T5
T6
L. Wang and A. Bouchard-Cote (UBC) Harnessing Non-Local Evolutionary Events for Tree Infer. June 25, 2012 6 / 16
Model: String-valued Continuous Time Markov Chain
Evolution of strings is denoted by MBranch length is T = T6.Strings {Xt : 0 � t � T}; Xt has a countable infinite domain of all possiblemolecular sequences.
0 T1
T2
T3
Branch length
Strings
TA CGx1
TA Cx2
TG Cx3
TG CTGx4
TG CTG Ax5
CTG Ax6
T4
T5
T6
L. Wang and A. Bouchard-Cote (UBC) Harnessing Non-Local Evolutionary Events for Tree Infer. June 25, 2012 6 / 16
Model: String-valued Continuous Time Markov Chain
Evolution of strings is denoted by MBranch length is T = T6.Strings {Xt : 0 � t � T}; Xt has a countable infinite domain of all possiblemolecular sequences.
0 T1
T2
T3
Branch length
Strings
TA CGx1
TA Cx2
TG Cx3
TG CTGx4
TG CTG Ax5
CTG Ax6
T4
T5
T6
L. Wang and A. Bouchard-Cote (UBC) Harnessing Non-Local Evolutionary Events for Tree Infer. June 25, 2012 6 / 16
Model: String-valued Continuous Time Markov Chain
Evolution of strings is denoted by MBranch length is T = T6.Strings {Xt : 0 � t � T}; Xt has a countable infinite domain of all possiblemolecular sequences.
0 T1
T2
T3
Branch length
Strings
TA CGx1
TA Cx2
TG Cx3
TG CTGx4
TG CTG Ax5
CTG Ax6
T4
T5
T6
L. Wang and A. Bouchard-Cote (UBC) Harnessing Non-Local Evolutionary Events for Tree Infer. June 25, 2012 6 / 16
• Evolution modeled by a stochastic process
• Discrete data: Continuous time Markov chains (CTMC)
• Continuous data: Diffusions
Phylogenetics: challenges
reality /data
estimated tree
biological/clinical
discovery
reality probabilistic model
tree estimator
approximation algorithm
computer implementation
estimated tree
modeling errors
statistical issues intractability, approximation/
optimization errors
bugs!
Where can this go wrong?
reality probabilistic model
• Common assumption: we have replicates (loci/sites) of the evolutionary stochastic process (often iid given tree)
• Identifiability/consistency: can we find the tree given ‘infinite data’
• What does infinite data mean?
• # loci/sites
• # leaves
• sequencing depth
• Efficiency: how fast will the error decay as we add more data? (cf. computational efficiency)
probabilistic model
tree estimator
statistical issues
• Earlier methods: parsimony, distance-based
• State-of-the-art: likelihood-based (maximum likelihood and Bayesian method)
• key idea: score many hypothetical trees, marginalizing over unknown ancestral states
probabilistic model
tree estimator
Sequence Length200 400 600 800 1000 1200 1400 1600
Tim
e (s
econ
ds)
0
200
400
600
800fastDNAml MrBayes-CH2TREE-PUZZLE PAUP*-ML NJ
(a) Taxa = 20, Height = 0.3
Sequence Length200 400 600 800 1000 1200 1400 1600
Tim
e (s
econ
ds)
0
500
1000
1500
2000
2500fastDNAml MrBayes-CH2TREE-PUZZLE PAUP*-ML NJ
(b) Taxa = 20, Height = 1.0
Sequence Length200 400 600 800 1000 1200 1400 1600
Tim
e (s
econ
ds)
0
2000
4000
6000
8000
10000fastDNAml MrBayes-CH2 PAUP*-ML NJ
(c) Taxa = 40, Height = 0.3
Sequence Length200 400 600 800 1000 1200 1400 1600
Tim
e (s
econ
ds)
0
2000
4000
6000
8000
10000
12000 fastDNAml MrBayes-CH2PAUP*-ML NJ
(d) Taxa = 40, Height = 1.0
Fig. 1. Execution time of the phylogenetic methods as a function ofthe sequence length for various values of taxa and height.
Sequence Length
200 400 600 800 1000 1200 1400 1600
RF
rate
(%)
10
20
30
40
50
60fastDNAml MrBayes-CH2 TREE-PUZZLE PAUP*-ML NJ
(a) Taxa = 20, Height = 0.3
Sequence Length
200 400 600 800 1000 1200 1400 1600
RF
rate
(%)
10
20
30
40
50
60
70
80fastDNAml MrBayes-CH2 TREE-PUZZLE PAUP*-ML NJ
(b) Taxa = 20, Height = 1.0
Sequence Length
200 400 600 800 1000 1200 1400 1600
RF
rate
(%)
10
15
20
25
30
35
40
45 fastDNAml MrBayes-CH2PAUP*-MLNJ
(c) Taxa = 40, Height = 0.3
Sequence Length200 400 600 800 1000 1200 1400 1600
RF
rate
(%)
10
20
30
40
50
60
70
80 fastDNAml MrBayes-CH2PAUP*-ML NJ
(d) Taxa = 40, Height = 1.0
Fig. 2. Average RF rate of the phylogenetic methods as a functionof the sequence length for various values of taxa and height.
A
B
CD
AA
B
B
C
C
D
D
Williams and Moret, 03
• Statistically efficient tree reconstruction methods are generally computationally inefficient
• Parallelization, distribution
• Approximation methods:
• MCMC
• Sequential Monte Carlo
• Annealing
tree estimator
approximation algorithm
Application: Inferring the Phylogeny of Cancer Cells
Cancer and
evolution
• DNA repair pathways damaged => accelerated mutation rates
• Accelerated rate + clonal expansion => evolution within a tumour
Two types of
mutations
• Copy Number Alterations (CNA)
• Single Nucleotide Variant (SNV)
Copy number alterationsWhole
chromosome
Parent cancer
cell
Child cancer
cell
Local aberration
Single nucleotide variants (SNVs)
• Like SNPs, but arise in tumour instead of already in healthy cells (somatic instead of germline)
Accumulation of SNVs
genome at generation 1genome at generation 2
...
Tim
e
loci
Mathematical simplification: at most one single nucleotide hit per locus (‘infinite site model’, p2 << p when p is small)
=> binary state for SNVs
Motivations for studying this process
• Clinical:
• Evolution => Adaptation => Relapse
• Scientific:
• Evolution in the fitness landscape far from a local optimum
Bayesian non-parametric clonal inference
Trying to identify the ‘nodes’ of the tree
Simplification (pedagogical)
Simplification: for now, 1 chromosome, no CNA
loci
one cell
Observation: ‘bulk data’
loci
several cells
Count data (#reads that do not match ref genome)e.g. Bin(2/5, 5023)
311519085023
Observation: ‘bulk data’
loci
Count data (#reads that do not match ref genome)e.g. Bin(2/5, 5023)
311519085023
tree
??
Mutation (partial) order from clonal prevalence
Example: one mutation in 70% of cells, another in 60%
Consequence of infinite site assumption & pigeon hole principle:
Simplification (pedagogical)
Simplification: for now, 1 chromosome, no CNA
loci
one cell
Complex, aberrant copy number profiles
etc
Deconvolution of bulk data
Supplementary Figures
Supplementary Figure 1: a) A hypothetical phylogenetic tree generated by clonal expansion via the accu-mulation of mutations (stars). Unlike traditional phylogenetic trees internal nodes (clones) in the tree maycontribute to the observed data, not just the leaf nodes. b) Hypothetical observed cellular prevalences forthe mutations in tree. Mutations occurring higher up the tree always have a greater cellular prevalence thantheir descendants (the same statement need not be true about variant allelic prevalence because of theeffect of genotype). Note that the green, blue and purple mutations occur at the same cellular prevalencebecause they always co-occur in the clones of the tree. {fig:clonal˙evolution}
VariantPopulation
Normal Population
Reference Population
ChromosomeMutation
Supplementary Figure 2: Simplified structure of a sample submitted for sequencing. Here we consider thesample with respect to a single mutation (stars). With respect to this mutation we can separate the cells inthe sample into three populations: the normal population consists of all normal cells (circular), the referencepopulation consists of cancer cells (irregular) which do not contain the mutation and the variant populationconsists of all cancer cells with the mutation. To simplify the model we assume all the cells within eachpopulation share the same genotype. For example all cells in the variant population in this case have thegenotype AABB i.e. two copies of the reference allele, A, and two copies of the variant allele, B. Note thatthe fraction of cancer cells from the variant population is the cellular prevalence of the mutation which is610 = 0.6 in this example. Due to the effect of heterogeneity and genotype the expected fraction of readscontaining the variant allele (variant allelic prevalence) in this example would be 6·4· 24
2·2+4·3+6·4 = 0.3.{fig:pyclone˙populations}
16
Variant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)
Variant cellular prevalence = # variant cells / # cells = 6/12 (inferred)
Example of data
Figure 1: SNV phylogeny of 7 HGSOvCa patients
(a) Anatomic sites sampled for whole genome sequencing in 7 HGSOvCa patients. Phylogeny inferred from SNVs, with previously nominatedputative driver genes annotated on branches. Branch lengths represent counts of the number of SNVs originating on each branch. Branches areannotated with the number of SNVs lost along the branch.
ROv3 ROv2 ROv1 ROv4 SBwl Om1
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
4626
10714 65 113
3
8
ROv1 ROv4 ROv3 ROv2
15000
10000
5000
0
SNV
coun
t
9175 82
38
Om2 Om1 ROv1 ROv2
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
17 6
52 9
ROv2 ROv1 Adnx Om1
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
3538 49
81
ROv1 ROv3 ROv2 ROv4 LPvS
5000
4000
3000
2000
1000
0
SNV
coun
t
141957
1323
47
LOv1 RPvM BrnM
8000
7000
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
23 98
Om2 ROv1 Om1 LOv2 LOv1
4000
3000
2000
1000
0
SNV
coun
t
280
2413 13 2
11
TP53
CDK12
NAV3
TP53
TET2
TP53
STAG2ELF3
TP53
TP53
USP9X
TP53 TP53
FAT3
SETBP1
c b a d f e b a c d b a d c e a d c b
1 3 2 a c b d e a d c b
Patient 1 Patient 2 Patient 3 Patient 9
Patient 7 Patient 4 Patient 10
link
16
4 biopsies (can be temporal or spatial)Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
Each curve shows the allele prevalence of a mutation
(depth not shown)
Clustering: motivation
Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
Clustering: motivation
Cluster ≈ edge of the tree
cluster 1cluster 2
cluster 3
Inferential questions
• Cluster mutations
• Cellular prevalence of each sub-population.
Supplementary Figures
Supplementary Figure 1: a) A hypothetical phylogenetic tree generated by clonal expansion via the accu-mulation of mutations (stars). Unlike traditional phylogenetic trees internal nodes (clones) in the tree maycontribute to the observed data, not just the leaf nodes. b) Hypothetical observed cellular prevalences forthe mutations in tree. Mutations occurring higher up the tree always have a greater cellular prevalence thantheir descendants (the same statement need not be true about variant allelic prevalence because of theeffect of genotype). Note that the green, blue and purple mutations occur at the same cellular prevalencebecause they always co-occur in the clones of the tree. {fig:clonal˙evolution}
VariantPopulation
Normal Population
Reference Population
ChromosomeMutation
Supplementary Figure 2: Simplified structure of a sample submitted for sequencing. Here we consider thesample with respect to a single mutation (stars). With respect to this mutation we can separate the cells inthe sample into three populations: the normal population consists of all normal cells (circular), the referencepopulation consists of cancer cells (irregular) which do not contain the mutation and the variant populationconsists of all cancer cells with the mutation. To simplify the model we assume all the cells within eachpopulation share the same genotype. For example all cells in the variant population in this case have thegenotype AABB i.e. two copies of the reference allele, A, and two copies of the variant allele, B. Note thatthe fraction of cancer cells from the variant population is the cellular prevalence of the mutation which is610 = 0.6 in this example. Due to the effect of heterogeneity and genotype the expected fraction of readscontaining the variant allele (variant allelic prevalence) in this example would be 6·4· 24
2·2+4·3+6·4 = 0.3.{fig:pyclone˙populations}
16
cluster 1cluster 2
cluster 3
Challenges
• Noisy data
• High level of uncertainty, partial identifiability
• Many data types and sources of prior information=> Good fit for Bayesian non-parametrics
PyClone• Input: bulk ultra-deep
sequencing data [+ some prior info]
• Output: posterior distribution over clustering and cellular prevalences
• Based on the Dirichlet Process mixture model
H
H0 ↵
a↵, b↵
�n
bnmdn
m nm ⇡n
m
tm
s
as, bs
n = 1...N
m = 1...M
� ⇠ Gamma(a�, b�)
H0 = Uniform([0, 1]M )
H |�, H0 ⇠ DP(�, H0)
�n|H ⇠ H
�nm|�n
m ⇠ Categorical(�nm)
�nm = (gn
m,N, gnm,R, gn
m,V)
either
bnm|dn
m, �nm, �n
m, tm ⇠ Binomial(dnm, �(�n
m, �nm, tm))
or
s|a, b ⇠ Gamma(as, bs)
bnm|dn
m, �nm, �n
m, tm, s ⇠ BetaBinomial(dnm, �(�n
m, �nm, tm), s)
where
�(�, �, t) =
(1�t)c(gN)Z µ(gN) + t(1��)c(gR)
Z µ(gR)+
t�c(gV)Z µ(gV)
Z = (1 � t)c(gN) + t(1 � �)c(gR)+
t�c(gV)
Supplementary Figure 3: Probabilistic graphical representation of the PyClone model. The model assumesthe observed count data for the nth mutation is dependent on the cellular prevalence of the mutation aswell as the state of the normal, reference and variant populations. The cellular prevalence of mutationn across the M samples, �n, is drawn from a Dirichlet Process (DP) prior to allow mutations to clusterand the number of clusters to be inferred. For brevity we show the multi-sample version of PyClone whichgeneralises the single sample case (M=1). We also show the model with either the Binomial or BetaBinomial emission densities. For all analyses conducted in this paper we set vague priors of a↵ = 1, b↵ =
10
�3 for the DP concentration parameter ↵ and as = 1, bs = 10
�4 for the Beta Binomial precision parameters. The Gamma distributions are parametrised in terms of the shape, a, and rate, b, parameters (seeSection 1.1). {fig:pgm}
17
Roth et al. (2014) Nature Methods
PyClone: main parameter
Supplementary Figures
Supplementary Figure 1: a) A hypothetical phylogenetic tree generated by clonal expansion via the accu-mulation of mutations (stars). Unlike traditional phylogenetic trees internal nodes (clones) in the tree maycontribute to the observed data, not just the leaf nodes. b) Hypothetical observed cellular prevalences forthe mutations in tree. Mutations occurring higher up the tree always have a greater cellular prevalence thantheir descendants (the same statement need not be true about variant allelic prevalence because of theeffect of genotype). Note that the green, blue and purple mutations occur at the same cellular prevalencebecause they always co-occur in the clones of the tree. {fig:clonal˙evolution}
VariantPopulation
Normal Population
Reference Population
ChromosomeMutation
Supplementary Figure 2: Simplified structure of a sample submitted for sequencing. Here we consider thesample with respect to a single mutation (stars). With respect to this mutation we can separate the cells inthe sample into three populations: the normal population consists of all normal cells (circular), the referencepopulation consists of cancer cells (irregular) which do not contain the mutation and the variant populationconsists of all cancer cells with the mutation. To simplify the model we assume all the cells within eachpopulation share the same genotype. For example all cells in the variant population in this case have thegenotype AABB i.e. two copies of the reference allele, A, and two copies of the variant allele, B. Note thatthe fraction of cancer cells from the variant population is the cellular prevalence of the mutation which is610 = 0.6 in this example. Due to the effect of heterogeneity and genotype the expected fraction of readscontaining the variant allele (variant allelic prevalence) in this example would be 6·4· 24
2·2+4·3+6·4 = 0.3.{fig:pyclone˙populations}
16
Variant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)
Variant cellular prevalence = # variant cells / # cells = 6/12 (inferred)
ϕn: cellular prevalence vector of mutation n
PyClone: main parameterVariant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)
Variant cellular prevalence = # variant cells / # cells = 6/12 (inferred)
ϕn: cellular prevalence vector of mutation n
Note: the cellular prevalence varies across anatomical samples
ϕn =(ϕn ,..., ϕn)1 M
Figure 1: SNV phylogeny of 7 HGSOvCa patients
(a) Anatomic sites sampled for whole genome sequencing in 7 HGSOvCa patients. Phylogeny inferred from SNVs, with previously nominatedputative driver genes annotated on branches. Branch lengths represent counts of the number of SNVs originating on each branch. Branches areannotated with the number of SNVs lost along the branch.
ROv3 ROv2 ROv1 ROv4 SBwl Om1
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
4626
10714 65 113
3
8
ROv1 ROv4 ROv3 ROv2
15000
10000
5000
0
SNV
coun
t
9175 82
38
Om2 Om1 ROv1 ROv2
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
17 6
52 9
ROv2 ROv1 Adnx Om1
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
3538 49
81
ROv1 ROv3 ROv2 ROv4 LPvS
5000
4000
3000
2000
1000
0
SNV
coun
t
141957
1323
47
LOv1 RPvM BrnM
8000
7000
6000
5000
4000
3000
2000
1000
0
SNV
coun
t
23 98
Om2 ROv1 Om1 LOv2 LOv1
4000
3000
2000
1000
0
SNV
coun
t
280
2413 13 2
11
TP53
CDK12
NAV3
TP53
TET2
TP53
STAG2ELF3
TP53
TP53
USP9X
TP53 TP53
FAT3
SETBP1
c b a d f e b a c d b a d c e a d c b
1 3 2 a c b d e a d c b
Patient 1 Patient 2 Patient 3 Patient 9
Patient 7 Patient 4 Patient 10
link
16
DP mixture model
Parametric mixture model:
Part 2: Basics of Dirichlet processes 2-3
and let G = ⇡�
✓1 +(1�⇡)�✓2 . Note that G : F
⌦
! [0, 1] and that G(! ) = 1, in other words, G is distribution.Since it is constructed from random variables, it is a random distribution. Here is an example of a realizationof G:
Mea
n pa
ram
eter
Variance parameter
As it is apparent from this figure, the prior over G has support over discrete distributions with two pointmasses.
This is where the Dirichlet process comes in: it is a prior, denoted by DP, with a support over discrete witha countably infinite number of point masses. If G ⇠ DP, this means that a realization from G will look like:
Mea
n pa
ram
eter
Variance parameter
In the next section, we define Dirichlet process more formally.
2.3 First definition of Dirichlet processes
A Dirichlet process has two parameters:
1. A positive real number ↵
0
> 0, called the concentration parameter. We will see later that it can beinterpreted as a precision (inverse variance) parameter.
2. A distribution G
0
: F⌦
! [0, 1] called the base measure. We will see later that it can be interpreted asa mean parameter.
Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochasticprocess on a probability space (! 0
,F⌦
0), it is enough to provide a consistent definition of what the marginalsof this stochastic process are. As the name suggest, in the case of a Dirichlet process, the marginals areDirichlet distributions:
Definition 2.1 (Dirichlet Process [2]) Let ↵
0
, G
0
be of the types listed above. We say that G : F⌦
0 !(F
⌦
! [0, 1]) is distributed according to the Dirichlet process distribution, denoted by G ⇠ DP(↵0
, G
0
), if for
all measurable partitions of ! , (A1
, . . . , A
K
) (this means that A
k
are events, A
k
2 F , that they are disjoint,
and that their union is equal to ! ), we have:
(G(A1
), G(A2
), . . . , G(AK
)) ⇠ Dir(↵0
G
0
(A1
),↵0
G
0
(A2
), . . . ,↵0
G
0
(AK
)).
Parameter for each datapoint (mutation) obtained by sampling the discrete measure
on the right
Cellular freq. for anatomical sample 2
Cellular
freq. fo
r
anatom
ical sa
mple 1
DP mixture model:
Part 2: Basics of Dirichlet processes 2-3
and let G = ⇡�
✓1 +(1�⇡)�✓2 . Note that G : F
⌦
! [0, 1] and that G(! ) = 1, in other words, G is distribution.Since it is constructed from random variables, it is a random distribution. Here is an example of a realizationof G:
Mea
n pa
ram
eter
Variance parameter
As it is apparent from this figure, the prior over G has support over discrete distributions with two pointmasses.
This is where the Dirichlet process comes in: it is a prior, denoted by DP, with a support over discrete witha countably infinite number of point masses. If G ⇠ DP, this means that a realization from G will look like:
Mea
n pa
ram
eter
Variance parameter
In the next section, we define Dirichlet process more formally.
2.3 First definition of Dirichlet processes
A Dirichlet process has two parameters:
1. A positive real number ↵
0
> 0, called the concentration parameter. We will see later that it can beinterpreted as a precision (inverse variance) parameter.
2. A distribution G
0
: F⌦
! [0, 1] called the base measure. We will see later that it can be interpreted asa mean parameter.
Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochasticprocess on a probability space (! 0
,F⌦
0), it is enough to provide a consistent definition of what the marginalsof this stochastic process are. As the name suggest, in the case of a Dirichlet process, the marginals areDirichlet distributions:
Definition 2.1 (Dirichlet Process [2]) Let ↵
0
, G
0
be of the types listed above. We say that G : F⌦
0 !(F
⌦
! [0, 1]) is distributed according to the Dirichlet process distribution, denoted by G ⇠ DP(↵0
, G
0
), if for
all measurable partitions of ! , (A1
, . . . , A
K
) (this means that A
k
are events, A
k
2 F , that they are disjoint,
and that their union is equal to ! ), we have:
(G(A1
), G(A2
), . . . , G(AK
)) ⇠ Dir(↵0
G
0
(A1
),↵0
G
0
(A2
), . . . ,↵0
G
0
(AK
)).
sameCellular freq. for
anatomical sample 2
Cellular
freq. fo
r
anatom
ical sa
mple 1
DP mixture
• n: index of the SNV loci
• m: index of sample (biopsy)
• ϕn: cellular prevalence vector of mutation n
• d: depth (total # of reads)
• b: number of mutant
H
H0 ↵
a↵, b↵
�n
bnmdn
m nm ⇡n
m
tm
s
as, bs
n = 1...N
m = 1...M
� ⇠ Gamma(a�, b�)
H0 = Uniform([0, 1]M )
H |�, H0 ⇠ DP(�, H0)
�n|H ⇠ H
�nm|�n
m ⇠ Categorical(�nm)
�nm = (gn
m,N, gnm,R, gn
m,V)
either
bnm|dn
m, �nm, �n
m, tm ⇠ Binomial(dnm, �(�n
m, �nm, tm))
or
s|a, b ⇠ Gamma(as, bs)
bnm|dn
m, �nm, �n
m, tm, s ⇠ BetaBinomial(dnm, �(�n
m, �nm, tm), s)
where
�(�, �, t) =
(1�t)c(gN)Z µ(gN) + t(1��)c(gR)
Z µ(gR)+
t�c(gV)Z µ(gV)
Z = (1 � t)c(gN) + t(1 � �)c(gR)+
t�c(gV)
Supplementary Figure 3: Probabilistic graphical representation of the PyClone model. The model assumesthe observed count data for the nth mutation is dependent on the cellular prevalence of the mutation aswell as the state of the normal, reference and variant populations. The cellular prevalence of mutationn across the M samples, �n, is drawn from a Dirichlet Process (DP) prior to allow mutations to clusterand the number of clusters to be inferred. For brevity we show the multi-sample version of PyClone whichgeneralises the single sample case (M=1). We also show the model with either the Binomial or BetaBinomial emission densities. For all analyses conducted in this paper we set vague priors of a↵ = 1, b↵ =
10
�3 for the DP concentration parameter ↵ and as = 1, bs = 10
�4 for the Beta Binomial precision parameters. The Gamma distributions are parametrised in terms of the shape, a, and rate, b, parameters (seeSection 1.1). {fig:pgm}
17
ϕ1 ϕ2
311519085023
bd
b|d,ϕ1 ~BeBin
Posterior inference
• MCMC: standard auxiliary variable method for non-conjugate models (‘algorithm 8’)
• Issue: scaling to large # of mutations
• Latest work with Andy Roth and Arnaud Doucet: a particle MCMC split merge algorithm
Example of clustering with the model so far
Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
Genotypes• n: index of the SNV loci
• m: index of sample (biopsy)
• ϕn: cellular prevalence vector of mutation n
• d: depth (total # of reads)
• b: number of mutant alleles
• s: overdispersion parameter
• ψ: genotype (∈ {AB, AAB, AABB, ...})
H
H0 ↵
a↵, b↵
�n
bnmdn
m nm ⇡n
m
tm
s
as, bs
n = 1...N
m = 1...M
� ⇠ Gamma(a�, b�)
H0 = Uniform([0, 1]M )
H |�, H0 ⇠ DP(�, H0)
�n|H ⇠ H
�nm|�n
m ⇠ Categorical(�nm)
�nm = (gn
m,N, gnm,R, gn
m,V)
either
bnm|dn
m, �nm, �n
m, tm ⇠ Binomial(dnm, �(�n
m, �nm, tm))
or
s|a, b ⇠ Gamma(as, bs)
bnm|dn
m, �nm, �n
m, tm, s ⇠ BetaBinomial(dnm, �(�n
m, �nm, tm), s)
where
�(�, �, t) =
(1�t)c(gN)Z µ(gN) + t(1��)c(gR)
Z µ(gR)+
t�c(gV)Z µ(gV)
Z = (1 � t)c(gN) + t(1 � �)c(gR)+
t�c(gV)
Supplementary Figure 3: Probabilistic graphical representation of the PyClone model. The model assumesthe observed count data for the nth mutation is dependent on the cellular prevalence of the mutation aswell as the state of the normal, reference and variant populations. The cellular prevalence of mutationn across the M samples, �n, is drawn from a Dirichlet Process (DP) prior to allow mutations to clusterand the number of clusters to be inferred. For brevity we show the multi-sample version of PyClone whichgeneralises the single sample case (M=1). We also show the model with either the Binomial or BetaBinomial emission densities. For all analyses conducted in this paper we set vague priors of a↵ = 1, b↵ =
10
�3 for the DP concentration parameter ↵ and as = 1, bs = 10
�4 for the Beta Binomial precision parameters. The Gamma distributions are parametrised in terms of the shape, a, and rate, b, parameters (seeSection 1.1). {fig:pgm}
17
ABB AB AABB
Informed priors over genotypes
• Goal: reduce the space of possible genotypes
• Key idea: we can also use germline SNPs to get extra information on the copy number at a loci
• Hijack standard micro-array platforms to get log intensity in the neighborhood of the SNV of interest
SNV (cancer)
SNP (from your parents)
Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
Clustering of mutations
Without genotypes With genotypes
Full output (consensus)
Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}
31
How to validate?
• Cross validation
• Synthetic data (in silico and in vitro)
2 | ADVANCE ONLINE PUBLICATION | NATURE METHODS
BRIEF COMMUNICATIONS
variation events. Third, Bayesian nonparametric clustering is used to discover groupings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter. Fourth, multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples. Simulated data sets sys-tematically illustrate improvements in performance of each of the novel modeling components (Supplementary Figs. 12 and 13).
We first evaluated our approach on idealized data sets (Fig. 1) produced from physical mixtures of DNA extracted from four 1000 Genomes Project samples11,12 (Online Methods). Each mix-ture contained DNA in approximate proportions of 0.01, 0.05, 0.20 and 0.74. Specific single-nucleotide variants (SNVs) were ampli-fied using PCR and then sequenced deeply on the Illumina MiSeq platform. This data set simulates a hierarchically related popula-tion with ground truth for quantitative benchmarking (Fig. 1c). We selected positions with variants present in exactly one case,
e f
1.0
b
0.9
0.8
0.7
V-m
easu
re
0.6
0.5
IBMM
IBBMM
Bin-TCN
Bin-PCN
BeBin-
TCN
BeBin-
PCN
1.0
0.8
0.6
Var
iant
alle
lic p
reva
lenc
e
0.4
0A B C D
Mixture
Genotype
0.2
AB (n = 3)BB (n = 3)
1.0
c
0.8
0.6
Exp
ecte
d ce
llula
rpr
eval
ence
0.4
0A B C D
Mixture
0.2
Variant case(s)
NA12878NA18507NA19240NA18507NA19240
NA12156
NA12878NA18507NA19240NA12156NA12878NA18507NA19240
1.0
a
0.9
0.8
0.7
V-m
easu
re
0.6
0.5
IBMM
IBBMM
Bin-TCN
Bin-PCN
BeBin-
TCN
BeBin-
PCN
d1.0
0.8
0.6
IBB
MM
(in
ferr
ed)
cellu
lar
prev
alen
ce
0.4
0A B C D
Mixture
0.2
Cluster
1.0 (n = 2)2.0 (n = 2)3.0 (n = 3)4.0 (n = 3)5.0 (n = 3)6.0 (n = 3)7.0 (n = 2)8.0 (n = 3)9.0 (n = 7)10.0 (n = 7)11.0 (n = 1)12.0 (n = 1)
1.0
0.8
0.6
0.4
0A B C D
Mixture
0.2
BeB
in-P
CN
(in
ferr
ed)
cellu
lar
prev
alen
ce
Cluster
3.0 (n = 4)4.0 (n = 6)
2.0 (n = 3)
6.0 (n = 3)
1.0 (n = 6)
5.0 (n = 14)
7.0 (n = 1)
Figure 1 | Comparison of clustering performance for the mixture of normal-tissue data sets. (a,b) Comparison of methods (10 replicates of 37 mutations) when analyzing a four-mixture experiment separately (a) or jointly (b). IBMM, infinite binomial (Bin) mixture model; IBBMM, infinite beta-binomial (BeBin) mixture model; TCN, with total copy-number priors; PCN, with parental copy-number priors. (c) Expected cellular prevalence of each cluster across the four-mixture experiments. (d,e) Inferred cellular prevalences and clustering using the IBBMM model (d) and PyClone BeBin-PCN model (e). Clusters with homozygous single-nucleotide variants (SNVs) or an equal number of homo- and heterozygous SNVs are indicated by solid lines; clusters with heterozygous SNVs, by dashed lines. (f) Variant allelic prevalence for mutations assigned to cluster 1 by the PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs; solid lines represent homozygous SNVs. Whiskers (a,b) indicate 1.5× the interquartile range, red bars indicate the median and boxes represent the interquartile range. Error bars (d,e) indicate the mean s.d. (using 9,000 post-‘burn-in’ samples (Online Methods)) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. (d–f) The number of mutations n in each cluster is shown in the legends in parentheses.
a1.0
0.8
0.6
0.4
0.2
0
Var
iant
alle
licpr
eval
ence
A B C D
1.0
0.8
0.6
0.4
0.2
0(Inf
erre
d) c
ellu
lar
prev
alen
ce
A B C D
Cluster1 (n = 4)2 (n = 20)3 (n = 2)4 (n = 6)5 (n = 1)6 (n = 1)7 (n = 9)8 (n = 2)9 (n = 4)
c1:18661418
8:133905955X: 1496390371:1106553972:31589849
2:1726480703:50391111
4:1538313667:44280313
11:4676081612:5036858416:29675929
17:766281019:115692932:1968915508:95186221X:635793247:28858772
1:22605559510:2644631216:61687701X:129283520
Absent Present Unknown
dAbsentPresentUnknown
ControlCancerNormal
Cell type
BeB
in-P
CN
IBB
MM
gDN
A 1
Nuc
leus
41
Nuc
leus
4N
ucle
us 3
9N
ucle
us 3
6N
ucle
us 3
5N
ucle
us 3
4N
ucle
us 3
3N
ucle
us 3
2N
ucle
us 3
1N
ucle
us 9
Nuc
leus
3N
ucle
us 5
Nuc
leus
29
Nuc
leus
22
Nuc
leus
20
Nuc
leus
2N
ucle
us 1
8N
ucle
us 1
7N
ucle
us 1
6
Nuc
leus
14
Nuc
leus
15
Nuc
leus
13
Nuc
leus
10
Nuc
leus
27
Nuc
leus
7N
ucle
us 8
Nuc
leus
1N
ucle
us 1
9
Nuc
leus
30
Nuc
leus
6
IBBMM
b1.0
0.8
0.6
0.4
0.2
0
Var
iant
alle
licpr
eval
ence
A B CSample
D
1.0
0.8
0.6
0.4
0.2
0(Inf
erre
d) c
ellu
lar
prev
alen
ce
A B CSample
D
Cluster1 (n = 25)2 (n = 9)3 (n = 5)4 (n = 7)5 (n = 2)6 (n = 1)
BeBin-PCN
Figure 2 | Joint analysis of multiple samples from high-grade serous ovarian cancer 2. (a,b) Left, variant allelic prevalence for each mutation, color coded by predicted cluster, using IBBMM (a) and PyClone with the BeBin-PCN model (b) to jointly analyze the four samples. Right, inferred cellular prevalence for each cluster using IBBMM (a) and BeBin-PCN methods (b). As in Figure 1, the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutations in the cluster. (c) Presence or absence of variant alleles at target loci in single cells from sample B. Loci with fewer than 40 reads covering them are colored gray. Predicted clusters for each method are shown on the left, with white cells indicating nonsomatic control positions. Row labels indicate hg19 chromosome and chromosome coordinate separated by a colon. (d) Presence or absence of IBBMM clusters in single cells from sample B. Clusters were deemed present if any mutation in the cluster was present. gDNA, bulk genomic DNA control. Error bars (a,b) indicate the mean s.d. (using 50,000 post-‘burn-in’ samples) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. The number of mutations n in each cluster is shown in the legends in parentheses.
Single-cell sequencing
loci
cell
2 | ADVANCE ONLINE PUBLICATION | NATURE METHODS
BRIEF COMMUNICATIONS
variation events. Third, Bayesian nonparametric clustering is used to discover groupings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter. Fourth, multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples. Simulated data sets sys-tematically illustrate improvements in performance of each of the novel modeling components (Supplementary Figs. 12 and 13).
We first evaluated our approach on idealized data sets (Fig. 1) produced from physical mixtures of DNA extracted from four 1000 Genomes Project samples11,12 (Online Methods). Each mix-ture contained DNA in approximate proportions of 0.01, 0.05, 0.20 and 0.74. Specific single-nucleotide variants (SNVs) were ampli-fied using PCR and then sequenced deeply on the Illumina MiSeq platform. This data set simulates a hierarchically related popula-tion with ground truth for quantitative benchmarking (Fig. 1c). We selected positions with variants present in exactly one case,
e f
1.0
b
0.9
0.8
0.7
V-m
easu
re
0.6
0.5
IBMM
IBBMM
Bin-TCN
Bin-PCN
BeBin-
TCN
BeBin-
PCN
1.0
0.8
0.6
Var
iant
alle
lic p
reva
lenc
e
0.4
0A B C D
Mixture
Genotype
0.2
AB (n = 3)BB (n = 3)
1.0
c
0.8
0.6
Exp
ecte
d ce
llula
rpr
eval
ence
0.4
0A B C D
Mixture
0.2
Variant case(s)
NA12878NA18507NA19240NA18507NA19240
NA12156
NA12878NA18507NA19240NA12156NA12878NA18507NA19240
1.0
a
0.9
0.8
0.7
V-m
easu
re
0.6
0.5
IBMM
IBBMM
Bin-TCN
Bin-PCN
BeBin-
TCN
BeBin-
PCN
d1.0
0.8
0.6
IBB
MM
(inf
erre
d) c
ellu
lar
prev
alen
ce
0.4
0A B C D
Mixture
0.2
Cluster
1.0 (n = 2)2.0 (n = 2)3.0 (n = 3)4.0 (n = 3)5.0 (n = 3)6.0 (n = 3)7.0 (n = 2)8.0 (n = 3)9.0 (n = 7)10.0 (n = 7)11.0 (n = 1)12.0 (n = 1)
1.0
0.8
0.6
0.4
0A B C D
Mixture
0.2
BeB
in-P
CN
(inf
erre
d)ce
llula
r pre
vale
nce Cluster
3.0 (n = 4)4.0 (n = 6)
2.0 (n = 3)
6.0 (n = 3)
1.0 (n = 6)
5.0 (n = 14)
7.0 (n = 1)
Figure 1 | Comparison of clustering performance for the mixture of normal-tissue data sets. (a,b) Comparison of methods (10 replicates of 37 mutations) when analyzing a four-mixture experiment separately (a) or jointly (b). IBMM, infinite binomial (Bin) mixture model; IBBMM, infinite beta-binomial (BeBin) mixture model; TCN, with total copy-number priors; PCN, with parental copy-number priors. (c) Expected cellular prevalence of each cluster across the four-mixture experiments. (d,e) Inferred cellular prevalences and clustering using the IBBMM model (d) and PyClone BeBin-PCN model (e). Clusters with homozygous single-nucleotide variants (SNVs) or an equal number of homo- and heterozygous SNVs are indicated by solid lines; clusters with heterozygous SNVs, by dashed lines. (f) Variant allelic prevalence for mutations assigned to cluster 1 by the PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs; solid lines represent homozygous SNVs. Whiskers (a,b) indicate 1.5× the interquartile range, red bars indicate the median and boxes represent the interquartile range. Error bars (d,e) indicate the mean s.d. (using 9,000 post-‘burn-in’ samples (Online Methods)) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. (d–f) The number of mutations n in each cluster is shown in the legends in parentheses.
a1.0
0.8
0.6
0.4
0.2
0
Var
iant
alle
licpr
eval
ence
A B C D
1.0
0.8
0.6
0.4
0.2
0(Infe
rred
) cel
lula
rpr
eval
ence
A B C D
Cluster1 (n = 4)2 (n = 20)3 (n = 2)4 (n = 6)5 (n = 1)6 (n = 1)7 (n = 9)8 (n = 2)9 (n = 4)
c1:18661418
8:133905955X: 1496390371:1106553972:31589849
2:1726480703:50391111
4:1538313667:44280313
11:4676081612:5036858416:29675929
17:766281019:115692932:1968915508:95186221X:635793247:28858772
1:22605559510:2644631216:61687701X:129283520
Absent Present Unknown
dAbsentPresentUnknown
ControlCancerNormal
Cell type
BeB
in-P
CN
IBB
MM
gDN
A 1
Nuc
leus
41
Nuc
leus
4N
ucle
us 3
9N
ucle
us 3
6N
ucle
us 3
5N
ucle
us 3
4N
ucle
us 3
3N
ucle
us 3
2N
ucle
us 3
1N
ucle
us 9
Nuc
leus
3N
ucle
us 5
Nuc
leus
29
Nuc
leus
22
Nuc
leus
20
Nuc
leus
2N
ucle
us 1
8N
ucle
us 1
7N
ucle
us 1
6
Nuc
leus
14
Nuc
leus
15
Nuc
leus
13
Nuc
leus
10
Nuc
leus
27
Nuc
leus
7N
ucle
us 8
Nuc
leus
1N
ucle
us 1
9
Nuc
leus
30
Nuc
leus
6
IBBMM
b1.0
0.8
0.6
0.4
0.2
0
Var
iant
alle
licpr
eval
ence
A B CSample
D
1.0
0.8
0.6
0.4
0.2
0(Infe
rred
) cel
lula
rpr
eval
ence
A B CSample
D
Cluster1 (n = 25)2 (n = 9)3 (n = 5)4 (n = 7)5 (n = 2)6 (n = 1)
BeBin-PCN
Figure 2 | Joint analysis of multiple samples from high-grade serous ovarian cancer 2. (a,b) Left, variant allelic prevalence for each mutation, color coded by predicted cluster, using IBBMM (a) and PyClone with the BeBin-PCN model (b) to jointly analyze the four samples. Right, inferred cellular prevalence for each cluster using IBBMM (a) and BeBin-PCN methods (b). As in Figure 1, the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutations in the cluster. (c) Presence or absence of variant alleles at target loci in single cells from sample B. Loci with fewer than 40 reads covering them are colored gray. Predicted clusters for each method are shown on the left, with white cells indicating nonsomatic control positions. Row labels indicate hg19 chromosome and chromosome coordinate separated by a colon. (d) Presence or absence of IBBMM clusters in single cells from sample B. Clusters were deemed present if any mutation in the cluster was present. gDNA, bulk genomic DNA control. Error bars (a,b) indicate the mean s.d. (using 50,000 post-‘burn-in’ samples) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. The number of mutations n in each cluster is shown in the legends in parentheses.
Issue: loss of SNVs
Application: reconstructing ancient languages
Examples of language change
Beowulf (~ 8th - 11th century) *dekm Proto-Indo-European-6000 years before present?
tehun Proto-Germanic-2000 years before present
tȳne ætsomneten
tien
‘ten altogether’
Old English
(Modern) English
-1000 years before present
present
exp
�⇧
⇤⌥
(s,s�)⇥d
⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)
⇥⌃
⌅ ⇤t(d)
Examples of language change
*dekm Proto-Indo-European> 6000 years before present?
tehun Proto-Germanic2000 years before present
ten
tien Old English
(Modern) English
1000 years before present
present
Deletion: /i/ > ∅
Insertion: /e/ > /ie/; Deletion: /h/ > ∅
Substitutions: /d/ > /t/, /k/ > /h/,...
exp
�⇧
⇤⌥
(s,s�)⇥d
⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)
⇥⌃
⌅ ⇤t(d)
exp
�⇧
⇤⌥
(s,s�)⇥d
⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)
⇥⌃
⌅ ⇤t(d)
Example of data
Austronesian dataset:
Size: 706 languages, 150k word forms in IPA
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ikaProto-Oceanic *ika *mataku
...
...
[Greenhill et al, ’08]
Oceanic language
HawaiianSamoanTonganMaori
Task: comparative reconstruction
‘fish’ Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
Task: comparative reconstruction
Proto-Oceanic
‘fish’ Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
Task: comparative reconstruction
Proto-Oceanic
‘fish’ POc *ika
*k > ʔ‘fish’
Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
Task: comparative reconstruction
Proto-Oceanic
‘fish’ POc *iʔa
or
‘fish’ POc *ika
*ʔ > k
‘fish’ Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
Task: comparative reconstruction
Proto-Oceanic
‘fish’ POc *iʔa
‘fish’ POc *ika
‘fish’ Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
or
...
Previous work/main stream approach: manually
reconstructQuestion: Can we leverage machine learning methods?
Can we harness more languages?
‘fish’ Hawaiian iʔa
Samoan iʔa
Tongan ika
Maori ika
Geser ikan
Rapanui ika
Nukuoro iga
Niue ika
Task 2: cognate inference
Fact: New words are invented & old words fall out of usage
Consequence: not all words forms for a given concept will have cousins in related languages.
Hawaiian (G) iʔa makaʔu
Samoan (S) iʔa mataʔu
Tongan (T) ika manavahē
Maori (M) ika mataku
Column: ‘cognate
set’ (think as a cluster)
‘fish’ ‘fear’ ‘fear’
Task 2: cognate inference
Hawaiian iʔa, makaʔu, ...
Samoan iʔa, makaʔu, ...
Tongan ika, manavahē, ...
Maori ika, mataku, ...
Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika manavahē
Maori ika mataku
or ...Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika manavahē
Maori ika mataku
Task 3: tree inference
HawaiianSamoanTonganMaori
HawaiianSamoanTonganMaori
or ...
Model-based approach
X1
X2X3
X4 X5
X2|X1 � Mut(X1; �)
Xi|Xpar(i) � Mut(Xpar(i); �)
X3|X1 � Mut(X1; �)
...
Input: modern wordsProbabilistic
models of change
Hawaiian
Samoan
Tongan
Maori
Hawaiian
Samoan
Tongan
Maori
or...
‘fish’
POc *i!a
‘fish’
POc *ika
or
...
‘fish’ (1)
Hawaiian i!a
Samoan i!a
Tongan ika
Marshallese yapil
or
‘fish’ (1) ‘fish’ (2)
Hawaiian i!a
Samoan i!a
Tongan ika
Marshallese yapil
...
Reconstruction
Cognates
Trees
‘fish’ ‘fear’
Hawaiian i!a maka!u
Samoan i!a mata!u
Tongan ika
Motivation
Why reconstruct?
! Can answer a large number of questions about our past
! Learn about ancient populations’ migrations
! Decipherment of ancient scripts
Motivation: typology
! What are the statistical regularities of sound change?
! Role of functional load in sound change
! Useful for synchronic typology as well
! Is an observed pattern the result of structural constraints, or the result of common descent?
Motivation: machine translation (MT)
! Long-term challenge: usable machine translation between all pairs of languages
! Existing MT rely on large corpora in both languages in the pair,
! in particular, bitexts (e.g. europarl)
! Concrete step: large scale cognate detection
Background
International Phonetic Alphabet
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ikaProto-Oceanic ika mataku
Instead of {A,C,G,T}, here the alphabet is a set of basic sound units (phonemes):
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
Sound changes
*dekm Proto-Indo-European-6000 years before present?
tehun Proto-Germanic-2000 years before present
ten
tien Old English
(Modern) English
-1000 years before present
present
English ten
Danish ti
Greek deka
Italian dieci
French dix
Sanskrit dasan
Proto-Indo-European
Proto-Germanic
/d/ > /t/
Regularity in sound change
English ten tooth
Danish ti tand
Greek deka donti
Italian dieci dente
French dix dent
Sanskrit dasan danta
Proto-Indo-European
/d/ > /t/ in word initial position
! Probabilistic models of inventory changes? ! Open computational
problemSWEDISH
BAFFA_PASHTOLITHUANIAN
FRENCHHINDI
DUTCHASSAMESE
BRETONITALIAN
BHOIPURICORNISHCZECH
GUJARATIBENGALISPANISH
EASTERN_FARSIGILAKIWELSH
PORTUGUESEKURDISH_CENTRALSTANDARD_GERMAN
SARDINIANBELARUSIAN
LATVIANPOLISH
ENGLISHALSATIAN
BULGARIANROMANIANCATALAN
BERNESE_GERMANBANNU_PASHTO
RUSSIAN
3 5 8 C E L N S T X Z a b c d e f g h i j k l m n o p q r s t u v w x y z
0.0
0.2
0.4
0.6
0.8
1.0
Modelling phonological change
iʔa iʔa ika ika
Simplifying assumptions
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika
Maori ika mataku
POc
H. S. T. M.
Fixed tree Fixed cognate sets
iʔa iʔa ika ika
Graphical model
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika
Maori ika mataku
POc
H. S. T. M.
Graphical model
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika
Maori ika matakumakaʔu makaʔu makaku
POc
H. S. T. M.
Graphical model
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ika
Maori ika mataku
POc
H. S. T. M.
Key distribution: Models diachronic word mutation
X
Y
Y |X � Mut(X)
Modeling string mutation
! What kind of string mutations need to be captured?
! Substitution
*k > ʔ‘fish’ ‘voice’
Hawaiian iʔa leo
Samoan iʔa leo
Tongan ika leʔo
Maori ika reo
! Deletion (& insertions)
*ʔ > Ø
! Contextualized change
*ʔ > Ø / V _ V
taŋgis
taŋgi
taŋgi aŋgi
taŋgis
angi
String transducer
t a
a
ŋ
n g
i
i
s#
#
#
#
‘to cry’
t a
a
ŋ
n g
i
i
s#
#
#
#?
ŋ
String transducer
}
t a
a
ŋ
n g
i
i
s#
#
#
#?
ŋ
θS
String transducer
θS : Substitution/Deletion Parameters
t a
a
ŋ
n g
i
i
s#
#
#
#
ŋ
n ?
θS
θI
String transducer
θI : Insertion Parameters
t a
a
ŋ
n g
i
i
s#
#
#
#
ŋ
n g
θS
θI
?
String transducer
Bayesian inference
X1, X2|X3, X4, X5
Computational bottleneck
X1
X2X3
X4 X5
More difficult: string-valued expectations
Parameters fitting: can be approached using standard tools POc
H. S. T. M.
θθ θθ θθ
λ
Various MCMC algorithms
T A G G
T A G C
T A C G C
C A G G
C A C G G
T A G G
T A G C
T A C G C
C A G G
C A C G G
Gibbs sampling
Ancestry resampling
Thin vertical section
Sequential Monte Carlo• SMC: An alternative to
MCMC
• Approximates posterior with particles and weights
• D&C SMC: generalization more suitable to phylogenies
π1 π2 π3
π7
...
π6 π5
π4 π3 π2 π1
Standard SMC samplers
Divide and Conquer
(D&C) SMC samplers• tinyurl.com/divide-and-conquer-smc
• Fredrik Lindsten, Adam M. Johansen, Christian A. Naesseth, Bonnie Kirkpatrick, Thomas B. Schön, John Aston, and Alexandre Bouchard-Côté. (2016) Divide-and-Conquer with Sequential Monte Carlo. JCSG, In Press.
D&C SMC: notation
...
...
yl
yr
xr
xl
xt
xl
xr
xtyt
1. Assume inductively we have &
...
......
2. Sample from the product measure:
(xil, x
ir) ⇠ ⇡l ⇥ ⇡r
⇡t
⇡l
⇡r
- amounts to two independent multinomial sampling steps
D&C (SIR): ⇡l ⇡r
1. Assume inductively...
...
......
2. Sample from the product measure:
⇡t
⇡l
⇡r
3. Propose (extend):
(xil, x
ir)
x
it|(xi
l, xir) ⇠ qt(·|xi
l, xir)
Concatenate:x
it =
(xit,�x
il�,�x
ir�)
?
D&C (SIR):
1. Assume inductively...
...
......
2. Sample from the product measure
⇡t
⇡l
⇡r
3. Propose (extend)
(xil, x
ir)
4. Reweigh:
x
it
w
it =
⇡t(xit)
⇡l(xil)⇡r(xi
r)
1
qt(xit|xi
l, xir)
D&C (SIR):
1. Assume inductively...
...
......
2. Sample from the product measure
⇡t
⇡l
⇡r
3. Propose (extend)
4. Reweigh
Repeat for each particle
D&C (SIR):
Potential applications• Inference over non-local/catastrophic events (explicit
sound change operators)
• Joint cognate inference
• Statistical multiple sequence alignment
• Relaxed clock models
• Many other models that would otherwise have to be handled with very expensive methods such as Approximate Bayesian Computation
Numerical examples
Validation methodology
Evaluation criterion: Edit distance Smallest number of substitutions, insertions and deletions needed to go from one string to the other
Holding-out the manual reconstructions
e.g.: d( /ika/ , /ga/ ) = 2 /ika/ ➔ /iga/ ➔ /ga/
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ikaProto-Oceanic *ika *mataku
‘fish’ ‘fear’Hawaiian iʔa makaʔu
Samoan iʔa mataʔu
Tongan ikaProto-Oceanic ??? ???
Datasets used for reconstruction
! Austronesian languages dataset ! 706 languages, 150k words ! 17 annotated reconstructions ! Ideal for testing large scale reconstruction ! Has continued to grow since the time experiments were
performed
[Greenhill et al. 08]
Comparisons in large phylogenies
Mean error (edit distance)
Comparisons on Proto-Oceanic and Proto-Austronesian
0
0.125
0.250
0.375
0.500
Proto-Oceanic Proto-Austronesian
YakanBajoMapunSamalS ias iInabaknonSubanunSinSubanonSioWes ternBukManoboWes tManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukidMaranaoIranunSasakBaliMamanwaIlonggoHiligaynon
WarayWarayButuanonTausugJoloSurigaononAklanonBisCebuanoKalaganMansakaTagalogAnt
BikolNagaC
TagbanwaKa
TagbanwaAb
BatakPalaw
PalawanBat
BanggiSWPalawano
HanunooPalauanSinghiDayakBakat
Katingan
DayakNgaju
KadorihMerinaM
ala
Maanyan
Tunjung
Kerinc iMelayuSara
Melayu
OganLowMalay
Indones ian
BanjareseM
MalayBahas
MelayuBrun
Minangkaba
PhanRangCh
ChruHainanCham
Moken
IbanGayoIdaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauMuk
Lahanan
EngganoMal
EngganoBan
Nias
TobaBatak
Madurese
IvatanBasc
Itbayat
Babuyan
Iraralay
Itbayaten
Isamorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
KallahanKa
KallahanKe
Inibaloi
Pangasinan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
Agta
DumagatCas
Ilokano
Yogya
OldJavanes
WesternMalayoPolynesian
BugineseSo
Maloh
Makassar
TaeSToraja
SangirTabu
Sangir
SangilSara
Bantik
Kaidipang
GorontaloH
BolaangMon
Popalia
Bonerate
Wuna
MunaKatobu
Tontemboan
Tonsea
BanggaiWdi
Baree
Wolio
Mori
KayanUmaJu
Bukat
PunanKelai
Wampar
Yabem
NumbamiSib
Mouk
Aria
LamogaiMul
Amara
Sengseng
KaulongAuV
Bilibil
Megiar
Gedaged
Matukar
Biliau
Mengen
Maleu
Kove
Tarpia
Sobei
KayupulauK
Riwo
KairiruKis
WogeoAli
Auhelawa
Saliba
Suau
Maisin
Gumawana
Diodio
Molima
Ubir
Gapapaiwa
Wedau
Dobuan
Misima
Kilivila
Gabadi
Doura
Kuni
Roro
Mekeo
Lala
Vilirupu
Motu
MagoriSout
Kandas
Tanga
Siar
Kuanua
Patpatar
Roviana
Simbo
Nduke
Kusaghe
Hoava
LunggaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSo
los
Taiof
Teop
NehanHape
Nis sanHaku
Sis ingga
BabatanaTu
VarisiGhon
VarisiRi
rioSengga
BabatanaAv
Vaghua
Babatana
BabatanaLo
BabatanaKa
LungaLungaMono
MonoAluTorauUruava
MonoFauro
Bilur
ChekeHolo
MaringeKm
a
MaringeTat
NggaoPoro
MaringeLel
Zabana
Kia
LaghuS
amas
Blabla
ngaBlabla
ngaGKok
otaKilokak
aYsBanoniMadara
Tungag
TungKara
WestNalikTiangTigakL ihir
Sungl
MadakLam
asBarok
MaututuNakan
aiBilLakala
iVitu
YapeseMbugho
tuDhBugotu
Talis eMoliTalis eMala
GhariNdiGhariTolo
Talis eMalango
GhariNggaeMbirao
GhariTandaTalis ePoleGhariNggerTalis eKooGhariNgini
Lengo
LengoGhaim
LengoParipNggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbaelelea
KwaraaeSol
Fataleka
LauWalade
LauNorth
Longgu
SaaUkiNiMa
SaaSaaVill
Oroha
SaaUlawa
Saa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawat
Aros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErroman
Ura
ArakiSouth
Merei
Ambrym
Sout
Mota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
Nese
Tape
AnejomAnei
SakaoPortO
Avava
Neveei
Naman
Puluwatese
PuloAnnan
Woleaian
Satawalese
ChuukeseAK
Mortlockes
Carolinian
Chuukese
Sonsoroles
SaipanCaro
PuloAnna
Woleai
Mokilese
Pingilapes
Ponapean
Kiribati
Kusaie
Marshalles
Nauru
Nelemwa
Jawe
Canala
Iaai
Nengone
Dehu
Rotuman
WesternFij
Maori
Tahiti
Penrhyn
Tuamotu
Manihiki
Rurutuan
Tahitia
nMo
Rarotongan
Tahitia
nth
Marquesan
Mangareva
Hawaiian
RapanuiEas
Samoan
Bellona
Tikopia
Emae
Anuta
Rennellese
FutunaAniw
IfiraMeleM
UveaWest
FutunaEast
VaeakauTau
Takuu
Tuvalu
Sikaiana
Kapingamar
Nukuoro
Luangiua
Pukapuka
Tokelau
UveaEast
Tongan
Niue
Fijia
nBau
Teanu
Vano
Tanema
Buma
Asumboa
Tanimbili
Nembao
Seimat
Wuvulu
Lou
Nauna
Leipon
SivisaTita
Likum
Levei
Loniu
Mussau
KasiraIrah
Giman
Buli
Mor
Minyaifuin
BigaMisool
As
Waropen
Numfor
Marau
AmbaiYapen
WindesiWan
NorthBabar
DaweraDawe
Dai
TelaMasbua
Imroing
Emplawas
EastMasela
Serili
CentralMas
SouthEastB
WestDamar
TaranganBa
UjirNAru
NgaiborSAr
Geser
Watubela
ElatKeiBes
HituAmbon
SouAmanaTe
Amahai
Paulohi
Alune
MurnatenAl
Bonfia
Werinama
BuruNamrol
Soboyo
Mamboru
Manggarai
Ngadha
Pondok
Kodi
WejewaTana
Bima
Soa
Savu
Wanukaka
GauraNggau
Eas tSumban
Baliledo
Lamboya
L ioFloresT
Ende
Nage
Kambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerman
Atoni
Mambai
TetunTerik
Kemak
Lamalerale
LamaholotI
S ika
Kedang
Erai
Talur
Aputai
Perai
Tugun
Iliun
Serua
Nila
Teun
Kisar
Roma
Eas tD
amar
Letinese
Yamd
ena
KeiTan
imba
Selaru
Koiwai
Iria
Chamo
rro
Lampun
g
Komerin
g
Sunda
RejangR
eja
TboliTa
gab
Tagabil
i
Koronad
alB
Sarangan
iB
BilaanKor
o
BilaanSar
a
CiuliAtaya
SquliqAtay
Sediq
Bunun
ThaoFavorla
ng
RukaiPaiwan
Kanakanabu
SaaroaTsouPuyumaKavalanBasaiCentralAmiS irayaPazehSais iat
YakanBajoMapunSamalS ias iInabaknonSubanunSinSubanonSioWes ternBukManoboWes tManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukidMaranaoIranunSasakBaliMamanwaIlonggoHiligaynon
WarayWarayButuanonTausugJoloSurigaononAklanonBisCebuanoKalaganMansakaTagalogAnt
BikolNagaC
TagbanwaKa
TagbanwaAb
BatakPalaw
PalawanBat
BanggiSWPalawano
HanunooPalauanSinghiDayakBakat
Katingan
DayakNgaju
KadorihMerinaM
ala
Maanyan
Tunjung
Kerinc iMelayuSara
Melayu
OganLowMalay
Indones ian
BanjareseM
MalayBahas
MelayuBrun
Minangkaba
PhanRangCh
ChruHainanCham
Moken
IbanGayoIdaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauMuk
Lahanan
EngganoMal
EngganoBan
Nias
TobaBatak
Madurese
IvatanBasc
Itbayat
Babuyan
Iraralay
Itbayaten
Isamorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
KallahanKa
KallahanKe
Inibaloi
Pangasinan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
Agta
DumagatCas
Ilokano
Yogya
OldJavanes
WesternMalayoPolynesian
BugineseSo
Maloh
Makassar
TaeSToraja
SangirTabu
Sangir
SangilSara
Bantik
Kaidipang
GorontaloH
BolaangMon
Popalia
Bonerate
Wuna
MunaKatobu
Tontemboan
Tonsea
BanggaiWdi
Baree
Wolio
Mori
KayanUmaJu
Bukat
PunanKelai
Wampar
Yabem
NumbamiSib
Mouk
Aria
LamogaiMul
Amara
Sengseng
KaulongAuV
Bilibil
Megiar
Gedaged
Matukar
Biliau
Mengen
Maleu
Kove
Tarpia
Sobei
KayupulauK
Riwo
KairiruKis
WogeoAli
Auhelawa
Saliba
Suau
Maisin
Gumawana
Diodio
Molima
Ubir
Gapapaiwa
Wedau
Dobuan
Misima
Kilivila
Gabadi
Doura
Kuni
Roro
Mekeo
Lala
Vilirupu
Motu
MagoriSout
Kandas
Tanga
Siar
Kuanua
Patpatar
Roviana
Simbo
Nduke
Kusaghe
Hoava
LunggaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSo
los
Taiof
Teop
NehanHape
Nis sanHaku
Sis ingga
BabatanaTu
VarisiGhon
VarisiRi
rioSengga
BabatanaAv
Vaghua
Babatana
BabatanaLo
BabatanaKa
LungaLungaMono
MonoAluTorauUruava
MonoFauro
Bilur
ChekeHolo
MaringeKm
a
MaringeTat
NggaoPoro
MaringeLel
Zabana
Kia
LaghuS
amas
Blabla
ngaBlabla
ngaGKok
otaKilokak
aYsBanoniMadara
Tungag
TungKara
WestNalikTiangTigakL ihir
Sungl
MadakLam
asBarok
MaututuNakan
aiBilLakala
iVitu
YapeseMbugho
tuDhBugotu
Talis eMoliTalis eMala
GhariNdiGhariTolo
Talis eMalango
GhariNggaeMbirao
GhariTandaTalis ePoleGhariNggerTalis eKooGhariNgini
Lengo
LengoGhaim
LengoParipNggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbaelelea
KwaraaeSol
Fataleka
LauWalade
LauNorth
Longgu
SaaUkiNiMa
SaaSaaVill
Oroha
SaaUlawa
Saa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawat
Aros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErroman
Ura
ArakiSouth
Merei
Ambrym
Sout
Mota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
Nese
Tape
AnejomAnei
SakaoPortO
Avava
Neveei
Naman
Puluwatese
PuloAnnan
Woleaian
Satawalese
ChuukeseAK
Mortlockes
Carolinian
Chuukese
Sonsoroles
SaipanCaro
PuloAnna
Woleai
Mokilese
Pingilapes
Ponapean
Kiribati
Kusaie
Marshalles
Nauru
Nelemwa
Jawe
Canala
Iaai
Nengone
Dehu
Rotuman
WesternFij
Maori
Tahiti
Penrhyn
Tuamotu
Manihiki
Rurutuan
Tahitia
nMo
Rarotongan
Tahitia
nth
Marquesan
Mangareva
Hawaiian
RapanuiEas
Samoan
Bellona
Tikopia
Emae
Anuta
Rennellese
FutunaAniw
IfiraMeleM
UveaWest
FutunaEast
VaeakauTau
Takuu
Tuvalu
Sikaiana
Kapingamar
Nukuoro
Luangiua
Pukapuka
Tokelau
UveaEast
Tongan
Niue
Fijia
nBau
Teanu
Vano
Tanema
Buma
Asumboa
Tanimbili
Nembao
Seimat
Wuvulu
Lou
Nauna
Leipon
SivisaTita
Likum
Levei
Loniu
Mussau
KasiraIrah
Giman
Buli
Mor
Minyaifuin
BigaMisool
As
Waropen
Numfor
Marau
AmbaiYapen
WindesiWan
NorthBabar
DaweraDawe
Dai
TelaMasbua
Imroing
Emplawas
EastMasela
Serili
CentralMas
SouthEastB
WestDamar
TaranganBa
UjirNAru
NgaiborSAr
Geser
Watubela
ElatKeiBes
HituAmbon
SouAmanaTe
Amahai
Paulohi
Alune
MurnatenAl
Bonfia
Werinama
BuruNamrol
Soboyo
Mamboru
Manggarai
Ngadha
Pondok
Kodi
WejewaTana
Bima
Soa
Savu
Wanukaka
GauraNggau
Eas tSumban
Baliledo
Lamboya
L ioFloresT
Ende
Nage
Kambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerman
Atoni
Mambai
TetunTerik
Kemak
Lamalerale
LamaholotI
S ika
Kedang
Erai
Talur
Aputai
Perai
Tugun
Iliun
Serua
Nila
Teun
Kisar
Roma
Eas tD
amar
Letinese
Yamd
ena
KeiTan
imba
Selaru
Koiwai
Iria
Chamo
rro
Lampun
g
Komerin
g
Sunda
RejangR
eja
TboliTa
gab
Tagabil
i
Koronad
alB
Sarangan
iB
BilaanKor
o
BilaanSar
a
CiuliAtaya
SquliqAtay
Sediq
Bunun
ThaoFavorla
ng
RukaiPaiwan
Kanakanabu
SaaroaTsouPuyumaKavalanBasaiCentralAmiS irayaPazehSais iat
Baseline: sampling from the observed wordsThis work
Distance between two linguists’ reconstructions
[Blust ’93, Pawley ’09] [Blust ’99]
Examples of reconstruction
Known Modern Languages Reconstructed Ancestors
Gloss Fijian Pazeh Melanau Inabaknon Manual Automated �
star kalokalo mintol biten bitu’on *bituqen *bituqen 0to hold taura ma:raP magem kumkom *gemgem *gemgemhouse vale xumaP lebuP ruma *öumaq *öumaqbird manumanu aiam manuk manok *qayam *qayamto cut, hack tata ta:tatak tutek hadhad *taöaq *taöaqat e N/A gaP N/A *i *iwhat? cava Paxai uaP inew ay *nanu *anu 1this oqo Pimini itew yayto *ini *aniwind cagi var@ paNay bariyo *bali *beliu 2
Figure 1: Examples for each types of data processed by our system. The main input is theIPA representation of the pronunciation of basic concepts for a collection of related languages.This is shown in the phonology section of the table on the left. Optionally, our system can alsouse partial or complete clustering of these word forms into cognate sets. The cognates sets,shown in the Cognates section of the left table, are clusters of word forms all descending from aproto-word form (they can be equivalently represented by a binary matrix where there an entryis equal to one if two word are in the same cognate set, and zero otherwise). Our system canalso hypothesize or refine these cognate sets when the information is missing. In the table onthe right, we show a random sample of reconstructions produced by the system, stratified byLevenstein distance (�) to a reference reconstruction, in this case the reconstruction of (8).More detailed reconstructions can be found in (2) [[ref to sup mat]]. The baseline consists inpicking one of the modern forms uniformly at random.
came French /SaZ/ ‘change’. Still, several factors obscure these regularities. In particular, sound
changes are often context sensitive: Latin /kor/ ‘heart’ with French reflex /kœK/ illustrates that
the place of articulation change occurred only before the vowel /a/.
In this paper, we present the first automated system capable of large-scale reconstruction of
protolanguages. This system is based on a probabilistic model of sound change, building on re-
cent work on the reconstruction of ancestral sequences in computational biology (9,10). Several
groups have recently explored how methods from computational biology can be applied to prob-
lems in historical linguistics, but this work has focused on identifying the relationships between
languages (as might be expressed in a phylogeny) rather than reconstructing the languages them-
selves (11–15). Much of this work has been based on binary cognate matrices, which discard
all information about the form that words take, simply indicating whether they are cognate. To
illustrate the difference between the representation of the data used by these models and ours, it
3
Stratified sampling by edit
distance
Colors here indicate
cognate sets