Statistical/computational phylogeneticsbouchard/pub/phylo-oxford-lecture.pdf · Evolution of...

Statistical/computational phylogenetics

Alexandre Bouchard-Côté

Department of Statistics,

University of British Columbia

Organization

• Background in phylogenetics

• Non-standard application areas:

• Clonal evolution inside an individual cancer tumour

• Reconstructing ancient languages

• Current research on Divide-and-Conquer Sequential Monte Carlo methods

Background on phylogenetics

Notation/background on trees

root

EDCBA }

}Observations (typically)

Hidden sequences (typically)

Clock trees:

ab c

a+b = c d+b = c a = d ...

Non-clock tree: remove additivity restrictions on branch lengths

A

B

CD

Clade: group containing a taxa and all its descendants. (I will restrict them to leaves, e.g. {D,E})

Branch lengths

d

root?

v ∈ V

Stochastic process on a tree

Model: String-valued Continuous Time Markov Chain

Evolution of strings is denoted by MBranch length is T = T6.Strings {Xt : 0 � t � T}; Xt has a countable infinite domain of all possiblemolecular sequences.

0 T1

T2

T3

Branch length

Strings

TA CGx1

TA Cx2

TG Cx3

TG CTGx4

TG CTG Ax5

CTG Ax6

T4

T5

T6

L. Wang and A. Bouchard-Cote (UBC) Harnessing Non-Local Evolutionary Events for Tree Infer. June 25, 2012 6 / 16



0 T1

T2

T3

Branch length

Strings

TA CGx1

TA Cx2

TG Cx3

TG CTGx4

TG CTG Ax5

CTG Ax6

T4

T5

T6




0 T1

T2

T3

Branch length

Strings

TA CGx1

TA Cx2

TG Cx3

TG CTGx4

TG CTG Ax5

CTG Ax6

T4

T5

T6




0 T1

T2

T3

Branch length

Strings

TA CGx1

TA Cx2

TG Cx3

TG CTGx4

TG CTG Ax5

CTG Ax6

T4

T5

T6


• Evolution modeled by a stochastic process

• Discrete data: Continuous time Markov chains (CTMC)

• Continuous data: Diffusions

Phylogenetics: challenges

reality /data

estimated tree

biological/clinical

discovery

reality probabilistic model

tree estimator

approximation algorithm

computer implementation

estimated tree

modeling errors

statistical issues intractability, approximation/

optimization errors

bugs!

Where can this go wrong?

reality probabilistic model

• Common assumption: we have replicates (loci/sites) of the evolutionary stochastic process (often iid given tree)

• Identifiability/consistency: can we find the tree given ‘infinite data’

• What does infinite data mean?

• # loci/sites

• # leaves

• sequencing depth

• Efficiency: how fast will the error decay as we add more data? (cf. computational efficiency)

probabilistic model

tree estimator

statistical issues

• Earlier methods: parsimony, distance-based

• State-of-the-art: likelihood-based (maximum likelihood and Bayesian method)

• key idea: score many hypothetical trees, marginalizing over unknown ancestral states

probabilistic model

tree estimator

Sequence Length200 400 600 800 1000 1200 1400 1600

Tim

e (s

econ

ds)

0

200

400

600

800fastDNAml MrBayes-CH2TREE-PUZZLE PAUP*-ML NJ

(a) Taxa = 20, Height = 0.3

Sequence Length200 400 600 800 1000 1200 1400 1600

Tim

e (s

econ

ds)

0

500

1000

1500

2000

2500fastDNAml MrBayes-CH2TREE-PUZZLE PAUP*-ML NJ

(b) Taxa = 20, Height = 1.0

Sequence Length200 400 600 800 1000 1200 1400 1600

Tim

e (s

econ

ds)

0

2000

4000

6000

8000

10000fastDNAml MrBayes-CH2 PAUP*-ML NJ

(c) Taxa = 40, Height = 0.3

Sequence Length200 400 600 800 1000 1200 1400 1600

Tim

e (s

econ

ds)

0

2000

4000

6000

8000

10000

12000 fastDNAml MrBayes-CH2PAUP*-ML NJ

(d) Taxa = 40, Height = 1.0

Fig. 1. Execution time of the phylogenetic methods as a function ofthe sequence length for various values of taxa and height.

Sequence Length

200 400 600 800 1000 1200 1400 1600

RF

rate

(%)

10

20

30

40

50

60fastDNAml MrBayes-CH2 TREE-PUZZLE PAUP*-ML NJ

(a) Taxa = 20, Height = 0.3

Sequence Length

200 400 600 800 1000 1200 1400 1600

RF

rate

(%)

10

20

30

40

50

60

70

80fastDNAml MrBayes-CH2 TREE-PUZZLE PAUP*-ML NJ

(b) Taxa = 20, Height = 1.0

Sequence Length

200 400 600 800 1000 1200 1400 1600

RF

rate

(%)

10

15

20

25

30

35

40

45 fastDNAml MrBayes-CH2PAUP*-MLNJ

(c) Taxa = 40, Height = 0.3

Sequence Length200 400 600 800 1000 1200 1400 1600

RF

rate

(%)

10

20

30

40

50

60

70

80 fastDNAml MrBayes-CH2PAUP*-ML NJ

(d) Taxa = 40, Height = 1.0

Fig. 2. Average RF rate of the phylogenetic methods as a functionof the sequence length for various values of taxa and height.

A

B

CD

AA

B

B

C

C

D

D

Williams and Moret, 03

• Statistically efficient tree reconstruction methods are generally computationally inefficient

• Parallelization, distribution

• Approximation methods:

• MCMC

• Sequential Monte Carlo

• Annealing

tree estimator

approximation algorithm

Application: Inferring the Phylogeny of Cancer Cells

Cancer and

evolution

• DNA repair pathways damaged => accelerated mutation rates

• Accelerated rate + clonal expansion => evolution within a tumour

Two types of

mutations

• Copy Number Alterations (CNA)

• Single Nucleotide Variant (SNV)

Copy number alterationsWhole

chromosome

Parent cancer

cell

Child cancer

cell

Local aberration

Single nucleotide variants (SNVs)

• Like SNPs, but arise in tumour instead of already in healthy cells (somatic instead of germline)

Accumulation of SNVs

genome at generation 1genome at generation 2

...

Tim

e

loci

Mathematical simplification: at most one single nucleotide hit per locus (‘infinite site model’, p2 << p when p is small)

=> binary state for SNVs

Motivations for studying this process

• Clinical:

• Evolution => Adaptation => Relapse

• Scientific:

• Evolution in the fitness landscape far from a local optimum

Bayesian non-parametric clonal inference

Trying to identify the ‘nodes’ of the tree

Simplification (pedagogical)

Simplification: for now, 1 chromosome, no CNA

loci

one cell

Observation: ‘bulk data’

loci

several cells

Count data (#reads that do not match ref genome)e.g. Bin(2/5, 5023)

311519085023

Observation: ‘bulk data’

loci

Count data (#reads that do not match ref genome)e.g. Bin(2/5, 5023)

311519085023

tree

??

Mutation (partial) order from clonal prevalence

Example: one mutation in 70% of cells, another in 60%

Consequence of infinite site assumption & pigeon hole principle:

Simplification (pedagogical)

Simplification: for now, 1 chromosome, no CNA

loci

one cell

Complex, aberrant copy number profiles

etc

Deconvolution of bulk data

Supplementary Figures

Supplementary Figure 1: a) A hypothetical phylogenetic tree generated by clonal expansion via the accu-mulation of mutations (stars). Unlike traditional phylogenetic trees internal nodes (clones) in the tree maycontribute to the observed data, not just the leaf nodes. b) Hypothetical observed cellular prevalences forthe mutations in tree. Mutations occurring higher up the tree always have a greater cellular prevalence thantheir descendants (the same statement need not be true about variant allelic prevalence because of theeffect of genotype). Note that the green, blue and purple mutations occur at the same cellular prevalencebecause they always co-occur in the clones of the tree. {fig:clonal˙evolution}

VariantPopulation

Normal Population

Reference Population

ChromosomeMutation

Supplementary Figure 2: Simplified structure of a sample submitted for sequencing. Here we consider thesample with respect to a single mutation (stars). With respect to this mutation we can separate the cells inthe sample into three populations: the normal population consists of all normal cells (circular), the referencepopulation consists of cancer cells (irregular) which do not contain the mutation and the variant populationconsists of all cancer cells with the mutation. To simplify the model we assume all the cells within eachpopulation share the same genotype. For example all cells in the variant population in this case have thegenotype AABB i.e. two copies of the reference allele, A, and two copies of the variant allele, B. Note thatthe fraction of cancer cells from the variant population is the cellular prevalence of the mutation which is610 = 0.6 in this example. Due to the effect of heterogeneity and genotype the expected fraction of readscontaining the variant allele (variant allelic prevalence) in this example would be 6·4· 24

2·2+4·3+6·4 = 0.3.{fig:pyclone˙populations}

16

Variant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)

Variant cellular prevalence = # variant cells / # cells = 6/12 (inferred)

Example of data

Figure 1: SNV phylogeny of 7 HGSOvCa patients

(a) Anatomic sites sampled for whole genome sequencing in 7 HGSOvCa patients. Phylogeny inferred from SNVs, with previously nominatedputative driver genes annotated on branches. Branch lengths represent counts of the number of SNVs originating on each branch. Branches areannotated with the number of SNVs lost along the branch.

ROv3 ROv2 ROv1 ROv4 SBwl Om1

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

4626

10714 65 113

3

8

ROv1 ROv4 ROv3 ROv2

15000

10000

5000

0

SNV

coun

t

9175 82

38

Om2 Om1 ROv1 ROv2

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

17 6

52 9

ROv2 ROv1 Adnx Om1

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

3538 49

81

ROv1 ROv3 ROv2 ROv4 LPvS

5000

4000

3000

2000

1000

0

SNV

coun

t

141957

1323

47

LOv1 RPvM BrnM

8000

7000

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

23 98

Om2 ROv1 Om1 LOv2 LOv1

4000

3000

2000

1000

0

SNV

coun

t

280

2413 13 2

11

TP53

CDK12

NAV3

TP53

TET2

TP53

STAG2ELF3

TP53

TP53

USP9X

TP53 TP53

FAT3

SETBP1

c b a d f e b a c d b a d c e a d c b

1 3 2 a c b d e a d c b

Patient 1 Patient 2 Patient 3 Patient 9

Patient 7 Patient 4 Patient 10

link

16

4 biopsies (can be temporal or spatial)Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}

31

Each curve shows the allele prevalence of a mutation

(depth not shown)

Clustering: motivation

Supplementary Figure 20: Joint analysis of multiple samples from high grade serous ovarian cancer (HG-SOC) 1. Panels a) and c) show the variant allelic prevalence for each mutation color coded by predictedcluster using the IBBMM and PyClone with BeBin-PCN model to jointly analyse the four samples. Panelsb) and d) show the inferred cellular prevalence for each cluster using the IBBMM and BeBin-PCN methods.As in Figure 1 the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutationsin the cluster. The number of mutations n in each cluster is shown in the legend in parentheses. {fig:ith˙case1}

31

Clustering: motivation

Cluster ≈ edge of the tree

cluster 1cluster 2

cluster 3

Inferential questions

• Cluster mutations

• Cellular prevalence of each sub-population.



VariantPopulation

Normal Population


ChromosomeMutation



16

cluster 1cluster 2

cluster 3

Challenges

• Noisy data

• High level of uncertainty, partial identifiability

• Many data types and sources of prior information=> Good fit for Bayesian non-parametrics

PyClone• Input: bulk ultra-deep

sequencing data [+ some prior info]

• Output: posterior distribution over clustering and cellular prevalences

• Based on the Dirichlet Process mixture model

H

H0 ↵

a↵, b↵

�n

bnmdn

m nm ⇡n

m

tm

s

as, bs

n = 1...N

m = 1...M

� ⇠ Gamma(a�, b�)

H0 = Uniform([0, 1]M )

H |�, H0 ⇠ DP(�, H0)

�n|H ⇠ H

�nm|�n

m ⇠ Categorical(�nm)

�nm = (gn

m,N, gnm,R, gn

m,V)

either

bnm|dn

m, �nm, �n

m, tm ⇠ Binomial(dnm, �(�n

m, �nm, tm))

or

s|a, b ⇠ Gamma(as, bs)

bnm|dn

m, �nm, �n

m, tm, s ⇠ BetaBinomial(dnm, �(�n

m, �nm, tm), s)

where

�(�, �, t) =

(1�t)c(gN)Z µ(gN) + t(1��)c(gR)

Z µ(gR)+

t�c(gV)Z µ(gV)

Z = (1 � t)c(gN) + t(1 � �)c(gR)+

t�c(gV)

Supplementary Figure 3: Probabilistic graphical representation of the PyClone model. The model assumesthe observed count data for the nth mutation is dependent on the cellular prevalence of the mutation aswell as the state of the normal, reference and variant populations. The cellular prevalence of mutationn across the M samples, �n, is drawn from a Dirichlet Process (DP) prior to allow mutations to clusterand the number of clusters to be inferred. For brevity we show the multi-sample version of PyClone whichgeneralises the single sample case (M=1). We also show the model with either the Binomial or BetaBinomial emission densities. For all analyses conducted in this paper we set vague priors of a↵ = 1, b↵ =

10

�3 for the DP concentration parameter ↵ and as = 1, bs = 10

�4 for the Beta Binomial precision parameters. The Gamma distributions are parametrised in terms of the shape, a, and rate, b, parameters (seeSection 1.1). {fig:pgm}

17

Roth et al. (2014) Nature Methods

PyClone: main parameter



VariantPopulation

Normal Population


ChromosomeMutation



16

Variant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)


ϕn: cellular prevalence vector of mutation n

PyClone: main parameterVariant allelic prevalence = # variant strands / # strands = 12/40 (‘observed’)


ϕn: cellular prevalence vector of mutation n

Note: the cellular prevalence varies across anatomical samples

ϕn =(ϕn ,..., ϕn)1 M

Figure 1: SNV phylogeny of 7 HGSOvCa patients

(a) Anatomic sites sampled for whole genome sequencing in 7 HGSOvCa patients. Phylogeny inferred from SNVs, with previously nominatedputative driver genes annotated on branches. Branch lengths represent counts of the number of SNVs originating on each branch. Branches areannotated with the number of SNVs lost along the branch.

ROv3 ROv2 ROv1 ROv4 SBwl Om1

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

4626

10714 65 113

3

8

ROv1 ROv4 ROv3 ROv2

15000

10000

5000

0

SNV

coun

t

9175 82

38

Om2 Om1 ROv1 ROv2

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

17 6

52 9

ROv2 ROv1 Adnx Om1

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

3538 49

81

ROv1 ROv3 ROv2 ROv4 LPvS

5000

4000

3000

2000

1000

0

SNV

coun

t

141957

1323

47

LOv1 RPvM BrnM

8000

7000

6000

5000

4000

3000

2000

1000

0

SNV

coun

t

23 98

Om2 ROv1 Om1 LOv2 LOv1

4000

3000

2000

1000

0

SNV

coun

t

280

2413 13 2

11

TP53

CDK12

NAV3

TP53

TET2

TP53

STAG2ELF3

TP53

TP53

USP9X

TP53 TP53

FAT3

SETBP1

c b a d f e b a c d b a d c e a d c b

1 3 2 a c b d e a d c b

Patient 1 Patient 2 Patient 3 Patient 9

Patient 7 Patient 4 Patient 10

link

16

DP mixture model

Parametric mixture model:

Part 2: Basics of Dirichlet processes 2-3

and let G = ⇡�

✓1 +(1�⇡)�✓2 . Note that G : F

⌦

! [0, 1] and that G(! ) = 1, in other words, G is distribution.Since it is constructed from random variables, it is a random distribution. Here is an example of a realizationof G:

Mea

n pa

ram

eter

Variance parameter

As it is apparent from this figure, the prior over G has support over discrete distributions with two pointmasses.

This is where the Dirichlet process comes in: it is a prior, denoted by DP, with a support over discrete witha countably infinite number of point masses. If G ⇠ DP, this means that a realization from G will look like:

Mea

n pa

ram

eter

Variance parameter

In the next section, we define Dirichlet process more formally.

2.3 First definition of Dirichlet processes

A Dirichlet process has two parameters:

1. A positive real number ↵

0

> 0, called the concentration parameter. We will see later that it can beinterpreted as a precision (inverse variance) parameter.

2. A distribution G

0

: F⌦

! [0, 1] called the base measure. We will see later that it can be interpreted asa mean parameter.

Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochasticprocess on a probability space (! 0

,F⌦

0), it is enough to provide a consistent definition of what the marginalsof this stochastic process are. As the name suggest, in the case of a Dirichlet process, the marginals areDirichlet distributions:

Definition 2.1 (Dirichlet Process [2]) Let ↵

0

, G

0

be of the types listed above. We say that G : F⌦

0 !(F

⌦

! [0, 1]) is distributed according to the Dirichlet process distribution, denoted by G ⇠ DP(↵0

, G

0

), if for

all measurable partitions of ! , (A1

, . . . , A

K

) (this means that A

k

are events, A

k

2 F , that they are disjoint,

and that their union is equal to ! ), we have:

(G(A1

), G(A2

), . . . , G(AK

)) ⇠ Dir(↵0

G

0

(A1

),↵0

G

0

(A2

), . . . ,↵0

G

0

(AK

)).

Parameter for each datapoint (mutation) obtained by sampling the discrete measure

on the right

Cellular freq. for anatomical sample 2

Cellular

freq. fo

r

anatom

ical sa

mple 1

DP mixture model:

Part 2: Basics of Dirichlet processes 2-3

and let G = ⇡�

✓1 +(1�⇡)�✓2 . Note that G : F

⌦

! [0, 1] and that G(! ) = 1, in other words, G is distribution.Since it is constructed from random variables, it is a random distribution. Here is an example of a realizationof G:

Mea

n pa

ram

eter

Variance parameter

As it is apparent from this figure, the prior over G has support over discrete distributions with two pointmasses.

This is where the Dirichlet process comes in: it is a prior, denoted by DP, with a support over discrete witha countably infinite number of point masses. If G ⇠ DP, this means that a realization from G will look like:

Mea

n pa

ram

eter

Variance parameter

In the next section, we define Dirichlet process more formally.

2.3 First definition of Dirichlet processes

A Dirichlet process has two parameters:

1. A positive real number ↵

0

> 0, called the concentration parameter. We will see later that it can beinterpreted as a precision (inverse variance) parameter.

2. A distribution G

0

: F⌦

! [0, 1] called the base measure. We will see later that it can be interpreted asa mean parameter.

Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochasticprocess on a probability space (! 0

,F⌦

0), it is enough to provide a consistent definition of what the marginalsof this stochastic process are. As the name suggest, in the case of a Dirichlet process, the marginals areDirichlet distributions:

Definition 2.1 (Dirichlet Process [2]) Let ↵

0

, G

0

be of the types listed above. We say that G : F⌦

0 !(F

⌦

! [0, 1]) is distributed according to the Dirichlet process distribution, denoted by G ⇠ DP(↵0

, G

0

), if for

all measurable partitions of ! , (A1

, . . . , A

K

) (this means that A

k

are events, A

k

2 F , that they are disjoint,

and that their union is equal to ! ), we have:

(G(A1

), G(A2

), . . . , G(AK

)) ⇠ Dir(↵0

G

0

(A1

),↵0

G

0

(A2

), . . . ,↵0

G

0

(AK

)).

sameCellular freq. for

anatomical sample 2

Cellular

freq. fo

r

anatom

ical sa

mple 1

DP mixture

• n: index of the SNV loci

• m: index of sample (biopsy)

• ϕn: cellular prevalence vector of mutation n

• d: depth (total # of reads)

• b: number of mutant

H

H0 ↵

a↵, b↵

�n

bnmdn

m nm ⇡n

m

tm

s

as, bs

n = 1...N

m = 1...M



H |�, H0 ⇠ DP(�, H0)

�n|H ⇠ H

�nm|�n


�nm = (gn

m,N, gnm,R, gn

m,V)

either

bnm|dn

m, �nm, �n


m, �nm, tm))

or


bnm|dn

m, �nm, �n


m, �nm, tm), s)

where

�(�, �, t) =

(1�t)c(gN)Z µ(gN) + t(1��)c(gR)

Z µ(gR)+

t�c(gV)Z µ(gV)

Z = (1 � t)c(gN) + t(1 � �)c(gR)+

t�c(gV)


10



17

ϕ1 ϕ2

311519085023

bd

b|d,ϕ1 ~BeBin

Posterior inference

• MCMC: standard auxiliary variable method for non-conjugate models (‘algorithm 8’)

• Issue: scaling to large # of mutations

• Latest work with Andy Roth and Arnaud Doucet: a particle MCMC split merge algorithm

Example of clustering with the model so far


31

Genotypes• n: index of the SNV loci

• m: index of sample (biopsy)

• ϕn: cellular prevalence vector of mutation n

• d: depth (total # of reads)

• b: number of mutant alleles

• s: overdispersion parameter

• ψ: genotype (∈ {AB, AAB, AABB, ...})

H

H0 ↵

a↵, b↵

�n

bnmdn

m nm ⇡n

m

tm

s

as, bs

n = 1...N

m = 1...M



H |�, H0 ⇠ DP(�, H0)

�n|H ⇠ H

�nm|�n


�nm = (gn

m,N, gnm,R, gn

m,V)

either

bnm|dn

m, �nm, �n


m, �nm, tm))

or


bnm|dn

m, �nm, �n


m, �nm, tm), s)

where

�(�, �, t) =

(1�t)c(gN)Z µ(gN) + t(1��)c(gR)

Z µ(gR)+

t�c(gV)Z µ(gV)

Z = (1 � t)c(gN) + t(1 � �)c(gR)+

t�c(gV)


10



17

ABB AB AABB

Informed priors over genotypes

• Goal: reduce the space of possible genotypes

• Key idea: we can also use germline SNPs to get extra information on the copy number at a loci

• Hijack standard micro-array platforms to get log intensity in the neighborhood of the SNV of interest

SNV (cancer)

SNP (from your parents)


31


31

Clustering of mutations

Without genotypes With genotypes

Full output (consensus)


31

How to validate?

• Cross validation

• Synthetic data (in silico and in vitro)

2 | ADVANCE ONLINE PUBLICATION | NATURE METHODS

BRIEF COMMUNICATIONS

variation events. Third, Bayesian nonparametric clustering is used to discover groupings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter. Fourth, multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples. Simulated data sets sys-tematically illustrate improvements in performance of each of the novel modeling components (Supplementary Figs. 12 and 13).

We first evaluated our approach on idealized data sets (Fig. 1) produced from physical mixtures of DNA extracted from four 1000 Genomes Project samples11,12 (Online Methods). Each mix-ture contained DNA in approximate proportions of 0.01, 0.05, 0.20 and 0.74. Specific single-nucleotide variants (SNVs) were ampli-fied using PCR and then sequenced deeply on the Illumina MiSeq platform. This data set simulates a hierarchically related popula-tion with ground truth for quantitative benchmarking (Fig. 1c). We selected positions with variants present in exactly one case,

e f

1.0

b

0.9

0.8

0.7

V-m

easu

re

0.6

0.5

IBMM

IBBMM

Bin-TCN

Bin-PCN

BeBin-

TCN

BeBin-

PCN

1.0

0.8

0.6

Var

iant

alle

lic p

reva

lenc

e

0.4

0A B C D

Mixture

Genotype

0.2

AB (n = 3)BB (n = 3)

1.0

c

0.8

0.6

Exp

ecte

d ce

llula

rpr

eval

ence

0.4

0A B C D

Mixture

0.2

Variant case(s)

NA12878NA18507NA19240NA18507NA19240

NA12156

NA12878NA18507NA19240NA12156NA12878NA18507NA19240

1.0

a

0.9

0.8

0.7

V-m

easu

re

0.6

0.5

IBMM

IBBMM

Bin-TCN

Bin-PCN

BeBin-

TCN

BeBin-

PCN

d1.0

0.8

0.6

IBB

MM

(in

ferr

ed)

cellu

lar

prev

alen

ce

0.4

0A B C D

Mixture

0.2

Cluster

1.0 (n = 2)2.0 (n = 2)3.0 (n = 3)4.0 (n = 3)5.0 (n = 3)6.0 (n = 3)7.0 (n = 2)8.0 (n = 3)9.0 (n = 7)10.0 (n = 7)11.0 (n = 1)12.0 (n = 1)

1.0

0.8

0.6

0.4

0A B C D

Mixture

0.2

BeB

in-P

CN

(in

ferr

ed)

cellu

lar

prev

alen

ce

Cluster

3.0 (n = 4)4.0 (n = 6)

2.0 (n = 3)

6.0 (n = 3)

1.0 (n = 6)

5.0 (n = 14)

7.0 (n = 1)

Figure 1 | Comparison of clustering performance for the mixture of normal-tissue data sets. (a,b) Comparison of methods (10 replicates of 37 mutations) when analyzing a four-mixture experiment separately (a) or jointly (b). IBMM, infinite binomial (Bin) mixture model; IBBMM, infinite beta-binomial (BeBin) mixture model; TCN, with total copy-number priors; PCN, with parental copy-number priors. (c) Expected cellular prevalence of each cluster across the four-mixture experiments. (d,e) Inferred cellular prevalences and clustering using the IBBMM model (d) and PyClone BeBin-PCN model (e). Clusters with homozygous single-nucleotide variants (SNVs) or an equal number of homo- and heterozygous SNVs are indicated by solid lines; clusters with heterozygous SNVs, by dashed lines. (f) Variant allelic prevalence for mutations assigned to cluster 1 by the PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs; solid lines represent homozygous SNVs. Whiskers (a,b) indicate 1.5× the interquartile range, red bars indicate the median and boxes represent the interquartile range. Error bars (d,e) indicate the mean s.d. (using 9,000 post-‘burn-in’ samples (Online Methods)) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. (d–f) The number of mutations n in each cluster is shown in the legends in parentheses.

a1.0

0.8

0.6

0.4

0.2

0

Var

iant

alle

licpr

eval

ence

A B C D

1.0

0.8

0.6

0.4

0.2

0(Inf

erre

d) c

ellu

lar

prev

alen

ce

A B C D

Cluster1 (n = 4)2 (n = 20)3 (n = 2)4 (n = 6)5 (n = 1)6 (n = 1)7 (n = 9)8 (n = 2)9 (n = 4)

c1:18661418

8:133905955X: 1496390371:1106553972:31589849

2:1726480703:50391111

4:1538313667:44280313

11:4676081612:5036858416:29675929

17:766281019:115692932:1968915508:95186221X:635793247:28858772

1:22605559510:2644631216:61687701X:129283520

Absent Present Unknown

dAbsentPresentUnknown

ControlCancerNormal

Cell type

BeB

in-P

CN

IBB

MM

gDN

A 1

Nuc

leus

41

Nuc

leus

4N

ucle

us 3

9N

ucle

us 3

6N

ucle

us 3

5N

ucle

us 3

4N

ucle

us 3

3N

ucle

us 3

2N

ucle

us 3

1N

ucle

us 9

Nuc

leus

3N

ucle

us 5

Nuc

leus

29

Nuc

leus

22

Nuc

leus

20

Nuc

leus

2N

ucle

us 1

8N

ucle

us 1

7N

ucle

us 1

6

Nuc

leus

14

Nuc

leus

15

Nuc

leus

13

Nuc

leus

10

Nuc

leus

27

Nuc

leus

7N

ucle

us 8

Nuc

leus

1N

ucle

us 1

9

Nuc

leus

30

Nuc

leus

6

IBBMM

b1.0

0.8

0.6

0.4

0.2

0

Var

iant

alle

licpr

eval

ence

A B CSample

D

1.0

0.8

0.6

0.4

0.2

0(Inf

erre

d) c

ellu

lar

prev

alen

ce

A B CSample

D

Cluster1 (n = 25)2 (n = 9)3 (n = 5)4 (n = 7)5 (n = 2)6 (n = 1)

BeBin-PCN

Figure 2 | Joint analysis of multiple samples from high-grade serous ovarian cancer 2. (a,b) Left, variant allelic prevalence for each mutation, color coded by predicted cluster, using IBBMM (a) and PyClone with the BeBin-PCN model (b) to jointly analyze the four samples. Right, inferred cellular prevalence for each cluster using IBBMM (a) and BeBin-PCN methods (b). As in Figure 1, the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutations in the cluster. (c) Presence or absence of variant alleles at target loci in single cells from sample B. Loci with fewer than 40 reads covering them are colored gray. Predicted clusters for each method are shown on the left, with white cells indicating nonsomatic control positions. Row labels indicate hg19 chromosome and chromosome coordinate separated by a colon. (d) Presence or absence of IBBMM clusters in single cells from sample B. Clusters were deemed present if any mutation in the cluster was present. gDNA, bulk genomic DNA control. Error bars (a,b) indicate the mean s.d. (using 50,000 post-‘burn-in’ samples) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. The number of mutations n in each cluster is shown in the legends in parentheses.

Single-cell sequencing

loci

cell

2 | ADVANCE ONLINE PUBLICATION | NATURE METHODS

BRIEF COMMUNICATIONS

variation events. Third, Bayesian nonparametric clustering is used to discover groupings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter. Fourth, multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples. Simulated data sets sys-tematically illustrate improvements in performance of each of the novel modeling components (Supplementary Figs. 12 and 13).

We first evaluated our approach on idealized data sets (Fig. 1) produced from physical mixtures of DNA extracted from four 1000 Genomes Project samples11,12 (Online Methods). Each mix-ture contained DNA in approximate proportions of 0.01, 0.05, 0.20 and 0.74. Specific single-nucleotide variants (SNVs) were ampli-fied using PCR and then sequenced deeply on the Illumina MiSeq platform. This data set simulates a hierarchically related popula-tion with ground truth for quantitative benchmarking (Fig. 1c). We selected positions with variants present in exactly one case,

e f

1.0

b

0.9

0.8

0.7

V-m

easu

re

0.6

0.5

IBMM

IBBMM

Bin-TCN

Bin-PCN

BeBin-

TCN

BeBin-

PCN

1.0

0.8

0.6

Var

iant

alle

lic p

reva

lenc

e

0.4

0A B C D

Mixture

Genotype

0.2

AB (n = 3)BB (n = 3)

1.0

c

0.8

0.6

Exp

ecte

d ce

llula

rpr

eval

ence

0.4

0A B C D

Mixture

0.2

Variant case(s)

NA12878NA18507NA19240NA18507NA19240

NA12156

NA12878NA18507NA19240NA12156NA12878NA18507NA19240

1.0

a

0.9

0.8

0.7

V-m

easu

re

0.6

0.5

IBMM

IBBMM

Bin-TCN

Bin-PCN

BeBin-

TCN

BeBin-

PCN

d1.0

0.8

0.6

IBB

MM

(inf

erre

d) c

ellu

lar

prev

alen

ce

0.4

0A B C D

Mixture

0.2

Cluster

1.0 (n = 2)2.0 (n = 2)3.0 (n = 3)4.0 (n = 3)5.0 (n = 3)6.0 (n = 3)7.0 (n = 2)8.0 (n = 3)9.0 (n = 7)10.0 (n = 7)11.0 (n = 1)12.0 (n = 1)

1.0

0.8

0.6

0.4

0A B C D

Mixture

0.2

BeB

in-P

CN

(inf

erre

d)ce

llula

r pre

vale

nce Cluster

3.0 (n = 4)4.0 (n = 6)

2.0 (n = 3)

6.0 (n = 3)

1.0 (n = 6)

5.0 (n = 14)

7.0 (n = 1)

Figure 1 | Comparison of clustering performance for the mixture of normal-tissue data sets. (a,b) Comparison of methods (10 replicates of 37 mutations) when analyzing a four-mixture experiment separately (a) or jointly (b). IBMM, infinite binomial (Bin) mixture model; IBBMM, infinite beta-binomial (BeBin) mixture model; TCN, with total copy-number priors; PCN, with parental copy-number priors. (c) Expected cellular prevalence of each cluster across the four-mixture experiments. (d,e) Inferred cellular prevalences and clustering using the IBBMM model (d) and PyClone BeBin-PCN model (e). Clusters with homozygous single-nucleotide variants (SNVs) or an equal number of homo- and heterozygous SNVs are indicated by solid lines; clusters with heterozygous SNVs, by dashed lines. (f) Variant allelic prevalence for mutations assigned to cluster 1 by the PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs; solid lines represent homozygous SNVs. Whiskers (a,b) indicate 1.5× the interquartile range, red bars indicate the median and boxes represent the interquartile range. Error bars (d,e) indicate the mean s.d. (using 9,000 post-‘burn-in’ samples (Online Methods)) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. (d–f) The number of mutations n in each cluster is shown in the legends in parentheses.

a1.0

0.8

0.6

0.4

0.2

0

Var

iant

alle

licpr

eval

ence

A B C D

1.0

0.8

0.6

0.4

0.2

0(Infe

rred

) cel

lula

rpr

eval

ence

A B C D

Cluster1 (n = 4)2 (n = 20)3 (n = 2)4 (n = 6)5 (n = 1)6 (n = 1)7 (n = 9)8 (n = 2)9 (n = 4)

c1:18661418

8:133905955X: 1496390371:1106553972:31589849

2:1726480703:50391111

4:1538313667:44280313

11:4676081612:5036858416:29675929

17:766281019:115692932:1968915508:95186221X:635793247:28858772

1:22605559510:2644631216:61687701X:129283520

Absent Present Unknown

dAbsentPresentUnknown

ControlCancerNormal

Cell type

BeB

in-P

CN

IBB

MM

gDN

A 1

Nuc

leus

41

Nuc

leus

4N

ucle

us 3

9N

ucle

us 3

6N

ucle

us 3

5N

ucle

us 3

4N

ucle

us 3

3N

ucle

us 3

2N

ucle

us 3

1N

ucle

us 9

Nuc

leus

3N

ucle

us 5

Nuc

leus

29

Nuc

leus

22

Nuc

leus

20

Nuc

leus

2N

ucle

us 1

8N

ucle

us 1

7N

ucle

us 1

6

Nuc

leus

14

Nuc

leus

15

Nuc

leus

13

Nuc

leus

10

Nuc

leus

27

Nuc

leus

7N

ucle

us 8

Nuc

leus

1N

ucle

us 1

9

Nuc

leus

30

Nuc

leus

6

IBBMM

b1.0

0.8

0.6

0.4

0.2

0

Var

iant

alle

licpr

eval

ence

A B CSample

D

1.0

0.8

0.6

0.4

0.2

0(Infe

rred

) cel

lula

rpr

eval

ence

A B CSample

D

Cluster1 (n = 25)2 (n = 9)3 (n = 5)4 (n = 7)5 (n = 2)6 (n = 1)

BeBin-PCN

Figure 2 | Joint analysis of multiple samples from high-grade serous ovarian cancer 2. (a,b) Left, variant allelic prevalence for each mutation, color coded by predicted cluster, using IBBMM (a) and PyClone with the BeBin-PCN model (b) to jointly analyze the four samples. Right, inferred cellular prevalence for each cluster using IBBMM (a) and BeBin-PCN methods (b). As in Figure 1, the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutations in the cluster. (c) Presence or absence of variant alleles at target loci in single cells from sample B. Loci with fewer than 40 reads covering them are colored gray. Predicted clusters for each method are shown on the left, with white cells indicating nonsomatic control positions. Row labels indicate hg19 chromosome and chromosome coordinate separated by a colon. (d) Presence or absence of IBBMM clusters in single cells from sample B. Clusters were deemed present if any mutation in the cluster was present. gDNA, bulk genomic DNA control. Error bars (a,b) indicate the mean s.d. (using 50,000 post-‘burn-in’ samples) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. The number of mutations n in each cluster is shown in the legends in parentheses.

Issue: loss of SNVs

Application: reconstructing ancient languages

Examples of language change

Beowulf (~ 8th - 11th century) *dekm Proto-Indo-European-6000 years before present?

tehun Proto-Germanic-2000 years before present

tȳne ætsomneten

tien

‘ten altogether’

Old English

(Modern) English

-1000 years before present

present

exp

�⇧

⇤⌥

(s,s�)⇥d

⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)

⇥⌃

⌅ ⇤t(d)

Examples of language change

*dekm Proto-Indo-European> 6000 years before present?

tehun Proto-Germanic2000 years before present

ten

tien Old English

(Modern) English

1000 years before present

present

Deletion: /i/ > ∅

Insertion: /e/ > /ie/; Deletion: /h/ > ∅

Substitutions: /d/ > /t/, /k/ > /h/,...

exp

�⇧

⇤⌥

(s,s�)⇥d

⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)

⇥⌃

⌅ ⇤t(d)

exp

�⇧

⇤⌥

(s,s�)⇥d

⇥�, T (s, s�)⇤ � ⇥ ⌅� � ⌅(t)⌅22 � A(⇥)

⇥⌃

⌅ ⇤t(d)

Example of data

Austronesian dataset:

Size: 706 languages, 150k word forms in IPA

‘fish’ ‘fear’Hawaiian iʔa makaʔu

Samoan iʔa mataʔu

Tongan ikaProto-Oceanic *ika *mataku

...

...

[Greenhill et al, ’08]

Oceanic language

HawaiianSamoanTonganMaori

Task: comparative reconstruction

‘fish’ Hawaiian iʔa

Samoan iʔa

Tongan ika

Maori ika


Proto-Oceanic


Samoan iʔa

Tongan ika

Maori ika


Proto-Oceanic

‘fish’ POc *ika

*k > ʔ‘fish’

Hawaiian iʔa

Samoan iʔa

Tongan ika

Maori ika


Proto-Oceanic

‘fish’ POc *iʔa

or

‘fish’ POc *ika

*ʔ > k


Samoan iʔa

Tongan ika

Maori ika


Proto-Oceanic

‘fish’ POc *iʔa

‘fish’ POc *ika


Samoan iʔa

Tongan ika

Maori ika

or

...

Previous work/main stream approach: manually

reconstructQuestion: Can we leverage machine learning methods?

Can we harness more languages?


Samoan iʔa

Tongan ika

Maori ika

Geser ikan

Rapanui ika

Nukuoro iga

Niue ika

Task 2: cognate inference

Fact: New words are invented & old words fall out of usage

Consequence: not all words forms for a given concept will have cousins in related languages.

Hawaiian (G) iʔa makaʔu

Samoan (S) iʔa mataʔu

Tongan (T) ika manavahē

Maori (M) ika mataku

Column: ‘cognate

set’ (think as a cluster)

‘fish’ ‘fear’ ‘fear’

Task 2: cognate inference

Hawaiian iʔa, makaʔu, ...

Samoan iʔa, makaʔu, ...

Tongan ika, manavahē, ...

Maori ika, mataku, ...

Hawaiian iʔa makaʔu

Samoan iʔa mataʔu

Tongan ika manavahē

Maori ika mataku

or ...Hawaiian iʔa makaʔu

Samoan iʔa mataʔu

Tongan ika manavahē

Maori ika mataku

Task 3: tree inference



or ...

Model-based approach

X1

X2X3

X4 X5

X2|X1 � Mut(X1; �)

Xi|Xpar(i) � Mut(Xpar(i); �)

X3|X1 � Mut(X1; �)

...

Input: modern wordsProbabilistic

models of change

Hawaiian

Samoan

Tongan

Maori

Hawaiian

Samoan

Tongan

Maori

or...

‘fish’

POc *i!a

‘fish’

POc *ika

or

...

‘fish’ (1)

Hawaiian i!a

Samoan i!a

Tongan ika

Marshallese yapil

or

‘fish’ (1) ‘fish’ (2)

Hawaiian i!a

Samoan i!a

Tongan ika

Marshallese yapil

...

Reconstruction

Cognates

Trees

‘fish’ ‘fear’

Hawaiian i!a maka!u

Samoan i!a mata!u

Tongan ika

Motivation

Why reconstruct?

! Can answer a large number of questions about our past

! Learn about ancient populations’ migrations

! Decipherment of ancient scripts

Motivation: typology

! What are the statistical regularities of sound change?

! Role of functional load in sound change

! Useful for synchronic typology as well

! Is an observed pattern the result of structural constraints, or the result of common descent?

Motivation: machine translation (MT)

! Long-term challenge: usable machine translation between all pairs of languages

! Existing MT rely on large corpora in both languages in the pair,

! in particular, bitexts (e.g. europarl)

! Concrete step: large scale cognate detection

Background

International Phonetic Alphabet


Samoan iʔa mataʔu

Tongan ikaProto-Oceanic ika mataku

Instead of {A,C,G,T}, here the alphabet is a set of basic sound units (phonemes):

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

!

"

œ

#

$

% &

'

(

)

*

+

,

-

f

.

gd

e

/

b

a

n

o

m

0

k

h

i

1

v

u

2

t

s

3

ø

qp 4

5

6 7

z

y

8

x

Sound changes

*dekm Proto-Indo-European-6000 years before present?

tehun Proto-Germanic-2000 years before present

ten

tien Old English

(Modern) English

-1000 years before present

present

English ten

Danish ti

Greek deka

Italian dieci

French dix

Sanskrit dasan

Proto-Indo-European

Proto-Germanic

/d/ > /t/

Regularity in sound change

English ten tooth

Danish ti tand

Greek deka donti

Italian dieci dente

French dix dent

Sanskrit dasan danta

Proto-Indo-European

/d/ > /t/ in word initial position

! Probabilistic models of inventory changes? ! Open computational

problemSWEDISH

BAFFA_PASHTOLITHUANIAN

FRENCHHINDI

DUTCHASSAMESE

BRETONITALIAN

BHOIPURICORNISHCZECH

GUJARATIBENGALISPANISH

EASTERN_FARSIGILAKIWELSH

PORTUGUESEKURDISH_CENTRALSTANDARD_GERMAN

SARDINIANBELARUSIAN

LATVIANPOLISH

ENGLISHALSATIAN

BULGARIANROMANIANCATALAN

BERNESE_GERMANBANNU_PASHTO

RUSSIAN

3 5 8 C E L N S T X Z a b c d e f g h i j k l m n o p q r s t u v w x y z

0.0

0.2

0.4

0.6

0.8

1.0

Modelling phonological change

iʔa iʔa ika ika

Simplifying assumptions


Samoan iʔa mataʔu

Tongan ika

Maori ika mataku

POc

H. S. T. M.

Fixed tree Fixed cognate sets

iʔa iʔa ika ika

Graphical model


Samoan iʔa mataʔu

Tongan ika

Maori ika mataku

POc

H. S. T. M.

Graphical model


Samoan iʔa mataʔu

Tongan ika

Maori ika matakumakaʔu makaʔu makaku

POc

H. S. T. M.

Graphical model


Samoan iʔa mataʔu

Tongan ika

Maori ika mataku

POc

H. S. T. M.

Key distribution: Models diachronic word mutation

X

Y

Y |X � Mut(X)

Modeling string mutation

! What kind of string mutations need to be captured?

! Substitution

*k > ʔ‘fish’ ‘voice’

Hawaiian iʔa leo

Samoan iʔa leo

Tongan ika leʔo

Maori ika reo

! Deletion (& insertions)

*ʔ > Ø

! Contextualized change

*ʔ > Ø / V _ V

taŋgis

taŋgi

taŋgi aŋgi

taŋgis

angi

String transducer

t a

a

ŋ

n g

i

i

s#

#

#

#

‘to cry’

t a

a

ŋ

n g

i

i

s#

#

#

#?

ŋ

String transducer

}

t a

a

ŋ

n g

i

i

s#

#

#

#?

ŋ

θS

String transducer

θS : Substitution/Deletion Parameters

t a

a

ŋ

n g

i

i

s#

#

#

#

ŋ

n ?

θS

θI

String transducer

θI : Insertion Parameters

t a

a

ŋ

n g

i

i

s#

#

#

#

ŋ

n g

θS

θI

?

String transducer

Bayesian inference

X1, X2|X3, X4, X5

Computational bottleneck

X1

X2X3

X4 X5

More difficult: string-valued expectations

Parameters fitting: can be approached using standard tools POc

H. S. T. M.

θθ θθ θθ

λ

Various MCMC algorithms

T A G G

T A G C

T A C G C

C A G G

C A C G G

T A G G

T A G C

T A C G C

C A G G

C A C G G

Gibbs sampling

Ancestry resampling

Thin vertical section

Sequential Monte Carlo• SMC: An alternative to

MCMC

• Approximates posterior with particles and weights

• D&C SMC: generalization more suitable to phylogenies

π1 π2 π3

π7

...

π6 π5

π4 π3 π2 π1

Standard SMC samplers

Divide and Conquer

(D&C) SMC samplers• tinyurl.com/divide-and-conquer-smc

• Fredrik Lindsten, Adam M. Johansen, Christian A. Naesseth, Bonnie Kirkpatrick, Thomas B. Schön, John Aston, and Alexandre Bouchard-Côté. (2016) Divide-and-Conquer with Sequential Monte Carlo. JCSG, In Press.

D&C SMC: notation

...

...

yl

yr

xr

xl

xt

xl

xr

xtyt

1. Assume inductively we have &

...

......

2. Sample from the product measure:

(xil, x

ir) ⇠ ⇡l ⇥ ⇡r

⇡t

⇡l

⇡r

- amounts to two independent multinomial sampling steps

D&C (SIR): ⇡l ⇡r

1. Assume inductively...

...

......

2. Sample from the product measure:

⇡t

⇡l

⇡r

3. Propose (extend):

(xil, x

ir)

x

it|(xi

l, xir) ⇠ qt(·|xi

l, xir)

Concatenate:x

it =

(xit,�x

il�,�x

ir�)

?

D&C (SIR):


...

......

2. Sample from the product measure

⇡t

⇡l

⇡r

3. Propose (extend)

(xil, x

ir)

4. Reweigh:

x

it

w

it =

⇡t(xit)

⇡l(xil)⇡r(xi

r)

1

qt(xit|xi

l, xir)

D&C (SIR):


...

......

2. Sample from the product measure

⇡t

⇡l

⇡r

3. Propose (extend)

4. Reweigh

Repeat for each particle

D&C (SIR):

Potential applications• Inference over non-local/catastrophic events (explicit

sound change operators)

• Joint cognate inference

• Statistical multiple sequence alignment

• Relaxed clock models

• Many other models that would otherwise have to be handled with very expensive methods such as Approximate Bayesian Computation

Numerical examples

Validation methodology

Evaluation criterion: Edit distance Smallest number of substitutions, insertions and deletions needed to go from one string to the other

Holding-out the manual reconstructions

e.g.: d( /ika/ , /ga/ ) = 2 /ika/ ➔ /iga/ ➔ /ga/


Samoan iʔa mataʔu

Tongan ikaProto-Oceanic *ika *mataku


Samoan iʔa mataʔu

Tongan ikaProto-Oceanic ??? ???

Datasets used for reconstruction

! Austronesian languages dataset ! 706 languages, 150k words ! 17 annotated reconstructions ! Ideal for testing large scale reconstruction ! Has continued to grow since the time experiments were

performed

[Greenhill et al. 08]

Comparisons in large phylogenies

Mean error (edit distance)

Comparisons on Proto-Oceanic and Proto-Austronesian

0

0.125

0.250

0.375

0.500

Proto-Oceanic Proto-Austronesian

YakanBajoMapunSamalS ias iInabaknonSubanunSinSubanonSioWes ternBukManoboWes tManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukidMaranaoIranunSasakBaliMamanwaIlonggoHiligaynon

WarayWarayButuanonTausugJoloSurigaononAklanonBisCebuanoKalaganMansakaTagalogAnt

BikolNagaC

TagbanwaKa

TagbanwaAb

BatakPalaw

PalawanBat

BanggiSWPalawano

HanunooPalauanSinghiDayakBakat

Katingan

DayakNgaju

KadorihMerinaM

ala

Maanyan

Tunjung

Kerinc iMelayuSara

Melayu

OganLowMalay

Indones ian

BanjareseM

MalayBahas

MelayuBrun

Minangkaba

PhanRangCh

ChruHainanCham

Moken

IbanGayoIdaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauMuk

Lahanan

EngganoMal

EngganoBan

Nias

TobaBatak

Madurese

IvatanBasc

Itbayat

Babuyan

Iraralay

Itbayaten

Isamorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

KallahanKa

KallahanKe

Inibaloi

Pangasinan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

Agta

DumagatCas

Ilokano

Yogya

OldJavanes

WesternMalayoPolynesian

BugineseSo

Maloh

Makassar

TaeSToraja

SangirTabu

Sangir

SangilSara

Bantik

Kaidipang

GorontaloH

BolaangMon

Popalia

Bonerate

Wuna

MunaKatobu

Tontemboan

Tonsea

BanggaiWdi

Baree

Wolio

Mori

KayanUmaJu

Bukat

PunanKelai

Wampar

Yabem

NumbamiSib

Mouk

Aria

LamogaiMul

Amara

Sengseng

KaulongAuV

Bilibil

Megiar

Gedaged

Matukar

Biliau

Mengen

Maleu

Kove

Tarpia

Sobei

KayupulauK

Riwo

KairiruKis

WogeoAli

Auhelawa

Saliba

Suau

Maisin

Gumawana

Diodio

Molima

Ubir

Gapapaiwa

Wedau

Dobuan

Misima

Kilivila

Gabadi

Doura

Kuni

Roro

Mekeo

Lala

Vilirupu

Motu

MagoriSout

Kandas

Tanga

Siar

Kuanua

Patpatar

Roviana

Simbo

Nduke

Kusaghe

Hoava

LunggaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSo

los

Taiof

Teop

NehanHape

Nis sanHaku

Sis ingga

BabatanaTu

VarisiGhon

VarisiRi

rioSengga

BabatanaAv

Vaghua

Babatana

BabatanaLo

BabatanaKa

LungaLungaMono

MonoAluTorauUruava

MonoFauro

Bilur

ChekeHolo

MaringeKm

a

MaringeTat

NggaoPoro

MaringeLel

Zabana

Kia

LaghuS

amas

Blabla

ngaBlabla

ngaGKok

otaKilokak

aYsBanoniMadara

Tungag

TungKara

WestNalikTiangTigakL ihir

Sungl

MadakLam

asBarok

MaututuNakan

aiBilLakala

iVitu

YapeseMbugho

tuDhBugotu

Talis eMoliTalis eMala

GhariNdiGhariTolo

Talis eMalango

GhariNggaeMbirao

GhariTandaTalis ePoleGhariNggerTalis eKooGhariNgini

Lengo

LengoGhaim

LengoParipNggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbaelelea

KwaraaeSol

Fataleka

LauWalade

LauNorth

Longgu

SaaUkiNiMa

SaaSaaVill

Oroha

SaaUlawa

Saa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawat

Aros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErroman

Ura

ArakiSouth

Merei

Ambrym

Sout

Mota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

Nese

Tape

AnejomAnei

SakaoPortO

Avava

Neveei

Naman

Puluwatese

PuloAnnan

Woleaian

Satawalese

ChuukeseAK

Mortlockes

Carolinian

Chuukese

Sonsoroles

SaipanCaro

PuloAnna

Woleai

Mokilese

Pingilapes

Ponapean

Kiribati

Kusaie

Marshalles

Nauru

Nelemwa

Jawe

Canala

Iaai

Nengone

Dehu

Rotuman

WesternFij

Maori

Tahiti

Penrhyn

Tuamotu

Manihiki

Rurutuan

Tahitia

nMo

Rarotongan

Tahitia

nth

Marquesan

Mangareva

Hawaiian

RapanuiEas

Samoan

Bellona

Tikopia

Emae

Anuta

Rennellese

FutunaAniw

IfiraMeleM

UveaWest

FutunaEast

VaeakauTau

Takuu

Tuvalu

Sikaiana

Kapingamar

Nukuoro

Luangiua

Pukapuka

Tokelau

UveaEast

Tongan

Niue

Fijia

nBau

Teanu

Vano

Tanema

Buma

Asumboa

Tanimbili

Nembao

Seimat

Wuvulu

Lou

Nauna

Leipon

SivisaTita

Likum

Levei

Loniu

Mussau

KasiraIrah

Giman

Buli

Mor

Minyaifuin

BigaMisool

As

Waropen

Numfor

Marau

AmbaiYapen

WindesiWan

NorthBabar

DaweraDawe

Dai

TelaMasbua

Imroing

Emplawas

EastMasela

Serili

CentralMas

SouthEastB

WestDamar

TaranganBa

UjirNAru

NgaiborSAr

Geser

Watubela

ElatKeiBes

HituAmbon

SouAmanaTe

Amahai

Paulohi

Alune

MurnatenAl

Bonfia

Werinama

BuruNamrol

Soboyo

Mamboru

Manggarai

Ngadha

Pondok

Kodi

WejewaTana

Bima

Soa

Savu

Wanukaka

GauraNggau

Eas tSumban

Baliledo

Lamboya

L ioFloresT

Ende

Nage

Kambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerman

Atoni

Mambai

TetunTerik

Kemak

Lamalerale

LamaholotI

S ika

Kedang

Erai

Talur

Aputai

Perai

Tugun

Iliun

Serua

Nila

Teun

Kisar

Roma

Eas tD

amar

Letinese

Yamd

ena

KeiTan

imba

Selaru

Koiwai

Iria

Chamo

rro

Lampun

g

Komerin

g

Sunda

RejangR

eja

TboliTa

gab

Tagabil

i

Koronad

alB

Sarangan

iB

BilaanKor

o

BilaanSar

a

CiuliAtaya

SquliqAtay

Sediq

Bunun

ThaoFavorla

ng

RukaiPaiwan

Kanakanabu

SaaroaTsouPuyumaKavalanBasaiCentralAmiS irayaPazehSais iat

YakanBajoMapunSamalS ias iInabaknonSubanunSinSubanonSioWes ternBukManoboWes tManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukidMaranaoIranunSasakBaliMamanwaIlonggoHiligaynon

WarayWarayButuanonTausugJoloSurigaononAklanonBisCebuanoKalaganMansakaTagalogAnt

BikolNagaC

TagbanwaKa

TagbanwaAb

BatakPalaw

PalawanBat

BanggiSWPalawano

HanunooPalauanSinghiDayakBakat

Katingan

DayakNgaju

KadorihMerinaM

ala

Maanyan

Tunjung

Kerinc iMelayuSara

Melayu

OganLowMalay

Indones ian

BanjareseM

MalayBahas

MelayuBrun

Minangkaba

PhanRangCh

ChruHainanCham

Moken

IbanGayoIdaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauMuk

Lahanan

EngganoMal

EngganoBan

Nias

TobaBatak

Madurese

IvatanBasc

Itbayat

Babuyan

Iraralay

Itbayaten

Isamorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

KallahanKa

KallahanKe

Inibaloi

Pangasinan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

Agta

DumagatCas

Ilokano

Yogya

OldJavanes

WesternMalayoPolynesian

BugineseSo

Maloh

Makassar

TaeSToraja

SangirTabu

Sangir

SangilSara

Bantik

Kaidipang

GorontaloH

BolaangMon

Popalia

Bonerate

Wuna

MunaKatobu

Tontemboan

Tonsea

BanggaiWdi

Baree

Wolio

Mori

KayanUmaJu

Bukat

PunanKelai

Wampar

Yabem

NumbamiSib

Mouk

Aria

LamogaiMul

Amara

Sengseng

KaulongAuV

Bilibil

Megiar

Gedaged

Matukar

Biliau

Mengen

Maleu

Kove

Tarpia

Sobei

KayupulauK

Riwo

KairiruKis

WogeoAli

Auhelawa

Saliba

Suau

Maisin

Gumawana

Diodio

Molima

Ubir

Gapapaiwa

Wedau

Dobuan

Misima

Kilivila

Gabadi

Doura

Kuni

Roro

Mekeo

Lala

Vilirupu

Motu

MagoriSout

Kandas

Tanga

Siar

Kuanua

Patpatar

Roviana

Simbo

Nduke

Kusaghe

Hoava

LunggaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSo

los

Taiof

Teop

NehanHape

Nis sanHaku

Sis ingga

BabatanaTu

VarisiGhon

VarisiRi

rioSengga

BabatanaAv

Vaghua

Babatana

BabatanaLo

BabatanaKa

LungaLungaMono

MonoAluTorauUruava

MonoFauro

Bilur

ChekeHolo

MaringeKm

a

MaringeTat

NggaoPoro

MaringeLel

Zabana

Kia

LaghuS

amas

Blabla

ngaBlabla

ngaGKok

otaKilokak

aYsBanoniMadara

Tungag

TungKara

WestNalikTiangTigakL ihir

Sungl

MadakLam

asBarok

MaututuNakan

aiBilLakala

iVitu

YapeseMbugho

tuDhBugotu

Talis eMoliTalis eMala

GhariNdiGhariTolo

Talis eMalango

GhariNggaeMbirao

GhariTandaTalis ePoleGhariNggerTalis eKooGhariNgini

Lengo

LengoGhaim

LengoParipNggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbaelelea

KwaraaeSol

Fataleka

LauWalade

LauNorth

Longgu

SaaUkiNiMa

SaaSaaVill

Oroha

SaaUlawa

Saa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawat

Aros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErroman

Ura

ArakiSouth

Merei

Ambrym

Sout

Mota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

Nese

Tape

AnejomAnei

SakaoPortO

Avava

Neveei

Naman

Puluwatese

PuloAnnan

Woleaian

Satawalese

ChuukeseAK

Mortlockes

Carolinian

Chuukese

Sonsoroles

SaipanCaro

PuloAnna

Woleai

Mokilese

Pingilapes

Ponapean

Kiribati

Kusaie

Marshalles

Nauru

Nelemwa

Jawe

Canala

Iaai

Nengone

Dehu

Rotuman

WesternFij

Maori

Tahiti

Penrhyn

Tuamotu

Manihiki

Rurutuan

Tahitia

nMo

Rarotongan

Tahitia

nth

Marquesan

Mangareva

Hawaiian

RapanuiEas

Samoan

Bellona

Tikopia

Emae

Anuta

Rennellese

FutunaAniw

IfiraMeleM

UveaWest

FutunaEast

VaeakauTau

Takuu

Tuvalu

Sikaiana

Kapingamar

Nukuoro

Luangiua

Pukapuka

Tokelau

UveaEast

Tongan

Niue

Fijia

nBau

Teanu

Vano

Tanema

Buma

Asumboa

Tanimbili

Nembao

Seimat

Wuvulu

Lou

Nauna

Leipon

SivisaTita

Likum

Levei

Loniu

Mussau

KasiraIrah

Giman

Buli

Mor

Minyaifuin

BigaMisool

As

Waropen

Numfor

Marau

AmbaiYapen

WindesiWan

NorthBabar

DaweraDawe

Dai

TelaMasbua

Imroing

Emplawas

EastMasela

Serili

CentralMas

SouthEastB

WestDamar

TaranganBa

UjirNAru

NgaiborSAr

Geser

Watubela

ElatKeiBes

HituAmbon

SouAmanaTe

Amahai

Paulohi

Alune

MurnatenAl

Bonfia

Werinama

BuruNamrol

Soboyo

Mamboru

Manggarai

Ngadha

Pondok

Kodi

WejewaTana

Bima

Soa

Savu

Wanukaka

GauraNggau

Eas tSumban

Baliledo

Lamboya

L ioFloresT

Ende

Nage

Kambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerman

Atoni

Mambai

TetunTerik

Kemak

Lamalerale

LamaholotI

S ika

Kedang

Erai

Talur

Aputai

Perai

Tugun

Iliun

Serua

Nila

Teun

Kisar

Roma

Eas tD

amar

Letinese

Yamd

ena

KeiTan

imba

Selaru

Koiwai

Iria

Chamo

rro

Lampun

g

Komerin

g

Sunda

RejangR

eja

TboliTa

gab

Tagabil

i

Koronad

alB

Sarangan

iB

BilaanKor

o

BilaanSar

a

CiuliAtaya

SquliqAtay

Sediq

Bunun

ThaoFavorla

ng

RukaiPaiwan

Kanakanabu

SaaroaTsouPuyumaKavalanBasaiCentralAmiS irayaPazehSais iat

Baseline: sampling from the observed wordsThis work

Distance between two linguists’ reconstructions

[Blust ’93, Pawley ’09] [Blust ’99]

Examples of reconstruction

Known Modern Languages Reconstructed Ancestors

Gloss Fijian Pazeh Melanau Inabaknon Manual Automated �

star kalokalo mintol biten bitu’on *bituqen *bituqen 0to hold taura ma:raP magem kumkom *gemgem *gemgemhouse vale xumaP lebuP ruma *öumaq *öumaqbird manumanu aiam manuk manok *qayam *qayamto cut, hack tata ta:tatak tutek hadhad *taöaq *taöaqat e N/A gaP N/A *i *iwhat? cava Paxai uaP inew ay *nanu *anu 1this oqo Pimini itew yayto *ini *aniwind cagi var@ paNay bariyo *bali *beliu 2

Figure 1: Examples for each types of data processed by our system. The main input is theIPA representation of the pronunciation of basic concepts for a collection of related languages.This is shown in the phonology section of the table on the left. Optionally, our system can alsouse partial or complete clustering of these word forms into cognate sets. The cognates sets,shown in the Cognates section of the left table, are clusters of word forms all descending from aproto-word form (they can be equivalently represented by a binary matrix where there an entryis equal to one if two word are in the same cognate set, and zero otherwise). Our system canalso hypothesize or refine these cognate sets when the information is missing. In the table onthe right, we show a random sample of reconstructions produced by the system, stratified byLevenstein distance (�) to a reference reconstruction, in this case the reconstruction of (8).More detailed reconstructions can be found in (2) [[ref to sup mat]]. The baseline consists inpicking one of the modern forms uniformly at random.

came French /SaZ/ ‘change’. Still, several factors obscure these regularities. In particular, sound

changes are often context sensitive: Latin /kor/ ‘heart’ with French reflex /kœK/ illustrates that

the place of articulation change occurred only before the vowel /a/.

In this paper, we present the first automated system capable of large-scale reconstruction of

protolanguages. This system is based on a probabilistic model of sound change, building on re-

cent work on the reconstruction of ancestral sequences in computational biology (9,10). Several

groups have recently explored how methods from computational biology can be applied to prob-

lems in historical linguistics, but this work has focused on identifying the relationships between

languages (as might be expressed in a phylogeny) rather than reconstructing the languages them-

selves (11–15). Much of this work has been based on binary cognate matrices, which discard

all information about the form that words take, simply indicating whether they are cognate. To

illustrate the difference between the representation of the data used by these models and ours, it

3

Stratified sampling by edit

distance

Colors here indicate

cognate sets

Statistical/computational phylogeneticsbouchard/pub/phylo-oxford-lecture.pdf · Evolution of...

Documents

Transcript of Statistical/computational phylogeneticsbouchard/pub/phylo-oxford-lecture.pdf · Evolution of...