Reconstruction of species trees from gene trees using...

Post on 15-May-2020

11 views 0 download

Transcript of Reconstruction of species trees from gene trees using...

Reconstruction of species trees from gene trees using ASTRAL

Siavash Mirarab University of California, San Diego (ECE)

sour

ce: h

ttp://

ww

w.ev

ogen

eao.

com

/

in collaboration with Warnow lab (UIUC)

Phylogenomics

2

More data (#genes)

Inference error

“gene” here refers to a portion of the genome (not a functional gene)

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Orangutan

Gorilla

Chimpanzee

Human

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

Gene tree discordance

3

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

Gene tree discordance

3

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

gene1000gene 1

Incomplete Lineage Sorting (ILS)

• A random process related to the coalescence of alleles across various populations

4

Tracing alleles through generations

Incomplete Lineage Sorting (ILS)

• A random process related to the coalescence of alleles across various populations

4

Tracing alleles through generations

Incomplete Lineage Sorting (ILS)

• A random process related to the coalescence of alleles across various populations

• Omnipresent: possible for every tree

• Likely for short branches or large population sizes

4

Tracing alleles through generations

MSC and Identifiability• A statistical model called multi-species coalescent

(MSC) can generate ILS.

5

MSC and Identifiability• A statistical model called multi-species coalescent

(MSC) can generate ILS.

• Any species tree defines a unique distribution on the set of all possible gene trees

5

MSC and Identifiability• A statistical model called multi-species coalescent

(MSC) can generate ILS.

• Any species tree defines a unique distribution on the set of all possible gene trees

• In principle, the species tree can be identified despite high discordance from the gene tree distribution

5

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene tree estimation pipelines

6

There are also other very good approaches: co-estimation (e.g., *BEAST),

site-based (e.g., SVDQuartets, SNAPP)

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene tree estimation pipelines

6

Statistically inconsistent [Roch and Steel, 2014]

Can be statistically consistent given true gene treesData

Error

Error-free gene trees

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene tree estimation pipelines

6

Statistically inconsistent [Roch and Steel, 2014]

Can be statistically consistent given true gene trees

STAR, STELLS, GLASS, BUCKy (population tree), MP-EST, NJst (ASTRID), … ASTRAL, ASTRAL-II, ASTRAL-III

Data

Error

Error-free gene trees

ASTRAL used by the biologists• Plants: Wickett, et al., 2014, PNAS • Birds: Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature • Xenoturbella, Rouse et al., 2016, Nature • Flatworms: Laumer, et al., 2015, eLife • Shrews: Giarla, et al., 2015, Syst. Bio. • Frogs: Yuan et al., 2016, Syst. Bio. • Tomatoes: Pease, et al., 2016, PLoS Bio. • Angiosperms: Huang et al., 2016, MBE • Worms: Andrade, et al., 2015, MBE

AST

RA

L-II

AST

RA

L

This talk: ASTRAL• Outlines of the method

• Accuracy in simulation studies

• With strong model violations

• The impact of

• Fragmentary sequence data

• Sampling multiple individuals

8

Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

9

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

9

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

The most frequent gene tree = The most likely species tree

Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

9

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

The most frequent gene tree = The most likely species tree

shorter branches ⇒ more discordance ⇒ a harder species tree reconstruction problem

10

Orang.Gorilla ChimpHuman Rhesus

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

10

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

10

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

Alternative: weight all 3(n

4 ) quartet topologies by their frequency

and find the optimal tree

Maximum Quartet Support Species Tree

• Optimization problem:

11

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

the set of quartet trees induced by T

a gene treeScore(T ) =

mX

1

|Q(T ) \Q(ti)|

Maximum Quartet Support Species Tree

• Optimization problem:

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

11

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

the set of quartet trees induced by T

a gene treeScore(T ) =

mX

1

|Q(T ) \Q(ti)|

ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]

• Solve the problem exactly using dynamic programming

• Constrains the search space to make large datasets feasible

• The constrained version remains statistically consistent

• Running time: polynomially increases with the number of genes and the number of species

12

ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]

• Solve the problem exactly using dynamic programming

• Constrains the search space to make large datasets feasible

• The constrained version remains statistically consistent

• Running time: polynomially increases with the number of genes and the number of species

• ASTRAL-II:

• Increased the search space

• Improved the running time

• Can handle polytomies (lack of resolution) in input gene trees

12

Simulation study

13

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

Simulation study

• Evaluate using the FN rate: the percentage of branches in the true tree that are missing from the estimated tree

13

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

Number of species impacts estimation error in the species tree

14

1000 genes, “medium” levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015]

ASTRAL: accurate and scalable

15

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of ILS, simulated species trees[S. Mirarab, T. Warnow, 2015]

ASTRAL: accurate and scalable

15

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of ILS, simulated species trees[S. Mirarab, T. Warnow, 2015]

Comparison to concatenation: depends on the ILS level

16

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Comparison to concatenation: depends on the ILS level

16

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Horizontal Gene Transfer (HGT)[R. Davidson et al., BMC Genomics. 16 (2015)]

~70%~55% ~35% ~30% discordance

Model violation: the simulated discordance is due to both ILS and randomly distributed HGT

50 species, 10 genes

Horizontal Gene Transfer (HGT)[R. Davidson et al., BMC Genomics. 16 (2015)]

~70%~55%

~35% ~30% discordance

Randomly distributed HGT is tolerated with enough genes

50 species, varying # genes

# genes

UCSD Work 1. Statistical support 2. ASTRAL-III

• Better running time • GPU and CPU Parallelism • Multi-Individual datasets

3. Impact of fragmentary data

19

Branch support• Traditional approach:

Multi-locus bootstrapping (MLBS)

• Slow: requires bootstrapping all genes(e.g., 100×m ML trees)

• Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015]

• We can do better!

20

[Mirarab et al., Sys bio, 2014]

Local posterior probability• Recall quartet frequencies follow a multinomial distribution

• P ( topology seen in m1 / m gene trees is the species tree ) = P ( θ1 > 1/3 )

21

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

m 2 = 63 m3 = 57m 1 = 80

θ2 θ3θ1

Local posterior probability• Recall quartet frequencies follow a multinomial distribution

• P ( topology seen in m1 / m gene trees is the species tree ) = P ( θ1 > 1/3 )

• Can be analytically solved

• We implemented this idea in astral and called the measure “the local posterior probability” (localPP)

21

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

m 2 = 63 m3 = 57m 1 = 80

θ2 θ3θ1

Quartet support v.s. localPP

22

!3Background: How many targets are enough? The number of loci required for confidently resolving a phylogenetic relationship depends on its “hardness”. Discordant gene trees due to incomplete lineage sorting (ILS) can make branches of a species tree very difficult to resolve. ILS is a function of the branch length and population size. This means that shorter branches or larger population sizes increase ILS, and can make species tree reconstruction more difficult. The co-PI has recently developed a new approach for estimating the support of a branch in a coalescent-based framework based on the proportion of gene trees that support quartet topologies in gene trees[31]. This new approach is analogous to finding the probability that a three-sided die is loaded towards each facet by observing outcomes of many tosses. We can also ask the reverse question: if we have an estimate of the degree of bias of a die, how many tosses are needed to confidently find its loaded side. Similarly, this approach[31] sheds light on the relationship between the number of genes required to achieve a level of support for a branch of a certain length in coalescent units (Fig. 3). While the co-PI’s method of calculating support, which is implemented in ASTRAL [32, 33], gives the basic mathematical framework, more method development is needed (see below).

Research Approach Specimens Suitable specimens of more than 80 species of Sabellidae and >120 Terebelliformia species have been collected from localities around the world in the last 15 years by the PI. Further collecting will be undertaken for some key taxa for a few more transcriptomes and for targeted DNA capture. We also have agreements from other experts in Sabellidae and Terebelliformia to provide other needed ethanol-preserved specimens for targeted capture. We will obtain ~ 50% of the known sabellid diversity and ~25% of terebelliforms for targeted capture sequencing (~500 species total). Our plan is to sample all genera, particularly focusing on their type species. This should allow for the generation of robust phylogenies, from which we will revise the taxonomy. Specimens will be vouchered at the SIO Benthic Invertebrate Collection and biodiversity information curated on the Encyclopedia of Life. Sequencing (transcriptomes and targeted capture of DNA) A targeted capture approach will be used with an appropriate number of loci (see below) to generate robust phylogenies of Sabellidae and Terebelliformia. Two different sets of targets will be designed from transcriptomes, across Sabellidae and Terebelliformia repectively. We have already generated new transcriptomes for nine sabellids, a serpulid and a fabriciid (bold terminals in Fig. 4), and nine Terebelliformia (bold terminals in Fig. 5) and combined this data with the few publicly available transcriptomes. Based on direct sequencing data already obtained for several genes for Sabellidae and Terebelliformia [Rouse in prep.; Stiller et al. in prep.], our sampling arguably spans the extant diversity of each clade. Several more transcriptomes will be needed to ensure the targeted capture methods will be robust. Methods for generating these transcriptomes will follow those previously used by us[6, 34]. How many targets are needed? A targeted capture pipeline typically starts from a large number of loci sequenced from a small number of species using genome-wide approaches (e.g., transcriptomics)[7, 8, 35]. This initial dataset is then used to select a smaller subset of loci for the targeted capture phase, which involves a larger set of species. To reduce the cost and effort, one would want to minimize the number of loci in the second phase, as long as sufficient loci are selected to confidently resolve relationships of interest. To date the number of loci has been determined in an ad hoc manner, either by the ultra conserved elements discovered[7, 8] or limitations of the targeted capture technology[13]. Initial analyses of our two datasets demonstrates how the number of

Fig. 3. Impact of number of genes (colors) and amount of ILS (x-axis) on support (posterior probability) for a branch (y-axis). Under the coalescent model, for four species separated with an unrooted branch of length d, the probability of a gene tree being identical to the species tree is 1-2/3e-d (we use this as a measure of ILS). We measure support using local pp [31] for various numbers of gene trees (assuming no gene tree estimation error). More ILS (ie., low 1-2/3e-d) requires more genes to get high support.

quartet frequency

Increased number of genes (m ) ⇒ increased support

Decreased discordance ⇒ increased support

23

localPP is more accurate than bootstrapping

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

Avian simulated dataset (48 taxa, 1000 genes)

the ROC curves show (fig. 5), for the same number offalse positives branches, local posterior probabilities result inbetter recall than MLBS. This pattern is more pronouncedfor shorter alignments, which have increased gene treeerror. For example, for the 250 bp model condition, if wechoose a support threshold that results in 0.01 FPR, with localposterior values, we still recover 84% of correct branches,whereas with MLBS, the same FPR results in retaining70% of correct branches. Thus, for a desired level of precision,better recall can be obtained using local posteriorprobabilities.

Branch LengthBranch length accuracy on the avian dataset was a function ofgene tree estimation error whether ASTRAL or MP-EST wasused (table 4). With true gene trees, branch length log error

FIG. 4. Branch length accuracy on the A-200 dataset with Medium ILS. See supplementary figures S7 and S8, Supplementary Material online forlow and high ILS. The estimated branch length is plotted against the true branch length in log scale (base 10). Blue line: a fitted generalizedadditive model with smoothing (Wood 2011).

Table 3. Branch Length Accuracy on the A-200 Dataset.

Dataset n Log Err RMSE

True gt Est. gt True gt Est. gt

Low ILS 1,000 0.10 0.42 5.57 6.75Low ILS 200 0.16 0.44 6.22 6.99Low ILS 50 0.25 0.48 6.84 7.29Med ILS 1,000 0.03 0.20 0.22 0.86Med ILS 200 0.07 0.22 0.44 0.91Med ILS 50 0.13 0.26 0.74 1.05High ILS 1,000 0.06 0.15 0.03 0.13High ILS 200 0.11 0.18 0.07 0.15High ILS 50 0.18 0.24 0.14 0.19

NOTE.—Logarithmic error (Log Err) and RMSE are shown for true species treesscored with true gene trees or estimated gene trees (Est. gt).

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00False Positive Rate

Rec

all

MLBS Local PP

FIG. 5. ROC curve for the avian dataset based on MLBS and local PPPPsupport values. Boxes show different numbers of sites per gene (con-trolling gene tree estimation error).

Fast Coalescent-Based Support Computation . doi:10.1093/molbev/msw079 MBE

9

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

[Sayyari and Mirarab, MBE, 2016]

100Xfaster

ASTRAL can also estimate internal branch lengths

24

True gene trees

1500 1000

500 250

−7.5

−5.0

−2.5

0.0

2.5

−7.5

−5.0

−2.5

0.0

2.5

−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)

estim

ated

bra

nch

leng

th (l

og s

cale

)

−7.5

−5.0

−2.5

0.0

2.5

−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)

estim

ated

bra

nch

leng

th (l

og s

cale

)

With true gene trees, ASTRAL correctly estimates BL

ASTRAL can also estimate internal branch lengths

24

True gene trees

1500 1000

500 250

−7.5

−5.0

−2.5

0.0

2.5

−7.5

−5.0

−2.5

0.0

2.5

−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)

estim

ated

bra

nch

leng

th (l

og s

cale

)

−7.5

−5.0

−2.5

0.0

2.5

−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)

estim

ated

bra

nch

leng

th (l

og s

cale

)

was only 0.06, corresponding to branches that are about 14%shorter or longer than the true branch. As gene tree estima-tion error increases with reduced number of sites (see table 1for gene tree error statistics), the branch length error also

increases. Thus, while 1,500 bp genes give 0.17 log error,250 bp genes result in 0.59 error, which corresponds tobranches that are on average 3.9 times too short or long.Moreover, unlike true gene trees, the error in branch lengthsestimated based on estimated gene trees is biased towardunderestimation (fig. 6), a pattern that increases in intensitywith shorter alignments.

ASTRAL and MP-EST have similar branch length accuracymeasured by log error for highly accurate gene trees, butASTRAL has an advantage with increased gene tree error(supplementary fig. S10, Supplementary Material online andtable 4). Error measured by root mean squared error (RMSE)(which emphasizes the accuracy of long branches) is compa-rable for the two methods, but MP-EST has a slight advantagegiven accurate gene trees.

Biological DatasetsFor each biological dataset, we show MLBS support and thelocal posterior probabilities, computed based on RAxML genetrees available from respective publications. We also collapsegene tree branches with <33% bootstrap support and usethese collapsed gene trees to draw local posterior probabili-ties. For ease of discussion, we show local posterior probabil-ities as percentages and refer to them simply as posterior orcollapsed posterior (for values based on collapsed gene trees).We discuss the confidence in important branches in eachtree.

1KP. Three of the key relationships studied by Wickett et al.(2014) are the sister branch to land plants, the base of theangiosperms, and the relationship among Bryophytes (horn-worts, liverworts, and mosses). In the ASTRAL tree, manybranches have full support regardless of the measure of thesupport used, but the remaining branches reveal interestingpatterns (fig. 7). The sister relationship betweenZygnematales and land plants receives a moderate 80% BS,but has 100% posterior. Wickett et al. (2014) also recoveredthis relationship by concatenation of various data partitions.There are 12 other branches that have collapsed posteriorsthat are at least 10% higher than BS (fig. 7); no branch hassubstantially higher BS than collapsed posterior. Collapsedposterior for monophyly of Bryophytes and for Amborellaas sister to other angiosperms are 100% (compared with97% and 93% BS, respectively).

When we collapse low-support branches in gene trees,posterior goes up for several branches: nine branches haveimprovements of 10% or more, and only two branches havecomparable reductions. An interesting case is Coleochaetalesas sister to Zygnematalesþ land plants, which has only 62%BS and 61% posterior, but has 100% collapsed posterior.Finally, note that several branches have low posterior, evenafter collapsing.

Our estimated branch lengths are short for several nodes.For example, the branch that unites (ChloranthalesþMagnoliids) and Eudicots has a length of 0.14 in coalescentunits. Other branches that have been historically hard to re-cover also tend to have short branches; however, these arenot necessarily extremely short branches that would

FIG. 6. ASTRAL branch length accuracy on the avian dataset. Logtransformed estimated branch lengths are shown versus true branchlengths, and a generalized additive model is fitted to the data. Onebranch with length 10"6 is trimmed out here, but full results, includ-ing MP-EST, is shown in supplementary figure S9, SupplementaryMaterial online.

Table 4. Branch Length Accuracy for the Avian Dataset.

No. of sites Log Err RMSE

ASTRAL MPEST ASTRAL MPEST

True gt. 0.06 (0.10) 0.07 (0.11) 0.44 (0.44) 0.30 (0.30)1,500 0.17 (0.20) 0.14 (0.18) 0.83 (0.83) 0.70 (0.70)1,000 0.22 (0.27) 0.22 (0.25) 1.08 (1.07) 1.01 (1.00)500 0.37 (0.42) 0.42 (0.46) 1.65 (1.64) 1.65 (1.64)250 0.59 (0.63) 0.81 (0.84) 2.25 (2.24) 2.28 (2.26)

NOTE.—Logarithmic error and root mean squared error are shown for true speciestrees scored with true gene trees or estimated gene trees with various numbers ofsites using ASTRAL and MP-EST. An extremely short branch with length 10"6 wasremoved from the calculations, but error including that branch is shownparenthetically.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

10

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

low gene tree error

High gene tree error

Moderate g.t. error

Medium g.t. error

With true gene trees, ASTRAL correctly estimates BLWith error-prone estimated gene trees, ASTRAL underestimates BL

Running time complexity & ASTRAL-III

(unpublished work)

25

Maryam Rabiee Hashemi Chao ZhangJohn Yin

ASTRAL-III new features• Improved running time with a single CPU

• We have ~3-5X running time improvement, and guaranteed polynomial running time

• ~8 hours for 1000 genes, 1000 species

• Handling datasets with multiple individuals

• GPU and CPU parallelism

26

Parallelism

Can infer trees with 10,000 species & 400 genes in a day27

GPU (1000 species)

GPU (500 species)

0

5

10

15

20

25

1 2 4 8 12 16 20 24# cores

Spee

d−up

1000 species

500 species

# CPU cores (comet)

Multiple individuals• What if we sample multiple

individuals from each species?

• In recently diverged species individuals may have different trees for each gene

• Sampling multiple individuals may provide useful information

• The ASTRAL approach can be extended to these types of data

28

Multiple individuals helpful?

29

0.0

0.1

0.2

0.3

Variable effort500 genes; 200 species

Fixed effortgenes x inds = 1000; 200 species

Fixed effort500 genes; species x inds = 200

Spec

ies

tree

erro

r (R

F di

stan

ce) 1 ind.

2 ind.

5 ind.

Yes, it marginally helps accuracy

Simulations with very shallow trees (500K generations in total)

Multiple individuals helpful?

29

0.0

0.1

0.2

0.3

Variable effort500 genes; 200 species

Fixed effortgenes x inds = 1000; 200 species

Fixed effort500 genes; species x inds = 200

Spec

ies

tree

erro

r (R

F di

stan

ce) 1 ind.

2 ind.

5 ind.

Yes, it marginally helps accuracy

But not if sequencing effort is kept fixed

Simulations with very shallow trees (500K generations in total)

Multiple individuals helpful?

29

0.0

0.1

0.2

0.3

Variable effort500 genes; 200 species

Fixed effortgenes x inds = 1000; 200 species

Fixed effort500 genes; species x inds = 200

Spec

ies

tree

erro

r (R

F di

stan

ce) 1 ind.

2 ind.

5 ind.

Yes, it marginally helps accuracy

But not if sequencing effort is kept fixed

Simulations with very shallow trees (500K generations in total)

Impact of fragmentary data

30

Erfan Sayyari James B. Whitfield

Two types of missing data• Missing genes: a taxon

misses the gene fully (taxon occupancy)

• Fragmentary data: the species is partially sequenced for some genes

31

Figure from Hosner et al, 2016

Does missing data matter?• Missing genes:

• Evidence suggests summary methods are often robust

32

Does missing data matter?• Missing genes:

• Evidence suggests summary methods are often robust

• Fragmentary data:

• Less extensively studied

• Hosner et al. indicated:

• Fragments matter

• They suggest removing genes with many fragments

32

Our study• Empirical data:

• An insect dataset of 144 species and 1478 genes

• 600 million years of evolution

• Transcriptomics, so plenty of missing data of both types

• Simulated data:

• 100 taxon, medium ILS, 1000 genes

• Added fragmentation with patterns similar to the insect data

33

Fragmentary data hurt gene trees and species trees

34

0.25

0.50

0.75

1.00

no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method

NR

F D

ista

nce

with fragments

without fragments

Species tree error: 2.8% errorG

ene

tree

erro

r

Fragmentary data hurt gene trees and species trees

Fragmentary data dramatically increase

gene tree and the species tree error

34

0.25

0.50

0.75

1.00

no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method

NR

F D

ista

nce

with fragments

without fragments

Species tree error: 2.8% errorG

ene

tree

erro

r

Species tree error: 7.0% error

How do we solve this?• Hosner et al. suggested removing entire genes.

• This is too conservative in our opinion.

35

How do we solve this?• Hosner et al. suggested removing entire genes.

• This is too conservative in our opinion.

• We simply remove the fragmentary sequence from the gene and keep the rest of the gene

• What is fragmentary?

• Anything that has at most X% of sites

• We study different X values

35

Filtering helps to some level (simulations)

36

●●

●●

●● ● ●

0.25

0.50

0.75

1.00

no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method

NR

F D

ista

nce

Gene tree error Species tree error

How about real data?

37

38

no-filtering

Dipt

era

Mec

opte

ra

Siph

onap

tera

Lepi

dopt

era

Trich

opte

ra

Cole

opte

ra

Stre

psip

tera

Neur

opte

rida

Hym

enop

tera

Psoc

odea

Hem

ipte

ra

Thys

anop

tera

Blat

tode

a

Man

tode

a

Gry

llobl

atto

dea

Orth

opte

raPh

asm

ida

Embi

idin

a

Plec

opte

ra

Derm

apte

ra

Ephe

mer

opte

ra

Odo

nata

Thys

anur

a

Arch

aeog

nath

a

Dipl

ura

Colle

mbo

laIm

porta

nt

Ingr

oup A B C

B/C D

D/B/

C E F G H I J K L M N P

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Prop

ortio

n of

rele

vant

gen

e tre

es

Strongly Supported Weakly Supported Weakly Reject Strongly Reject

66% filtering

Gene trees seem to improve (less gene tree discordance)

38

no-filtering

Dipt

era

Mec

opte

ra

Siph

onap

tera

Lepi

dopt

era

Trich

opte

ra

Cole

opte

ra

Stre

psip

tera

Neur

opte

rida

Hym

enop

tera

Psoc

odea

Hem

ipte

ra

Thys

anop

tera

Blat

tode

a

Man

tode

a

Gry

llobl

atto

dea

Orth

opte

raPh

asm

ida

Embi

idin

a

Plec

opte

ra

Derm

apte

ra

Ephe

mer

opte

ra

Odo

nata

Thys

anur

a

Arch

aeog

nath

a

Dipl

ura

Colle

mbo

laIm

porta

nt

Ingr

oup A B C

B/C D

D/B/

C E F G H I J K L M N P

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Prop

ortio

n of

rele

vant

gen

e tre

es

Strongly Supported Weakly Supported Weakly Reject Strongly Reject

66% filtering

Gene trees seem to improve (less gene tree discordance)

The species tree also improves

39

IngroupABC

B/CD

D/B/CEFGHIJKLMNP

no filter

ing

20

25 33 50

66 75 80

Supported0% 50% 100%

Weakly Rejected(<95%)

StronglyRejected(>=95%)

b)

• ASTRAL trees differed with concatenation initially

• After filtering, ASTRAL and concatenation were very similar

• Differences on controversial nodes

ASTRAL …• Reconstructs species trees from gene trees with both

• high accuracy

• scalability to large datasets • robust to some levels of model violations

• Like any other summary method, remains sensitive to gene tree estimation error

• try best to obtain highly accurate gene trees

• removing fragmentary data seems to help

40

Future of ASTRAL• Can it be changed to use characters directly?

possible but slow for binary characters …

• More broadly, can alternative scalable methods be developed for better gene tree estimation?

41

Future of ASTRAL• Can it be changed to use characters directly?

possible but slow for binary characters …

• More broadly, can alternative scalable methods be developed for better gene tree estimation?

• ASTRAL scales to 10K leaves. We have 90K bacterial genomes.

• Can we scale further?

41

 S.M.  Bayzid   Théo    Zimmermann

Erfan  Sayyari Shubhanshu    Shekhar

Maryam Rabiee Hashemi

Chao ZhangTandy  Warnow

John Yin

James  B.  Whitfield

Theoretical sample complexity results

How many genes are enough to reconstruct the tree?

43

●●

●●●●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0

10000

20000

30000

40000

50000

60000

70000

010

020

030

0

1 f

# ge

nes

ε0.050.10.20.33

BalancedCaterpillarTheoretical

Gene trees seem to improve (bootstrap)

44

c)

●●

● ● ●

● ●

●●

●●

●●

●● ●

10%

20%

30%

40%

50%

no−filtering20 25 33 50 66 75 80Fragment filtering threshold

Perc

ent

Bootstrap support ● ● ● ●=0 <=33 >=75 =100

d)

Gene trees seem to improve (less gene tree discordance)

45

e)

● ●●

●● ● ●

0.00

0.25

0.50

0.75

1.00

no-filtering 20 25 33 50 66 75 90Filtering method

FN d

istan

ce b

etwe

en s

pecie

s an

d ge

ne tr

ees

0.6 0.7 0.8 0.9

occupancy

Impact of gene tree error (using true gene trees)

46

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Medium ILSLow ILS High ILS

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Impact of gene tree error (using true gene trees)

• When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error

46

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Medium ILSLow ILS High ILS

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

ASTRAL-I versus ASTRAL-II

47

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5

Medium ILSLow ILS High ILS

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5200 species, deep ILS

ASTRAL-I versus ASTRAL-II

47

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5

Medium ILSLow ILS High ILS

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5200 species, deep ILS

Dynamic programming

48

A L-A

L

Dynamic programming

48

A

A-A’A’A-A’A’A-A’A’

L-A

• Recursively break subsets of species into smaller subsets

L

Dynamic programming

48

A

A-A’A’A-A’A’A-A’A’

L-A

A-A’

A’

L-A

• Recursively break subsets of species into smaller subsets

• : Compare u against input gene trees and compute quartets from gene trees satisfied by u

u

L

w(u)

Dynamic programming

48

A

A-A’A’A-A’A’A-A’A’

L-A

A-A’

A’

L-A

• Recursively break subsets of species into smaller subsets

• : Compare u against input gene trees and compute quartets from gene trees satisfied by u

Exponential

u

L

w(u)

Constrained version

49

A

A-A’A’A-A’A’A-A’A’

L-A

L

A1

A4

A8

A5

A2

A6

A3A7

A9

A10

Constrained version

49

A

A-A’A’A-A’A’A-A’A’

L-A

L

A1

A4

A8

A5

A2

A6

A3A7

A9

A10

• Restrict “branches” in the species tree to a given constraint set X.

Deep ILS

50

1000 genes 200 genes 50 genes

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IINJstCA−ML

200 species, simulated species trees, deep ILS [Mirarab and Warnow, ISMB, 2015]