ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A...

56
ASTRAL Tutorial Instructor: Siavash Mirarab, [email protected] Github site: https://github.com/smirarab/ASTRAL Email: [email protected] 1

Transcript of ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A...

Page 1: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

ASTRAL Tutorial

• Instructor: Siavash Mirarab, [email protected]

• Github site: https://github.com/smirarab/ASTRAL

• Email: [email protected]

1

Page 2: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Gene tree discordance

gene1000gene 1

Page 3: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Gene tree discordance

Causes of gene tree discordance include:• Duplication and loss • Horizontal Gene Transfer (HGT) and Hybridization • Incomplete Lineage Sorting (ILS)

gene1000gene 1

Page 4: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

OrangutanGorilla ChimpHuman

Gene evolution model (MSC)

Orang.GorillaChimp

Human Orang.Gorilla ChimpHuman

Orang.Gorilla

ChimpHuman

Orang.Chimp Human

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

Sequence evolution model (GTR)

3

Species tree

Gene tree

Sequence data(Alignments)

Gene tree Gene tree Gene tree

Sequence data(Alignments)

Model

Page 5: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Orang.GorillaChimp

Human Orang.Gorilla ChimpHuman

Orang.Gorilla

ChimpHuman

Orang.Chimp Human

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

Step 1: infer gene trees (e.g., ML)

4

Gene tree Gene tree Gene tree Gene tree

Two-step approach

Page 6: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

OrangutanGorilla ChimpHuman

Step 2: gene tree summarization

Orang.GorillaChimp

Human Orang.Gorilla ChimpHuman

Orang.Gorilla

ChimpHuman

Orang.Chimp Human

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

Step 1: infer gene trees (e.g., ML)

4

Gene tree Gene tree Gene tree Gene tree

Two-step approach

Page 7: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

ASTRAL• Input: Unrooted gene trees

• Can have missing data

• Can have polytomies

• Can have multiple individuals per species

• Output: The estimated unrooted species tree

• Will have branch lengths in coalescent units on internal branches (not super accurate)

• Will have a measure of support called localPP

5

Page 8: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

A bit on the input

6

Page 9: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Typical pipeline to produce input

1. Find orthologous parts of the genomes for species of interest

2. Align sequences per ortholog using your favorite MSA method (e.g., PASTA, UPP, MAFFT, etc.)

3. Infer “gene trees” per ortholog using Maximum Likelihood methods (e.g., RAxML or FastTree)Optionally: perform various filtering steps to remove errors

7

samples

ACTGCACACCG ACTGC-CCCCG AATGC-CCCC- -CTGCACACGG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

ACTGCACACCG ACTGCCCCCG AATGCCCCC CTGCACACGG

CAGGCACGCACGAA AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT

gene 1000

gene 1 MSA

MSAOrtholog gene mining (using

WGA, annotation, HMMs, etc.)SequencingSample preparation Assembly

CTGAGCATCG

Reads

CTGAGCTCG

ATGAGCTCCTGACACG

CTGAGCTCG

ATGAGCTC

CTGACACGAACGACTAG...ACCGAAGTAAATATATAAT

AGTACACGAACGA...ACCAAGAAGTAATATACATATA

ATGACACGAACAAG...TACGAAGTGTACCGAGATATATACA

ACAAACGATCGACATAATT...ACCGAAGTAAACGTAT

Putative genomes

gene 1000

gene 1

Contamination Sequencing error Misassembly fragmentation

Mis-annotation, hidden paralogy, mis-alignment, fragmentation

Alignment error

Page 10: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Should you filter?• Filtering genes based on missing data?

• Generally not beneficial [Molloy and Warnow, 2018]

8

Page 11: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Should you filter?• Filtering genes based on missing data?

• Generally not beneficial [Molloy and Warnow, 2018]

• Filtering genes based on gene tree estimation error?

• Depends on conditions [Molloy and Warnow, 2018]

8

Page 12: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Should you filter?• Filtering genes based on missing data?

• Generally not beneficial [Molloy and Warnow, 2018]

• Filtering genes based on gene tree estimation error?

• Depends on conditions [Molloy and Warnow, 2018]

• Filtering fragmentary sequences while keeping the gene?

• Often beneficial [Sayyari, Whitfield, and Mirarab, 2018]

8

Page 13: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Should you filter?• Filtering genes based on missing data?

• Generally not beneficial [Molloy and Warnow, 2018]

• Filtering genes based on gene tree estimation error?

• Depends on conditions [Molloy and Warnow, 2018]

• Filtering fragmentary sequences while keeping the gene?

• Often beneficial [Sayyari, Whitfield, and Mirarab, 2018]

• Filtering super long branches helps (talk today by Uyen Mai)

8

Page 14: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

bestML versus MLBS

9

• Multi-locus Bootstrapping (MLBS) is possible

• See -b option here: https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#multi-locus-bootstrapping

• You need a file with bootstrap replicates for each gene tree

• MLBS not suggested as seem to degrade accuracy compared to simply using ML gene trees.

Bioinformatics, 2014, Mirarab et al.

Page 15: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low support branches

10

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

All replicates

BMC Bioinformatics, 2018, Zhang et al.

Page 16: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low support branches

• Mostly helps in the presence of low support gene trees

10

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

All replicates

BMC Bioinformatics, 2018, Zhang et al.

Page 17: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low support branches

• Mostly helps in the presence of low support gene trees

• More genes allows for more filtering

10

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

All replicates

BMC Bioinformatics, 2018, Zhang et al.

Page 18: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

A bit on the output

11

Page 19: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Unrooted quartets under MSC model

For a quartet (4 species), the most probable unrooted quartet tree (among the gene trees) is the unrooted species tree topology (Allman, et al. 2010)

12

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

Page 20: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Unrooted quartets under MSC model

For a quartet (4 species), the most probable unrooted quartet tree (among the gene trees) is the unrooted species tree topology (Allman, et al. 2010)

12

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

The most frequent gene tree = The most likely species tree

Page 21: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

13

Orang.Gorilla ChimpHuman Rhesus

More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)

Page 22: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

13

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa

3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)

Page 23: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

13

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa

3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)

Alternative: weight all 3(n

4 ) quartet topologies by their frequency

and find the optimal tree

Page 24: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Maximum Quartet Support Species Tree

• Optimization problem (NP-Hard):

14

the set of quartet trees induced by T

a gene treeScore(T) =

k

∑1

|Q(T) ∪ Q(ti) |

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Page 25: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Maximum Quartet Support Species Tree

• Optimization problem (NP-Hard):

• Statistically consistent under the multi-species coalescent model when solved exactly

14

the set of quartet trees induced by T

a gene treeScore(T) =

k

∑1

|Q(T) ∪ Q(ti) |

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Page 26: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Maximum Quartet Support Species Tree

• Optimization problem (NP-Hard):

• Statistically consistent under the multi-species coalescent model when solved exactly

• ASTRAL: an exact solution using dynamic programming a constrained version that can run on large datasets

14

the set of quartet trees induced by T

a gene treeScore(T) =

k

∑1

|Q(T) ∪ Q(ti) |

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Page 27: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Beyond topology, ASTRAL can …

• estimate length of internal branches in coalescent units: # generations / population size

15

Page 28: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Beyond topology, ASTRAL can …

• estimate length of internal branches in coalescent units: # generations / population size

• estimate a measure of branch support called local posterior probability

• The probability of the branch being correct assuming gene trees are generated by the MSC model

15

Page 29: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Beyond topology, ASTRAL can …

• estimate length of internal branches in coalescent units: # generations / population size

• estimate a measure of branch support called local posterior probability

• The probability of the branch being correct assuming gene trees are generated by the MSC model

• test for polytomies

15

Page 30: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Beyond topology, ASTRAL can …

• estimate length of internal branches in coalescent units: # generations / population size

• estimate a measure of branch support called local posterior probability

• The probability of the branch being correct assuming gene trees are generated by the MSC model

• test for polytomies

• annotate species tree branches with quartet-based measures of discordance

15

Page 31: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

A bit on the search space

16

Page 32: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

ASTRAL versions• ASTRAL-I (<v. 4.7.3) restricts the search space to combinations

of bipartitions seen in gene trees.

• This make it fast but it remains statistically consistent

17

Page 33: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

ASTRAL versions• ASTRAL-I (<v. 4.7.3) restricts the search space to combinations

of bipartitions seen in gene trees.

• This make it fast but it remains statistically consistent

• ASTRAL-II (<v. 5.1.0) increased the search space heuristically

• Improved the accuracy at the expense of running time

• Can handle polytomies in input gene trees

17

Page 34: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

ASTRAL versions• ASTRAL-I (<v. 4.7.3) restricts the search space to combinations

of bipartitions seen in gene trees.

• This make it fast but it remains statistically consistent

• ASTRAL-II (<v. 5.1.0) increased the search space heuristically

• Improved the accuracy at the expense of running time

• Can handle polytomies in input gene trees

• ASTRAL-III (>v. 5.1.1) changed the search space again for a better running time versus accuracy trade-off

• Improved running time for unresolved trees; makes it feasible to remove very low support branches from gene trees

17

Page 35: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Modifying the Search Space

• For small enough datasets, you can run ASTRAL in *exact* mode, which will find the globally optimal tree! (-x option)

• ASTRAL’s default usage will not run in exact mode, and will constrain the search space largely using the gene trees.

• It may be beneficial to *expand* the search space, using the “-e” or “-f” options (see tutorial at GitHub site). In particular, you can add species trees you’ve estimated using other methods, such as concatenation, SVDquartets, or species trees that other studies have suggested.

18

Page 36: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

A bit on the running time

19

Page 37: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Empirical running time as a function of the number of genes

20

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Avian simulations: n = 48 species k = 256 to 16,384 gene trees Relatively high ILS Low/moderate gene tree error 4 replicates per dot

Empirical running time seems to increase close to O(k2)

17 hours

Page 38: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

21

Zhang et al. Page 34 of 34

● ●

● ●

1.09

10k

20k

50k

100k

200k

500k

1M

200 500 1000 2000 5000 10000#Species

Size

of X

1.83

1

10

100

1000

200 500 1000 2000 5000 10000#Species

Run

ning

tim

e (m

inut

es)

Figure S8 Empirical running time of ASTRAL-III with n. Average running time is shown forASTRAL-III for datasets with varying n. Averages are over 20 replicates. One replicate of 2000species dataset could not finish in 2 days and is removed from the analysis. Note that thesedatasets have factors other than n that change as well (e.g., the amount of ILS, etc.). Thus, theserunning times should be treated as ball-park estimates. Finally, we note that on the 10,000dataset, we have only 2 replicates and not 20.

Empirical running time as a function of the number of species

Empirical running time seems to increase close to O(n1.8)

Simphy simulations: n = 200 to 10,000 species k = 1000 gene trees Varying ILS Varying gene tree error 20 replicates everywhere, except for 10000 that has only 2 replicates

52 hours

Page 39: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Parallelization and scaling limits

• We have now developed ASTRAL-MP to use parallelism.

• Can analyze datasets with 10,000 species and 1000 genes in less than a day given 24 cores and a GPU

• https://github.com/smirarab/ASTRAL/tree/MP-similarity

22

4

31

73

108

1 12 162 20 244 6 8CPU cores

Spee

dup

vers

us A

STR

AL−I

II

No GPU

1 GPU

Page 40: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Examples

23

Page 41: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Simple run

• Runs ASTRAL version 5.6.1 on 1KP-genetrees.tre and save results in 1KP-speciestrees.tre with logs saved to 1KP-log.txt.

• The dataset is from Wickett, Mirarab, et al. PNAS (2014), Phylotranscriptomic Analysis of the Origins and early Diversification of Land Plants.

24

java -jar ~/workspace/ASTRAL/astral.5.6.1.jar -i 1KP-genetrees.tre -o 1KP-speciestrees.tre 2> 1KP-log.txt

Page 42: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Open the tree in FigTree

• ASTRAL trees are unrooted.

• You will have to root them.

• You may have to open 2-3 times in FigTree (not sure why)

25

Equisetum_diffusum

Bazzania_trilobata

Rhynchostegium_serrulatum

Riccia_sp

Hibiscus_cannabinus

Bryum_argenteum

Juniperus_scopulorum

Vitis_vinifera

Eschscholzia_californicaPodophyllum_peltatum

Physcomitrella_patens

Mesostigma_viride

Rosulabryum_cf_capillare

Prumnopitys_andina

Acorus_americanus

Selaginella_moellendorffii_1kp

Carica_papaya

Roya_obtusa

Chlorokybus_atmophyticus

Spirogyra_sp

Sorghum_bicolor

Thuidium_delicatulum

Sciadopitys_verticillata

Pseudolycopodiella_caroliniana

Taxus_baccata

Sarcandra_glabra

Boehmeria_nivea

Persea_americana

Catharanthus_roseus

Arabidopsis_thaliana

Nuphar_advena

Spirotaenia_minuta

Cycas_rumphii

Penium_margaritaceum

Dioscorea_villosa

Cycas_micholitzii

Kadsura_heteroclita

Sphagnum_lescurii

Ceratodon_purpureus

Aquilegia_formosa

Angiopteris_evecta

Coleochaete_irregularis

Kochia_scoparia

Cunninghamia_lanceolata

Netrium_digitus

Marchantia_polymorpha

Inula_helenium

Nothoceros_aenigmaticus

Ipomoea_purpurea

Yucca_filamentosa

Cylindrocystis_cushleckae

Gnetum_montanum

Metzgeria_crassipilis

Ginkgo_bilobaPsilotum_nudum

Dendrolycopodium_obscurum

Monomastix_opisthostigma

Liriodendron_tulipifera

Mougeotia_sp

Mesotaenium_endlicherianum

Diospyros_malabarica

Ephedra_sinica

Allamanda_cathartica

Zamia_vazquezii

Rosmarinus_officinalis

Saruma_henryi

Pteridium_aquilinum

Sphaerocarpos_texanus

Klebsormidium_subtile

Colchicum_autumnale

Amborella_trichopoda

Pinus_taeda

Leucodon_brachypus

Pyramimonas_parkeae

Sabal_bermudana

Cosmarium_ochthodes

Chaetosphaeridium_globosum

Huperzia_squarrosa

Zea_mays

Chara_vulgaris

Medicago_truncatula

Nephroselmis_pyriformis

Tanacetum_parthenium

Alsophila_spinulosa

Populus_trichocarpa

Cedrus_libani

Anomodon_attenuatus

Oryza_sativa

Marchantia_emarginata

Brachypodium_distachyon

Cylindrocystis_brebissonii

Uronema_sp

Entransia_fimbriata

Nothoceros_vincentianus

Smilax_bona

Larrea_tridentata

Polytrichum_commune

Coleochaete_scutata

Selaginella_moellendorffii_genome

Welwitschia_mirabilis

Ophioglossum_petiolatum

Houttuynia_cordata

Hedwigia_ciliata

Length in coalescent units

Page 43: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

• Often better to look at cladogram

26

Entransia_fimbriata

Smilax_bona

Mougeotia_sp

Klebsormidium_subtile

Angiopteris_evecta

Marchantia_emarginata

Pinus_taedaCedrus_libani

Acorus_americanus

Psilotum_nudum

Eschscholzia_californica

Ophioglossum_petiolatum

Hibiscus_cannabinus

Sarcandra_glabra

Amborella_trichopoda

Catharanthus_roseus

Cylindrocystis_brebissonii

Oryza_sativa

Bazzania_trilobata

Uronema_sp

Chlorokybus_atmophyticus

Pyramimonas_parkeae

Brachypodium_distachyon

Houttuynia_cordata

Carica_papaya

Populus_trichocarpa

Dioscorea_villosa

Ephedra_sinica

Selaginella_moellendorffii_1kp

Diospyros_malabarica

Cylindrocystis_cushleckaeMesotaenium_endlicherianum

Coleochaete_scutata

Pteridium_aquilinum

Sorghum_bicolor

Cycas_micholitzii

Chara_vulgaris

Cycas_rumphii

Nuphar_advena

Arabidopsis_thaliana

Prumnopitys_andina

Pseudolycopodiella_caroliniana

Nephroselmis_pyriformis

Metzgeria_crassipilis

Aquilegia_formosa

Riccia_sp

Chaetosphaeridium_globosum

Physcomitrella_patens

Vitis_vinifera

Leucodon_brachypus

Roya_obtusa

Sciadopitys_verticillata

Larrea_tridentata

Equisetum_diffusum

Penium_margaritaceum

Sphaerocarpos_texanus

Zea_mays

Liriodendron_tulipifera

Marchantia_polymorpha

Ipomoea_purpurea

Sabal_bermudana

Rosulabryum_cf_capillare

Medicago_truncatula

Dendrolycopodium_obscurum

Ceratodon_purpureus

Allamanda_cathartica

Selaginella_moellendorffii_genome

Yucca_filamentosa

Saruma_henryi

Mesostigma_viride

Podophyllum_peltatum

Thuidium_delicatulum

Inula_heleniumRosmarinus_officinalis

Welwitschia_mirabilis

Spirogyra_sp

Colchicum_autumnale

Cosmarium_ochthodes

Kadsura_heteroclita

Coleochaete_irregularis

Rhynchostegium_serrulatum

Monomastix_opisthostigma

Taxus_baccata

Gnetum_montanum

Persea_americana

Alsophila_spinulosa

Sphagnum_lescurii

Cunninghamia_lanceolata

Netrium_digitus

Spirotaenia_minuta

Kochia_scoparia

Huperzia_squarrosa

Juniperus_scopulorum

Polytrichum_commune

Zamia_vazquezii

Nothoceros_vincentianus

Tanacetum_parthenium

Ginkgo_biloba

Boehmeria_nivea

Hedwigia_ciliata

Nothoceros_aenigmaticus

Anomodon_attenuatus

Bryum_argenteum

Page 44: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Branch support

27

Entransia_fimbriata

Smilax_bona

Mougeotia_sp

Klebsormidium_subtile

Angiopteris_evecta

Marchantia_emarginata

Pinus_taedaCedrus_libani

Acorus_americanus

Psilotum_nudum

Eschscholzia_californica

Ophioglossum_petiolatum

Hibiscus_cannabinus

Sarcandra_glabra

Amborella_trichopoda

Catharanthus_roseus

Cylindrocystis_brebissonii

Oryza_sativa

Bazzania_trilobata

Uronema_sp

Chlorokybus_atmophyticus

Pyramimonas_parkeae

Brachypodium_distachyon

Houttuynia_cordata

Carica_papaya

Populus_trichocarpa

Dioscorea_villosa

Ephedra_sinica

Selaginella_moellendorffii_1kp

Diospyros_malabarica

Cylindrocystis_cushleckaeMesotaenium_endlicherianum

Coleochaete_scutata

Pteridium_aquilinum

Sorghum_bicolor

Cycas_micholitzii

Chara_vulgaris

Cycas_rumphii

Nuphar_advena

Arabidopsis_thaliana

Prumnopitys_andina

Pseudolycopodiella_caroliniana

Nephroselmis_pyriformis

Metzgeria_crassipilis

Aquilegia_formosa

Riccia_sp

Chaetosphaeridium_globosum

Physcomitrella_patens

Vitis_vinifera

Leucodon_brachypus

Roya_obtusa

Sciadopitys_verticillata

Larrea_tridentata

Equisetum_diffusum

Penium_margaritaceum

Sphaerocarpos_texanus

Zea_mays

Liriodendron_tulipifera

Marchantia_polymorpha

Ipomoea_purpurea

Sabal_bermudana

Rosulabryum_cf_capillare

Medicago_truncatula

Dendrolycopodium_obscurum

Ceratodon_purpureus

Allamanda_cathartica

Selaginella_moellendorffii_genome

Yucca_filamentosa

Saruma_henryi

Mesostigma_viride

Podophyllum_peltatum

Thuidium_delicatulum

Inula_heleniumRosmarinus_officinalis

Welwitschia_mirabilis

Spirogyra_sp

Colchicum_autumnale

Cosmarium_ochthodes

Kadsura_heteroclita

Coleochaete_irregularis

Rhynchostegium_serrulatum

Monomastix_opisthostigma

Taxus_baccata

Gnetum_montanum

Persea_americana

Alsophila_spinulosa

Sphagnum_lescurii

Cunninghamia_lanceolata

Netrium_digitus

Spirotaenia_minuta

Kochia_scoparia

Huperzia_squarrosa

Juniperus_scopulorum

Polytrichum_commune

Zamia_vazquezii

Nothoceros_vincentianus

Tanacetum_parthenium

Ginkgo_biloba

Boehmeria_nivea

Hedwigia_ciliata

Nothoceros_aenigmaticus

Anomodon_attenuatus

Bryum_argenteum

1

1

1

1

1

1

1

0.94

1

1

1

1

1

0.99

1

1

1

1

1

1

1

1

1

0.7

0.39

1

0.63

0.81

1

1

0.61

1

0.68

1

11

1

1

1

1

1

1

0.98

1

1

1

1

0.42

1

1

1

1

1

1

1

1

0.98

0.86

1

1

1

1

1

1

1

1 1

1

1

1

1

1

0.76

1

0.81

1

1

0.9

0.81

1

1

1

0.98

0.97

1

1

1

1

1

1

1

1

1

11

0.99

1

0.95

0.331

1

By default, nodes labelled by localPP

[Sayyari and Mirarab, 2016]

Page 45: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

The log (1KP-log.txt) includes …

28

This is ASTRAL version 5.6.1 424 trees read from 1KP-genetrees.tre … Number of taxa: 103 (103 species) … Taxon occupancy: {Pseudolycopodiella_caroliniana=282, Kadsura_heteroclita=234, Entransia_fimbriata=294, … … Number of Clusters after addition by greedy: 11043 … Final quartet score is: 339023690 Final normalized quartet score is: 0.8946722590876793 … ASTRAL finished in 36.372 secs

Number of individuals

Number of species

Search space sizeNormalized

quartet scoreRaw quartet score

Page 46: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Annotate (score) a given tree and compute quartet support

• Scores 1KP-speciestrees.tre based on the 1KP-genetrees.tre and annotates branches with the quartet score of the branch (-t 1), saving results into 1KP-speciestrees-qs.tre.

• Check out other annotation options at https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#extensive-branch-annotations

29

java -jar ~/workspace/ASTRAL/astral.5.6.1.jar -i 1KP-genetrees.tre -q 1KP-speciestrees.tre -o 1KP-speciestrees-qs.tre -t 1

Page 47: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Quartet scores as branch-specific estimate of gene tree discordance

30

Quartet score for the main resolution

Brachypodium_distachyon

Alsophila_spinulosa

Selaginella_moellendorffii_genome

Eschscholzia_californica

Riccia_sp

Ophioglossum_petiolatum

Kochia_scoparia

Ephedra_sinica

Sarcandra_glabra

Nephroselmis_pyriformis

Cycas_micholitzii

Liriodendron_tulipifera

Zamia_vazquezii

Chlorokybus_atmophyticus

Smilax_bona

Nothoceros_aenigmaticus

Pseudolycopodiella_caroliniana

Entransia_fimbriata

Marchantia_emarginata

Persea_americana

Sphaerocarpos_texanus

Ipomoea_purpurea

Chara_vulgaris

Nothoceros_vincentianus

Marchantia_polymorpha

Welwitschia_mirabilis

Mougeotia_sp

Podophyllum_peltatum

Taxus_baccata

Cylindrocystis_brebissoniiSpirogyra_sp

Rhynchostegium_serrulatum

Rosulabryum_cf_capillare

Physcomitrella_patens

Larrea_tridentata

Sciadopitys_verticillata

Cosmarium_ochthodesPenium_margaritaceum

Coleochaete_irregularis

Sorghum_bicolor

Ceratodon_purpureus

Ginkgo_biloba

Medicago_truncatula

Houttuynia_cordata

Gnetum_montanum

Metzgeria_crassipilis

Rosmarinus_officinalis

Nuphar_advena

Uronema_sp

Cedrus_libani

Mesotaenium_endlicherianum

Acorus_americanus

Catharanthus_roseus

Hedwigia_ciliataBryum_argenteum

Chaetosphaeridium_globosum

Sabal_bermudana

Cycas_rumphii

Kadsura_heteroclita

Equisetum_diffusum

Netrium_digitus

Cylindrocystis_cushleckae

Inula_helenium

Selaginella_moellendorffii_1kp

Polytrichum_commune

Populus_trichocarpa

Angiopteris_evecta

Roya_obtusa

Tanacetum_parthenium

Colchicum_autumnale

Monomastix_opisthostigma

Psilotum_nudum

Allamanda_cathartica

Diospyros_malabarica

Boehmeria_niveaVitis_vinifera

Sphagnum_lescurii

Coleochaete_scutata

Pinus_taeda

Bazzania_trilobata

Prumnopitys_andina

Carica_papaya

Anomodon_attenuatus

Dendrolycopodium_obscurum

Thuidium_delicatulum

Mesostigma_viride

Dioscorea_villosa

Aquilegia_formosa

Hibiscus_cannabinus

Spirotaenia_minuta

Zea_mays

Pteridium_aquilinum

Huperzia_squarrosa

Arabidopsis_thaliana

Amborella_trichopoda

Oryza_sativa

Juniperus_scopulorum

Saruma_henryi

Yucca_filamentosa

Leucodon_brachypus

Klebsormidium_subtile

Cunninghamia_lanceolata

Pyramimonas_parkeae

48.11

70.41

37.78

55.64

90.25

65.36

87.08

64.57

95.51

51.6

36.96

86.81

75.2

92.65

57.27

72.05

67.74

51.14

82.35

34.03

49.47

92.72

61.25

43.33

40.24

37.22

98.52

33.4898.4

34.3

95.77

69.74

45.03

70.15

76.64

93.95

43.26

58.96

96.9

45.5

42.97

93.18

78.39

98.12

50.75

41.94

91.86

67.53

93.79

58.95

49.52

38.95

99.97

89.7

66.4

39.62

41.88

82.16

71.69

58.96

84.83

47.94

51.89

51.75

96.57

39.25

36.52

81.83

57.51

58.89

95.29

99.11

97.73

90.51

46.9

54.59

64.15

68.29

42.25

41.61

38.67

90.38

43.86

60.47

88.99

88.19

97.39

53.11

81.34

45.5

62.82

84.47

53.71

53.04

38.09

43.5

40.93

62.11

69.27

46.05

42.53

Page 48: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Annotate (score) a given tree and compute quartet support

• Scores 1KP-speciestrees.tre based on the 1KP-genetrees.tre and annotates branches with the quartet score of the branch and the two alternative resolutions (-t 8), saving results into 1KP-speciestrees-qsall.tre.

31

java -jar ~/workspace/ASTRAL/astral.5.6.1.jar -i 1KP-genetrees.tre -q 1KP-speciestrees.tre -o 1KP-speciestrees-qsall.tre -t 8

Page 49: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Quartet scores of alternative resolutions

32

Quartet scores for all three resolutions around the branch. Example: 52%, 21%, 27%

Populus_trichocarpa

Hedwigia_ciliata

Hibiscus_cannabinus

Brachypodium_distachyon

Anomodon_attenuatus

Cycas_rumphii

Ceratodon_purpureus

Uronema_sp

Dioscorea_villosa

Thuidium_delicatulum

Penium_margaritaceum

Ephedra_sinica

Rhynchostegium_serrulatum

Mesostigma_viride

Bazzania_trilobata

Ginkgo_biloba

Nothoceros_vincentianus

Sabal_bermudana

Dendrolycopodium_obscurum

Marchantia_polymorpha

Saruma_henryi

Pyramimonas_parkeae

Sphagnum_lescurii

Ipomoea_purpurea

Juniperus_scopulorum

Polytrichum_commune

Sorghum_bicolor

Prumnopitys_andina

Riccia_spMetzgeria_crassipilis

Nephroselmis_pyriformis

Houttuynia_cordataPersea_americana

Cunninghamia_lanceolata

Rosulabryum_cf_capillare

Amborella_trichopoda

Spirotaenia_minuta

Podophyllum_peltatum

Spirogyra_sp

Leucodon_brachypus

Nuphar_advena

Equisetum_diffusum

Chaetosphaeridium_globosum

Oryza_sativa

Catharanthus_roseus

Taxus_baccata

Coleochaete_scutata

Welwitschia_mirabilis

Medicago_truncatula

Tanacetum_parthenium

Eschscholzia_californica

Psilotum_nudum

Selaginella_moellendorffii_genome

Vitis_vinifera

Alsophila_spinulosa

Physcomitrella_patens

Boehmeria_nivea

Colchicum_autumnale

Liriodendron_tulipifera

Monomastix_opisthostigma

Chlorokybus_atmophyticus

Zamia_vazquezii

Huperzia_squarrosa

Yucca_filamentosa

Coleochaete_irregularis

Bryum_argenteum

Klebsormidium_subtile

Ophioglossum_petiolatum

Smilax_bona

Inula_helenium

Acorus_americanus

Sciadopitys_verticillata

Aquilegia_formosa

Carica_papaya

Kadsura_heteroclita

Nothoceros_aenigmaticus

Mesotaenium_endlicherianum

Pteridium_aquilinum

Roya_obtusaCosmarium_ochthodes

Gnetum_montanum

Allamanda_cathartica

Selaginella_moellendorffii_1kp

Cycas_micholitzii

Entransia_fimbriata

Rosmarinus_officinalis

Larrea_tridentata

Marchantia_emarginata

Netrium_digitus

Angiopteris_evecta

Arabidopsis_thaliana

Cylindrocystis_brebissonii

Chara_vulgaris

Pinus_taedaCedrus_libani

Zea_mays

Pseudolycopodiella_caroliniana

Sarcandra_glabra

Mougeotia_sp

Sphaerocarpos_texanus

Cylindrocystis_cushleckae

Kochia_scopariaDiospyros_malabarica

[q1=0.43;q2=0.27;q3=0.31]

[q1=0.37;q2=0.34;q3=0.29]

[q1=0.77;q2=0.1;q3=0.13]

[q1=0.97;q2=0.01;q3=0.03]

[q1=0.38;q2=0.28;q3=0.34]

[q1=0.59;q2=0.23;q3=0.18]

[q1=0.93;q2=0.02;q3=0.05]

[q1=0.53;q2=0.22;q3=0.25]

[q1=0.42;q2=0.38;q3=0.21]

[q1=0.52;q2=0.26;q3=0.22]

[q1=0.41;q2=0.3;q3=0.29]

[q1=0.62;q2=0.2;q3=0.18]

[q1=0.4;q2=0.38;q3=0.22]

[q1=0.82;q2=0.08;q3=0.1]

[q1=0.53;q2=0.16;q3=0.31]

[q1=0.92;q2=0.02;q3=0.06]

[q1=0.96;q2=0.03;q3=0.02]

[q1=0.45;q2=0.36;q3=0.18]

[q1=0.85;q2=0.07;q3=0.09]

[q1=0.33;q2=0.34;q3=0.32]

[q1=0.42;q2=0.3;q3=0.28]

[q1=0.48;q2=0.26;q3=0.26]

[q1=0.65;q2=0.19;q3=0.15]

[q1=0.4;q2=0.25;q3=0.35]

[q1=0.7;q2=0.12;q3=0.18]

[q1=0.97;q2=0.02;q3=0.01]

[q1=0.51;q2=0.21;q3=0.28]

[q1=0.78;q2=0.14;q3=0.08]

[q1=0.59;q2=0.27;q3=0.14]

[q1=0.59;q2=0.34;q3=0.07]

[q1=0.68;q2=0.18;q3=0.14]

[q1=0.81;q2=0.1;q3=0.09]

[q1=0.52;q2=0.22;q3=0.26]

[q1=0.98;q2=0.01;q3=0.01]

[q1=0.47;q2=0.22;q3=0.32]

[q1=0.64;q2=0.22;q3=0.14]

[q1=0.39;q2=0.31;q3=0.3]

[q1=0.43;q2=0.3;q3=0.27]

[q1=0.94;q2=0.02;q3=0.04]

[q1=0.58;q2=0.17;q3=0.25]

[q1=0.57;q2=0.18;q3=0.24]

[q1=0.88;q2=0.08;q3=0.04]

[q1=0.5;q2=0.23;q3=0.28]

[q1=0.43;q2=0.31;q3=0.26]

[q1=0.37;q2=0.32;q3=0.31]

[q1=0.68;q2=0.19;q3=0.13]

[q1=0.54;q2=0.26;q3=0.2]

[q1=0.51;q2=0.22;q3=0.27]

[q1=0.95;q2=0.02;q3=0.02]

[q1=0.87;q2=0.06;q3=0.07]

[q1=0.82;q2=0.08;q3=0.1]

[q1=0.68;q2=0.14;q3=0.18]

[q1=0.99;q2=0.01;q3=0.01]

[q1=0.72;q2=0.12;q3=0.17]

[q1=0.59;q2=0.23;q3=0.18]

[q1=0.72;q2=0.11;q3=0.17]

[q1=0.63;q2=0.21;q3=0.17]

[q1=0.44;q2=0.15;q3=0.42]

[q1=0.46;q2=0.28;q3=0.26]

[q1=0.48;q2=0.4;q3=0.12]

[q1=0.99;q2=0.01;q3=0]

[q1=0.45;q2=0.26;q3=0.28]

[q1=0.94;q2=0.04;q3=0.02]

[q1=0.45;q2=0.24;q3=0.31]

[q1=1;q2=0;q3=0]

[q1=0.65;q2=0.22;q3=0.13]

[q1=0.93;q2=0.04;q3=0.04]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.52;q2=0.21;q3=0.27]

[q1=0.89;q2=0.07;q3=0.04]

[q1=0.69;q2=0.14;q3=0.16]

[q1=0.34;q2=0.33;q3=0.33]

[q1=0.66;q2=0.16;q3=0.18]

[q1=0.38;q2=0.36;q3=0.26]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.93;q2=0.05;q3=0.02]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.55;q2=0.21;q3=0.25]

[q1=0.7;q2=0.14;q3=0.15]

[q1=0.75;q2=0.11;q3=0.14]

[q1=0.96;q2=0.02;q3=0.02]

[q1=0.82;q2=0.12;q3=0.06]

[q1=0.37;q2=0.3;q3=0.34]

[q1=0.61;q2=0.13;q3=0.26]

[q1=0.56;q2=0.25;q3=0.19][q1=0.49;q2=0.26;q3=0.24]

[q1=0.34;q2=0.34;q3=0.32]

[q1=0.42;q2=0.25;q3=0.33]

[q1=0.87;q2=0.06;q3=0.08]

[q1=0.84;q2=0.09;q3=0.07]

[q1=0.39;q2=0.31;q3=0.3]

[q1=0.43;q2=0.32;q3=0.25][q1=0.6;q2=0.28;q3=0.11]

[q1=0.7;q2=0.17;q3=0.14]

[q1=0.44;q2=0.22;q3=0.35]

[q1=0.91;q2=0.03;q3=0.07]

[q1=0.39;q2=0.28;q3=0.34]

[q1=0.42;q2=0.32;q3=0.26]

[q1=0.98;q2=0.01;q3=0]

[q1=0.98;q2=0.01;q3=0.01]

[q1=0.97;q2=0.01;q3=0.02]

Page 50: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Quartet scores of alternative resolutions

32

Quartet scores for all three resolutions around the branch. Example: 52%, 21%, 27%

Populus_trichocarpa

Hedwigia_ciliata

Hibiscus_cannabinus

Brachypodium_distachyon

Anomodon_attenuatus

Cycas_rumphii

Ceratodon_purpureus

Uronema_sp

Dioscorea_villosa

Thuidium_delicatulum

Penium_margaritaceum

Ephedra_sinica

Rhynchostegium_serrulatum

Mesostigma_viride

Bazzania_trilobata

Ginkgo_biloba

Nothoceros_vincentianus

Sabal_bermudana

Dendrolycopodium_obscurum

Marchantia_polymorpha

Saruma_henryi

Pyramimonas_parkeae

Sphagnum_lescurii

Ipomoea_purpurea

Juniperus_scopulorum

Polytrichum_commune

Sorghum_bicolor

Prumnopitys_andina

Riccia_spMetzgeria_crassipilis

Nephroselmis_pyriformis

Houttuynia_cordataPersea_americana

Cunninghamia_lanceolata

Rosulabryum_cf_capillare

Amborella_trichopoda

Spirotaenia_minuta

Podophyllum_peltatum

Spirogyra_sp

Leucodon_brachypus

Nuphar_advena

Equisetum_diffusum

Chaetosphaeridium_globosum

Oryza_sativa

Catharanthus_roseus

Taxus_baccata

Coleochaete_scutata

Welwitschia_mirabilis

Medicago_truncatula

Tanacetum_parthenium

Eschscholzia_californica

Psilotum_nudum

Selaginella_moellendorffii_genome

Vitis_vinifera

Alsophila_spinulosa

Physcomitrella_patens

Boehmeria_nivea

Colchicum_autumnale

Liriodendron_tulipifera

Monomastix_opisthostigma

Chlorokybus_atmophyticus

Zamia_vazquezii

Huperzia_squarrosa

Yucca_filamentosa

Coleochaete_irregularis

Bryum_argenteum

Klebsormidium_subtile

Ophioglossum_petiolatum

Smilax_bona

Inula_helenium

Acorus_americanus

Sciadopitys_verticillata

Aquilegia_formosa

Carica_papaya

Kadsura_heteroclita

Nothoceros_aenigmaticus

Mesotaenium_endlicherianum

Pteridium_aquilinum

Roya_obtusaCosmarium_ochthodes

Gnetum_montanum

Allamanda_cathartica

Selaginella_moellendorffii_1kp

Cycas_micholitzii

Entransia_fimbriata

Rosmarinus_officinalis

Larrea_tridentata

Marchantia_emarginata

Netrium_digitus

Angiopteris_evecta

Arabidopsis_thaliana

Cylindrocystis_brebissonii

Chara_vulgaris

Pinus_taedaCedrus_libani

Zea_mays

Pseudolycopodiella_caroliniana

Sarcandra_glabra

Mougeotia_sp

Sphaerocarpos_texanus

Cylindrocystis_cushleckae

Kochia_scopariaDiospyros_malabarica

[q1=0.43;q2=0.27;q3=0.31]

[q1=0.37;q2=0.34;q3=0.29]

[q1=0.77;q2=0.1;q3=0.13]

[q1=0.97;q2=0.01;q3=0.03]

[q1=0.38;q2=0.28;q3=0.34]

[q1=0.59;q2=0.23;q3=0.18]

[q1=0.93;q2=0.02;q3=0.05]

[q1=0.53;q2=0.22;q3=0.25]

[q1=0.42;q2=0.38;q3=0.21]

[q1=0.52;q2=0.26;q3=0.22]

[q1=0.41;q2=0.3;q3=0.29]

[q1=0.62;q2=0.2;q3=0.18]

[q1=0.4;q2=0.38;q3=0.22]

[q1=0.82;q2=0.08;q3=0.1]

[q1=0.53;q2=0.16;q3=0.31]

[q1=0.92;q2=0.02;q3=0.06]

[q1=0.96;q2=0.03;q3=0.02]

[q1=0.45;q2=0.36;q3=0.18]

[q1=0.85;q2=0.07;q3=0.09]

[q1=0.33;q2=0.34;q3=0.32]

[q1=0.42;q2=0.3;q3=0.28]

[q1=0.48;q2=0.26;q3=0.26]

[q1=0.65;q2=0.19;q3=0.15]

[q1=0.4;q2=0.25;q3=0.35]

[q1=0.7;q2=0.12;q3=0.18]

[q1=0.97;q2=0.02;q3=0.01]

[q1=0.51;q2=0.21;q3=0.28]

[q1=0.78;q2=0.14;q3=0.08]

[q1=0.59;q2=0.27;q3=0.14]

[q1=0.59;q2=0.34;q3=0.07]

[q1=0.68;q2=0.18;q3=0.14]

[q1=0.81;q2=0.1;q3=0.09]

[q1=0.52;q2=0.22;q3=0.26]

[q1=0.98;q2=0.01;q3=0.01]

[q1=0.47;q2=0.22;q3=0.32]

[q1=0.64;q2=0.22;q3=0.14]

[q1=0.39;q2=0.31;q3=0.3]

[q1=0.43;q2=0.3;q3=0.27]

[q1=0.94;q2=0.02;q3=0.04]

[q1=0.58;q2=0.17;q3=0.25]

[q1=0.57;q2=0.18;q3=0.24]

[q1=0.88;q2=0.08;q3=0.04]

[q1=0.5;q2=0.23;q3=0.28]

[q1=0.43;q2=0.31;q3=0.26]

[q1=0.37;q2=0.32;q3=0.31]

[q1=0.68;q2=0.19;q3=0.13]

[q1=0.54;q2=0.26;q3=0.2]

[q1=0.51;q2=0.22;q3=0.27]

[q1=0.95;q2=0.02;q3=0.02]

[q1=0.87;q2=0.06;q3=0.07]

[q1=0.82;q2=0.08;q3=0.1]

[q1=0.68;q2=0.14;q3=0.18]

[q1=0.99;q2=0.01;q3=0.01]

[q1=0.72;q2=0.12;q3=0.17]

[q1=0.59;q2=0.23;q3=0.18]

[q1=0.72;q2=0.11;q3=0.17]

[q1=0.63;q2=0.21;q3=0.17]

[q1=0.44;q2=0.15;q3=0.42]

[q1=0.46;q2=0.28;q3=0.26]

[q1=0.48;q2=0.4;q3=0.12]

[q1=0.99;q2=0.01;q3=0]

[q1=0.45;q2=0.26;q3=0.28]

[q1=0.94;q2=0.04;q3=0.02]

[q1=0.45;q2=0.24;q3=0.31]

[q1=1;q2=0;q3=0]

[q1=0.65;q2=0.22;q3=0.13]

[q1=0.93;q2=0.04;q3=0.04]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.52;q2=0.21;q3=0.27]

[q1=0.89;q2=0.07;q3=0.04]

[q1=0.69;q2=0.14;q3=0.16]

[q1=0.34;q2=0.33;q3=0.33]

[q1=0.66;q2=0.16;q3=0.18]

[q1=0.38;q2=0.36;q3=0.26]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.93;q2=0.05;q3=0.02]

[q1=0.9;q2=0.05;q3=0.05]

[q1=0.55;q2=0.21;q3=0.25]

[q1=0.7;q2=0.14;q3=0.15]

[q1=0.75;q2=0.11;q3=0.14]

[q1=0.96;q2=0.02;q3=0.02]

[q1=0.82;q2=0.12;q3=0.06]

[q1=0.37;q2=0.3;q3=0.34]

[q1=0.61;q2=0.13;q3=0.26]

[q1=0.56;q2=0.25;q3=0.19][q1=0.49;q2=0.26;q3=0.24]

[q1=0.34;q2=0.34;q3=0.32]

[q1=0.42;q2=0.25;q3=0.33]

[q1=0.87;q2=0.06;q3=0.08]

[q1=0.84;q2=0.09;q3=0.07]

[q1=0.39;q2=0.31;q3=0.3]

[q1=0.43;q2=0.32;q3=0.25][q1=0.6;q2=0.28;q3=0.11]

[q1=0.7;q2=0.17;q3=0.14]

[q1=0.44;q2=0.22;q3=0.35]

[q1=0.91;q2=0.03;q3=0.07]

[q1=0.39;q2=0.28;q3=0.34]

[q1=0.42;q2=0.32;q3=0.26]

[q1=0.98;q2=0.01;q3=0]

[q1=0.98;q2=0.01;q3=0.01]

[q1=0.97;q2=0.01;q3=0.02]

Hard to read

Page 51: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

• https://github.com/esayyari/DiscoVista • Sayyari, et. al. “DiscoVista: Interpretable Visualizations of

Gene Tree Discordance.” MPE 122 (2018): 110–15.

33

Discovista: visualizing discordance

Page 52: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

• https://github.com/esayyari/DiscoVista • Sayyari, et. al. “DiscoVista: Interpretable Visualizations of

Gene Tree Discordance.” MPE 122 (2018): 110–15.

33

Discovista: visualizing discordance

Page 53: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Handling multiple individuals• Handling multiple individuals is supported

(the relevant paper has been in review for a long time)

• The mapping between names in gene trees and names in the species tree should be provided using the -a option. Two formats are supported (either can be used):

34

cat: siamesecat12,persiancat17,wildcat dog: dog1,labrador,bulldog-1,bulldog-2 horse: pony,shire-2,mustang110

cat 3 siamesecat12 persiancat17 wildcat dog 4 dog1 labrador bulldog-1 bulldog-2 horse 3 pony shire-2 mustang110

Format 1:

Format 2:

Page 54: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Main publications• Mirarab, Siavash, Rezwana Reaz, Md. Shamsuzzoha Bayzid, Théo

Zimmermann, M. S. Swenson, and Tandy Warnow. “ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation.” Bioinformatics 30, no. 17 (2014): i541–48. https://doi.org/10.1093/bioinformatics/btu462.

• Mirarab, S., and T. Warnow. “ASTRAL-II: Coalescent-Based Species Tree Estimation with Many Hundreds of Taxa and Thousands of Genes.” Bioinformatics 31, no. 12 (2015). https://doi.org/10.1093/bioinformatics/btv234.

• Zhang, Chao, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. “ASTRAL-III: Polynomial Time Species Tree Reconstruction from Partially Resolved Gene Trees.” BMC Bioinformatics 19, no. S6 (2018): 153. https://doi.org/10.1186/s12859-018-2129-y.

• Sayyari, Erfan, and Siavash Mirarab. “Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies.” Molecular Biology and Evolution 33, no. 7 (2016): 1654–68. https://doi.org/10.1093/molbev/msw079.

35

Page 55: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

Some other ASTRAL-related papers• Testing for polytomies in species trees using quartet frequencies (Sayyari and Mirarab).

Genes (2018) (Note: Implemented in option -t 10)

• Filtering loci is not beneficial! (Molloy and Warnow), Systematic Biology (2018)

• ASTRAL consistent under models of missing data (Nute, Molloy, Chou, and Warnow). BMC Genomics (2018)

• SIESTA improves ASTRAL (Vachaspati and Warnow). BMC Genomics (2018)

• Visualizing Discordance using DiscoVista (Sayyari, Whitfield, and Mirarab). Molecular Phylogenetics and Evolution (2018)

• Fragmentary sequences can negatively impact ASTRAL trees (Sayyari, Whitfield, and Mirarab). Molecular Biology and Evolution (2017)

• How many genes does ASTRAL need? (Shekhar, Roch, and Mirarab). Transactions on Computational Biology and Bioinformatics (2017)

• Using ASTRAL as a supertree method (Vachaspati and Warnow). Bioinformatics (2017)

• Performance under ILS and HGT (Davidson, Vachaspati, Mirarab, and Warnow). BMC Genomics (2015).

36

Page 56: ASTRAL Tutorial - Tandy Warnow · gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II

For more info• Contact us:

Tandy Warnow, [email protected], Siavash Mirarab, [email protected]

• Software available at Github site: https://github.com/smirarab/ASTRAL

• See tutorial and README at GitHub site: https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md

• Email: [email protected]

• More related papers at http://tandy.cs.illinois.edu/papers-all.html and http://eceweb.ucsd.edu/~smirarab/publications.html

37