New Tools for Visualizing Genome Evolution Lutz Hamel Dept. of Computer Science and Statistics...

25
New Tools for Visualizing Genome Evolution Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island J. Peter Gogarten Dept. of Molecular and Cell Biology University of Connecticut
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of New Tools for Visualizing Genome Evolution Lutz Hamel Dept. of Computer Science and Statistics...

New Tools for VisualizingGenome Evolution

Lutz HamelDept. of Computer Science and StatisticsUniversity of Rhode Island

J. Peter GogartenDept. of Molecular and Cell BiologyUniversity of Connecticut

Motivation

Early life on Earth has left a variety of traces that can be utilized to reconstruct the history of life the fossil and geological records information retained in living organisms

Our research focuses on how information can be gained from the molecular record: information about the history of life that is retained

in the structure and sequences of macromolecules found in extant organisms

The analyses of the mosaic nature of genomes using phylogenetics will be a key ingredient to unravel the life's early history.

Relevance to NASA

The analyses are relevant in the context of NASA's Origin theme: Understand the origin and evolution of life on

Earth.

We address questions that are central to NASA's Astrobiology program: Understand how past life on Earth interacted

with its changing planetary and Solar System environment.

Understand the evolutionary mechanisms and environmental limits of life.

Phylogenetics

Phylogenetics (Greek: phylon = race and genetic = birth) is the taxonomical classification of organisms based on how closely they are related in terms of evolutionary differences.

Phylogenetic progression as envisioned by Darwin

From C. DarwinOrigin of Species

EuglenaTrypanosoma

Zea

Paramecium

Dictyostelium

EntamoebaNaegleria

Coprinus

Porphyra

Physarum

HomoTritrichomonas

Sulfolobus

ThermofilumThermoproteus

pJP 27pJP 78

pSL 22pSL 4

pSL 50

pSL 12

E.coli

Agrobacterium

Epulopiscium

AquifexThermotoga

Deinococcus

Synechococcus

Bacillus

Chlorobium

Vairimorpha

Cytophaga

HexamitaGiardia

mitochondria

chloroplast

Haloferax

Methanospirillum

Methanosarcina

Methanobacterium

ThermococcusMethanopyrus

Methanococcus

ARCHAEA BACTERIA

EUCARYA

Encephalitozoon

Thermus

EM 17 Marine group 1

RiftiaChromatium

ORIGIN

Treponema

Fig. modified from Norman Pace

SS

U-r

RN

A T

ree

of L

if e

Phylogenetics: Classic View

All genes are inherited from ancestor.

Branching reflects speciation events.

Evolutionary tree follows very closely the SSU-rRNA tree.

However…

Science, 280 p.672ff (1998)

Aquifex is assignedto different branches ofthe tree…

Horizontal Gene Transfer (HGT) leads to Mosaic Genomes, where different parts of the genome have different histories.

(a) concordant genes, (b) according to 16S (and other conserved genes) (c) according to phylogenetically discordant genes

Gophna, U., Doolittle, W.F. & Charlebois, R.L.: Weighted genome trees: refinements and applications. J. Bacteriol. (in press)

A Revised Tree of Life

Modified from:Bill Martin (1999)BioEssays 21, 99-104

Evolutionary Processes Analogous to the Ones Proposed to Occur in the Microbial World

+

=

+ =

Cartoons from Science Made Stupid, T. Weller, 1986.from http://www.besse.at/sms/

Visualizing Phylogenies

Visualize the relation of four organisms at a time

three unrooted trees plot the support of

various genes for each of the tree topologies in an equilateral triangle

OrthologousGene Families

Visualizing Phylogenies

Synechocystis sp. (cyanobact.)

Chlorobium tepidum (GSB)

Rhodobacter capsulatus (-prot)

Rhodopseudomonas palustris (-prot)

Constructing the Visualization

Download four genomes

(genome quartet)

[a.a.sequences]

Download four genomes

(genome quartet)

[a.a.sequences]

“BLAST” every genome

against every other genome

“BLAST” every genome

against every other genome

Select top hit

of every BLAST search

Select top hit

of every BLAST search

Detect quartets of orthologs

Detect quartets of orthologs

Align quartets of orthologuesusing ClustalW

Align quartets of orthologuesusing ClustalW

Calculate maximum-likelihood values and posterior

probabilities for all three tree topologies

Calculate maximum-likelihood values and posterior

probabilities for all three tree topologies

Convert probabilities(barycentric coordinates)

into Cartesian coordinates

Convert probabilities(barycentric coordinates)

into Cartesian coordinates

Plot all points onto

equilateral triangle

Plot all points onto

equilateral triangle

Visualizing Five Genomes

Five genomes => fifteen unrooted trees

Rather than triangle - dekapentagon

A: ArchaeoglobusS: SulfolobusY: YeastR: RhodobacterB: Bacillus

A: ArchaeoglobusS: SulfolobusY: YeastR: RhodobacterB: Bacillus

Zhaxybayeva O, Hamel L, Raymond J, Gogarten JP: Visualization of Phylogenetic Content of Five Genomes with Dekapentagonal Maps. Genome Biology 2004, 5:R20

Visualizing Multiple Genomes

Number of Genomes Number of Unrooted Trees

3 1

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.22E+020

30 8.69E+036

40 1.31E+055

50 2.84E+074

60 5.01E+094

70 5.00E+115

80 2.18E+137

For comparison the universe contains only about 1089 protons and has an age of about 5*1017 seconds or 5*1029 picoseconds.

Given this explosion, plotting all possible relationships as unrootedtrees is impossible.

Visualizing Multiple Genomes: SOMs

SOM Self-Organizing Map An artificial neural network approach to clustering

we are looking for clusters of genes which favor certain tree topologies

Advantages over other clustering approaches: No a priori knowledge of how many clusters to

expect Explicit summary of commonalities and differences

between clusters Cluster membership is not exclusive – a gene can

indicate membership in multiple clusters at the same time

Visually appealing representation

T. Kohonen, Self-organizing maps, 3rd ed. Berlin ; New York: Springer, 2001.

Visualization with SOMs

1211

11/44

15

9

Training a SOM

Tree #1 (T1) … Tree #k (Tk)

Support value vector for a set #1 of orthologous genes

P11 … P1k

Support value vector for a set #2 of orthologous genes

P21 … P2k

… … … …

Support value vector for a set #n of orthologous genes

Pn1 … Pnk

x

y

k

ittc ii

,)()(minarg mx

itthtt iciii )],()([)()1( mxmm

. if

, if 0

ic

ichci

SOM Regression Equations:

- learning rate - neighborhood distance

mi

Data SetSOM Neural Elements

SOM Visualization

Anatomy of a Trained SOM

In our case k = 15 for the fifteen tree topologies.

1 - 5

6 - 10

11 - 15

Visualizing a Larger Number of Genomes

13 gamma-proteobacterial genomes (258 putative orthologs):

•E.coli•Buchnera•Haemophilus•Pasteurella•Salmonella•Yersinia pestis (2 strains)•Vibrio•Xanthomonas (2 sp.)•Pseudomonas•Wigglesworhtia•Vibrio Cholerae

There are 13,749,310,575

possible unrooted tree topologies for 13 genomes

switch to bipartitions

Bipartition of a Phylogenetic Tree

Bipartition – a division of a phylogenetic tree into two parts that are connected by a single branch. It divides a dataset into two groups, but it does not consider the relationships within each of the two groups.

95Here 95 represents thebootstrap support for theinternal branch.

The number of bipartitions for N genomes is equal to 2(N-1)-N-1.

Bipartitions: Lento Plot & SOM

Strongly supported bipartitions in SOM

Strongly supported bipartitions

Conclusions & Future Work

Self-Organizing Maps seem to be an effective way to visualize mosaic genome evolution.

Corroborate findings with other methodologies

Scalable In the Future:

Larger data sets Locally Linear Embedding (LLE)

Acknowledgements

Olga A. ZhaxybayevaDept. of Biochemistry and Molecular BiologyDalhousie University

Maria PoptsovaDept. of Molecular and Cell BiologyUniversity of Connecticut