Molecular Basis of Some Neurological Disorders Said Ismail Dept. of Biochemistry
Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology
description
Transcript of Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology
Maria Poptsova
University of ConnecticutDept. of Molecular and Cell Biology
August 18, 2006, Stanford University, CA
AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY
TRANSFERRED GENES
Superfamily of ATP synthases for 317 taxa of bacteria and archaea
Outline:
Automated Methods of Assembling Orthologous Gene Families
Methods of HGT Detection
Tree of Life
16s RNA treeRooting the Tree of LifeHow Tree-like is an Organismal Evolution?What is Horizontal Gene Transfer?What is Organismal Lineage in light of HGT?
Reciprocal Blast Hit Method – problems with paralogsBranchclust: Phylogenetic Algorithm for Assembling Gene Families
Overview of methods for HGT detection: AU Test, Symmetrical Difference of Robinson and Foulds, Bipartition AnalysisHGT In-silico experiments
Cenancestoras placed by ancient duplicated genes (ATPases, Signal recognition particles, EF)
To Root
• strictly bifurcating• no reticulation• only extant lineages• based on a single molecular phylogeny• branch length is not proportional to time
The Tree of Life according to SSU ribosomal RNA (+)
SS
U-r
RN
A T
ree
of L
ife
EuglenaTrypanosoma
Zea
Paramecium
Dictyostelium
EntamoebaNaegleria
Coprinus
Porphyra
Physarum
HomoTritrichomonas
Sulfolobus
ThermofilumThermoproteus
pJP 27pJP 78
pSL 22pSL 4
pSL 50
pSL 12
E.coli
Agrobacterium
Epulopiscium
AquifexThermotoga
Deinococcus
Synechococcus
Bacillus
Chlorobium
Vairimorpha
Cytophaga
HexamitaGiardia
mitochondria
chloroplast
Haloferax
Methanospirillum
Methanosarcina
Methanobacterium
ThermococcusMethanopyrus
Methanococcus
ARCHAEA BACTERIA
EUCARYA
Encephalitozoon
Thermus
EM 17
0.1 changes per nt
Marine group 1
RiftiaChromatium
ORIGIN
Treponema
CPSV/A-ATPaseProlyl RSLysyl RSMitochondriaPlastids
Fig. modified from Norman Pace
What is HGT?
Genes can be passed vertically – from ancestor to a child
Genes also can be passed horizontally – exchange of genes between different species
HGT stands for Horizontal Gene Transfer
Science,280 p.672ff (1998)
Horizontal Gene Transfer Mosaic Genomes
How Tree-like is Organismal Evolution?
Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain,
Welch RA, et al.
Proc Natl Acad Sci U S A. 2002; 99:17020-4
“… only 39.2% of their combined (nonredundant) set of proteins actually are common to all three strains.”
How many common genes?
What is an “organismal lineage” in light of horizontal gene transfer?
Over very short time intervals an organismal lineage can be defined as the majority consensus of genes.
Organismal Lineage
Rope as a metaphor to describe an organismal lineage (Gary Olsen)
Individual fibers = genes that travel for some time in a lineage.
While no individual fiber (gene) present at the beginning might be present at the end, the rope (or the organismal lineage) nevertheless has continuity.
However, the genome as a whole will acquire the character of the incoming genes (the rope turns solidly red over time).
From:
Bill Martin (1999)BioEssays 21, 99-104
Selection of Orthologous Gene Families
(COG, or Cluster of Orthologous Groups)
All automated methods for assembling sets of orthologous genes are based on sequence similarities.
BLAST hits
(SCOP database)
Triangular circular BLAST significant hits
Sequence identity of 30% and greater
Similarity complemented by HMM-profile analysis
Pfam database
Reciprocal BLAST hit method
1 2
3 4
1 2
3 4
2’
often fails in the presence of paralogs
1 gene family
Reciprocal BLAST Hit Method
0 gene family
ATP-F
Case of 2 bacteria and 2 archaea species
ATP-A (catalytic subunit) ATP-B (non-catalytic subunit)
Escherichia coli
Bacillus subtilis
Methanosarcina mazei
Sulfolobus solfataricus
ATP-A
ATP-B
ATP-A
ATP-B
ATP-A
ATP-B
ATP-A
ATP-B
Escherichia coli
Bacillus subtilis
Methanosarcina mazei
Sulfolobus solfataricus
ATP-A
ATP-B
ATP-A
ATP-B
ATP-A
ATP-B
ATP-A
ATP-B
ATP-F
Neither ATP-A nor ATB-B is selected by RBH method
Families of ATP-synthases
Families of ATP-synthases
ATP-A
ATP-AATP-A
ATP-A
ATP-F
ATP-F
ATP-BATP-B
ATP-B
ATP-B
Escherichia coli
Escherichia coli
Bacillus subtilis
Bacillus subtilis
Bacillus subtilis
Escherichia coli
Methanosarcina mazei
Methanosarcina mazei
Sulfolobus solfataricus
Sulfolobus solfataricus
Family of ATP-A
Family of ATP-B
Family of ATP-F
Phylogenetic Tree
BranchClust Algorithm
www.bioinformatics.org/branchclust
genome igenome 1
genome 2
genome 3
genome N
dataset of N genomes superfamily tree
BLAST hits
BranchClust Algorithm
www.bioinformatics.org/branchclust
BranchClust Algorithm
Superfamily of penicillin-binding protein Superfamily of DNA-binding protein
13 gamma proteo bacteria
Root positions
www.bioinformatics.org/branchclust
13 gamma proteo bacteria
BranchClust Algorithm
Comparison of the best BLAST hit method and BranchClust algorithm
Number of taxa - A: ArchaeaB: Bacteria
Number of selected families:
Reciprocal best BLAST hit
BranchClust
2A 2B 80 414 (all complete)
13B 236 409 (263 complete, 409 with n8 )
16B 14A 12 126 (60 complete, 126 with n24).
www.bioinformatics.org/branchclust
BranchClust AlgorithmATP-synthases: Examples of Clustering
13 gamma proteobacteria
30 taxa: 16 bacteria and 14 archaea
317 bacteria and archaea
www.bioinformatics.org/branchclust
BranchClust AlgorithmTypical Superfamily for 30 taxa (16 bacteria and 14 archaea)
www.bioinformatics.org/branchclust
59:30
33:19
53:26
55:2137:19
36:21
BranchClust AlgorithmGene Annotation
www.bioinformatics.org/branchclust
------------ CLUSTER 1 ----------------------- FAMILY ------------>gi|27904705| peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)]>gi|26246017| Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073]>gi|16273058| penicillin-binding protein 3 [Haemophilus influenzae Rd KW20]>gi|15602001| FtsI [Pasteurella multocida subsp. multocida str. Pm70]>gi|15599614| penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1]>gi|16763512| division specific transpeptidase [Salmonella typhimurium LT2]>gi|15642404| penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961]>gi|32490961| hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]>gi|21230194| penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913]>gi|21241544| penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306]>gi|15837394| penicillin binding protein 3 [Xylella fastidiosa 9a5c]>gi|16120877| penicillin-binding protein 3 [Yersinia pestis CO92]>gi|22127506| peptidoglycan synthetase [Yersinia pestis KIM]COMPLETE: 13>>>>> IN-PARALOGS ----------->gi|16765177| putative penicillin-binding protein 3 [Salmonella typhimurium LT2]>gi|15597468| penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1]------------ CLUSTER 2 ----------------------- FAMILY ------------>gi|26246616| Penicillin-binding protein 2 [Escherichia coli CFT073]>gi|16272007| penicillin-binding protein 2 [Haemophilus influenzae Rd KW20]>gi|15603789| Pbp2 [Pasteurella multocida subsp. multocida str. Pm70]>gi|15599198| penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1]>gi|16764017| cell elongation-specific transpeptidase [Salmonella typhimurium LT2]>gi|15640966| penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961]>gi|32490921| hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]>gi|21232896| penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913]>gi|21241430| penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306]>gi|15837913| penicillin binding protein 2 [Xylella fastidiosa 9a5c]>gi|16122817| penicillin-binding protein 2 [Yersinia pestis CO92]>gi|22125081| peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM]INCOMPLETE: 12>>>>> IN-PARALOGS ----------->gi|16765252| putative penicillin-binding protein [Salmonella typhimurium LT2]
Superfamily of penicillin-binding protein for 13 gamma proteobacteria
BranchClust AlgorithmImplementation and Usage
www.bioinformatics.org/branchclust
1.Bioperl module for parsing trees Bio::TreeIO2. Taxa recognition file gi_numbers.out must be present in the current directory. How to create this file, read the Taxa recognition file section on the web-site.
The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for parsing trees and is freely available at http://bioinformatics.org/branchclust
Required:
Usage:
At the command line type:
# perl branch_clust.pl <tree-file> <MANY>
Output: families.list, clusters.out, clusters.log
How to do batch processing:
Example of a wrapper you can find on the web-site.
BranchClust AlgorithmData Flow
www.bioinformatics.org/branchclust
Download n complete genomes (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria)
In fasta format (*.faa)
Put all n genomes in one database
Take one starting genome and do BLAST of this genome against the database, consisting of n genomes
Parse BLAST-output with the requirement that all n-taxa should be present
Superfamilies
Align with ClustalW
Reconstruct superfamily treeClustalW –quick distance method
Phyml – Maximum Likelihood
Parse with BranchClust
Gene families
Why do we need gene families?
How many genes are common between different species?
Do all the common genes share the common history?
How do we reconstruct the tree of life?
How can we detect genes that were horizontally transferred?
Methods of HGT Detection
• Parametric methods
• GC-content analysis • analysis of single nucleotide composition (SNC) and dinucleotide composition (DNC) • codon usage bias • other measures based on sequence composition
• AU-test• SPR metric (NP-hard problem)• Symmeytric difference of Robison and Foulds• Biparition spectrum analysis• Quartet spectrum analysis
• Phylogenetic methods
AU test
AU test, or approximately unbiased test of phylogenetic tree selection was proposed for assessing the confidence of tree selection.
The AU test method produces for each tree a number ranging from zero to one – (P1 and P2). This number Is the probability value that the tree is the true tree. The greater the P-value, the greater the probability that the tree is the true tree.
P1 P2
One would expect that a tree with different topology would have a small P-value.Accepted requirement for HGT detection:
P-Value < 1E-2 – 1E-4
Hidetoshi Shimodaira (2002)
If P1 = genome tree , and P2 = gene tree, then the AU test provides the probability that P1 is the true tree for the gene family.
unclear
Some Metrics to Compare Tree Topologies
• SPR – distance
• Symmetric Difference of Robinson and Foulds (bipartition distance)
• Quartet distance
is very hard computationally
There are no tools available to calculate the differenceIn tree topology by number of SPR-operations required to transform one tree to another.
That is why bipartitions come into the scene
SPR metric – Subtree Pruning and Regrafting
There is Robert Beiko’s program..Also the SPR distance alone would not consider support levels.
Phylogenetic information present in genomes
Break information into small quanta of information (bipartitions or embedded quartets)
Spectral Analyses of Phylogenetic Data
Analyze spectra to detect transferred genes and plurality consensus.
Bipartition (or split) – a division of a phylogenetic tree into two parts that are connected by a single branch. It divides a dataset into two groups, but it does not
consider the relationships within each of the two groups.
Number of non-trivial bipartitions for N genomes is equal to 2(N-1)-N-1.
**…***..
*...**.*.*
Bipartitions can be divided in conflicting and non-conflicting
non-conflicting (can coexist in one tree)
**…***..
conflicting (can not coexist in one tree)
**…*…*
BIPARTITION OF A PHYLOGENETIC TREE
A B C D E A E C D B
Try to infer phylogeny
“Likely” Trees “Best” Tree
Choose/ make
consensus
The Tree Drawing Process
unclear
Resampling e.g.. bootstrapping
Resampling simulates examining extra sequence from the original data
65
100
75
100
75
Obtaining Bootstrap Support for BranchesNow bipartitions have weights:.**….. 65***….. 75…**… 75…..*** 100……** 100
ABCDEFGH
*.*….. 75
Data Flow
For every gene family,align sequences (ClustalW)
For every gene family,align sequences (ClustalW)
For every gene family,reconstruct Maximum Likelihood (ML) Tree
and generate 100 bootstrap samples (phyml)
For every gene family,extract bipartition information from each bootstrapped tree,
and compose a bipartitionmatrix
For every gene family,extract bipartition information from each bootstrapped tree,
and compose a bipartitionmatrix
Bipartitions matrix is generated
Do “Lento” Plot analysisDo “Lento” Plot analysis
Results:Results:
select gene families
**….. *.*…. *…*…. *….*…. etc.fam1 71 34 0 99 fam2 56 2 0 76fam3 1 99 1 99…famN 34 99 0 99
majority consensusbipartitions
detected conflicts =
HGT events
13 gamma proteobacteria
1. Buchnera aphidicola str. Bp (Baizongia pistaciae)2. Escherichia coli CFT0733. Haemophilus influenzae Rd KW20 4. Pasteurella multocida subsp. multocida str. Pm705. Pseudomonas aeruginosa PAO16. Salmonella typhimurium LT27. Vibrio cholerae O1 biovar eltor str. N169618. Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 9. Xanthomonas campestris pv. campestris str. ATCC 3391310. Xanthomonas axonopodis pv. citri str. 30611. Xylella fastidiosa 9a5c12. Yersinia pestis KIM
13. Yersinia pestis CO92
1.37E+10 possible unrooted tree topologies
“Lento”-plot of 34 supported bipartitions (out of 4082 possible)
13 gamma-proteobacterial genomes (258 putative orthologs):
•E.coli•Buchnera•Haemophilus•Pasteurella•Salmonella•Yersinia pestis (2 strains)•Vibrio•Xanthomonas (2 sp.)•Pseudomonas•Wigglesworthia
There are 13,749,310,575
possible unrooted tree topologies for 13 genomes
Consensus Tree and Horizontally Transferred Gene
Phylogeny of putatively transferred gene(virulence factor homologs (mviN))
only 258 genes analyzed
Consensus clusters of eight significantly supported bipartitions
Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods?
Here we explore how well these methods perform using in silico transfers between the leaves of a gamma proteobacterial phylogeny.
What are the Actual HGT Rates?
HGT in silico: Testing Methods of Detection
•AU test
•Symmetric Difference of Robinson – Foulds
•Biparition Analysis
HGT in silico: AU test
236 families
13 gamma proteobacteria
A
AU Test: Number of Detected Conflicts
0
50
100
150
200
250
-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0
Logarithm of Significance Level
Number of Conflicts
Only two families out of 236 showed a conflict at the significance level of 5 *10-4, 5 conflicts were found at the significance level of
0,01 and 26 conflicts at the significance level of 0,05.
HGT in silico: AU test
236 families
Escherichia coli Xylella fastidosaPseudomonas aeroginosa Vibrio cholera
13 gamma proteobacteria
Au-value = 10-4
Log(Au-value)=-4
Only 10% of au-values is less than 10-4
05
101520253035404550
-40.00-37.50-35.00-32.50-30.00-27.50-25.00-22.50-20.00-17.50-15.00-12.50-10.00-7.50-5.00-2.500.00
0
20
40
60
80
100
120
-40.00-37.50-35.00-32.50-30.00-27.50-25.00-22.50-20.00-17.50-15.00-12.50-10.00-7.50-5.00-2.500.00
86% of au-values is less than 10-4
HGT in silico: AU testPower of Detection
Significance level < 1e-4 Significance level < 1e-2
HGT in silico: Robinson Foulds Metric
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13
Nu of different bipartitions
Distribution of number of differentbipartitions in the original dataset
Power of DetectionSignificance level < 1e-2
HGT in silico: Bipartition Analysis
Bootstrap support >70%Bootstrap support >90%
Power of Detection
AcknowledgementsAcknowledgements
NSF Microbial GeneticsNASA Exobiology & AISR Programs
Gogarten Lab:
Pascal LapierreGregory Fournier Alireza Ghodsi SenejaniHolly E. GardnerTim HarlowKristen SwithersKaiyuan Shi
Prof. Peter Gogarten, University of Connecticut