Bird detection and species classification with time-lapse ...
Global Classification of (Plant) Proteins across Multiple Species
description
Transcript of Global Classification of (Plant) Proteins across Multiple Species
Global Classification of (Plant) Proteins across Multiple Species
Kerr WallJim Leebens-MackNaomi AltmanVictor AlbertDawn FieldHong MaClaude dePamphilis
Global Classification of Proteins
• The protein classification problem
• A method for global classification
• “Bootstrap” support for global classification
• Structure within clusters
• Structure between clusters
• Results from complete proteome classification: arabidopsis, oryza and populus
The protein classification problem
• Genomic sequence can be translated into protein sequence but …
• The function of most proteins is unknown.
• Protein classification is used to: infer protein folding structure infer protein function infer evolutionary relationships **
Similarity of Protein SequenceFFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMPFFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQFFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP
• Each row represents a different protein.• Each letter represents an amino acid.• Each “–” represents a space which is missing in this sequence but
has something in it in a different protein in this set.
• In closely related proteins, the distance between proteins is the number of mismatches.
• In distantly related species, the sequences are given a score – often the probability that a random sequence matches as well (e.g. BLAST E-value)
Inferring Evolutionary Relationships
Main methods: statistical phylogeny based on sequence alignment and evolutionary models
-requires a high degree of sequence similarity-good alignments use slow algorithms and often lots of
manual intervention
manual curation -requires a large amount of manual intervention-can incorporate sequence, folding structure and function.
These methods are good for 100’s of genes.
Global Classification of Proteins
Very high throughput:
Arabidopsis 26,207
Rice 57,915
Poplar 45,555
Total 129,677
Our goal: The joint classification of all known plant proteins using a “scaffold” derived from the 3 completely sequenced species
A method for global classification
• Clustering based on a similarity (or distance) matrix is commonly used.
• A quick method for clustering (sparse matrix computations are often used).
• Our similarity matrix is 129,677 x 129,677 so we need:
• A quick method for computing distance (BLAST E-values are often used; we use -log(E-value) as the similarity measure)
TribeMCL Clustering AlgorithmPredicted protein sequences from the fully sequenced genomes of Arabidopsis thaliana columbia (26207) and Oryza sativa japonica (57915) were downloaded from TIGR. Populus trichocarpa (45555) was downloaded from JGI.
All sequences were blasted against each other using BLASTp 2.4 with an E-value cutoff of 1x10-5
The TribeMCL package was used to predict putative protein families at low, medium, and high (I=1.2,3,5) stringencies
The results are stored at http://www.floralgenome.org/cgi-bin/tribedb/tribe.cgi
TribeMCL MethodEnright, Van Dongen and Ouzounis (2002)
• Similarity is measured by
-log10(BLAST E-value)
• Clustering is done by MCL Method
Suppose S is the similarity matrix.
1. Normalize the rows of S to sum to 1.
2. Raise each entry to the power r>1. (r is the “stringency”) and renormalize. S(r)
3. Take a “Markov step” – replace S(r)’S(r).
4. Iterate to convergence.
MCL Algorithmvan Dongen, 2000
It is very fast because low similarities are truncated to zero and sparse matrix methods can then be used.
A Heuristic for MCL
We take a random walk on the graph described by the similarity matrix
BUT
After each step we weaken the links between distant nodes and strengthen the links between nearby nodes
Graphic from van Dongen, 2000
Similarity Matrix
r=2.0
r=2.8
r=2.9
r=2.6Cluster pattern at Convergence as a function of r
Small groups break apart first.
The pattern is quite robust to changes in the similarity of the green region
16
40
60
Similarity Matrix
r=2.0
r=2.8
r=3.1
r=2.6Cluster pattern at Convergence as a function of r
At r=3.6 all units separate
The additional similarity indicated by pink has a profound effect
16
40
60
50
Similarity Matrix
r=2.0
r=2.7
r=2.8
r=2.6Cluster pattern at Convergence as a function of r
More strongly connecting the “background” disrupts the pattern until r=2.7, after which we quickly cycle through the pattern (2.9 turns the center group into singletons and 3.0 turns everything into singletons.)
30
40
60
Similarity Matrix
r=2.0
r=2.3
r=2.1Cluster pattern at Convergence as a function of r
Weakening the within cluster similarity accelerates the breakdown into singletons
16
30
60
Similarity Matrix
r=2.0
r=2.3Cluster pattern at Convergence as a function of r
Strengthening the “background” while weakening the within cluster similarity makes it difficult to pick out the clusters.
25
30
60
Some Summary Statistics for the Clusters
Protein Set Number of Proteins
Number of Clusters at r=3
Percent of Singletons
Arabidopsis 26,207 11,467
(44%)
69%
Arabidopsis+
Rice
84,122 28,175
(33%)
68%
Arabidopsis+
Rice + Poplar
129,677 35,873
(28%)
67%
Cluster ATH Rice Poplar
ATH 30% - -
+Rice 17% 25% -
+Poplar 12% 24% 15%
%Singletons
Tribes for large gene families show some, but not complete correspondence to inferred phylogenetic relationships. Tribes with MADS genes formed at low, medium and high stringencies are mapped on to the a recently published Arabidopsis MADS gene phylogeny (Martinez-Castilla & Alvarez-Buylla 2003).
Comparing Tribes to Phylogenetic Trees from Sequence Alignment
Comparisons with curated gene families
• Added tribe information to TAIR’s gene families– www.floralgenome.org/cgi-bin/tair/tair.cgi
– E.g. Cytochrome P450
“Bootstrap” Support for Clusters
To determine the stability of the clusters, we need some type of perturbation of the system. We use the “0.632 jackknife” instead of the bootstrap (as we want a set of unique proteins).
We clustered 100 samples, each a random selection of 63.2% of the proteins.
We count “1” for each tribe each time all the genes in the tribe selected for the bootstrap sample are clustered.
From Tribes to Phylogenetics• Within each tribe of 3 or more proteins we can
do hierarchical clustering using the similarity matrix (Harlow, Gogarten, Ragan, 2004) or forming a careful alignment and doing phylogenetic tree.
• We can also form SuperTribes, by clustering the tribes. Because we still have a large set of objects to cluster, we continue to use MCL.
• Within a SuperTribe, we can do hierarchical clustering.
• The SuperTribe for the MADS family shown earlier includes all the MADS sequences
Single Linkage TribeMCL• Define the distance
between tribes as the smallest pairwise E-value.
• Use TribeMCL on the resulting similarity matrix.
• Use hierarchical clustering within supertribes.
Single Linkage Tribe MCL
Hierarchical clustering or phylogenetic trees
Floral Genome Project and Plant ProteinClassification
Use of the Global Classification• Project goal is to understand the evolution
of flowers.• Data has been collected to various
degrees of intensity on 15 non-model species across the phylogeny of flowering plants and merged with data from other projects.
• PlantTribes will be used to assist in placing these proteins into families to infer evolutionary relationships.
And many thanks to:• Kerr Wall – FGP Bioinformatics (PSU)• Claude dePamphilis – FGP PI (PSU)• Jim Leebens-Mack – FGP Project Director(PSU)• Hong Ma – FGP co-PI (PSU)• Victor Albert – collaborator (U. Oslo)• Dawn Field – collaborator (Oxford U.)
And FGP collaborators at PSU, UFL and Cornell.
And especially
NSF – Plant Genome Research Program