Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State...
-
Upload
erin-crawford -
Category
Documents
-
view
216 -
download
0
Transcript of Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State...
Bioinformatic Analysis of Protein Families
Daniil G. Naumoff
Laboratory of BioinformaticsState Institute for Genetics and Selection of Industrial Microorganisms
Moscow, Russia
Gos NII Genetika
Moscow, Russia
The International Nucleotide Sequence Database Collaboration (INSDC)
• GenBank at NCBI: http://www.ncbi.nlm.nih.gov/Genbank/
• EMBL Nucleotide Sequence Database: http://www.ebi.ac.uk/embl/
• DNA Data Bank of Japan (DDBJ): http://www.ddbj.nig.ac.jp/
Corresponding protein databases: GenPept, UniProtKB/TrEMBL, and DDBJ
Curated protein database Swiss-Prot: http://au.expasy.org/sprot/
Three dimensional structures of proteins (3D)
PDB: http://www.pdb.org/pdb/home/home.do (database)
SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/ (classification)
http://www.ebi.ac.uk/embl/Services/DBStats/
http://www.genomesonline.org/gold_statistics.htm
http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100
Search of homologues
BLOSUM-62 matrix
http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html
Overprediction is annotation of sequences at a greater level of functional specificity than available evidence supports.
- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)
A Protein Family Analysis(http://zbio.net/bio/001/003.html)
15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA
1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0
8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2
11912October 2009http://pfam.janelia.org/Pfam 24.0
4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG
4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG
3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75
100194 Jan 2010http://www.cathdb.info/CATH 3.3
10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD
Number of families
DateAddressDatabase
Number of annotated protein domain families
15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA
1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0
8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2
11912October 2009http://pfam.janelia.org/Pfam 24.0
4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG
4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG
3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75
100194 Jan 2010http://www.cathdb.info/CATH 3.3
10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD
Number of families
DateAddressDatabase
Number of annotated protein domain families
51,778 domain families (+ 158,798 singletons) according to Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res. 2006, 34(3):1066-1080.
13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)
ADDA - Automatic Domain Decomposition Algorithmhttp://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdb/form_browse
33,879 domain families (79,965 if redundant sequences were used) according to Heger A,Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328(3):749-767.
- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)
A Protein Family Analysis(http://zbio.net/bio/001/003.html)
Let’s use this protein as a query sequence for BLAST
BLAST results (Descriptions)
E-value < 0.01 or 0.001
BLAST results (Graphic overview)
Domain I Domain II Domain III
GH27N GH27C
GH27N
GH27N GH27C CBM13
GH27N GH27C CBM6
GH27N GH27C CBM6 CBM13
GH27N CBM13 GH27C
NEW1 GH27N CBM13 GH27C
NEW1 GH27N GH27C
NEW2 NEW1 GH27N GH27C
GH27N GH27C NEW3 NEW2
GH27N GH27C NEW3
GH27N GH27C Dockerin
GH27N GH27C CBM1 CE1 N-terminal domain of GH27 family
C -terminal domain of GH27 family
CE1 domain of carbohydrate esterases
Carbohydrate-binding module CBM1
Carbohydrate-binding module CBM6
Carbohydrate-binding module CBM13
Dockerin I domain
Uncharacterized domain
Uncharacterized domain (NPCBM)
Uncharacterized domain
CBM13
CBM6
Dockerin
NEW1
NEW2
NEW3
CBM1
CE1
GH27C
GH27N
Domain structure of proteins of the GH27 familyaccording to Naumoff D.G. Phylogenetic analysis of α-galactosidases of the GH27 family. Molecular Biology (Engl Transl), 2004, 38(3):388-
399.PDF: http://bioinform.genetika.ru/members/Naumoff/MB2004E.pdf
15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA
1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0
8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2
11912October 2009http://pfam.janelia.org/Pfam 24.0
4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG
4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG
3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75
100194 Jan 2010http://www.cathdb.info/CATH 3.3
10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD
Number of families
DateAddressDatabase
Universal Protein Domain Databases
15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA
11082
Databases of individual protein families(http://www.oxfordjournals.org/nar/database/subcat/3/10)
Sequence Based Classification of the Carbohydrate-Active Enzymesat the CAZy server (www.cazy.org/)
• Glycoside Hydrolases (including transglycosidases) => 118 GH families (14 clans)
• Glycosyltransferases => 92 GT families
• Polysaccharide Lyases => 21 PL families
• Carbohydrate Esterases => 16 CE families
• Carbohydrate-Binding Modules => 59 CBM families
Family GH72 of Glycoside Hydrolases(http://www.cazy.org/GH72.html)
Multiple Sequence Alignment:
– Automatic (ClustalW or ClustalX) >50% of sequence identity only one domain no protein fragments
– Manual (BioEdit)(take into account BLAST pairwise sequence alignment!) <30% of sequence identity long insertions / deletions facultative N-terminal part
Local dissimilarities of very similar sequences:
– Local frameshift– Exon-intron structure– Stop codon
BioEdit(http://www.mbio.ncsu.edu/BioEdit/bioedit.html)
Phylip(http://evolution.gs.washington.edu/phylip.html)
Maximum Parsimony(ProtPars)
Distance program(Neighbor-Joining)
An infile for the Phylip package programs
Maximum Parsimony(protpars.exe)
from the Phylip package
Phylogenetic tree visualization: TreeView program (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)
Slanted cladogramRadial
Rectangular cladogram Phylogram
Subfamily criteria (for glycosidases)
1. Pairwise sequence similarity (>30% of identity)
2. Order of sequence appearance during BLAST search (members of the same subfamily always appear at the top of BLAST results)
3. Monophyletic status
The maximum parsimony phylogenetic tree of family GH97
100
1000876
1000
1000
954
1000
97C1_LEIXY97C1_PRERU97C2_BACTH1000
97C1_MICDE97C2_MICDE
97C1_BACTH97C2_PRERU97C3_PRERU1000
1000925
579
97D1_CAUCR97D1_XANAX97D1_XANCA1000
97B1_MICDE97B4_BACTH
97B1_PRERU97B1_BACTH874
813
97B2_PRERU97B1_BACFR
97B3_BACTH97B2_BACFR97B2_BACTH
4311000
8091000
509
424
977
97E1_BACTH97E1_RHOBA97A1_HALMA
97A1_SALRU97A2_BACFR97A3_BACTH
1000496
97A1_PRERU97A1_PREIN
1000
97A1_BACTH97A1_TANFO
680
97A1_BACFR97A2_BACTH97A1_UNBAC895
10001000
1000
97A8_ENSEQ97A1_AZOVI1000
97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ1000
97A7_ENSEQ97A6_ENSEQ
4921000
1000
678
97A1_MICDE97A1_SHEON
97A2_ENSEQ97A1_ENSEQ991
10001000
97A1_NOVAR97A1_ERYLI1000
97A1_XANAX1000866
999
558
277
782
Subfamily 97a
97A1_XANCA
Subfamily 97d
Subfamily 97e
Subfamily 97c
Subfamily 97b
-glucosidase activity [EC 3.2.1.20]
The neighbor-joining phylogenetic tree of family GH97 97E1_RHOBA97E1_BACTH97C1_LEIXY97C1_PRERU97C2_BACTH97C1_MICDE97C2_MICDE97C1_BACTH97C2_PRERU97C3_PRERU97D1_CAUCR97D1_XANCA97D1_XANAX97B1_MICDE97B1_BACTH97B4_BACTH97B1_PRERU97B2_PRERU97B1_BACFR97B3_BACTH97B2_BACFR97B2_BACTH97A1_HALMA97A1_PRERU97A1_PREIN97A1_TANFO97A1_BACTH97A1_BACFR97A1_UNBAC97A2_BACTH97A1_SALRU97A2_BACFR97A3_BACTH97A1_AZOVI97A8_ENSEQ97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ97A7_ENSEQ97A6_ENSEQ97A1_ERYLI97A1_NOVAR97A1_XANCA97A1_XANAX97A1_MICDE97A1_SHEON97A2_ENSEQ97A1_ENSEQ
996
991
988839
969
993646
996
991
996996
808835
996
617499
392996
951996
498
992
908996
562953
996
996401
996
996
773996
992
850
996
996975
931996
995
865
452
271
830
Subfamily 97e
Subfamily 97c
Subfamily 97d
Subfamily 97b
Subfamily 97a
[EC 3.2.1.20]
The neighbor-joining phylogenetic tree of the α-galactosidase superfamily
GH31
XYLS SULSOAGL2 BACTQ
AGLU ACIACSUIS HUMANc
SUIS HUMANnLYAG HUMAN
5572
3859
89
XYLQ LACPEORF1 THEMA
ORF1 BACHAYICI ECOLIORF1 CLOAC
4270
4077
86
ORF1 CHLAUORF2 CLOPE
ORF1 MOUSEORF1 DROME
8036
43
69
ORF1 AERHYORF1 ECOLI
93
NAGA CLOPEORF1 STRPNORF1 CLOPE
3992
98
25
AGL3 STRCOAGL2 STRCO
AGAL THETHAGAL THET2 77
AGAL THEMAAGAL LEPIN 72
37
5724
AGAL VIBCHAGAL VIBPA
99
20
30
AGA2 PEDPEAGA1 PEDPE
AGAL LACPLAGAL STRMUAGL2 RUMAL
4039
AGL5 BACFRAGL6 BACFR
94
54
49
AGL6 ASPFU
AGLC ASPNGAGL2 HYPJE
6979
21
AGAL ABSCOAGL2 BIFLORAFA ECOLI
5131
39
86
AGL3 RUMALAGL7 ASPFU
99
65
AGAL PORGI
MEL2 ARATHAGAL CYATEAGAL PHAVU
10061
AGAL SACERAGAL PSEFLAGAL MICDE
9716
9
AGL1 STRCOAGL2 ASPFU
AGAL FIBSUAGAL CLOJO
6716
5
AGLB ASPNGMEL1 YEASTMELA PHACH
2733
6
AGLA ASPNGNAGA ACRSP
98
MEL1 CAEELMEL1 DROME
NAGA HUMANAGAL HUMAN
5680
46
48
AGL3 BACFRAGL2 BACFRAGL1 BACFR
10012
7
21
AGL3 HYPJEIMD ARTGO
MEL4 ARATHMEL5 ORYSA
94
AGL1 BIFLOAGAL BACHAAGL1 RUMAL
6247
49
3972
84
AGAL SULTOAGAL SULSO
93
AGL4 BACFRAGAL BIFBR76
ORF2 ARATHSTAS PISSAGALT VIGAN
6745
ORF1 ARATHSIP CICAR
SIP HORVU53
58
94
36
89
45
57
GH27
GH36C
GH36A
GH36B
GH36D
Families of the α-galactosidase superfamily and family GH97
Family GH27 GH31 GH36A GH36B GH36C GH36D GH97Clan GH-D GH-D GH-D GH-D GH-D None
COG1501KOG1065
EC 2.4.1.x EC 2.4.1.x EC 2.4.1.67EC 3.2.1.22
EC 3.2.1.84
EC 2.4.1.82EC 3.2.1.49
EC 3.2.1.10EC 3.2.1.22
EC 3.2.1.88
EC 3.2.1.20
EC 3.2.1.48
EC 4.2.2.13
Molecular mechanism
Retaining Retaining Retaining Not known Not known Not known
Eukaryota: Eukaryota: Eukaryota: Eubacteria: Eukaryota: Eubacteria: Eukaryota:
Alveolata Alveolata Fungi Acidobacteria Alveolata Firmicutes Metazoa (?)
FungiEntamoebidae
Eubacteria:
Proteobacteria
Fungi Proteobacteria Eubacteria:
MetazoaEuglenozoa
Actinobacteria
Spirochaetes
Viridiplantae Acidobacteria
MycetozoaFungi
Bacteroidetes
Thermotogales
Eubacteria:Bacteroidetes
Viridiplantae
Metazoa
Firmicutes
Thermus
ActinobacteriaPlanctomycetesEubacteria:
Mycetozoa
ProteobacteriaBacteroidetes
ProteobacteriaAcidobacteria
Rhodophyta
Spirochaetes Archaea:Archaea:
Actinobacteria
Viridiplantae
CrenarchaeotaEuryarchaeota
Bacteroidetes
Eubacteria:
FibrobacteresActinobacteria
FirmicutesBacteroidetes
ProteobacteriaCyanobacteriaFirmicutesProteobacteriaSpirochaetesThermotogales
Archaea:CrenarchaeotaEuryarchaeota
COG3345 COG3345 None
Origin
KOG2366 None None
Known enzymatic activities
EC 3.2.1.22 EC 3.2.1.22 EC 3.2.1.49 EC 3.2.1.20
COG/KOG
Actinobacteria
Actinobacteria
Deinococcus
Acidobacteria
Thermus
GH-D
EC 3.2.1.22
Retaining, Inverting
Verrucomicrobia
Verrucomicrobia
Verrucomicrobia
Verrucomicrobia
Verrucomicrobia
Verrucomicrobia Acidobacteria
EC 3.2.1.94
Clans of Glycoside Hydrolases
(β)3-solenoidinversion (axial orientation)28, 49GH-N
(/)6inversion (equatorial orientation)8, 48GH-M
(/)6inversion (axial orientation)15, 65GH-L
(β/)8 -barrelretention (equatorial orientation)18, 20, 85GH-K
5-fold β-propellerretention (β‑furanoside)32, 68GH-J
+βinversion (equatorial orientation)24, 46, 80GH-I
(β/)8 -barrelretention (axial orientation)13, 70, 77GH-H
inversion (axial orientation)37, 63GH-G
5-fold β-propellerinversion (equatorial orientation)43, 62GH-F
6-fold β-propellerretention (equatorial orientation)33, 34, 83, 93GH-E
(β/)8 -barrelretention (axial orientation)27, 31, 36GH-D
β-jelly rollretention (equatorial orientation)11, 12GH-C
β-jelly rollretention (equatorial orientation)7, 16GH-B
(β/)8 -barrelretention (equatorial orientation)1, 2, 5, 10, 17, 26, 30, 35, 39, 42, 50, 51, 53, 59, 72, 79, 86, 113
GH-A
Tertiary StructureOptical ConfigurationFamilies (GH)Clan
(/)6
Rigden DJ. Iterative database searches demonstrate that glycoside hydrolase families 27, 31, 36, and 66 share a common evolutionary origin with family 13. FEBS Lett. 2002, 523(1-3):17‑22.
clans
GH-D
GH-H
Nagano N, Porter CT, Thornton JM. The (β/α)8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng. 2001, 14(11):845-855.
clans: GH-H GH-A GH-K ?
Screenshot of PSI Protein Classifier
D.G. Naumoff and M. Carreras. 2009. PSI Protein Classifier: a new program automatingPSI-BLAST search results. Molecular Biology (Engl Transl). V.43. N.4. P.652-664.
A hierarchical classification of the (β/α)8-type glycosyl hydrolases
A hierarchical structure of the -fructosidase (furanosidase) superfamily
furanosidase superfamily
GH32
GH68
GH43
GH62
GHLP
clan GH-J
clan GH-F
GH32a
GH32b
GH32c
GH32d
GH68a
GH68b
GH43a
GH43b
GH43c
GH43d
GH43e
GH43f
GH43g
The Secondary Structure Prediction
– 3D-PSSM (http://www.sbg.bio.ic.ac.uk/~3dpssm/)– GOR IV (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html)– nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict-instrucs.html)– PredictProtein (http://www.embl-heidelberg.de/predictprotein/predictprotein.html)– Hydrophobic cluster analysis (HCA)
The Tertiary Structure Prediction– The SWISS-MODEL modeling server (http://swissmodel.expasy.org/)
Phylogenetic Analysis of a Protein Family
– The first stage of a work Prediction of 3D structure and domain structure of the protein Prediction of the active center and residues for site-directed mutagenesis Prediction of the enzymatic activities– The only part of a work (bioinformatics)– The final stage of a work (interpretation of the experimental results)
Comparison of the phylogenetic trees of each domain of a certain protein will allow to reveal the protein evolutionary history, viz. the role of gene duplication, lost, fusion, and horizontal transfer.