Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique...

19
Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie Evolutive Université Claude Bernard - Lyon 1 Simon Penel, Julien Grassot, Laurent Duret, Manolo Gouy, Guy Perrière. Pôle Bio-Informatique Lyonnais

Transcript of Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique...

Page 1: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Databases of homologous gene families: new developments and web interfaces.

Equipe Bioinformatique et Génomique Evolutive

Laboratoire de Biométrie et Biologie EvolutiveUniversité Claude Bernard - Lyon 1

Simon Penel, Julien Grassot, Laurent Duret, Manolo Gouy, Guy Perrière.

Pôle Bio-Informatique Lyonnais

Page 2: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Homologous Genes Databases

Research fields:• Proteome/genome comparative analysis• Phylogenetic studies• Orthology/Paralogy relationship assignments• Development of generalist databases, specialised databases

– HOVERGEN: families of homologous vertebrate genes– HOBACGEN: families of homologous bacterial genes– NureBase, RTKdb, Hoppsigen, Mitalib,..

Important regions identification in genomic sequencesEvolution at the molecular levelSpecies phylogenyFunction prediction

Page 3: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Extension of HOVERGEN and HOBACGEN to all organisms for which the complete genome sequence has

been determined• Structured under the ACNUC (M. Gouy) retrieval system: flat file & index

files

• Integrates :

– Protein multiple alignments

– Phylogenetic trees

– Taxonomic data

– Nucleic and protein sequences

– Sequence annotations

The HoGenom database:Homologous Genes Families of

fully Sequenced OrganismsEuropean project TEMBLOR

Page 4: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Building of HoGenomSelection of fully sequenced organisms protein

sequences on the EBI proteome site.

Sequence comparison with BLAST on the whole sequences dataset

Clustering of the sequences in genes family on the basis of sequence similarity (transitive

association)

Add the gene family info in the protein sequence annotations

EMBL cross references calculations, nucleotide sequences selection

Add gene family info in the EMBL/GenBank nucleotide annotations

Protein Alignments

Phylogenetic trees

ACNUCProtein database

ACNUC Nucleotide database

For each family

Page 5: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Hogenprot: Q9DCD0ID Q9DCD0 PRELIMINARY; PRT; 483 AA.AC Q9DCD0;DT 01-JUN-2001 (TrEMBLrel. 17, Created)DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update)DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update)DE 0610042A05RIK PROTEIN.GN 0610042A05RIK.OS Mus musculus (Mouse).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.OX NCBI_TaxID=10090RN [1]RP SEQUENCE FROM N.A.RC STRAIN=C57BL/6J; TISSUE=KIDNEY;RX MEDLINE=21085660; PubMed=11217851;RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ----RA Hayashizaki Y.;RT "Functional annotation of a full-length mouse cDNA collection.";RL Nature 409:685-690(2001).CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSECC 5-PHOSPHATE + CO(2) + NADPH.CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT.CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASECC FAMILY.CC -!- GENE_FAMILY: HBG000005 [ FAMILY / ALN / TREE ]DR EMBL; AK002894; BAB22439.1; -.DR HSSP; P00349; 2PGD.DR MGD; MGI:1914101; 0610042A05Rik.DR InterPro; IPR001744; 6PGD.DR Pfam; PF00393; 6PGD; 1.DR PRINTS; PR00076; 6PGDHDRGNASE.DR PROSITE; PS00461; 6PGD; 1.DR PRODOM; Q9DCD0.DR SWISS-2DPAGE; Q9DCD0.KW NADP; Oxidoreductase; Pentose shunt.FT DOMAIN 5 60 PRODOM:2001.3:PD001594 134FT DOMAIN 63 296 PRODOM:2001.3:PD001025 91FT DOMAIN 316 469 PRODOM:2001.3:PD001549 79SQ SEQUENCE 483 AA; 53247 MW; CD0A3F72EEC2831E CRC64;

Protein sequence annotations

Page 6: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Hogennucl: AK002894.PE1AK002894.PE1 Location/QualifiersFT CDS_pept 76..1527FT /codon_start=1FT /db_xref="MGD:MGI:1914101"FT /db_xref="SWISS-PROT:Q9DCD0"FT /note="data source:SPTR, source key:P52209, evidence:ISS"FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE,FT DECARBOXYLATING (EC 1.1.1.44)"FT /note="putative"FT /transl_table=1FT /gene_family="HBG000005"FT /protein_id="BAB22439.1"FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLANFT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNSFT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAKFT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEEFT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLIFT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFMFT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAVFT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELLFT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120

….

ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452//

Nucleotide sequence annotations

Page 7: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

HoGenom ACNUC contents8th September 2003

HoGenom Proteins 423,577 sequences

HoGenom Nucleotide Sequences 448,582 cds

117 fully sequenced organisms

Data SourceProtein data from EBI: non-redondant complete proteome sets(SWISS-PROT, TrEMBL, TrEMBLnew) http://www.ebi.ac.uk/proteome, June 2003

Genomic data from EMBL , June 2003

Page 8: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

117 organisms

423 577 protein sequences

1016

91

Arabidopsis thaliana (plant) Caenorhabditis elegans (nematod) Drosophila melanogaster (fly) Encephalitozoon cuniculi (microsporidia) Guillardia theta (alguae) Homo sapiens (man) Mus musculus (mouse) Rattus norvegicus (rat) Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe (fungus)

31%

9%

60%

Page 9: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

41 907 families423 577 protein sequences

Sequences belonging to a family 305 514 (72%)

305 514

115 373 Orphan Sequences (27%)

115 373

Page 10: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Access to HoGenom is available at the PBIL: http://pbil.univ-lyon1.fr/

Web page of HoGenom : http://pbil.univ-lyon1.fr/databases/hogenom.html

Page 11: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Databases Access on the WebTwo main www interfaces

• WWW Query– Multiple query on sequences (Guy Perrière)– Multiple query on families– http://pbil.univ-lyon1.fr/search/query_fam.php

• Cross Taxa – Search of families in function of complex taxonomic criteria– Selection of families– http://pbil.univ-lyon1.fr/search/cross_fam.php

Page 12: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Cross Taxa: Selection of Gene Families example : selecting families of animal specific genes

A list of familiesA list of families

Page 13: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

√ A list of familiesA list of families

Page 14: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

display familydisplay family

Page 15: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Family Page

Page 16: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.
Page 17: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Application to other databases

Any sequence database can be structured under ACNUC and queried with WWW-QueryCurrently available :• SWISS-PROT,• EMBL,• GenBank,• etc.

Any family database can be structured under ACNUC and queried with WWW-Query and Cross-Taxa

For example, an ACNUC version of the HAMAP database developed by SWISS-PROT is currently available at the PBIL

Page 18: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Example:

sequence Q8ZY16 in NiceProt : cross-references to HAMAP-ACNUC and HOBACGEN

Cross-references with external databases1 sequence associated family

Display the family, alignment and phylogenetic tree associated to an sequence accession number via a URL link.

http

http://pbil.univ-lyon1.fr/cgi-bin/acnuc-link-ac2fam?db=HAMAPprot&query=Q8ZY16

Page 19: Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Acknowledgements

People from BBE: SWISS-PROT group

Laurent Duret Alexandre Gattiker

Manolo Gouy

Julien Grassot

Simon Penel

Guy Perrière

This project is supported byo the European Commission (TEMBLOR)o the Rhône-Alpes region (Projet Thématiques Prioritaires)