Bioinformatics Data Analysis: InterPro

26
EBI is an Outstation of the European Molecular Biology Laboratory. 9 April 2008 InterPro data pipelines Antony Quinn

description

Overview of data analysis in the European Bioinformatics Institute's InterPro database [http://www.ebi.ac.uk/interpro]

Transcript of Bioinformatics Data Analysis: InterPro

Page 1: Bioinformatics Data Analysis: InterPro

EBI is an Outstation of the European Molecular Biology Laboratory. 

9 April 2008

InterPro data pipelinesAntony Quinn

Page 2: Bioinformatics Data Analysis: InterPro

Outline

• Onion • Main InterPro pipeline: predict protein families, domains, sites • A lot

• CluSTr• Automatic classification of proteins based on sequence similarity• A little

InterPro data pipelines: What‘s in it for me?9 April 2008

Page 3: Bioinformatics Data Analysis: InterPro

Not this...

InterPro data pipelines: What‘s in it for me?9 April 2008

Page 4: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Mission: to explore strange new proteins…

Onion + Protein Sequence

=Prediction of Functional Annotation

InterPro

Protein families, domains, repeats and sites

Page 5: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Requirements

• Handle all member databases and algorithms• HMMER (eg. Gene3D, PANTHER)• Regular expressions (PROSITE)• SignalP• TMHMM• BLAST (PIRSF)• FingerPRINTScan (PRINTS)

• Fast• Wide and deep coverage

Page 6: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Design

• UniParc• Solves mapping problem• Sequential IDs• Comprehensive – many DBs, all sequences

• Method Archive• Minimise calculations• Read flat files once

• Decoupled analysis and post-processing

Page 7: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

The Trinity

Onion

UniParcMethod Archive

Member database methods

Protein sequences

Page 8: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

New sequences

UniParc

Onion

New sequences

Run against all methods

UniParcMethod Archive

Page 9: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Member database release

Onion

UniParcMethod archive

Method Archive

Run new and changed methods

against all sequences

Advantages:

• If only post-processing or cut-off changed – only run that part

• No change – no need to rerun

Methods added,

changed or deleted

Page 10: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

New release of model database – search new and

changed models against all of UniParc

anthill

Page 11: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

anthill

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 12: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

anthill

bsub sumissio

n cmds

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 13: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

LSF

anthill

bsub sumissio

n cmds

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 14: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

LSF

anthill

bsub sumissio

n cmds

output files (raw results)

SQL*Loader file

Parse, reformatUniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 15: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

LSF

anthill

bsub sumissio

n cmds

output files (raw results)

SQL*Loader file

Parse, reformat

Load

ONION

Raw results table

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 16: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Deluge” mode (manual)

FASTAfile

1000s of model files

LSF

anthill

bsub sumissio

n cmds

output files (raw results)

SQL*Loader file

Parse, reformat

Load

ONION

Raw results table

post-processing

Final results table

UniParc

HMMflatfile

Profileflatfile

FPrintflatfile

Page 17: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Drip” mode (automatic)

UniParc

New sequences– search all models every 4 minutes

anthillextract new sequences

HMMflatfile

Profileflatfile

FPrintflatfile

Page 18: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

“Drip” mode (automatic)

UniParc

LSF

anthill

bsub sumissio

n cmds

output files (raw results)

extract new sequences

HMMflatfile

Profileflatfile

FPrintflatfile

Page 19: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

UniParc

LSF

anthill

bsub sumissio

n cmds

output files (raw results)

Parse, reformatand load

extract new sequences

ONION

Raw results table

post-processing

Final results table

HMMflatfile

Profileflatfile

FPrintflatfile

“Drip” mode (automatic)

Page 20: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

pirsf

pantherScoreassignment

HMMER

Pfam TIGRFAM SMART SUPERFAMILYGENE3D PIRSF PANTHER

GA cut-off

TC cut-off

E-value cut-off

E-value cut-off

AM filter

clan

nested

threshold

(kinase)

domainFinder

sequence

Oracle (raw data)

Oracle (refined data)

The refinery

Page 21: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Onion vs InterProScan

• Similarities• Software: HMMER, TMHMM, SignalP• Models: Pfam, Gene3D, PRINTS …etc

• Differences• Internal use only• Decoupled analysis and post-processing• Java + database• Faster

Page 22: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Limitations

• Database design• Inflexible – single member DB version• Redundant

• Tight coupling• Internal

• Difficult to test/debug• External

• Oracle• LSF• File system

Page 23: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Plans

• Merge InterProScan• Single code base = reduced maintenance cost• Java (Java 5? Spring? Maven?)• Database (Oracle, Derby?, Hibernate, Java stored procs?)

• Testable• JUnit• Continous integration?

• API• Java (web services?)• Oracle: views, stored procs

Page 24: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

What’s in it for me?

• UniProt curators• On-demand sequence analysis?

• Ensembl production• InterPro hits• Pre- or post-UniParc?

Page 25: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

CluSTr

• Input: UniProtKB, IPI, Ensembl Human – 6 million sequences• Output:

• Similiarity scores (Smith-Waterman) – 3.5 billion• Clusters (single linkage, aka nearest neighbour)• Orthologues (best reciprocal hit) – 627 species• Every 3 weeks (UniProt cycle)

• Availability: Oracle, web app, FTP (sims + GO mappings)• Customers

• integr8 (orthologues)• Druggable Genome (similarities)

• Potential• Set-based analyses• Similarities on-demand

Page 26: Bioinformatics Data Analysis: InterPro

InterPro data pipelines: What‘s in it for me?9 April 2008

Acknowledgements

• InterPro• Robert Petryszak (Dark Side)• Craig McAnulla (Onion)• John Maslen (CluSTr)• Beat Ramseier (Method Archive)• Sarah Hunter (Management)

• integr8• Paul Kersey (CluSTr)

• A Team• Tracy Mumford• Kerry Smith

Thank you