Bioinformatics Data Analysis: InterPro
-
Upload
antony-quinn -
Category
Data & Analytics
-
view
205 -
download
2
description
Transcript of Bioinformatics Data Analysis: InterPro
EBI is an Outstation of the European Molecular Biology Laboratory.
9 April 2008
InterPro data pipelinesAntony Quinn
Outline
• Onion • Main InterPro pipeline: predict protein families, domains, sites • A lot
• CluSTr• Automatic classification of proteins based on sequence similarity• A little
InterPro data pipelines: What‘s in it for me?9 April 2008
Not this...
InterPro data pipelines: What‘s in it for me?9 April 2008
InterPro data pipelines: What‘s in it for me?9 April 2008
Mission: to explore strange new proteins…
Onion + Protein Sequence
=Prediction of Functional Annotation
InterPro
Protein families, domains, repeats and sites
InterPro data pipelines: What‘s in it for me?9 April 2008
Requirements
• Handle all member databases and algorithms• HMMER (eg. Gene3D, PANTHER)• Regular expressions (PROSITE)• SignalP• TMHMM• BLAST (PIRSF)• FingerPRINTScan (PRINTS)
• Fast• Wide and deep coverage
InterPro data pipelines: What‘s in it for me?9 April 2008
Design
• UniParc• Solves mapping problem• Sequential IDs• Comprehensive – many DBs, all sequences
• Method Archive• Minimise calculations• Read flat files once
• Decoupled analysis and post-processing
InterPro data pipelines: What‘s in it for me?9 April 2008
The Trinity
Onion
UniParcMethod Archive
Member database methods
Protein sequences
InterPro data pipelines: What‘s in it for me?9 April 2008
New sequences
UniParc
Onion
New sequences
Run against all methods
UniParcMethod Archive
InterPro data pipelines: What‘s in it for me?9 April 2008
Member database release
Onion
UniParcMethod archive
Method Archive
Run new and changed methods
against all sequences
Advantages:
• If only post-processing or cut-off changed – only run that part
• No change – no need to rerun
Methods added,
changed or deleted
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
New release of model database – search new and
changed models against all of UniParc
anthill
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
anthill
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
anthill
bsub sumissio
n cmds
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
LSF
anthill
bsub sumissio
n cmds
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
LSF
anthill
bsub sumissio
n cmds
output files (raw results)
SQL*Loader file
Parse, reformatUniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
LSF
anthill
bsub sumissio
n cmds
output files (raw results)
SQL*Loader file
Parse, reformat
Load
ONION
Raw results table
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTAfile
1000s of model files
LSF
anthill
bsub sumissio
n cmds
output files (raw results)
SQL*Loader file
Parse, reformat
Load
ONION
Raw results table
post-processing
Final results table
UniParc
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
New sequences– search all models every 4 minutes
anthillextract new sequences
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
LSF
anthill
bsub sumissio
n cmds
output files (raw results)
extract new sequences
HMMflatfile
Profileflatfile
FPrintflatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
UniParc
LSF
anthill
bsub sumissio
n cmds
output files (raw results)
Parse, reformatand load
extract new sequences
ONION
Raw results table
post-processing
Final results table
HMMflatfile
Profileflatfile
FPrintflatfile
“Drip” mode (automatic)
InterPro data pipelines: What‘s in it for me?9 April 2008
pirsf
pantherScoreassignment
HMMER
Pfam TIGRFAM SMART SUPERFAMILYGENE3D PIRSF PANTHER
GA cut-off
TC cut-off
E-value cut-off
E-value cut-off
AM filter
clan
nested
threshold
(kinase)
domainFinder
sequence
Oracle (raw data)
Oracle (refined data)
The refinery
InterPro data pipelines: What‘s in it for me?9 April 2008
Onion vs InterProScan
• Similarities• Software: HMMER, TMHMM, SignalP• Models: Pfam, Gene3D, PRINTS …etc
• Differences• Internal use only• Decoupled analysis and post-processing• Java + database• Faster
InterPro data pipelines: What‘s in it for me?9 April 2008
Limitations
• Database design• Inflexible – single member DB version• Redundant
• Tight coupling• Internal
• Difficult to test/debug• External
• Oracle• LSF• File system
InterPro data pipelines: What‘s in it for me?9 April 2008
Plans
• Merge InterProScan• Single code base = reduced maintenance cost• Java (Java 5? Spring? Maven?)• Database (Oracle, Derby?, Hibernate, Java stored procs?)
• Testable• JUnit• Continous integration?
• API• Java (web services?)• Oracle: views, stored procs
InterPro data pipelines: What‘s in it for me?9 April 2008
What’s in it for me?
• UniProt curators• On-demand sequence analysis?
• Ensembl production• InterPro hits• Pre- or post-UniParc?
InterPro data pipelines: What‘s in it for me?9 April 2008
CluSTr
• Input: UniProtKB, IPI, Ensembl Human – 6 million sequences• Output:
• Similiarity scores (Smith-Waterman) – 3.5 billion• Clusters (single linkage, aka nearest neighbour)• Orthologues (best reciprocal hit) – 627 species• Every 3 weeks (UniProt cycle)
• Availability: Oracle, web app, FTP (sims + GO mappings)• Customers
• integr8 (orthologues)• Druggable Genome (similarities)
• Potential• Set-based analyses• Similarities on-demand
InterPro data pipelines: What‘s in it for me?9 April 2008
Acknowledgements
• InterPro• Robert Petryszak (Dark Side)• Craig McAnulla (Onion)• John Maslen (CluSTr)• Beat Ramseier (Method Archive)• Sarah Hunter (Management)
• integr8• Paul Kersey (CluSTr)
• A Team• Tracy Mumford• Kerry Smith
Thank you