INFSO-RI-508833 Enabling Grids for E-sciencE High Throughput Bioinformatics analysis on the Grid...

31
INFSO-RI-508833 Enabling Grids for E- sciencE www.eu-egee.org High Throughput Bioinformatics analysis on the Grid EMBnet/CNB http://www.es.embnet.org/ Scientific Workshop, AGM'06 Helsinki, Finland

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE High Throughput Bioinformatics analysis on the Grid...

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

High Throughput Bioinformatics analysis on the Grid

EMBnet/CNBhttp://www.es.embnet.org/

Scientific Workshop, AGM'06

Helsinki, Finland

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Summary

HT analysis on the Grid

GROCK architecture

GROCK as Web Service

Thanks

Lessons Learnt

So long and thanks for all the fish!

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 3

Enabling Grids for E-sciencE

INFSO-RI-508833

Why do we want HT?• The short answer

To perform many analysis efficiently

• The long answer– To run multi-process jobs

Evolutionary bootstraps

Docking

Image processing...

– To run many processes High number of users

High number of problems

• Modelling

• Function prediction

• Structure prediction....

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 4

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK goal

• Why do we want High-Throughput docking?● find best matches between two molecular structures● for a probe molecule against all molecules in a database

● drug against protein● Identify drug function, predict secondary effects

● protein against proteins● Identify protein interactions, build interaction networks

● protein against drugs● Identify candidate drugs for therapy

● Beyond a single organism

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 5

Enabling Grids for E-sciencE

INFSO-RI-508833

So, what? Is it any good?

To tell you the truth:

In and of itself

it is of limited interest

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Beg to disagree

Pharmaceutical companies have been using something 3D-QSAR for years

With considerable success

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Come on!Do I need to tell you this? Really?

• You should never blindly trust a computer. – Predictions must be verified

– Predictions must be put in perspective

– Predictions are but a small part of a larger protocol

• It is difficult to get access to pharmacological data– Unless you are a Pharma

• GROCK should be part of a larger ensemble

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 8

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK in context• Predicting protein interaction networks

– HT protein interaction predictions (HT-GROCK? Whoa!)

– Experimental validation Proteomic analysis of experimental results

– Systems Biology modelling

– Analyze macromolecular assemblies (e.g. 3D-EM)

• Predicting new drugs– Build protein models / Analyze protein structure

– Identify putative targets (3D-QSAR, GROCK, WISDOM)

– Screen using QSAR

– Predict possible effects (GROCK, HT-GROCK?Re-Whoa!)

– Experimental validation

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Attacking current needs

• GROCK is a tool that makes 3D molecular screening:

● Easy through a simple, intutitive web interface● More reliable than pharmacophores: uses 3D

docking methods● Versatile: uses standard software and data● Efficient: thanks to the Grid (EGEE)● Integrable in other programs as a Web Service

(SOAP or XML-RPC)● And is GPL!

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 10

Enabling Grids for E-sciencE

INFSO-RI-508833

A Real Time example

• Just for fun: Let's run a screening of aspirin against a small test database

● Connect to GROCK server● Upload aspirin● Select options● Run

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 12

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK: match explorer

For each

pair● show 10 best● 3D coords● PNG● JPEG● PS● PDF● VRML1● VRML2● Jmol

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Aspirin (Acetyl salicylic acid)● Induces its effect through phospholipase A2

● Which is not on the search subset itself (sic)

● But has many other effects● on Protein G signalling● modulates hormone stimulated cyclic AMP production● protects against neurotoxicity● is used in dyslipidaemias ● affects pulmonary surfactant● etc... (check PubMed).

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Caveats

● Molecular databases are noisy– Plenty of room for enhancement– ...by Biology/Chemistry Structuralists

● Meaningless molecules are included– E.g. irrelevant molecules from uninteresting organisms– Data reduction by representative clustering

● Meaningful molecules may be excluded– E.g. by substitution of a relevant protein by an irrelevant relative

● 3D matching is approximate– E.g. meaningul info not included (like water or ion molecules)

● Users MUST exercise thoughtful criticism– Just like with any other theoretical tool

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Architecture

✔ GROCK: HT docking on the Grid

GROCK architecture

GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 16

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK: architecture

• Design:– User– Web Server– Web service– Grid front-end– Grid back-end

• Advantages:– Secure– Fail safe– Efficient– GENERIC

• To be done:– Make restartable

Avoiding “Death eaters”

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 17

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK design• Command line application

• WS wrapper

• WWW interface

• Provision for easy expansion– Plugin mechanism to add new databases (PDB, HIC-UP, ZINC)

– Plugin mechanism to add new methods (GRAMM, 3D-DOCK)

– Well defined plugin interfaces (roll your own)

• GROCK builds on other tools– Result browser relies on remote WS for generating output

– Generic docking methods

• GROCK may be used to build other tools

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: WS

✔ GROCK: HT docking on the Grid✔ GROCK architecture

GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 19

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK as a Web Service

– Callable using SOAP or XML-RPC

– Provides its own description and WSDL when invoked with no parameters User-friendly, human readable

– Provides meta-data about itself Source code Usage info Bibliography

– Job monitoring Asynchronous Web Service Dynamic

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 20

Enabling Grids for E-sciencE

INFSO-RI-508833

An asynchronous WS

When invoked, GROCK returns an opaque key that may be used to query it for status and output info:

Keys are generated at random with enough entropy to make them difficult to guess

The key is actually a ‘session ID’ that uniquely identifies a given job request in the file store.

GROCK uses the key to retrieve job status and output

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 21

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Lessons Learnt

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 22

Enabling Grids for E-sciencE

INFSO-RI-508833

Future directions

• Add support for additional docking methods– DOCK5 (MPI), AutoDock, others

• Add support for other databases– HIC-Up– ZINC subsets

• Exploit Grid distributed storage system– Needed for truly massive jobs (e.g. drug screening)

• Apply architecture to other problems (evolution, 3D reconstruction, high-throughput *)

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 23

Enabling Grids for E-sciencE

INFSO-RI-508833

Next steps• Extend pharmainformatics work

– Molecular modelling (YaMI: MODELLER) Already on its way

– Molecular Dynamics (AMBER, TINKER, NAMD) In collaboration with Raul Isea (RIB), Paulino Gomez-Puertas

(CBM)...

– Cheminformatics (MPQC, NWChem, Car-Parrinello, DFT) If still needed

• Extend interactions work– 3D-EM analysis of macromolecular assemblies (analysis

restarted on February 2006)

– Xmipp (in-house open source package)

– In collaboration with 3D-EM NoE

– Start easy, with most heavy and used applications

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 24

Enabling Grids for E-sciencE

INFSO-RI-508833

Lessons learned• YaMI v7 (Yet another Modeller Interface)

• GridGRAMM– Running a single process takes longer

– But may be worth the wait

– Don't let anybody mislead you: The Grid is a source of raw computing power. Dot.

• HT Docking– All you need is a tight loop, et voilà!

– Really!

– However...

à

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 25

Enabling Grids for E-sciencE

INFSO-RI-508833

Component Based Architecture• Extending GROCK to use additional dockers

• Extending GROCK to use distributed storage

• Extending GROCK to run in non-EGEE environments

• Shows the relevance of choosing appropriate interfaces

• GROCK, YaMI, GridGRAMM themselves require NEW, well thought out interfaces

• Job execution DOES NOT– DRMAA-WG is a estandard for a batch submission API

– Joined DRMAA-WG in February 2006

– Goal: Define a DRMAA binding for PHP

– Build a DRMAA binding for EGEE

Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 26

Enabling Grids for E-sciencE

INFSO-RI-508833

Our Advice• Program using a standard API: DRMAA

– Do it once, run on SGE, Condor, GridWay, etc...

• Use third party work whenever possible– To save effort and increase portability

– Remember: Don't over do it! KISS!

• Define plugin interfaces (and document them)– For extensibility

• Define WS invocation interface (and document it)– For integration into other frameworks

• And finally program a trivial loop (always document)– Don't be too worried about performance

– It will be simple, fast and short

Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 27

Enabling Grids for E-sciencE

INFSO-RI-508833

Current work and next steps• Build DRMAA API for EGEE

– So that next steps are easier

• Think about best architecture for data distribution– So it is intuitive, effective and simple

• Go ahead– Molecular modelling

– Molecular dynamics

– Molecular reconstruction

– Macromolecular assembly analysis by 3D-EM

– Cheminformatics (if not done yet)

Page 28: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 28

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Middleware classes

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service✔ Lessons learnt

Thanks

So long and thanks for all the fish!

Page 29: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 29

Enabling Grids for E-sciencE

INFSO-RI-508833

We wish to thank

• YOU ALL– for being here, your help, encouragement, feedback and

support

– and not falling asleep

• The TEAM at CNB– Biocomputing

José M. Carazo, Carlos Pérez-Roca, Enrique de Andrés, Natalia Jiménez, Sjors Schëres,Alfredo

– Bioinformatics José R. Valverde, David J. García

• THE EU for EGEE

Page 30: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 30

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: That's all folks!

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service✔ PHP middleware✔ LCG middleware✔ Thanks

So long and thanks for all the fish!

Page 31: INFSO-RI-508833 Enabling Grids for E-sciencE  High Throughput Bioinformatics analysis on the Grid EMBnet/CNB  Scientific.

Grid Workshop, SC '05, Seattle WA, USA 31

Enabling Grids for E-sciencE

INFSO-RI-508833

Any questions?