INFSO-RI-508833 Enabling Grids for E-sciencE Life Sciences Applications José R. Valverde...

33
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Life Sciences Applications José R. Valverde EMBnet/CNB

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE Life Sciences Applications José R. Valverde...

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Life Sciences ApplicationsJosé R. Valverde

EMBnet/CNB

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

jr

• PhD in Medicine and Surgery– You already guessed ;-)– Forensic Dr.– Molecular Biologist– Exobiologist, IVF-ET...

• Computer Scientist– Bioinformatician

• Actually– EMBnet node manager

EMBnet:Provide support, training, resources and services to Life Sciences

Enabling Grids for E-sciencE

INFSO-RI-508833

Daydreaming

• One day we would like to– Go to the doctor– Get taken a blood sample– Get a personalized diagnosis and therapy

• We may as well love to live in a World with– No hunger– No contamination– Biodiversity– Lots of fun!

Enabling Grids for E-sciencE

INFSO-RI-508833

Hey you! Yes! YOU!

• And you?– What would YOU like?

Enabling Grids for E-sciencE

INFSO-RI-508833

The Human Genome

• Laid out the basis for – Genomics– Proteomics– Transcriptomics– HighThroughput structure analysis

• Sets out the basis for the “Databang!”– Genome sequencing in many organisms– Genomic analysis implies a quantum leap

several orders of magnitude forward

– New experimental approaches– Genome sequencing overnight(?)

For less than 1K€/genome!

Enabling Grids for E-sciencE

INFSO-RI-508833

Oh, no! Not me again!

• What about YOU?– Would you like your genome sequenced?

Enabling Grids for E-sciencE

INFSO-RI-508833

Think twice

Enabling Grids for E-sciencE

INFSO-RI-508833

The Databang!

• Data growth in Molecular Biology– Exponential till recently (2x every year)– Greater than exponential lately (2x every 8mo., 6mo...)– The worst is still to come

• With a doubling rate of less than 6 months– You miss half the knowledge gathered in all of Human History if

you lag a few months behind!

• Experimental work– Classical: one gene– Modern: one genome

• Forget the classical Databank• Welcome the new Databang!

Enabling Grids for E-sciencE

INFSO-RI-508833

Beyond Molecular Biology

• Medicine– Homo sum nihil humani a me alienum puto. (Terenzio) ;-)– Knowledge application– Knowledge INTEGRATION

Medical records Neurology, immunology, etc...

• Pharmacology– Drug identification– Drug testing

• Biotechnology• Chemistry• You name it!

Enabling Grids for E-sciencE

INFSO-RI-508833

Pursue your dreams!

• Never, ever give up!But, how?

• Your doctor will need to analyse your whole genome– and compare against population standards

• Your pharmacist will need to find the best drug– Out of millions

• Engineers will need to understand Life– From molecules to populations– And how to modify it ecologically

• Shorthands/rules/laws will need to be drawn

Enabling Grids for E-sciencE

INFSO-RI-508833

Meaning what?

• Huge amounts of raw power at the fingertips...– Of many professionals– To store replicated data (security, accessibility, efficiency..)– To analyze vast amounts of data

• An scaffold of knowledge– Built stepwise on top of prior knowledge (molecules, cells, tissues,

organisms, populations, ecosystems)– From many sources (Biology, Medicine, Industry,etc.)– By many professionals (all over the World)

• Tight security– To protect data (personal, corporate) from abuse

Enabling Grids for E-sciencE

INFSO-RI-508833

Getting there

• A component based architecture– Multifaceted, multiheaded, multihosted– Integrable, deals with complexity

• A lot of power– HPC systems– Grid systems

• Politics– Security in the face of tremendous stress forces

Corporate Political Private Moral Ethical

Enabling Grids for E-sciencE

INFSO-RI-508833

Component Architecture

• This is Science, man!– There should be no barriers to collaboration

• Object Oriented Web Services (and CORBA, .Net, etc..)• BioMOBY (www.biomoby.org)

– Web Services based– Workflows with Taverna (under reevaluation)– Distributed development– Integration with the Grid (MyGrid, UK)– Examples of BioMOBY applications:

Sequence conversion Protein structure modelling Sequence comparison Gene finding...

Enabling Grids for E-sciencE

INFSO-RI-508833

Component layers

LAYER• Low level

– System('xxx')

• Middleware– Submit('xxx')

• Application– Analyze('xxx')– Decide('xxx')– Predict('xxx')

EXAMPLES• CGIs

• PHP:Grid, DRMAA

• DOCK-ws,• BLAST-ws• MODELLER-ws

Enabling Grids for E-sciencE

INFSO-RI-508833

Base services: Blast

• Different interfaces with different calling conventions

• Dynamically changing with each new version / syadmin / fashion

Enabling Grids for E-sciencE

INFSO-RI-508833

Derived services

☞ Call upon existing servers on remote systems

☞ Might be called from servers on remote sites

Enabling Grids for E-sciencE

INFSO-RI-508833

Distributed data queries

• Distributed DBMS (SRS Federation, www.srsfed.org)• Store databases distributed/replicated over central nodes• Distribute database processing to hosting servers• Distribute database queries transparently from distributive front

ends

• User data• Find the best way to store/access

• Data collection into databases• Test distributed collection/storage • Systems

Good problem for ☞ HPC, gridification

Enabling Grids for E-sciencE

INFSO-RI-508833

HPC

• MPI, queues• Good for massively parallel jobs

– e.g. Molecular Dynamics on MareNostrum– e.g. 3D reconstruction on MareNostrum

• Very expensive• Good for embarrassingly parallel jobs

– But so is the Grid

• Good for communication dependent jobs– Large messages– Many messages

• Don't misconstrue me: there is a lof of life on HPC

Enabling Grids for E-sciencE

INFSO-RI-508833

Classical problems

• There are still huge problems with huge demands on compute power

• Structure refinement (X-ray, NMR, Microscopy...)• Structure prediction (Homology, Threading, MD)• Structure analysis (docking, MD/QM simulations, QSAR, 3D-

QSAR)• Many others

• Coarse and fine grain computation• Benefit from distributed computing

• Farm / cluster / grid / supercomputers• PVM/MPI implementations may exist

Enabling Grids for E-sciencE

INFSO-RI-508833

Bird flu M2

Enabling Grids for E-sciencE

INFSO-RI-508833

Marenostrum

The life science program will take advantage of the Supercomputer to get a deeper understanding of the behavior of living organisms.

Research lines• Genomic analysis• Data mining• Systems Biology• Prediction of protein fold• Molecular interactions

Enabling Grids for E-sciencE

INFSO-RI-508833

Xmipp

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid computing

• Affordable• Good for embarrassingly parallel jobs

– Wisdom, GROCK, 3D analysis, HMMER...

• Appropriate for parallel jobs (big clusters)– MPI, MD, etc..

• Appropriate for distributed data – Medical imaging, databases, etc..

• Good for HT and HP (highly popular) tasks– EMBOSS, EMBRACE, etc..

Enabling Grids for E-sciencE

INFSO-RI-508833

High throughput data

• New experimental approaches generate pervasive hyper-exponential data streams

• Processing requires massive computing power

• Currently beyond reach of common developers

• There are some solutions

• Parallel processing in Molecular Structure

• But the vast majority of applications are still single threaded monolithic processes

• And developers are used to it!

Enabling Grids for E-sciencE

INFSO-RI-508833

Processing HT data

• Distribute computation over as many nodes as possible:

• Supercomputing centres• Departmental servers• Workstations• PDAs, mobiles, commodity appliances• Fridges, toasters, etc... as they become available

• Bring developers in

• MPI, PVM are powerful but lack programmers

• OO is intuitive and widely available

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK: HT docking

• Why do we want easy High-Throughput docking? find best matches between two molecular structures for a probe molecule against all molecules in a database

drug against protein Identify drug function, predict secondary effects

protein against proteins Identify protein interactions, build interaction networks

protein against drugs Identify candidate drugs for therapy

Beyond a single organism

Enabling Grids for E-sciencE

INFSO-RI-508833

Match molecule vs. database

• Sort pairs by energy• For each pair

– Save 1000 best matches– Show 10 best for exploration

Enabling Grids for E-sciencE

INFSO-RI-508833

Other EGEE examples

• GATE: Geant4 application for tomographic emission• CDSS: Clinical decision support system• GPS@: Genomics web portal• SIMRI3D: Magnetic resonance image simulator• gPTM3D: Interactive radiography visualization• WISDOM: Docking platform for tropical diseases• Pharmacokinetics: contrast agent diffusion in MRI• Bronze standard: evaluation of medical imaging algorithms• SPLATCHE: Genome evolution modelling• Mammogrid project• HealthGrid, EMBRACE, etc...

Enabling Grids for E-sciencE

INFSO-RI-508833

Users

Web serversFront Ends

Back Ends

A distribution architecture

Enabling Grids for E-sciencE

INFSO-RI-508833

Security

• Access control and authentication– CAs

• Tru$t– VOs

• Encryption (e.g. parrot/perroquet)• Usage/access policies• SOCIAL POLITICS (e.g. France)

– Patient privacy is sacred

• PRIVATE INTERESTS (e.g. Pharma and Biotech)• Criminal abuse

– Crackers

Enabling Grids for E-sciencE

INFSO-RI-508833

Science, Medicine, etc...

• Back to collaboration– We need to stand on the shoulders of giants (and dwarfs as well)– We need to share information

• Can we relay on personal certificates?– Services need server certificates– How do we deal with multiple access to private data?

• Does it make sense?– Research groups– Research projects– Doctor(s) and patient(s)

• A brave new World– For you to morph

Enabling Grids for E-sciencE

INFSO-RI-508833

Kudos to

• YOU ALL– for being here, your help, encouragement, feedback and

support– and not falling asleep

• The TEAM at CNB– Biocomputing

José M. Carazo, Carlos Pérez-Roca, Enrique de Andrés, Natalia Jiménez, Sjors Schëres,Alfredo

– Bioinformatics José R. Valverde, David J. García

• The NA4 Biomed task force• The EU for EGEE and EGEE-II

Enabling Grids for E-sciencE

INFSO-RI-508833

Any questions?