PILE DRIVING BY WAVE MECHANICS George Goble GOBLE PILE TEST.
Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and...
-
Upload
lillian-johnston -
Category
Documents
-
view
226 -
download
0
Transcript of Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and...
Grids and Biology
Professor Carole GobleUniversity of Manchester, UK
BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK28th October 2002
Grids and Biology
A take on the GridIssues in Bioinformatics for GridVarious BioGridsApplicability of Grid to BiologyReality check
What is the Grid?“ Grid computing [is] distinguished from
conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we review the "Grid problem", which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations."From "The Anatomy of the Grid: Enabling Scalable Virtual
Organizations" by Foster, Kesselman and Tuecke
What is the Grid?Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizationsOn-demand, ubiquitous access to computing, data, and servicesNew capabilities constructed dynamically and transparently from distributed servicesNo central location, No central control, No existing trust relationships, Little predeterminationUniformity for Pooling ResourcesVirtual pools of resources: databases, clusters….
Biology as a Grid Application
Informational ScienceLarge ScaleDistributedNo one organisation owns it all
Motivation
1990 2000 2010
ESTs
Combinatorial Chemistry
Human Genome
Pharmacogenomics
Metabolic Pathways
Computational Load
Genome Data
Moores Law
Genome Sequences
Assembled Genomes
Genes and Gene Structures
Simulation of Metabolic andSignal Transduction Pathways
Genes, Proteins,RNAs, and other
Biomolecules
Biochemical Pathways&
Processes
Cellular & Developmental Processes
Large-scaleGenome
Sequencing
Tissue and OrganismalPhysiology
Ecological Processesand Populations
Sequence Variation ofPopulations
Reconstructing Phylogeny,Homology, and Comparitive
Approaches
Predicting ProteinSequence
Simulating and Understanding GeneExpression Networks
Predicting Three-DimensionalStructures of Proteins and RNAs
Predicting Functions
PredictingCatalysis, Molecular
Dynamics
Structures of Multi-molecularcomplexes
Predicting Effects ofVariation
Morphogenesis and Development
Experiments
Computation
BioMedical Computation[Rick Stevens, Argonne Labs]
Biomedical Data: High Complexity and Large Scale
...atcgaattccaggcgtcacattctcaattcca...billions
DNA sequencesalignments
MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...
Proteins sequence 2º structure 3º structure
Hundredthousands
Protein-ProteinInteractions metabolism pathways receptor-ligand 4º structure
millions
billions
Polymorphism and Variants genetic variants individual patients epidemiology
Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc.
millions
millions
ESTs Expression patternsLarge-scale screens
Genetics and Maps Linkage Cytogenetic Clone-based
[Rick Stevens, Argonne Labs]
BioGrid Projects
EUROGRID BioGRIDAsia Pacific BioGRIDNorth Carolina BioGridBioinformatics Research NetworkOsaka University BioGridIndiana University BioArchive BioGridmyGridBioSime-ProteinObiGrid
Today’s Grid
A Single System ImageTransparent wide-area access to large data banksTransparent wide-area access to applications on heterogeneous platformsTransparent wide-area access to processing resources
Security, certification, single sign-on authentication, AAA
Grid Security Infrastructure,
Data access,Transfer & Replication
GridFTP, Giggle
Computational resource discovery, allocation and process creation
GRAAM, Unicore, Condor-G
Immediate benefits
Uniform file views of directories, regardless of platformGrid-based data transfer libraries for faster access to large files, reducing need for mirror-site servers. Replication to support mirroringGrid APIs provide a job manager with metadata about services to the user. Evaluate the quality of service providers based on factors that may include more than just server performance and availability. Grid-aware applications -- split sequence reference libraries among several servers, where BLAST comparisons can be conducted in parallel.Shielding from a variety of low-level computing problems would otherwise have to address themselves.
Grid LandscapeComputationally Intensive
Data Intensive
Collaborative
Visualisation
Knowledge Intensive
Grid LandscapeComputationally Intensive
Data Intensive
Collaborative
Visualisation
Knowledge Intensive
Classical Grids
Classical Grids emphasise sharing of physical resources.Existing Grid middleware (e.g. Globus, Condor, Unicore) allows resource discovery, resource allocation, data movement, certification …
High Performance Bioinformatics Software
[Jack da Silva, NCSC, Paracel]
European DataGrid
Managed access to specialist remote resources
Access portal for biomolecular modeling resources. Interfaces to enable chemists and biologists to be able to submit work to HPC facilitiesVisualization of electrostatic field generated by a molecule.
dr Krzysztof Nowinski (ICM)
Biogrid system
1000Base-T x 12
Myrinet-2000
Data Grid DiskExpress5800/140Ra-4 x3
Grid system 1Express5800/ISS for PC-ClusterXeon2.2G x 8 + Management node 1
1000Base-SX
Grid system 2NEC Blade Server78node (156 CPU )
Flat N
eighborh
ood networks
SCOREManagement Station
SCOREManagement Station
Connected toGrid system 3
Remote control of instrumentsSharing of UHVEM(Ultra High Voltage Electron Microscopy) in Osaka University with NCMIR (National Center for Microscopy and Imaging Research)
3 Million electron volts the most powerful microscopy
Tokyo XP(Chicago)
STAR TAP
TransPACAPAN
vBNS
(UC San Diego)SDSC
NCMIR(San Diego)
UHVEM(Osaka, Japan)
JGN
Osaka University
Home ComputersEvaluate AIDS Drugs
Community = 1000s of home
computer users Philanthropic computing
vendor (Entropia) Research group
(Scripps)
Common goal= advance AIDS research
From Steve Tuecke 12 Oct. 01
MatlabMatlab and toolboxes for mathematical computation, analysis, visualization, and algorithm development:
CROSS PLATFORM/ OS
MATLAB is an intuitive language and a technical computing environment. It provides core mathematics and advanced graphical tools for data analysis, visualization, and algorithm and application development. With more than 600 mathematical, statistical, and engineering functions, engineers and scientists rely on the MATLAB environment for their technical computing needs.”
(www.mathworks.com)
Geodise release in November [email protected]
BioSim -- Molecular simulations as a tool for protein structure analysis
Overall vision – simulation as an integral component of structural genomics
Needs both capacity (many systems) and capability (large systems - HPCx)
Molecular Dynamics database (distributed)
synchrotron
MD database
novel biology…
compute GRID
[Sansom]
Grid LandscapeComputationally Intensive
Data Intensive
Collaborative
Visualisation
Knowledge Intensive
Metabolic Reconstruction
Function Assignment
Stoichiometric Representation& Flux Analysis
Dynamic Simulation
Network Visualization Tools
Genome Visualization Tools
Whole Cell VisualizationsImage/Spectra Augmentations
Interactive StoichiometricGraphical Tools
Laboratory Verification
VisualizationEnvironment
BioinformaticAnalysis Tools
Whole Genome Analysis
Microbiology &Biochemistry
Enzymatic ConstantsMetabolic ***
Proteomics
Visualization + Bioinformatics
[Rick Stevens Argonne Labs]
X-ray microtomography
Scientific discovery can be enhanced by closely coupling computation and experiment. Simulation, visualization and data gathering coupledX-ray microtomography produces 3D X-ray attenuation maps of specimens at a microscopic levelExpensive synchrotron beam time resources optimally used to obtain sufficient resolution for simulation
Interactive Steering
Enables controlled simulation using knowledge and skills of trained scientist.
•User steers calculation from laptop
•Controlled steering on supercomputers
•Visualization and computation use large scale machines accessed via Grid.
Scalable molecular dynamics
• Structure of a protein in a fluid medium
• Calculation takes into account forces between protein and ambient medium (in this case water molecules)
• Run on world largest academic computer, LeMieux at PSC (6 Tflops theoretical peak)
Grid LandscapeComputationally Intensive
Data Intensive
Collaborative
Visualisation
Knowledge Intensive
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
http://www.ks.uiuc.edu/Research/biocore/
Grid Landscape: DATA!!Computationally Intensive
Data Intensive
Collaborative
Visualisation
Knowledge Intensive
Information Weaving and Question Answering
Large amounts of different kinds of data & many applications.Highly heterogeneous. Different types,
algorithms, forms, implementations, communities, service providers
High autonomy.Highly complex and inter-related, & volatile.
Annotation Pipeline
sequencesSCOPCATHPDB
NRPROT
proteome sequences
PDB hit no PDB hit
TM, CC, LC, SIG & MOTIFS
PSIBLAST & HHMs
structure-based function prediction
structural and functional annotation
INTERPRO
3D modelling x 2 fold recognition x 2
[Mike Sternberg]
myGrid
Personalised extensible environments for data-intensive in silico experiments in biology
RASMOL
Straightforward discovery, interoperation, deployment & sharing of services
Service-oriented architecture
Integration and Information Workflow & Databases
Experimentation Provenance, propagating change, personalisation
For bioinformaticians who are building tools and using or providing services
High Throughput Computing Services
Distributed Data EngineeringData Registration, Data Normalisation, Data Quality
Information StructuringInformation Integration & Composition,
Semantics & Domain-based Ontologies, Sharing
Grid-based Knowledge DiscoveryGrid-based Data Mining, Collaborative Visualisation
DiscoveryNetHigh Throughput Sensing (HTS) Applications
Large-scale Dynamic Real- time Decision
support
Large-scale Dynamic System Knowledge
Discovery
Grid Basic InfrastructureGlobus/Condor/SRB
Utilising Grid Infrastructure for HT Computing
Base
d o
n
Ken
sing
ton
Disco
very
Pla
tform
Base
d o
n
Glo
bu
s &
OR
B
Infra
structu
re
http://www.discovery-on-the.net/
Bio Chip Applications
Protein-folding chips: SNP chips, Diff. Gene chips using LFII
Protein-based fluorescent micro arrays
1-100010-1000 >10000
Data QualityVisualisationStructuringClusteringDistributed Dynamic
Knowledge Management
Grid Evolution
1st Generation Grid Computationally intensive, file access/transfer Bag of various heterogeneous protocols & toolkits Recognises internet, Ignores Web Academic teams
2nd Generation Grid Data intensive -> knowledge intensive Services-based architecture Recognises Web and Web services Global Grid Forum Industry participation
We are here!
Node
NodeNodeNode
Gigabit IP Network
Geographically (e.g. UKGrid)
Grid Middleware
A Grid of resources, not just compute resources but databases, digital libraries, instruments, workflows, documents …
MouseGrid
NovartisGrid
BioSimGrid
These configurations are dynamic
Resources discovered, combined, used and disbanded as and when needed or available.
A Grid vs The Grid
Ph
ysic
al
Log
ica
l
A configuration of resources
Not just compute services but databases, digital libraries, instruments, workflows, documents …
services
Web Services
Grid Technology
Grid Services
Open Grid Service ArchitectureOGSA
Bio Services
Domain Oriented Services
Basic BioGrid Services
Grid Resource ServicesCommon Services
Base Services
Fabric Services
• Drug Discovery• Microbial Engineering• Molecular Ecology• Oncology Research
• Integrated Databases• Sequence Analysis• Protein Interactions• Cell Simulation
• Compute Services• Pipeline Services• Data Archive Service• Database Hosting• Workflow Enactment• Event notification
What We Need to CreateGrid Bio applications enablement software layer Provide application’s access to Grid services Provides OS independent services
Grid enabled version of bioinformatics data management tools (e.g. DL, SRS, etc.) Need to support virtual databases via Grid services Grid support for commercial databases
Bioinformatics applications “plug-in” modules End user tools for a variety of domains Support major existing Bio IT platforms
Requirements for the BioGridOpen and extendable architecture
Enable tie in to service stack at appropriate points Not just access via Portals
Leverage scripting tools in wide use for Bioinformatics
Create BioGrid services bindings for PERL and Python
Address data federation and integration Leverage work of IBM, Lion BioSciences, DAS, BioMOBY, etc.
Match the biology workflow and tool chain Create high-level BioGrid services to address critical stages
in existing workflow Support composibility of new BioGrid tools with existing tool
chain elements
Some BioGrid ChallengesScalable human bioinformatics expertise
Best people working on the important problems Exploit collaboration technology to create world class teams
Robust local bioinformatics computing environment Best systems administrators and high-end technologies Embed local resources into the Grid via portal technologies
Access to leading edge bioinformatics software and databases customized to user needs
Core content from top scientists and developers Integrated access to biological databases
Worldwide access to robust computing and database infrastructure
Leverage Grid technology to provide worldwide access Integrate purpose built systems and service providers
Reality Checks!!The Technology is Ready Not true — its emerging
Building middleware, Advancing Standards, Developing, Dependability
Building demonstrators. The computational grid is in advance of the data
intensive middleware Integration and curation are probably the obstacles But!! It doesn’t have to be all there to be useful.
We know how we will use grid services No — Disruptive technology
Lower the barriers of entry.
Reality Checks!!
It’s the only game Not true — I3C, BioMOBY, bioDAS, OMG LSR
Grid and Web service merge makes integration likely.
One Size Fits All Not true
Addressed by a minimum set of composable virtual services, But starting with Globus
It’s only for “big” science No — “small” science collaborates too!
Biology is not unique! AstroGrid
Not a silver bullet! Its just middleware not magic
Data qualityContent management of databases (controlled vocabularies)Provenance and versioning policiesAppropriate use of toolsComputational inaccessibility of free text annotationDatabase accessibility through means other than point and click web interfaces.
Independent of the Grid!
Life Sciences Grid (LSG)
http://people.cs.uchicago.edu/~dangulo/LSG/