1 The myGrid Project Professor Chris Greenhalgh University of Nottingham.
-
Upload
kristin-jennings -
Category
Documents
-
view
221 -
download
0
Transcript of 1 The myGrid Project Professor Chris Greenhalgh University of Nottingham.
2
• Open Source Upper Middleware for Bioinformatics
• (Web) Service-based architecture• Targeted at Tool Developers,
Bioinformaticians and Service Providers
Newcastle
NottinghamManchester
Southampton
Hinxton
Sheffield
3
Philosophy
• Openness– open source– open world of services– open to wider eScience context– open to user feedback– open to third party metadata
• Collection of components for assembly– Pick and mix
4
Data-intensive bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
5
Use ScenariosGrave’s Disease• Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of
Human Genetics School of Clinical Medical Sciences, University of Newcastle
• Discover all you can about a gene• Annotation pipelines and Gene expression analysis• Services from Japan, Hong Kong, various sites in UK
Williams-Beuren Syndrome• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St
Mary’s Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis
Services from USA, Japan, various sites in UK
6
Williams-Beuren Syndrome Microdeletion
**
Chr 7 ~155 Mb
~1.5 Mb7q11.23
GTF2I
RFC2
CYLN2
GTF2IRD1
NCF1
WBSCR1/E1f4H
LIM
K1
ELN
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
TBL2
BCL7B
BAZ1B
FZD9
WBSCR5/LAB
WBSCR22
FKBP6
POM121
NOLR1
GTF2IRD2
C-c
en
C-m
id
A-c
en
B-m
id
B-c
en
A-m
id
B-t
el
A-t
el
C-t
el
WBSCR14
WBS
SVAS
ST
AG
3P
MS
2L
Block A
FK
BP
6T
PO
M12
1N
OL
R1
Block C
GT
F2I
P
NC
F1P
GT
F2I
RD
2P
Block B
Patient deletions
CTA-315H11
CTB-51J22
Gap
Physical Map
7
Manually filling a genomic gap
• Numerous web-based services (i.e. BLAST, RepeatMasker)
• Cutting and pasting• Large number of steps• Frequently repeated – info now rapidly added to public
databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled
in lab book and files saved to local hard drive• Mundane• Much knowledge remains undocumented .:
Bioinformatician does the analysis
8
WBS Workflows:
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetative elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterProPFAMPrositeSmart
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
ncbiBlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence ncbiBlastWrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns
RepeatMasker
9
Workflow approach:in-silico experiments
Williams-Beuren Syndrome• Manually: takes two days (+)
including analysis• Now takes 30 mins to
produce results and half a day for analysis
• Manually: Do analysis as perform experiment
• Workflow: Do analysis at end of experiment
• Therefore need good result co-ordination for back-tracking
10
(e-)Scientists…• …Experiment
• Can workflow be used as an experimental method?• How many times has this experiment been run?
• …Analyze• How do we manage the results to draw conclusions from
them?• How reliable are these results?
• …Collaborate• Can we share workflows, results, metadata etc?
• …Publish• Can we link to these workflows and results from our papers?
• …Review• Can I find, comprehend and review your work?• How was that result derived?
11
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesmyGrid Service Stack
Views
Legacy apps
GowLab
12
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesmyGrid Service Stack
Views
Legacy apps
GowLab
14
• Control flow, iteration and data flow
• Data sets and nested flows• Configurable failure handling• Incorporated Life Science Id
resolution• Provenance and status
reporting• Type and data management• Plug-ins• User notification• Data entry wizard
• Libraries of SHIM services• Libraries of workflows
FreeFluo Features
15
Domain Services• Native WSDL Web services
– DDBJ, NCBI BLAST, PathPort, BioMOBY
• Wrapped legacy services– SoapLab – GowLab
• Web pages as web services– One button wrapping– Leveraged the EMBOSS Suite– ~159 services
• Lots of them and lots of redundant services
• The joys of firewalls and licensing
EBI Support agreed to support Soaplab services as core business
http://industry.ebi.ac.uk/soaplab/
For each applicationCreateJobRunWaitForGetResultsDestroy
16
Two+ Paths
Core functionality• Services – Soaplab
and Gowlab• Workflow enactment
engine – Freefluo• Workflow workbench
– Taverna• Data integration –
OGSADQP• Information model &
management
Innovative work• Service and workflow
registration• Semantic discovery• Provenance
management• Text mining
In between• Event notification• Gateway
17
Drilling Down: myGrid and Semantics
• Workflow and service discovery – Prior to and during enactment– Semantic registration
• Workflow assembly– Semantic service typing of inputs and outputs
• Provenance of workflows and other entities• Experimental metadata glue• Use of RDF, RDFS, DAML+OIL/OWL
– Instance store, ontology server, reasoner– Materialised vs at point of delivery reasoning.
• myGrid Information Model
18
Workflow run
Workflow design
Experiment design
Project
Person
Organisation
Process
Service
Event
Data item
Data itemData item
data derivation e.g. output data derived from input data
knowledge statementse.g. similar protein sequence to
instanceOf
partOf componentProcesse.g. web service invocation of BLAST @ NCBI
componentEvente.g. completion of a web service invocation at 12.04pm
runBye.g. BLAST @ NCBI
run for
Organisation level provenance Process level provenance
Data/ knowledge level provenance
Pro
vena
nce
(1)
User can add templates to each workflow process to determine links between data items.
19
19747251 AC005089.3831Homo sapiens BAC
clone CTA-315H11 from 7, complete sequence15145617 AC073846.6
815Homo sapiens BAC
clone RP11-622P13 from 7, complete sequence15384807 AL365366.20
46.1Human DNA sequence
from clone RP11-553N16 on chromosome 1, complete sequence7717376 AL163282.2
44.1Homo sapiens
chromosome 21 segment HS21C08216304790 AL133523.5
44.1Human chromosome 14
DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence34367431 BX648272.1
44.1Homo sapiens mRNA;
cDNA DKFZp686G08119 (from clone DKFZp686G08119)5629923 AC007298.17
44.1Homo sapiens 12q22
BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence34533695 AK126986.1
44.1Homo sapiens cDNA
FLJ45040 fis, clone BRAWH302048620377057 AC069363.10
44.1Homo sapiens
chromosome 17, clone RP11-104J23, complete sequence4191263 AL031674.1
44.1Human DNA sequence
from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence17977487 AC093690.5
44.1Homo sapiens BAC
clone RP11-731I19 from 2, complete sequence17048246 AC012568.7
44.1Homo sapiens
chromosome 15, clone RP11-342M21, complete sequence14485328 AL355339.7
44.1Human DNA sequence
from clone RP11-461K13 on chromosome 10, complete sequence5757554 AC007074.2
44.1Homo sapiens PAC
clone RP3-368G6 from X, complete sequence4176355 AC005509.1
44.1Homo sapiens
chromosome 4 clone B200N5 map 4q25, complete sequence2829108 AF042090.1
44.1Homo sapiens
chromosome 21q22.3 PAC 171F15, complete sequence
>gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequenceAAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAGGAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTCAAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCTGTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG
urn:lsid:taverna:datathing:15
..BLAST_Report
rdf:type
urn:lsid:taverna:datathing:13
..similar_sequences_to
.. nucleotide_sequence
rdf:type
service invocation
..created_by
workflow invocation
workflow definition
experiment definition
project
person
group
service description
organisation
..described_by
..run_during
..invocation_of
..part_of
..works_for
..part_of
..part_of
..author
..author
..run_for
A B
..masked_sequence_of
..filtered_version_of
Relationship BLAST report has with other items in the repository
Other classes of information related to BLAST report
RDF Rules
20
Information Model v2
• Resources and Identifiers
• People, teams and organizations• Representing the e-science
process• Experimental methods for e-
science
1..*0..* uses
1
0..*
contains
10..*
selected studies
0..*1
method
0..*
0..*
acts in
10..*
labBooks
scmInvestigator
1 0..*has participants 10..* participates in
0..*
1
uses
method
1 0..*has instances
AgentExperimentInstance
LabBookView
+name:String+rule:String
SubjectObject
Resources.Resource
+getId:URIString
ProgrammeResource
+name:String
<<Resource>>Study
+name:String+description:String+startTime:DateTime+endTime:DateTime+status:String
Programme
<<Resource>>Operations.Operation
<<Resource>>ExperimentDesign
Investigation
<<Resource>>PeopleAndTeams.Person
StudyRole
+roleName:String+description:String
Agent<<Resource>>
StudyParticipation
• Scientific data and the life-science identifier– Types– Identifier Types– Values and Documents
• Provenance information• Annotation and Argumentation
In the middle of deployment
Bioinformatics middleware – domain neutral
21
LSIDs• LSID provides a uniform naming
scheme.• LSID Resolver guarantees to
resolve to same data object.• LSID Authority dishes them out.• Also returns metadata of object.• Used throughout myGrid as an
object naming device.• myGrid Repository acts an LSID
Authority• LSID allows universal access to
results for collaboration, as well as for review.
• RDF+LSID explains the context of results, and provides guidance for further investigations.
Pioneered by myGrid
I3C / IBM / EBI proposal for a Life Science Identifier
http://www.i3c.org/wgr/ta/resources/lsid/docs/
23
In a nutshell
Pre-Prototype
Prototype 1
ExperimentalWeb-based
Requirements gathering
Architectural workoutAll services represented
NetBeans workbenchAPI-based integration
Info Repository orientedXML-based process provenance
Workflow enactment engine
Prototype 2
Second generation servicesReworked information model
Open information managementLife Science IdentifiersRDF based provenance
Taverna workbenchWeb-based portal
Demo at ISMB 2003
Full paper and demoat ISMB 2004GSK deployment
Real biology
24
To Dos• Improve results management• Deployment of mIR• Portal for finding workflows, launching & monitoring workflows,
launching taverna, browsing results• Deploying publicly accessible semantic registry• Reinstate service discovery during enactment• Large scale data throughput workflow engine• Event notification on services• Using provenance graphs for impact analysis• Hiding LSIDs• Lexicons for concept names• Hardening semantic discovery• Ambient Text• Er..Security• Etc…• “myGrid in a box”
25
Ongoing/Future Activities
• myGrid-in-a-box• Technical follow-ons
– Best practice (6) and OMII (Freefluo,Taverna, Event notification) bids
• Research follow-ons– Semantic Grids, Data Grids, Workflow, Provenance services– PhD students
• Science follow-ons– Life Sciences: ISPIDER, e-Fungi– Clinical: PsyGrid, CLEF-II– PhD students
• Networking– LinK-up with BIRN/SEEK/GEON (SDSC) & SCEC/GriPhyN
(ISI,USC)
26
Wrap Up• Managed the transition from generic middleware
development to practical day to day useful services– Real users (plural) fundamental to that
• End to end support for an entire scenario– A broad view of the e-Science process
• Show stoppers for practical adoption are not sexy technical showstoppers– Can I incorporate my favourite service?– Can I manage the results?
• Tapping into (defacto) standards and communities to leverage others results and tools – LSID, Haystack, Pedro…
• http://www.mygrid.org.uk
27
AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net
28
myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,
Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of
Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,
Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,
Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker