Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · •...

32
1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole Goble and the rest of the my Grid team.

Transcript of Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · •...

Page 1: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

1

A myGrid Project Tutorial

Dr Mark GreenwoodUniversity of Manchester

With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole Goble and the rest of the myGrid team.

Page 2: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

2

• Open Source Upper Middleware for Bioinformatics

• (Web) Service-based architecture• Targeted at Tool Developers,

Bioinformaticians and Service Providers

Newcastle

NottinghamManchester

Southampton

Hinxton

Sheffield

Page 3: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

3

myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,

Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of

Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,

Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,

Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker

Page 4: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

4

Roadmap - start

services

data

Page 5: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

5

Philosophy• Openness

– open source– open world of services– open to wider eScience context– open to user feedback– open to third party metadata

• Collection of components for assembly– Pick and mix

Page 6: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

6

Tenet I• High level Middleware

services for data intensive resource interoperation for Bioinformatics– Information Grid not

computational Grid• Exploratory, ad hoc • For individuals• In silico experiment as

workflow• Distributed query processing• Information Management

Page 7: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

7

Tenet II• High level services for e-Science

experimental management;– Provenance– Event notification– Personalisation

• Sharing knowledge and sharing components– Scientific discovery is personal &

global.– Federated third party registries for

workflows and services– Workflow and service discovery for

reuse and repurposing

Registry

Register

Find

Annotate

Page 8: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

8

Tenet III• Open Source and Open

Services– No control or influence over

service providers• Open to third party

metadata and services• Open extensible

architecture– Assemble your own

components– Designed to work together– Toolkit

Freefluo

WfEE

TavernaViewUDDIregistry

EventNotification

mIR

PedroSemanticDiscovery

Info.Model Soaplab

Gateway & Portal

LSID

HaystackProvenanceBrowser

Page 9: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

9

Tenet IV• (Web) Service architecture

– Publication, discovery, interoperation, composition, decommissioning of myGrid services

– WS-I -> OGSA / WSRF

• Metadata driven– Ontologies– Common information model– Semantic Web technologies

• RDF, OWL

Page 10: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

10

Tenet V

Middleware for• Tool Developers • Bioinformaticians• Service Providers• Biologists are indirectly

supported by the portals and apps these develop.

Page 11: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

11

Roadmap

run workflows

services

workflows

data

discover services

data management

workflows

Page 12: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

12

Data-intensive bioinformatics

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32;

MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEIGGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGPRPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKTIIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Page 13: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

13

Use ScenariosGraves’ Disease• Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of

Human Genetics School of Clinical Medical Sciences, University of Newcastle

• Discover all you can about a gene• Annotation pipelines and Gene expression analysis• Services from Japan, Hong Kong, various sites in UK

Williams-Beuren Syndrome• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St

Mary’s Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis

Services from USA, Japan, various sites in UK

Page 14: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

14

Manually filling a genomic gap

Two major steps:• Extend into the gap: Similarity searches; RepeatMasker, BLAST• Characterise the new sequence: NIX, Interpro, etc…

• Numerous web-based services (i.e. BLAST, RepeatMasker)• Cutting and pasting between screens• Large number of steps• Frequently repeated – info now rapidly added to public databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled in lab book and

files saved to local hard drive• Mundane• Much knowledge remains undocumented• Bioinformatician does the analysis

Page 15: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

15

WBS Workflows: GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetative elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterProPFAMPrositeSmart

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

ncbiBlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence ncbiBlastWrapper

Sort for appropriate Sequences only

Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns

RepeatMasker

Page 16: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

16

Graves’ Disease Bioinformatics

Annotation PipelineWhat is known about my

candidate gene?

Medline

OMIM

GO

BLAST

EMBL

DQP

Query

Genotype Assay Design System 3D Protein StructureIs this SNP present

in my samples?What is the structure of the protein

product encoded by my candidate gene?

Primer Design

Gene ID

Restriction FragmentLength Polymorphism experiment

SNP SNPSNP

Use primers designed by myGrid to amplify region flanking SNP on the gene

PDB

Query PDB & display proteinstructure

Obtain information about protein& extract information about active site

Swiss-ProtAMBITInterpro

Emboss Eprimer applicationin SoapLab

Selection of restriction enzyme

Talisman

SNP

Emboss Restrictin SoapLab AMBIT

Determine whether coding SNPaffects the active site of the protein

Peter Li1, Claire Jennings2, Simon Pearce2 and Anil Wipat1, (2003)1School of Computing Science and 2Institute of Human Genetics, University of Newcastle-upon-Tyne.Candidate gene

pool

Page 17: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

17

Experiment life cycle

Discovering and reusing

experiments and resources

Managing lifecycle, provenance and

results of experiments

Sharingservices &

experiments

Personalisation

Forming experiments

Executing and monitoring

experiments

Page 18: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

18

(e-)Scientists…• …Experiment

• Can workflow be used as an experimental method?• How many times has this experiment been run?

• …Analyze• How do we manage the results to draw conclusions from

them?• How reliable are these results?

• …Collaborate• Can we share workflows, results, metadata etc?

• …Publish• Can we link to these workflows and results from our papers?

• …Review• Can I find, comprehend and review your work?• How was that result derived?

Page 19: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

19

Collections of Tasks

Finding

Description ServiceDiscovery

Enactment

BuildingWorkflow

Provenance

StorageData

ManagementQuerying

DomainTasks Service

ProvidersBioinformaticians

Scientists

Annotation providers

Page 20: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

20

Registry

mIR

Discovery View

HaystackProvenance

Browser

FreeFluoEnactor

TavernaWF Builder

PedroAnnotation tool

Ontology Store

OthersWSDLSoap-

lab

Interface Description

Annotation/description

Annotation providers

Query &Retrieve Workflow

Execution

Store data/knowledge

Scientists

Bioinformaticians

invoking

Querying/sharing/federating/registering

ServiceProviders

Data descriptions

Vocabulary

Page 21: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

21

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bioi

nfor

mat

icia

nsTo

ol P

rovi

ders

Serv

ice

Prov

ider

sApplications

Core services

External services

myGrid Service Stack

Views

Legacy apps

GowLab

Page 22: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

22

Two+ PathsCore functionality• Services – Soaplab

and Gowlab• Workflow enactment

engine – Freefluo• Workflow workbench

– Taverna• Data integration –

OGSADQP• Information model &

management

Innovative work• Service and workflow

registration• Semantic discovery• Provenance

management• Text mining

In between• Event notification• Gateway

Page 23: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

23

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bioi

nfor

mat

icia

nsTo

ol P

rovi

ders

Serv

ice

Prov

ider

sApplications

Core services

External services

myGrid Service Stack

Views

Legacy apps

GowLab

Page 24: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

24

Page 25: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

25

Run the Workflow

Viewing intermediate results

Page 26: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

26

Run the Workflow

Page 27: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

27

Drilling Down: myGrid and Semantics

• Workflow and service discovery – Prior to and during enactment– Semantic registration

• Workflow assembly– Semantic service typing of inputs and outputs

• Provenance of workflows and other entities• Experimental metadata glue• Use of RDF, RDFS, DAML+OIL/OWL

– Instance store, ontology server, reasoner– Materialised vs at point of delivery reasoning.

• myGrid Information Model

Page 28: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

28

Semantic Discovery

View annotations on workflow

Pedro data capture tool

Drag a workflow entry into the explorer pane and the workflow loads.Drag a service/ workflow to the scavenger window for inclusion into the workflow

Page 29: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

29

Tutorial focusCore functionality• Services – Soaplab

and Gowlab• Workflow enactment

engine – Freefluo• Workflow workbench

– Taverna• Data integration –

OGSADQP• Information model &

management

Innovative work• Service and workflow

registration• Semantic discovery• Provenance

management• Text mining

In between• Event notification• Gateway

Page 30: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

30

Roadmap

LSID authorities

Taverna workbench

Registry1. Describe services

3. Write & run workflows

services

workflows

data

2. Discover services

4. Provenance & datamanagement

workflows

Page 31: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

31

Sessions on Details• Workflows - hands on with Taverna• Semantics• Timetable – split sessions

– Session 1• Group 1 – hands on (Swanson)• Group 2 – semantics (Newhaven)

– Teabreak (short)– Session 2

• Group 1 – semantics (Newhaven)• Group 2 –hands on (Swanson)

– Discussions and Conclusions

Page 32: Dr Mark Greenwood University of Manchestermurli/GridSummerSchool2004/presentations/My… · • Distributed query processing • Information Management. 7 Tenet II • High level

32

Questions?

http://www.mygrid.org.uk

http://taverna.sf.net

http://freefluo.sf.net/