ppt

20
Web Services, Workflows & Taverna Superglue for the Semantic Web Tom Oinn – EMBL-EBI, [email protected] http://mygrid.org.uk http://taverna.sf.net

Transcript of ppt

Page 2: ppt

Who are we? myGrid

An EPSRC funded ‘eScience Pilot Project’

Based across multiple sites in the UK

Taverna A tethered spin-off of the

myGrid project Aimed at producing

powerful tools to complement the basic research work

EBI Hinxton Campus

Page 3: ppt

What is Taverna? Allows scientists to graphically construct

complex processes in the form of workflows What is a workflow?

Set of activities that make up a process Definitions about how data moves between these

activities The user specifies what to do but not how to do it Insulates users from the complexity of

distributed computing

Page 4: ppt

Looks a bit like this…

Page 5: ppt

myGrid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on

Williams-Beuren Syndrome in the medical genetics department

Workflows written by life scientists not computer scientists

Following slides stolen at the last minute from Hannah Tipney at Manchester!

Page 6: ppt

Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous

recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’

Page 7: ppt

Chr 7 ~155 Mb

~1.5 Mb7q11.23

C-cen

C-mid

A-cen

B-mid

B-cen

A-mid

GTF2I

RFC2

CYLN2

GTF2IRD1

NCF1

WBSCR1/E1f4H

LIMK1

ELNCLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

TBL2BCL7B

BAZ1B

FZD9

WBSCR5/LAB

WBSCR22

FKBP6

POM121

NOLR1

GTF2IRD2

B-telA-tel

C-tel

WBSCR14

STAG3PMS2L

Blo

ck A

FKBP6T

POM121NOLR1

Blo

ck C

GTF2IPNCF1PGTF2IRD2P

Blo

ck B

CTA-315H11

CTB-51J22

Gap

Physical Map

Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164

Williams-Beuren Syndrome Microdeletion

Page 8: ppt

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetitive elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterPro

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

BlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence

BLASTwrapper

Sort for appropriate Sequences only

RepeatMasker

TF binding Prediction

Promotor Prediction

Regulation Element Prediction

Identify regulatory elements in genomic sequence

Experiment

Page 9: ppt

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Analysis via ‘Cut and Paste’

Page 10: ppt

A B C

A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence

Workflows

Page 11: ppt

The Biological Results

CTA-315H11 CTB-51J22

ELN

WBSCR14

RP11-622P13 RP11-148M21 RP11-731K22

314,004bp extension

All nine known genes identified(40/45 exons identified)

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

WBSCR22

WBSCR24

WBSCR27

WBSCR28

Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified

Page 12: ppt

Different Kinds of Services Pure web services are not always the

solution Abstraction Level? Typing? Description? Data Volumes?

Taverna employs a hybrid architecture which includes web services amongst other components

Page 13: ppt

Complex Invocation Patterns E.g. Soaplab – has a typical factory pattern

‘create job’, ‘set parameter’, ‘run task’, ‘wait’, ‘get results’, ‘destroy task’.

Multiple web service calls per conceptual operation

Handled in Taverna by embedding this invocation pattern within a Soaplab processor.

Page 14: ppt

Large Data Sets No explicit limit to message size in WS specs

but… Most common toolkits equally terrible at handling

large data. WS Standards for bulk data transfer insufficiently

mature or lacking interoperability. Transfer references across WS calls, transfer

actual data ‘out of band’ More info from Jon later, handled in Taverna

via a Styx Grid Service plugin.

Page 15: ppt

Service Description WS standards fail to address the description

of a service. Registries – UDDI is an old standard and

predates work on semantic description BioMoby and myGrid include Semantic

Description and Discovery components. Search for services by task, by input or by past

involvement in another workflow Essential for AI assisted workflow construction

Page 16: ppt

BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)

Multiple Service Types

Page 17: ppt

Taverna Demo There should be a live demo of the Workflow

Workbench here…

Page 18: ppt

Obtaining Taverna Taverna is available under the LGPL from our

project site on Sourceforge.net http://taverna.sourceforge.net

Release 1.0 as of the 20th Jan 2005 (after twelve beta releases)

Includes online and downloadable user manual, examples etc.

Support via project mailing lists

Page 19: ppt

myGrid and WBS People!CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes,

Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical

Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan,

Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker

Page 20: ppt

AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net