DCC Keynote 2007

71
Curating Services and Workflows The Good, the Bad and the Ugly A Personal Story in the Small Professor Carole Goble The University of Manchester, UK [email protected] ote: 3 rd International Digital Curation Conference, Washington DC, 11-13 December 2007

description

A keynote given on experiences in curating workflows and web services. 3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"11-13 December 2007Renaissance HotelWashington DC, USA

Transcript of DCC Keynote 2007

Page 1: DCC Keynote 2007

Curating Services and Workflows

The Good, the Bad and the UglyA Personal Story in the Small

Professor Carole GobleThe University of Manchester, [email protected]

Keynote: 3rd International Digital Curation Conference, Washington DC, 11-13 December 2007

Page 2: DCC Keynote 2007
Page 3: DCC Keynote 2007

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

[GSK]

Page 4: DCC Keynote 2007

4

Programmatic Interfaces to Services(Web Services not Web Sites)

Your Script

ServiceRegistry

Web Service

SeqFetchService

BLAT Service

BLAST Service

SeqFetchService

GO Service

Adapted from Lincoln Stein

Your WorkflowYour

Application

Interface Description Document

WSDL WADL

European Bioinformatics Institute API submissions has risen to 3,166,901 for 2007 (Sarah Hunter)

Page 5: DCC Keynote 2007

5

[Mark Wilkinson, 2006]

Page 6: DCC Keynote 2007

• Workflows describe the scientists in silico experiment– Link together and cross reference data in

different repositories– Mechanism for interoperating.– And that includes publications!

• Remote, third party, external applications and services– Accessible to the workflow machinery– And that includes data and publications!

• Results management– Semantic metadata annotation of data– Provenance tracking of results

• Sharing and replicating know-how – Reuse of workflows

Viva la Workflows!

Page 7: DCC Keynote 2007

myGrid Taverna Workflow

Workbenchhttp://www.mygrid.org.uk

Page 8: DCC Keynote 2007

41000+ downloads 40 per day since June 2006. Ranked 210 sourceforge activity (06

06 07) Open Source Development Used throughout the world Systems biology – SysMo Consortium Proteomics Gene/protein annotation, Microarray

data analysis, Medical image analysis Heart simulations, High throughput

screening, Phenotypical studies, Phylogeny

Plants, Mouse, Human Astronomy, Music, Geography Text mining And Curation….

Page 9: DCC Keynote 2007

Because software needs curating too.

http://www.omii.ac.uk

ManchesterSouthamptonEdinburghEuropean Bioinformatics Institute

Page 10: DCC Keynote 2007

10

Automated Curation using Workflows• Coordinating data mirroring

refreshes• Refreshing Data warehouses

– e-Fungi, ISPIDER

• Rebuilding lost databases– tGRAP when collapsed picked up

by Nijmegen and rebuilt using workflows over two days.

• Text mining– Very, very popular.

• Workflows instead of data curation?– Data regenerated on demand.– Curate the workflow and not the

data?Bas Vroling, Gert Vriend CMBI NCMLS UMC Nijmegen

Page 11: DCC Keynote 2007

11

Workflows are reading publications.Workflows are processing the data.

Workflows are part of curation pipelines

Workflows are another form of outcome to publish and curate alongside data and

publications

Page 12: DCC Keynote 2007

12

Workflows are….…provenance of data…general technique for describing and enacting a

process, like a script or a protocol or a method…precise, unambiguous and transparent protocols and

records.…often complex, so they need explaining.…often challenging and expensive to develop.…know-how and best practice. …collaborations.…valuable first class scientific assets in their own right.

• Services are steps in the workflow, and a workflow can be deployed as a service. They are “Social Networks” of services. More on this later….

Page 13: DCC Keynote 2007

13

“We need to curate methods as well as data. With

the new large scale data sets process matters

as much as content and we are rubbish at curating, capturing and reusing it. Much of what we now rely on is processed, not raw data. We have strategies for curating the raw data - indeed multiple standards.

Thus, in life sciences we have a gaping void in our curation. We need standards, need places to put methods, and places to allow re-use.

Professor Andy Brass, Bioinformatics

Page 14: DCC Keynote 2007

14

Towards Reproducible Science (with Reproducible Scientific Objects)

Page 15: DCC Keynote 2007

15

Trypanosomiasis in Cattle• Identified a pathway for

which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.

• Systematic and comprehensive automation. Elimination of user bias.

Fisher P et al A systematic strategy for large-scale analysis ofgenotype–phenotype correlations: identification of candidate genes involved in African trypanosomiasis, Nucleic Acids Research, 2007, 1–9

A PhD student. Paul Fisher.

Page 16: DCC Keynote 2007

16

Recycling, Reuse, Repurposing• A Trypanosomiasis in Cattle

workflow (by Paul) reused without change for Trichuris muris Infection (by Jo).

• Identified the biological pathways believed to be involved in the ability of mice to expel the parasite.

• Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value.

Page 17: DCC Keynote 2007

Kepler

Triana

BPEL

Ptolemy II

Scientific memes. Scientific viruses.Increasing numbers.

Page 18: DCC Keynote 2007

Aerospace Engine Design

90% of design is variant design70% of information is taken from

previous designs

Source: Silvia Wong, University of Southampton, UK

Page 19: DCC Keynote 2007

19

Institutional Archive

LocalWebPublisher

Holdings

Digital Library

Graduate Students

Undergraduate Students

Virtual Learning Environment

e-Experimentation

e-Scientists

Technical Reports

Reprints

Peer-Reviewed Journal &

Conference Papers

Preprints &

Metadata

Certified Experimental

Results & Analyses

Data, Metadata & Ontologies Workflows

Adapted from the eBank project

Page 20: DCC Keynote 2007

20

If I had (well) curated services and workflows I could….

• Browse around and see what is out there and stop reinventing the wheel.

• Find a service based on what it does (or was meant to do), and what it consumes as inputs and produces as outputs, and what it uses, or because it matches (somehow) something I have already

• Understand how it works and when it works• Know where there are exact copies or similar

services I can use as alternates• Know whether I have permission to use it, or have

the set up to use it.

Page 21: DCC Keynote 2007

21

If I had (well) curated services and workflows I could….

• Understand how to operate it, configure it correctly with some examples and defaults, invoke it and handle all the error stuff, and predict performance properties

• Know how expensive it might be to use (financially or performance)

• Know when and by whom its was created, its version history and track its versions

• Know what other people think of it, how popular it is and who else use it and how

• Know how reliable it is, if it still works and how reliable it is and whether it keeps changing.

Page 22: DCC Keynote 2007

22

If I had (well) curated services and workflows I could….

• Get intelligent help with using it in my application, like when building workflows

• Validate it• Know how it can be chained with others• Find services that can mediate the mismatches

between other services.• Automagically match it up with others to

automagically create new ones• Call it from an application or a web browser

Page 23: DCC Keynote 2007

23

A definition for me [based on wikipedia]• Digital curation is about maintaining and adding value to

a trusted body of digital assets for current and future use by, and on behalf of, a community.

• It is a long term process where those assets are managed, cleaned up and corrected, associated with metadata, annotated and discussed, and appropriately preserved or reliably disposed of.

• Assets are used, we hope– By applications and scientists who had anticipated

using them.– By applications and scientists that had not, or in

ways that were unanticipated.

http://en.wikipedia.org/wiki/Digital_curation

Page 24: DCC Keynote 2007

24

e-Scientists in the Cloud • Individual life scientists, in

under-resourced labs, using other people’s applications, with little systems support.

• Consumers are providers.• Exploratory.

• A distributed, disconnected community of scientists.

Page 25: DCC Keynote 2007

Hypo Science©Virtual Laboratories

Science in the Small by the Many

© Peter Murray-Rust

Page 26: DCC Keynote 2007

26

Global Services in the Cloud• Independent third party

world-wide service providers of applications, tools and data sets. In the Cloud. Hosted at the originators site.

• Local applications, tools and datasets. My copies of third party services.

• Special shim services.• Decoupled providers and

consumers.• 3500 service operations

Page 27: DCC Keynote 2007

27

But Surely ….…Can’t I just Google (or Woogle) for a service?• The clustalw program from Emboss is called

‘emma’…Can’t I look at its WSDL document?• Input0:string, Output0: string• What does SeqRet actually do?• Liberal use of polymorphic capabilities• What about the ones that are not Web Services?…Can’t I look at its documentation?• Ahem. We have to try them to find out what

they do…

Page 28: DCC Keynote 2007

Writing Reusable stuff is HARD

Predicting the unknown required by the unknown.

Services in the Wild are frequently Rubbish.

Scientists and Developers are naughty.

Page 29: DCC Keynote 2007

Applications and Scientists need a Curated Registry of Services

Note: Registry, not repositoryServices are hosted elsewhere

(Just having a workflow system isn’t enough)

Page 30: DCC Keynote 2007

30

Service Curation • 3500+ service

operations• 600+ annotated by full-

time curator.• myGrid Ontology• Annotation and

curation pipeline• Curation tools• Feta and Find-O-Matic

discovery tools• There are others:

– DAS Registry– BioMOBY Central

Since 2002

Page 31: DCC Keynote 2007

31

Building Annotation Commodities

Object

Service Endpoint

Workflowfileetc

AnnotationModel

FunctionalOperationalProvenanceReputation

Descriptions

OntologiesControlled vocabulary

TagsFolksonomy

Free text

Layered, Enrichment, Augmentation Annotation modelUses Semantic Web technologies - OWL and RDFS

The perspective of the scientistManaged, centralised curation process

700+ class domain ontology

Service Ontology3500+ Services

Page 32: DCC Keynote 2007

32

Volatility and DecayBioNanny• Services are not deposited and

preserved.• They are referred to.• Constant, silent churn and flux.• No SLA to be stable or standard.• Constantly need tending or else they go

bad and stale.– SeqHound, BioMART API

• Rapid metadata heart-beat, especially on operational metadata. Like minutes.

• (cf. IVOA service validation, DAS).• Workflow decay• Not Fix, File, Forget

Page 33: DCC Keynote 2007

33

One size does not fit all…• Scientist - Finding

– Simple classifications on a few properties. Smart tools. “Coarse grained”. Simple Ontology.

– Decision Support

• Automation – Validation and Execution “fine grained”– Rich metadata for automatic service

configuration, invocation, debugging, repair, automated composition

– Decision making.

Page 34: DCC Keynote 2007

34

Increasing valueIncreased automationBetter understanding

Investment (cost, effort)

Scripted toolinvocation

Guided workflowconstruction

Basic ‘discovery’ styleservice annotations

Knowledge drivenvisualization

Workflow validation

Semantically enriched data

Automated Workflow Construction

Guided workflowreuse

Dynamic Service SubstitutionManual use of

tools, web pages

Naïve workflowsystems

Folksonomy TaggingOntology Curation

Service Configuration

output{score} is_distance_between pair {input{sequence a}, input{sequence b}}

‘myalignscript.pl’

‘A tool to comparemultiple protein structures’

performs_task : alignment

input_type{seq_a} : sequence…output_type{score} : d_value

Page 35: DCC Keynote 2007

35

Progressive CurationJust enough, Just in time

Jam today and Jam tomorrow

Gain

Pain

VeryBAD

Good, butUnlikely

Just right

Page 36: DCC Keynote 2007

Applications and Scientists needed aCurated Repository of Workflows

Find a workflow like this one that I can edit to do something else. That’s really hard.

Page 37: DCC Keynote 2007

37

Workflow Glass Boxes• Social Networks of Services

– Is it dependent on a service I don’t have access to, or is depreciated or is unreliable?

• Nesting and fragments of workflows– Workflow networks

• Service Diagnostics– Popularity, Co-use and clustering– Quality of Service

• Service Curation– Automate service annotation– Debug service annotations

Page 38: DCC Keynote 2007

38

Franck Tanoh

Katy Wolstencroft

Our hard working (real) curatorsnotice how tired they look

Curation Sweatshop• Steady increase in

numbers of services and workflows

• Time-consuming and expensive.

• Annotation and the Ontologies• Choosing, Adding value.

Monitoring.• Should we instead enable

suppliers to add value?

Page 39: DCC Keynote 2007

39

Automated Curation• Operational:

– Monitoring information services, dial home diagnostics from applications, customer reports

• Reputation and Provenance:– Recommendations and ratings

• Functional: – Text mining and parsing files and

documents (if any)– Incidental metadata through use.– Annotation derivation from sound

workflows and rich service descriptions of inputs and outputs

– Not perfect, but a help!

Needs lots of infrastructure

Needs lots of seeding and reviewing

Page 40: DCC Keynote 2007

Local Libraries and Warehouses of

Workflowstrapped in their

enterprises or platforms

Page 41: DCC Keynote 2007

41

Tryps Twiki World

Wikis are where data lives….

Page 42: DCC Keynote 2007

42

• Picture of workflow in Flicker – evidence of social tagging and networking

Page 43: DCC Keynote 2007

43

Page 44: DCC Keynote 2007

44

myExperiment.org is…• A bazaar for any and all kinds

of workflows.• A community social network for

community annotation and general gossip.

• A gateway to other publishing environments.

• A federated repository.• Publish self-describing

encapsulated myExperiment Objects.

• Not workflows; Scientific Objects!

• e-Crystals, Social science, Astronomy, Geography, Music

• (A platform for launching workflows.)Since Feb 2007

Page 45: DCC Keynote 2007
Page 46: DCC Keynote 2007

46

Encapsulated myExperiment Objects.• A single or collection of workflows

with instructions and examples• A workflow with its inputs and the

products of executing it (including logs), perhaps multiple times

• Chemistry data from instruments, coupled with blogged log book entries

• A collection of all the digital items associated with one experiment—including EMOs

• A reproducible article with workflows and data

Virtual Exchange

Format

Page 47: DCC Keynote 2007

47

Encapsulated myExperiment Objects.• Open Archives Initiative

– Object Reuse and Exchange (OAI-ORE) – compound object

information and standardised and interoperable mechanisms

• W3C Open Linked Data Initiative

• Reproducible Scientific Objects

Virtual Exchange

Format

x

Page 48: DCC Keynote 2007

48

EMO Challenges• What happens when the

parts are scattered across multiple stores?

• What happens if someone updates a part?

• How will my EMO be discovered on the Web?

• How can I work with an EMO offline?

• What is the provenance of the EMO and its parts?

• What happens if a part is unavailable?

24/5/2007 | myExperiment | Slide 48

• How do I send an EMO by email?

• Can I turn an EMO into a tarball?

• Can I archive an EMO to a CDROM?

• If I delete this file will it break anyone’s EMOs?

• How do I trust an EMO?• How do I handle an EMO

RESTfully?• Can my EMO link to objects

outside the EMO?

Page 49: DCC Keynote 2007

49

Not just Workflows, Not just Biology

Chemistry - eCrystals

Social Science

Astronomy

Music

Files and Documents

Logs and Blogs

Ontologies

Data

Page 50: DCC Keynote 2007

51

Respect Cautious Collaboration….

24/5/2007 | myExperiment | Slide 51

Community web site, federated repository.

Multiple and My.

Publish what I want when I want within the group I want.

Mixed identity regimes: an identity authority

OAI-MPH.

Open Archives Initiative. http://www.openarchives.org/

The CombeChem project. http://www.combechem.org/

cloud

enterprise

personal

laboratory

project

Page 51: DCC Keynote 2007

52

A Gateway + more User Participation

24/5/2007 | myExperiment | Slide 52

Tryps team already has a wiki Mash up with Facebook and

workflow hosting apps. Bring functionality to the user.

Cooperate! Don’t Control.

The Research Information CentreBritish Library and Microsoft

Figure courtesy Savas Parastatidis , Microsoft

Page 52: DCC Keynote 2007

53

Page 53: DCC Keynote 2007

Apologies to Larson

Page 54: DCC Keynote 2007

55

From me-Science to we-Science• Tribal bonding and sharing• Crossing Tribal Boundaries• Across communities and

disciplines (MIT)• “Intellectual Fusion” &

“Swarming”; breaking down silos

• Understanding outside my expertise. E.g. sources of error

• Metadata challenges.• Social challenges.

Page 55: DCC Keynote 2007

56

Curation by the Monks

Curation by the Masses

Automated Curation

refinevalidate

refinevalidate

Curation by Developers

seed seed

refinevalidate

seed

A Change in the World

Page 56: DCC Keynote 2007

Challenges - where to start?

If we thought about them hard we wouldn’t have done it. So we didn’t.

Its, er, my experiment.

National Centre for e-Social Science

Page 57: DCC Keynote 2007

58

User Participation for Content and Functionality

• Adoption depends on lots of shared services and workflows

• and enabling Scientists to add value through applications and collaborative tagging

• The Selfish Scientist – • e-Science is me-Science• Incentive models for

Scientists to share?

Page 58: DCC Keynote 2007

59

• We expect workflow versioning.

• We encourage workflow evolution by the developers and others.

• Versions to be re-pooled.• Ownership• Sharing• Permissions• Separate update of workflow

from update of metadata.

Workflow Versioning and Sharing

Page 59: DCC Keynote 2007

60

• Control in the hands of the developers.

• Is this flexible enough? • Sense of Ownership. IP.

Authorship attribution. Copyright.

• Provenance propagation.• Validation, Safety, Trust.• When does a workflow get

changed so much its no longer the same workflow?

Workflow Versioning and Sharing

Page 60: DCC Keynote 2007

61

More Challenges• Privacy, Copyright, IP• Incentives to share, collaboratively curate and behave.

– Altruism, mischief, self-interest– Credit, reputation, fame, impact. Me-Science.– Expectations – suppose its wrong? Will I get sued?– Scientists are naughty too.

• Quality control.– Palpability, buyer beware, memes are tricky things. Community

Trust models. Policing. Auto-checking? Shaming?

• Sustainability leverages – The Open Source Development Model – On young peoples’ endless enthusiasm to share.

• Better tooling.

Page 61: DCC Keynote 2007

62

Keep your Users CloseWeb 2.0 Style development

• Perpetual Beta• Users Add Value

Parties

HackFestsAdvocates

Guinea Pigs

Page 62: DCC Keynote 2007

Do we still need curators?

“Hell is other people’s metadata”

Page 63: DCC Keynote 2007

64

Yes!• Open tagging, folksonomies, blogging, profiles,

recommendations, Social network analysis and e-tracking, workflow analytics.

• Deafened by the Shouting• Overseeing but not Controlling. Review and add value.• Tagging -> Structured Pipeline• Reconcile Creative Freewheeling with need to Organise.

– Impedance mismatch between research activities and the recording of research data. Dynamic Scientists vs Prescriptive Platform

• Ontology dictatorship.– Reconciling managed ontologies with emergent folksonomies.

Encourage Tagging with Ontologies.

• Metadata Creep: multi-form, multiple-descriptions

Page 64: DCC Keynote 2007

65

Pay as you Go, Emergent Curation

Gain

Pain

VeryBAD

Good, butUnlikely

Just right

Folksonomy Tagging

Hard Core Ontology Curation

Page 65: DCC Keynote 2007

Must be careful to avoid technology seduction

Computer people want to do interesting stuff; curators want stability and reliability; users want

simplicity.

Smart tools and good interfaces often outwit clever techniques.

Bummer.

However….

Page 66: DCC Keynote 2007

67

Model Flexibility• Semantic Web!

– Flexibility of RDF– Incrementality of OWL– Self description– Reasoning when needed– Open Linked Data, SKOS

• Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) – compound object information and standardised and

interoperable mechanisms

Page 67: DCC Keynote 2007

68

Metadata Middleware• Annotations are First Class Citizens • A technology independent metadata

abstraction layer. Natively supported by the middleware infrastructure.

• S-OGSA Framework from the Semantic Grid.• Semantic Bindings Management.

Page 68: DCC Keynote 2007

69

Curation Design Patterns

• http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

1. The Long Tail

2. Data is the Next Intel Inside

3. Users Add Value

4. Network Effects by Default

5. Some Rights Reserved

6. The Perpetual Beta

7. Cooperate, Don't Control

8. Beyond a Single Device

Page 69: DCC Keynote 2007

70

SMARTER Curation• Selective – ROI• Mass community annotation – cooperate don’t control.

Harness people cycles and network effects. • Automate – Derive. Harness compute cycles and

network effects.• React – to changes, automate responses• Timely – just in time• Expedient – just enough• Review – seed, oversee & refine rather than control

• Changes in model support and infrastructure• Changes in work practice – if it’s a problem, it’s a people

problem.

Page 70: DCC Keynote 2007

71

Credits• David De Roure• Matt Lee• David Withers• Don Cruickshank• Jiten Bhagat• David Newman• Mark Borkum• Danius Michaelides• Ed Zaluska• Jeremy Frey• Simon Coles• Marco Roos• Rob Procter• Alex Voss• Duncan Hull• Paul Fisher• Antoon Goderis

• Katy Wolstencroft

• Franck Tanoh

• Robert Stevens

• Martin Senger

• Khalid Belhajjame

• Andy Brass

• Norman Paton

• Rodrigo Lopez (EBI)

• Tom Oinn (EBI)

• Pinar Alper, Phil Lord, Chris Wroe

• Mark Wilkinson (BioMOBY)• Savas Parastatidis (Microsoft)

• Alan Williams, Stuart Owen, June Finch, Stian Soiland,

• Kaixuan Wang, Oscar Corcho• And the rest of myGrid and OntoGrid

Page 71: DCC Keynote 2007

72

For More Information• myExperiment:

– http://myexperiment.org– David De Roure [email protected]

• myGrid: Taverna and WS4LS Catalogue– http://www.mygrid.org.uk

• SoapLab:– http://soaplab.sourceforge.net/soaplab2/

• OntoGrid: Semantic middleware– http://www.semanticgrid.org