DCC Keynote 2007
-
Upload
carole-goble -
Category
Technology
-
view
1.695 -
download
0
description
Transcript of DCC Keynote 2007
Curating Services and Workflows
The Good, the Bad and the UglyA Personal Story in the Small
Professor Carole GobleThe University of Manchester, [email protected]
Keynote: 3rd International Digital Curation Conference, Washington DC, 11-13 December 2007
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
[GSK]
4
Programmatic Interfaces to Services(Web Services not Web Sites)
Your Script
ServiceRegistry
Web Service
SeqFetchService
BLAT Service
BLAST Service
SeqFetchService
GO Service
Adapted from Lincoln Stein
Your WorkflowYour
Application
Interface Description Document
WSDL WADL
European Bioinformatics Institute API submissions has risen to 3,166,901 for 2007 (Sarah Hunter)
5
[Mark Wilkinson, 2006]
• Workflows describe the scientists in silico experiment– Link together and cross reference data in
different repositories– Mechanism for interoperating.– And that includes publications!
• Remote, third party, external applications and services– Accessible to the workflow machinery– And that includes data and publications!
• Results management– Semantic metadata annotation of data– Provenance tracking of results
• Sharing and replicating know-how – Reuse of workflows
Viva la Workflows!
myGrid Taverna Workflow
Workbenchhttp://www.mygrid.org.uk
41000+ downloads 40 per day since June 2006. Ranked 210 sourceforge activity (06
06 07) Open Source Development Used throughout the world Systems biology – SysMo Consortium Proteomics Gene/protein annotation, Microarray
data analysis, Medical image analysis Heart simulations, High throughput
screening, Phenotypical studies, Phylogeny
Plants, Mouse, Human Astronomy, Music, Geography Text mining And Curation….
Because software needs curating too.
http://www.omii.ac.uk
ManchesterSouthamptonEdinburghEuropean Bioinformatics Institute
10
Automated Curation using Workflows• Coordinating data mirroring
refreshes• Refreshing Data warehouses
– e-Fungi, ISPIDER
• Rebuilding lost databases– tGRAP when collapsed picked up
by Nijmegen and rebuilt using workflows over two days.
• Text mining– Very, very popular.
• Workflows instead of data curation?– Data regenerated on demand.– Curate the workflow and not the
data?Bas Vroling, Gert Vriend CMBI NCMLS UMC Nijmegen
11
Workflows are reading publications.Workflows are processing the data.
Workflows are part of curation pipelines
Workflows are another form of outcome to publish and curate alongside data and
publications
12
Workflows are….…provenance of data…general technique for describing and enacting a
process, like a script or a protocol or a method…precise, unambiguous and transparent protocols and
records.…often complex, so they need explaining.…often challenging and expensive to develop.…know-how and best practice. …collaborations.…valuable first class scientific assets in their own right.
• Services are steps in the workflow, and a workflow can be deployed as a service. They are “Social Networks” of services. More on this later….
13
“We need to curate methods as well as data. With
the new large scale data sets process matters
as much as content and we are rubbish at curating, capturing and reusing it. Much of what we now rely on is processed, not raw data. We have strategies for curating the raw data - indeed multiple standards.
Thus, in life sciences we have a gaping void in our curation. We need standards, need places to put methods, and places to allow re-use.
Professor Andy Brass, Bioinformatics
14
Towards Reproducible Science (with Reproducible Scientific Objects)
15
Trypanosomiasis in Cattle• Identified a pathway for
which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.
• Systematic and comprehensive automation. Elimination of user bias.
Fisher P et al A systematic strategy for large-scale analysis ofgenotype–phenotype correlations: identification of candidate genes involved in African trypanosomiasis, Nucleic Acids Research, 2007, 1–9
A PhD student. Paul Fisher.
16
Recycling, Reuse, Repurposing• A Trypanosomiasis in Cattle
workflow (by Paul) reused without change for Trichuris muris Infection (by Jo).
• Identified the biological pathways believed to be involved in the ability of mice to expel the parasite.
• Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value.
Kepler
Triana
BPEL
Ptolemy II
Scientific memes. Scientific viruses.Increasing numbers.
Aerospace Engine Design
90% of design is variant design70% of information is taken from
previous designs
Source: Silvia Wong, University of Southampton, UK
19
Institutional Archive
LocalWebPublisher
Holdings
Digital Library
Graduate Students
Undergraduate Students
Virtual Learning Environment
e-Experimentation
e-Scientists
Technical Reports
Reprints
Peer-Reviewed Journal &
Conference Papers
Preprints &
Metadata
Certified Experimental
Results & Analyses
Data, Metadata & Ontologies Workflows
Adapted from the eBank project
20
If I had (well) curated services and workflows I could….
• Browse around and see what is out there and stop reinventing the wheel.
• Find a service based on what it does (or was meant to do), and what it consumes as inputs and produces as outputs, and what it uses, or because it matches (somehow) something I have already
• Understand how it works and when it works• Know where there are exact copies or similar
services I can use as alternates• Know whether I have permission to use it, or have
the set up to use it.
21
If I had (well) curated services and workflows I could….
• Understand how to operate it, configure it correctly with some examples and defaults, invoke it and handle all the error stuff, and predict performance properties
• Know how expensive it might be to use (financially or performance)
• Know when and by whom its was created, its version history and track its versions
• Know what other people think of it, how popular it is and who else use it and how
• Know how reliable it is, if it still works and how reliable it is and whether it keeps changing.
22
If I had (well) curated services and workflows I could….
• Get intelligent help with using it in my application, like when building workflows
• Validate it• Know how it can be chained with others• Find services that can mediate the mismatches
between other services.• Automagically match it up with others to
automagically create new ones• Call it from an application or a web browser
23
A definition for me [based on wikipedia]• Digital curation is about maintaining and adding value to
a trusted body of digital assets for current and future use by, and on behalf of, a community.
• It is a long term process where those assets are managed, cleaned up and corrected, associated with metadata, annotated and discussed, and appropriately preserved or reliably disposed of.
• Assets are used, we hope– By applications and scientists who had anticipated
using them.– By applications and scientists that had not, or in
ways that were unanticipated.
http://en.wikipedia.org/wiki/Digital_curation
24
e-Scientists in the Cloud • Individual life scientists, in
under-resourced labs, using other people’s applications, with little systems support.
• Consumers are providers.• Exploratory.
• A distributed, disconnected community of scientists.
Hypo Science©Virtual Laboratories
Science in the Small by the Many
© Peter Murray-Rust
26
Global Services in the Cloud• Independent third party
world-wide service providers of applications, tools and data sets. In the Cloud. Hosted at the originators site.
• Local applications, tools and datasets. My copies of third party services.
• Special shim services.• Decoupled providers and
consumers.• 3500 service operations
27
But Surely ….…Can’t I just Google (or Woogle) for a service?• The clustalw program from Emboss is called
‘emma’…Can’t I look at its WSDL document?• Input0:string, Output0: string• What does SeqRet actually do?• Liberal use of polymorphic capabilities• What about the ones that are not Web Services?…Can’t I look at its documentation?• Ahem. We have to try them to find out what
they do…
Writing Reusable stuff is HARD
Predicting the unknown required by the unknown.
Services in the Wild are frequently Rubbish.
Scientists and Developers are naughty.
Applications and Scientists need a Curated Registry of Services
Note: Registry, not repositoryServices are hosted elsewhere
(Just having a workflow system isn’t enough)
30
Service Curation • 3500+ service
operations• 600+ annotated by full-
time curator.• myGrid Ontology• Annotation and
curation pipeline• Curation tools• Feta and Find-O-Matic
discovery tools• There are others:
– DAS Registry– BioMOBY Central
Since 2002
31
Building Annotation Commodities
Object
Service Endpoint
Workflowfileetc
AnnotationModel
FunctionalOperationalProvenanceReputation
Descriptions
OntologiesControlled vocabulary
TagsFolksonomy
Free text
Layered, Enrichment, Augmentation Annotation modelUses Semantic Web technologies - OWL and RDFS
The perspective of the scientistManaged, centralised curation process
700+ class domain ontology
Service Ontology3500+ Services
32
Volatility and DecayBioNanny• Services are not deposited and
preserved.• They are referred to.• Constant, silent churn and flux.• No SLA to be stable or standard.• Constantly need tending or else they go
bad and stale.– SeqHound, BioMART API
• Rapid metadata heart-beat, especially on operational metadata. Like minutes.
• (cf. IVOA service validation, DAS).• Workflow decay• Not Fix, File, Forget
33
One size does not fit all…• Scientist - Finding
– Simple classifications on a few properties. Smart tools. “Coarse grained”. Simple Ontology.
– Decision Support
• Automation – Validation and Execution “fine grained”– Rich metadata for automatic service
configuration, invocation, debugging, repair, automated composition
– Decision making.
34
Increasing valueIncreased automationBetter understanding
Investment (cost, effort)
Scripted toolinvocation
Guided workflowconstruction
Basic ‘discovery’ styleservice annotations
Knowledge drivenvisualization
Workflow validation
Semantically enriched data
Automated Workflow Construction
Guided workflowreuse
Dynamic Service SubstitutionManual use of
tools, web pages
Naïve workflowsystems
Folksonomy TaggingOntology Curation
Service Configuration
output{score} is_distance_between pair {input{sequence a}, input{sequence b}}
‘myalignscript.pl’
‘A tool to comparemultiple protein structures’
performs_task : alignment
input_type{seq_a} : sequence…output_type{score} : d_value
35
Progressive CurationJust enough, Just in time
Jam today and Jam tomorrow
Gain
Pain
VeryBAD
Good, butUnlikely
Just right
Applications and Scientists needed aCurated Repository of Workflows
Find a workflow like this one that I can edit to do something else. That’s really hard.
37
Workflow Glass Boxes• Social Networks of Services
– Is it dependent on a service I don’t have access to, or is depreciated or is unreliable?
• Nesting and fragments of workflows– Workflow networks
• Service Diagnostics– Popularity, Co-use and clustering– Quality of Service
• Service Curation– Automate service annotation– Debug service annotations
38
Franck Tanoh
Katy Wolstencroft
Our hard working (real) curatorsnotice how tired they look
Curation Sweatshop• Steady increase in
numbers of services and workflows
• Time-consuming and expensive.
• Annotation and the Ontologies• Choosing, Adding value.
Monitoring.• Should we instead enable
suppliers to add value?
39
Automated Curation• Operational:
– Monitoring information services, dial home diagnostics from applications, customer reports
• Reputation and Provenance:– Recommendations and ratings
• Functional: – Text mining and parsing files and
documents (if any)– Incidental metadata through use.– Annotation derivation from sound
workflows and rich service descriptions of inputs and outputs
– Not perfect, but a help!
Needs lots of infrastructure
Needs lots of seeding and reviewing
Local Libraries and Warehouses of
Workflowstrapped in their
enterprises or platforms
41
Tryps Twiki World
Wikis are where data lives….
42
• Picture of workflow in Flicker – evidence of social tagging and networking
43
44
myExperiment.org is…• A bazaar for any and all kinds
of workflows.• A community social network for
community annotation and general gossip.
• A gateway to other publishing environments.
• A federated repository.• Publish self-describing
encapsulated myExperiment Objects.
• Not workflows; Scientific Objects!
• e-Crystals, Social science, Astronomy, Geography, Music
• (A platform for launching workflows.)Since Feb 2007
46
Encapsulated myExperiment Objects.• A single or collection of workflows
with instructions and examples• A workflow with its inputs and the
products of executing it (including logs), perhaps multiple times
• Chemistry data from instruments, coupled with blogged log book entries
• A collection of all the digital items associated with one experiment—including EMOs
• A reproducible article with workflows and data
Virtual Exchange
Format
47
Encapsulated myExperiment Objects.• Open Archives Initiative
– Object Reuse and Exchange (OAI-ORE) – compound object
information and standardised and interoperable mechanisms
• W3C Open Linked Data Initiative
• Reproducible Scientific Objects
Virtual Exchange
Format
x
48
EMO Challenges• What happens when the
parts are scattered across multiple stores?
• What happens if someone updates a part?
• How will my EMO be discovered on the Web?
• How can I work with an EMO offline?
• What is the provenance of the EMO and its parts?
• What happens if a part is unavailable?
24/5/2007 | myExperiment | Slide 48
• How do I send an EMO by email?
• Can I turn an EMO into a tarball?
• Can I archive an EMO to a CDROM?
• If I delete this file will it break anyone’s EMOs?
• How do I trust an EMO?• How do I handle an EMO
RESTfully?• Can my EMO link to objects
outside the EMO?
49
Not just Workflows, Not just Biology
Chemistry - eCrystals
Social Science
Astronomy
Music
Files and Documents
Logs and Blogs
Ontologies
Data
51
Respect Cautious Collaboration….
24/5/2007 | myExperiment | Slide 51
Community web site, federated repository.
Multiple and My.
Publish what I want when I want within the group I want.
Mixed identity regimes: an identity authority
OAI-MPH.
Open Archives Initiative. http://www.openarchives.org/
The CombeChem project. http://www.combechem.org/
cloud
enterprise
personal
laboratory
project
52
A Gateway + more User Participation
24/5/2007 | myExperiment | Slide 52
Tryps team already has a wiki Mash up with Facebook and
workflow hosting apps. Bring functionality to the user.
Cooperate! Don’t Control.
The Research Information CentreBritish Library and Microsoft
Figure courtesy Savas Parastatidis , Microsoft
53
Apologies to Larson
55
From me-Science to we-Science• Tribal bonding and sharing• Crossing Tribal Boundaries• Across communities and
disciplines (MIT)• “Intellectual Fusion” &
“Swarming”; breaking down silos
• Understanding outside my expertise. E.g. sources of error
• Metadata challenges.• Social challenges.
56
Curation by the Monks
Curation by the Masses
Automated Curation
refinevalidate
refinevalidate
Curation by Developers
seed seed
refinevalidate
seed
A Change in the World
Challenges - where to start?
If we thought about them hard we wouldn’t have done it. So we didn’t.
Its, er, my experiment.
National Centre for e-Social Science
58
User Participation for Content and Functionality
• Adoption depends on lots of shared services and workflows
• and enabling Scientists to add value through applications and collaborative tagging
• The Selfish Scientist – • e-Science is me-Science• Incentive models for
Scientists to share?
59
• We expect workflow versioning.
• We encourage workflow evolution by the developers and others.
• Versions to be re-pooled.• Ownership• Sharing• Permissions• Separate update of workflow
from update of metadata.
Workflow Versioning and Sharing
60
• Control in the hands of the developers.
• Is this flexible enough? • Sense of Ownership. IP.
Authorship attribution. Copyright.
• Provenance propagation.• Validation, Safety, Trust.• When does a workflow get
changed so much its no longer the same workflow?
Workflow Versioning and Sharing
61
More Challenges• Privacy, Copyright, IP• Incentives to share, collaboratively curate and behave.
– Altruism, mischief, self-interest– Credit, reputation, fame, impact. Me-Science.– Expectations – suppose its wrong? Will I get sued?– Scientists are naughty too.
• Quality control.– Palpability, buyer beware, memes are tricky things. Community
Trust models. Policing. Auto-checking? Shaming?
• Sustainability leverages – The Open Source Development Model – On young peoples’ endless enthusiasm to share.
• Better tooling.
62
Keep your Users CloseWeb 2.0 Style development
• Perpetual Beta• Users Add Value
Parties
HackFestsAdvocates
Guinea Pigs
Do we still need curators?
“Hell is other people’s metadata”
64
Yes!• Open tagging, folksonomies, blogging, profiles,
recommendations, Social network analysis and e-tracking, workflow analytics.
• Deafened by the Shouting• Overseeing but not Controlling. Review and add value.• Tagging -> Structured Pipeline• Reconcile Creative Freewheeling with need to Organise.
– Impedance mismatch between research activities and the recording of research data. Dynamic Scientists vs Prescriptive Platform
• Ontology dictatorship.– Reconciling managed ontologies with emergent folksonomies.
Encourage Tagging with Ontologies.
• Metadata Creep: multi-form, multiple-descriptions
65
Pay as you Go, Emergent Curation
Gain
Pain
VeryBAD
Good, butUnlikely
Just right
Folksonomy Tagging
Hard Core Ontology Curation
Must be careful to avoid technology seduction
Computer people want to do interesting stuff; curators want stability and reliability; users want
simplicity.
Smart tools and good interfaces often outwit clever techniques.
Bummer.
However….
67
Model Flexibility• Semantic Web!
– Flexibility of RDF– Incrementality of OWL– Self description– Reasoning when needed– Open Linked Data, SKOS
• Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) – compound object information and standardised and
interoperable mechanisms
68
Metadata Middleware• Annotations are First Class Citizens • A technology independent metadata
abstraction layer. Natively supported by the middleware infrastructure.
• S-OGSA Framework from the Semantic Grid.• Semantic Bindings Management.
69
Curation Design Patterns
• http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
1. The Long Tail
2. Data is the Next Intel Inside
3. Users Add Value
4. Network Effects by Default
5. Some Rights Reserved
6. The Perpetual Beta
7. Cooperate, Don't Control
8. Beyond a Single Device
70
SMARTER Curation• Selective – ROI• Mass community annotation – cooperate don’t control.
Harness people cycles and network effects. • Automate – Derive. Harness compute cycles and
network effects.• React – to changes, automate responses• Timely – just in time• Expedient – just enough• Review – seed, oversee & refine rather than control
• Changes in model support and infrastructure• Changes in work practice – if it’s a problem, it’s a people
problem.
71
Credits• David De Roure• Matt Lee• David Withers• Don Cruickshank• Jiten Bhagat• David Newman• Mark Borkum• Danius Michaelides• Ed Zaluska• Jeremy Frey• Simon Coles• Marco Roos• Rob Procter• Alex Voss• Duncan Hull• Paul Fisher• Antoon Goderis
• Katy Wolstencroft
• Franck Tanoh
• Robert Stevens
• Martin Senger
• Khalid Belhajjame
• Andy Brass
• Norman Paton
• Rodrigo Lopez (EBI)
• Tom Oinn (EBI)
• Pinar Alper, Phil Lord, Chris Wroe
• Mark Wilkinson (BioMOBY)• Savas Parastatidis (Microsoft)
• Alan Williams, Stuart Owen, June Finch, Stian Soiland,
• Kaixuan Wang, Oscar Corcho• And the rest of myGrid and OntoGrid
72
For More Information• myExperiment:
– http://myexperiment.org– David De Roure [email protected]
• myGrid: Taverna and WS4LS Catalogue– http://www.mygrid.org.uk
• SoapLab:– http://soaplab.sourceforge.net/soaplab2/
• OntoGrid: Semantic middleware– http://www.semanticgrid.org