SemTech West 2011 - Digital Provenance

29
Implementing Digital Provenance on the World Wide Web Using Semantic Web Technology Gregory Joiner*, Douglas Reid Raytheon BBN Technologies {gjoiner,dreid}@bbn.com June 9 th , 2011

description

Digital Provenance overview presentation given at SemTech 2011 by Greg Joiner of Raytheon BBN Technologies.

Transcript of SemTech West 2011 - Digital Provenance

Page 1: SemTech West 2011 - Digital Provenance

Implementing Digital Provenance on the World Wide Web Using Semantic

Web Technology

Gregory Joiner*, Douglas ReidRaytheon BBN Technologies

{gjoiner,dreid}@bbn.com

June 9th, 2011

Page 2: SemTech West 2011 - Digital Provenance

First…Some Administrivia!

• Updated slides are located on SlideShare at: http://slidesha.re/lqCHWd

• Presentation is not “Technical – Intermediate.”– I wanted to reach the maximum number of users– Was not enough time to provide both an overview and

technical instruction.

• Feel free to interrupt me anytime with questions!

June 9th, 2011 2

Page 3: SemTech West 2011 - Digital Provenance

Goals of this Talk

• Learn what digital provenance is

• Understand why it is important

• Know what is currently being done by whom

• Have starting point for implementing provenance in your semantic web applications

• Be passionate about digital provenance!

June 9th, 2011 3

Page 4: SemTech West 2011 - Digital Provenance

Agenda

• Part 1: A Introduction to Digital Provenance – What is Digital Provenance– National Cyber Leap Year Summit

• Part 2: Digital Provenance Use Cases– Everyday Web Browsing– Contradictory, Time-Sensitive Information– Closed Network Provenance

• Part 3: Where Are We Now?– W3C Provenance Work– Review of the Current State-of-the-Art

• Part 4: Digital Provenance Tool Development– Why SemWeb is Perfect for Digital Provenance– Open Source and Standards Compliance– Securing Provenance Metadata– Additional Design Considerations

June 9th, 2011 4

Page 5: SemTech West 2011 - Digital Provenance

A INTRODUCTION TO DIGITAL PROVENANCE

Part 1:

Part 1: A Introduction to Digital Provenance

Part 2: Digital Provenance Use Cases

Part 3: Where Are We Now?

Part 4: Digital Provenance Tool Development

June 9th, 2011 5

Page 6: SemTech West 2011 - Digital Provenance

What is Digital Provenance

• Provenance is defined by Webster’s Dictionary as “the origin or source of something” – mainly pertaining to art or architectural artifacts

• Digital Provenance is metadata that establishes the chain-of-custody information needed for users to make trust decisions about digital data

• Digital Provenance Metadata can describe any type of electronic data at any granularity level from entire web sites to single files to even individual assertions within a webpage or document

June 9th, 2011 6

Page 7: SemTech West 2011 - Digital Provenance

What is Digital Provenance

Types of Digital Provenance Metadata include:

• Bibliographical Information – Provides a list of all of the sources behind a document or assertion

• Chain-of-Custody Information – Provides a history of the different people and/or systems that have handled the document or assertion

• Proof / Justification Information – Documents the logical steps followed to make an assertion

• Trust Information – Provides a quantifiable metric to measure and compare the trustworthiness of one document or assertion to another.

June 9th, 2011 7

Page 8: SemTech West 2011 - Digital Provenance

National Cyber Leap Year Summit

• Convened in 2009 as a response to the President’s call to secure the nation’s cyber infrastructure and charged with identifying the “game-changing” technologies needed to secure cyberspace

• Identified Digital Provenance as one of those technologies because it enables the identification, authentication, and reputation of entities and objects with appropriate granularity at many layers of the protocol hierarchy.

June 9th, 2011 8

Page 9: SemTech West 2011 - Digital Provenance

DIGITAL PROVENANCEUSE CASES

Part 2:

Part 1: A Introduction to Digital Provenance

Part 2: Digital Provenance Use Cases

Part 3: Where Are We Now?

Part 4: Digital Provenance Tool Development

June 9th, 2011 9

Page 10: SemTech West 2011 - Digital Provenance

Everyday Web Browsing

• Scenario: People often rely on the Internet for advice on important subjects, such health or finance, and frequently make key decisions based on web content alone. This is especially true for mobile users who lack the bandwidth and display room to investigate the provenance on their own.

• Solution: By dynamically marking the trustworthiness of web content, users can quickly determine what data they can trust so they can make more informed decisions.

June 9th, 2011 10

Page 11: SemTech West 2011 - Digital Provenance

Contradictory, Time-Sensitive Information

• Scenario: When breaking news happens, content re-publishers and end users are often forced to chose between contradicting information. For example, after the tragic shooting in Arizona in January 2011, some websites claimed Rep. Gifford was dead while others properly reported that she was still alive.

• Solution: By providing a standard way to view and compare the bibliographical and chain-of-custody information of the conflicting articles, users can make an informed decision on which one to trust.

June 9th, 2011 11

Page 12: SemTech West 2011 - Digital Provenance

Closed Network Provenance

• Scenario: Even in a closed network, users frequently have to decide whether to trust existing content. This is often the case within the Intelligence Community and Department of Defense where certain time-sensitive tasks allow assumptions to be made that other tasks can not. For example, the use of lethal force against a target requires more concrete evidence than other, less irreparable actions.

• Solution: By providing analysts with a complete list of the assumptions and justifications behind a given assertion, they can determine whether or not they can use that assertion in their analysis.

June 9th, 2011 12

Page 13: SemTech West 2011 - Digital Provenance

Additional Use Cases

• License and Contract Compliance

• Public Policy Conformance

• Assigning Credit and Blame to Information

• Many more were identified by the W3C Provenance Incubator Group and are located at: http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases

June 9th, 2011 13

Page 14: SemTech West 2011 - Digital Provenance

WHERE ARE WE NOW?Part 3:

Part 1: A Introduction to Digital Provenance

Part 2: Digital Provenance Use Cases

Part 3: Where Are We Now?

Part 4: Digital Provenance Tool Development

June 9th, 2011 14

Page 15: SemTech West 2011 - Digital Provenance

W3C Provenance Work• Provenance Interchange Working Group

– Chartered through Oct 2012, based on Incubator Group’s findings– Formed to “support the widespread publication and use of

provenance information of Web documents, data, and resources”– Will publish Recommendations to define a language for exchanging

provenance information (PIL) among applications• Provenance Interchange Language (PIL) design goals

– Be applicable to any resource– Provide a low barrier to entry to facilitate widespread adoption– Provide a small, extensible core model– Draw from existing vocabularies ontologies

• Deliverables– Conceptual Model, Formal Model, Formal Semantics, Accessing

and Query Provenance, XML Serialization, Best Practice Cookbook, Primer

June 9th, 2011 15

Page 16: SemTech West 2011 - Digital Provenance

W3C’s work (cont.)

• Key Recommendations for PIL– Standard way to represent, at a minimum, three basic entities

1. A handle (URI) to refer to an object2. A person/entity that the object is attributed to3. A processing step done by a person/entity to an object

– Mechanism to access provenance-related information addressed by other standards

• Licensing information of an object• Digital signature for the object• Digital signature for the provenance records

– Standard way for sites to make provenance information about their content available to other parties in a selective manner, and for others to access that information

June 9th, 2011 16

Page 17: SemTech West 2011 - Digital Provenance

Review of the Current State-of-the-ArtRepresentation

• Existing Provenance Vocabularies/Ontologies– Dublin Core: “Librarian” vocabulary capturing bibliographical information.– Provenir Ontology: Upper-level ontology for use in SemWeb applications– Provenance Vocabulary: Captures data using the Linked Data principles– Proof Markup Language (PML): “Full-Featured” interlingua that describes

basic provenance meta-data plus justification and trust information.– Others: Changeset Vocabulary, PREMIS, SWAN Provenance Ontology,

Semantic Web Publishing Vocabulary, and WOT Schema

• Concrete mapping specified between existing ontologies– The Open Provenance Model (OPM) was chosen as a reference

vocabulary since it contained is a general and broad model that encompasses many aspects of provenance

– W3C Incubator Group formally encoded the mappings according to Simple Knowledge Organization System (SKOS) vocabulary

June 9th, 2011 17

Page 18: SemTech West 2011 - Digital Provenance

Review of the Current State-of-the-ArtImplementation

• News aggregation scenario– Content tracking (Memetracker, Spinn3r & BlogTracker, influence studies)– Explicit provenance (trackbacks / pingbacks, Twitter’s Retweet)– Licensing (Creative Commons, Google Books Right Registry)

• Disease outbreak scenario– Data provenance (human-readable changelogs, database research)– Workflow provenance (Taverna/Pegasus, Inference Web, ZOOM)– Justification for policy (ad-hoc user effort)

• Business Contract scenario– Tracking design (VisTrails)– Computer-aided Design (Design Rationale editor (DRed), IBIS software)

June 9th, 2011 18

Page 19: SemTech West 2011 - Digital Provenance

State-of-the-Art (cont.)Gaps

• Content– No mechanism to refer to the identity/derivation of an information object– No guidance on granularity for description of complex objects– No common standard for exposing/expressing provenance information– No standard for versioning and publishing updates– No standard to characterize suitability of provenance info for proof

• Management– No standard for linking provenance between sites– No guidance on combining existing standards to provide provenance– No guidance for exposing provenance info on the Web– No proven approaches to manage scale– No standard way to ensure only essential non-confidential provenance is

released

June 9th, 2011 19

Page 20: SemTech West 2011 - Digital Provenance

State-of-the-Art (cont.)More Gaps

• Use– No clear understanding of how to relate provenance at different levels of

abstraction– No general solutions to understand provenance publish on the Web– No standard to enable provenance integration/comparison– No broadly applicable methodology for making trust judgments based on

provenance when presented with information of varying quality– No existing mechanism to check compliance with laws, regulations or

contracts– No means to resolve conflicts in provenance data

June 9th, 2011 20

Page 21: SemTech West 2011 - Digital Provenance

DIGITAL PROVENANCETOOL DEVELOPMENT

Part 4:

Part 1: A Introduction to Digital Provenance

Part 2: Digital Provenance Use Cases

Part 3: Where Are We Now?

Part 4: Digital Provenance Tool Development

June 9th, 2011 21

Page 22: SemTech West 2011 - Digital Provenance

Why SemWeb is Perfect for Digital Provenance

• Semantic Web Technologies allow data to be shared and reused in a manner that is more flexible and integratable than traditional knowledge representations.

• The Web Ontology Language (OWL) allows deeper context to be encoded in the digital provenance metadata which enables the capture of more complex information in a standard, well specified format.

• With the provenance metadata in a machine-readable format, powerful automated information processing can which can provide additional provenance knowledge.

• By semantically tagging the digital provenance metadata, it can be dynamically linked to supporting (or contradicting) information to provide a more complete chain-of-custody picture.

June 9th, 2011 22

Page 23: SemTech West 2011 - Digital Provenance

Why Digital Provenance is Perfect for SemWeb

June 9th, 2011 23

Provenance helps complete the path to the top of the Semantic Web layer cake and to TBL’s SemWeb nirvana.

Page 24: SemTech West 2011 - Digital Provenance

Open Source and Standards Compliance

• As explained in the National Cyber Leap Year Summit’s Co-Chairs’ Report, establishing standards early on in the development process is crucial to achieving rapid, widespread community acceptance that is required for any digital provenance tool to be successful.

• Therefore, Digital Provenance tools should comply with and even inform the emerging W3C standards discussed earlier in this presentation

• Furthermore, since digital provenance tools require an additional time burden for both content developers and end-users, they should be available at little to no cost to further encourage acceptance.

June 9th, 2011 24

Page 25: SemTech West 2011 - Digital Provenance

Securing Provenance Metadata

• Provenance metadata that is not signed or secured is susceptible to tampering and therefore cannot realistically be trusted.

• Confidentiality and integritycontrols that are consistent with a wide variety of security models are crucial to creating a successful digital provenance solution.

June 9th, 2011 25

Page 26: SemTech West 2011 - Digital Provenance

Additional Design Considerations

• It is crucial that any digital provenance tool supports the creation, processing, and rendering of digital provenance metadata at all stages of the content creation lifecycle.

• Since users will require provenance information at many different levels of detail, successful digital provenance tools will be configurable to allow content creators and users to create and view the metadata at any granularity level.

June 9th, 2011 26

Page 27: SemTech West 2011 - Digital Provenance

Key Takeaways

• Provenance is key to the future success of the Web and is the final piece of the Semantic Web puzzle.

• The U.S. government has identified digital provenance as one of the important “game changing” cyber security technologies.

• Important W3C work is already underway.

• You can start thinking about and incorporating provenance in your application right now.

June 9th, 2011 27

Page 28: SemTech West 2011 - Digital Provenance

For More Information

• Authors– Greg Joiner, [email protected], 703-284-1259– Douglas Reid, [email protected], 703-284-1291

• National Cyber Leap Year Report– Co-Chairs Report: http://bit.ly/6NO05g– Participants’ Ideas Report: http://bit.ly/7HmjQ8

• W3C Provenance Interchange Working Group– www.w3.org/2011/prov

June 9th, 2011 28

Page 29: SemTech West 2011 - Digital Provenance

Questions

June 9th, 2011 29