Knowledge Streams: Stream Processing of Semantic Web Content Mike Dean Principal Engineer Raytheon...

Post on 13-Jan-2016

220 views 0 download

Tags:

Transcript of Knowledge Streams: Stream Processing of Semantic Web Content Mike Dean Principal Engineer Raytheon...

Knowledge Streams: Stream Processing of Semantic Web Content

Mike DeanPrincipal Engineer

Raytheon BBN Technologiesmdean@bbn.com

1

Assumptions

• Technology – Intermediate– Familiarity with RDF and OWL

• Interest in– Stream processing– Scalability

2

Presenter Background

• Principal Engineer at Raytheon BBN Technologies (1984-present)• Principal Investigator for DARPA Agent Markup Language (DAML)

Integration and Transition (2000-2005)– Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL

• Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present)

• Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups

– Co-editor of the W3C OWL Reference• Local co-chair for ISWC2009• Other SemTech presentations

– Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher)

– Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher)

– Use of SWRL for Ontology Translation (2008)– Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John

Hebeler)– How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge

Corpus (2009)– Finding a Good Ontology: The Open Ontology Repository Initiative (2010, w/ Peter Yim

and Todd Schneider)3

Outline

• Motivation• Vision• Building Blocks• Demonstration

4

Motivations

• Timeliness• Performance

5

Timeliness

• Streaming minimizes latency– Processing elements see events as they occur– Resources are expended only when an event occurs

• This is in contrast to polling– Latency averages half the polling interval– Resources are expended on every poll– Popular web syndication mechanisms such as RSS

and Atom involve polling

6

Performance

• Many Semantic Web tools provide streaming parsers rather than, or in addition to, model access– Analogous to XML SAX vs. DOM

• For suitable applications, this can be 10x faster than loading all statements into memory or a KB

7

2 Streaming Stories

• dumpont of OpenCyc (circa 2003)– HTML-based ontology visualization tool periodically

bogged down daml.org server– Reimplementation using event-based Jena ARP parser

yielded 10x performance and scalability improvements

• Billion Triples Challenge 2009– Streaming analysis of the 2009 corpus was

performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk

– Compare to loading 10-20K statements/second on a server

8

Stream Processing Examples

• Unix pipes• Dataflow architectures• Streambase• IBM System S/InfoSphere Streams

9

aggregationaggregation

persistentqueriespersistentqueries

augmentationaugmentationcontextfiltercontextfilter

alertsalerts

correlationcorrelationtranslationtranslation

inferenceinference

distributiondistribution

DataDataSourcesSources

Distribution And Processing ElementsDistribution And Processing Elements

UsersUsers

CEPCEPNLPNLP

Sensor Sensor NetworkNetwork

ImageryImagery

RSSRSS

IMIM

GazetteerGazetteer

SensorSensor

Semantic Semantic WebWeb

DatabaseDatabase

Persistent pipelines• Streams of statements comprising

object subgraphs• URI naming allows drill-down• Provenance, timestamps

Processing elements •Consume and produce subgraphs •Multiple functions may be combined

ArchiveArchive

User 2User 2

User 3User 3

Community of Interest 1

Community of Interest 2

User 1User 1

Vision: Knowledge Streams

10

Goals

• Web-scale– Decentralized among multiple sites– Heterogenous implementations

• Long-lived, persistent connections– User accountability

• Introspection over the processing network for control and optimization– E.g. aggregating subscriptions– Balance with security, privacy, and autonomy

concerns

11

Building Blocks

• RDF Content• Existing stream processing frameworks• Workflow systems• Publish/subscribe message oriented middleware

12

RDF Payloads

• Malleable data– Standards-based graph structure– Can easily add, remove, and transform statements

• Self-describing– Unique naming via URIs– References to vocabularies and ontologies

• Potential for inference

13

Workflow Systems

• Graphical environments for developing processing pipelines– Yahoo Pipes, DERI Pipes, SPARQLMotion– Nice user interfaces for development and execution

14

http://pipes.deri.org

Semantic Complex Event Processing

• Complex Event Processing– One of the leading edges of rules technology – Formal specification of higher-level events in terms of lower-level

events• E.g. alert if the moving average increases 15% within a 10 minute window

– Engine can be compiled/optimized for a specific rule set– High-volume deployments in finance and other industries– Most implementations focus on self-contained tuples

• Semantic Complex Event Processing– Enrich CEP using Semantic Web technology– Emerging topic at recent conferences

• Early implementations– Wrappers around open source CEP engines– Native implementation

• Provides a powerful set of operators and engines for Knowledge Streams

15

Implementation Approach

• Well-defined APIs for implementing operators• Operator execution containers

– Could encapsulate existing engines

• Start with manual processing network configuration, then automate

16

Use Cases

• Dissemination of metadata for new satellite imagery

• Social network changes• Alerting of friends’ new publications• …

17

Demo

• Processing using DERI Pipes with new operators– Ingest of #SemTechBiz tweets using Twitter

Streaming API– Conversion of JSON to RDF– Mapping to SIOC vocabulary using SWRL rules– Enrich by matching Twitter @handles with contacts– Persistent buffering using Java Message Service– Monitoring

18