Post on 08-May-2015
description
On the need for a W3C community group on RDF
Stream Processing
ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013
Oscar Corchoocorcho@fi.upm.es, ocorcho@localidata.com
@ocorchohttp://www.slideshare.net/ocorcho/
Disclaimer…
2<<Texto libre: proyecto, speaker, etc.>>
This presentation expresses my view but not necessarily the one from the rest of the group (although I hope that it is similar)
Acknowledgements
• All those that I have “stolen” slides, material and ideas from• Emanuele Della Valle• Daniele Dell’Aglio• Marco Balduini• Jean Paul Calbimonte• And many others who
have already startedcontributing…
3<<Texto libre: proyecto, speaker, etc.>>
Why setting up a community group?
4<<Texto libre: proyecto, speaker, etc.>>
In RDF Stream models(timestamps, events, time
intervals, triple-based, graph-based …)
In RDF Stream query languages(windows, stream selection, CEP-based operators, …)
In implementations(RDF native, query rewriting, continuous query registration,
scalability, static vs streaming data…)
In operational semantics(tick, window content, report)
Heterogeneity
You may think that we do not like heterogeneity…
5<<Texto libre: proyecto, speaker, etc.>>
But at least I love it…
• However, we need to tell people what to expect with each system, and smooth differences when they are not crucial……
6<<Texto libre: proyecto, speaker, etc.>>
The solution…
• Let’s create a W3C community group…
• To understand better those differences• The requirements on which we are based• And explain to others• …• And maybe get some “recommendation” out
7<<Texto libre: proyecto, speaker, etc.>>
The W3C RDF Stream Processing Comm. Group
• http://www.w3.org/community/rsp/
8<<Texto libre: proyecto, speaker, etc.>>
W3C RSP Community Group mission
“The mission of the RDF Stream Processing Community Group (RSP) is to define a common model for producing, transmitting and continuously querying RDF Streams. This includes extensions to both RDF and SPARQL for representing streaming data, as well as their semantics. Moreover this work envisions an ecosystem of streaming and static RDF data sources whose data can be combined through standard models, languages and protocols. Complementary to related work in the area of databases, this Community Group looks at the dynamic properties of graph-based data, i.e., graphs that are produced over time and which may change their shape and data over time.”
9<<Texto libre: proyecto, speaker, etc.>>
Use cases
• We have started collecting them
• And I hope that by the end of my talk you will consider contributing some more…
10<<Texto libre: proyecto, speaker, etc.>>
A template to describe use cases (I)
• Streaming Information • Type: Environmental data: temperatures, pressures, salinity, acidity, fluid
velocities etc, • Nature:
• Relational Stream: yes • Text stream: no
• Origin: Data is produced by sensors in oil wells and on oil and gas platforms equipments. Each oil platform has an average of 400.000.
• Frequency of update: • from sub-second to minutes • In triples/minute: [10000-10] t/min
• Quality: It varies, due to instrument/sensor issues • Management /access
• Technology in use: Dedicated (relational and proprietary) stores • Problems: The ability of users to access data from different sources is
limited by an insufficient description of the context • Means of improvement: Add context (metadata) to the data so it
become meaningful and use reasoning techniques to process that metadata
11<<Texto libre: proyecto, speaker, etc.>>
A template to describe use cases (II)
• [optional] Static Information required to interpret the streaming information
• Type: Topology of the sensor network, position of each sensor, the descriptions of the oil platform
• Origin: Oil and gas production operations • Dimension:
• 100s of MB as PostGIS dump • In triples: 10^8
• Quality: Good • Management / access
• Technology in use: RDBMS, proprietary technologies • Available Ontologies and Vocabularies: Reference Semantic Model
(RSM), based on ISO 15926
12<<Texto libre: proyecto, speaker, etc.>>
A tale of four heterogeneities
ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013
Oscar Corchoocorcho@fi.upm.es, ocorcho@localidata.com
@ocorchohttp://www.slideshare.net/ocorcho/
Heterogeneity #1: Representing RDF Streams
14<<Texto libre: proyecto, speaker, etc.>>
What is an RDF stream?
• Several possibilities:• An RDF stream is an infinite sequence of timestamped
events (triples or graphs), where timestamps are non-decreasing
…<eventi,ti >
<eventi+1,ti+1 >
<eventi+2,ti+2 >
…• An RDF stream is an infinite sequence of triple occurrences
<<s,p,o>,tα,tω> where <s,p,o> is an RDF triple and tα and tω are the start and end of the interval
• How are timestamps assigned?
Some examples…
• What would be the best/possible RDF stream representation for the following types of problems?• Does Alice meet Bob before Carl?• Who does Carl meet first?
• How many people has Alice met in the last 5m?• Does Diana meet Bob and then Carl within 5m?
• Which are the meetings the last less than 5m?• Which are the meetings with conflicts?
16<<Texto libre: proyecto, speaker, etc.>>
e1
:alice :isWith :bob
e2
:alice :isWith :carl
e3
:bob :isWith :diana
e4
:diana :isWith :carl
t3 6 91
:alice :isWith :bob :alice :isWith :carl :bob :isWith :diana :diana :isWith :carl
e1
e2
e3
e4
Data types for semantic streams - Summary
• Multiple notions of RDF stream proposed• Ordered sequence (implicit timestamp)
• One timestamp per triple (point in time semantics)
• Two timestamps per triple (interval base semantics)
• Comparison between existing approaches
• More investigation is required to agree on an RDF stream model
17
System Data item Time model # of timestamps
INSTANS triple Implicit 0
C-SPARQL triple Point in time 1
SPARQLstream triple Point in time 1
CQELS triple Point in time 1
Sparkwave triple Point in time 1
Streaming Linked Data RDF graph Point in time 1
ETALIS triple Interval 2
Heterogeneity #2: RDF Stream processors
18<<Texto libre: proyecto, speaker, etc.>>
Existing RDF Stream Processing systems
• C-SPARQL: RDF Store + Stream processor• Combined architecture
• CQELS: Implemented from scratch. Focus on performance• Native + adaptive joins for static-data and streaming data
• CQELS-Cloud: Reusing Storm• Paper presentation on Thursday
RDF Store
Stream processor
C-SPARQLquery
static
streaming
continuous results
Native RSPCQELSquery
continuous results
translator
Storm topology
CQELSquery
continuous results
Existing RSP systems
• EP-SPARQL: Complex-event detection• SEQ, EQUALS operators
• SPARQLStream: Ontology-based stream query answering• Virtual RDF views, using R2RML mappings• SPARQL stream queries over the original data streams.
• Instans: RETE-based evaluation
Prolog engine
EP-SPARQLquery
continuous results
translator
DSMS/CEPSPARQLStreamquery
continuous results
rewriter
R2RML mappings
Query languages for semantic streams - Summary
• Different architectural choices • It is not clear when each choice is best for which type of use
case• Wrappers over existing systems
• C-SPARQL, ETALIS, SPARQLstream , CQELS-Cloud
• Better reliability and maintainability?• Native implementations
• CQELS, Streaming Linked Data, INSTANS • Better scalability: optimizations that are not possible
in other systems
• Different operational semantics• See later
21
Heterogeneity #3: Querying RDF Streams
22<<Texto libre: proyecto, speaker, etc.>>
Querying data streams (from CQL to SPARQL-X)
Streams
Relations
…<s,τ>…
<s1
><s2
><s3
>
infiniteunbounded
bagfinitebag
Mapping: T R
stream-to-relation (S2R)
relation-to-stream (R2S)
relation-to-relation (R2R)
Stream Relation R(t)
RDF Stream
s
S2R Window operators
R2S operators
SPARQL operators
RDF
Output: relation
• Case 1: the output is a set of timestamped mappings
RSP
SELECT ?a ?b …FROM ….WHERE ….
CONSTRUCT {?a :prop ?b }FROM ….WHERE ….
a … ?b… [t1]a … ?b…a … ?b… [t3]a … ?b… [t5]a … ?b… [t7]
<… :prop … > [t1] <… :prop … > <… :prop … > [t3] <… :prop … > [t5] <… :prop … > [t7]
queries bindings
triples
Output: stream
• Case 2: the output is a stream• R2S operatorsCONSTRUCT RSTREAM {?a :prop ?b }FROM ….WHERE ….
… <… :prop … > [t1] <… :prop … > [t1] <… :prop … > [t3] <… :prop … > [t5] < …:prop … > [t7]…
RSPquery
stream
ISTREAM: stream out data in the last step that wasn’t on the previous step
DSTREAM: stream out data in the previous step that isn’t in the last step
RSTREAM: stream out all data in the last step
Other operators
• Sequence operators and CEP world
e1 e2 e3
e4
SS
3 6 91
Sequence Simultaneous
SEQ: joins eti,tf and e’ti’,tf’ if e’ occurs after e
EQUALS: joins eti,tf and e’ti’,tf’ if they occur simultaneously
OPTIONALSEQ, OPTIONALEQUALS: Optional join variants
Query languages for semantic streams - Summary
• Comparison between existing approaches
• Is it time to converge on a standard?
27
System S2R R2R Time-aware R2S
INSTANS Based on time events
SPARQL update
Based on time events Ins only
C-SPARQL Engine
Logical and triple-based
SPARQL 1.1 query
timestamp function Batch only
SPARQLstream Logical and triple-based
SPARQL 1.1 query
no Ins, batch, del
CQELS Logical and triple-based
SPARQL 1.1 query
no Ins only
Sparkwave Logical SPARQL 1.0 no Ins only
Streaming Linked Data
Logical and graph-based
SPARQL 1.1 no Batch only
ETALIS no SPARQL 1.0 SEQ, PAR, AND, OR, DURING, STARTS, EQUALS, NOT, MEETS, FINISHES
Ins only
• Different syntax for S2R operator• Semantics of query languages is similar, but not
identical• Lack of R2S operator in some cases• Different support for time-aware operators
28
Query languages for semantic streams - Issues
Classification of existing systems
Heterogeneity #4: Operational Semantics
30<<Texto libre: proyecto, speaker, etc.>>
Operational Semantics
S1 S2 S3 S4SS
t3 6 91
:bob :isIn :hall
:bob :isIn :kitchen
:alice :isIn :hall
:alice :isIn :kitchen
Where are both alice and bob in the last 5s?
System 1: :hall [5] :kitchen [10]
System 2: :hall [3] :kitchen [9]
Both correct?ISWC 2013 evaluation track for "On Correctness in RDF stream
processor benchmarking" by Daniele Dell’Aglio, Jean-Paul Calbimonte, Marco Balduini, Oscar Corcho and Emanuele Della Valle
Conclusions…
32<<Texto libre: proyecto, speaker, etc.>>
Next steps in the community group…
• Agree on an RDF model? • Metamodel?• Timestamps in graphs?• Timestamp intervals• Compatibility with normal (static) RDF
• Additional operators for SPARQL?• Windows (not only time based?)• CEP operators• Semantics
• Go Web• Volatile URIs• Serialization: terse, compact• Protocols: HTTP, Websockets?
On the need for a W3C community group on RDF
Stream Processing
ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013
Oscar Corchoocorcho@fi.upm.es, ocorcho@localidata.com
@ocorchohttp://www.slideshare.net/ocorcho/