Graph-Parallel Querying of RDF with GraphX [1]schaetzl/talks/S2X_Big...31.08.2015 S2X:...

36
S2X Graph-Parallel Querying of RDF with GraphX [1] Alexander Schätzle , Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen University of Freiburg, Germany [email protected] http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/ Big-O(Q) 2015 Hawaii, USA

Transcript of Graph-Parallel Querying of RDF with GraphX [1]schaetzl/talks/S2X_Big...31.08.2015 S2X:...

S2X

Graph-Parallel Querying of RDF with GraphX [1]

Alexander Schätzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen

University of Freiburg, Germany

[email protected]

http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/

Big-O(Q) 2015 – Hawaii, USA

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 2

Motivation

Semantic Web has arrived in real-world applications(not only academia & research)

• Web-scale semantic data makes single machine solutions infeasible

Hadoop ecosystem has become de-facto

standard for Big Data applications

Our Idea: Use it for Semantic Web purposes as well

• Main reasons:

1) Web-scale semantic data requires solutions that scale out

2) Industry has settled on Hadoop (or Hadoop-style) architectures

superior cost-benefit ratio compared to specialized infrastructures

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 3

Motivation

RDF/SPARQL-on-Hadoop

• Existing solutions are based on relational-style data models

implemented on data-parallel frameworks

• Hence they do not exploit the inherent graph-like structure of RDF

Parallel Graph Processing Systems

• e.g. Pregel, Giraph, GraphLab

• Tailored towards iterative graph algorithms

• But: Not integrated with Hadoop or hard to combine with other frameworks

Data movement or duplication

S2X (SPARQL on Spark with GraphX)

• Spark GraphX allows to combine graph-parallel and data-parallel computation

• GraphX used for graph pattern matching

• Other SPARQL operators implemented in Spark

Spark GraphX

Apache Spark

• General-purpose in-memory cluster computing system

• Based on Resilient Distributed Datasets (RDD)

- distributed, fault-tolerant collection of elements

- kept in memory and partitioned across the cluster

• Jobs are modeled as Directed Acyclic Graphs (DAG) of tasks

- each tasks runs on a horizontal partition of an RDD

GraphX

• Spark API for graphs and graph-parallel computation

• Uses a Property Graph Data Model (represented by two RDDs)

• Vertex-Centric view of graphs (similar to Pregel)

• Adopts Bulk-Synchronous Parallel (BSP) execution model

- vertex programs run concurrently in a sequence of Supersteps

• Vertex-Cut partitioning

- evenly assigns edges to machines, vertices can span multiple machines

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 4

RDF Property Graph Model

Property Graph (PG)

• 𝑃𝐺 𝑃 = 𝑉, 𝐸, 𝑃

• 𝑉 = 𝑠𝑒𝑡 𝑜𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝐸 = 𝑠𝑒𝑡 𝑜𝑓 𝑑𝑖𝑟𝑒𝑐𝑡𝑒𝑑 𝑒𝑑𝑔𝑒𝑠

• 𝑃 = 𝑃𝑉 , 𝑃𝐸 = 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑒𝑟𝑡𝑒𝑥 𝑎𝑛𝑑 𝑒𝑑𝑔𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑖𝑒𝑠 𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒

RDF Graph (G)

• 𝐺 = 𝑡1, … , 𝑡𝑛 = 𝑠𝑒𝑡 𝑜𝑓 𝑅𝐷𝐹 𝑡𝑟𝑖𝑝𝑙𝑒𝑠

• 𝑅𝐷𝐹 𝑡𝑟𝑖𝑝𝑙𝑒 𝑡 = 𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 5

i j

𝑃𝐸 𝑖, 𝑗 = 𝑘, 𝑣

𝑃𝑉 𝑖 = 𝑘, 𝑣 𝑃𝑉 𝑗 = 𝑘, 𝑣

𝐼𝐷𝑖 𝐼𝐷𝑗

s op

IRIs = global identifierse.g. dbpedia:Leonardo_da_Vinci

RDF Property Graph Model

RDF to PG - Example

• 𝐺 = 𝐴, 𝑘𝑛𝑜𝑤𝑠, 𝐵 , 𝐴, 𝑙𝑖𝑘𝑒𝑠, 𝐵 , 𝐴, 𝑙𝑖𝑘𝑒𝑠, 𝐶 , 𝐵, 𝑘𝑛𝑜𝑤𝑠, 𝐶

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 6

A B

knows

C

likes

knows

likes

𝐼𝐷𝐴 = 𝑣𝐴 𝐼𝐷𝐵 = 𝑣𝐵 𝐼𝐷𝐶 = 𝑣𝐶

𝑃𝐸 𝑣𝐴, 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑃𝐸 𝑣𝐴, 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑃𝐸 𝑣𝐴, 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑃𝐸 𝑣𝐵 , 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑃𝑉 𝑣𝐴 . 𝑙𝑎𝑏𝑒𝑙 = 𝐴 𝑃𝑉 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝐵 𝑃𝑉 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝐶

SPARQL

Standard Query Language for RDF

(Solution) Mapping

• 𝜇 ∶ 𝑉 → 𝑅𝐷𝐹 𝑇𝑒𝑟𝑚𝑠

• Result of a triple pattern 𝑡𝑝 is a bag of mappings

Ω𝑡𝑝 = 𝜇 𝑑𝑜𝑚 𝜇 = 𝑣𝑎𝑟𝑠 𝑡𝑝 , 𝜇 𝑡𝑝 ∈ 𝐺

• Result of a basic graph pattern 𝑏𝑔𝑝 = 𝑡𝑝1, … , 𝑡𝑝𝑚 is the merge of

compatible mappings (Join on common variables)

Ω𝑏𝑔𝑝 = Ω𝑡𝑝1 ⋈ … ⋈ Ω𝑡𝑝𝑚

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 7

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

triple pattern

basic graph pattern (BGP)

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 8

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 9

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

𝜇11 𝐴 𝐵

𝜇12 𝐵 𝐶

Ω𝑡𝑝1

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 10

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

𝜇11 𝐴 𝐵

𝜇12 𝐵 𝐶

Ω𝑡𝑝1

? 𝑨 ?𝑩

𝜇21 𝐴 𝐵

𝜇22 𝐴 𝐶

Ω𝑡𝑝2

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 11

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

𝜇11 𝐴 𝐵

𝜇12 𝐵 𝐶

Ω𝑡𝑝1

? 𝑨 ?𝑩

𝜇21 𝐴 𝐵

𝜇22 𝐴 𝐶

Ω𝑡𝑝2

?𝑩 ?𝑪

𝜇31 𝐴 𝐵

𝜇32 𝐵 𝐶

Ω𝑡𝑝3

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 12

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

𝜇11 𝐴 𝐵

𝜇12 𝐵 𝐶

Ω𝑡𝑝1

? 𝑨 ?𝑩

𝜇21 𝐴 𝐵

𝜇22 𝐴 𝐶

Ω𝑡𝑝2

?𝑩 ?𝑪

𝜇31 𝐴 𝐵

𝜇32 𝐵 𝐶

Ω𝑡𝑝3

⋈ ⋈

Ω𝑏𝑔𝑝 = 𝜇 ∶ ? 𝐴 → 𝐴, ? 𝐵 → 𝐵, ? 𝐶 → 𝐶

S2X Query Processing

Basic Idea:

• Every vertex stores the variables of a query where it is a candidate for

• Vertices exchange their candidate sets for validation

Match Candidate:

• 𝑚𝑎𝑡𝑐ℎ𝐶 = 𝑣, ? 𝑣𝑎𝑟, 𝜇, 𝑡𝑝

• 𝑣 is a candidate for ? 𝑣𝑎𝑟 with mapping 𝜇 derived from triple pattern 𝑡𝑝

Match Set:

• Set of all match candidates of a vertex

• Match set is stored as a property of the vertex 𝑃𝑉 𝑣 .𝑚𝑎𝑡𝑐ℎ𝑆

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 13

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠𝑡𝑝 = (?𝐴 𝑘𝑛𝑜𝑤𝑠 ?𝐵)

𝑣𝐴, ? 𝐴, 𝜇: ?𝐴 → 𝐴, ?𝐵 → 𝐵 , 𝑡𝑝 𝑣𝐵, ? 𝐵, 𝜇: ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 , 𝑡𝑝

S2X Query Processing

BGP Matching in a Nutshell

• Superstep 1:

- Determine all possible match candidates

Iterate over all edges and match with all triple patterns at once

- Yields an overestimation of the BGP result

• Following Supersteps:

- Validate Local Match Sets

Does a vertex have all required match candidates for a variable?

- Validate Remote Match Sets

Are the neighbors of vertex still the required candidates?

- Exchange Match Sets with Neighbor Vertices

- Repeat until no changes occur

• Result Generation (solution mappings):

- Collect and merge (join) all match sets to produce the final mappings

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 14

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 16

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 17

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 18

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 19

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 20

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

Overestimation!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 21

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

𝐷𝑜𝑒𝑠 ?𝐵 , ? 𝐵 → 𝐴,… , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!

Validation of Local Match Sets:

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 22

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2 𝐷𝑜𝑒𝑠 ?𝐴 , ? 𝐴 → 𝐴, ?𝐵 → 𝐶 , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!

Validation of Local Match Sets:

𝐷𝑜𝑒𝑠 ?𝐵 , ? 𝐵 → 𝐴,… , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 23

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2

? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

Removed!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 24

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

Exchange

Exchange

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 25

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

𝐷𝑜𝑒𝑠 ?𝐵 , ?𝐵 → 𝐴, ? 𝐶 → 𝐵 , 𝑡𝑝3𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!

Validation of

Remote Match Sets:

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 26

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3

? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

Removed!No Changes! No Changes!

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 27

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

Exchange

Exchange

BGP Matching - Example

Superstep 4

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 28

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴

𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵

𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1

? 𝐵 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2

? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

? 𝑣𝑎𝑟 𝜇 𝑡𝑝

? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3

No Changes! No Changes!

Experiments

Small Cluster with low-end configuration

• 10 machines (1 master, 9 worker), 2 disks each, Gigabit network

• 24 GB RAM for Spark, 60% assigned for graph caching(typical Hadoop nodes currently ≥ 256 GB RAM, ≥ 12 disks, ≥ 10 Gigabit network)

• CDH 5.3, Spark 1.2.0, Pig 0.12

WatDiv Benchmark [2]

• Combines e-commerce scenario with a kind of social network

• All different kinds of query shapes (star, linear, snowflake, complex)

• Scale Factors:

- 10 (10K users, ~0.1M vertices)

- 100 (100K users, ~1M vertices)

- 1000 (1M users, ~10M vertices)

Compared with PigSPARQL [3]

• SPARQL Query Processing Baseline for Hadoop

• Proven to be competitive to other Hadoop research approaches

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 29

Experiments

Geometric Mean per Query Group

• S2X outperforms PigSPARQL by up to an order of magnitude

• Runtime of S2X strongly depends on result size (better for small output)

• Runtime can be highly influenced by adjusting parameters of Spark/GraphX

• GraphX currently quite “memory-heavy” (early stage of development)

• Very good preliminary results but still much room for improvement in S2X

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 30

S L SF C

S2X 7,62 9,62 17,98 52,89

PigSPARQL 43,08 62,19 88,09 127,34

0

20

40

60

80

100

120

140

Overall Runtime

S L SF C

result 1,98 2,92 8,03 23,49

BGP 5,62 6,66 9,93 29,04

0

10

20

30

40

50

60

S2X - BGP Matching vs. Result Generation

Summary

S2X (SPARQL on Spark with GraphX)

• SPARQL Query Processor for Hadoop based on Spark GraphX

• Property Graph representation for RDF

• Combines graph-parallel and data-parallel computation

- graph-parallel: BGP matching

- data-parallel: BGP solution mapping generation + other SPARQL operators

Experiments

• Very good preliminary results

- Up to an order of magnitude performance improvement compared to a

state-of-the-art SPARQL Processor baseline for Hadoop

• GraphX is in an early stage of development

- Many improvements can be expected in future GraphX versions

• Many possible improvements for S2X

- Incremental exchange of match sets I/O reduction

- Optimization of graph partitioning avoid unbalanced workloads

- Incremental BGP matching for selective queries Reduction of Overestimation

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 31

Thank You

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 32

Thank you for your attention!

Questions?

References

[1] Alexander Schätzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen:

S2X: Graph-Parallel Querying of RDF with GraphX. In: Big-O(Q) 2015

[2] Günes Aluc, Olaf Hartig, M. Tamer Özsu, Khuzaima Daudjee:

Diversified Stress Testing of RDF Data Management Systems. In: ISWC 2014, pp. 197-212

[3] Alexander Schätzle, Martin Przyjaciel-Zablocki, Thomas Hornung, Georg Lausen:

PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In: ISWC 2013 Posters

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 33

Experiments

Star Queries

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 34

10 100 1000

result 0,70 1,60 6,87

BGP 1,93 3,73 24,66

0

5

10

15

20

25

30

35

S2X - BGP Matching vs. Result Generation

10 100 1000

S2X 2,63 5,33 31,53

PigSPARQL 41,43 41,28 46,74

0

5

10

15

20

25

30

35

40

45

50

Geometric Mean

Experiments

Linear Queries

• Strong variation in runtimes (min: ~14s, max: ~348s)

• For the long running queries, result generation takes more time than BGP

matching, e.g. 348s = (124s, 224s) many results

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 35

10 100 1000

S2X 2,21 5,23 77,25

PigSPARQL 61,45 61,23 63,93

0

10

20

30

40

50

60

70

80

Geometric Mean

10 100 1000

result 0,55 1,64 27,41

BGP 1,65 3,58 49,85

0

10

20

30

40

50

60

70

80

S2X - BGP Matching vs. Result Generation

Experiments

Snowflake Queries

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 36

10 100 1000

S2X 5,84 12,99 76,63

PigSPARQL 85,47 85,21 93,85

0

10

20

30

40

50

60

70

80

90

100

Geometric Mean

10 100 1000

result 2,58 6,20 32,39

BGP 3,26 6,79 44,25

0

10

20

30

40

50

60

70

80

S2X - BGP Matching vs. Result Generation

Experiments

Complex Queries

• Queries produce many results

for some queries, result generation takes more time than BGP matching

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 37

10 100 1000

S2X 17,53 41,17 204,96

PigSPARQL 116,13 123,76 143,66

0

30

60

90

120

150

180

210

Geometric Mean

10 100 1000

result 8,09 20,98 76,40

BGP 9,44 20,19 128,57

0

30

60

90

120

150

180

210

S2X - BGP Matching vs. Result Generation