Graph-Parallel Querying of RDF with GraphX [1]schaetzl/talks/S2X_Big...31.08.2015 S2X:...
Transcript of Graph-Parallel Querying of RDF with GraphX [1]schaetzl/talks/S2X_Big...31.08.2015 S2X:...
S2X
Graph-Parallel Querying of RDF with GraphX [1]
Alexander Schätzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen
University of Freiburg, Germany
http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/
Big-O(Q) 2015 – Hawaii, USA
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 2
Motivation
Semantic Web has arrived in real-world applications(not only academia & research)
• Web-scale semantic data makes single machine solutions infeasible
Hadoop ecosystem has become de-facto
standard for Big Data applications
Our Idea: Use it for Semantic Web purposes as well
• Main reasons:
1) Web-scale semantic data requires solutions that scale out
2) Industry has settled on Hadoop (or Hadoop-style) architectures
superior cost-benefit ratio compared to specialized infrastructures
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 3
Motivation
RDF/SPARQL-on-Hadoop
• Existing solutions are based on relational-style data models
implemented on data-parallel frameworks
• Hence they do not exploit the inherent graph-like structure of RDF
Parallel Graph Processing Systems
• e.g. Pregel, Giraph, GraphLab
• Tailored towards iterative graph algorithms
• But: Not integrated with Hadoop or hard to combine with other frameworks
Data movement or duplication
S2X (SPARQL on Spark with GraphX)
• Spark GraphX allows to combine graph-parallel and data-parallel computation
• GraphX used for graph pattern matching
• Other SPARQL operators implemented in Spark
Spark GraphX
Apache Spark
• General-purpose in-memory cluster computing system
• Based on Resilient Distributed Datasets (RDD)
- distributed, fault-tolerant collection of elements
- kept in memory and partitioned across the cluster
• Jobs are modeled as Directed Acyclic Graphs (DAG) of tasks
- each tasks runs on a horizontal partition of an RDD
GraphX
• Spark API for graphs and graph-parallel computation
• Uses a Property Graph Data Model (represented by two RDDs)
• Vertex-Centric view of graphs (similar to Pregel)
• Adopts Bulk-Synchronous Parallel (BSP) execution model
- vertex programs run concurrently in a sequence of Supersteps
• Vertex-Cut partitioning
- evenly assigns edges to machines, vertices can span multiple machines
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 4
RDF Property Graph Model
Property Graph (PG)
• 𝑃𝐺 𝑃 = 𝑉, 𝐸, 𝑃
• 𝑉 = 𝑠𝑒𝑡 𝑜𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝐸 = 𝑠𝑒𝑡 𝑜𝑓 𝑑𝑖𝑟𝑒𝑐𝑡𝑒𝑑 𝑒𝑑𝑔𝑒𝑠
• 𝑃 = 𝑃𝑉 , 𝑃𝐸 = 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑒𝑟𝑡𝑒𝑥 𝑎𝑛𝑑 𝑒𝑑𝑔𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑖𝑒𝑠 𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒
RDF Graph (G)
• 𝐺 = 𝑡1, … , 𝑡𝑛 = 𝑠𝑒𝑡 𝑜𝑓 𝑅𝐷𝐹 𝑡𝑟𝑖𝑝𝑙𝑒𝑠
• 𝑅𝐷𝐹 𝑡𝑟𝑖𝑝𝑙𝑒 𝑡 = 𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 5
i j
𝑃𝐸 𝑖, 𝑗 = 𝑘, 𝑣
𝑃𝑉 𝑖 = 𝑘, 𝑣 𝑃𝑉 𝑗 = 𝑘, 𝑣
𝐼𝐷𝑖 𝐼𝐷𝑗
s op
IRIs = global identifierse.g. dbpedia:Leonardo_da_Vinci
RDF Property Graph Model
RDF to PG - Example
• 𝐺 = 𝐴, 𝑘𝑛𝑜𝑤𝑠, 𝐵 , 𝐴, 𝑙𝑖𝑘𝑒𝑠, 𝐵 , 𝐴, 𝑙𝑖𝑘𝑒𝑠, 𝐶 , 𝐵, 𝑘𝑛𝑜𝑤𝑠, 𝐶
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 6
A B
knows
C
likes
knows
likes
𝐼𝐷𝐴 = 𝑣𝐴 𝐼𝐷𝐵 = 𝑣𝐵 𝐼𝐷𝐶 = 𝑣𝐶
𝑃𝐸 𝑣𝐴, 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑃𝐸 𝑣𝐴, 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑃𝐸 𝑣𝐴, 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑃𝐸 𝑣𝐵 , 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑃𝑉 𝑣𝐴 . 𝑙𝑎𝑏𝑒𝑙 = 𝐴 𝑃𝑉 𝑣𝐵 . 𝑙𝑎𝑏𝑒𝑙 = 𝐵 𝑃𝑉 𝑣𝐶 . 𝑙𝑎𝑏𝑒𝑙 = 𝐶
SPARQL
Standard Query Language for RDF
(Solution) Mapping
• 𝜇 ∶ 𝑉 → 𝑅𝐷𝐹 𝑇𝑒𝑟𝑚𝑠
• Result of a triple pattern 𝑡𝑝 is a bag of mappings
Ω𝑡𝑝 = 𝜇 𝑑𝑜𝑚 𝜇 = 𝑣𝑎𝑟𝑠 𝑡𝑝 , 𝜇 𝑡𝑝 ∈ 𝐺
• Result of a basic graph pattern 𝑏𝑔𝑝 = 𝑡𝑝1, … , 𝑡𝑝𝑚 is the merge of
compatible mappings (Join on common variables)
Ω𝑏𝑔𝑝 = Ω𝑡𝑝1 ⋈ … ⋈ Ω𝑡𝑝𝑚
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 7
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
triple pattern
basic graph pattern (BGP)
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 8
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 9
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? 𝑨 ?𝑩
𝜇11 𝐴 𝐵
𝜇12 𝐵 𝐶
Ω𝑡𝑝1
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 10
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? 𝑨 ?𝑩
𝜇11 𝐴 𝐵
𝜇12 𝐵 𝐶
Ω𝑡𝑝1
? 𝑨 ?𝑩
𝜇21 𝐴 𝐵
𝜇22 𝐴 𝐶
Ω𝑡𝑝2
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 11
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? 𝑨 ?𝑩
𝜇11 𝐴 𝐵
𝜇12 𝐵 𝐶
Ω𝑡𝑝1
? 𝑨 ?𝑩
𝜇21 𝐴 𝐵
𝜇22 𝐴 𝐶
Ω𝑡𝑝2
?𝑩 ?𝑪
𝜇31 𝐴 𝐵
𝜇32 𝐵 𝐶
Ω𝑡𝑝3
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 12
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? 𝑨 ?𝑩
𝜇11 𝐴 𝐵
𝜇12 𝐵 𝐶
Ω𝑡𝑝1
? 𝑨 ?𝑩
𝜇21 𝐴 𝐵
𝜇22 𝐴 𝐶
Ω𝑡𝑝2
?𝑩 ?𝑪
𝜇31 𝐴 𝐵
𝜇32 𝐵 𝐶
Ω𝑡𝑝3
⋈ ⋈
Ω𝑏𝑔𝑝 = 𝜇 ∶ ? 𝐴 → 𝐴, ? 𝐵 → 𝐵, ? 𝐶 → 𝐶
S2X Query Processing
Basic Idea:
• Every vertex stores the variables of a query where it is a candidate for
• Vertices exchange their candidate sets for validation
Match Candidate:
• 𝑚𝑎𝑡𝑐ℎ𝐶 = 𝑣, ? 𝑣𝑎𝑟, 𝜇, 𝑡𝑝
• 𝑣 is a candidate for ? 𝑣𝑎𝑟 with mapping 𝜇 derived from triple pattern 𝑡𝑝
Match Set:
• Set of all match candidates of a vertex
• Match set is stored as a property of the vertex 𝑃𝑉 𝑣 .𝑚𝑎𝑡𝑐ℎ𝑆
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 13
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠𝑡𝑝 = (?𝐴 𝑘𝑛𝑜𝑤𝑠 ?𝐵)
𝑣𝐴, ? 𝐴, 𝜇: ?𝐴 → 𝐴, ?𝐵 → 𝐵 , 𝑡𝑝 𝑣𝐵, ? 𝐵, 𝜇: ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 , 𝑡𝑝
S2X Query Processing
BGP Matching in a Nutshell
• Superstep 1:
- Determine all possible match candidates
Iterate over all edges and match with all triple patterns at once
- Yields an overestimation of the BGP result
• Following Supersteps:
- Validate Local Match Sets
Does a vertex have all required match candidates for a variable?
- Validate Remote Match Sets
Are the neighbors of vertex still the required candidates?
- Exchange Match Sets with Neighbor Vertices
- Repeat until no changes occur
• Result Generation (solution mappings):
- Collect and merge (join) all match sets to produce the final mappings
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 14
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 16
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 17
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 18
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 19
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 20
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
Overestimation!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 21
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
𝐷𝑜𝑒𝑠 ?𝐵 , ? 𝐵 → 𝐴,… , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!
Validation of Local Match Sets:
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 22
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2 𝐷𝑜𝑒𝑠 ?𝐴 , ? 𝐴 → 𝐴, ?𝐵 → 𝐶 , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!
Validation of Local Match Sets:
𝐷𝑜𝑒𝑠 ?𝐵 , ? 𝐵 → 𝐴,… , 𝑡𝑝1 𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 23
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐴 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐶 𝑡𝑝2
? 𝐵 ? 𝐴 → 𝐵, ? 𝐵 → 𝐶 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
Removed!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 24
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
Exchange
Exchange
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 25
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
𝐷𝑜𝑒𝑠 ?𝐵 , ?𝐵 → 𝐴, ? 𝐶 → 𝐵 , 𝑡𝑝3𝑒𝑥𝑖𝑠𝑡 𝑖𝑛 𝑣𝐴 ⇒ 𝑁𝑂!
Validation of
Remote Match Sets:
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 26
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐶 ?𝐵 → 𝐴, ? 𝐶 → 𝐵 𝑡𝑝3
? 𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
Removed!No Changes! No Changes!
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 27
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
Exchange
Exchange
BGP Matching - Example
Superstep 4
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 28
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
𝑣𝐴𝑙𝑎𝑏𝑒𝑙 = 𝐴
𝑣𝐵𝑙𝑎𝑏𝑒𝑙 = 𝐵
𝑣𝐶𝑙𝑎𝑏𝑒𝑙 = 𝐶
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
𝑙𝑎𝑏𝑒𝑙 = 𝑙𝑖𝑘𝑒𝑠𝑙𝑎𝑏𝑒𝑙 = 𝑘𝑛𝑜𝑤𝑠
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐴 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
?𝐵 ?𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝1
? 𝐵 ? 𝐴 → 𝐴, ? 𝐵 → 𝐵 𝑡𝑝2
? 𝐵 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
? 𝑣𝑎𝑟 𝜇 𝑡𝑝
? 𝐶 ?𝐵 → 𝐵, ? 𝐶 → 𝐶 𝑡𝑝3
No Changes! No Changes!
Experiments
Small Cluster with low-end configuration
• 10 machines (1 master, 9 worker), 2 disks each, Gigabit network
• 24 GB RAM for Spark, 60% assigned for graph caching(typical Hadoop nodes currently ≥ 256 GB RAM, ≥ 12 disks, ≥ 10 Gigabit network)
• CDH 5.3, Spark 1.2.0, Pig 0.12
WatDiv Benchmark [2]
• Combines e-commerce scenario with a kind of social network
• All different kinds of query shapes (star, linear, snowflake, complex)
• Scale Factors:
- 10 (10K users, ~0.1M vertices)
- 100 (100K users, ~1M vertices)
- 1000 (1M users, ~10M vertices)
Compared with PigSPARQL [3]
• SPARQL Query Processing Baseline for Hadoop
• Proven to be competitive to other Hadoop research approaches
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 29
Experiments
Geometric Mean per Query Group
• S2X outperforms PigSPARQL by up to an order of magnitude
• Runtime of S2X strongly depends on result size (better for small output)
• Runtime can be highly influenced by adjusting parameters of Spark/GraphX
• GraphX currently quite “memory-heavy” (early stage of development)
• Very good preliminary results but still much room for improvement in S2X
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 30
S L SF C
S2X 7,62 9,62 17,98 52,89
PigSPARQL 43,08 62,19 88,09 127,34
0
20
40
60
80
100
120
140
Overall Runtime
S L SF C
result 1,98 2,92 8,03 23,49
BGP 5,62 6,66 9,93 29,04
0
10
20
30
40
50
60
S2X - BGP Matching vs. Result Generation
Summary
S2X (SPARQL on Spark with GraphX)
• SPARQL Query Processor for Hadoop based on Spark GraphX
• Property Graph representation for RDF
• Combines graph-parallel and data-parallel computation
- graph-parallel: BGP matching
- data-parallel: BGP solution mapping generation + other SPARQL operators
Experiments
• Very good preliminary results
- Up to an order of magnitude performance improvement compared to a
state-of-the-art SPARQL Processor baseline for Hadoop
• GraphX is in an early stage of development
- Many improvements can be expected in future GraphX versions
• Many possible improvements for S2X
- Incremental exchange of match sets I/O reduction
- Optimization of graph partitioning avoid unbalanced workloads
- Incremental BGP matching for selective queries Reduction of Overestimation
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 31
Thank You
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 32
Thank you for your attention!
Questions?
References
[1] Alexander Schätzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen:
S2X: Graph-Parallel Querying of RDF with GraphX. In: Big-O(Q) 2015
[2] Günes Aluc, Olaf Hartig, M. Tamer Özsu, Khuzaima Daudjee:
Diversified Stress Testing of RDF Data Management Systems. In: ISWC 2014, pp. 197-212
[3] Alexander Schätzle, Martin Przyjaciel-Zablocki, Thomas Hornung, Georg Lausen:
PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In: ISWC 2013 Posters
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 33
Experiments
Star Queries
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 34
10 100 1000
result 0,70 1,60 6,87
BGP 1,93 3,73 24,66
0
5
10
15
20
25
30
35
S2X - BGP Matching vs. Result Generation
10 100 1000
S2X 2,63 5,33 31,53
PigSPARQL 41,43 41,28 46,74
0
5
10
15
20
25
30
35
40
45
50
Geometric Mean
Experiments
Linear Queries
• Strong variation in runtimes (min: ~14s, max: ~348s)
• For the long running queries, result generation takes more time than BGP
matching, e.g. 348s = (124s, 224s) many results
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 35
10 100 1000
S2X 2,21 5,23 77,25
PigSPARQL 61,45 61,23 63,93
0
10
20
30
40
50
60
70
80
Geometric Mean
10 100 1000
result 0,55 1,64 27,41
BGP 1,65 3,58 49,85
0
10
20
30
40
50
60
70
80
S2X - BGP Matching vs. Result Generation
Experiments
Snowflake Queries
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 36
10 100 1000
S2X 5,84 12,99 76,63
PigSPARQL 85,47 85,21 93,85
0
10
20
30
40
50
60
70
80
90
100
Geometric Mean
10 100 1000
result 2,58 6,20 32,39
BGP 3,26 6,79 44,25
0
10
20
30
40
50
60
70
80
S2X - BGP Matching vs. Result Generation
Experiments
Complex Queries
• Queries produce many results
for some queries, result generation takes more time than BGP matching
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 37
10 100 1000
S2X 17,53 41,17 204,96
PigSPARQL 116,13 123,76 143,66
0
30
60
90
120
150
180
210
Geometric Mean
10 100 1000
result 8,09 20,98 76,40
BGP 9,44 20,19 128,57
0
30
60
90
120
150
180
210
S2X - BGP Matching vs. Result Generation