Querying Distributed RDF Graphs: The Effects of Partitioning Poster
Click here to load reader
-
Upload
dbonto -
Category
Technology
-
view
99 -
download
1
description
Transcript of Querying Distributed RDF Graphs: The Effects of Partitioning Poster
Querying Distributed RDF Graphs: The Effects of Partitioning Anthony Potter Boris Motik Ian Horrocks
Challenges
Create a distributed, cloud-based DBMS for large RDF graphs. The two main challenges are: • How to partition data across a cluster • How to answer queries over a cluster
Current Approaches
Hashing • Hashing function maps triple to partition element, often
by subject. • Groups triples with the same attribute value.
Graph-based approaches • Exploits the graph structure of RDF data. • Groups triples close together in the graph.
N-hop duplication • Can be used in conjunction with any partitioning
scheme. • Includes all triples that are within n “hops” of triples in
original partition element.
Evalua&on of Par&&oning Scheme Calculate the percentage of answers that exist on a single par&&on element: Calculate how much duplica&on of data is created:
Aims 1. Maximise number of local answers to star
queries 2. Achieve similar, or beDer, par&&on quality than
schemes employing n-‐hop duplica&on 3. Minimise duplica&on
Partitioning Scheme
Aims 1. Maximise the number of local answers to star
queries. 2. Achieve similar, or better, partition quality than
schemes employing n-hop duplication. 3. Minimise duplication. The quality of partitions critically depends on the structure of the data and the anticipated query workload. We make the following assumptions:
Assumption 1. Subject–subject joins are common.���Assumption 2. Queries often constrain variables to elements of classes—that is, they often contain atoms of the form ⟨?x, rdf :type, class⟩.���Assumption 3. Joins involving resources representing classes are uncommon— that is, queries rarely contain atoms ⟨?x1, rdf :type, ?y⟩ ∧ ⟨?x2, rdf :type, ?y⟩. Assumption 4. Joins on resources that are literals are uncommon—that is, if a query contains atoms ⟨?x1,:R,?y⟩∧⟨?x2,:S,?y⟩, it is unlikely that a query answer will map variable ?y to a literal. Assumption 5. The number of schema triples in G is small, so all schema triples can be replicated in each partition element.
Distributed Query Answering
The key to our approach to distributed query answering is the introduction of a wildcard resource: Wildcard • Represents all resources external to partition
element. • Tracks the connections between partition elements. • Signals when an answer is non-local. • Reduces number of intermediate answers generated.
Contributions
1. Novel data partitioning scheme that reduces distributed computing on common queries and reduces storage overhead.
2. Introduction of a wildcard resource that tracks connections between partition elements.
3. Analysis of how the choice of data partitioning scheme affects the amount of network communication needed.
Example
Evaluation of Partitioning Scheme
System Q2 Q8 Q9 Q11 Q12 Qc RDFox 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% SHAPE 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% Hash 0.44% 4.96% 0.23% 5.80% 0.00% 0.04%
System Q4 Q5 Q6 Q7 Q8 RDFox 95.95% 73.00% 99.90% 92.41% 91.45% SHAPE 95.23% 9.72% 100.00% 41.97% 73.72% Hash 0.01% 0.77% 0.25% 0.08% 0.26%
RDFox SHAPE Hash LUBM 3.60% 84.23% 0.00% SP2B 0.60% 38.63% 0.00%
LUBM: Percentage of Local Answers
SP2B: Percentage of Local Answers
Storage Overhead
Partitioning Procedure Step 1. Compute the undirected graph G′ by removing from G all schema triples and triples containing class and literal resources (i.e., all triples of the form ⟨s, rdf :type, o⟩ and ⟨s, p, l⟩ with l a literal), and by treating each remaining triple ⟨s, p, o⟩ as an undirected edge connecting s and o. Step 2. Partition the nodes of G′ into n disjoint sets using min-cut graph partitioning (e.g.,usingMETIS),andletV1′,...,Vn′ be the resulting vocabularies. Step 3. Extend each Vi′ to Vi⋆ = Vi′ ∪ {r | r occurs in a schema triple in G}. Step 4. Extend each Vi⋆ to Vi = Vi⋆ ∪ {o | ⟨s, p, o⟩ ∈ G and s ∈ Vi⋆}.��� Step 5. Calculate [G]Vi for each Vi, and set G = {[G]V1 , . . . , [G]Vn }.
Conclusions • The same partition quality can be achieved without
n-hop duplication. • Our partitioning scheme has very small storage
overhead.
RDF Graph Partitioning
Star Queries • Common query to
many applications • Can be answered
fully locally • No network
communication required
Evaluation
RDFox achieves a greater percentage of local answers against two common benchmarks, with minimal storage overhead.
Wildcard *
• New resource that represents all resources external to the partition element.
• Shows the connections
to other partition elements.
• Form of extreme graph
summarisation. • Helps reduce the
number of intermediate answers.
Conclusions
Our approach: • Greater percentage of local
answers compared to competitors
• Minimal storage overhead • Recognises local answers
Future Work
• Develop efficient query answering scheme with wildcards
• Implement distributed DBMS on top of RDFox
• Colour of edge represents which partition element it belongs to.
• Subject hashing (left) is common and simple.
• Groups all triples with the same subject.
• Graph-based partitioning (right) groups triples that are close in the graph.
• Our data partitioning scheme is an optimised graph-based scheme.
Local Answer
A local answer can be evaluated fully on a single machine:
• No network communication
• Fast evaluation Subject Hashing Graph-based Partition element 1
Partition element 2
RDFox: Our approach SHAPE: Semantic hash partitioning
with n-hop duplication Hash: Subject hashing
Aims
• Maximise number of local answers to common queries
• Recognise local answers
• Minimise storage overhead
• Subject hashing groups triples with the same subject
• Graph-based partitioning groups triples close in the graph
• Many approaches to partitioning: § Subject hashing § Graph-based § Semantic hashing
• Partitioning scheme affects the number of local answers
• We propose a novel graph-based
partitioning scheme: § Minimal storage overhead § Common (star) queries fully
local
Introduction of a new wildcard resource:
• Represents all external resources
• Tracks connections between partition elements
Used in our novel query answering scheme:
• Reduces the number of intermediate answers
• Recognises local answers