Querying Distributed RDF Graphs: The Effects of Partitioning Poster

1

Click here to load reader

description

Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering, considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.

Transcript of Querying Distributed RDF Graphs: The Effects of Partitioning Poster

Page 1: Querying Distributed RDF Graphs: The Effects of Partitioning Poster

Querying Distributed RDF Graphs: The Effects of Partitioning Anthony Potter Boris Motik Ian Horrocks

Challenges

Create a distributed, cloud-based DBMS for large RDF graphs. The two main challenges are: •  How to partition data across a cluster •  How to answer queries over a cluster

Current Approaches

Hashing •  Hashing function maps triple to partition element, often

by subject. •  Groups triples with the same attribute value.

Graph-based approaches •  Exploits the graph structure of RDF data. •  Groups triples close together in the graph.

N-hop duplication •  Can be used in conjunction with any partitioning

scheme. •  Includes all triples that are within n “hops” of triples in

original partition element.

Evalua&on  of  Par&&oning  Scheme  Calculate  the  percentage  of  answers  that  exist  on  a  single  par&&on  element:                              Calculate  how  much  duplica&on  of  data  is  created:            

Aims  1.  Maximise  number  of  local  answers  to  star  

queries  2.  Achieve  similar,  or  beDer,  par&&on  quality  than  

schemes  employing  n-­‐hop  duplica&on  3.  Minimise  duplica&on  

Partitioning Scheme

Aims 1.  Maximise the number of local answers to star

queries. 2.  Achieve similar, or better, partition quality than

schemes employing n-hop duplication. 3.  Minimise duplication. The quality of partitions critically depends on the structure of the data and the anticipated query workload. We make the following assumptions:

Assumption 1. Subject–subject joins are common.���Assumption 2. Queries often constrain variables to elements of classes—that is, they often contain atoms of the form ⟨?x, rdf :type, class⟩.���Assumption 3. Joins involving resources representing classes are uncommon— that is, queries rarely contain atoms ⟨?x1, rdf :type, ?y⟩ ∧ ⟨?x2, rdf :type, ?y⟩. Assumption 4. Joins on resources that are literals are uncommon—that is, if a query contains atoms ⟨?x1,:R,?y⟩∧⟨?x2,:S,?y⟩, it is unlikely that a query answer will map variable ?y to a literal. Assumption 5. The number of schema triples in G is small, so all schema triples can be replicated in each partition element.

Distributed Query Answering

The key to our approach to distributed query answering is the introduction of a wildcard resource: Wildcard •  Represents all resources external to partition

element. •  Tracks the connections between partition elements. •  Signals when an answer is non-local. •  Reduces number of intermediate answers generated.

Contributions

1.  Novel data partitioning scheme that reduces distributed computing on common queries and reduces storage overhead.

2.  Introduction of a wildcard resource that tracks connections between partition elements.

3.  Analysis of how the choice of data partitioning scheme affects the amount of network communication needed.

Example

Evaluation of Partitioning Scheme

System Q2 Q8 Q9 Q11 Q12 Qc RDFox 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% SHAPE 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% Hash 0.44% 4.96% 0.23% 5.80% 0.00% 0.04%

System Q4 Q5 Q6 Q7 Q8 RDFox 95.95% 73.00% 99.90% 92.41% 91.45% SHAPE 95.23% 9.72% 100.00% 41.97% 73.72% Hash 0.01% 0.77% 0.25% 0.08% 0.26%

RDFox SHAPE Hash LUBM 3.60% 84.23% 0.00% SP2B 0.60% 38.63% 0.00%

LUBM: Percentage of Local Answers

SP2B: Percentage of Local Answers

Storage Overhead

Partitioning Procedure Step 1. Compute the undirected graph G′ by removing from G all schema triples and triples containing class and literal resources (i.e., all triples of the form ⟨s, rdf :type, o⟩ and ⟨s, p, l⟩ with l a literal), and by treating each remaining triple ⟨s, p, o⟩ as an undirected edge connecting s and o. Step 2. Partition the nodes of G′ into n disjoint sets using min-cut graph partitioning (e.g.,usingMETIS),andletV1′,...,Vn′ be the resulting vocabularies. Step 3. Extend each Vi′ to Vi⋆ = Vi′ ∪ {r | r occurs in a schema triple in G}. Step 4. Extend each Vi⋆ to Vi = Vi⋆ ∪ {o | ⟨s, p, o⟩ ∈ G and s ∈ Vi⋆}.��� Step 5. Calculate [G]Vi for each Vi, and set G = {[G]V1 , . . . , [G]Vn }.

Conclusions •  The same partition quality can be achieved without

n-hop duplication. •  Our partitioning scheme has very small storage

overhead.

RDF Graph Partitioning

Star Queries •  Common query to

many applications •  Can be answered

fully locally •  No network

communication required

Evaluation

RDFox achieves a greater percentage of local answers against two common benchmarks, with minimal storage overhead.

Wildcard *  

•  New resource that represents all resources external to the partition element.

•  Shows the connections

to other partition elements.

•  Form of extreme graph

summarisation. •  Helps reduce the

number of intermediate answers.

Conclusions

Our approach: •  Greater percentage of local

answers compared to competitors

•  Minimal storage overhead •  Recognises local answers

Future Work

•  Develop efficient query answering scheme with wildcards

•  Implement distributed DBMS on top of RDFox

•  Colour of edge represents which partition element it belongs to.

•  Subject hashing (left) is common and simple.

•  Groups all triples with the same subject.

•  Graph-based partitioning (right) groups triples that are close in the graph.

•  Our data partitioning scheme is an optimised graph-based scheme.

Local Answer

A local answer can be evaluated fully on a single machine:

•  No network communication

•  Fast evaluation Subject Hashing Graph-based Partition element 1

Partition element 2

RDFox: Our approach SHAPE: Semantic hash partitioning

with n-hop duplication Hash: Subject hashing

Aims

•  Maximise number of local answers to common queries

•  Recognise local answers

•  Minimise storage overhead

•  Subject hashing groups triples with the same subject

•  Graph-based partitioning groups triples close in the graph

•  Many approaches to partitioning: §  Subject hashing §  Graph-based §  Semantic hashing

•  Partitioning scheme affects the number of local answers

•  We propose a novel graph-based

partitioning scheme: §  Minimal storage overhead §  Common (star) queries fully

local

Introduction of a new wildcard resource:

•  Represents all external resources

•  Tracks connections between partition elements

Used in our novel query answering scheme:

•  Reduces the number of intermediate answers

•  Recognises local answers