Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned...

21
Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety (Microsoft Research) Yuxiong He (Microsoft Research) Mohamed Mokbel (University of Minnesota)

Transcript of Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned...

Page 1: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

Horton+: A Distributed System for Processing

Declarative Reachability Queries over Partitioned Graphs

Mohamed Sarwat (Arizona State University)

Sameh Elnikety (Microsoft Research)

Yuxiong He (Microsoft Research)

Mohamed Mokbel (University of Minnesota)

Page 2: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

2

Motivation•Social network

•Queries– Find Alice’s friends– How Alice & Ed are connected– Find Alice’s photos with friends

Bob

Hillary Alice

Chris David

FranceEd George

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 3: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

3

Data Model•Attributed multi-graph•Node

– Represent entities– ID, type, attributes

•Edge– Represent binary relationship– Type, direction, weight, attrs

Hillary

Bob Alice

Chris David

FranceEd George

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Manages BobAlice

BobAlice

Manages>

<Manages

App

Horton

Page 4: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

4

Horton+ Contributions1.Defining reachability queries formally2.Introducing graph operators for distributed graph engine3.Developing query optimizer4.Evaluating the techniques experimentally

Page 5: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

5

Graph Reachability Queries•Query is a regular expression

– Sequence of node and edge predicates1. Hello world in reachability

» Photo-Tags-’Alice’» Search for path with node: type=Photo, edge: type=Tags, node:

id=‘Alice’

2. Attribute predicate» Photo{date.year=‘2012’}-Tags-’Alice’

3. Or» (Photo | video)-Tags-’Alice’

4. Closure for path with arbitrary length » ‘Alice’(-Manages-Person)*» Kleene star to find Alice’s org chart

Page 6: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

6

Declarative Query Language Declarative Navigational

Photo-Tags-’Alice’ Foreach( n1 in graph.Nodes.SelectByType(Photo) ) { Foreach( n2 in n1.GetNeighboursByEdgeType(Tags) { If(node2.id == ‘Alice’) { return path(node1, Tags, node2) } }}

Page 7: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

7

Comparison to SQL & SPARQL

SQL RL

• SQL

• SPARQL– Pattern matching

» Find sub-graph in a bigger graph

Page 8: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

8

‘Alice’-Tags-Photo S2S0 S1 S3

‘Alice’ Tags Photo

Compile into Algebraic Query Plan

‘Alice’(-Manages-Person)* S2S0 S1

‘Alice’ Manages

Person

Page 9: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

9

‘Alice’-Tags-Photo

Breadth First Search

Answer Paths:‘Alice’-Tags-Photo1‘Alice’-Tags-Photo8

S2S0 S1 S3

‘Alice’ Tags PhotoCentralized Query Execution

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 10: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

10

Distributed Query Execution

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Partition 2

Partition 1

‘Alice’-Tags-Photo-Tags-’Bob’

Page 11: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

‘Alice’-Tags-Photo-Tags-‘Bob’

S2

S0

S1

S3

‘Alice’

Tags

Photo

Distributed Query Execution

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

S4

S5

Tags

‘Bob’

Alice

Photo1 Photo8

Step 1

Step 2

Step 3

Partition 1

Partition 2Bob

Partition 1Partition 2 FSM

11

Page 12: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

12

Compile into query plan & Optimize

Process plan operators

. . .

Partition 1 Partition 2

Execution Engine

Communication library

Partition N

Result paths

Query

Communication library

Execution Engine

Execution Engine

Communication library

Architecture Distributed Execution Engine

Page 13: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

13

Algebraic Operators1.Select

– Find set of starting nodes

2.Traverse– Traverse graph to construct paths

3.Join– Construct longer paths

‘Alice’-Tags-Photo S2S0 S1 S3

‘Alice’ Tags Photo

Page 14: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

14

Plan Enumeration for Query Optimization

•Query: ‘Mike’-Tags-Photo-Tags-Person-FriendOf-‘Mike’•Example plans

1. Left to right» ‘Mike’-Tags-Photo-Tags-Person-FriendOf-‘Mike’

2. Right to left» ‘Mike’-FriendOf-Person-Tags-Photo-Tags-‘Mike’

3. Split then join» (‘Mike’-FriendOf-Person) ⋈ (Person-Tags-Photo-Tags-‘Mike’)

4. Split then join» (‘Mike’-FriendOf-Person-Tags-Photo) ⋈ (Photo-Tags-‘Mike’)

5. …

Page 15: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

15

Query: Q[1, n] = N1 E1 N2 E2 …… Nn-1 En-1 Nn Selectivity of query Q[i,j] : Sel(Q[i,j]) Minimum cost of query Q[i,j] : F(Q[i,j])

Enumeration Algorithm

Apply dynamic programming • Store intermediate results of all F(Q[i,j]) pairs• Complexity: O(n3)

F(Q[i,j]) = min{ SequentialCost_LR(Q[i,j]), SequentialCost_RL(Q[i,j]), min_{i<k<j} (F(Q[i,k]) + F(Q[k,j]) + Sel(Q[i,k])*Sel(Q[k,j])) }

Base step: F(Qi) = F(Ni) = Cost of matching predicate Ni

Page 16: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

16

Graphs• Real dataset (codebook graph: 4M nodes, 14M edges, 20

types) • Synthetic dataset (RMAT graph, 1024M nodes, 5120M

edges)Machines• Commodity servers • Intel Core 2 Duo 2.26 GHz, 16 GB ram

Experimental Evaluation

Page 17: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

17

Q1: ShortFind the person who committed checkin 400 and the WorkItemRevisions it modifies:Person-Committer-Checkin{id=400}-Modifies-WorkItemRevision

Q2: Selective Find Dave’s checkins that modified a WorkItem create by Tim:‘Dave’-Committer-Checkin-Modifies-WorkItem-CreatedBy-’Tim’

Q3: ReportFor each checkin, find the person (and his/her manager) who committer it as well as all the work

items and their WebURLs that are modified by that checkin:Person-Manages-Person-Committer-Checkin-Modifies-WorkItemRevision-Modifies-

WorkItem-Links-WebURL

Q4: ClosureRetrieve all checkins that any employee in Dave organizational chart (working under him)

committed:‘Dave’(-Manages-Person)*-Checkin

Query Workload

Page 18: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

18

Query Execution Time (Small Graph)

Page 19: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

19

Query Execution Time

•RMAT graph – does not fit in one server, 1024 M nodes, 5120 M edges

•16 partition servers•Execution time dominated by computations

Query Total Execution Communication Computation

Q1 47.588 sec 0.723 sec 46.865 sec

Q2 06.294 sec 0.693 sec 05.601 sec

Q3 92.593 sec 1.258 sec 91.325 sec

Page 20: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

20

Query Optimization

•Synthetic graphs– Vary graph size

•Centralized (1 Server)•Execution time for queries Q1, Q2, Q3

Page 21: Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs Mohamed Sarwat (Arizona State University) Sameh Elnikety.

21

Horton+ Contributions1.Defining reachability queries formally2.Introducing graph operators for distributed graph engine3.Developing query optimizer4.Evaluating the techniques experimentally