GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...

21
GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks) UC BERKELEY

Transcript of GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...

Page 1: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

GraphFrames: An Integrated API for Mixing Graph and Relational QueriesAnkur DaveUC Berkeley AMPLab

Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)

UC BERKELEY

Page 2: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

+ Graph Queries

2016Apache Spark + GraphFrames

Trend: Unified Graph Analysis

+ Graph Algorithms

2013Apache Spark + GraphX

Relational Queries

2009Spark

Page 3: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Graph Algorithms vs. Graph Queries

≈x

PageRank

Alternating Least Squares

Graph Algorithms Graph Queries

Page 4: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1

Article 2

Editor 1

Editor 2

same day} same day}

Page 5: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel(sendMsg = { e =>

e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

0.15 + 0.85 * msgSum})

Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))

.sort("d1 + d2".desc).take(10)

Page 6: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Separate SystemsGraph Algorithms Graph Queries

Page 7: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Raw Wikipedia

< / >< / >< / >XML

Text Table

Edit GraphEdit Table

Frequent Collaborators

Problem: Mixed Graph AnalysisHyperlinks PageRank

Article Text

User Article

Vandalism Suspects

User User

User Article

Page 8: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL

GraphFrames API

Pattern Query Optimizer

Page 9: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

GraphFrames API• Unifies graph algorithms, graph queries, and relational operations (DataFrames)• Designed for interactive use• Available in Scala, Java, and Python

class GraphFrame {def vertices: DataFramedef edges: DataFrame

def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

}

Page 10: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Implementation

Parsed Pattern

Logical Plan

Materialized Views

Optimized Logical Plan

DataFrameResult

Query String

Graph–RelationalTranslation Join Elimination

and Reordering

Spark SQL

View SelectionGraph

AlgorithmsGraphX

Page 11: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Graph–Relational Translation

B

D

A

C

Existing Logical PlanOutput: A,B,C

Src Dst

⋈C=Src

Edge Table

ID Attr

Vertex Table

⋈D=ID

Page 12: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Join Elimination

Src Dst1 21 32 32 5

Edges

ID Attr1 A2 B3 C4 D

VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;

Unnecessary join

can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:

SELECT src, dst FROM edges;

Page 13: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Materialized View Selection

GraphX: Triplet view enabled efficient message-passing algorithms

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A C

B C

C D

Graph

+

UpdatedPageRanks

B

A

C

D

A

Page 14: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A CB CC D

Graph

User-Defined Views

PageRank

CommunityDetection

Graph Queries

Page 15: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Join Reordering

A → B B → A

⋈A, BB → D

C→ B⋈BB → E⋈B

C→ D⋈BC→ E⋈C, D

⋈C, EExample Query

Left-Deep Plan Bushy Plan

A → B B → A

⋈A, B

B → D C→ B

⋈B

B → E⋈B⋈B⋈B, C

User-Defined View

Page 16: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Query Planning AlgorithmDynamic programming algorithm based on:J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

1. Considers all left-deep plans, and a subset of bushy plans• Bushy plans to explore are chosen using layered-DAG and cycle-detection heuristics

2. Considers using each view that is exactly equivalent to a plan subtree• Result: Selects the largest of multiple hierarchically contained views

Page 17: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

EvaluationFaster than Neo4j for unanchored pattern queries

0

0.5

1

1.5

2

2.5

GraphFrames Neo4j

Que

ry la

tenc

y, s

Anchored Pattern Query

01020304050607080

GraphFrames Neo4j

Que

ry la

tenc

y, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

Page 18: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

0

1

2

3

4

5

6

7

GraphFrames GraphX Naïve Spark

Per-i

tera

tion

runt

ime,

s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

Page 19: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

EvaluationRegistering the right views can greatly improve performance for some queries

Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

Page 20: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node

Page 21: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal

Try It Out!Released as a Spark Package at:

https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

[email protected]