Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016
GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...
Transcript of GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...
GraphFrames: An Integrated API for Mixing Graph and Relational QueriesAnkur DaveUC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
UC BERKELEY
+ Graph Queries
2016Apache Spark + GraphFrames
Trend: Unified Graph Analysis
+ Graph Algorithms
2013Apache Spark + GraphX
Relational Queries
2009Spark
Graph Algorithms vs. Graph Queries
≈x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank
// Iterate until convergence wikipedia.pregel(sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum})
Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")
.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)
Separate SystemsGraph Algorithms Graph Queries
Raw Wikipedia
< / >< / >< / >XML
Text Table
Edit GraphEdit Table
Frequent Collaborators
Problem: Mixed Graph AnalysisHyperlinks PageRank
Article Text
User Article
Vandalism Suspects
User User
User Article
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFrames API
Pattern Query Optimizer
GraphFrames API• Unifies graph algorithms, graph queries, and relational operations (DataFrames)• Designed for interactive use• Available in Scala, Java, and Python
class GraphFrame {def vertices: DataFramedef edges: DataFrame
def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...
}
Implementation
Parsed Pattern
Logical Plan
Materialized Views
Optimized Logical Plan
DataFrameResult
Query String
Graph–RelationalTranslation Join Elimination
and Reordering
Spark SQL
View SelectionGraph
AlgorithmsGraphX
Graph–Relational Translation
B
D
A
C
Existing Logical PlanOutput: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
Vertex Table
⋈D=ID
Join Elimination
Src Dst1 21 32 32 5
Edges
ID Attr1 A2 B3 C4 D
VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;
Unnecessary join
can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:
SELECT src, dst FROM edges;
Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
UpdatedPageRanks
B
A
C
D
A
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A CB CC D
Graph
User-Defined Views
PageRank
CommunityDetection
…
Graph Queries
Join Reordering
A → B B → A
⋈A, BB → D
C→ B⋈BB → E⋈B
C→ D⋈BC→ E⋈C, D
⋈C, EExample Query
Left-Deep Plan Bushy Plan
A → B B → A
⋈A, B
B → D C→ B
⋈B
B → E⋈B⋈B⋈B, C
User-Defined View
Query Planning AlgorithmDynamic programming algorithm based on:J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.
1. Considers all left-deep plans, and a subset of bushy plans• Bushy plans to explore are chosen using layered-DAG and cycle-detection heuristics
2. Considers using each view that is exactly equivalent to a plan subtree• Result: Selects the largest of multiple hierarchically contained views
EvaluationFaster than Neo4j for unanchored pattern queries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Que
ry la
tenc
y, s
Anchored Pattern Query
01020304050607080
GraphFrames Neo4j
Que
ry la
tenc
y, s
Unanchored Pattern Query
Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.
EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-i
tera
tion
runt
ime,
s
PageRank Performance
Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.
EvaluationRegistering the right views can greatly improve performance for some queries
Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.
Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node
Try It Out!Released as a Spark Package at:
https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.