GraphFrames: Graph Queries in Apache Spark lierranli/publications/GraphFrames-Spark...GraphFrames:...

download GraphFrames: Graph Queries in Apache Spark lierranli/publications/GraphFrames-Spark...GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh

of 20

  • date post

    26-May-2018
  • Category

    Documents

  • view

    212
  • download

    0

Embed Size (px)

Transcript of GraphFrames: Graph Queries in Apache Spark lierranli/publications/GraphFrames-Spark...GraphFrames:...

  • GraphFrames: Graph Queries in Apache Spark SQL

    Ankur DaveUC Berkeley AMPLab

    Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)

  • + Graph Queries

    2016Apache Spark + GraphFrames

    GraphFrames (2016)

    + Graph Algorithms

    2013Apache Spark + GraphX

    Relational Queries

    2009Spark

  • Graph Algorithms vs. Graph Queries

    x

    PageRank

    Alternating Least Squares

    Graph Algorithms Graph Queries

  • Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

    Editor 1 Editor 2 Article 1 Article 2

    Article 1

    Article 2

    Editor 1

    Editor 2

    same day} same day}

  • Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

    // Iterate until convergence wikipedia.pregel(sendMsg = { e =>

    e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

    0.15 + 0.85 * msgSum})

    Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

    .select("*","e11.date e21.date".as("d1"),"e12.date e22.date".as("d2"))

    .sort("d1 + d2".desc).take(10)

  • Separate SystemsGraph Algorithms Graph Queries

  • Raw Wikipedia

    < / >< / >< / >XML

    Text Table

    Edit GraphEdit Table

    Frequent Collaborators

    Problem: Mixed Graph AnalysisHyperlinks PageRank

    Article Text

    User Article

    Vandalism Suspects

    User User

    User Article

  • Solution: GraphFrames

    Graph Algorithms Graph Queries

    Spark SQL

    GraphFrames API

    Pattern Query Optimizer

  • GraphFrames API Unifies graph algorithms, graph queries, and DataFrames Available in Scala, Java, and Python

    class GraphFrame {def vertices: DataFramedef edges: DataFrame

    def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

    def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

    }

  • Implementation

    Parsed Pattern

    Logical Plan

    Materialized Views

    Optimized Logical Plan

    DataFrameResult

    Query String

    GraphRelationalTranslation Join Elimination

    and Reordering

    Spark SQL

    View SelectionGraph

    AlgorithmsGraphX

  • GraphRelational Translation

    B

    D

    A

    C

    Existing Logical PlanOutput: A,B,C

    Src Dst

    C=Src

    Edge Table

    ID Attr

    Vertex Table

    D=ID

  • Materialized View Selection

    GraphX: Triplet view enabled efficient message-passing algorithms

    Vertices

    B

    A

    C

    D

    Edges

    A B

    A C

    B C

    C D

    A

    B

    Triplet View

    A C

    B C

    C D

    Graph

    +

    UpdatedPageRanks

    B

    A

    C

    D

    A

  • Materialized View Selection

    GraphFrames: User-defined views enable efficient graph queries

    Vertices

    B

    A

    C

    D

    Edges

    A B

    A C

    B C

    C D

    A

    B

    Triplet View

    A CB CC D

    Graph

    User-Defined Views

    PageRank

    CommunityDetection

    Graph Queries

  • Join Elimination

    Src Dst1 21 32 32 5

    Edges

    ID Attr1 A2 B3 C4 D

    VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;

    Unnecessary join

    can be eliminated if tables satisfy referential integrity, simplifying graphrelational translation:

    SELECT src, dst FROM edges;

  • Join Reordering

    A B B A

    A, B B DC BB

    B EBC DB

    C EC, DC, EExample Query

    Left-Deep Plan Bushy Plan

    A B B A

    A, B

    B D C B

    B

    B EBBB, C

    User-Defined View

  • EvaluationFaster than Neo4j for unanchored pattern queries

    0

    0.5

    1

    1.5

    2

    2.5

    GraphFrames Neo4j

    Que

    ry la

    tenc

    y, s

    Anchored Pattern Query

    01020304050607080

    GraphFrames Neo4j

    Que

    ry la

    tenc

    y, s

    Unanchored Pattern Query

    Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

  • EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

    0

    1

    2

    3

    4

    5

    6

    7

    GraphFrames GraphX Nave Spark

    Per-i

    tera

    tion

    runt

    ime,

    s

    PageRank Performance

    Per-iteration performance on web-Google, single 8-core machine. Nave Spark uses Scala RDD API.

  • EvaluationRegistering the right views can greatly improve performance for some queries

    Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

  • Future Work Suggest views automatically Exploit attribute-based partitioning in optimizer Code generation for single node

  • Try It Out!Released as a Spark Package at:

    https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

    ankurd@eecs.berkeley.edu