GraphX : Graph Analytics on Spark

download GraphX : Graph Analytics on Spark

If you can't read please download the document

  • date post

    26-Feb-2016
  • Category

    Documents

  • view

    275
  • download

    3

Embed Size (px)

description

GraphX : Graph Analytics on Spark. Joseph Gonzalez, Reynold Xin , Ion Stoica , Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp : August 29, 2013. Graphs are Essential to Data Mining and Machine Learning. Identify influential people and information Find communities - PowerPoint PPT Presentation

Transcript of GraphX : Graph Analytics on Spark

Slide 1

GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab

AMPCamp: August 29, 20131Graphs are Essential to Data Mining and Machine LearningIdentify influential people and informationFind communitiesUnderstand peoples shared interestsModel complex data dependencies

LiberalConservativePostPostPostPostPostPostPostPostPredicting Political BiasPostPostPostPostPostPost

Post

Post

Post

Post

PostPost

Post

Post??????????????????????????????3Conditional Random FieldBelief Propagation3Triangle CountingCount the triangles passing through each vertex:

Measures cohesiveness of local communityMore TrianglesStronger CommunityFewer TrianglesWeaker Community1234Collaborative FilteringRatingsItems

UsersMany More Graph AlgorithmsCollaborative FilteringAlternating Least SquaresStochastic Gradient DescentTensor FactorizationSVDStructured PredictionLoopy Belief PropagationMax-Product Linear ProgramsGibbs SamplingSemi-supervised MLGraph SSL CoEMGraph AnalyticsPageRankSingle Source Shortest PathTriangle-CountingGraph ColoringK-core DecompositionPersonalized PageRankClassificationNeural NetworksLasso6Dependency GraphTableStructure of Computation7ResultData-ParallelGraph-ParallelRowRowRowRow

PregelThe Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edgesUsing messages (e.g. Pregel [PODC09, SIGMOD10])Through shared state (e.g., GraphLab [UAI10, VLDB12])

Parallelism: run multiple vertex programs simultaneously88By exploiting graph-structureGraph-Parallel systems can be orders-of-magnitude faster.9Counted: 34.8 Billion Triangles10Triangle Counting on Twitter64 Machines15 SecondsGraphLab1536 Machines423 MinutesHadoop[WWW11]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, WWW111000 x Faster40M Users, 1.4 Billion Links10

PregelSpecialized Graph Systems

Specialized Graph SystemsAPIs to capture complex graph dependencies

Exploit graph structure toreduce communicationand computationWhy GraphX?13GraphLabHadoop Graph AlgorithmsGraph CreationPostProc.The Bigger PictureTime Spent in Data Pipeline

Vertices

EdgesEdgesLimitations of Specialized Graph-Parallel SystemsNo support for Construction & Post ProcessingNot interactive Requires maintaining multiple platformsSpark excels at these!GraphX Unifies Data-Parallel and Graph-Parallel SystemsSpark Table APIRDDs, Fault-tolerance, and task schedulingGraphLabGraph APIgraph representation and executionGraph ConstructionComputationPost-Processingone system for the entire graph pipelineEnable Joining Tables and GraphsUser DataProductRatingsFriend GraphETLProduct Rec.GraphJoinInf.Prod.Rec.Tables Graphs20The GraphX Resilient Distributed GraphIdRxinJegonzalFranklinIstoicaSrcIdDstIdrxinjegonzalfranklinrxinistoicafranklinfranklinjegonzalRJFIAttribute (E)FriendAdvisorCoworkerPIAttribute (V)(Stu., Berk.)(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T, reduceF: (T, T) => T, direction: EdgeDir): Graph[T, E]}

GraphX APIFEAggregate NeighborsMap-Reduce for each vertexDBAC mapF( )AB mapF( )ACa1a2 reduceF( , )a1a2AFEExample: Oldest FollowerDBACWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

234230197516We can express both Pregel and GraphLab using aggregateNeighbors in 40 lines of code!Performance OptimizationsReplicate & co-partition vertices with edgesGraphLab (PowerGraph) style vertex-cut partitioningMinimize communication by avoiding edge data movement in JOINsIn-memory hash index for fast joins26Early Performance27In Progress OptimizationsByte-code inspection of user functionsE.g. if mapf does not need edge data, we can rewrite the query to delay the joinExecution strategies optimizerScan edges randomly accessing verticesScan vertices randomly accessing edges

28Current ImplementationPregel (20)PageRank (5)GraphXSpark (relational operators)Connected Comp. (10)Shortest Path (10)ALS(40)GraphLab (20)DemoReynold Xinvertices = spark.textFile("hdfs://path/pages.csv")edges = spark.textFile("hdfs://path/to/links.csv) .map(line => new Edge(line.split(\t))g = new Graph(vertices, edges).cacheprintln(g.vertices.count)println(g.edges.count)g1 = g.filterVertices(_.split('\t')(2) == "Berkeley")ranks = Analytics.pageRank(g1, numIter = 10)println(ranks.vertices.sum)3131ranks = Analytics.pageRank(g1, numIter = 10)println(ranks.vertices.sum)

3232SummaryGraph-parallel primitives on Spark.Currently slower than GraphLab, butNo need for specialized systemsEasier ETL, and easier consumption of outputInteractive graph data miningFuture work will bring performance closer to specialized engines.Sub-second33StatusCurrently finalizing the APIsFeedback wanted: http://bit.ly/graph-apiAlso working on improving system performanceWill be part of Spark 0.9

Questions?jegonzal@eecs.berkeley.edurxin@eecs.berkeley.edu

Backup slidesVertex Cut Partitioning

Vertex Cut Partitioning

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

Example: Vertex Degree

Example: Vertex Degree

Example: Vertex Degree

A: 5B: 0C: 0D: 0E: 0F: 0FEExample: Oldest FollowerDBACWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

Specialized Graph Systems47Shared State[UAI10, VLDB12]

PregelMessaging[PODC09, SIGMOD10]Many OthersGiraph, Stanford GPS, Signal-Collect, Combinatorial BLAS, BoostPGL, The ChallengeExpressive graph computation primitives implementable on SparkLeveraging advanced properties and engine extensions to make these primitives fastAn optimizer for choosing execution strategiesControlled data partitioningNew index-based access methods and operatorsclass Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T, reduceF: (T, T) => T, direction: EdgeDir): Graph[T, E]}

GraphX API