Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data...
Transcript of Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data...
![Page 1: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/1.jpg)
Graph Data ManagementAnalysis and Optimization of Graph Data Frameworks
presented by Fynn Leitow
![Page 2: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/2.jpg)
Overview1) Introduction
a) Motivation
b) Application for big data
2) Choice of algorithms
3) Choice of frameworks
a) Framework implementation
4) Framework analysis
a) Performance comparison
b) Optimization techniques
5) Conclusion & Criticism
by Fynn Leitow
![Page 3: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/3.jpg)
Motivation
➢How to implement graph analysis algorithms on huge graphs?
➢How do they perform?
➢How about parallel computing?
![Page 4: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/4.jpg)
Application for big data
➢Social Networks & Web of Data
➢extremely large & dynamic
➢can’t be handled by legacy programs
![Page 5: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/5.jpg)
Social Networks
➢vertices: people, pictures, videos
➢edges: relations among nodes (friendship, follower)
➢scale-free: follow power-law distribution (few vertices have high popularity)
![Page 6: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/6.jpg)
Web of Data➢Large-scale structured data by governments, researchers, companies
➢Publishing principles:
○ Unique ressource identifier
○ publication at this URI in RDF triples (ressource-data-framework, statements similar to entity-relationship model but more general)
○ links to similar online ressources
➢RDF example: “The sky has the color blue”
subject:“the sky”
object:“the color blue”
predicate:“has”
![Page 7: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/7.jpg)
Typical queries➢ social networks:
○ punctual updates (of vertices, adding edges)
○ transitive closures (“other people you might know”, ...)
○ betweenness queries (“common friends”, shortest path, ...)
➢ Web of data:
○ bulk inserts
○ joins
○ logical inference (deductions)
All men aremortal
Therefore,Socrates is
mortal.
Socratesis a man
![Page 8: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/8.jpg)
Objective of the paper
➢ Analyse existing graph frameworks - also for machine learning!
➢ native, hand-optimized implementation as reference point
➢ give suggestions to improve performance gap
![Page 9: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/9.jpg)
AlgorithmsPageRank (statistics)Breadth-first search (graph traversal)Triangle Counting (statistics)Collaborative filtering (machine learning)
![Page 10: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/10.jpg)
1.PageRank (site popularity)➢how many links go to this this site?
➢technically: probability for a random walk to end on this vertex
t - iterationr - probability for random jumpe - set of directed edgesdegree(j) - number of outgoing edges
![Page 11: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/11.jpg)
2. Breadth-first search (BFS) - Graph traversal
➢calculates “distance” (smallest number of edges) from start to any othervertex
➢Distance(start) initialized to zero, all others to infinity
➢iteratively computes for neighboring vertices:
![Page 12: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/12.jpg)
3. Triangle Counting - graph statistics➢Triangle := two vertices are both neighbors of a common third vertex
➢Algorithm:
○ each vertex shares his neighborhood list with its neighbors
○ do neighbors overlap with the neighborhood lists? -> triangle!
i j
k
![Page 13: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/13.jpg)
Collaborative filtering - machine learning➢Recommendation system: predicts user ratings based on an incomplete set
of (user, item) ratings
(a)Given a rating matrix R, find P and Q so that R = PQ is approximated best.
(b) find the bipartitegraph with edgeweights Ru,v
![Page 14: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/14.jpg)
Overview
➢We focus on PageRank and BFS
Algorithm Application Graph type Vertex property Message size(Bytes/edge)
PageRank Graph statistics Directed,unweighted edges
Double (pagerank) Constant (8)
Breadth FirstSearch
Graph traversal Undirected,unweighted edges
Int (distance) Constant (4)
CollaborativeFiltering
Machine learning Bipartite graph;Undirected,weighted edges
Array of Doubles(pu or qv )
Constant (8K)
Triangle Counting Graph statistics Directed,unweighted edges
Long (# triangles) Variable (0-106 )
![Page 15: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/15.jpg)
FrameworksGraphlabCombinatorialBLASSocialLiteGaloisGiraph
![Page 16: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/16.jpg)
Explanation: Framework & Ninja gap
➢Framework: layered structure indicating what kind of programs can/shouldbe built and how they should interrelate
➢can include programs, specify interfaces, offer programming tools,...
➢Ninja performance gap: “Performance gap between naively written C++/Ccode that is parallelism unaware (often serial) and best-optimized code onmodern multi-/many-core processors” [1]
![Page 17: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/17.jpg)
Choice of Frameworks:
(Austin Texas University)
COMBINATORIAL_BLAS(UCSB)
(Stanford University)
![Page 18: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/18.jpg)
➢Galois: parallel computing framework for irregulardata structures
OverviewFramework Programming
modelLanguage Graph
PartitioningCommunicationlayer
GraphLab Vertex C++ 1-D (vertex-part.) Sockets
CombBLAS Sparse (adjacency) matrix C++ 2-D (edge-part.) MPI
SociaLite Datalog (declarative,deductive database tables)
Java 1-D Sockets
Galois Task-based C/C++ N/A N/A
Giraph (Hadoop) Vertex Java 1-D Netty
![Page 19: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/19.jpg)
PageRank - vertex programming
➢Graphlab & Giraph, similar for Galois
➢runs on a single vertex and communicates with adjacent vertices
➢Vertex program for one iteration:
![Page 20: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/20.jpg)
PageRank - sparse matrix (CombBLAS)
➢single iteration of PageRank:
A - adjacency matrixpt - vector of PageRank valuespt = pt / [vector of vertex out-degrees]~
![Page 21: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/21.jpg)
PageRank - declarative (SociaLite)
➢Head: PageRank of vertex n for iteration t+1 is sum of two rules:
➢1st: constant term, 2nd: normalized values from iteration t
➢InEdge[n](s): vertices in 1st, neighbors in 2nd column
➢second version optimized for distributed machines
![Page 22: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/22.jpg)
BFS - vertex programing (GraphLab, Giraph)
➢Initialize Distance to zero or infinity, respectively
➢Continue until there are no more updates
![Page 23: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/23.jpg)
Breadth First Search - sparse matrix (CombBLAS)
➢matrix-vector multiplication in each iteration:
v = ATs
v - non-zeros indicate start vertices for next iteration
A - adjacency matrix
s - starting vertices
![Page 24: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/24.jpg)
Breadth First Search - SociaLite
➢1st rule handles source, 2nd recursively follows neighbors
![Page 25: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/25.jpg)
Breadth First Search - Galois➢Work lists are maintained &
parallel processed for eachlevel by Galois
![Page 26: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/26.jpg)
Framework AnalysisDatasets - Performance on single and multiple nodes
![Page 27: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/27.jpg)
Experimental Setup
➢Twitter & Yahoo large enough for multiple (4, 16 for triangle) nodes
➢Intel Xeon E5-2697, 24 cores @ 2.7 Ghz, 64 GB DRAM / node
➢Real data distributed by power law -> synthetic as well
Dataset FB, Wiki,Livejournal
Netflix Twitter Yahoo Music Synthetic Graph500(64 nodes)
Synthetic CollaborativeFiltering (64 nodes)
# vertices(k = 103)
3 - 5 M 480 k users18 k movies
62 M 1.0 M users0.6 M items
537 M 63 M users1.3 M items
# edges(M = 106)
42 - 85 M 99 M ratings 1468 M 253 M ratings 8539 M 16743 M ratings
![Page 28: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/28.jpg)
Performance results (single node)➢native fastest
➢Galois very fast(single node only)
➢Giraph slow
➢synthetic data in linewith real-world results
![Page 29: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/29.jpg)
Performance (single node)
➢CombBlas, Graphlab and SociaLite perform well on average
➢CombBlas ran out of memory on Triangle Count (A2 calculations)
![Page 30: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/30.jpg)
Performance (multiple nodes, synthetic)
➢“weak scaling”: graph dataper node constant,
➢horizontal line denotesperfect scaling
![Page 31: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/31.jpg)
multiple node (real-world / combined)
![Page 32: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/32.jpg)
Observations
➢Again: Native best, Giraph worst, CombBLAS OOM.
➢Graphlab & SociaLite perform well on - counting because data structure is
optimized for UNION - operations
➢CombBLAS performs well for the other three algorithms
➢GraphLab drops off for PageRank & BFS due to network bottleneck
![Page 33: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/33.jpg)
Further Analysis: Resource monitoring
➢CPU utilization ( > larger is better)
➢Peak network transfer rate ( > )
➢memory footprint ( < )
➢network data volume ( < )
=> find out why certain trends are observed
![Page 34: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/34.jpg)
Resource monitoring➢Giraph - only 16 % CPU due to memory requirements for each worker
➢Pagerank: limited by network traffic for all ~> performance difference
![Page 35: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/35.jpg)
Optimization➢ Data Structure
○ 2x : bit vectors (exploit bit-level parallelism in hardware)
○ compressed sparse-row (CSR) format for adjacency list
➢ Data Compression
○ 2x : bit vectors + delta coding (store differences rather than actual values)
➢ Overlap of Computation and Communication
○ 1.2-2x : start computation before receiving whole message, chunk large messages
➢ Message passing mechanisms
○ 2.5-3x : use MPI (message passing interface) instead of TCP sockets
○ 2x : use multiple sockets between two nodes
➢ Partitioning schemes (1-d, 2-d, vertex-cut,...)
![Page 36: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/36.jpg)
Optimization - PageRank & DFS➢optimization
performed for thenative algorithm
DFS: bit-vectors forlist of already visitedvertices
![Page 37: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/37.jpg)
Criticism & Summary
![Page 38: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/38.jpg)
Criticism
➢Not including spark-based GraphX (7x slower than GraphLab on PageRank)
➢For those frameworks without pre-implemented code, your implementationmight be too good/too bad and distort the results.
✓ Creates value for end-users and framework developers alike
✓ Methods and algorithms well explain without being too technical
![Page 39: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/39.jpg)
Summary
➢vertex-based: every vertex is an instance, communicates non-iterative
➢Galois: best single node results. Giraph: slow
➢Optimization: bit vectors, delta coding, CSR, Overlap, MPI
⇒Reduce the performance gap through recommended changes and let the enduser choose by preference.
![Page 40: Graph Data Management - cs. · PDF fileSociaLite Datalog (declarative, ... Graph Data Management Systems for New ... other tables and figures: Navigating the maze of graph analytics](https://reader036.fdocuments.net/reader036/viewer/2022062907/5aa312de7f8b9a46238de4a5/html5/thumbnails/40.jpg)
Sources
[1] Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications?(Satish et al. ; 2012)
Images: Giraph, Graphlab, SociaLite, Galois
Introductory Paper: Graph Data Management Systems for New Application Domains (Cudré-Mauroux,Elnikety; 2011)
Main Paper, other tables and figures: Navigating the maze of graph analytics frameworks using massivegraph datasets (Satish et al.; SIGMOD 2014)