Graphs are everywhere! Distributed graph computing with Spark GraphX
-
Upload
andrea-iacono -
Category
Data & Analytics
-
view
1.434 -
download
1
Transcript of Graphs are everywhere! Distributed graph computing with Spark GraphX
Diapositiva 1
Graphs are everywhere!
Distributed graph computing with Spark GraphXAndrea Iacono
Agenda:
Graph definitions and usages
GraphX introduction
Pregel
Code examples
The main focus will be the programming model
The code is available at:https://github.com/andreaiacono/TalkGraphX
Question to public: - Who knows what a graph is?- Who ever used it?- Who knows the most used algorithms? (BFS, DFS, Dijkstra)- Who knows Scala?
A graph is a set of vertices and edges that connect them:
Graphs are used for modeling very different domains.
Edge
Vertex
Vertici e archi
Networks
Conteggio dei triangoli x raggruppareInteresse commerciale x proposte mirate a gruppi con stessi interessi
Routing
Vertici = incrociArchi = stradeAlgoritmo cammino minimo (Dijkstra), dove gli archi hanno pi pesi: tipicamente distanza, traffico, pagamento di un pedaggio, etc
Page Rank
Pagine = verticiArchi = link in entrataOgni arco in uscita ha un pesao legato a quello del suo vertice; maggiore la sommatoria dei valori degli archi in ingresso, maggiore il peso del vertice.Algoritmo iterativo
Artificial Intelligence / Minimum spannin gtree
Definitions
UndirectedDirected
Orientato / non orientato
Definitions
Connected
Disconnected
Connesso / Non connesso
Definitions
K5
K2,3
Complete
Bipartite (and complete)
K la nomeclatura standard x indicare questo tipo di grafiA bipartite graph is useful for e-commerce, when you a all the user nodes that can buy any of the product nodes.
Definitions
Cyclic
Acyclic
Ciclico / Aciclico (o senza cicli)
Definitions
Multigraph
Pseudograph
Multi grafo: quando si possono avere pi archi che hanno la stessa sorgente e la stessa destinazionePseudo grafo: quando un arco pu avere lo stesso vertice come sorgente e come destinazione
Definitions
An undirected acyclic connected graph is a tree!
Quando dicevo che gli archi sono dappertutto, soprattuto per questo!
What's wrong with MapReduce?
Every run of MapReduce reads from disk (e.g. HDFS) the initial data, computes the results and then stores them on disk; since most algorithms on graphs are iterative, this means that for every iteration the whole data must be read and written from/to disk.
It's better to use a distributed dataflow framework
Qui si parla di grafi di grosse dimensioni, che non stanno nella RAM di un solo PC.
GraphX is a graph processing systembuilt on top of Apache Spark
Graph processing systems represent graph structured data as a property graph, which associates user-defined properties with each vertex and edge.
The Spark storage abstraction called Resilient Distributed Datasets (RDDs) enables applications to keep data in memory, which is essential for iterativegraph algorithms.
RDDs permit user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement. This is essential for encoding partitioned graphs.
Excerpt from GraphX: Graph Processing in a Distributed Dataflow Frameworkhttps://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
Il grafo rappresentato un multi-pseduo grafo.????? rappresentazione interna?
GraphX / Spark software stack
(image source: Spark site)
A differenza di spark, che offre le API in scala, Java e python, GraphX le offre solo in Scala; tuttavia in un prossimo futuro dovrebbero essere disponibili.
Graph partitioning
Issues with distributed computing
(image source: GraphX Programming Guide)
Graph Databases
Storage
Query Language
Transactions
Examples:Neo4j
OrientDB
Titan
APIs for traversing and processing
Better performance (in-memory data)
Examples:GraphX
Giraphe
GraphLab
Graph Processing Systems
Gremlin graph query language (tinkerpop)Gremlin is a DSL for traversing property graphs
Neo4j uses (proprietary) cypher as native query language
Titan a graph database che supporta come backend di storage:- cassandra (column)- hbase (column)- berkeleyDB (key-value)
Pregel
is a computational model designed by Google (https://kowshik.github.io/JPregel/pregel_paper.pdf) It consists of a sequence of supersteps until termination. In each superstep, every vertex can:modify its state or the one of any of its neighbours
receive the messages sent to it during the previous superstep
send messages to its neighbours (that will be received in next superstep)
vote to halt
When a node votes to halt, it goes to inactive state; if in a later superstep it receives a message, the framework will awake it changing its state to active. When all the nodes have voted to halt, the computation stops; otherwise it can be set a maximum number of iteration.
Edges don't have any computation.
When writing algorithms, you have to think as a vertex.
Pregel sample
Image source: Pregel paper
Immaginiamo di avere un valore per ogni vertice e di voler trovare il valore massimo di tutto il grafo.Con questo modello di computazione, l'idea che dobbiamo propagare le informazioni fra i nodi.In ogni superstep, ogni vertice che ha ricevuto un valore pi alto del suo, lo manda a tutti i suoi vicini.Quando nessun vertice cambia pi, l'agoritmo terminato.
GraphX implementation of Pregel
GraphX uses three functions for implementing Pregel:
vprog: the vertex program computed for each vertex that receives the incoming message and computes a new vertex value
sendMsg: the function used for sending messages to other vertices
mergeMsg: a function that takes two incoming messages and merges them into a single message
Unlike Google's Pregel, GraphX implementation of Pregel:
leave the message construction out of the vertex-program, so to have a more efficient distributed execution
permits access to both vertices attributes of an edge while building the messages
contraints sending messages to graph structure (only to neighbours)
Commutativa: 2 + 3 == 3 + 2Associativa: (2 + 3) + 4 = 2 + (3 + 4)
GraphX Pregel communication diagram
GraphX is well suited for algorithms that:respect the neighborhood structure
GraphX is NOT well suited for algorithms that:need iteration among distant vertices
change the structure of the graph
When to use GraphX
Algorithms out of the box: (as of Spark v1.5.1)
- Connected Components- Label Propagation- PageRank- SVD++- Shortest Paths- Strongly Connected Components- Triangle Count
GraphX API / Functions over RDD
Now some code!
Same example but with Pregel
Questions & Answers
Andrea Iacono
The code is available at: https://github.com/andreaiacono/TalkGraphX
Estrazione JetBrains
MILAN 20/21.11.2015
MILAN 20/21.11.2015 - Andrea Iacono
MILAN 20/21.11.2015
MILAN 20/21.11.2015 - Andrea Iacono
Leave your feedback on Joind.in!
https://m.joind.in/event/codemotion-milan-2015
MILAN 20/21.11.2015 - Andrea Iacono
MILAN 20/21.11.2015 - Andrea Iacono
MILAN 20/21.11.2015 - Andrea Iacono
MILAN 20/21.11.2015 - Andrea Iacono