Graphs are everywhere! Distributed graph computing with Spark GraphX

download Graphs are everywhere! Distributed graph computing with Spark GraphX

If you can't read please download the document

Transcript of Graphs are everywhere! Distributed graph computing with Spark GraphX

Diapositiva 1

Graphs are everywhere!

Distributed graph computing with Spark GraphXAndrea Iacono

Agenda:

Graph definitions and usages

GraphX introduction

Pregel

Code examples

The main focus will be the programming model

The code is available at:https://github.com/andreaiacono/TalkGraphX

Question to public: - Who knows what a graph is?- Who ever used it?- Who knows the most used algorithms? (BFS, DFS, Dijkstra)- Who knows Scala?

A graph is a set of vertices and edges that connect them:

Graphs are used for modeling very different domains.

Edge

Vertex

Vertici e archi

Networks

Conteggio dei triangoli x raggruppareInteresse commerciale x proposte mirate a gruppi con stessi interessi

Routing

Vertici = incrociArchi = stradeAlgoritmo cammino minimo (Dijkstra), dove gli archi hanno pi pesi: tipicamente distanza, traffico, pagamento di un pedaggio, etc

Page Rank

Pagine = verticiArchi = link in entrataOgni arco in uscita ha un pesao legato a quello del suo vertice; maggiore la sommatoria dei valori degli archi in ingresso, maggiore il peso del vertice.Algoritmo iterativo

Artificial Intelligence / Minimum spannin gtree

Definitions

UndirectedDirected

Orientato / non orientato

Definitions

Connected

Disconnected

Connesso / Non connesso

Definitions

K5

K2,3

Complete

Bipartite (and complete)

K la nomeclatura standard x indicare questo tipo di grafiA bipartite graph is useful for e-commerce, when you a all the user nodes that can buy any of the product nodes.

Definitions

Cyclic

Acyclic

Ciclico / Aciclico (o senza cicli)

Definitions

Multigraph

Pseudograph

Multi grafo: quando si possono avere pi archi che hanno la stessa sorgente e la stessa destinazionePseudo grafo: quando un arco pu avere lo stesso vertice come sorgente e come destinazione

Definitions

An undirected acyclic connected graph is a tree!

Quando dicevo che gli archi sono dappertutto, soprattuto per questo!

What's wrong with MapReduce?

Every run of MapReduce reads from disk (e.g. HDFS) the initial data, computes the results and then stores them on disk; since most algorithms on graphs are iterative, this means that for every iteration the whole data must be read and written from/to disk.

It's better to use a distributed dataflow framework

Qui si parla di grafi di grosse dimensioni, che non stanno nella RAM di un solo PC.

GraphX is a graph processing systembuilt on top of Apache Spark

Graph processing systems represent graph structured data as a property graph, which associates user-defined properties with each vertex and edge.

The Spark storage abstraction called Resilient Distributed Datasets (RDDs) enables applications to keep data in memory, which is essential for iterativegraph algorithms.

RDDs permit user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement. This is essential for encoding partitioned graphs.

Excerpt from GraphX: Graph Processing in a Distributed Dataflow Frameworkhttps://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf

Il grafo rappresentato un multi-pseduo grafo.????? rappresentazione interna?

GraphX / Spark software stack

(image source: Spark site)

A differenza di spark, che offre le API in scala, Java e python, GraphX le offre solo in Scala; tuttavia in un prossimo futuro dovrebbero essere disponibili.

Graph partitioning

Issues with distributed computing

(image source: GraphX Programming Guide)

Graph Databases

Storage

Query Language

Transactions

Examples:Neo4j

OrientDB

Titan

APIs for traversing and processing

Better performance (in-memory data)

Examples:GraphX

Giraphe

GraphLab

Graph Processing Systems

Gremlin graph query language (tinkerpop)Gremlin is a DSL for traversing property graphs

Neo4j uses (proprietary) cypher as native query language

Titan a graph database che supporta come backend di storage:- cassandra (column)- hbase (column)- berkeleyDB (key-value)

Pregel

is a computational model designed by Google (https://kowshik.github.io/JPregel/pregel_paper.pdf) It consists of a sequence of supersteps until termination. In each superstep, every vertex can:modify its state or the one of any of its neighbours

receive the messages sent to it during the previous superstep

send messages to its neighbours (that will be received in next superstep)

vote to halt

When a node votes to halt, it goes to inactive state; if in a later superstep it receives a message, the framework will awake it changing its state to active. When all the nodes have voted to halt, the computation stops; otherwise it can be set a maximum number of iteration.

Edges don't have any computation.

When writing algorithms, you have to think as a vertex.

Pregel sample

Image source: Pregel paper

Immaginiamo di avere un valore per ogni vertice e di voler trovare il valore massimo di tutto il grafo.Con questo modello di computazione, l'idea che dobbiamo propagare le informazioni fra i nodi.In ogni superstep, ogni vertice che ha ricevuto un valore pi alto del suo, lo manda a tutti i suoi vicini.Quando nessun vertice cambia pi, l'agoritmo terminato.

GraphX implementation of Pregel

GraphX uses three functions for implementing Pregel:

vprog: the vertex program computed for each vertex that receives the incoming message and computes a new vertex value

sendMsg: the function used for sending messages to other vertices

mergeMsg: a function that takes two incoming messages and merges them into a single message

Unlike Google's Pregel, GraphX implementation of Pregel:

leave the message construction out of the vertex-program, so to have a more efficient distributed execution

permits access to both vertices attributes of an edge while building the messages

contraints sending messages to graph structure (only to neighbours)

Commutativa: 2 + 3 == 3 + 2Associativa: (2 + 3) + 4 = 2 + (3 + 4)

GraphX Pregel communication diagram

GraphX is well suited for algorithms that:respect the neighborhood structure

GraphX is NOT well suited for algorithms that:need iteration among distant vertices

change the structure of the graph

When to use GraphX

Algorithms out of the box: (as of Spark v1.5.1)

- Connected Components- Label Propagation- PageRank- SVD++- Shortest Paths- Strongly Connected Components- Triangle Count

GraphX API / Functions over RDD

Now some code!

Same example but with Pregel

Questions & Answers

Andrea Iacono

The code is available at: https://github.com/andreaiacono/TalkGraphX

Estrazione JetBrains

MILAN 20/21.11.2015

MILAN 20/21.11.2015 - Andrea Iacono

MILAN 20/21.11.2015

MILAN 20/21.11.2015 - Andrea Iacono

Leave your feedback on Joind.in!

https://m.joind.in/event/codemotion-milan-2015

MILAN 20/21.11.2015 - Andrea Iacono

MILAN 20/21.11.2015 - Andrea Iacono

MILAN 20/21.11.2015 - Andrea Iacono

MILAN 20/21.11.2015 - Andrea Iacono