Gelly in Apache Flink Bay Area Meetup

30
Vasia Kalavri Apache Flink PMC, PhD student @KTH @vkalavri - [email protected] August 27, 2015 Gelly: Large-Scale Graph Processing with Apache Flink

Transcript of Gelly in Apache Flink Bay Area Meetup

Page 1: Gelly in Apache Flink Bay Area Meetup

Vasia KalavriApache Flink PMC, PhD student @KTH

@vkalavri - [email protected] 27, 2015

Gelly: Large-Scale Graph Processing

with Apache Flink

Page 2: Gelly in Apache Flink Bay Area Meetup

Overview

- Why Graph Processing with Flink?

- Gelly: Flink’s Graph Processing API

- Iterative Graph Processing in Gelly

- What’s next, Gelly?

2

Page 3: Gelly in Apache Flink Bay Area Meetup

Why Graph Processing with Flink?

Page 4: Gelly in Apache Flink Bay Area Meetup

4

Page 5: Gelly in Apache Flink Bay Area Meetup

A data analysis pipeline

load cleancreate graph

analyze graph

clean transformload result

clean transformload

5

Page 6: Gelly in Apache Flink Bay Area Meetup

A more user-friendly pipeline

load

load result

load

6

Page 7: Gelly in Apache Flink Bay Area Meetup

General-purpose or specialized?

- fast application development and deployment

- easier maintenance- non-intuitive APIs

- time-consuming- use, configure and

integrate different systems

- hard to maintain- rich APIs and features

general-purpose specialized

what about performance?

7

Page 8: Gelly in Apache Flink Bay Area Meetup

Native Iterations

● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically

Caching Loop-invariant DataPushing work“out of the loop”

Maintain state as index

8

Page 9: Gelly in Apache Flink Bay Area Meetup

Flink Iteration Operators

Iterate IterateDelta

9

Input

Iterative Update Function

Result

Rep

lace

Workset

IterativeUpdate Function

Result

Solution Set

State

Page 10: Gelly in Apache Flink Bay Area Meetup

Many iterative algorithms exhibit sparse computational dependencies and asymmetric convergence

Delta Iterations

10

Page 11: Gelly in Apache Flink Bay Area Meetup

Example: Connected Components

11

Only vertices that change their value are in the workset for the next iteration.

Page 12: Gelly in Apache Flink Bay Area Meetup

Beyond Iterations

12

Performance & Scalability● own memory management● own serialization framework● operate on binary data when possibleAutomatic Optimizations● choose best execution strategy● cache invariant data

Page 13: Gelly in Apache Flink Bay Area Meetup

Gellythe Flink Graph API

Page 14: Gelly in Apache Flink Bay Area Meetup

● Java & Scala Graph APIs on top of Flink● Ships already with Flink 0.9 (Beta)● Can be seamlessly mixed with the DataSet

Flink API→ easily implement applications that use both record-based and graph-based analysis

Meet Gelly

14

Page 15: Gelly in Apache Flink Bay Area Meetup

Gelly in the Flink stack

15

Scala API(batch and streaming)

Java API(batch and streaming)

ML library Gelly

Runtime and Execution Environments

Data storage

Table API Python API

Transformations and Utilities

Iterative Graph Processing

Graph Library

Page 16: Gelly in Apache Flink Bay Area Meetup

Hello, Gelly!

// create a graph from a Dataset of Vertices and a DataSet of Edges

Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);

16

// create a graph from a Collection of Edges and initialize the Vertex valuesGraph<String, Long, Double> graph = Graph.fromCollection(edges,

new MapFunction<String, Long>() {

public Long map(String vertexId) { return 1l; }

}, env);

vertex ID type

vertex value type

edge value type

edges is a Java collection

assigned vertex value

Page 17: Gelly in Apache Flink Bay Area Meetup

● Graph Properties○ getVertexIds○ getEdgeIds○ numberOfVertices○ numberOfEdges○ getDegrees○ isWeaklyConnected○ ...

Available Methods

● Transformations○ map, filter, join○ subgraph, union,

difference○ reverse, undirected○ getTriplets

● Mutations○ add vertex/edge○ remove vertex/edge

17

Page 18: Gelly in Apache Flink Bay Area Meetup

Example: mapVertices

18

// increment each vertex value by one

Graph<Long, Long, Long> updatedGraph = graph.mapVertices(

new MapFunction<Vertex<Long, Long>, Long>() {

public Long map(Vertex<Long, Long> value) {

return value.getValue() + 1;

}

});

4

28

5

53

17

4

5

Page 19: Gelly in Apache Flink Bay Area Meetup

Example: subGraph

19

Graph<Long, Long, Long> subgraph = graph.subgraph(

new FilterFunction<Vertex<Long, Long>>() {

public boolean filter(Vertex<Long, Long> vertex) {

// keep only vertices with positive values

return (vertex.getValue() > 0);

}

},

new FilterFunction<Edge<Long, Long>>() {

public boolean filter(Edge<Long, Long> edge) {

// keep only edges with negative values

return (edge.getValue() < 0);

}

})

Page 20: Gelly in Apache Flink Bay Area Meetup

- Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel

Neighborhood Methods

3

47

4

4

graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);

3

97

4

5

20

Page 21: Gelly in Apache Flink Bay Area Meetup

Iterative Graph Processing

Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations● vertex-centric (similar to Pregel, Giraph)● gather-sum-apply (similar to PowerGraph)

21

Page 22: Gelly in Apache Flink Bay Area Meetup

MessagingFunction: what messages to send to other vertices

VertexUpdateFunction: how to update a vertex value, based on the received messages

The workset contains only active vertices (received a msg in the previous superstep)

Vertex-centric Iterations

22

Page 23: Gelly in Apache Flink Bay Area Meetup

Vertex-centric SSSPshortestPaths = graph.runVertexCentricIteration(

new DistanceUpdater(), new DistanceMessenger()).getVertices();

DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction

23

sendMessages(K key, Double newDist) {

for (Edge edge : getOutgoingEdges()) {

sendMessageTo(edge.getTarget(),

newDist + edge.getValue());

}

updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) {

Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist);}

Page 24: Gelly in Apache Flink Bay Area Meetup

Gather: how to transform each of the edges and neighbors of each vertex

Sum: how to aggregate the partial values of Gather to a single value

Apply: how to update the vertex value, based on the new aggregate and the current value

The workset contains the active vertices (user-defined activation)

Gather-Sum-Apply Iterations

24

Page 25: Gelly in Apache Flink Bay Area Meetup

Gather-Sum-Apply SSSPshortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(),

new ChooseMinDistance(), new UpdateDistance()).getVertices();

CalculateDistances: Gather

25

gather(Neighbor<Double, Double> neighbor) {return neighbor.getNeighborValue() + neighbor.getEdgeValue();

}

ChooseMinDistance: Sum

ChooseMinDistance: Apply

sum(Double newValue, Double currentValue) {return Math.min(newValue, currentValue);

}

apply(Double newDistance, Double oldDistance) {if (newDistance < oldDistance) { setResult(newDistance); }

}

Page 26: Gelly in Apache Flink Bay Area Meetup

Iteration Configuration

● Parallelism● Aggregators● Broadcast Variables● Messaging Direction (IN, OUT, ALL)● Access vertex degrees and number of

vertices

26

Page 27: Gelly in Apache Flink Bay Area Meetup

Vertex-Centric or GSA?

27

vertex-centric GSA

when message creation is independent & expensive (parallelized in Gather)

when the graph is skewed (~ 1.5x faster)

if update is associative-commutative

when all messages are needed for the update

when a vertex needs to send messages to non-neighbors

Page 28: Gelly in Apache Flink Bay Area Meetup

● PageRank*● Single Source Shortest Paths*● Label Propagation● Weakly Connected Components*● Community DetectionUpcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor));

*: both vertex-centric and GSA implementations

Library of Algorithms

28

Page 29: Gelly in Apache Flink Bay Area Meetup

What’s next, Gelly?

● Integration with the Flink Streaming API● Specialized Operators for Skewed Graphs● Partitioning Methods● More graph processing models: neighborhood-centric,

partition-centric

Check our Roadmap!https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly

29

Page 30: Gelly in Apache Flink Bay Area Meetup

Feeling Gelly? Try it!

● Grab the Flink master: https://github.com/apache/flink● Read the Gelly programming guide: https://ci.apache.

org/projects/flink/flink-docs-master/libs/gelly_guide.html● Follow the Gelly School tutorials (Beta): http://gellyschool.com● Read our blog post: https://flink.apache.

org/news/2015/08/24/introducing-flink-gelly.html● Come to Flink Forward (October 12-13, Berlin): http://flink-

forward.org/

30