Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

62
Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides

Transcript of Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Page 1: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Graph Algorithms with MapReduceChapter 5

Thanks to Jimmy Lin slides

Page 2: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Topics

• Introduction to graph algorithms and graph representations

• Single Source Shortest Path (SSSP) problem– Refresher: Dijkstra’s algorithm– Breadth-First Search with MapReduce

• PageRank

Page 3: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

What’s a graph?

• G = (V,E), where– V represents the set of vertices (nodes)– E represents the set of edges (links)– Both vertices and edges may contain additional information

• Different types of graphs:– Directed vs. undirected edges– Presence or absence of cycles

• Graphs are everywhere:– Hyperlink structure of the Web– Physical structure of computers on the Internet– Interstate highway system– Social networks

Page 4: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Some Graph Problems• Finding shortest paths

– Routing Internet traffic and UPS trucks• Finding minimum spanning trees

– Telco laying down fiber• Finding Max Flow

– Airline scheduling• Identify “special” nodes and communities

– Breaking up terrorist cells, spread of avian flu• Bipartite matching

– Monster.com, Match.com• And of course... PageRank

Page 5: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Graphs and MapReduce

• Graph algorithms typically involve:– Performing computation at each node– Processing node-specific data, edge-specific data,

and link structure– Traversing the graph in some manner

• Key questions:– How do you represent graph data in MapReduce?– How do you traverse a graph in MapReduce?

Page 6: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Representing Graphs

• G = (V, E)– A poor representation for computational purposes

• Two common representations– Adjacency matrix– Adjacency list

Page 7: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Adjacency Matrices

Represent a graph as an n x n square matrix M– n = |V|– Mij = 1 means a link from node i to j

1 2 3 4

1 0 1 0 1

2 1 0 1 1

3 1 0 0 0

4 1 0 1 0

1

2

3

4

Page 8: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Adjacency Matrices: Critique

• Advantages:– Naturally encapsulates iteration over nodes– Rows and columns correspond to inlinks and

outlinks• Disadvantages:

– Lots of zeros for sparse matrices– Lots of wasted space

Page 9: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Adjacency Lists

Take adjacency matrices… and throw away all the zeros

1 2 3 4

1 0 1 0 1

2 1 0 1 1

3 1 0 0 0

4 1 0 1 0

1: 2, 42: 1, 3, 43: 14: 1, 3

Page 10: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Adjacency Lists: Critique

• Advantages:– Much more compact representation– Easy to compute over outlinks– Graph structure can be broken up and distributed

• Disadvantages:– Much more difficult to compute over inlinks

Page 11: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Single Source Shortest Path

• Problem: find shortest path from a source node to one or more target nodes

• “Graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree” Wikipedia

• First, a refresher: Dijkstra’s algorithm– Single machine

Page 12: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

2 3

2

1

9

7

4 6

Example from CLR

Page 13: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

2 3

2

1

9

7

4 6

Example from CLR

n1

n2

n3

n4

n0

Page 14: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

10

5

2 3

2

1

9

7

4 6

Example from CLR

n0

n1

n2

n3

n4

Page 15: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

8

5

14

7

10

5

2 3

2

1

9

7

4 6

Example from CLR

n0

n1

n2

n3

n4

Page 16: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

8

5

13

7

10

5

2 3

2

1

9

7

4 6

Example from CLR

n0

n1

n2

n3

n4

Page 17: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Example from CLR

n0

n1

n2

n3

n4

Page 18: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Example from CLR

n0

n1

n2

n3

n4

Page 19: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Single Source Shortest Path

• Problem: find shortest path from a source node to one or more target nodes

• Single processor machine: Dijkstra’s Algorithm• MapReduce: parallel Breadth-First Search

(BFS)– How to do it? First simplify the problem!!

Page 20: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Finding the Shortest Path

• First, consider equal edge weights• Solution to the problem can be defined

inductively• Here’s the intuition:

– DistanceTo(startNode) = 0– For all nodes n directly reachable from startNode,

DistanceTo(n) = 1– For all nodes n reachable from some other set of

nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m S)

Page 21: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Finding the Shortest Path

• This strategy advances the “known frontier” by one hop– Subsequent iterations include more reachable

nodes as frontier advances– Multiple iterations are needed to explore entire

graph

Page 22: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Visualizing Parallel BFS

1

2 2

23

3

33

4

4

Page 23: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Termination

• Does the algorithm ever terminate?– Eventually, all nodes will be discovered, all edges

will be considered (in a connected graph)• When do we stop?

– When distances at every node no longer change at next frontier

Page 24: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Next Step to Solving

• Next –– No longer assume distance to each node is 1

Page 25: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Weighted Edges

• Now add positive weights to the edges• Simple change: points-to list in map task

includes a weight w for each pointed-to node– emit (p, D+wp) instead of (p, D+1) for each node p

Page 26: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

2 3

2

1

9

7

4 6

Example from CLR

n1

n2

n3

n4

n0

Page 27: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Multiple Iterations Needed• This MapReduce task advances the “known frontier”

by one hop– Subsequent iterations include more reachable nodes as

frontier advances– Multiple iterations are needed to explore entire graph– Each iteration a MapReduce task– Final output is input to next iteration - MapReduce task– Feed output back into the same MapReduce task

Page 28: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Assume d = 1

Page 29: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

From Intuition to Algorithm• What info does the map task require?

– A map task receives (k,v)• Key:

– node n• Value:

– D (distance from start)– points-to (adjacency list of nodes reachable from n)

• What does the map task do?– Computes distances– Emit (p, D+wp) p points-to: Makes sure current distance is

carried into the reducer

– Emits graph structure of node n (n, struct) which contains the current shortest distance to node n

Page 30: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

From Intuition to Algorithm

• What info does the reduce task require?– The reduce task gathers possible distances to a

given p• What does the reduce task do?

– selects the minimum one

Page 31: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Algorithm

• Assume adjacency list has information about edges and distances!!

Page 32: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

class Mappermethod MAP(nid n, node N)D ← N.DistanceEmit(nid n, N) // Pass along graph structurefor all nodeid m € N.AdjacencyList do

Emit(nid m, d+w) // Emit distances to reachable nodes

class Reducermethod REDUCE (nid m, [d1, d2, ...])

dmin ← ∞M ← Φfor all d € counts [d1, d2, ...] do

if IsNode(d) thenM ← d // Recover graph structure

else if d < dmin then // Look for shorter distance

dmin ← dif M.Distance > dmin // update shortest distance

M.Distance ← dmin

Increment counter for driverEmit(nid m, node M)

Page 33: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Map Algorithm

• Line 2. N is an adjacency list and current distance (shortest)

• Line 4. Emits (k,v) in k which is current node info , but only one of these for a node because assume each node assigned to one mapper

• Line 6. Emits different type of (k,v) which only has distance to neighbor not adjacency list

• Shuffles (k,v) with same k to same reducers

Page 34: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Reduce Algorithm• Line 2. Will have different types of (k,v) as input• Line 5. Determine what type of (k,v) if adjacency list • Line 6. If v is not adjacency list (Node structure) then

it is a distance, find shortest• Only 1 IsNode as far as I can tell

• Line 9. Determine if new shortest• Line 10. Update current shortest, increment a

counter to determine if should stop

Page 35: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Shortest path – one more thing

• Only finds shortest distances, not the shortest path

• Is this true?– Do we have to use backpointers to find shortest

path to retrace– NO --– Emit paths along with distances, each node has

shortest path accessible at all times• Most paths relatively short, uses little space

Page 36: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Weighted edgesFinds Minimum?• Discover node r• Discovered shortest D to p and shortest D to r goes

through p• Maybe path through q to r that is shorter, but path

lies outside current search frontier– Not true if D = 1 since shortest path cannot lie outside

search frontier, since would be longer path • Have found shortest path within frontier• Will discover shortest path as frontier expands• With sufficient iterations, eventually discover

shortest Distance

Page 37: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

2 3

2

1

9

7

4 6

Example from CLR

n1

n2

n3

n4

n0

Page 38: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Termination• Does this ever terminate?

– Yes! Eventually, no better distances will be found. When distance is the same, we stop

– Checking of termination must occur outside of MapReduce

– Driver program submits MR job to iterate algorithm, see if termination condition met

– Hadoop provides Counters (drivers) outside MapReduce

• Drivers determine after reducers if done• In shortest path reducers count each change to min

distance, passes count to driver

Page 39: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Iterations

• How many iterations needed to compute shortest distance to all nodes?– Diameter of graph or greatest distance between

any pair of nodes– Small for many real-world problems – 6 degrees of

separation• For global social network – 6 MapReduce iterations

Page 40: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Fig. 5.6 needs how many iterations for n1-n6?

Worst case?

need (#nodes – 1)

Page 41: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Comparison to Dijkstra

• Dijkstra’s algorithm is more efficient – At any step it only pursues edges from the

minimum-cost path inside the frontier• MapReduce explores all paths in parallel

– Brute force – wastes time – Divide and conquer– Except at search frontier, within frontier repeating

same computations– Throw more hardware at the problem

Page 42: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

General Approach

• MapReduce is adept at manipulating graphs– Store graphs as adjacency lists

• Graph algorithms with MapReduce:– Each map task receives a node and its outlinks– Map task compute some function of the link structure,

emits value with target as the key– Reduce task collects keys (target nodes) and aggregates

• Iterate multiple MapReduce cycles until some termination condition– Remember to “pass” graph structure from one iteration to

next

Page 43: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Another example –Random Walks Over the Web

• Model:– User starts at a random Web page– User randomly clicks on links, surfing from page to

page (may also teleport to completely diff page• How frequently will a page be encountered

during this surfing?• This is PageRank

– Probability distribution over nodes in a graph representing likelihood random walk over a graph will arrive at a particular node

Page 44: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

PageRank: DefinedGiven page n with in-bound links L(n), where

– C(m) is the out-degree of m– P(m) is the page rank of m– is probability of random jump– |G| is the total number of nodes in the graph

)( )(

)()1(

||

1)(

nLm mC

mP

GnP

n

m1

mn

mn

Page 45: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Computing PageRank

• Properties of PageRank– Can be computed iteratively– Effects at each iteration is local

• Sketch of algorithm:– Start with seed (Pi ) values– Each page distributes (Pi ) “credit” to all pages it links

to– Each target page adds up “credit” from multiple in-

bound links to compute (Pi+1)– Iterate until values converge

Page 46: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Computing PageRank

• What does map do?• What does reduce do?

Page 47: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

PageRank MapReduce

• Fig. 5.7• Begins with 5 nodes splitting 1.0 -> 0.2 each• Each node must split their 0.2 to outgoing

nodes (map)• Then add up all incoming values (reduce)• Each iteration is one MapReduce job

Page 48: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.
Page 49: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.
Page 50: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

PageRank in MapReduceMap: distribute PageRank “credit” to link targets

...

Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value

Iterate untilconvergence

Page 51: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Convergence to end Page Rank

• Stop when few changes (some tolerance for precision errors) or reached fixed number of iterations

• Driver checks for convergence• How many iterations needed for PageRank to

converge, e.g. if 322 M edges?– Fewer than expected– 52 iterations

Page 52: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dangling nodes and random jumps

• Must redistribute mass lost at dangling nodes (no out going edges – so mass lost)– 3 approaches to determine missing mass

• Count dangling nodes and multiply by constant • Emit special key, handle special key with logic• Write as side data, sum across all map tasks

– Next, Redistribute missing mass m across all nodes• Compute final page rank p’ where a is random jump probability

• Need 2 MapReduce jobs for one iteration – 1 to distribute mass across edges, the other to take care of lost mass

)||

)(1(||

1' p

G

m

Gp

Page 53: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

PageRank

• Assume honest users• No Spider trap – infinite chain of pages all link

to single page to inflate PageRank• PageRank only one of thousands of features

used in ranking web pages

Page 54: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Issues with Graph processing

• No global data structures can be used• Local computation on each node, results

passed to neighbors• With multiple iterations, convergence on

global graph• Amount of intermediate data order of number

of edges– Worst case?– O(n2) for dense graph

Page 55: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Issues with Graph processing

• Role of combiner?

Page 56: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.
Page 57: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

PageRank in MapReduceMap: distribute PageRank “credit” to link targets

...

Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value

Iterate untilconvergence

Page 58: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Dijkstra’s Algorithm Example

0

10

5

2 3

2

1

9

7

4 6

Example from CLR

n1

n2

n3

n4

n0

Page 59: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Issues with Graph processing

• Combiners only useful if can do partial aggregation– Only if multiple nodes being processed by

individual mapper and point to same nodes– Otherwise combiner not useful

• Assume we have a mapper process more than one node– How to assign nodes (partition graph) so useful?

Page 60: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Issues with Graph processing

• Desirable to partition graph so many intra-component links and few inter-component link

• Consider a social network --– Partitioning heuristics– Order nodes by:

• Last name?• Zip code?• Language spoken?• School?

– So people are connected

Page 61: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.

Summary

• Graph structure represented with adjacency list• Map over nodes, pass partial results to nodes on

adjacency list, partial results aggregated for each node in reducer

• Graph structure passed from mapper to reducer, output in same form as input

• Algorithms iterative, under control of non-MapReduce driver checking for termination at end of each iteration

Page 62: Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides.