Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures...

42
Master Thesis Thesis Advisor: Sebastian Schelter, Research Associate Thesis Supervisor: Prof. Dr. Markl, Volker Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by Janani Chakkaradhari 5 September 2014

Transcript of Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures...

Page 1: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Master Thesis

Thesis Advisor: Sebastian Schelter, Research Associate

Thesis Supervisor: Prof. Dr. Markl, Volker

Large Scale Centrality Measures in

Apache Flink and Apache Giraph

Submitted by

Janani Chakkaradhari

5 September 2014

Page 2: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Centrality measures identify the most central

nodes in a network

2

Page 3: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

3

Targeted advertising minimizes resources and effort

required for marketing

Page 4: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Centrality measures to identify the head of terrorist

network that attacked on 9/11

Krebs, 2002

4

Page 5: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Different notions of the “most central nodes”

5 Freeman et al, 1977

Degree Closeness Betweenness

Page 6: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

The real world networks are very large and

sparse

6 Barabási, 2004

Page 7: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Big Data platforms to analyze large networks

7

Page 8: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Related work on parallel computation of

centrality measures

Two novel algorithms proposed by Kang, U., et al in the paper

“Centralities in Large Networks: Algorithms and Observations”

for computing closeness and betweenness and implemented in

Hadoop.

• Effective Closeness algorithm

- an approximate technique for closeness

• LineRank algorithm

- random walk betweenness

8

Page 9: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

• Evaluating the performance

of each of these two

platforms

9

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

Page 10: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

10 Stephan Ewen, 2014 Sebastian Schelter, 2012

V1

V2

V3

V1

V2

V3

V1

V2

V3

superstep i superstep i+1 superstep i+2

Page 11: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

Iterative

Function

Initial dataset

Result

Bulk

Iteration

Initial

solutionset

Initial

workset

Result Delta

Iteration

11 Stephan Ewen, 2014 Sebastian Schelter, 2012

V1

V2

V3

V1

V2

V3

V1

V2

V3

superstep i superstep i+1 superstep i+2

Page 12: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

• Evaluating the performance

of each of these two

platforms

12

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

1. Computation logic

2. Implementation

Page 13: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Iterative computation of Effective Closeness Algorithm

• Shortest path between nodes => it counts the node at each

step/shortest path progressively

• Sum of the shortest paths: (2 x 1) + (2 x 2) + (1 x 3)= 9

13

2 2 1

Step 1 Step 2 Step 3

Page 14: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Computation of Effective Closeness in Apache Giraph

3 2

2

3 3

1

4 5

4

5 5

3

1 3

2

2 2

2

14

Page 15: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

1 2

1 5

2 3

5 4

4 6

2 1

5 1

3 2

4 5

6 4

2 5

3 4

4 3

5 2

vid bit_string

src des

Illustration of Effective Closeness in Apache Flink using Delta iteration

(1/4)

• Vertices – Initial workset and solution set

• Edges - Pair of source and destination ids

15

Page 16: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

1 2

1 5

2 3

5 4

4 6

2 1

5 1

3 2

4 5

6 4

2 5

3 4

4 3

5 2

⋈ vid=src

vid bit_string

src des

emit

0 1 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

1

1

2

5

4

2

5

3

4

6

2

3

4

5

des bit_string

Illustration of Effective Closeness in Apache Flink using Delta iteration (2/4)

16

Page 17: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

0 1 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

1

1

2

5

4

2

5

3

4

6

2

3

4

5

des bit_string

𝛾𝑠𝑟𝑐

0 1 0 0 0 0

0 0 0 0 1 0

1

1

0 0 1 0 0 0 2

1 0 0 0 0 0 2

0 0 0 0 1 0 2

0 0 0 1 0 0 5

1 0 0 0 0 0 5

0 1 0 0 0 0 5

0 1 0 0 0 0 3

0 0 0 1 0 0 3

0 0 0 0 0 1 4

0 0 0 0 1 0 4

0 0 1 0 0 0 4

0 0 0 1 0 0 6

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

Bit-OR

Illustration of Effective Closeness in Apache Flink using Delta iteration (3/4)

des bit_string

Updated result in current iteration

17

Page 18: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

⋈ vid=des

vid bit_string

Illustration of Effective Closeness in Apache Flink using Delta iteration (4/4)

Solutionset /previous

iteration’s result

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

des bit_string

Updated result in

current iteration

0

0

0

0

0

0

2

3

2

3

3

1

Termination condition

check If(prev count != current count)

emit the updated nodes => Next

Workset

else keep calm!

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

2

3

2

3

3

1

emit

Next Workset

18

Page 19: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Illustration of Effective Closeness in Apache Flink of delta iteration

19

REDUCE

JOIN

Step Function

Update Function

JOIN

Page 20: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

20

Page 21: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Idea behind LineRank Algorithm

• Betweenness is computed by finding the importance score of

incident edges of a node

1

2 3

b a d

c

e

G

a e

c b

d

L(G)

kang et al, 2011

Power

Iteration

PageRank

Eigenvector/ Rank

of nodes in L(G)

• Problem: Line graph L(G) is larger than original graph

𝑟 = 𝑇𝑘 𝑟0

21

Page 22: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Challenges in implementing LineRank in Apache Giraph

• Two step matrix-vector multiplication in the power iteration using two

sparse matrices (incoming and outgoing edges)

• The vertex state value in the LineRank is edge score which contradicts

with the vertex centric computation model

• How to achieve two stage matrix-vector multiplication in Giraph?

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2 𝑣 ⟵ 𝐿(𝐺)𝑣 ↔

22

Page 23: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Proposed solution for implementing LineRank algorithm in

Apache Giraph (1/2)

• Illustration of “think like vertex”

• Let us compute the step v2 in the first iteration for our example graph

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

1

2 3

b a d

c

e

23

=

Page 24: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Proposed solution for implementing LineRank algorithm in

Apache Giraph (2/2)

Pseudo-code

• Current state of the vertex is assigned with computation result of v2

• The messages that are distributed or exchanged in the iteration are

considered to be the edge score v3

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

=

24

Page 25: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Illustration of proposed solution to implement LineRank in Apache

Giraph

1

2 3

1 2 3 Input graph

0.2 0.2 0.2

0.2

0.2 0.1

0.1

0.1

superstep 0

0.15

0.3 0.3 0.1

0.1

0.1 0.15

0.15

superstep 1

25

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

Page 26: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Implementation of LineRank in Apache Flink

26

Page 27: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Summary of LineRank implementation

• Two step matrix-vector multiplication is hard to implement in

Apache Giraph

• Remodeling the LineRank computation in Apache Giraph requires

an in-depth knowledge in both platform level and algorithmic level

• Programmability with Apache Flink for computational intensive

iterative algorithms are simple and flexible

27

Page 28: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Comparison of parallel computing platforms by implementing and

evaluating centrality measures

28

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

1. Computation logic

2. Implementation

• Evaluating the performance

of each of these two

platforms

Page 29: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Evaluation – Dataset

29

Page 30: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Evaluating scalability of Effective Closeness in Apache Giraph &

Apache Flink (Runtime vs Edges)

*Fixed number of parallel tasks 30

Page 31: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Evaluating scalability of LineRank in Apache Giraph & Apache Flink

(Runtime vs Edges)

*Fixed graph data

LineRank in Flink: Runtime vs Number of cores

LineRank in Giraph: Runtime vs Number of cores

31

Page 32: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Evaluation – Comparing the performance of Apache

Giraph and Apache Flink

LineRank Effective Closeness

32

No. of cores = 15

Page 33: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Evaluation – Comparing the performance of

Apache Giraph and Apache Flink

• Apache Giraph incorporates hash based aggregations

• Apache Flink uses sorting technique for aggregations

• Efficient mechanism for estimating memory

requirements in Apache Giraph

33

Page 34: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Conclusion

34

• Implementation of Effective Closeness exploits the sparse nature of the real world graphs

• The programming model of Apache Giraph is not flexible for

computations that involves multi-step matrix-vector

multiplication whereas Apache Flink is more flexible for these

computations

• Efficient optimizations in Apache Giraph makes it perform better

than Apache Flink

• The implementation of these algorithms are targeted to

contribute to the Apache Flink open source community

Page 35: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Future Works

35

• This work can be extended to evaluate the computation intensive

centrality algorithms on other parallel data processing systems

such as Apache Spark and Distributed GraphLab

Page 36: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

References

[1] Kang, U., et al. "Centralities in Large Networks: Algorithms and Observations."

SDM. Vol. 2011. 2011

[2] Sebastian Schelter, “Introducing Apache Giraph for Large Scale Graph

Processing”, “slideshare.net/sscdotopen/introducing- apache-giraph-for-large-

scale-graph-processing”, 2012

[3] Krebs, Valdis E. "Mapping networks of terrorist cells." Connections 24.3 (2002):

43-52

[4] Ewen, Stephan, et al. "Spinning fast iterative data flows." Proceedings of

the VLDB Endowment 5.11 (2012): 1268-1279

[5] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing"

Proceedings of the 2010 ACM SIGMOD International Conference on Management

of data. ACM, 2010.

36

Page 37: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

References

[6] Freeman, Linton C. "A set of measures of centrality based on betweenness"

Sociometry (1977): 35-41

[9] Stephan Ewen “Stratosphere, Next-Gen Data Analytics Platform”, Hadoop Summit

Europe, 2014

[8] Barabasi, Albert-Laszlo, and Zoltan N. Oltvai. "Network biology: understanding

the cell's functional organization." Nature Reviews Genetics 5.2 (2004): 101-113

37

Page 38: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Backup Slides

38

Page 39: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

Highly connected

Less connected

39

Page 40: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

LineRank Dataflow in Apache Flink

Page 41: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Final step in the proposed solution to implement LineRank in

Apache Giraph

• Aggregating the computed edge scores

(incoming and outgoing edges )

• Computation of v2 represents aggregation

incoming edges scores

1

2 3

b a d

c

e

G

Bet(1) = R(a)+R(b)+R(c)+R(d)

Bet(2) = R(a)+R(b)+R(e)

Bet(3) = R(c)+R(d)+R(e)

Page 42: Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

LineRank algorithm computes the random-walk

betweenness without constructing line graph

• L(G) is decomposed into two sparse matrices

– Source Incidence Matrix S(G) [Outgoing edges]

– Target Incidence Matrix T(G) [Incoming edges]

– L G = 𝑇 𝐺 𝑆(𝐺)𝑇

1

2 3

b a c

d

e

a 1 0 0

b 0 1 0

c 1 0 0

d 0 0 1

e 0 1 0

1 2 3

a 0 1 0

b 1 0 0

c 0 0 1

d 1 0 0

e 0 0 1

1 2 3

S(G) T(G)

0 1 0 1 0 0

= 1 0 0

0 0 1

0 0 1

1 0 0 1 0 0 1 0 0 1 0 0 1 0 0

𝑻 𝑮 𝑺(𝑮)𝑻

0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0

=

Power iteration in LineRank

Referred from [Ukang]