Mining Billion-node Graphs: Patterns, Generators and Tools

122
CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU

description

Mining Billion-node Graphs: Patterns, Generators and Tools. Christos Faloutsos CMU. Our goal:. Open source system for mining huge graphs: PEGASUS project (PEta GrAph mining System) www.cs.cmu.edu/~pegasus code and papers. Outline. Introduction – Motivation Problem#1: Patterns in graphs - PowerPoint PPT Presentation

Transcript of Mining Billion-node Graphs: Patterns, Generators and Tools

Page 1: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Mining Billion-node Graphs: Patterns, Generators and Tools

Christos Faloutsos

CMU

Page 2: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 2

Our goal:

Open source system for mining huge graphs:

PEGASUS project (PEta GrAph mining System)

• www.cs.cmu.edu/~pegasus

• code and papers

NSA'10

Page 3: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 3

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools

• Problem#3: Scalability

• Conclusions

NSA'10

Page 4: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 4

Graphs - why should we care?

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

NSA'10

Page 5: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 5

Graphs - why should we care?

• IR: bi-partite graphs (doc-terms)

• web: hyper-text graph

• ... and more:

D1

DN

T1

TM

... ...

NSA'10

Page 6: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 6

Graphs - why should we care?

• network of companies & board-of-directors members

• ‘viral’ marketing

• web-log (‘blog’) news propagation

• computer network security: email/IP traffic and anomaly detection

• ....

NSA'10

Page 7: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 7

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs– Weighted graphs– Time evolving graphs

• Problem#2: Tools

• Problem#3: Scalability

• Conclusions

NSA'10

Page 8: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 8

Problem #1 - network and graph mining

• How does the Internet look like?• How does FaceBook look like?

• What is ‘normal’/‘abnormal’?• which patterns/laws hold?

NSA'10

Page 9: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 9

Problem #1 - network and graph mining

• How does the Internet look like?• How does FaceBook look like?

• What is ‘normal’/‘abnormal’?• which patterns/laws hold?

– To spot anomalies (rarities), we have to discover patterns

NSA'10

Page 10: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 10

Problem #1 - network and graph mining

• How does the Internet look like?• How does FaceBook look like?

• What is ‘normal’/‘abnormal’?• which patterns/laws hold?

– To spot anomalies (rarities), we have to discover patterns

– Large datasets reveal patterns/anomalies that may be invisible otherwise…

NSA'10

Page 11: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 11

Graph mining

• Are real graphs random?

NSA'10

Page 12: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 12

Laws and patterns• Are real graphs random?• A: NO!!

– Diameter– in- and out- degree distributions– other (surprising) patterns

• So, let’s look at the data

NSA'10

Page 13: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 13

Solution# S.1

• Power law in the degree distribution [SIGCOMM99]

log(rank)

log(degree)

-0.82

internet domains

att.com

ibm.com

NSA'10

Page 14: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 14

Solution# S.2: Eigen Exponent E

• A2: power law in the eigenvalues of the adjacency matrix

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

NSA'10

Page 15: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 15

Solution# S.2: Eigen Exponent E

• [Mihail, Papadimitriou ’02]: slope is ½ of rank exponent

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

NSA'10

Page 16: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 16

But:

How about graphs from other domains?

NSA'10

Page 17: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 17

More power laws:

• web hit counts [w/ A. Montgomery]

Web Site Traffic

in-degree (log scale)

Count(log scale)

Zipf

userssites

``ebay’’

NSA'10

Page 18: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 18

epinions.com• who-trusts-whom

[Richardson + Domingos, KDD 2001]

(out) degree

count

trusts-2000-people user

NSA'10

Page 19: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

And numerous more

• # of sexual contacts

• Income [Pareto] –’80-20 distribution’

• Duration of downloads [Bestavros+]

• Duration of UNIX jobs (‘mice and elephants’)

• Size of files of a user

• …

• ‘Black swans’NSA'10 C. Faloutsos (CMU) 19

Page 20: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 20

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs

• degree, diameter, eigen,

• triangles

• cliques

– Weighted graphs– Time evolving graphs

• Problem#2: ToolsNSA'10

Page 21: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 21

Solution# S.3: Triangle ‘Laws’

• Real social networks have a lot of triangles

NSA'10

Page 22: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 22

Solution# S.3: Triangle ‘Laws’

• Real social networks have a lot of triangles– Friends of friends are friends

• Any patterns?

NSA'10

Page 23: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 23

Triangle Law: #S.3 [Tsourakakis ICDM 2008]

ASNHEP-TH

Epinions X-axis: # of Triangles a node participates inY-axis: count of such nodes

NSA'10

Page 24: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 24

Triangle Law: #S.3 [Tsourakakis ICDM 2008]

ASNHEP-TH

Epinions X-axis: # of Triangles a node participates inY-axis: count of such nodes

NSA'10

Page 25: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 25

Triangle Law: #S.4 [Tsourakakis ICDM 2008]

SNReuters

EpinionsX-axis: degreeY-axis: mean # trianglesn friends -> ~n1.6 triangles

NSA'10

Page 26: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 26

Triangle Law: Computations [Tsourakakis ICDM 2008]

But: triangles are expensive to compute(3-way join; several approx. algos)

Q: Can we do that quickly?

details

NSA'10

Page 27: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 27

Triangle Law: Computations [Tsourakakis ICDM 2008]

But: triangles are expensive to compute(3-way join; several approx. algos)

Q: Can we do that quickly?A: Yes!

#triangles = 1/6 Sum ( i3 )

(and, because of skewness, we only need the top few eigenvalues!

details

NSA'10

Page 28: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 28

Triangle Law: Computations [Tsourakakis ICDM 2008]

1000x+ speed-up, >90% accuracy

details

NSA'10

Page 29: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes

B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010.

C. Faloutsos (CMU) 29NSA'10

Page 30: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

30C. Faloutsos (CMU)NSA'10

Page 31: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

31C. Faloutsos (CMU)NSA'10

N

N

details

Page 32: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes

• EE plot:

• Scatter plot of scores of u1 vs u2

• One would expect– Many points @

origin– A few scattered

~randomly

C. Faloutsos (CMU) 32

u1

u2

NSA'10

1st Principal component

2nd Principal component

Page 33: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes

• EE plot:

• Scatter plot of scores of u1 vs u2

• One would expect– Many points @

origin– A few scattered

~randomly

C. Faloutsos (CMU) 33

u1

u290o

NSA'10

Page 34: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes - pervasiveness

•Present in mobile social graph across time and space

•Patent citation graph

34C. Faloutsos (CMU)NSA'10

Page 35: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

35C. Faloutsos (CMU)NSA'10

Page 36: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

36C. Faloutsos (CMU)NSA'10

Page 37: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

So what? Extract nodes with high

scores high connectivity Good “communities”

spy plot of top 20 nodes

37C. Faloutsos (CMU)NSA'10

Page 38: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Bipartite Communities!

magnified bipartite community

patents fromsame inventor(s)

cut-and-pastebibliography!

38C. Faloutsos (CMU)NSA'10

Page 39: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 39

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs

• degree, diameter, eigen,

• triangles

• cliques

– Weighted graphs– Time evolving graphs

• Problem#2: ToolsNSA'10

Page 40: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 40

Observations on weighted graphs?

• A: yes - even more ‘laws’!

M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008

NSA'10

Page 41: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 41

Observation W.1: Fortification

Q: How do the weights

of nodes relate to degree?

NSA'10

Page 42: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 42

Observation W.1: Fortification

More donors, more $ ?

$10

$5

NSA'10

‘Reagan’

‘Clinton’$7

Page 43: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Edges (# donors)

In-weights($)

C. Faloutsos (CMU) 43

Observation W.1: fortification:Snapshot Power Law

• Weight: super-linear on in-degree • exponent ‘iw’: 1.01 < iw < 1.26

Orgs-Candidates

e.g. John Kerry, $10M received,from 1K donors

More donors, even more $

$10

$5

NSA'10

Page 44: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 44

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs – Weighted graphs– Time evolving graphs

• Problem#2: Tools

• …

NSA'10

Page 45: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 45

Problem: Time evolution• with Jure Leskovec (CMU ->

Stanford)

• and Jon Kleinberg (Cornell – sabb. @ CMU)

NSA'10

Page 46: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 46

T.1 Evolution of the Diameter

• Prior work on Power Law graphs hints at slowly growing diameter:– diameter ~ O(log N)– diameter ~ O(log log N)

• What is happening in real data?

NSA'10

Page 47: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 47

T.1 Evolution of the Diameter

• Prior work on Power Law graphs hints at slowly growing diameter:– diameter ~ O(log N)– diameter ~ O(log log N)

• What is happening in real data?

• Diameter shrinks over time

NSA'10

Page 48: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 48

T.1 Diameter – “Patents”

• Patent citation network

• 25 years of data

• @1999– 2.9 M nodes– 16.5 M edges

time [years]

diameter

NSA'10

Page 49: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 49

T.2 Temporal Evolution of the Graphs

• N(t) … nodes at time t

• E(t) … edges at time t

• Suppose thatN(t+1) = 2 * N(t)

• Q: what is your guess for E(t+1) =? 2 * E(t)

NSA'10

Page 50: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 50

T.2 Temporal Evolution of the Graphs

• N(t) … nodes at time t• E(t) … edges at time t• Suppose that

N(t+1) = 2 * N(t)

• Q: what is your guess for E(t+1) =? 2 * E(t)

• A: over-doubled!– But obeying the ``Densification Power Law’’

NSA'10

Page 51: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 51

T.2 Densification – Patent Citations

• Citations among patents granted

• @1999– 2.9 M nodes– 16.5 M edges

• Each year is a datapoint

N(t)

E(t)

1.66

NSA'10

Page 52: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 52

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs – Weighted graphs– Time evolving graphs

• Problem#2: Tools

• …

NSA'10

Page 53: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 53

More on Time-evolving graphs

M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008

NSA'10

Page 54: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 54

Observation T.3: NLCC behaviorQ: How do NLCC’s emerge and join with the

GCC?

(``NLCC’’ = non-largest conn. components)

– Do they continue to grow in size?

– or do they shrink?

– or stabilize?

NSA'10

Page 55: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 55

Observation T.3: NLCC behavior

• After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, oscillate).

IMDB

CC size

Time-stampNSA'10

Page 56: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 56

Timing for Blogs

• with Mary McGlohon (CMU)

• Jure Leskovec (CMU->Stanford)

• Natalie Glance (now at Google)

• Mat Hurst (now at MSR)

[SDM’07]

NSA'10

Page 57: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 57

T.4 : popularity over time

Post popularity drops-off – exponentially?

lag: days after post

# in links

1 2 3

@t

@t + lag

NSA'10

Page 58: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 58

T.4 : popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent?

# in links(log)

1 2 3 days after post(log)

NSA'10

Page 59: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 59

T.4 : popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent? -1.6 • close to -1.5: Barabasi’s stack model• and like the zero-crossings of a random walk

# in links(log)

1 2 3

-1.6

days after post(log)

NSA'10

Page 60: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 60

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools– CenterPiece Subgraphs; G-Ray– OddBall (anomaly detection)– PEGASUS

• Problem#3: Scalability

• Conclusions

NSA'10

Page 61: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

CenterPiece Subgraphs

• Hanghang TONG et al, KDD’06

C. Faloutsos (CMU) 61NSA'10

Page 62: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Center-Piece Subgraph Discovery [Tong+ KDD 06]

Original Graph

Q: Who is the most central node wrt the black nodes?

(e.g., master-mind criminal, common advisor/collaborator, etc)

Input

B

A

C

62C. Faloutsos (CMU)NSA'10

Page 63: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

63

B

A

C B

A

C

Center-Piece Subgraph Discovery [Tong+ KDD 06]

Q: How to find hub for the query nodes?

Input: original graph Output: CePS

CePS Node

C. Faloutsos (CMU)A: Combine proximity scores (RWR)NSA'10

Page 64: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

CePS: Example (AND Query)

64

?

C. Faloutsos (CMU)NSA'10

DBLP co-authorship network: -400,000 authors, 2,000,000 edges

Code at: http://www.cs.cmu.edu/~htong/soft.htm

Page 65: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

CePS: Example (AND Query)

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

65C. Faloutsos (CMU)

DBLP co-authorship network: -400,000 authors, 2,000,000 edges

Code at: http://www.cs.cmu.edu/~htong/soft.htmNSA'10

Page 66: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 66

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools– CenterPiece Subgraphs; G-Ray– OddBall (anomaly detection)– PEGASUS

• Problem#3: Scalability

• Conclusions

NSA'10

Page 67: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Graph X-Ray: Fast Best-Effort Pattern Matching

in Large Attributed Graphs

Hanghang Tong, Brian Gallagher,

Christos Faloutsos, Tina Eliassi-Rad

KDD’07

Page 68: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

68

OutputInput

Attributed Data Graph

Query Graph

Matching SubgraphAccountant

CEO

Manager

SEC

NSA'10 C. Faloutsos (CMU)

Page 69: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

69

Effectiveness: star-query

Query Result

NSA'10 C. Faloutsos (CMU)

Page 70: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 70

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools– CenterPiece Subgraphs– OddBall (anomaly detection)• Problem#3: Scalability - PEGASUS

• Conclusions

NSA'10

Page 71: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

OddBall: Spotting Anomalies in Weighted Graphs

Leman Akoglu, Mary McGlohon, Christos Faloutsos

Carnegie Mellon University

School of Computer Science

PAKDD 2010, Hyderabad, India

Page 72: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Main idea

For each node,

• extract ‘ego-net’ (=1-step-away neighbors)

• Extract features (#edges, total weight, etc etc)

• Compare with the rest of the population

C. Faloutsos (CMU) 72NSA'10

Page 73: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

What is an egonet?

ego

73

egonet

C. Faloutsos (CMU)NSA'10

Page 74: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Selected Features

Ni: number of neighbors (degree) of ego i Ei: number of edges in egonet i Wi: total weight of egonet i λw,i: principal eigenvalue of the weighted

adjacency matrix of egonet I

74C. Faloutsos (CMU)NSA'10

Page 75: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Near-Clique/Star

75NSA'10 C. Faloutsos (CMU)

Page 76: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Near-Clique/Star

76C. Faloutsos (CMU)NSA'10

Page 77: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 77

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools– CenterPiece Subgraphs– OddBall (anomaly detection)• Problem#3: Scalability -PEGASUS

• Conclusions

NSA'10

Page 78: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 78

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE

Visualization STARTED

Outline – Algorithms & results

NSA'10

Page 79: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

HADI for diameter estimation

• Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10

• Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B)

• Our HADI: linear on E (~10B)– Near-linear scalability wrt # machines– Several optimizations -> 5x faster

C. Faloutsos (CMU) 79NSA'10

Page 80: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)• Largest publicly available graph ever studied.

????????

??

19+? [Barabasi+]

80C. Faloutsos (CMU)

Radius

Count

NSA'10

Page 81: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)• Largest publicly available graph ever studied.

????????

19+? [Barabasi+]

81C. Faloutsos (CMU)

Radius

Count

NSA'10

14 (dir.)

~7 (undir.)

Page 82: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)• effective diameter: surprisingly small.• Multi-modality: probably mixture of cores .

82C. Faloutsos (CMU)NSA'10

Page 83: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

83C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)• effective diameter: surprisingly small.• Multi-modality: probably mixture of cores .

NSA'10

Page 84: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Radius Plot of GCC of YahooWeb.

84C. Faloutsos (CMU)NSA'10

Page 85: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

details

Page 86: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 86

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE

Visualization STARTED

Outline – Algorithms & results

NSA'10

Page 87: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Generalized Iterated Matrix Vector Multiplication (GIMV)

C. Faloutsos (CMU) 87

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up).

NSA'10

Page 88: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Generalized Iterated Matrix Vector Multiplication (GIMV)

C. Faloutsos (CMU) 88

• PageRank• proximity (RWR)• Diameter• Connected components• (eigenvectors, • Belief Prop. • … )

Matrix – vectorMultiplication

(iterated)

NSA'10

details

Page 89: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

89

Example: GIM-V At Work

• Connected Components

Size

Count

C. Faloutsos (CMU)NSA'10

Page 90: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

90

Example: GIM-V At Work

• Connected Components

Size

Count

C. Faloutsos (CMU)NSA'10

~0.7B singleton nodes

Page 91: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

91

Example: GIM-V At Work

• Connected Components

Size

Count

C. Faloutsos (CMU)NSA'10

Page 92: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

92

Example: GIM-V At Work

• Connected Components

Size

Count300-size

cmptX 500.Why?

1100-size cmptX 65.Why?

C. Faloutsos (CMU)NSA'10

Page 93: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

93

Example: GIM-V At Work

• Connected Components

Size

Count

suspiciousfinancial-advice sites

(not existing now)

C. Faloutsos (CMU)NSA'10

Page 94: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

94

GIM-V At Work• Connected Components over Time

• LinkedIn: 7.5M nodes and 58M edges

Stable tail slopeafter the gelling point

C. Faloutsos (CMU)NSA'10

Page 95: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 95

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE

Visualization STARTED

Outline – Algorithms & results

NSA'10

Page 96: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 96

Triangles : Computations [Tsourakakis ICDM 2008]

But: triangles are expensive to compute(3-way join; several approx. algos)

Q: Can we do that quickly?A: Yes!

#triangles = 1/6 Sum ( i3 )

(and, because of skewness, we only need the top few eigenvalues!

Mentioned already

NSA'10

Page 97: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 97

Triangle Law: #1 [Tsourakakis ICDM 2008]

ASNHEP-TH

Epinions X-axis: # of Triangles a node participates in

Y-axis: count of such nodes

Mentioned already

NSA'10

Page 98: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 98

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE

Visualization STARTED

Outline – Algorithms & results

NSA'10

Page 99: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

Visualization: ShiftR

• Supporting Ad Hoc Sensemaking: Integrating Cognitive, HCI, and Data Mining ApproachesAniket Kittur, Duen Horng (‘Polo’) Chau, Christos Faloutsos, Jason I. HongSensemaking Workshop at CHI 2009, April 4-5. Boston, MA, USA.

C. Faloutsos (CMU) 99NSA'10

Page 100: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

100C. Faloutsos (CMU)NSA'10

Page 101: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 101

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools

• Problem#3: Scalability

• (additional topics, skipped)

• Conclusions

NSA'10

Page 102: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 102

Other topics - part#1 - tools

• Community detection – how many?– Cross-Associations [Chakrabarti +, KDD 2004]

• Time-evolving graphs– Tensors [Sun+, KDD’06], – [Kolda+ ICDM’05]– GraphScope [Sun+, KDD’07]

• Graph compression– CUR decomposition [Sun+ SDM’07]

NSA'10

Page 103: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

103

Tensors

• Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’)

keyword

1990

AuthorNSA'10 C. Faloutsos (CMU)

Page 104: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

104

Tensors

• Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’)

keyword

19911992

1990

AuthorNSA'10 C. Faloutsos (CMU)

Page 105: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

105

Tensors

• Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’)

~ +

PARAFAC tensor decomposition(generalization of SVD)

keyword

19911992

1990

AuthorNSA'10 C. Faloutsos (CMU)

Page 106: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 106

Other topics – part#2 - generators

• Kronecker [PKDD’05];

• Random Typing [Akoglu+, PKDD’09]

NSA'10

Page 107: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

NSA'10 C. Faloutsos (CMU) 107

Kronecker Product – a Graph• One of most realistic generators, with provable

properties

Page 108: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 108

Other topics - part#3 – virus propagation

• Epidemic threshold for SIS: depends only on first eigenvalue of adjacency matrix

• [Chakrabarti+, TISSEC’07]

• Immunization policies [Tong+, under review]

• Drinking water sensor placement [KDD’07]

NSA'10

Page 109: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

More info

Tutorial on graph mining: KDD’09

(w/ Gary Miller and C. Tsourakakis)www.cs.cmu.edu/~christos/TALKS/09-KDD-tutorial/

Tutorial on tensors: SIGMOD’07

(w/ T. Kolda and J. Sun):www.cs.cmu.edu/~christos/TALKS/SIGMOD-07-tutorial/

NSA'10 C. Faloutsos (CMU) 109

Page 110: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 110

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Tools

• Problem#3: Scalability

• (additional topics, skipped)

• Conclusions

NSA'10

Page 111: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 111

OVERALL CONCLUSIONS – low level:

• Several new patterns (fortification, triangle-laws, conn. components, etc)

• New tools:

– CenterPiece Subgraphs, G-Ray, anomaly detection (OddBall)

• Scalability: PEGASUS / hadoop

NSA'10

Page 112: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 112

OVERALL CONCLUSIONS – high level

• Large datasets may reveal patterns/outliers that would be invisible otherwise

• Terrific opportunities

– Large datasets, easily(*) available PLUS

– s/w and h/w developments

• Promising collaborations between DB/Sys, AI/Stat, sociology, marketing, epidemiology, ++

NSA'10

Page 113: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 113

References• Leman Akoglu, Christos Faloutsos: RTG: A Recursive

Realistic Graph Generator Using Random Typing. ECML/PKDD (1) 2009: 13-28

• Deepayan Chakrabarti, Christos Faloutsos: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1): (2006)

NSA'10

Page 114: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 114

References• Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jure

Leskovec, Christos Faloutsos: Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur. 10(4): (2008)

• Deepayan Chakrabarti, Jure Leskovec, Christos Faloutsos, Samuel Madden, Carlos Guestrin, Michalis Faloutsos: Information Survival Threshold in Sensor and P2P Networks. INFOCOM 2007: 1316-1324

NSA'10

Page 115: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 115

References• Christos Faloutsos, Tamara G. Kolda, Jimeng Sun:

Mining large graphs and streams using matrix and tensor tools. Tutorial, SIGMOD Conference 2007: 1174

NSA'10

Page 116: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 116

References• T. G. Kolda and J. Sun. Scalable Tensor

Decompositions for Multi-aspect Data Mining. In: ICDM 2008, pp. 363-372, December 2008.

NSA'10

Page 117: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 117

References• Jure Leskovec, Jon Kleinberg and Christos Faloutsos

Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, KDD 2005 (Best Research paper award).

• Jure Leskovec, Deepayan Chakrabarti, Jon M. Kleinberg, Christos Faloutsos: Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication. PKDD 2005: 133-145

NSA'10

Page 118: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 118

References• Jimeng Sun, Yinglian Xie, Hui Zhang, Christos

Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007.

• Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, GraphScope: Parameter-free Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007

NSA'10

Page 119: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

References

• Jimeng Sun, Dacheng Tao, Christos Faloutsos: Beyond streams and graphs: dynamic tensor analysis. KDD 2006: 374-383

NSA'10 C. Faloutsos (CMU) 119

Page 120: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 120

References

• Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM 2006, Hong Kong.

• Hanghang Tong, Christos Faloutsos, Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA

NSA'10

Page 121: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 121

References

• Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort pattern matching in large attributed graphs. KDD 2007: 737-746

NSA'10

Page 122: Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS

C. Faloutsos (CMU) 122

Project infowww.cs.cmu.edu/~pegasus

Akoglu, Leman

Chau, Polo

Kang, U

McGlohon, Mary

Tsourakakis, Babis

Tong, Hanghang

Prakash,Aditya

NSA'10

Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, INTEL, HP