1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 .

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 7

May 14, 2006

http://www.ee.technion.ac.il/courses/049011

2

Web Structure I:Power Laws

and Small World Phenomenon

3

Outline

Power laws The preferential attachment model Small-world networks The Watts-Strogatz model

4

Observed Phenomena

Few multi-billionaires, but many with modest income [Pareto, 1896]

Few frequent words, but many infrequent words [Zipf, 1932]

Few “mega-cities” but many small towns [Zipf, 1949]

Few web pages with high degree, but many with low degree [Kumar et al, 99] [Barabási & Albert, 99]

All the above obey power laws.

5

Power Law (Pareto) Distribution

> 0: shape parameter (“slope”) k > 0: location parameter Ex: (k = $1000, = 2)

1/100 earn ≥ $10,0001/10,000 earn ≥ $100,0001/1,000,000 earn ≥ $1,000,000

6

Power Law Properties

PDF: Infinite mean for ≤ 1 Infinite variance for ≤ 2 When X is discrete,

7

Power Law Graphs

Linear Scale Plot Log-Log Plot

Slope = -

8

Scale-Free Distributions

Power laws are invariant to scaleEx: (k = arbitrary, = 2)

1/100 earn ≥ 10k 1/10,000 earn ≥ 100k 1/1,000,000 earn ≥ 1000k

9

Heavy Tailed Distributions

In many “classical” distributions

Ex: normal, exponential

In power law distributions

“heavy tail”

“light tail”

10

Zipf’s Law

Size of r-th largest city is Equivalent to a power law:

X = size of a city Change variables:

11

Power Laws and the Internet

Web Graph In- and out-degrees (in slope: ~2.1, out slope: ~2.7)

[Kumar et al. 99, Barabási & Albert 99, Broder et al 00] Sizes of connected components [Broder et al 00] Website sizes [Huberman & Adamic 99]

Internet graph Degrees [Faloutsos3 99] Eigenvalues [Mihail & Papadimitriou 02]

Traffic Number of visits to websites

12

Power Laws and Graphs

If X is a random web page, then

What random graph model explains this phenomenon?

13

Erdős-Rényi Random Graphs

Gn,p

n: size of the graph (fixed)p: edge existence probability (fixed):

Every pair u,v is connected by an edge with probability p.

Theorem [Erdős & Rényi, 60]

For any node x in Gn,p,

14

Preferential Attachment [Barabási & Albert 99] A novel random graph model

Initialization: graph starts with a single node with two self loops.

Growth: At every step a new node v is added to the graph. v has a self loop and connects to one neighbor.

Preferential attachment: v connects to u with probability

The rich get richer / The winner takes it all

15

: # of nodes whose indegree = k after t steps

k > 1: Expected growth:

Why Does it Work?

k = 1:

16

Why Does it Work? (2)

Fact: After sufficiently many steps, reaches a “steady state”.

ck = value of at the steady state. Since at steady state, Hence,

Therefore:

17

Why Does it Work? (3)

Then: And: Therefore:

18

Six Degrees of Separation[Stanley Milgram, 67] “Random starters” at Nebraska, Kansas,

etc. Destinations: in Boston Intermediaries send postcards to Milgram Findings: average of 6 postcards “Conclusion”: every two people in the US

are connected by a path of length ~ 6

19

Small-World Networks

Average diameter: length of shortest path from u to v, averaged over all pairs u,v

Clustering coefficient: fraction of neighbors of v that are neighbors of each other, averaged over all v

Small-world network: a sparse graph with average diameter O(log n) and a constant clustering coefficient

20

The Web as a Small World Network

Low diameter Study of a synthetic web graph model [Albert, Jeong,

Barabási 99] Average diameter of the Web is ~19 Grows logarithmically with size of the Web.

Study of a large crawl [Broder et al 00] Average diameter of the SCC is ~ 16 Maximum diameter of the SCC is ≥ 28

Diameter of host graph [Adamic 99] Average diameter of SCC: ~4

High clustering coefficient Clustering coefficient of host graph [Adamic 99]

Clustering coefficient: ~0.08 (compared to 0.001 in a comparable random graph)

21

Model for Small-World Networks[Watts & Strogatz 98]

One extreme: random networks Low diameter Low clustering coefficient

Other extreme: “regular” networks (e.g., a lattice) High clustering coefficient High diameter

Small-world: interpolation between the two Low diameter High clustering coefficient Regularity: social networking Randomness: individual interests

22

Random Network

The model: n vertices Every pair u,v is connected by an edge

with probability p = d/n

Properties: Expected number of edges: ~dn Graph is connected w.h.p Diameter: O(log n) w.h.p. Clustering coefficient: ~ p = d/n = o(1)

23

Ring Lattice

The model: n vertices on a circle Every vertex has d neighbors: the d/2

vertices to its right and the d/2 vertices to its left

Properties: Number of edges: dn/2 Graph is connected Diameter: O(n/d) Clustering coefficient:

24

Random Rewiring

Start from a ring lattice for i = 1 to d/2 do

for v = 1 to n do Pick i-th clockwise nearest neighbor of v With probability p, replace this neighbor by a random vertex

25

Analysis

If p = 0, ring lattice High clustering coefficient High diameter

If p = 1, random network Logarithmic diameter Low clustering coefficient

However, Diameter goes down rapidly as p grows Clustering coefficient goes down slowly as p grows

Therefore, for small p, we get a small-world network. Logarithmic diameter High clustering coefficient

26

End of Lecture 7

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 .

Documents

Transcript of 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 .