15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B....
Transcript of 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B....
15-826 C. Faloutsos
1
CMU SCS
15-826: Multimedia Databases
and Data Mining
Fractals - introduction
C. Faloutsos
15-826 Copyright: C. Faloutsos (2007) 2
CMU SCS
Outline
Goal: ‘Find similar / interesting things’
• Intro to DB
• Indexing - similarity search
• Data Mining
15-826 Copyright: C. Faloutsos (2007) 3
CMU SCS
Indexing - Detailed outline
• primary key indexing
• secondary key / multi-key indexing
• spatial access methods
– z-ordering
– R-trees
– misc
• fractals
– intro
– applications
• text
15-826 C. Faloutsos
2
15-826 Copyright: C. Faloutsos (2007) 4
CMU SCS
Intro to fractals - outline
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples and tools
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 Copyright: C. Faloutsos (2007) 5
CMU SCS
Road end-points of
Montgomery county:
•Q1: how many d.a. for an
R-tree?
•Q2 : distribution?
•not uniform
•not Gaussian
•no rules??
Problem #1: GIS - points
15-826 Copyright: C. Faloutsos (2007) 6
CMU SCS
Problem #2 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
Nichol)
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
"elliptical.cut.dat""spiral.cut.dat"
- ‘spiral’ and ‘elliptical’
galaxies
(stores and households ...)
- patterns?
- attraction/repulsion?
- how many ‘spi’ within r
from an ‘ell’?
15-826 C. Faloutsos
3
15-826 Copyright: C. Faloutsos (2007) 7
CMU SCS
Problem #3: traffic
• disk trace (from HP - J.
Wilkes); Web traffic - fit
a model
• how many explosions to
expect?
• queue length distr.?time
# bytes
15-826 Copyright: C. Faloutsos (2007) 8
CMU SCS
Problem #3: traffic
time
# bytes
Poisson
indep.,
ident. distr
15-826 Copyright: C. Faloutsos (2007) 9
CMU SCS
Problem #3: traffic
time
# bytes
Poisson
indep.,
ident. distr
15-826 C. Faloutsos
4
15-826 Copyright: C. Faloutsos (2007) 10
CMU SCS
Problem #3: traffic
time
# bytes
Poisson
indep.,
ident. distr
Q: Then, how to generate
such bursty traffic?
15-826 Copyright: C. Faloutsos (2007) 11
CMU SCS
Common answer:
• Fractals / self-similarities / power laws
• Seminal works from Hilbert, Minkowski,
Cantor, Mandelbrot, (Hausdorff, Lyapunov,
Ken Wilson, …)
15-826 Copyright: C. Faloutsos (2007) 12
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples and tools
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 C. Faloutsos
5
15-826 Copyright: C. Faloutsos (2007) 13
CMU SCS
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area;
infinite length!
15-826 Copyright: C. Faloutsos (2007) 14
CMU SCS
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58...
15-826 Copyright: C. Faloutsos (2007) 15
CMU SCS
Dfn of fd:
ONLY for a perfectly self-similar point set:
=log(n)/log(f) = log(3)/log(2) = 1.58
...zero area;
infinite length!
15-826 C. Faloutsos
6
15-826 Copyright: C. Faloutsos (2007) 16
CMU SCS
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of aline?
• A: 1 (= log(2)/log(2)!)
15-826 Copyright: C. Faloutsos (2007) 17
CMU SCS
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of aline?
• A: 1 (= log(2)/log(2)!)
15-826 Copyright: C. Faloutsos (2007) 18
CMU SCS
Intrinsic (‘fractal’) dimension
• Q: dfn for a given set
of points?
42
33
24
15
yx
15-826 C. Faloutsos
7
15-826 Copyright: C. Faloutsos (2007) 19
CMU SCS
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of aline?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
• Q: fd of a plane?
• A: nn ( <= r ) ~ r^2
fd== slope of (log(nn) vslog(r) )
15-826 Copyright: C. Faloutsos (2007) 20
CMU SCS
Intrinsic (‘fractal’) dimension
• Algorithm, to estimate it?
Notice
• avg nn(<=r) is exactly
tot#pairs(<=r) / N
including ‘mirror’ pairs
15-826 Copyright: C. Faloutsos (2007) 21
CMU SCS
Sierpinsky triangle
log( r )
log(#pairs
within <=r )
1.58
== ‘correlation integral’
15-826 C. Faloutsos
8
15-826 Copyright: C. Faloutsos (2007) 22
CMU SCS
Observations:
• Euclidean objects have integer fractal
dimensions
– point: 0
– lines and smooth curves: 1
– smooth surfaces: 2
• fractal dimension -> roughness of the
periphery
15-826 Copyright: C. Faloutsos (2007) 23
CMU SCS
Important properties
• fd = embedding dimension -> uniform
pointset
• a point set may have several fd, depending
on scale
15-826 Copyright: C. Faloutsos (2007) 24
CMU SCS
Important properties
• fd = embedding dimension -> uniform
pointset
• a point set may have several fd, depending
on scale
2-d
15-826 C. Faloutsos
9
15-826 Copyright: C. Faloutsos (2007) 25
CMU SCS
Important properties
• fd = embedding dimension -> uniform
pointset
• a point set may have several fd, depending
on scale
1-d
15-826 Copyright: C. Faloutsos (2007) 26
CMU SCS
Important properties
0-d
15-826 Copyright: C. Faloutsos (2007) 27
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples and tools
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 C. Faloutsos
10
15-826 Copyright: C. Faloutsos (2007) 28
CMU SCS
Cross-roads of
Montgomery county:
•any rules?
Problem #1: GIS points
15-826 Copyright: C. Faloutsos (2007) 29
CMU SCS
Solution #1
A: self-similarity ->
• <=> fractals
• <=> scale-free
• <=> power-laws
(y=x^a, F=C*r^(-2))
• avg#neighbors(<= r )
= r^D
log( r )
log(#pairs(within <= r))
1.51
15-826 Copyright: C. Faloutsos (2007) 30
CMU SCS
Solution #1
A: self-similarity
• avg#neighbors(<= r )
~ r^(1.51)
log( r )
log(#pairs(within <= r))
1.51
15-826 C. Faloutsos
11
15-826 Copyright: C. Faloutsos (2007) 31
CMU SCS
Examples:MG county
• Montgomery County of MD (road end-
points)
15-826 Copyright: C. Faloutsos (2007) 32
CMU SCS
Examples:LB county
• Long Beach county of CA (road end-points)
15-826 Copyright: C. Faloutsos (2007) 33
CMU SCS
Solution#2: spatial d.m.
Galaxies ( ‘BOPS’ plot - [sigmod2000])
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
"elliptical.cut.dat""spiral.cut.dat"
log(#pairs)
log(r)
15-826 C. Faloutsos
12
15-826 Copyright: C. Faloutsos (2007) 34
CMU SCS
Solution#2: spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
15-826 Copyright: C. Faloutsos (2007) 35
CMU SCS
Spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
15-826 Copyright: C. Faloutsos (2007) 36
CMU SCS
Spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
r1r2
r1
r2
Heuristic on choosing # of
clusters
15-826 C. Faloutsos
13
15-826 Copyright: C. Faloutsos (2007) 37
CMU SCS
Spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
15-826 Copyright: C. Faloutsos (2007) 38
CMU SCS
Spatial d.m.
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100
"ell-ell.points.ns""spi-spi.points.ns"
"spi.dat-ell.dat.points"
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
-repulsion!!
-duplicates
15-826 Copyright: C. Faloutsos (2007) 39
CMU SCS
Solution #3: traffic
• disk traces: self-similar:
time
#bytes
15-826 C. Faloutsos
14
15-826 Copyright: C. Faloutsos (2007) 40
CMU SCS
Solution #3: traffic
• disk traces (80-20 ‘law’ = ‘multifractal’)
time
#bytes
20% 80%
15-826 Copyright: C. Faloutsos (2007) 41
CMU SCS
80-20 / multifractals20 80
0
200000
400000
600000
800000
1e+06
1.2e+06
0 2e+06 4e+06 6e+06 8e+06 1e+07
num
ber
of K
b
time (in milliseconds)
15-826 Copyright: C. Faloutsos (2007) 42
CMU SCS
80-20 / multifractals
20• p ; (1-p) in general
• yes, there are
dependencies
80
15-826 C. Faloutsos
15
15-826 Copyright: C. Faloutsos (2007) 43
CMU SCS
More on 80/20: PQRS
• Part of ‘self-* storage’ project
time
cylinder#
15-826 Copyright: C. Faloutsos (2007) 44
CMU SCS
More on 80/20: PQRS
• Part of ‘self-* storage’ project
p q
r s
q
r s
15-826 Copyright: C. Faloutsos (2007) 45
CMU SCS
Solution#3: traffic
Clarification:
• fractal: a set of points that is self-similar
• multifractal: a probability density function
that is self-similar
Many other time-sequences are
bursty/clustered: (such as?)
15-826 C. Faloutsos
16
15-826 Copyright: C. Faloutsos (2007) 46
CMU SCS
Example:
• network traffic
0
200000
400000
600000
800000
1e+06
1.2e+06
0 2e+06 4e+06 6e+06 8e+06 1e+07
num
ber
of K
b
time (in milliseconds)
http://repository.cs.vt.edu/lbl-conn-7.tar.Z
15-826 Copyright: C. Faloutsos (2007) 47
CMU SCS
Web traffic
• [Crovella Bestavros, SIGMETRICS’96]
Chronological time (slots of 1000 sec)
Byte
s
0 1000 2000 3000
05*1
0^6
10^7
1.5
*10^7
2*1
0^7
2.5
*10^7
3*1
0^7
Chronological time (slots of 100 sec)
Byte
s
23300 23400 23500 23600
02*1
0^6
4*1
0^6
6*1
0^6
8*1
0^6
Chronological time (slots of 10 sec)
Byte
s
234200 234250 234300 234350 234400 234450
0100000
200000
300000
400000
500000
Chronological time (slots of 1 sec)
Byte
s
2342600 2342650 2342700 2342750 2342800 2342850 2342900
050000
100000
150000
Figure 2: Tra�c Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and Lower Right:1 Second Aggegrations. (Actual Transfers)
this e�ect is by visually inspecting time series plots of tra�cdemand.
In Figure 2 we show four time series plots of the WWWtra�c induced by our reference traces. The plots are producedby aggregating byte tra�c into discrete bins of 1, 10, 100, or1000 seconds.
The upper left plot is a complete presentation of the entiretra�c time series using 1000 second (16.6 minute) bins. Thediurnal cycle of network demand is clearly evident, and dayto day activity shows noticeable bursts. However, even withinthe active portion of a single day there is signi�cant burstiness;this is shown in the upper right plot, which uses a 100 secondtimescale and is taken from a typical day in the middle of thedataset. Finally, the lower left plot shows a portion of the 100second plot, expanded to 10 second detail; and the lower rightplot shows a portion of the lower left expanded to 1 seconddetail. These plots show signi�cant bursts occurring at thesecond-to-second level.
4.2.2 Statistical Analysis
We used the four methods for assessing self-similarity describedin Section 2: the variance-time plot, the rescaled range (orR/S) plot, the periodogram plot, and the Whittle estimator.We concentrated on individual hours from our tra�c series, soas to provide as nearly a stationary dataset as possible.
To provide an example of these approaches, analysis of asingle hour (4pm to 5pm, Thursday 5 Feb 1995) is shown inFigure 3. The �gure shows plots for the three graphical meth-ods: variance-time (upper left), rescaled range (upper right),and periodogram (lower center). The variance-time plot is lin-
ear and shows a slope that is distinctly di�erent from -1 (whichis shown for comparison); the slope is estimated using regres-sion as -0.48, yielding an estimate for H of 0.76. The R/S plotshows an asymptotic slope that is di�erent from 0.5 and from1.0 (shown for comparision); it is estimated using regressionas 0.75, which is also the corresponding estimate of H. Theperiodogram plot shows a slope of -0.66 (the regression line isshown), yielding an estimate of H as 0.83. Finally, the Whittleestimator for this dataset (not a graphical method) yields anestimate of H = 0:82 with a 95% con�dence interval of (0.77,0.87).
As discussed in Section 2.1, the Whittle estimator is theonly method that yields con�dence intervals on H, but short-range dependence in the timeseries can introduce inaccura-cies in its results. These inaccuracies are minimized by m-aggregating the timeseries for successively large values of m,and looking for a value of H around which the Whittle esti-mator stabilizes.
The results of this method for four busy hours are shown inFigure 4. Each hour is shown in one plot, from the busiest hourin the upper left to the least busy hour in the lower right. Inthese �gures the solid line is the value of the Whittle estimateof H as a function of the aggregation level m of the dataset.The upper and lower dotted lines are the limits of the 95%con�dence interval on H. The three level lines represent theestimate of H for the unaggregated dataset as given by thevariance-time, R-S, and periodogram methods.
The �gure shows that for each dataset, the estimate of Hstays relatively consistent as the aggregation level is increased,and that the estimates given by the three graphical methods
1000 sec; 100sec
10sec; 1sec
15-826 Copyright: C. Faloutsos (2007) 48
CMU SCS
Tape accesses
time
Tape#1 Tape# N
# tapes needed, to
retrieve n records?
(# days down, due to
failures / hurricanes /
communication
noise...)
15-826 C. Faloutsos
17
15-826 Copyright: C. Faloutsos (2007) 49
CMU SCS
Tape accesses
LD16
rdb50.out
rdb60.out
rdb75.out
rdb85.out
rdb86.out
rdb87.out
rdb90.out
LD16.avg
avg # blocks
records
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
0.00 50.00 100.00 150.00 200.00
time
Tape#1 Tape# N
# tapes retrieved
# qual. records
50-50 = Poisson
real
15-826 Copyright: C. Faloutsos (2007) 50
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More tools and examples
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 Copyright: C. Faloutsos (2007) 51
CMU SCS
A counter-intuitive example
• avg degree is, say 3.3
• pick a node at random
– guess its degree,
exactly (-> “mode”)
degree
count
avg: 3.3
15-826 C. Faloutsos
18
15-826 Copyright: C. Faloutsos (2007) 52
CMU SCS
A counter-intuitive example
• avg degree is, say 3.3
• pick a node at random
– guess its degree,
exactly (-> “mode”)
• A: 1!!
degree
count
avg: 3.3
15-826 Copyright: C. Faloutsos (2007) 53
CMU SCS
A counter-intuitive example
• avg degree is, say 3.3
• pick a node at random- what is the degreeyou expect it to have?
• A: 1!!
• A’: very skewed distr.
• Corollary: the mean ismeaningless!
• (and std -> infinity (!))
degree
count
avg: 3.3
15-826 Copyright: C. Faloutsos (2007) 54
CMU SCS
Rank exponent R• Power law in the degree distribution
[SIGCOMM99]
internet domains
log(rank)
log(degree)
-0.82
att.com
ibm.com
15-826 C. Faloutsos
19
15-826 Copyright: C. Faloutsos (2007) 55
CMU SCS
More tools
• Zipf’s law
• Korcak’s law / “fat fractals”
15-826 Copyright: C. Faloutsos (2007) 56
CMU SCS
A famous power law: Zipf’s
law
• Q: vocabulary word frequency in a document
- any pattern?
aaron zoo
freq.
15-826 Copyright: C. Faloutsos (2007) 57
CMU SCS
A famous power law: Zipf’s
law
• Bible - rank vs
frequency (log-log)
log(rank)
log(freq)
“a”
“the”
15-826 C. Faloutsos
20
15-826 Copyright: C. Faloutsos (2007) 58
CMU SCS
A famous power law: Zipf’s
law
• Bible - rank vs
frequency (log-log)
• similarly, in many
other languages; for
customers and sales
volume; city
populations etc etclog(rank)
log(freq)
15-826 Copyright: C. Faloutsos (2007) 59
CMU SCS
A famous power law: Zipf’s
law
•Zipf distr:
freq = 1/ rank
•generalized Zipf:
freq = 1 / (rank)^a
log(rank)
log(freq)
15-826 Copyright: C. Faloutsos (2007) 60
CMU SCS
Olympic medals (Sidney):
y = -0.9676x + 2.3054
R2 = 0.9458
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Series1
Linear (Series1)
rank
log(#medals)
15-826 C. Faloutsos
21
15-826 Copyright: C. Faloutsos (2007) 61
CMU SCS
Olympic medals (Sidney’00,
Athens’04):
log( rank)
log(#medals)
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
athens
sidney
15-826 Copyright: C. Faloutsos (2007) 62
CMU SCS
TELCO data
# of service units
count of
customers
‘best customer’
15-826 Copyright: C. Faloutsos (2007) 63
CMU SCS
SALES data – store#96
# units sold
count of
products
“aspirin”
15-826 C. Faloutsos
22
15-826 Copyright: C. Faloutsos (2007) 64
CMU SCS
More power laws: areas –
Korcak’s law
Scandinavian lakes
Any pattern?
15-826 Copyright: C. Faloutsos (2007) 65
CMU SCS
More power laws: areas –
Korcak’s law
Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes) 0
2
4
6
8
10
0 2 4 6 8 10 12 14
Nu
mb
er
of
reg
ion
s
Area
Area distribution of regions (SL dataset)
-0.85*x +10.8
log(count( >= area))
log(area)
15-826 Copyright: C. Faloutsos (2007) 66
CMU SCS
More power laws: Korcak
1
10
100
1000
1 10 100 1000 10000 100000
Num
ber
of is
lands
Area
Area distribution of Japan Archipelago
Japan islands;
area vs cumulative
count (log-log axes)log(area)
log(count( >= area))
15-826 C. Faloutsos
23
15-826 Copyright: C. Faloutsos (2007) 67
CMU SCS
(Korcak’s law: Aegean islands)
15-826 Copyright: C. Faloutsos (2007) 68
CMU SCS
Korcak’s law & “fat fractals”
How to generate such regions?
15-826 Copyright: C. Faloutsos (2007) 69
CMU SCS
Korcak’s law & “fat fractals”Q: How to generate such regions?
A: recursively, from a single region
15-826 C. Faloutsos
24
15-826 Copyright: C. Faloutsos (2007) 70
CMU SCS
so far we’ve seen:
• concepts:
– fractals, multifractals and fat fractals
• tools:
– correlation integral (= pair-count plot)
– rank/frequency plot (Zipf’s law)
– CCDF (Korcak’s law)
15-826 Copyright: C. Faloutsos (2007) 71
CMU SCS
so far we’ve seen:
• concepts:
– fractals, multifractals and fat fractals
• tools:
– correlation integral (= pair-count plot)
– rank/frequency plot (Zipf’s law)
– CCDF (Korcak’s law)
same
info
15-826 Copyright: C. Faloutsos (2007) 72
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More tools and examples
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 C. Faloutsos
25
15-826 Copyright: C. Faloutsos (2007) 73
CMU SCS
Other applications: Internet
• How does the internet look like?
CMU
15-826 Copyright: C. Faloutsos (2007) 74
CMU SCS
Other applications: Internet
• How does the internet look like?
• Internet routers: how many neighbors
within h hops?
CMU
15-826 Copyright: C. Faloutsos (2007) 75
CMU SCS
(reminder: our tool-box:)
• concepts:
– fractals, multifractals and fat fractals
• tools:
– correlation integral (= pair-count plot)
– rank/frequency plot (Zipf’s law)
– CCDF (Korcak’s law)
15-826 C. Faloutsos
26
15-826 Copyright: C. Faloutsos (2007) 76
CMU SCS
Internet topology
• Internet routers: how many neighbors
within h hops?
Reachability function:
number of neighbors
within r hops, vs r (log-
log).
Mbone routers, 1995log(hops)
log(#pairs)
2.8
15-826 Copyright: C. Faloutsos (2007) 77
CMU SCS
More power laws on the
Internet
degree vs rank, for Internet domains
(log-log) [sigcomm99]
log(rank)
log(degree)
-0.82
15-826 Copyright: C. Faloutsos (2007) 78
CMU SCS
More power laws - internet
• pdf of degrees: (slope: 2.2 )
1
10
100
1000
10000
1 10 100
"981205.out"exp(8.11393) * x ** ( -2.20288 )Log(count)
Log(degree)
-2.2
15-826 C. Faloutsos
27
15-826 Copyright: C. Faloutsos (2007) 79
CMU SCS
Even more power laws on the
Internet
Scree plot for Internet domains (log-
log) [sigcomm99]
log(i)
log( i-th eigenvalue)
1
10
100
1 10 100
"971108.internet.svals"exp(3.57926) * x ** ( -0.471327 )
0.47
15-826 Copyright: C. Faloutsos (2007) 80
CMU SCS
Fractals & power laws:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
15-826 Copyright: C. Faloutsos (2007) 81
CMU SCS
More apps: Brain scans
• Oct-trees; brain-scans
7
8
9
10
11
12
13
14
15
16
4 4.5 5 5.5 6 6.5 7
log
2(
octr
ee
le
ave
s)
level k
"oct-count"x*2.63871-2.9122
octree levels
Log(#octants)
2.63 =
fd
15-826 C. Faloutsos
28
15-826 Copyright: C. Faloutsos (2007) 82
CMU SCS
More apps: Medical images
[Burdett et al, SPIE ‘93]:
• benign tumors: fd ~ 2.37
• malignant: fd ~ 2.56
15-826 Copyright: C. Faloutsos (2007) 83
CMU SCS
More fractals:
• cardiovascular system: 3 (!)
• lungs: 2.9
15-826 Copyright: C. Faloutsos (2007) 84
CMU SCS
Fractals & power laws:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
15-826 C. Faloutsos
29
15-826 Copyright: C. Faloutsos (2007) 85
CMU SCS
More fractals:
• Coastlines: 1.2-1.58
1 1.1
1.3
15-826 Copyright: C. Faloutsos (2007) 86
CMU SCS
15-826 Copyright: C. Faloutsos (2007) 87
CMU SCS
More fractals:
• the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
15-826 C. Faloutsos
30
15-826 Copyright: C. Faloutsos (2007) 88
CMU SCS
More fractals:
• the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
15-826 Copyright: C. Faloutsos (2007) 89
CMU SCS
More power laws
• Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
log(freq)
magnitudeday
amplitude
15-826 Copyright: C. Faloutsos (2007) 90
CMU SCS
Fractals & power laws:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
15-826 C. Faloutsos
31
15-826 Copyright: C. Faloutsos (2007) 91
CMU SCS
More fractals:
stock prices (LYCOS) - random walks: 1.5
1 year 2 years
15-826 Copyright: C. Faloutsos (2007) 92
CMU SCS
Even more power laws:
• Income distribution (Pareto’s law)
• size of firms
• publication counts (Lotka’s law)
15-826 Copyright: C. Faloutsos (2007) 93
CMU SCS
Even more power laws:
library science (Lotka’s law of publication
count); and citation counts:
(citeseer.nj.nec.com 6/2001)
1
10
100
100 1000 10000
log c
ount
log # citations
’cited.pdf’
log(#citations)
log(count)
Ullman
15-826 C. Faloutsos
32
15-826 Copyright: C. Faloutsos (2007) 94
CMU SCS
Even more power laws:
• web hit counts [w/ A. Montgomery]
Web Site Traffic
log(freq)
log(count)
Zipf
“yahoo.com”
15-826 Copyright: C. Faloutsos (2007) 95
CMU SCS
Fractals & power laws:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
15-826 Copyright: C. Faloutsos (2007) 96
CMU SCS
Power laws, cont’d
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log indegree
- log(freq)
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
15-826 C. Faloutsos
33
15-826 Copyright: C. Faloutsos (2007) 97
CMU SCS
Power laws, cont’d
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log indegree
log(freq)
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
15-826 Copyright: C. Faloutsos (2007) 98
CMU SCS
“Foiled by power law”
• [Broder+, WWW’00]
“The anomalous bump at 120
on the x-axis
is due a large clique
formed by a single spammer”
(log) in-degree
(log) count
15-826 Copyright: C. Faloutsos (2007) 99
CMU SCS
Power laws, cont’d
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
• length of file transfers [Crovella+Bestavros
‘96]
• duration of UNIX jobs [Harchol-Balter]
15-826 C. Faloutsos
34
15-826 Copyright: C. Faloutsos (2007) 100
CMU SCS
Even more power laws:
• Distribution of UNIX file sizes
• web hit counts [Huberman]
15-826 Copyright: C. Faloutsos (2007) 101
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples and tools
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 Copyright: C. Faloutsos (2007) 102
CMU SCS
What else can they solve?
• separability [KDD’02]
• forecasting [CIKM’02]
• dimensionality reduction [SBBD’00]
• non-linear axis scaling [KDD’02]
• disk trace modeling [PEVA’02]
• selectivity of spatial/multimedia queries
[PODS’94, VLDB’95, ICDE’00]
• ...
15-826 C. Faloutsos
35
15-826 Copyright: C. Faloutsos (2007) 103
CMU SCS
Settings for fractals:
Points; areas (-> fat fractals), eg:
15-826 Copyright: C. Faloutsos (2007) 104
CMU SCS
Settings for fractals:
Points; areas, eg:
• cities/stores/hospitals, over earth’s surface
• time-stamps of events (customer arrivals,
packet losses, criminal actions) over time
• regions (sales areas, islands, patches of
habitats) over space
15-826 Copyright: C. Faloutsos (2007) 105
CMU SCS
Settings for fractals:
• customer feature vectors (age, income,
frequency of visits, amount of sales per
visit)
‘good’ customers
‘bad’ customers
15-826 C. Faloutsos
36
15-826 Copyright: C. Faloutsos (2007) 106
CMU SCS
Some uses of fractals:
• Detect non-existence of rules (if points are
uniform)
• Detect non-homogeneous regions (eg., legal
login time-stamps may have different fd
than intruders’)
• Estimate number of neighbors / customers /
competitors within a radius
15-826 Copyright: C. Faloutsos (2007) 107
CMU SCS
Multi-Fractals
Setting: points or objects, w/ some value, eg:
– cities w/ populations
– positions on earth and amount of gold/water/oil
underneath
– product ids and sales per product
– people and their salaries
– months and count of accidents
15-826 Copyright: C. Faloutsos (2007) 108
CMU SCS
Use of multifractals:
• Estimate tape/disk accesses
– how many of the 100 tapes contain my 50
phonecall records?
– how many days without an accident?
time
Tape#1 Tape# N
15-826 C. Faloutsos
37
15-826 Copyright: C. Faloutsos (2007) 109
CMU SCS
Use of multifractals
• how often do we exceed the threshold?
time
#bytes
Poisson
15-826 Copyright: C. Faloutsos (2007) 110
CMU SCS
Use of multifractals cont’d
• Extrapolations for/from samples
time
#bytes
15-826 Copyright: C. Faloutsos (2007) 111
CMU SCS
Use of multifractals cont’d
• How many distinct products account for
90% of the sales?20% 80%
15-826 C. Faloutsos
38
15-826 Copyright: C. Faloutsos (2007) 112
CMU SCS
Road map
• Motivation – 3 problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples and tools
• Discussion - putting fractals to work!
• Conclusions – practitioner’s guide
• Appendix: gory details - boxcounting plots
15-826 Copyright: C. Faloutsos (2007) 113
CMU SCS
Conclusions
• Real data often disobey textbook
assumptions (Gaussian, Poisson,
uniformity, independence)
15-826 Copyright: C. Faloutsos (2007) 114
CMU SCS
Conclusions - cont’d
Self-similarity & power laws: appear in many
cases
Bad news:
lead to skewed distributions
(no Gaussian, Poisson,
uniformity, independence,
mean, variance)
Good news:
• ‘correlation integral’for separability
• rank/frequency plots
• 80-20 (multifractals)• (Hurst exponent,
• strange attractors,
• renormalization theory,
• ++)
15-826 C. Faloutsos
39
15-826 Copyright: C. Faloutsos (2007) 115
CMU SCS
Conclusions
• tool#1: (for points) ‘correlation integral’:
(#pairs within <= r) vs (distance r)
• tool#2: (for categorical values) rank-
frequency plot (a’la Zipf)
• tool#3: (for numerical values) CCDF:
Complementary cumulative distr. function
(#of elements with value >= a )
15-826 Copyright: C. Faloutsos (2007) 116
CMU SCS
Practitioner’s guide:• tool#1: #pairs vs distance, for a set of objects,
with a distance function (slope = intrinsic
dimensionality)
log(hops)
log(#pairs)
2.8
log( r )
log(#pairs(within <= r))
1.51
internet
MGcounty
15-826 Copyright: C. Faloutsos (2007) 117
CMU SCS
Practitioner’s guide:
• tool#2: rank-frequency plot (for categorical
attributes)
log(rank)
log(degree)
-0.82
internet domainsBible
log(freq)
log(rank)
15-826 C. Faloutsos
40
15-826 Copyright: C. Faloutsos (2007) 118
CMU SCS
Practitioner’s guide:• tool#3: CCDF, for (skewed) numerical
attributes, eg. areas of islands/lakes, UNIX
jobs...)
0
2
4
6
8
10
0 2 4 6 8 10 12 14
Nu
mb
er
of
reg
ion
s
Area
Area distribution of regions (SL dataset)
-0.85*x +10.8
log(count( >= area))
log(area)
scandinavian lakes
15-826 Copyright: C. Faloutsos (2007) 119
CMU SCS
Resources:
• Software for fractal dimension
– http://www.cs.cmu.edu/~christos
15-826 Copyright: C. Faloutsos (2007) 120
CMU SCS
Books
• Strongly recommended intro book:
– Manfred Schroeder Fractals, Chaos, Power
Laws: Minutes from an Infinite Paradise W.H.
Freeman and Company, 1991
• Classic book on fractals:
– B. Mandelbrot Fractal Geometry of Nature,
W.H. Freeman, 1977
15-826 C. Faloutsos
41
15-826 Copyright: C. Faloutsos (2007) 121
CMU SCS
References
• [vldb95] Alberto Belussi and Christos Faloutsos,Estimating the Selectivity of Spatial Queries Using the`Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995
• [Broder+’00] Andrei Broder, Ravi Kumar , FarzinMaghoul1, Prabhakar Raghavan , Sridhar Rajagopalan ,
Raymie Stata, Andrew Tomkins , Janet Wiener, Graph
structure in the web , WWW’00
• M. Crovella and A. Bestavros, Self similarity in Worldwide web traffic: Evidence and possible causes ,SIGMETRICS ’96.
15-826 Copyright: C. Faloutsos (2007) 122
CMU SCS
References
– [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger,
D.V. Wilson, On the Self-Similar Nature of Ethernet
Traffic, IEEE Transactions on Networking, 2, 1, pp 1-
15, Feb. 1994.
– [pods94] Christos Faloutsos and Ibrahim Kamel,
Beyond Uniformity and Independence: Analysis of R-
trees Using the Concept of Fractal Dimension, PODS,
Minneapolis, MN, May 24-26, 1994, pp. 4-13
15-826 Copyright: C. Faloutsos (2007) 123
CMU SCS
References
– [vldb96] Christos Faloutsos, Yossi Matias and Avi
Silberschatz, Modeling Skewed Distributions Using
Multifractals and the `80-20 Law’ Conf. on Very Large
Data Bases (VLDB), Bombay, India, Sept. 1996.
15-826 C. Faloutsos
42
15-826 Copyright: C. Faloutsos (2007) 124
CMU SCS
References
– [vldb96] Christos Faloutsos and Volker Gaede Analysis
of the Z-Ordering Method Using the Hausdorff Fractal
Dimension VLD, Bombay, India, Sept. 1996
– [sigcomm99] Michalis Faloutsos, Petros Faloutsos and
Christos Faloutsos, What does the Internet look like?
Empirical Laws of the Internet Topology, SIGCOMM
1999
15-826 Copyright: C. Faloutsos (2007) 125
CMU SCS
References
– [icde99] Guido Proietti and Christos Faloutsos, I/O
complexity for range queries on region data stored
using an R-tree International Conference on Data
Engineering (ICDE), Sydney, Australia, March 23-26,
1999
– [sigmod2000] Christos Faloutsos, Bernhard Seeger,
Agma J. M. Traina and Caetano Traina Jr., Spatial Join
Selectivity Using Power Laws, SIGMOD 2000
15-826 Copyright: C. Faloutsos (2007) 126
CMU SCS
Appendix - Gory details
• Bad news: There are more than one fractal
dimensions
– Minkowski fd; Hausdorff fd; Correlation fd;
Information fd
• Great news:
– they can all be computed fast!
– they usually have nearby values
15-826 C. Faloutsos
43
15-826 Copyright: C. Faloutsos (2007) 127
CMU SCS
Fast estimation of fd(s):
• How, for the (correlation) fractal dimension?
• A: Box-counting plot:
log( r )
rpi
log(sum(pi ^2))
15-826 Copyright: C. Faloutsos (2007) 128
CMU SCS
Definitions
• pi : the percentage (or count) of points in
the i-th cell
• r: the side of the grid
15-826 Copyright: C. Faloutsos (2007) 129
CMU SCS
Fast estimation of fd(s):
• compute sum(pi^2) for another grid side, r’
log( r )
r’
pi’
log(sum(pi ^2))
15-826 C. Faloutsos
44
15-826 Copyright: C. Faloutsos (2007) 130
CMU SCS
Fast estimation of fd(s):
• etc; if the resulting plot has a linear part, its
slope is the correlation fractal dimension D2
log( r )
log(sum(pi ^2))
15-826 Copyright: C. Faloutsos (2007) 131
CMU SCS
Definitions (cont’d)
• Many more fractal dimensions Dq (related toRenyi entropies):
)log(
)log(
1)log(
)log(
1
1
1r
ppD
qr
p
qD
ii
q
i
q
∂
∂=
≠∂
∂
−=
∑
∑
15-826 Copyright: C. Faloutsos (2007) 132
CMU SCS
Hausdorff or box-counting fd:
• Box counting plot: Log( N ( r ) ) vs Log ( r)
• r: grid side
• N (r ): count of non-empty cells
• (Hausdorff) fractal dimension D0:
)log(
))(log(0
r
rND
∂
∂−=
15-826 C. Faloutsos
45
15-826 Copyright: C. Faloutsos (2007) 133
CMU SCS
Definitions (cont’d)
• Hausdorff fd:
r
log(r)
log(#non-empty cells)
D0
15-826 Copyright: C. Faloutsos (2007) 134
CMU SCS
Observations
• q=0: Hausdorff fractal dimension
• q=2: Correlation fractal dimension
(identical to the exponent of the number
of neighbors vs radius)
• q=1: Information fractal dimension
15-826 Copyright: C. Faloutsos (2007) 135
CMU SCS
Observations, cont’d
• in general, the Dq’s take similar, but not
identical, values.
• except for perfectly self-similar point-sets,
where Dq=Dq’ for any q, q’
15-826 C. Faloutsos
46
15-826 Copyright: C. Faloutsos (2007) 136
CMU SCS
Examples:MG county
• Montgomery County of MD (road end-
points)
15-826 Copyright: C. Faloutsos (2007) 137
CMU SCS
Examples:LB county
• Long Beach county of CA (road end-points)
15-826 Copyright: C. Faloutsos (2007) 138
CMU SCS
Conclusions
• many fractal dimensions, with nearby
values
• can be computed quickly
(O(N) or O(N log(N))
• (code: on the web)