Post on 16-Aug-2020
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
INTERFACE 2008
Scan Statistics on Enron Hypergraphs
Youngser Park∗, Carey E. Priebe∗, David J. Marchette†
∗Johns Hopkins University, Baltimore, MD†Naval Surface Warfare Center, Dahlgren, VA
Carey E.
Priebe
David J.
Marchette
May 24, 2008Durham, NC
1 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Citation
C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park,“Scan Statistics on Enron Graphs,”SIAM International Conference on Data Mining,Workshop on Link Analysis, Counterterrorism and Security,Newport Beach, California, April 23, 2005
C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park,“Scan Statistics on Enron Graphs,”Computational and Mathematical Organization Theory,Vol. 11, No. 3, pp. 229-247, 2005.
http://www.ams.jhu.edu/∼priebe/sseg.html
2 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Introduction
Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series
HypergraphsDefinitionScan Statistics on Hypergraphs
ExperimentsEnron GraphsExperiments 1Experiments 2
Discussion and Future Works
3 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Introduction
Problem: Time series of graphs are becoming more and morecommon, e.g., communication graphs, socialnetworks, etc., and methods for statistical inferencesare required.
Objective: To extend a theory of scan statistics on hypergraphsto perform change point / anomaly detection ingraphs and in time series thereof.
Hypotheses: H0: homogeneityHA: local subregion of excessive activity
4 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics
“moving window analysis”:
to scan a small “window” (scan region) over data,calculating some locality statistic for each window;e.g.,
• number of events for a point pattern,
• average pixel value for an image,
• ...
scan statistic ≡ maximum of locality statistic:
If maximum of observed locality statistics is large,then the inference can be made thatthere exists a subregion of excessive activity!
5 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics on Graphs
directed graph (digraph): D = (V, A)
order: |V(D)|
size: |A(D)|
neighborhood: kth order neighborhood of v:Nk[v; D] = w ∈ V(D) : d(v, w) 6 k
scan region: (example: induced subdigraph): Ω(Nk[v; D])
locality statistic: (example: size): Ψk(v) = |A(Ω(Nk[v; D]))|
scan statistic: (“scale specific”) Mk(D) = maxv∈V(D) Ψk(v)
6 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Example of Scan Statistics
b
b
b
b
b
b
bb
b
b
scan color locality
0 • 41 •+ • 42 •+ •+ • 83 •+ •+ •+ • 10
7 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics and Time Series
• Let Dt t = 1, . . . , tmax be a time series of directed graphs.
• Scan region: induced subgraph of k-neighborhood:Ω(Nk(v; Dt)).
• Locality statistic: Ψk,t(v) = size(Ω(Nk(v; Dt))).
• Scan statistic: Mk,t = maxv(size(Ω(Nk(v; Dt)))).
• Let τ be an integer (temporal window).
8 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics and Time Series
Vertex Standardization
• We want to standardize the vertices (“loud” vertices don’tdrown out “quiet” ones).
• Vertex-dependent standardized locality statistic:
Ψk,t(v) =Ψk,t(v) − µk,t,τ(v)
max(σk,t,τ(v), 1)
• µk,t,τ(v) = 1τ
∑t−1t ′=t−τ
Ψk,t ′(v)
• σ2k,t,τ(v) = 1
τ−1
∑t−1t ′=t−τ
(Ψk,t ′(v) − µk,t,τ(v))2 .
• Mk,t = maxv Ψk,t(v).
9 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics and Time Series
Normalizing the Scan Statistic
• If we want to detect anomalies, we need to detrend.
• temporally-normalized scan statistics:
Sk,t =Mk,t − µk,t,ℓ
max(σk,t,ℓ, 1)
where µk,t,ℓ and σk,t,ℓ are the running mean and standard
deviation of Mk,t based on the most recent ℓ time steps.
10 / 30
Scan Statistics and Time SeriesSome Examples
050
100
150
200
250
300
time (mm/yy)
raw
sca
n st
atis
tics
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
sizescan0scan1scan2
Figure: Time series scan statistics for weekly Enron email graphs.
Scan Statistics and Time SeriesSome Examples
050
100
150
time (mm/yy)
stan
dard
ized
sca
n st
atis
tics
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
scan0scan1scan2
Figure: Time series of standardized scan statistics Mk,t(G) for k = 0, 1, 2.
Scan Statistics and Time SeriesSome Examples
−2
02
46
8
time (mm/yy)
scan
0
02/01 03/01 04/01 05/01 06/01
−2
02
46
8
time (mm/yy)
scan
1
02/01 03/01 04/01 05/01 06/01
−2
02
46
8
time (mm/yy)
scan
2
02/01 03/01 04/01 05/01 06/01
Figure: Sk,t, temporally-normalized scan statistics, on zoomed in timeseries of Enron email graphs.
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Introduction
Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series
HypergraphsDefinitionScan Statistics on Hypergraphs
ExperimentsEnron GraphsExperiments 1Experiments 2
Discussion and Future Works
14 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Definition of Hypergraph
• A graph in which generalized edges (called hyperedges) mayconnect more than two vertices.
• A hypergraph H = (V,E) consists of a set of verticesV = v1, · · · , vn and a set of hyperedges E = e1, · · · , em,with ei 6= ∅ and ei ⊆ V for i = 1, · · · , m [Berge89].
C. Berge,Hypergraphs: Combinatorics of Finite Sets,North-Holland, 1989.
15 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Example of Hypergraph
v1
v4 v3
v2
e1e2
e5
e3e4
e6
Incidence Matrixe1 e2 e3 e4 e5 e6
v1 1 1 0 0 0 1v2 0 0 1 1 0 0v3 1 0 1 0 1 1v4 0 1 0 1 1 1
16 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics on Hypergraphs
hypergraph: H = (V,E)
order: order(H) = |V| = n,
size: size(H) = |E| = m,
neighborhood: (1st-order) N1(v, H) =⋃
v∈ei,ei∈Eei,
neighborhood: (kth-order) Nk(v, H) =⋃
v∈Nk−1(v,H) N1(v, H) fork > 2,
induced subgraph: Ω(Nk(v, H)), where Ek = ei ∈ E : ei ⊂ Nk,
17 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics on Hypergraphs
hypergraph: H = (V,E)
locality statistic: Ψk(v, H) = size(Ω(Nk(v, H))), for k > 1,
locality statistic: (vertex-dependent standardized)
Ψk,t(v, H) =Ψk,t(v, H) − µk,t,τ(v)
max(σk,t,τ(v), 1)
scan statistic: (“scale-specific”) Mk(H) = maxv∈V(H) Ψk(v, H).
scan statistic: (standardized) Mk,t(H) = maxv Ψk,t(v, H).
scan statistic: (temporally-normalized)
Sk,t(H) =Mk,t(H) − µk,t,ℓ
max(σk,t,ℓ, 1)
18 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Scan Statistics on Hypergraphs
v1
v4 v3
v2
e1e2
e5
e3e4
e6
locality statisticΨ0 Ψ1 Ψ0 Ψ1
v1 2 3 4 5v2 2 3 2 3v3 3 5 6 7v4 3 5 6 7
v1 6= v2
19 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Introduction
Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series
HypergraphsDefinitionScan Statistics on Hypergraphs
ExperimentsEnron GraphsExperiments 1Experiments 2
Discussion and Future Works
20 / 30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Enron Graphs
• Energy company famous for “creating accounting” measuresto boost stock value.
• Email sent and received between executives at Enron over aperiod of about 2 years.
• 150 executives (184 email addresses – some duplication).
• From-To pairs extracted from the headers of the email toconstruct a communications graph:
• Each graph covers one week (non-overlapping).• Vertices correspond to email addresses.• An edge between u and v if v sent an email with v in the To or
CC field during the week.• Duplicates not counted.
21 / 30
Experiments 1Detection by raw scan statistics
020
040
060
080
010
0012
00
time (mm/yy)
raw
sca
n st
atis
tics
11/9
8
01/9
9
03/9
9
06/9
9
08/9
9
10/9
9
01/0
0
03/0
0
05/0
0
08/0
0
10/0
0
12/0
0
02/0
1
05/0
1
07/0
1
09/0
1
12/0
1
02/0
2
04/0
2
hsizesizehscan0scan0hscan1scan1hscan2scan2
Figure: Time series scan statistics for weekly Enron email graphs.
Experiments 1Detection by raw scan statistics
5010
015
0
time(mm/yy)
rank
(arg
max
)
11/9
8
01/9
9
03/9
9
06/9
9
08/9
9
10/9
9
01/0
0
03/0
0
05/0
0
08/0
0
10/0
0
12/0
0
02/0
1
05/0
1
07/0
1
09/0
1
12/0
1
02/0
2
04/0
2
Figure: kt vs. t for Ψ1,t(G) and Ψ1,t(H).
Experiments 2Detection by normalized scan statistics
050
100
150
200
250
300
time(mm/yy)
stan
dard
ized
sca
n st
atis
tics
11/9
8
01/9
9
03/9
9
06/9
9
08/9
9
10/9
9
01/0
0
03/0
0
05/0
0
08/0
0
10/0
0
12/0
0
02/0
1
05/0
1
07/0
1
09/0
1
12/0
1
02/0
2
04/0
2
hscan1scan1
Figure: Time series of standardized scan statistics M1,t(G) and M1,t(H).
Experiments 2Detection by normalized scan statistics
−2
0
2
4
6
8
time(mm/yy)
S1(G
)
02/01 03/01 04/01 05/01 06/01
−2
0
2
4
6
8
time(mm/yy)
S1(H
)
02/01 03/01 04/01 05/01 06/01
Figure: Time series of temporally-normalized scan statistics S1,t(G) and
S1,t(H). It shows that t∗ = 130 and v∗ = arg maxv Ψ1,t∗=130(H) = 76.
Experiments 2Comparison of scan statistics
0 50 100 150
010
020
030
040
050
0
w130=05/01Ψ1(G)
Ψ1(H
)
19121315202526293235363843455372748688105109110112114118122123134136143151171173175178182184
4
16
19
21
22
27
41
44
49
55
627082
8485
98
103107
130
132
137138
142
144
153
155165167
176
183
3
67892101131145
174
8
14
24
4754
6973
778789120
139
140
154
159
179
180
46
5691
149
181
11
57
6090
96
117
124125
150
152
313348507176
99
111
168
177
23
8094157
160
34
58
65
95
104121
126133
161
10
37
128
13581169
93
17242
100
102
2
17
116166
1840
63
61156
162
158
163
68129
564
7
52
127
146
170
30
79
119
11366
97
51
106
148
164
67
39
141
115
28 59
75
147 108
83
Figure: Locality statistics ΨH1 on hypergraph as a function of Ψ1 on
graph at week 130 (= May, 2001).
Experiments 2
Comparison of scan statistics
0
100
200
300
400
500
time(mm/yy)
Ψ1(7
9, H
)
02/01 03/01 04/01 05/01 06/01
employee #79
0
100
200
300
400
500
time(mm/yy)
Ψ1(9
7, H
)
02/01 03/01 04/01 05/01 06/01
employee #97
Experiments 2Comparison of scan statistics
−2 0 2 4 6
050
100
150
200
250
300
w130=05/01Ψ~1(G)
Ψ~1(H
)
92 23142159168110271434105961781431091256214482261541271551792016512114170
1771241711301496417494164
1673
255713315660471129149146132
98
241131611491315293235363843455372748688118122123134136151173175182184
138
85119141486116670661585884115
120
104
71
2815352592210015281101374221
10254
107
176
163561668
1941
4455103137
145
183
67116129
89
789514012116212810830
33
1063977963131
6
978
87
93111
31
148
18069135
77
73
139
50
147181
11
172
157
80
51
117
99
18
46
1607583
10
40
150
5
90
2
126
76
169
65
17
837997
Figure: Standardized scan statistics ΨH1 on hypergraph as a function of
Ψ1 on graph at week 130 (= May, 2001).
Experiments 2
Comparison of scan statistics
0
20
40
60
time(mm/yy)
Ψ~1(1
7, H
)
02/01 03/01 04/01 05/01 06/01
employee #17
0
50
100
150
200
250
300
time(mm/yy)
Ψ~1(7
6, H
)
02/01 03/01 04/01 05/01 06/01
employee #76
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works
Discussion / Future Works
• Weighted/Directed Hypergraph
• Content Analysis
• Real-time Data
30 / 30