@let@token INTERFACE 2008 - Scan Statistics on Enron...

30
Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works INTERFACE 2008 Scan Statistics on Enron Hypergraphs Youngser Park , Carey E. Priebe , David J. Marchette Johns Hopkins University, Baltimore, MD Naval Surface Warfare Center, Dahlgren, VA Carey E. Priebe David J. Marchette May 24, 2008 Durham, NC 1 / 30

Transcript of @let@token INTERFACE 2008 - Scan Statistics on Enron...

Page 1: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

INTERFACE 2008

Scan Statistics on Enron Hypergraphs

Youngser Park∗, Carey E. Priebe∗, David J. Marchette†

∗Johns Hopkins University, Baltimore, MD†Naval Surface Warfare Center, Dahlgren, VA

Carey E.

Priebe

David J.

Marchette

May 24, 2008Durham, NC

1 / 30

Page 2: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Citation

C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park,“Scan Statistics on Enron Graphs,”SIAM International Conference on Data Mining,Workshop on Link Analysis, Counterterrorism and Security,Newport Beach, California, April 23, 2005

C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park,“Scan Statistics on Enron Graphs,”Computational and Mathematical Organization Theory,Vol. 11, No. 3, pp. 229-247, 2005.

http://www.ams.jhu.edu/∼priebe/sseg.html

2 / 30

Page 3: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Introduction

Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series

HypergraphsDefinitionScan Statistics on Hypergraphs

ExperimentsEnron GraphsExperiments 1Experiments 2

Discussion and Future Works

3 / 30

Page 4: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Introduction

Problem: Time series of graphs are becoming more and morecommon, e.g., communication graphs, socialnetworks, etc., and methods for statistical inferencesare required.

Objective: To extend a theory of scan statistics on hypergraphsto perform change point / anomaly detection ingraphs and in time series thereof.

Hypotheses: H0: homogeneityHA: local subregion of excessive activity

4 / 30

Page 5: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics

“moving window analysis”:

to scan a small “window” (scan region) over data,calculating some locality statistic for each window;e.g.,

• number of events for a point pattern,

• average pixel value for an image,

• ...

scan statistic ≡ maximum of locality statistic:

If maximum of observed locality statistics is large,then the inference can be made thatthere exists a subregion of excessive activity!

5 / 30

Page 6: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics on Graphs

directed graph (digraph): D = (V, A)

order: |V(D)|

size: |A(D)|

neighborhood: kth order neighborhood of v:Nk[v; D] = w ∈ V(D) : d(v, w) 6 k

scan region: (example: induced subdigraph): Ω(Nk[v; D])

locality statistic: (example: size): Ψk(v) = |A(Ω(Nk[v; D]))|

scan statistic: (“scale specific”) Mk(D) = maxv∈V(D) Ψk(v)

6 / 30

Page 7: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Example of Scan Statistics

b

b

b

b

b

b

bb

b

b

scan color locality

0 • 41 •+ • 42 •+ •+ • 83 •+ •+ •+ • 10

7 / 30

Page 8: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics and Time Series

• Let Dt t = 1, . . . , tmax be a time series of directed graphs.

• Scan region: induced subgraph of k-neighborhood:Ω(Nk(v; Dt)).

• Locality statistic: Ψk,t(v) = size(Ω(Nk(v; Dt))).

• Scan statistic: Mk,t = maxv(size(Ω(Nk(v; Dt)))).

• Let τ be an integer (temporal window).

8 / 30

Page 9: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics and Time Series

Vertex Standardization

• We want to standardize the vertices (“loud” vertices don’tdrown out “quiet” ones).

• Vertex-dependent standardized locality statistic:

Ψk,t(v) =Ψk,t(v) − µk,t,τ(v)

max(σk,t,τ(v), 1)

• µk,t,τ(v) = 1τ

∑t−1t ′=t−τ

Ψk,t ′(v)

• σ2k,t,τ(v) = 1

τ−1

∑t−1t ′=t−τ

(Ψk,t ′(v) − µk,t,τ(v))2 .

• Mk,t = maxv Ψk,t(v).

9 / 30

Page 10: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics and Time Series

Normalizing the Scan Statistic

• If we want to detect anomalies, we need to detrend.

• temporally-normalized scan statistics:

Sk,t =Mk,t − µk,t,ℓ

max(σk,t,ℓ, 1)

where µk,t,ℓ and σk,t,ℓ are the running mean and standard

deviation of Mk,t based on the most recent ℓ time steps.

10 / 30

Page 11: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Scan Statistics and Time SeriesSome Examples

050

100

150

200

250

300

time (mm/yy)

raw

sca

n st

atis

tics

11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02

sizescan0scan1scan2

Figure: Time series scan statistics for weekly Enron email graphs.

Page 12: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Scan Statistics and Time SeriesSome Examples

050

100

150

time (mm/yy)

stan

dard

ized

sca

n st

atis

tics

11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02

scan0scan1scan2

Figure: Time series of standardized scan statistics Mk,t(G) for k = 0, 1, 2.

Page 13: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Scan Statistics and Time SeriesSome Examples

−2

02

46

8

time (mm/yy)

scan

0

02/01 03/01 04/01 05/01 06/01

−2

02

46

8

time (mm/yy)

scan

1

02/01 03/01 04/01 05/01 06/01

−2

02

46

8

time (mm/yy)

scan

2

02/01 03/01 04/01 05/01 06/01

Figure: Sk,t, temporally-normalized scan statistics, on zoomed in timeseries of Enron email graphs.

Page 14: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Introduction

Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series

HypergraphsDefinitionScan Statistics on Hypergraphs

ExperimentsEnron GraphsExperiments 1Experiments 2

Discussion and Future Works

14 / 30

Page 15: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Definition of Hypergraph

• A graph in which generalized edges (called hyperedges) mayconnect more than two vertices.

• A hypergraph H = (V,E) consists of a set of verticesV = v1, · · · , vn and a set of hyperedges E = e1, · · · , em,with ei 6= ∅ and ei ⊆ V for i = 1, · · · , m [Berge89].

C. Berge,Hypergraphs: Combinatorics of Finite Sets,North-Holland, 1989.

15 / 30

Page 16: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Example of Hypergraph

v1

v4 v3

v2

e1e2

e5

e3e4

e6

Incidence Matrixe1 e2 e3 e4 e5 e6

v1 1 1 0 0 0 1v2 0 0 1 1 0 0v3 1 0 1 0 1 1v4 0 1 0 1 1 1

16 / 30

Page 17: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics on Hypergraphs

hypergraph: H = (V,E)

order: order(H) = |V| = n,

size: size(H) = |E| = m,

neighborhood: (1st-order) N1(v, H) =⋃

v∈ei,ei∈Eei,

neighborhood: (kth-order) Nk(v, H) =⋃

v∈Nk−1(v,H) N1(v, H) fork > 2,

induced subgraph: Ω(Nk(v, H)), where Ek = ei ∈ E : ei ⊂ Nk,

17 / 30

Page 18: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics on Hypergraphs

hypergraph: H = (V,E)

locality statistic: Ψk(v, H) = size(Ω(Nk(v, H))), for k > 1,

locality statistic: (vertex-dependent standardized)

Ψk,t(v, H) =Ψk,t(v, H) − µk,t,τ(v)

max(σk,t,τ(v), 1)

scan statistic: (“scale-specific”) Mk(H) = maxv∈V(H) Ψk(v, H).

scan statistic: (standardized) Mk,t(H) = maxv Ψk,t(v, H).

scan statistic: (temporally-normalized)

Sk,t(H) =Mk,t(H) − µk,t,ℓ

max(σk,t,ℓ, 1)

18 / 30

Page 19: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Scan Statistics on Hypergraphs

v1

v4 v3

v2

e1e2

e5

e3e4

e6

locality statisticΨ0 Ψ1 Ψ0 Ψ1

v1 2 3 4 5v2 2 3 2 3v3 3 5 6 7v4 3 5 6 7

v1 6= v2

19 / 30

Page 20: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Introduction

Scan StatisticsScan Statistics on GraphsScan Statistics and Time Series

HypergraphsDefinitionScan Statistics on Hypergraphs

ExperimentsEnron GraphsExperiments 1Experiments 2

Discussion and Future Works

20 / 30

Page 21: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Enron Graphs

• Energy company famous for “creating accounting” measuresto boost stock value.

• Email sent and received between executives at Enron over aperiod of about 2 years.

• 150 executives (184 email addresses – some duplication).

• From-To pairs extracted from the headers of the email toconstruct a communications graph:

• Each graph covers one week (non-overlapping).• Vertices correspond to email addresses.• An edge between u and v if v sent an email with v in the To or

CC field during the week.• Duplicates not counted.

21 / 30

Page 22: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 1Detection by raw scan statistics

020

040

060

080

010

0012

00

time (mm/yy)

raw

sca

n st

atis

tics

11/9

8

01/9

9

03/9

9

06/9

9

08/9

9

10/9

9

01/0

0

03/0

0

05/0

0

08/0

0

10/0

0

12/0

0

02/0

1

05/0

1

07/0

1

09/0

1

12/0

1

02/0

2

04/0

2

hsizesizehscan0scan0hscan1scan1hscan2scan2

Figure: Time series scan statistics for weekly Enron email graphs.

Page 23: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 1Detection by raw scan statistics

5010

015

0

time(mm/yy)

rank

(arg

max

)

11/9

8

01/9

9

03/9

9

06/9

9

08/9

9

10/9

9

01/0

0

03/0

0

05/0

0

08/0

0

10/0

0

12/0

0

02/0

1

05/0

1

07/0

1

09/0

1

12/0

1

02/0

2

04/0

2

Figure: kt vs. t for Ψ1,t(G) and Ψ1,t(H).

Page 24: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2Detection by normalized scan statistics

050

100

150

200

250

300

time(mm/yy)

stan

dard

ized

sca

n st

atis

tics

11/9

8

01/9

9

03/9

9

06/9

9

08/9

9

10/9

9

01/0

0

03/0

0

05/0

0

08/0

0

10/0

0

12/0

0

02/0

1

05/0

1

07/0

1

09/0

1

12/0

1

02/0

2

04/0

2

hscan1scan1

Figure: Time series of standardized scan statistics M1,t(G) and M1,t(H).

Page 25: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2Detection by normalized scan statistics

−2

0

2

4

6

8

time(mm/yy)

S1(G

)

02/01 03/01 04/01 05/01 06/01

−2

0

2

4

6

8

time(mm/yy)

S1(H

)

02/01 03/01 04/01 05/01 06/01

Figure: Time series of temporally-normalized scan statistics S1,t(G) and

S1,t(H). It shows that t∗ = 130 and v∗ = arg maxv Ψ1,t∗=130(H) = 76.

Page 26: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2Comparison of scan statistics

0 50 100 150

010

020

030

040

050

0

w130=05/01Ψ1(G)

Ψ1(H

)

19121315202526293235363843455372748688105109110112114118122123134136143151171173175178182184

4

16

19

21

22

27

41

44

49

55

627082

8485

98

103107

130

132

137138

142

144

153

155165167

176

183

3

67892101131145

174

8

14

24

4754

6973

778789120

139

140

154

159

179

180

46

5691

149

181

11

57

6090

96

117

124125

150

152

313348507176

99

111

168

177

23

8094157

160

34

58

65

95

104121

126133

161

10

37

128

13581169

93

17242

100

102

2

17

116166

1840

63

61156

162

158

163

68129

564

7

52

127

146

170

30

79

119

11366

97

51

106

148

164

67

39

141

115

28 59

75

147 108

83

Figure: Locality statistics ΨH1 on hypergraph as a function of Ψ1 on

graph at week 130 (= May, 2001).

Page 27: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2

Comparison of scan statistics

0

100

200

300

400

500

time(mm/yy)

Ψ1(7

9, H

)

02/01 03/01 04/01 05/01 06/01

employee #79

0

100

200

300

400

500

time(mm/yy)

Ψ1(9

7, H

)

02/01 03/01 04/01 05/01 06/01

employee #97

Page 28: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2Comparison of scan statistics

−2 0 2 4 6

050

100

150

200

250

300

w130=05/01Ψ~1(G)

Ψ~1(H

)

92 23142159168110271434105961781431091256214482261541271551792016512114170

1771241711301496417494164

1673

255713315660471129149146132

98

241131611491315293235363843455372748688118122123134136151173175182184

138

85119141486116670661585884115

120

104

71

2815352592210015281101374221

10254

107

176

163561668

1941

4455103137

145

183

67116129

89

789514012116212810830

33

1063977963131

6

978

87

93111

31

148

18069135

77

73

139

50

147181

11

172

157

80

51

117

99

18

46

1607583

10

40

150

5

90

2

126

76

169

65

17

837997

Figure: Standardized scan statistics ΨH1 on hypergraph as a function of

Ψ1 on graph at week 130 (= May, 2001).

Page 29: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Experiments 2

Comparison of scan statistics

0

20

40

60

time(mm/yy)

Ψ~1(1

7, H

)

02/01 03/01 04/01 05/01 06/01

employee #17

0

50

100

150

200

250

300

time(mm/yy)

Ψ~1(7

6, H

)

02/01 03/01 04/01 05/01 06/01

employee #76

Page 30: @let@token INTERFACE 2008 - Scan Statistics on Enron ...parky/Enron/hgraph-interface08-handout.pdf · Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future

Outline Introduction Scan Statistics Hypergraphs Experiments Discussion and Future Works

Discussion / Future Works

• Weighted/Directed Hypergraph

• Content Analysis

• Real-time Data

30 / 30