Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining...

32
Information-theoretic Learning and Data Mining Kenji Yamanishi The University of Tokyo, JAPAN Oct. 18 th , 2009

Transcript of Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining...

Page 1: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Information-theoretic Learning and Data Mining

Kenji Yamanishi

The University of Tokyo, JAPAN

Oct. 18th, 2009

Page 2: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Contents

1.Information-theoretic Learning based Novelty Detection

2.Theory of Dynamic Model Selection

3.Applications of DMS to Data Mining

3-1. Masquerade Detection

3-2. Failure Detection

3-3. Topic Dynamics Detection

3-4. Network Structure Mining

4.Summary

Page 3: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

1.Information-theoretic Learning and Novelty Detection

Page 4: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Data Mining Issues

Patten Discovery

Anomaly Detection

DataMIning

=Machine Learnig+ Pre/PostProcessing

Distributed and Heterogeneous Data

Knowledge Discovery

Marketing, CRM, SCM , ERP, Community Mgmnt.

Data

Risk Managment

Fraud Detection,Failure Detection, Compliance Mgmnt, Security

WEB

logs

RFID

cellar

media

sensor

Dynamic and Non-stationary Data

Future Prediction

0

50

100

150

200

250

300

DATE 2004/8/18

12:00

2004/8/19 0:00 2004/8/19

12:00

2004/8/20 0:00 2004/8/20

12:00

2004/8/21 0:00 2004/8/21

12:00

測定値

予測値・推定値

上限値

閾値

Mining from large amount of dynamic and heterogeneous data is a critical issue

Page 5: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Information-theoretic Learning and Data Mining

Non-stationary DynamicInformation Sources

Stochastic Complexity Minimization…..Minimum Description Length Principle(Data Compression and Summarization)

Dynamics of Statistical Models

Structural Changes

Novelty Detection

Structured Information

Dynamic Data

Structural changes from dynamic data are discovered using stochastic complexity minimization

Page 6: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

2.Theory of Dynamic Model Selection

Page 7: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Concept of Dynamic Model Selection

Issue: How do you track changes of statistical models

under non-stationary environments?

timeModel 1 Model 2

x1 xn… xtxt-1

Selecting a model sequence that explains the data best

Data

Model M1,……………..,M1, M2,……………………,M2

■Discounting Learning

■Windowing

■Dynamic Model Selection(DMS)

Methodologies

Page 8: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Pattern 1

cdls

#patterns:1→ 2

time

Pattern 1

cdls

Pattern 2

shpsvi

Ex.1: UNIX Command Pattern Analysis

Tracking changes of the mixture size leads to detection of a new pattern

Probabilistic model (Mixture Model)

1 emacs 2 cp 3 bash 4 perl 5 :

1 cd2 ls3 emacs4 platex5 :

UNIX commandsequence

1 cd2 ls 3 emacs4 platex 5 :

1 emacs 2 make 3 ./a.exe4 :5 :

Page 9: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Ex.2 Curve Fitting

Static ModelSelection

Dynamic Model

Selection

K: degree of polynomial

K=2 K=4K=5

Degree of curve changes over time

Tracking curve degrees which will change over time

Page 10: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Related Work• Static model selection

AIC [Akaike 1973], BIC [Schwarz 1978], MDL [Rissanen 1978], MML [Wallace & Freeman 1978]

– Under stationarity assumption

• Tracking the best expert [Herbster & Warmuth 1998]

Derandomization [Vovk 1997][Cesanianchi Lugosi 2006]

– Best model (=“Best expert”) may change over time

– Cannot track the sequence of best models itself

・ MDL-Based Dynamic Model Selection under non-stationarity assumption [Yamanishi and Maruyama 2007]

・Switching [Ervan, Grunwald, Rooij 2008]

-Convergence as with MDL and convergence rate as with AIC

・Changing dependency detection [Fearnhead& Liu 2007]

Page 11: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

MDL-based Information Criterion for DMS

Evaluating goodness of model sequence from theMinimum Description Length (MDL) Principle [Rissanen 1978]

Code-length required by A to encode xn

and kn under prefix condition

Data Sequence

Model Sequence

Minimize w.r.t. kn for given xn

・ Batch DMS

・ Sequential DMS

(Kraft’s Inequality)

Page 12: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Predictive Distribution

Dynamics of model changes is described using predictive dist.

Page 13: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Switching Distribution

Page 14: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Description of Model Transition

k+1

time

k

Model

Change point

Predictive code length Predictive code length

Code length for model transition

Page 15: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Model transition probability : parameter),,|( 1 t

tt kkkP

Assumption: model probabilistically transits

n

t

t

tt

n

t

t

t

t

nn kkPkxxPkx1

1

1

1 )ˆ,|(log):|(log):(

Predictive code-length for data Predictive code- length for model seq.

DMS (Dynamic Model Selection) Criterion:

: estimate ofInput: xn=x1,….xn

Output: kn =k1,…. kn minimizing (xn: kn)

DMS(Dynamic Model Selction) Criterion

1 if ,2

,,1 and if ,2

1

,,1 and if ,1

):|(

1

11

11

1

tt

ttt

ttt

tt

kk

Kkkk

Kkkk

kkP

k

k-1

k

k+1

1

2/

2/

Page 16: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Theorem 1: [Yamanishi and Maruyama IEEE Trans IT 07] There exists a batch DMS algorithm (DMS1) for switching distributions that runs in time O(Kn2) and the total code-length is upper bounded by

Coding Bound for Batch DMS for Switching Dist.

… … …

K

k At each time, at each stateselect an optimal pathfrom path sets selected at the latest time

POINT 1: Optimal path search based on Dynamic ProgrammingPOINT 2: Krichevsky and Trofimov Estimation of model transition probabilities

Model seq. complexityData complexity for switching

Page 17: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Sequential DMS Problem

model sequence set

model sequence set restricted to (a,b)

so that the DMS criterion is minimum:

Predictive code-length of model seq.Data code-length for switching dist.

tt-B1

Window of size B

[Yamanishi and Sakurai WITMSE 2010]

Page 18: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

3. Applications of DMS to Data Mining

Page 19: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Information leak

1 emacs 2 cp 3 bash 4 perl 5 :

time

1 cd2 ls3 emacs4 platex5 :

Program coding

emacs cp

make

Structuralchanges

Computer

Usage

Logs

1 cd2 ls 3 emacs4 platex 5 :

1 emacs 2 make 3 ./a.exe4 :5 :

Latent

Information

Program coding

emac cp

make

cp **.exe

Tracking the emergence of a new pattern leads to the discovery of a masquerader’s behavior

3.1. Masquerade Detection

Page 20: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

command session y j= (ps, cd, ls, cp, …)

= (y1, … yTj) length Tj

y1 ,...,yM : command patterns are modeled using HMM MixtureWish to detect masquerade by tracking changes in mixture size

K

k

kjkkj PP1

)|()|( yy

Each behavior pattern is represented by 1-st order HMM

),...,(

1

1 1

11

1

)|()|()()|(

jT

j j

xx

T

t

T

t

ttkttkkkjk xybxxaxP y

),...,( 1 jTxx :hidden states⇒Markov

K: Number of different behavior patterns (Mixture size)

Model

γk : initial prob. ak : transition prob. bk : event prob.

session

Time

[ Maruyama & Yamansihi ITW04]

Page 21: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Session time

Session Stream

Anomalous Behavior

Detection

①On-line Discounting Learning Algorithm of each mixture model

③Universal Test Statistics BasedScoring with dynamically optimizedthreshold

②Dynamic Model SelectionLearning

Parameters

Selection of

Optimal Mixture Size K*

Anomaly Score

Computation

Learning

Parameters

Flow of Behavior Change Detection

time

Anomaly

Score

Mixture Size

Emerging PatternIdentification

HMM Mixture

K=Kmax

HMM Mixture

K=1

threshold

Page 22: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

0

1

2

3

0 200 400 600 800 1000 1200 1400

Masquerade data (t=820,…,849)

Pattern 1

file→ridstd

cat→rdistd

touch→tcsh

df→tcsh

rshd→rdistd

tcsh→rshd

rdistd→tcsh

0 0.2 0.4 0.6 0.8 1

Pattern 2

cat→post

cat→file

ln→ln

generic→ln

m→generic

file→post

awk→cat

0 0.2 0.4 0.6 0.8 1

t=820

Ex. User30

T-DMS1 detects the masquerader’s

command pattern at t=820

Top 7 transitions of each cluster (k=2 at t=820)

Masquerade Pattern DetectionMasquerade’s pattern is largely deviated from the normal

user’s pattern (mainly operating “remote shells” ).

Page 23: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

3.2 Failure Detection

ID Time stamp Event Severity

Att1 Att2 Message

## Nov 13 00:06:23: ERR bridge: !brdgursrv: queue is full. discarding

a message.

## Nov 13 10:15:00: WARN: INTR: ether2atm: Ethernet Slot 2L/1 Lock-

Up!!

## Nov 13 10:15:10: WARN: INTR: ether2atm: Ethernet Slot 2L/2

Lock-Up!!

## Nov 13 10:15:20: WARN: INTR: ether2atm: Ethernet Slot 2L/3

Lock-Up!!

■ Syslog….A sequence of events collected using the BSD syslog protocol

■ Includes information crucial to network failure

y1 ,...,yM : syslog patterns are also modeled using HMM MixtureWish to identify emerging failure patterns by tracking changes in mixture size.

[Yamansihi & Maruama KDD2005]

Page 24: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Emerging Pattern Identification

Identify a new pattern by detecting the change point of mixture size

■ Goal: Track a new behavior pattern caused by a network failure

The component with smallest occurrence probability corresponds to a new pattern

lock-up failure

Page 25: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Topic 1

Topic2

Topic 3

Topic1

Topic2

Topic 5Topic 4

Inquiries

Claims

DMS

Tracking changes of topic organization

Clustering

Topic Emergence

Detection

3-3. Topic Dynamics Detection

Text Processing at Contact Center

Topic 1

Activity

number

時間

Topic 2

Topic 5

Emergence of Topic 5

[Morinaga and Yamanishi KDD2004]

Page 26: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

time

dot = input document

p.d.f.component

= main topic

Gaussian mixture

Detecting emerging component

Online

learning

Dynamic Topic Modeling with Gaussian Mixture

Page 27: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

C42 is main.

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

3/23 4/2 4/12 4/22 5/2 5/12 5/22

-4

-3

-2

-1

0

1

2

3

4

トピック “サービスZZZ”, “障害”がアクティブな期間

Date

Topic Activity Topic

“Transfer”Topic

“Failure of service X”

Emergence

of new topic

Disappearance

of old topic

Topic structure changes in contact center text stream

Topic Dynamics Detection at Contact Center

Page 28: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

0

10

20

30

40

50

2/23 3/14 4/3 4/23 5/13 time

•Total number of inquiries at a contact center is 1202

Changes of number of main topics over time

Changes of Topic Number

Page 29: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

3-4. Network Structure Mining

Time evolving

Tracking network hierarchy changes leads to new hub detection

ex. SNS networksDetection of influencers, criminal groups, new communities

Other examples:Physical networks, coauthor network, copurchase network

Time evolving

Emergence of new hubs

Emergence of new hierarchies

Page 30: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

Time evolving Time evolving

Change of partitioning structure

Emergence of a new cluster

Graph Structure Change Detection

Tracking a new graph cluster leads to discovery of a new community

③④⑤

⑥⑦

① ⑦①

・・・

Time series of

correlation matrix

[Hirose, Yamanishi, Nakata, Fujimaki KDD2009]

Page 31: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

4. Summary

• Novelty detection from dynamic data is a challenging issue in data mining.

• Dynamic model selection based on MDLprinciple is a novel method to detect structural changes in a data stream

• Applications cover masquerade detection, failure detection, topic emergence detection,network structure mining and are applicable to a wide range of areas.

Page 32: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum

References1) K.Yamanishi, J.Takeuchi, G.Williamas, and P.Milne: On-line Unsupervised Outlier

Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery Journal, pp:275-300, May 2004, Volume 8, Issue 3.

2) J.Takeuchi and K.Yamanishi: A Unifying Framework for Detecting Outliers and Change-points from Time Series. IEEE Transactions on Knowledge and Data Engineering, 18:44, pp: 482-492, 2006.

3)S.Morinaga and K.Yamanishi: Tracking Dynamics of Topic Trends Using a Finite Mixture Model. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004), ACM Press, 2004.

4) K.Yamanishi and Y.Maruyama: Dynamic Syslog Mining for Network Failure Monitoring. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2005), pp: 499-508, ACM Press, 2005.

5) K.Yamanishi and Y.Maruyama: Dynamic Model Selection with Its Applications to Novelty Detection. IEEE Transactions on Information Theory, pp:2180-2189, VOL 53, NO 6, June, 2007.

6) S.Hirose, K.Yamanishi, T.Nakata, R.Fujimaki: Network Anomaly Detection based on Eigen Equation Compression. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2009), 2009.

7) K.Yamanishi and E.Sakurai: Extensions and Probabilistic Analysis of Dynamic Model Selection.2010 Workshop on Information Theoretic Methods for Science and Engineering(WITMSE 2010), Tampere, Finland, August 2010.