Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...

30
Classification and Novel Class Detection in Data Streams Mehedy Masud 1 , Latifur Khan 1 , Jing Gao 2 , Jiawei Han 2 , and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas 2 Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by

Transcript of Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...

Page 1: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Classification and Novel Class Detection in Data Streams

Mehedy Masud1, Latifur Khan1, Jing Gao2,

Jiawei Han2, and Bhavani Thuraisingham1

1Department of Computer Science, University of Texas at Dallas

2Department of Computer Science, University of Illinois at Urbana Champaign

This work was funded in part by

Page 2: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Presentation Overview

Stream Mining Background

Novel Class Detection– Concept Evolution

Page 3: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Data StreamsData streams are:

◦ Continuous flows of

data

Network traffic

Sensor data Call center

records

◦ Examples:

Page 4: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Uses past labeled data to build classification model

Predicts the labels of future instances using the model

Helps decision making

Data Stream Classification

Network traffic

Classification model

Attack traffic

Firewall

Block and quarantine

Benign traffic

Server

Model update

Expert analysis and labeling

Page 5: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Infinite length

Concept-drift

Concept-evolution (emergence of

novel class)

Recurrence (seasonal) class

ChallengesIntroduction

5ICDM 2012, Brussels, Belgium 12/11/2012

Page 6: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Impractical to store and use all historical data

◦ Requires infinite storage

◦ And running time

Infinite Length

0 11

0

11

11

0

0 0

0

Page 7: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Concept-Drift

Negative instancePositive instance

A data chunk

Current hyperplane

Previous hyperplane

Instances victim of concept-drift

Page 8: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Concept-Evolution

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X

X X X X X X

Novel classy

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances

AC

D

B

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

A

CD

B

Page 9: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Background: Ensemble of Classifiers

C1

C2

C3

x,?

+

+

-input

ClassifierIndividual outputs

voting

+

Ensemble output

Page 10: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Background: Ensemble Classification of Data

StreamsDivide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3

Data chunks

Classifiers

D1

C1

D2

C2

D3

C3

Ensemble

C1 C2 C3

D4

Prediction

D4

C4C4

C4

D5D5

C5C5

C5

D6

Labeled chunkUnlabeled chunk

Addresses infinite lengthand concept-drift

Note: Di may contain data points from different classes

Page 11: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Examples of Recurrence and Novel Classes

Twitter Stream – a stream of messagesEach message may be given a category or

“class” ◦ based on the topic

Examples ◦ “Election 2012”, “London Olympic”,

“Halloween”, “Christmas”, “Hurricane Sandy”, etc.

Among these ◦ “Election 2012” or “Hurricane Sandy” are

novel classes because they are new events.Also

◦ “Halloween” is recurrence class because it “recurs” every year.

11ICDM 2012, Brussels, Belgium 12/11/2012

Introduction

Page 12: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Concept-Evolution and Feature Space

Introduction

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X

X X X X X X

Novel class

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances

AC

D

B

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

A

CD

B

12ICDM 2012, Brussels, Belgium 12/11/2012

Page 13: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Novel Class Detection – Prior Work

Prior work

13ICDM 2012, Brussels, Belgium 12/11/2012

Three steps:

◦ Training and building decision boundary

◦ Outlier detection and filtering

◦ Computing cohesion and separation

Page 14: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Training: Creating Decision Boundary

++++ ++ + + +

+ +++ ++ +

+ + + + ++ +

+++ ++ ++ +++

+++++ ++++ +++ + ++ + +

++ ++ + ++

- - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - -

- - - - - - - - - - - -

- -

y

x1

y1

y2B

CA

D

x

- - - - -

- - -

- - - - - - - -

+++ ++ + + + + + + + + + + +

Raw training dataClusters are created

y

x1

y1

y2

x

A

D

C

B

Pseudopoints

Addresses Infinite length problem14ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

• Training is done chunk-by-chunk (One classifier per chunk)

• An ensemble of classifiers are used for classification

Page 15: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Outlier Detection and Filtering

x1 x

y

y1

y2B

CA

D x

x

AND

Routlier

Routlier

Routlier

Ensemble of L modelsM1 M2 ML

xTest instance

. . .

True

X is a filtered outlier (Foutlier)(potential novel class instance)

False

X is an existing class instance

Test instance inside decision boundary (not outlier)

Test instance outside decision

boundary Raw outlier or

Routlier

Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible.

15ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Page 16: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Computing Cohesion & Separation

a(x) = mean distance from an Foutlier x to the instances in o,q(x)

bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)

q-Neighborhood Silhouette Coefficient (q-NSC):a(x)),(x)bmax(

a(x)) (x)(b NSC(x)-q

min

min

If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.

x

o,5(x)

+,5(x)

- - - -

+ + + +

- - -

- -

+ + + + +

-,5(x)

a(x)

b+

(x)b-(x)

16ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Page 17: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Limitation: Recurrence Class

chunk0 chunk1 chunk49 chunk50

Stream

chunk51 chunk52 chunk99 chunk100

Novel

chunk101 chunk102 chunk149 chunk150

Recurrence

17ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Page 18: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Why Recurrence Classes are Forgotten?

Divide the data stream into equal sized chunks◦ Train a classifier from whole data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3◦ Therefore, old models are discarded◦ Old classes are “forgotten” after a while

Data chunks

Classifiers

D1

C1

D2

C2

D3

C3

Ensemble

C1 C2 C3

D4

Prediction

D4

C4C4

C4

D5D5

C5C5

C5

D6

Labeled chunkUnlabeled chunk

Addresses infinite length and concept-drift

18ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Page 19: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

CLAM: The Proposed Approach

19ICDM 2012, Brussels, Belgium 12/11/2012

LatestLabeled chunk

Stream

New model

Training

Ensemble (M)(keeps all classes)

Upd

ate

Latest unlabeled instance Outlier

detection

Not outlierClassify using M

(Existing class)Outlier

Buffering and novel class detection

Proposed method

CLAss Based Micro-Classifier Ensemble

Page 20: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Training and Updating

20ICDM 2012, Brussels, Belgium 12/11/2012

Proposed method

• Each chunk is first separated into different classes• A micro-classifier is trained from each class’s data• Each micro-classifier replaces one existing micro-

classifier• A total of L micro-classifiers make a Micro-Classifier

Ensemble (MCE)• C such MCE’s constitute the whole ensemble, E

Page 21: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

CLAM: The Proposed Approach

21ICDM 2012, Brussels, Belgium 12/11/2012

LatestLabeled chunk

Stream

New model

Training

Ensemble (M)(keeps all classes)

Upd

ate

Latest unlabeled instance Outlier

detection

Not outlierClassify using M

(Existing class)Outlier

Buffering and novel class detection

Proposed method

CLAss Based Micro-Classifier Ensemble

Page 22: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Outlier Detection and Classification

22ICDM 2012, Brussels, Belgium 12/11/2012

Proposed method

• A test instance x is first classified with each micro-classifier ensemble

• Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean)

• If all ensembles flags x as outlier, then it is buffered and sent to novel class detector

• Otherwise, the partial outputs are combined and a class label is predicted

Page 23: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Evaluation Competitors:

◦ CLAM (CL) – proposed work◦ SCANR (SC) [1] – prior work◦ ECSMiner (EM) [2] – prior work◦ Olindda [3]-WCE [4] (OW) – another baseline

Datasets: Synthetic, KDD Cup 1999 & Forest covertype

1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181.

2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).

3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008.

4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM.

23ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Page 24: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Overall Error

24ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD

Page 25: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Number of Recurring Classes vs Error

25ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Page 26: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Error vs Drift and Chunk Size

26ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Page 27: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Summary Table

27ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Page 28: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

ConclusionDetect RecurrenceImproved AccuracyRunning TimeReduced Human InteractionFuture work: use other base

learners

28ICDM 2012, Brussels, Belgium 12/11/2012

Page 29: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

QUESTIONS?

29ICDM 2012, Brussels, Belgium 12/11/2012

Page 30: Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

THANKS

30ICDM 2012, Brussels, Belgium 12/11/2012