Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...

Post on 05-Jan-2016

226 views 0 download

Tags:

Transcript of Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...

Classification and Novel Class Detection in Data Streams

Mehedy Masud1, Latifur Khan1, Jing Gao2,

Jiawei Han2, and Bhavani Thuraisingham1

1Department of Computer Science, University of Texas at Dallas

2Department of Computer Science, University of Illinois at Urbana Champaign

This work was funded in part by

Presentation Overview

Stream Mining Background

Novel Class Detection– Concept Evolution

Data StreamsData streams are:

◦ Continuous flows of

data

Network traffic

Sensor data Call center

records

◦ Examples:

Uses past labeled data to build classification model

Predicts the labels of future instances using the model

Helps decision making

Data Stream Classification

Network traffic

Classification model

Attack traffic

Firewall

Block and quarantine

Benign traffic

Server

Model update

Expert analysis and labeling

Infinite length

Concept-drift

Concept-evolution (emergence of

novel class)

Recurrence (seasonal) class

ChallengesIntroduction

5ICDM 2012, Brussels, Belgium 12/11/2012

Impractical to store and use all historical data

◦ Requires infinite storage

◦ And running time

Infinite Length

0 11

0

11

11

0

0 0

0

Concept-Drift

Negative instancePositive instance

A data chunk

Current hyperplane

Previous hyperplane

Instances victim of concept-drift

Concept-Evolution

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X

X X X X X X

Novel classy

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances

AC

D

B

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

A

CD

B

Background: Ensemble of Classifiers

C1

C2

C3

x,?

+

+

-input

ClassifierIndividual outputs

voting

+

Ensemble output

Background: Ensemble Classification of Data

StreamsDivide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3

Data chunks

Classifiers

D1

C1

D2

C2

D3

C3

Ensemble

C1 C2 C3

D4

Prediction

D4

C4C4

C4

D5D5

C5C5

C5

D6

Labeled chunkUnlabeled chunk

Addresses infinite lengthand concept-drift

Note: Di may contain data points from different classes

Examples of Recurrence and Novel Classes

Twitter Stream – a stream of messagesEach message may be given a category or

“class” ◦ based on the topic

Examples ◦ “Election 2012”, “London Olympic”,

“Halloween”, “Christmas”, “Hurricane Sandy”, etc.

Among these ◦ “Election 2012” or “Hurricane Sandy” are

novel classes because they are new events.Also

◦ “Halloween” is recurrence class because it “recurs” every year.

11ICDM 2012, Brussels, Belgium 12/11/2012

Introduction

Concept-Evolution and Feature Space

Introduction

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X

X X X X X X

Novel class

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances

AC

D

B

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - -- -

- - -

A

CD

B

12ICDM 2012, Brussels, Belgium 12/11/2012

Novel Class Detection – Prior Work

Prior work

13ICDM 2012, Brussels, Belgium 12/11/2012

Three steps:

◦ Training and building decision boundary

◦ Outlier detection and filtering

◦ Computing cohesion and separation

Training: Creating Decision Boundary

++++ ++ + + +

+ +++ ++ +

+ + + + ++ +

+++ ++ ++ +++

+++++ ++++ +++ + ++ + +

++ ++ + ++

- - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - -

- - - - - - - - - - - -

- -

y

x1

y1

y2B

CA

D

x

- - - - -

- - -

- - - - - - - -

+++ ++ + + + + + + + + + + +

Raw training dataClusters are created

y

x1

y1

y2

x

A

D

C

B

Pseudopoints

Addresses Infinite length problem14ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

• Training is done chunk-by-chunk (One classifier per chunk)

• An ensemble of classifiers are used for classification

Outlier Detection and Filtering

x1 x

y

y1

y2B

CA

D x

x

AND

Routlier

Routlier

Routlier

Ensemble of L modelsM1 M2 ML

xTest instance

. . .

True

X is a filtered outlier (Foutlier)(potential novel class instance)

False

X is an existing class instance

Test instance inside decision boundary (not outlier)

Test instance outside decision

boundary Raw outlier or

Routlier

Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible.

15ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Computing Cohesion & Separation

a(x) = mean distance from an Foutlier x to the instances in o,q(x)

bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)

q-Neighborhood Silhouette Coefficient (q-NSC):a(x)),(x)bmax(

a(x)) (x)(b NSC(x)-q

min

min

If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.

x

o,5(x)

+,5(x)

- - - -

+ + + +

- - -

- -

+ + + + +

-,5(x)

a(x)

b+

(x)b-(x)

16ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Limitation: Recurrence Class

chunk0 chunk1 chunk49 chunk50

Stream

chunk51 chunk52 chunk99 chunk100

Novel

chunk101 chunk102 chunk149 chunk150

Recurrence

17ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

Why Recurrence Classes are Forgotten?

Divide the data stream into equal sized chunks◦ Train a classifier from whole data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3◦ Therefore, old models are discarded◦ Old classes are “forgotten” after a while

Data chunks

Classifiers

D1

C1

D2

C2

D3

C3

Ensemble

C1 C2 C3

D4

Prediction

D4

C4C4

C4

D5D5

C5C5

C5

D6

Labeled chunkUnlabeled chunk

Addresses infinite length and concept-drift

18ICDM 2012, Brussels, Belgium 12/11/2012

Prior work

CLAM: The Proposed Approach

19ICDM 2012, Brussels, Belgium 12/11/2012

LatestLabeled chunk

Stream

New model

Training

Ensemble (M)(keeps all classes)

Upd

ate

Latest unlabeled instance Outlier

detection

Not outlierClassify using M

(Existing class)Outlier

Buffering and novel class detection

Proposed method

CLAss Based Micro-Classifier Ensemble

Training and Updating

20ICDM 2012, Brussels, Belgium 12/11/2012

Proposed method

• Each chunk is first separated into different classes• A micro-classifier is trained from each class’s data• Each micro-classifier replaces one existing micro-

classifier• A total of L micro-classifiers make a Micro-Classifier

Ensemble (MCE)• C such MCE’s constitute the whole ensemble, E

CLAM: The Proposed Approach

21ICDM 2012, Brussels, Belgium 12/11/2012

LatestLabeled chunk

Stream

New model

Training

Ensemble (M)(keeps all classes)

Upd

ate

Latest unlabeled instance Outlier

detection

Not outlierClassify using M

(Existing class)Outlier

Buffering and novel class detection

Proposed method

CLAss Based Micro-Classifier Ensemble

Outlier Detection and Classification

22ICDM 2012, Brussels, Belgium 12/11/2012

Proposed method

• A test instance x is first classified with each micro-classifier ensemble

• Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean)

• If all ensembles flags x as outlier, then it is buffered and sent to novel class detector

• Otherwise, the partial outputs are combined and a class label is predicted

Evaluation Competitors:

◦ CLAM (CL) – proposed work◦ SCANR (SC) [1] – prior work◦ ECSMiner (EM) [2] – prior work◦ Olindda [3]-WCE [4] (OW) – another baseline

Datasets: Synthetic, KDD Cup 1999 & Forest covertype

1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181.

2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).

3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008.

4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM.

23ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Overall Error

24ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD

Number of Recurring Classes vs Error

25ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Error vs Drift and Chunk Size

26ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

Summary Table

27ICDM 2012, Brussels, Belgium 12/11/2012

Evaluation

ConclusionDetect RecurrenceImproved AccuracyRunning TimeReduced Human InteractionFuture work: use other base

learners

28ICDM 2012, Brussels, Belgium 12/11/2012

QUESTIONS?

29ICDM 2012, Brussels, Belgium 12/11/2012

THANKS

30ICDM 2012, Brussels, Belgium 12/11/2012