Novelty Detection & One-Class SVM (OCSVM)

11

Novelty Detection &

One-Class SVM (OCSVM)

22

OutlineOutline

IntroductionIntroduction

Quantile EstimationQuantile Estimation

OCSVM – TheoryOCSVM – Theory

OCSVM – Application to Jet EnginesOCSVM – Application to Jet Engines

33

Novelty Detection isNovelty Detection is

An unsupervised learning problem An unsupervised learning problem (data unlabeled)(data unlabeled)

About the identification of new or About the identification of new or unknown data or signal that a machine unknown data or signal that a machine learning system is not aware of during learning system is not aware of during trainingtraining

44

Example 1Example 1

“Novel”

“Novel”

“Novel”

“Normal”

55

So what’s seems to be the So what’s seems to be the problem?problem?

It’s a 2-Class problem. It’s a 2-Class problem. “Normal vs. “Novel”“Normal vs. “Novel”

66

The Problem isThe Problem is

That “All positive examples are That “All positive examples are alike but each negative alike but each negative example is negative in its own example is negative in its own way”.way”.

77

Example 2Example 2Suppose we want to build a classifier that recognizes Suppose we want to build a classifier that recognizes web pages about “pickup sticks”.web pages about “pickup sticks”.

How can we collect a training data?How can we collect a training data? We can surf the web and pretty easily assemble a sample to be We can surf the web and pretty easily assemble a sample to be

our collection of our collection of positive examplespositive examples..

What about What about negative examples negative examples ?? The negative examples are… the rest of the web. That is The negative examples are… the rest of the web. That is

~(“pickup sticks web page”)~(“pickup sticks web page”)

So the So the negative examples negative examples come from an unknown # of come from an unknown # of negative classes. negative classes.

88

ApplicationsApplications

Many existMany exist Intrusion detectionIntrusion detection Fraud detectionFraud detection Fault detectionFault detection Robotics Robotics Medical diagnosisMedical diagnosis E-CommerceE-Commerce And more…And more…

99

Possible ApproachesPossible Approaches

Density Estimation:Density Estimation: Estimate a Estimate a density density based on training databased on training data Threshold the estimated density for test pointsThreshold the estimated density for test points

Quantile Estimation:Quantile Estimation: Estimate a Estimate a quantile quantile of the distribution underlying the of the distribution underlying the

training data: for a fixed constant , attempt training data: for a fixed constant , attempt to find a small set such that to find a small set such that

Check whether test points are inside or outside Check whether test points are inside or outside

]1,0(S

S )Pr( Sx

1010

Quantile Estimation (QE)Quantile Estimation (QE)

A quantile function with respect to A quantile function with respect to is defined as :is defined as :

- a class of measurable subsets of - a class of measurable subsets of

- a real valued function. - a real valued function.

denotes the that attains the denotes the that attains the infimuminfimum

H X H:

),,( HP

},)(|)(inf{)( HCCPCU

10

)(C HC

1111

Quantile Estimation (QE)Quantile Estimation (QE)

The empirical quantile function is defined as The empirical quantile function is defined as above where is the empirical distribution:above where is the empirical distribution:

- denotes the that attains the - denotes the that attains the infimum on the training set.infimum on the training set.

Thus the goal is to estimate throughThus the goal is to estimate through

P

m

iiC

memp xI

mCP

1

)(1

)(

)(mC HC

)(C)(

mC

1212

Quantile EstimationQuantile Estimation

Choosing Intelligently and is importantChoosing Intelligently and is important

On one hand large class On one hand large class many small many small sets that contain a fraction of the training sets that contain a fraction of the training examples.examples.

On the other hand, if we allowed just On the other hand, if we allowed just any any set, the chosen set could consist of only set, the chosen set could consist of only the training points the training points poor generalization poor generalization

H

H

1313

Complex vs. SimpleComplex vs. Simple

1)( CPmemp 1)( CPmemp

1414

Support Vector Method for Support Vector Method for Novelty DetectionNovelty Detection

Bernhard SchBernhard Schöölkof, Robert Williams, Alex lkof, Robert Williams, Alex Smola, John Shawe-Taylor, John PlattSmola, John Shawe-Taylor, John Platt

1515

Problem FormulationProblem Formulation

Suppose we are given a training ample drawn Suppose we are given a training ample drawn from an underlying distributionfrom an underlying distribution

We want to a estimate a “simple” subset We want to a estimate a “simple” subset such that for a test point drawn from the such that for a test point drawn from the distribution ,distribution ,

We approach the problem by trying to estimate a We approach the problem by trying to estimate a function which is positive on and negative on function which is positive on and negative on the complementthe complement

P

XS

]1,0(,)Pr( Sxx

P

f S

1616

The SV Approach to QE The SV Approach to QE

The classThe class is defined as the set of half-is defined as the set of half-spaces in a feature space (via kernel )spaces in a feature space (via kernel )

Here we define , whereHere we define , where

are respectively a weight vector are respectively a weight vector and an offset parameterizing a hyperplane and an offset parameterizing a hyperplane in in

HF

2||||)( wCw

k

})(|{ xfxCww

),( w

F

1717

““Hey, Just a second” Hey, Just a second”

If we use hyperplanes & offsets, If we use hyperplanes & offsets, doesn’t it mean we separate the doesn’t it mean we separate the “positive” sample? But, separate “positive” sample? But, separate

from what? from what?

From the OriginFrom the Origin

1818

OCSVMOCSVM

w

||||/ w

||||/ w)(x

1919

OCSVMOCSVM

To separate the data set from the origin To separate the data set from the origin we solve the following quadric program:we solve the following quadric program:

i iRRFw m

wi

1||||

2

1min

,,

ixw )(,subject to

0i

Serves as a penalizer like “C” in the 2-class svm (recall that )10

Notice that no “y”s are incorporated in the constraint since there are no labels

2020

OCSVMOCSVM

The decision is therefore:The decision is therefore:

Since the slack variables are penalized in the Since the slack variables are penalized in the objective function, we can expect that if and objective function, we can expect that if and

solve the problem then will equal 1 for solve the problem then will equal 1 for most example in the training set, while still most example in the training set, while still stays small stays small

))(,sgn()( xwxf

i

fw

|||| w

2121

OCSVMOCSVM

Using multipliers we get the Lagrangian, Using multipliers we get the Lagrangian, 0, ii

i

imL

1||||

2

1),,,,( 2wβαξw

i

iiii

i ))(,( xw

2222

OCSVMOCSVM

Setting the derivatives of w.r.t to 0 Setting the derivatives of w.r.t to 0 yields:yields:

1)1)

2) , 2) ,

,,ξw

i

ii x )(w

L

mm ii

11

i

i 1

2323

OCSVMOCSVM

Eq. 1 transforms into a kernel expansion:Eq. 1 transforms into a kernel expansion:

Substituting eq. 1 & 2 into yields the dual Substituting eq. 1 & 2 into yields the dual problem:problem:

subject to ,subject to ,

)(xf

iiikf ),(sgn)( xxx

),(2

1min ji

ijji

Rkα

mxx

α

mi 1

0

L

i

i 1

The offset can be recovered by exploiting that for any the corresponding pattern satisfies:

mi 1

0

ix

j

ijji k ),()(, xxxw

2424

- Property- Property

Assume the solution of the primal problem Assume the solution of the primal problem satisfies . The following statements hold:satisfies . The following statements hold:

is an upper bound on the fraction outliers.is an upper bound on the fraction outliers.

is a lower bound on the fraction SVsis a lower bound on the fraction SVs

With probability 1, asymptotically, equals both the With probability 1, asymptotically, equals both the fraction of SVs and the fraction of outliers. (under fraction of SVs and the fraction of outliers. (under certain conditions of P(x) and the kernel) certain conditions of P(x) and the kernel)

0

2525

Results – USPS (“0”)Results – USPS (“0”) X axis – svm magnitude

y axis – frequency

For = 50%, we get:50% SVs49% Outliers

For = 5%, we get:6% SVs4% Outliers

2626

OCSM - ShortcomingsOCSM - Shortcomings

Implicitly assumes that the “negative” data Implicitly assumes that the “negative” data lies around the origin.lies around the origin.

Ignores completely “negative” data even if Ignores completely “negative” data even if such data partially exist.such data partially exist.

2727

Support Vector Novelty Detection Support Vector Novelty Detection Applied to Jet Engine Vibration Applied to Jet Engine Vibration

SpectraSpectra

Paul Hyton, Bernhard SchPaul Hyton, Bernhard Schöölkof, lkof, Lionel Tarassenko, Paul AnuzisLionel Tarassenko, Paul Anuzis

2828

Intro.Intro.

Jet engines have pass-off tests before they can Jet engines have pass-off tests before they can be delivered to the customer.be delivered to the customer.

Through vibration tests an engine’s “vibration Through vibration tests an engine’s “vibration signature” can be extractedsignature” can be extracted

While normal vibration signatures are common, While normal vibration signatures are common, we may be short of abnormal signatures.we may be short of abnormal signatures.

Or even worse, the engine under test may show Or even worse, the engine under test may show up a type of abnormality which has never been up a type of abnormality which has never been seen before.seen before.

2929

Feature SelectionFeature Selection

A vibration gauges are attached to the A vibration gauges are attached to the engine’s caseengine’s caseThe engine under test is slowly accelerated The engine under test is slowly accelerated from idle to full speed and decelerated back from idle to full speed and decelerated back to idleto idleThe vibration signal is then recordedThe vibration signal is then recordedThe final feature is calculated over a The final feature is calculated over a weighted average of the vibration for 10 weighted average of the vibration for 10 different speed rangesdifferent speed rangesThus yielding a 10-D vectorThus yielding a 10-D vector

3030

AlgorithmAlgorithmSlightly more general than the regular Slightly more general than the regular OCSVMOCSVM

In addition to the “normal” data pointsIn addition to the “normal” data points we take into account we take into account

some abnormal points some abnormal points

Rather than separating from the origin we Rather than separating from the origin we separate from the mean of separate from the mean of

},...,{ 1 mxxX},...,{ 1 tzzZ

Z

3131

Primal FormPrimal Form

i iRRFw m

i

1||||

2

1min

,,

w

in

i t )

1(, zxwsubject to

0i

and the decision function is

))1

(,sgn()( n

ntf zxwx

3232

Dual FormDual Form

where where

andand

subject tosubject to

, ,

)),((2

1min ijji

ijji

Rqqqkα

m

xx

α

mi 1

0 i

i 1

np

pnkt

q ),(12

zz n

njj kt

q ),(1

zx

3333

2D Toy Example2D Toy Example

3434

Training DataTraining Data

99 Normal Engines were used as training 99 Normal Engines were used as training datadata

40 Normal Engines were used as 40 Normal Engines were used as validation datavalidation data

23 Abnormal Engines used as test data23 Abnormal Engines used as test data

3535

Standard OCSM ResultsStandard OCSM Results

3636

Modified OCSM ResultsModified OCSM Results

Novelty Detection & One-Class SVM (OCSVM)

Documents

Transcript of Novelty Detection & One-Class SVM (OCSVM)