Novelty Detection & One-Class SVM (OCSVM)
description
Transcript of Novelty Detection & One-Class SVM (OCSVM)
11
Novelty Detection &
One-Class SVM (OCSVM)
22
OutlineOutline
IntroductionIntroduction
Quantile EstimationQuantile Estimation
OCSVM – TheoryOCSVM – Theory
OCSVM – Application to Jet EnginesOCSVM – Application to Jet Engines
33
Novelty Detection isNovelty Detection is
An unsupervised learning problem An unsupervised learning problem (data unlabeled)(data unlabeled)
About the identification of new or About the identification of new or unknown data or signal that a machine unknown data or signal that a machine learning system is not aware of during learning system is not aware of during trainingtraining
44
Example 1Example 1
“Novel”
“Novel”
“Novel”
“Normal”
55
So what’s seems to be the So what’s seems to be the problem?problem?
It’s a 2-Class problem. It’s a 2-Class problem. “Normal vs. “Novel”“Normal vs. “Novel”
66
The Problem isThe Problem is
That “All positive examples are That “All positive examples are alike but each negative alike but each negative example is negative in its own example is negative in its own way”.way”.
77
Example 2Example 2Suppose we want to build a classifier that recognizes Suppose we want to build a classifier that recognizes web pages about “pickup sticks”.web pages about “pickup sticks”.
How can we collect a training data?How can we collect a training data? We can surf the web and pretty easily assemble a sample to be We can surf the web and pretty easily assemble a sample to be
our collection of our collection of positive examplespositive examples..
What about What about negative examples negative examples ?? The negative examples are… the rest of the web. That is The negative examples are… the rest of the web. That is
~(“pickup sticks web page”)~(“pickup sticks web page”)
So the So the negative examples negative examples come from an unknown # of come from an unknown # of negative classes. negative classes.
88
ApplicationsApplications
Many existMany exist Intrusion detectionIntrusion detection Fraud detectionFraud detection Fault detectionFault detection Robotics Robotics Medical diagnosisMedical diagnosis E-CommerceE-Commerce And more…And more…
99
Possible ApproachesPossible Approaches
Density Estimation:Density Estimation: Estimate a Estimate a density density based on training databased on training data Threshold the estimated density for test pointsThreshold the estimated density for test points
Quantile Estimation:Quantile Estimation: Estimate a Estimate a quantile quantile of the distribution underlying the of the distribution underlying the
training data: for a fixed constant , attempt training data: for a fixed constant , attempt to find a small set such that to find a small set such that
Check whether test points are inside or outside Check whether test points are inside or outside
]1,0(S
S )Pr( Sx
1010
Quantile Estimation (QE)Quantile Estimation (QE)
A quantile function with respect to A quantile function with respect to is defined as :is defined as :
- a class of measurable subsets of - a class of measurable subsets of
- a real valued function. - a real valued function.
denotes the that attains the denotes the that attains the infimuminfimum
H X H:
),,( HP
},)(|)(inf{)( HCCPCU
10
)(C HC
1111
Quantile Estimation (QE)Quantile Estimation (QE)
The empirical quantile function is defined as The empirical quantile function is defined as above where is the empirical distribution:above where is the empirical distribution:
- denotes the that attains the - denotes the that attains the infimum on the training set.infimum on the training set.
Thus the goal is to estimate throughThus the goal is to estimate through
P
m
iiC
memp xI
mCP
1
)(1
)(
)(mC HC
)(C)(
mC
1212
Quantile EstimationQuantile Estimation
Choosing Intelligently and is importantChoosing Intelligently and is important
On one hand large class On one hand large class many small many small sets that contain a fraction of the training sets that contain a fraction of the training examples.examples.
On the other hand, if we allowed just On the other hand, if we allowed just any any set, the chosen set could consist of only set, the chosen set could consist of only the training points the training points poor generalization poor generalization
H
H
1313
Complex vs. SimpleComplex vs. Simple
1)( CPmemp 1)( CPmemp
1414
Support Vector Method for Support Vector Method for Novelty DetectionNovelty Detection
Bernhard SchBernhard Schöölkof, Robert Williams, Alex lkof, Robert Williams, Alex Smola, John Shawe-Taylor, John PlattSmola, John Shawe-Taylor, John Platt
1515
Problem FormulationProblem Formulation
Suppose we are given a training ample drawn Suppose we are given a training ample drawn from an underlying distributionfrom an underlying distribution
We want to a estimate a “simple” subset We want to a estimate a “simple” subset such that for a test point drawn from the such that for a test point drawn from the distribution ,distribution ,
We approach the problem by trying to estimate a We approach the problem by trying to estimate a function which is positive on and negative on function which is positive on and negative on the complementthe complement
P
XS
]1,0(,)Pr( Sxx
P
f S
1616
The SV Approach to QE The SV Approach to QE
The classThe class is defined as the set of half-is defined as the set of half-spaces in a feature space (via kernel )spaces in a feature space (via kernel )
Here we define , whereHere we define , where
are respectively a weight vector are respectively a weight vector and an offset parameterizing a hyperplane and an offset parameterizing a hyperplane in in
HF
2||||)( wCw
k
})(|{ xfxCww
),( w
F
1717
““Hey, Just a second” Hey, Just a second”
If we use hyperplanes & offsets, If we use hyperplanes & offsets, doesn’t it mean we separate the doesn’t it mean we separate the “positive” sample? But, separate “positive” sample? But, separate
from what? from what?
From the OriginFrom the Origin
1818
OCSVMOCSVM
w
||||/ w
||||/ w)(x
1919
OCSVMOCSVM
To separate the data set from the origin To separate the data set from the origin we solve the following quadric program:we solve the following quadric program:
i iRRFw m
wi
1||||
2
1min
,,
ixw )(,subject to
0i
Serves as a penalizer like “C” in the 2-class svm (recall that )10
Notice that no “y”s are incorporated in the constraint since there are no labels
2020
OCSVMOCSVM
The decision is therefore:The decision is therefore:
Since the slack variables are penalized in the Since the slack variables are penalized in the objective function, we can expect that if and objective function, we can expect that if and
solve the problem then will equal 1 for solve the problem then will equal 1 for most example in the training set, while still most example in the training set, while still stays small stays small
))(,sgn()( xwxf
i
fw
|||| w
2121
OCSVMOCSVM
Using multipliers we get the Lagrangian, Using multipliers we get the Lagrangian, 0, ii
i
imL
1||||
2
1),,,,( 2wβαξw
i
iiii
i ))(,( xw
2222
OCSVMOCSVM
Setting the derivatives of w.r.t to 0 Setting the derivatives of w.r.t to 0 yields:yields:
1)1)
2) , 2) ,
,,ξw
i
ii x )(w
L
mm ii
11
i
i 1
2323
OCSVMOCSVM
Eq. 1 transforms into a kernel expansion:Eq. 1 transforms into a kernel expansion:
Substituting eq. 1 & 2 into yields the dual Substituting eq. 1 & 2 into yields the dual problem:problem:
subject to ,subject to ,
)(xf
iiikf ),(sgn)( xxx
),(2
1min ji
ijji
Rkα
mxx
α
mi 1
0
L
i
i 1
The offset can be recovered by exploiting that for any the corresponding pattern satisfies:
mi 1
0
ix
j
ijji k ),()(, xxxw
2424
- Property- Property
Assume the solution of the primal problem Assume the solution of the primal problem satisfies . The following statements hold:satisfies . The following statements hold:
is an upper bound on the fraction outliers.is an upper bound on the fraction outliers.
is a lower bound on the fraction SVsis a lower bound on the fraction SVs
With probability 1, asymptotically, equals both the With probability 1, asymptotically, equals both the fraction of SVs and the fraction of outliers. (under fraction of SVs and the fraction of outliers. (under certain conditions of P(x) and the kernel) certain conditions of P(x) and the kernel)
0
2525
Results – USPS (“0”)Results – USPS (“0”) X axis – svm magnitude
y axis – frequency
For = 50%, we get:50% SVs49% Outliers
For = 5%, we get:6% SVs4% Outliers
2626
OCSM - ShortcomingsOCSM - Shortcomings
Implicitly assumes that the “negative” data Implicitly assumes that the “negative” data lies around the origin.lies around the origin.
Ignores completely “negative” data even if Ignores completely “negative” data even if such data partially exist.such data partially exist.
2727
Support Vector Novelty Detection Support Vector Novelty Detection Applied to Jet Engine Vibration Applied to Jet Engine Vibration
SpectraSpectra
Paul Hyton, Bernhard SchPaul Hyton, Bernhard Schöölkof, lkof, Lionel Tarassenko, Paul AnuzisLionel Tarassenko, Paul Anuzis
2828
Intro.Intro.
Jet engines have pass-off tests before they can Jet engines have pass-off tests before they can be delivered to the customer.be delivered to the customer.
Through vibration tests an engine’s “vibration Through vibration tests an engine’s “vibration signature” can be extractedsignature” can be extracted
While normal vibration signatures are common, While normal vibration signatures are common, we may be short of abnormal signatures.we may be short of abnormal signatures.
Or even worse, the engine under test may show Or even worse, the engine under test may show up a type of abnormality which has never been up a type of abnormality which has never been seen before.seen before.
2929
Feature SelectionFeature Selection
A vibration gauges are attached to the A vibration gauges are attached to the engine’s caseengine’s caseThe engine under test is slowly accelerated The engine under test is slowly accelerated from idle to full speed and decelerated back from idle to full speed and decelerated back to idleto idleThe vibration signal is then recordedThe vibration signal is then recordedThe final feature is calculated over a The final feature is calculated over a weighted average of the vibration for 10 weighted average of the vibration for 10 different speed rangesdifferent speed rangesThus yielding a 10-D vectorThus yielding a 10-D vector
3030
AlgorithmAlgorithmSlightly more general than the regular Slightly more general than the regular OCSVMOCSVM
In addition to the “normal” data pointsIn addition to the “normal” data points we take into account we take into account
some abnormal points some abnormal points
Rather than separating from the origin we Rather than separating from the origin we separate from the mean of separate from the mean of
},...,{ 1 mxxX},...,{ 1 tzzZ
Z
3131
Primal FormPrimal Form
i iRRFw m
i
1||||
2
1min
,,
w
in
i t )
1(, zxwsubject to
0i
and the decision function is
))1
(,sgn()( n
ntf zxwx
3232
Dual FormDual Form
where where
andand
subject tosubject to
, ,
)),((2
1min ijji
ijji
Rqqqkα
m
xx
α
mi 1
0 i
i 1
np
pnkt
q ),(12
zz n
njj kt
q ),(1
zx
3333
2D Toy Example2D Toy Example
3434
Training DataTraining Data
99 Normal Engines were used as training 99 Normal Engines were used as training datadata
40 Normal Engines were used as 40 Normal Engines were used as validation datavalidation data
23 Abnormal Engines used as test data23 Abnormal Engines used as test data
3535
Standard OCSM ResultsStandard OCSM Results
3636
Modified OCSM ResultsModified OCSM Results