slide10-Anomaly Detection[Qi Liu] [兼容模式]Anomaly Detection Qi Liu ... anomaly scores greater...
Transcript of slide10-Anomaly Detection[Qi Liu] [兼容模式]Anomaly Detection Qi Liu ... anomaly scores greater...
Data Mining Tasks …Data Mining Tasks …2
Tid Refund Marital Taxable
DataTid Refund Marital
Status TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes 10
Milk
Anomaly/Outlier DetectionAnomaly/Outlier Detection
What are anomalies/outliers?The set of data points that are considerably different than theconsiderably different than the remainder of the data
Natural implication is that anomalies are relatively rare
O i th d ft if h l t f d tOne in a thousand occurs often if you have lots of dataContext is important, e.g., freezing temps in July
Can be important or a nuisance10 foot tall 2 year oldUnusually high blood pressure
Importance of Anomaly DetectionImportance of Anomaly Detection
Ozone Depletion HistoryIn 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levelsbelow normal levels
Why did the Nimbus 7 satellite, which had instruments aboard for recording had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer Sources:
htt // l i d t d / ht lprogram and discarded! http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Causes of AnomaliesCauses of Anomalies
Data from different classesMeasuring the weights of oranges, but a few grapefruit are mixed iin
Natural ariationNatural variationUnusually tall people
Data errors200 pound 2 year old200 pound 2 year old
Distinction Between Noise and AnomaliesAnomalies
h d lNoise is erroneous, perhaps random, values or contaminating objects
Weight recorded incorrectly
Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or objects
Noise is not interestingg
Anomalies may be interesting if they are not a result of noisenoise
Noise and anomalies are related but distinct concepts
General Issues: Number of AttributesGeneral Issues: Number of Attributes
Many anomalies are defined in terms of a single attributeHeightShapeColor
Can be hard to find an anomaly using all attributesNoisy or irrelevant attributesNoisy or irrelevant attributesObject is only anomalous with respect to some attributes
However, an object may not be anomalous in any one tt ib tattribute
General Issues: Anomaly ScoringGeneral Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary categorization
An object is an anomaly or it isn’tThis is especially true of classification‐based approaches
Other approaches assign a score to all pointsThis score measures the degree to which an object is an anomalyThis score measures the degree to which an object is an anomalyThis allows objects to be ranked
In the end, you often need a binary decisionShould this credit card transaction be flagged?ggStill useful to have a score
How many anomalies are there?
Other Issues for Anomaly Detectiony
Find all anomalies at once or one at a timeSwampingMasking
E l tiEvaluationHow do you measure performance?Supervised vs unsupervised situationsSupervised vs. unsupervised situations
EfficiencyEfficiency
ContextContextProfessional basketball team
Variants of Anomaly Detection ProblemsProblems
Gi d t t D fi d ll d t i t D ithGiven a data set D, find all data points x ∈ D with anomaly scores greater than some threshold t
Given a data set D, find all data points x ∈ D having the top n largest anomaly scoresthe top‐n largest anomaly scores
d l l bGiven a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the
l f ith t t Danomaly score of x with respect to D
Model‐Based Anomaly D t tiDetection
Build a model for the data and seeBuild a model for the data and seeUnsupervised
Anomalies are those points that don’t fit wellAnomalies are those points that don t fit wellAnomalies are those points that distort the model Examples:Statistical distributionClustersRegressiongGeometricGraph
Su e i edSupervisedAnomalies are regarded as a rare classNeed to have training datag
Additional Anomaly Detection Te hni uesTechniques
P i it b dProximity‐basedAnomalies are points far away from other pointsCan detect this graphically in some casesCan detect this graphically in some cases
Density‐basedLow density points are outliersLow density points are outliers
Pattern matchingCreate profiles or templates of atypical but important events orCreate profiles or templates of atypical but important events or objectsAlgorithms to detect these patterns are usually simple and efficientg p y p
Graphical ApproachesGraphical Approaches
B l lBoxplots or scatter plots
LimitationsN t t tiNot automaticSubjective
Convex Hull MethodConvex Hull Method
Extreme points are assumed to be outliersExtreme points are assumed to be outliersUse convex hull method to detect extreme values
What if the outlier occurs in the middle of the data?
Statistical ApproachesStatistical Approaches
Probabilistic definition of an outlier: An outlier is an object thatProbabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on
Data distributionParameters of distribution (e.g., mean, variance)Number of expected outliers (confidence limit)
I ueIssuesIdentifying the distribution of a data set
Heavy tailed distributionHeavy tailed distributionNumber of attributesIs the data a mixture of distributions?
Normal DistributionsNormal Distributions
One-dimensional G iGaussian
6
7
8
0.1
Two-dimensional Gaussian2
3
4
5
0.06
0.07
0.08
0.09
Gaussiany
-2
-1
0
1
0.02
0.03
0.04
0.05
x-4 -3 -2 -1 0 1 2 3 4 5
-5
-4
-3
probability density
0.01
Grubbs’ TestGrubbs Test
D li i i i dDetect outliers in univariate dataAssume data comes from normal distributionDetects one outlier at a time, remove the outlier, and repeatand repeat
H0: There is no outlier in data
XXG
−=
maxHA: There is at least one outlier
Grubbs’ test statistic: s
2)2/()1( − NN
tNG αReject H0 if: 2)2,/(
)2,/(
2)(
−
−
+−>
NN
NN
tNNG
α
α
Statistical‐based – Likelihood A hApproach
Assume the data set D contains samples from a mixture of two probability distributions:
M (majority distribution) A (anomalous distribution)
General Approach:Initially, assume all the data points belong to ML L (D) b h l lik lih d f D iLet Lt(D) be the log likelihood of D at time tFor each point xt that belongs to M, move it to A
Let L 1 (D) be the new log likelihoodLet Lt+1 (D) be the new log likelihood.Compute the difference, Δ = Lt(D) – Lt+1 (D)If Δ > c (some threshold), then xt is declared as an anomaly and moved
tl f M t Apermanently from M to A
Statistical‐based – Likelihood A hApproach
Data distribution, D = (1 – λ) M + λ AM is a probability distribution estimated from dataM is a probability distribution estimated from data
Can be based on any modeling method (naïve Bayes, maximum entropy etc)maximum entropy, etc)
A is initially assumed to be uniform distribution
⎞⎛⎞⎛N
Likelihood at time t:
∑∑
∏∏∏∈∈=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−==
ti
t
t
ti
t
t
AxiA
A
MxiM
MN
iiDt xPxPxPDL )()()1()()( ||||
1
λλ
∑∑∈∈
+++−=ti
t
ti
tAx
iAtMx
iMtt xPAxPMDLL )(loglog)(log)1log()( λλ
Strengths/Weaknesses of Statistical A hApproaches
Firm mathematical foundation
Can be very efficient
G d l f d b kGood results if distribution is known
I d di ib i b kIn many cases, data distribution may not be known
For high dimensional data it may be difficult to estimateFor high dimensional data, it may be difficult to estimate the true distribution
Anomalies can distort the parameters of the distribution
Distance‐Based ApproachesDistance Based Approaches
Several different techniques
An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, j p y ( ,Ng 1998)
Some statistical definitions are special cases of this
The outlier score of an object is the distance to its kth i hbnearest neighbor
One Nearest Neighbor ‐ One Outlier
D
1 8
2
1.6
1.8
1.2
1.4
0 8
1
0.6
0.8
0.4
Outlier Score
One Nearest Neighbor ‐ Two Outliersg
0 55
D0.5
0.55
0.4
0.45
0.3
0.35
0.2
0.25
0 1
0.15
0.05
0.1
Outlier Score
Five Nearest Neighbors ‐ Small ClusterCluster
2
D1.8
1.4
1.6
1.2
0.8
1
0.6
0.4
Outlier Score
Five Nearest Neighbors ‐ Differing D itDensity
D
1 6
1.8
1.4
1.6
1
1.2
0.8
0.4
0.6
0.2
Outlier Score
Strengths/Weaknesses of Distance‐Based ApproachesStrengths/Weaknesses of Distance Based Approaches
Simple
Expensive – O(n2)
S iti t tSensitive to parameters
Sensitive to variations in densitySensitive to variations in density
Distance becomes less meaningful in highDistance becomes less meaningful in high‐dimensional space
Density‐Based ApproachesDensity‐Based Approaches
Density‐based Outlier: The outlier score of an object is the inverse of the density around the object.
Can be defined in terms of the k nearest neighborsOne definition: Inverse of distance to kth neighborA h d fi i i I f h di k i hbAnother definition: Inverse of the average distance to k neighborsDBSCAN definition
If there are regions of different density, this approach can have problemscan have problems
Relative DensityRelative Density
Consider the density of a point relative to that of its k nearest neighbors
Relative Density Outlier ScoresRelative Density Outlier Scores
6
6.85
C
5
41.40D
3
1.33
1
2A
Outlier Score1
Density‐based: LOF happroach
For each point compute the density of its localFor each point, compute the density of its local neighborhoodCompute local outlier factor (LOF) of a sample p as theCompute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighborsy gOutliers are points with largest LOF value
In the NN approach, p2 is not considered as outlier, while LOF approach find
p2
while LOF approach find both p1 and p2 as outliers
× p1×
Strengths/Weaknesses of Density‐Based ApproachesStrengths/Weaknesses of Density Based Approaches
Simple
E O 2Expensive – O(n2)
Se iti e to a a eteSensitive to parameters
D it b l i f l i hi hDensity becomes less meaningful in high‐dimensional space
Clustering‐Based Approaches
Clustering‐based Outlier: AnClustering‐based Outlier: An object is a cluster‐based outlier if it does not strongly belong to any g y g ycluster
For prototype‐based clusters, an bj t i tli if it i t lobject is an outlier if it is not close
enough to a cluster centerFor density‐based clusters, an object y , jis an outlier if its density is too lowFor graph‐based clusters, an object is an outlier if it is not well connectedan outlier if it is not well connected
Other issues include the impact of outliers on the clusters and theoutliers on the clusters and the number of clusters
Distance of Points from Closest CentroidsCentroids
4 5
4
4.5
C
4.6
3
3.5
2.5
D 0.17
1.5
2
0.5
1
A
1.2
Outlier Score
0 5
Relative Distance of Points fromClosest CentroidClosest Centroid
4
3.5
C: 76.9
2 5
3
D: 15.0
2
2.5
1.5
0.5
1A: 13.1
Outlier Score
Strengths/Weaknesses of Cluster‐Based Approachesg pp
Simple
Many clustering techniques can be used
Can be difficult to decide on a clustering technique
Can be difficult to decide on number of clusters
Outliers can distort the clusters
Co‐anomaly Event Detection inMultiple Temperature Seriesp p
36
Co‐anomaly Event Detection inMultiple Temperature SeriesMultiple Temperature Series
37
Co‐anomaly Event Detection inMultiple Temperature SeriesMultiple Temperature Series
38
Co‐anomaly Event Detection inMultiple Temperature Series
39
Multiple Temperature Series