LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.
-
Upload
melina-wright -
Category
Documents
-
view
227 -
download
2
Transcript of LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.
![Page 1: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/1.jpg)
LÊ VĂN QUÔC ANH
Overview of Anomaly Detection in Time Series Data
![Page 2: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/2.jpg)
Outline
IntroductionAnomaly detection approaches
Classification basedNearest Neighbor BasedPredictiveWindow-BasedDisk Aware Discord DiscoveryAnd others approaches
CommentsConclusionReferences
2
![Page 3: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/3.jpg)
Introduction
Time series data problems:Similarity searchClassificationClusteringMotif discoveryAnomaly/novelty detectionVisualization
3* [Keogh]
![Page 4: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/4.jpg)
Introduction
Time series data problems:Similarity searchClassificationClusteringMotif discoveryAnomaly/novelty detectionVisualization
4* [Keogh]
![Page 5: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/5.jpg)
Problem Definition
Anomaly/novelty detection refers to the problem of finding patterns in data that do not conform to expected behavior
5
![Page 6: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/6.jpg)
Problem Definition (cont.)
Finding discords in large scale time series
6[V. Chandola]
![Page 7: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/7.jpg)
Applications
Intrusion detection for cyber-securityFraud detection for credit cardsFault detection in safety critical systemsIndustrial damage detectionMedical and public health anomaly detectionStock market analysis…
7
![Page 8: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/8.jpg)
Very simple technique: Match the data to known patterns
8
![Page 9: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/9.jpg)
Existing anomaly detection techniques
Classification basedNearest Neighbor BasedPredictiveWindow-BasedDisk Aware Discord DiscoveryAnd others techniques
9
![Page 10: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/10.jpg)
Classification based approaches
Learn a model from a set of labeled data instances and then, classify a test instance into one of the classes using the learnt model
Operate in two phases:training phase: learning from trainning datatesting phase: test instance as normal or anomalous
Assumption: A classifier that can distinguish between normal and anomalous classes can be learnt in the given feature space.
10
![Page 11: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/11.jpg)
Classification based approaches(cont.)
11
![Page 12: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/12.jpg)
Classification based approaches(cont.)
Some techniques:Neural Networks basedBayesian Networks basedSupport Vector Machines basedRule based
12
![Page 13: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/13.jpg)
Classification based approaches(cont.)
Advantages:can distinguish between instances belonging to
different classestesting phase is fast
Disadvantages:have to assign a label to each test instancerely on availability of accurate labels for various
normal classes
13
![Page 14: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/14.jpg)
Nearest Neighbor Based
Assumption: Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors.
require a distance defined between two data instances
14
![Page 15: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/15.jpg)
Nearest Neighbor Based(cont.)
15
![Page 16: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/16.jpg)
Nearest Neighbor Based(cont.)
Advantages:purely data driven
Disadvantages:if the data has normal instances that do not have
enough close neighbors or if the data has anomalies that have enough close neighbors, the technique fails to label them correctly
performance greatly relies on a distance measuredefining distance measures between instances can be
challenging when the data is complex
16
![Page 17: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/17.jpg)
Predictive techniques
Forecast the next observation in the time series, using the statistical model and the time series observed so far, and compare the forecasted observation with the actual observation to determine if an anomaly has occurred.
Some techniques: Regression, Auto Regression ARMA, ARIMA, SVR (Support Vector Regression)
17
![Page 18: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/18.jpg)
Predictive techniques(cont.)
Advantages:provide a statistically justifiable solution for anomaly
detection if the assumptions regarding the underlying data distribution hold true
Disadvantages: rely on the assumption that the data is generated from
a particular distribution
18
![Page 19: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/19.jpg)
Window-Based
Extract fixed length (w) windows from a test time series, and assign an anomaly score to each window. The per-window scores are then aggregated to obtain the anomaly score for the test time series.
Some proposed techniques:HOT SAX AWDD WAT
19
![Page 20: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/20.jpg)
HOT SAX
[Eamonn Keogh,Jessica Lin, Ada Fu]Finding the most unusual time series subsequence
discordImprove BFDD algorithm (Brute Force Discord
Discovery) with heristic ordering Use SAX for discretization
20
![Page 21: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/21.jpg)
21
Brute
Force
Algorithm
![Page 22: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/22.jpg)
22
Heuristic
Discord
Discovery
![Page 23: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/23.jpg)
The two data structures for Inner and Outer heuristics
23[Keogh]
![Page 24: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/24.jpg)
AWDD technique
M. Chuah, F. Fu (2006)AWDD - Adaptive Window Based Discord DiscoveryApply for ECG time series
24
![Page 25: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/25.jpg)
AWDD technique(cont.)
Advantages: use adaptive rather than fixed windows
Disadvantages: deal only with ECG datasets
25
![Page 26: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/26.jpg)
WAT technique
Y. Bu et al (2006)WAT - Wavelet and Augmented TrieEmploys Haar wavelet transform and symbol word
mapping orderly on raw time series to build prefix tree for Inner and Outer loop heuristic
can view a subsequence in different resolutionsthe first symbol of each word gives us the lowest
resolution for each subsequence
26
![Page 27: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/27.jpg)
WAT technique(cont.)
Advantages:require 2 parameter (1 intuitive parameter)better performance than HOT SAX
Disadvantages:assume the coefficients are in Gaussian distributionassume that the data reside in main memory
27
![Page 28: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/28.jpg)
DADD technique
DADD - Disk Aware Discord Discovery (2008)[Yankov, Keogh and Rebbapragada]
Finding unusual time series in terabyte sized datasets on secondary memory
Algorithm has two phases:Phase 1: a candidate selection phase
given a threshold r , finds a set of all discords at distance at least r from their nearest neighbor
Phase 2: a discord refinement phase remove all false discords from the candidate set
28
![Page 29: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/29.jpg)
A candidate selection phaseprocedure [C]=DC Selection(S, r)in: S: disk resident data set of time series r: discord defining rangeout: C: list of discord candidates1 C = {S1}2 for i = 2 to |S| do3 isCandidate = true4 for ∀Cj ∈ C do
5 if (Dist(Si,Cj) < r) then
6 C = C \ Cj
7 isCandidate = false8 end if9 end for10 if (isCandidate) then11 C = C ∪ Si
12 end if13 end for 29
![Page 30: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/30.jpg)
A discord refinement phaseprocedure [C,C.dist]=DC Refinement(S, C, r)in: S: disk resident dataset of time series C: discord candidates set r: discord defining rangeout: C: list of discords C.dist: list of NN distances to the discords1 for j = 1 to |C| do2 C.distj = ∞3 end for4 for ∀Si ∈ S do
5 for ∀Cj ∈ C do
6 if Si == Cj then7 continue8 end if9 d = EarlyAbandon(Si,Cj ,C.distj)10 if (d < r) then11 C = C \ Cj
12 C.dist = C.dist \ C.distj13 else
14 C.distj = min(C.distj , d)15 end if16 end for17 end for
30
![Page 31: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/31.jpg)
DADD technique (cont.)
Advantages:equires only two linear scans of the disk with a tiny
buffer of main memoryvery simple to implement
Disadvantages:depend on threshold r
31
![Page 32: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/32.jpg)
Proposed approach
Using Vector Quantization for discretizationImprove BFDD algorithm with ordering heuristic
32
![Page 33: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/33.jpg)
Using histogram model
33
Codebook s=16
Generation
Series Transformation
Series
Encoding
112100000000100012000100110000001000000012001100100000001100210000010101001100101010000100100011
……
c m d b c a i f a j b bm i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p
……
![Page 34: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/34.jpg)
Similarity measure
),(1
1),(
tqdistqSHM
s
i qiti
qiti
ff
fftqdis
1 ,,
,,
1),(
with
fi,t
fi,q
1 2...s
34
![Page 35: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/35.jpg)
Using multiple resolutions
35
• Codebook (6,60)
• Codebook (16,30)
![Page 36: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/36.jpg)
For each resolution
Start with lowest resolution and a group of all subsequences
For each resolutiongroups which have more than one subsequences are
splitted based on a threshold r Stop when have groups with one subsequences or
reach the highest resolution
36
![Page 37: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/37.jpg)
Improve BFDD
Outer Loop Heuristic:groups which have smallest subsequences count are
considered firstInner Loop Heuristic:
when ith subsequence is considered in the outer loop, all subsequences in the same group are considered first in the Inner Loop
37
![Page 38: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/38.jpg)
References [1] E. Keogh, J. Lin, W. Fu. HOT SAX: Efficiently Finding the Most
Unusual Time Series Subsequence. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005), November 27-30, 2005, pp. 226-233.
[2] D. Yankov, E. Keogh, U. Rebbapragada, Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets, 2008
[3] E. Keogh.Mining Shape and Time Series Databases with Symbolic Representations. Tutorial of the 13rd ACM Interantional Conference on Knowledge Discovery and Data Mining (KDD 2007), August 12-15, 2007.
[4] J. Lin, E. Keogh, A. Fu, and H. Van Herle, Approximations to Magic: Finding Unusual Medical Time Series, the 18th IEEE International Symposium on Computer-Based Medical Systems, pp. 329-334, 2005.
[5] M. Chuah and F. Fu, ECG anomaly detection via time series analysis, Technical Report LU-CSE-07-001, 2007.
38
![Page 39: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/39.jpg)
References (cont.) [6] V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos. A
Multiresolution Symbolic Representation of Time Series. In Proc. of the 21st International Conference on Data Engineering (ICDE 2005), April 5-8, 2005, pp. 668-679, 2005.
[7] V. Chandola, D. Cheboli, and V. Kumar, Detecting Anomalies in a Time Series Database,Technical Report TR 09-004, 2009.
[8] Y. Bu, T-W Leung, A. Fu, E. Keogh, J. Pei, and S. Meshkin, WAT: Finding Top-K Discords in Time Series Database, in Proc. of the 2007 SIAM International Conference on Data Mining (SDM'07), Minneapolis, MN, USA, April 26-28, 2007.
[9] Q. Wang, V. Megalooikonomou, A dimensionality reduction technique for efficient time series similarity analysis, Information Systems 33, 115–132, 2008.
[10] H. B. Kekre Tanuja K. Sarode, Fast Codebook Search Algorithm for Vector Quantization using Sorting Technique , International Conference on Advances in Computing, Communication and Control (ICAC3’09), 2009.
39
![Page 40: LÊ VĂN QUỐC ANH Overview of Anomaly Detection in Time Series Data.](https://reader031.fdocuments.net/reader031/viewer/2022032104/56649cea5503460f949b5d98/html5/thumbnails/40.jpg)
Thank you!
40