Analytics for large-scale time series and event data
-
Upload
anodot -
Category
Technology
-
view
677 -
download
6
Transcript of Analytics for large-scale time series and event data
![Page 1: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/1.jpg)
1
Building Anomaly
Detection For Large
Scale AnalyticsIra Cohen, Chief Data Scientist16th May, 2016
![Page 2: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/2.jpg)
2
Outline
Anomaly detection? Why do I need it?
Design principals for Anomaly Detection
What is anomaly detection?
Anomaly Detection Methods
The Anodot System
![Page 3: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/3.jpg)
3
Why Anomaly Detection?
![Page 4: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/4.jpg)
4
Detecting the Unknowns Saves Time + Money
Industrial IoTProactive Maintenance
Detecting issues in factories/machines
Web ServicesDetecting business incidents + unknown
business opportunities
Machine LearningClosing the “Machine Learning” loop
Tracking and detecting ”unknowns” not modeled
during training
SecurityDetection of unknown breach/attack
patterns
![Page 5: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/5.jpg)
5
Business Incidents - More go undetected as the business grows
$$$$
$$
$$$
![Page 6: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/6.jpg)
6
Detecting Business Incidents: Metric Driven Detection
Business
Business Generation:
Leads, visitors, usage,
engagements
App: Performance,
errors, usability
Infra utilization/state:
Middleware, network, System
e.g., Purchases per product,
Conversions per campaign…
Per Geo, user segment, page,
browser, device…
Per class, method, feature…
Per host, database, switch…
![Page 7: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/7.jpg)
7
Detecting Business Incidents: Metric Driven Detection
Drop in # of visitors
Decrease in ad conversion on Android Price glitch – increase in
purchases / decrease in revenue
![Page 8: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/8.jpg)
8
Setting alerts with thresholdsDashboards
Manual Detection of Business incidents
![Page 9: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/9.jpg)
9
Manual Solutions: Drowning in a “Sea of Data”
MISSED
INCIDENTS
FALSE
ALARMS
GENUINE
ALERTS
Too many parametersto set thresholds
Too much data to analyze in
real time
![Page 10: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/10.jpg)
10
What is Anomaly Detection?
![Page 11: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/11.jpg)
11
Find the Anomaly
![Page 12: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/12.jpg)
12
Anomaly Detection
12
• Ill posed problem
• What is an anomaly?
![Page 13: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/13.jpg)
13
Anomaly Detection in Time Series Signals
Unexpected change of temporal pattern of one or more
time series signals.
![Page 14: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/14.jpg)
14
Anomaly detection: Design Principals
![Page 15: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/15.jpg)
15
Anomaly Detection: Design Considerations
Timeliness
Real time vs.
Retroactive Detection
Scale
100’s vs. Millions
of metrics
Rate of change
Adaptive vs. Offline
learning
Conciseness
Univariate vs.
Multivariate methods
Well defined incidents?
Supervised vs.
Unsupervised methods
![Page 16: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/16.jpg)
16
Timeliness: Real time vs. Retroactive Detection
Real time decision making Non-real time decision making
Reduction in
visitors/revenues
Check
for bugs
Increase in product
purchase
Increase
inventory
Increase in ad conversion
w/o increase in
impressions
check for
fraud
Capacity Planning
Marketing budget allocations
Data Cleaning
Scheduled Maintenance
![Page 17: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/17.jpg)
17
Timeliness: Real time vs. Retroactive Detection
Real time decision making Non-real time decision making
Online learning: Cannot iterate over
the data
More prone to False
Positives
Scales more easily
Batch learning: can iterate over the
data
Easier to remove False
Positives
Poor scaling
![Page 18: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/18.jpg)
18
Rate of change
Constant change Very slow change
• Most common case• ”Closed” systems – e.g., airplanes,
large machinery
• Requires adaptive algorithms• Learn once and apply the model for
a long time
![Page 19: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/19.jpg)
19
Conciseness of Anomalies
Univariate Anomaly Detection Multivariate Anomaly Detection
• Learn normal model for each
metric
• Anomaly detection at the metric
level
• Easier to scale
• Causes anomaly storms: Can’t
see the forest from the trees
• Easier to model many types of
behaviors
• Learn single model for all metrics
• Anomaly detection of complete
incident
• Hard to scale
• Hard to interpret the anomaly
• Often requires metric behaviour
to be homogeneous
Hybrid approach
• Learn normal model for each
metric
• Combine anomalies to single
incidents if metrics are related
• Scalable
• Can combine multiple types of
metric behaviours
![Page 20: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/20.jpg)
20
Well defined incidents?
Yes - Supervised methods No - Unsupervised methods
• Requires a well defined set of
incidents to identify
• Learning a model to classify
samples as normal or abnormal
• Requires labeled examples of
anomalies
• Cannot detect new types of
incidents
• Learning a normal model only
• Statistical test to detect
anomalies
• Can detect any type of anomaly
known or unknown
Semi-Supervised methods
• Use few labelled examples to
improve detection of
unsupervised methods.
• Or – use unsupervised detection
for unknown cases, supervised
detection to classify already
known cases.
![Page 21: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/21.jpg)
21
Anomaly Detection Methods
![Page 22: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/22.jpg)
22
Unsupervised Anomaly Detection
General scheme
Step 1 Step 2 Step 3
Model the normal
behavior of the metric(s)
using a statistical model
Devise a statistical test to
determine if samples are
explained by the model.
Apply the test for each
sample. Flag as anomaly
if it does not pass the test
![Page 23: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/23.jpg)
23
Very Simple Model
1σ1σ
2σ2σ
3σ3σ
μ
99.7%
95.4%
68%
Assume normal behavior is the
Normal distribution
Estimate the average, standard
deviation over all samples
Test: any sample |x-average|> 3*standard
deviation is abnormal
![Page 24: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/24.jpg)
24
A single model does not fit them all!
Smooth
(stationary)
Irregular
sampling
Multi Modal Sparse
Discrete “Step”
![Page 25: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/25.jpg)
25
Metric types distribution
Based on 50,000,000 metrics sampled from dozens of companies
Nearly constant, 2%
Discrete, 15%
Sparse, 3%
Multi Modal, 5%
Smooth, 38%
Irregular sampling, 37%
All
Industries
![Page 26: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/26.jpg)
26
Example: The importance of modeling seasonality
Single seasonal pattern
![Page 27: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/27.jpg)
27
Example: The importance of modeling seasonality
Multiple seasonal patterns (“Amplitude modulation”)
![Page 28: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/28.jpg)
28
Example: The importance of modeling seasonality
Multiple seasons – Additive signals
![Page 29: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/29.jpg)
29
Seasonality Distribution
Season: 3 hours,
2%
Season: 12 hours,
1%
Season: 2 hours,
1%Season: 1 hours,
1%Season: 6 hours,
0.5%
Season: 4 hours,
0.2%
Season: 5 hours,
0.1%
Season: 24 hours,
69%
Season: Weekly, 26%
Note: Only 14% of the metrics have season
![Page 30: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/30.jpg)
30
Example Methods to detect seasonality
Finding maximums in Auto-
correlation of signal
Computationally expensive
More robust to gaps
Finding maximum(s) in Fourier
transform of signal
Challenging to detect low
frequency seasons
Challenging to discover
multiple seasons
Sensitive to missing
data
Exhaustive search based on cost
function
Computationally expensive
Robust to gaps
Challenging to discover
multiple seasons
![Page 31: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/31.jpg)
31
Real time detection @ scale = Online learning algorithms
1
2
3
Initialize model
For each new
sample test if
anomaly
Update model
parameters with
each new sample
![Page 32: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/32.jpg)
32
Example Online Models/Algorithms
4
2
1
3
Simple Moving
Average
Double/Triple
exponential (Holt-
Winters)
Kalman Filters +
ARIMA and
variations
Single
exponential
forgetting
![Page 33: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/33.jpg)
33
Example: Simple exponential forgetting (Normal distribution model)
Define alpha – forgetting factor
Compute initial average, sumOfSquares
using initial samples
For each new sample, x[t]
If |x[t]-average[t-1]|> 3* Stddev[t-1]
Flag x[t] as an anomalous sample
average[t] = alpha*x[t] + (1-alpha)*average[t-1]
sumOfSquares[t] = alpha*x^2 + (1-alpha)*sumOfSquares[t-1]
Stddev[t] = sqrt(sumOfSquares[t] – average[t]^2)
![Page 34: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/34.jpg)
34
Update rate with online models: Avoiding pitfalls
What should be the learning rate?
Too Slow
Too Fast
![Page 35: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/35.jpg)
35
Update rate with online models: Avoiding pitfalls
What should be the learning rate?
“Al Dente”
Auto tuning required!
![Page 36: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/36.jpg)
36
Update rate with online models: Avoiding pitfalls
How to update a model when there is an anomaly?
Strategy A: Update as usual
Most of the
anomaly is missed
![Page 37: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/37.jpg)
37
Update rate with online models: Avoiding pitfalls
Full anomaly
captured
How to update a model when there is an anomaly?
Strategy B: Adapt the learning rate
![Page 38: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/38.jpg)
38
Batch models
1 2 3 4
Collect
historical
samples
Segment samples
to similarly
behaving segments
Cluster segments
according to some
similarity measure
Mark as anomalies
segments that are in
small or no clusters
![Page 39: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/39.jpg)
39
Example Batch Anomaly Detection Methods
Multi-model distributions:
• Gaussian models
• Generalized
mixture models
One sided SVM
PCA
Clustering methods
(K-Means, DBScan, Mean-
Shift)
MOST COMMON IN USE
Hidden Markov Models
![Page 40: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/40.jpg)
40
Anomaly detection methods - examples
NAME ADAPTIVE? REALTIME? SCALABLE?UNI-MULTI
VARIATE
Holt-Winters Yes Yes Yes Univariate
ARIMA + Kalman Yes Yes Yes Both
HMM No Yes No Multivariate
GMM No No No Both
DBScan No No No Multivariate
K-Means No No No Multivariate
![Page 41: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/41.jpg)
41
Large scale anomaly detection –the Anodot system
![Page 42: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/42.jpg)
42
Automatic Anomaly Detection in five Steps: The Anodot Way
Metrics
Collection –
Universal, scale
to millions
Normal
behavior
learning
Abnormal
behavior
learning
Behavioral
Topology
Learning
Feedback
Based Learning
1 2 3 4 5
![Page 43: Analytics for large-scale time series and event data](https://reader034.fdocuments.net/reader034/viewer/2022051318/58a128ae1a28abb91b8b6d47/html5/thumbnails/43.jpg)
43
Large Scale Anomaly Detection System Architecture
Kafka
Events
Queue
Anomaly
Grouping
Signals
Correlation
Map
Real-Time
Rollups StoreCassandra
Anodotd
REST
WebApp
Online
Base Line
Learning
Aggregator
Elasticsearch
DWH S3
HADOOP
HIVE
Offline
Learning
Management
&
Portal
Anodot-Web
User Mgmt
RDBMS
Customer DS
Agent