Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data...

17
Online Machine Learning with Distributed In- memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC

Transcript of Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data...

Page 1: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Online Machine Learning with Distributed In-memory Clusters

Arshak Navruzyan, VP Product Argyle Data

AcknowledgementsNina Zumel, PhD – Win-Vector LLC

Page 2: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Contents Business use-cases

- DevOps / Continuous Deployment

- Quality of Service (QOS)

- Quality of Experience (QoE)

Machine learning methods for anomaly detection

- Nearest Neighbor (NN)

- Isolation Forest (iForest)

- Random Forests (RF)

Page 3: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Ecommerce operator use case Large distributed search farm logs usage data into in-memory grid

- Mixed measurements

- Continuous: query latency, resource utilization, etc.

- Categorical: client IP, server IP, code version, etc.

- ~1-3 TB of search logs every 12 hours

Find the anomalies

- Data isn’t easy to characterize due to size

- Anomalies are across multiple variables (combination of server, code version, latency)

- No labeled data is available

- High rate of false-positives at this scale is a flood of data

- Very few ML methods operate at this scale

Page 4: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

In-memory grid provides live view of the data Distributed SQL store

- Read / write optimized

- Automatic placement (sharding)

- Fast ingest (million inserts per sec.)

- Fast aggregations (billion rows per sec.)

- Holds ~30 days of data online

- Insert also sets time-to-live (ttl)

Simple monitoring tool

- D3.js based horizon graphs

- Socket.io / Node.js

Page 5: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Isn’t there an open source project that does this? Etsy’s continuous deployment problem

- 1.5b page views, 1M users, $117m in goods sold

- 250+ commiters, everyone deploys!

- 30+ deploys to production a day

- ~8 commits per deploy

- How do you NOT break production?

Page 6: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

How do you monitor such an environment like Etsy? Usual suspects: IT monitoring tools like Ganglia, Nagios

- But … “Not all things that break throw errors”

Etsy’s Kale

- StatsD - StatsD::increment(“foo.bar”)

- Skyline – Real-time anomaly detection system

- Oculus – Anomaly correlation system using dynamic warping

Page 7: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Skyline & Oculus

Page 8: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Did they solve the problem? Skyline’s Anomaly Detection

- Basic principle: “Metric is anomalous if it’s latest datapoint is over three standard deviations above the moving average”

- Ensemble of methods from tailing average, median absolute deviation, grubbs, stdev from moving average, least squares (3 sigma), Kolmogorov-Smirnov test, etc.

- Results get better with ensemble technqiue but still very noisy

- Non-normal distributions

- Spike influence

- Periodicity / seasonality

Page 9: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Machine learning based anomaly detection “Benchmarking algorithms for detecting anomalies in large datasets” -Uriel Carrasquilla

(2010)

Page 10: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Nearest-neighbor approach N1 and N2 are regions of normal behaviour

Points o1 and o2 are anomalies

Points in region O3 are anomalies Advantage

- No need for assumptions about data distribution- No need to have pre-labelled anomalies- Supports categorical as wells as continuous variables

Drawbacks- Computationally expensive – quadratic in data volume (every point has to be compared to

every other point)- Existing implementations are batch-oriented

X

Y

N1

N2

o1

o2

O3

“Anomaly Detection: A Tutorial” - A. Banerjee, V. Chandola, V. Kumar, J. Srivastava (SIAM International Conference on Data Mining 2008)

Page 11: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Avoiding the quadratic computation time of NN Bay & Schwabacher SIGKDD 2003

- Outliners in Near Linear Time with Randomization and a Simple Pruning Rule

Only anomalies are compared to every point in the data set

- If anomalies are rare, points only get compared to a small constant number of points

Challenges that remain

- Batch learner won’t work for our scale

- How would this work in a sharded environment?

- Linear time is still long for “Big Data”

Page 12: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Nearest-neighbor as an online learner ORCA concepts can be extended to online learning

- Randomly pull a small sample of the data that is representative of a time period

- Compare the streaming data (latest observations) to the representative set

- Representative set moves with time (sliding window) so that noon is compared to typical noon (this addresses periodicity)

Other enhancements

- Every node needs the full reference set

- Distributed cache moves reference set between nodes

- Locality sensitive hashing

Page 13: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Random partitioning approach “Isolation Forest” – Liu IEEE (2008)

When building a tree, anomalies are likely to be isolated closer to the root of the tree; whereas normal points appear deeper in the tree structure

No need to profile normal data points

No distance or density measures

Gap

- No support for categorical attributes

- Lacks explanatory power

- No obvious way to turn into an online

method

Page 14: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Ensemble of trees as an online learner Random Forest (RF) basics

- Take the square root of the number of attributes and build that many trees

- Random selection (with replacement) of observations (in-bag/out-of-bag)

- Random selection of variables

“Online Random Forests” (ORF) algorithm of Saffari et al., ICCV-OLCV 2009

- How do you perform bagging online? (Oza “Online bagging and Boosting”)

- Probability of an individual tree seeing an observation in batch learning mode is Poisson (λ = 1)

- How to grow random trees on-the-fly?

- Continuously measure info-gain (or gini) of a potential split

Page 15: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Offline vs. Online Training

Page 16: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

To summarize Machine learning can produce more accurate anomaly detection than the stats

approach

Nearest Neighbor and Random Forests may be adapted to an online learner

In-memory distributed processing can improve the performance of such algorithms

Supervised methods for classification and regression will work similarly (we think …)

Page 17: Online Machine Learning with Distributed In-memory Clusters Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC.

Thank you!www.linkedin.com/in/arshak