Anomaly Detection Deep Learning for...Anomaly detection approaches can be categorized in terms of...

Cloudera Fast ForwardFF12 · February 2020

Deep Learning forAnomaly Detection

https://www.cloudera.com/products/fast-forward-labs-research.html

1. Introduction

Applications of Anomaly Detection2. Background

Anomaly Detection ApproachesSupervised LearningUnsupervised learningSemi-supervised learningEvaluating Models: Accuracy Is Not EnoughAnomaly Detection as Learning Normal BehaviorApproaches to Modeling Normal BehaviorWhy Use Deep Learning for Anomaly Detection?What Can Go Wrong?

3. Deep Learning for Anomaly Detection

AutoencodersVariational AutoencodersGenerative Adversarial NetworksSequence-to-Sequence ModelsOne-Class Support Vector MachinesAdditional ConsiderationsHow to Decide on a Modeling Approach?

4. Prototype

DatasetsBenchmarking ModelsWeb Application Prototypes

5. Landscape

Open Source Tools and FrameworksAnomaly Detection as a Service

6. Ethics

Diversity MattersExplainability

Additional Use CasesInnovation

7. Conclusion

“An outlier is an observation which deviates so much from otherobservations as to arouse suspicions that it was generated by a differentmechanism.”

– Hawkins, Identification of Outliers (1980)

Anomalies, often referred to as outliers, abnormalities, rare events, or deviants,are data points or patterns in data that do not conform to a notion of normalbehavior. Anomaly detection, then, is the task of finding those patterns in datathat do not adhere to expected norms, given previous observations. Thecapability to recognize or detect anomalous behavior can provide highly usefulinsights across industries. Flagging unusual cases or enacting a plannedresponse when they occur can save businesses time, costs, and customers.Hence, anomaly detection has found diverse applications in a variety ofdomains, including IT analytics, network intrusion analytics, medical diagnostics,financial fraud protection, manufacturing quality control, marketing and socialmedia analytics, and more.

Applications of Anomaly DetectionWe’ll begin by taking a closer look at some possible use cases, before divinginto different approaches to anomaly detection in the next chapter.

IntroductionCHAPTER 1

Network Intrusion DetectionNetwork security is critical to running a modern viable business, yet allcomputer systems suffer from security vulnerabilities which are both technicallydifficult and economically punishing to resolve once exploited. Business ITsystems collect data about their own network traffic, user activity in the system,types of connection requests, and more. While most activity will be benign androutine, analysis of this data may provide insights into unusual (anomalous)activity within the network after and ideally before a substantive attack. Inpractice, the damage and cost incurred right after an intrusion event escalatesfaster than most teams are able to mount an effective response. Thus, itbecomes critical to have special-purpose intrusion detection systems (IDSs) inplace that can surface potential threat events and anomalous probing early andin a reliable manner.

Medical DiagnosisIn many medical diagnosis applications, a variety of data points (e.g., X-rays,MRIs, ECGs) indicative of health status are collected as part of diagnosticprocesses. Some of these data points are also collected by end-user medicaldevices (e.g., glucose monitors, pacemakers, smart watches). Anomalydetection approaches can be applied to highlight situations of abnormal

Anomaly detection is relevant to several usecases - Network intrusion

detection, Medical diagnosis, Fraud detection and manufacturing defect

detection.

readings that may be indicative of health conditions or precursor signals ofmedical incidents.

Fraud DetectionIn 2018, fraud was estimated to have a global financial cost of over £3.89 trillion(about $5 trillion USD). Within the financial services industry, it is critical forservice providers to quickly and correctly identify and react to fraudulenttransactions. In the most straightforward cases, a transaction may be identifiedas fraudulent by comparison to the historical transactions by a given party or toall other transactions occurring within the same time period for a peer group.Here, fraud can be cast as a deviation from normal transaction data andaddressed using anomaly detection approaches. Even as financial fraud isfurther clustered into card-based, check-based, unauthorized account access-based, or authorized payment-based categories, the core concepts ofbaselining an individual’s standard behavior and looking for signals of unusualactivity apply.

Manufacturing Defect DetectionWithin the manufacturing industry, an automated approach to the task ofdetecting defects, particularly in items manufactured in large volumes, is vital toquality assurance (QA). This task can be cast as an anomaly detection exercise,where the goal is to identify manufactured items that significantly or evenslightly differ from ideal (normal) items that have passed QA tests. The amountof acceptable deviation is determined by the company and customers, as wellas industry and regulatory standards.

Use-Case Example

ScoleMans has been making drywall panels for years and is the leadingregional supplier for construction companies. Occasionally, some of thewall panels they produce have defects: cracks, chips at the edges,paint/coating issues, etc. As part of their QA process, they capture bothRGB and thermal images of each produced panel, which their QAengineers use to flag defective units. They want to automate this process;i.e., develop tools that automatically identify these defects. This ispossible using a deep anomaly detection model. In particular, ScoleManscan use an autoencoder or GAN-based model built with convolutional

http://www.crowe.ie/wp-content/uploads/2019/08/The-Financial-Cost-of-Fraud-2019.pdf

neural network blocks (see Chapter 3. Deep Learning for AnomalyDetection for more information) to create a model of normal data based onimages of normal panels. This model can then be used to tag new imagesas normal or abnormal.

Similarly, the task of predictive maintenance can be cast as an anomalydetection problem. For example, anomaly detection approaches can be appliedto data from machine sensors (vibrations, temperature, drift, and more), whereabnormal sensor readings can be indicative of impending failures.

As these examples suggest, anomaly detection is useful in a variety of areas.Detecting and correctly classifying as anomalous something previously unseenis a challenging problem that has been tackled in many different ways over theyears. While there are many approaches, the traditional machine learning (ML)techniques are suboptimal when it comes to high-dimensional data andsequence datasets, because they fail to capture the complex structures in thedata.

This report, with its accompanying prototype, explores deep learning-basedapproaches that first learn to model normal behavior and then exploit thisknowledge to identify anomalies. While they’re capable of yielding remarkableresults on complex and high-dimensional data, there are several factors thatinfluence the choice of approach when building an anomaly detectionapplication. We survey the options, highlighting their pros and cons.

In this chapter, we provide an overview of approaches to anomaly detectionbased on the type of data available, how to evaluate an anomaly detectionmodel, how each approach constructs a model of normal behavior, and whydeep learning models are valuable. We conclude with a discussion of pitfallsthat may be encountered while deploying these models.

Anomaly Detection ApproachesAnomaly detection approaches can be categorized in terms of the type of dataneeded to train the model. In most use cases, it is expected that anomaloussamples represent a very small percentage of the entire dataset. Thus, evenwhen labeled data is available, normal data samples are more readily availablethan abnormal samples. This assumption is critical for most applications today.In the following sections, we touch on how the availability of labeled dataimpacts the choice of approach.

Supervised LearningWhen learning with supervision, machines learn a function that maps inputfeatures to outputs based on example input-output pairs. The goal ofsupervised anomaly detection algorithms is to incorporate application-specificknowledge into the anomaly detection process. With sufficient normal andanomalous examples, the anomaly detection task can be reframed as aclassification task where the machines can learn to accurately predict whether agiven example is an anomaly or not. That said, for many anomaly detection usecases the proportion of normal versus anomalous examples is highlyimbalanced; while there may be multiple anomalous classes, each of themcould be quite underrepresented.

BackgroundCHAPTER 2

This approach assumes that one has labeled examples for all types ofanomalies that could occur and can correctly classify them. In practice, this isusually not the case, as anomalies can take many different forms, with novelanomalies emerging at test time. Thus, approaches that generalize well and aremore effective at identifying previously unseen anomalies are preferable.

Unsupervised learningWith unsupervised learning, machines do not possess example input-outputpairs that allow it to learn a function that maps the input features to outputs.Instead, they learn by finding structure within the input features. Because, asmentioned previously, labeled anomalous data is relatively rare, unsupervisedapproaches are more popular than supervised ones in the anomaly detectionfield. That said, the nature of the anomalies one hopes to detect is often highlyspecific. Thus, many of the anomalies found in a completely unsupervisedmanner could correspond to noise, and may not be of interest for the task athand.

An illustration of supervised learning.

Semi-supervised learningSemi-supervised learning approaches represent a sort of middle ground,employing a set of methods that take advantage of large amounts of unlabeleddata as well as small amounts of labeled data. Many real-world anomalydetection use cases are well suited to semi-supervised learning, in that thereare a huge number of normal examples available from which to learn, butrelatively few examples of the more unusual or abnormal classes of interest.Following the assumption that most data points within an unlabeled dataset arenormal, one can train a robust model on an unlabeled dataset and evaluate itsperformance (and tune the model’s parameters) using a small amount of labeled

data.[1]

An illustration of unsupervised learning.

This hybrid approach is well suited to applications like network intrusiondetection, where one may have multiple examples of the normal class andsome examples of intrusion classes, but new kinds of intrusions may arise overtime.

To give another example, consider X-ray screening for aviation or bordersecurity. Anomalous items posing a security threat are not commonlyencountered and can take many forms. In addition, the nature of any anomalyposing a potential threat may evolve due to a range of external factors.Exemplary data of anomalies can therefore be difficult to obtain in any usefulquantity.

An illustration of semi-supervised learning

Such situations may require the determination of abnormal classes as well asnovel classes, for which little or no labeled data is available. In cases like these,a semi-supervised classification approach that enables detection of both knownand previously unseen anomalies is an ideal solution.

Evaluating Models: Accuracy IsNot EnoughAs mentioned in the previous section, in anomaly detection applications it isexpected that the distribution between the normal and abnormal class(es) maybe highly skewed. This is commonly referred to as the class imbalance problem.A model that learns from such skewed data may not be robust; it may beaccurate when classifying examples within the normal class, but perform poorlywhen classifying anomalous examples.

For example, consider a dataset consisting of 1,000 images of luggage passingthrough a security checkpoint. 950 images are of normal pieces of luggage, and50 are abnormal. A classification model that always classifies an image as

Exemplary data in certain applications can be difficult to obtain

normal can achieve high overall accuracy for this dataset (95%), even though itsaccuracy rate for classifying abnormal data is 0%.

Such a model may also misclassify normal examples as anomalous (falsepositives, FP), or misclassify anomalous examples as normal ones (falsenegatives, FN). As we consider both of these types of errors, it becomesobvious that the traditional accuracy metric (total number of correctclassifications divided by total classifications) is insufficient in evaluating theskill of an anomaly detection model.

Two important metrics have been introduced that provide a better measure ofmodel skill: precision and recall. Precision is defined as the number of truepositives (TP) divided by the number of true positives plus the number of falsepositives (FP), while recall is the number of true positives divided by thenumber of true positives plus the number of false negatives (FN). Depending onthe use case or application, it may be desirable to optimize for either precisionor recall.

Optimizing for precision may be useful when the cost of failure is low, or toreduce human workload. Optimizing for high recall may be more appropriatewhen the cost of a false negative is very high; for example, in airport security,where it is better to flag many items for human inspection (low cost) in order toavoid the much higher cost of incorrectly admitting a dangerous item onto aflight. While there are several ways to optimize for precision or recall, themanner in which a threshold is set can be used to reflect the precision andrecall preferences for each specific use case.

You now have an idea of why an unsupervised or semi-supervisedapproach to anomaly detection is desirable, and what metrics are best touse for evaluating these models. In the next section, we focus on semi-supervised approaches and discuss how they work.

Anomaly Detection as LearningNormal BehaviorThe underlying strategy for most approaches to anomaly detection is to firstmodel normal behavior, and then exploit this knowledge to identify deviations

(anomalies). This approach typically falls under the semi-supervised learningcategory and is accomplished through two steps in the anomaly detection loop.The first step, referred to as the training step, involves building a model ofnormal behavior using available data. Depending on the specific anomalydetection method, this training data may contain both normal and abnormal datapoints, or only normal data points (see Chapter 3. Deep Learning for AnomalyDetection for additional details on anomaly detection methods). Based on thismodel, an anomaly score is then assigned to each data point that represents ameasure of deviation from normal behavior.

The second step in the anomaly detection loop, the test step, introduces theconcept of threshold-based anomaly tagging. Based on the range of scoresassigned by the model, one can select a threshold rule that drives the anomalytagging process; e.g., scores above a given threshold are tagged as anomalies,while those below it are tagged as normal. The idea of a threshold is valuable, asit provides the analyst an easy lever with which to tune the “sensitivity” of theanomaly tagging process. Interestingly, while most methods for anomaly

The training step in the anomaly detection loop: based on data (which

may or may not contain abnormal samples), the anomaly detection model

learns a model of normal behavior which it uses to assign anomaly

scores.

detection follow this general approach, they differ in how they model normalbehavior and generate anomaly scores.

To further illustrate this process, consider the scenario where the task is todetect abnormal temperatures (e.g., spikes), given data from the temperaturesensors attached to servers in a data center. We can use a statistical approachto solve this problem (see the table in the following section for an overview ofcommon methods). In step 1, we assume the samples follow a normaldistribution, and we can use sample data to learn the parameters of thisdistribution (mean and variance). We assign an anomaly score based on asample’s deviation from the mean and set a threshold (e.g., any value more than3 standard deviations from the mean is an anomaly). In step 2, we then tag allnew temperature readings and generate a report.

The test step in the anomaly detection loop.

Approaches to Modeling NormalBehaviorGiven the importance of the anomaly detection task, multiple approaches haveRbeen proposed and rigorously studied over the last few decades. To provide ahigh-level summary, we categorize the more popular techniques into four mainareas: clustering, nearest neighbor, classification, and statistical. For a detailed

survey of existing techniques, see [2]). The following table provides a summaryof the assumptions and anomaly scoring strategies employed by approacheswithin each category, and some examples of each.

AnomalyDetectionMethod

Assumptions Anomaly Scoring Notable Examples

Clustering Normal data points belong to a cluster (orlie close to its centroid) in the data whileanomalies do not belong to any clusters.

Distance from nearestcluster centroid

Self-organizing maps (SOMs), k-means clustering, expectationmaximization (EM)

NearestNeighbour

Normal data instances occur in denseneighborhoods while anomalous data arefar from their nearest neighbors

Distance from _k_thnearest neighbour

k-nearest neighbors (KNN)

Anomaly scoring

AnomalyDetectionMethod

Assumptions Anomaly Scoring Notable Examples

Classification A classifier can be learned whichdistinguishes between normal andanomalous with the given feature space

Labeled data exists for normal andabnormal classes

A measure of classifierestimate (likelihood)that a data pointbelongs to the normalclass

One-class support vectormachines (OCSVMs)

Statistical Given an assumed stochastic model,normal data instances fall in high-probability regions of the model whileabnormal data points lie in low-probabilityregions

Probability that a datapoint lies in a high-probability region inthe assumeddistribution

Regression models (ARMA,ARIMA)

Deeplearning

Given an assumed stochastic model,normal data instances fall in high-probability regions of the model whileabnormal data points lie in low-probabilityregions

Probability that a datapoint lies in a high-probability region inthe assumeddistribution

autoencoders, sequence-to-sequence models, generativeadversarial networks (GANs),variational autoencoders (VAEs)

The majority of these approaches have been applied to univariate time seriesdata; a single data point generated by the same process at various time steps(e.g., readings from a temperature sensor over time); and assume linearrelationships within the data. Examples include k-means clustering, ARMA,ARIMA, etc. However, data is increasingly high-dimensional (e.g., multivariatedatasets, images, videos), and the detection of anomalies may require the jointmodeling of interactions between each variable. For these sorts of problems,deep learning approaches (the focus of this report) such as autoencoders,VAEs, sequence-to-sequence models, and GANs present some benefits.

Why Use Deep Learning forAnomaly Detection?Deep learning approaches, when applied to anomaly detection, offer severaladvantages. First, these approaches are designed to work with multivariate andhigh dimensional data. This makes it easy to integrate information from multiplesources, and eliminates challenges associated with individually modelinganomalies for each variable and aggregating the results. Deep learningapproaches are also well-adapted to jointly modeling the interactions betweenmultiple variables with respect to a given task and - beyond the specification ofgeneric hyperparameters (number of layers, units per layer, etc.) - deep learningmodels require minimal tuning to achieve good results.

Performance is another advantage. Deep learning methods offer the opportunityto model complex, nonlinear relationships within data, and leverage this for theanomaly detection task. The performance of deep learning models can also

potentially scale with the availability of appropriate training data, making themsuitable for data-rich problems.

What Can Go Wrong?There are a proliferation of algorithmic approaches that can help one tackle ananomaly detection task and build solid models, at times even with just normalsamples. But do they really work? What could possibly go wrong? Here aresome of the issues that need to be considered:

Contaminated normal examplesIn large-scale applications that have huge volumes of data, it’s possible thatwithin the large unlabeled dataset that’s considered the normal class, a smallpercentage of the examples may actually be anomalous, or simply be poortraining examples. And while some models (like a one-class SVM or isolationforest) can account for this, there are others that may not be robust to detectinganomalies.

Computational complexityAnomaly detection scenarios can sometimes have low latency requirements;i.e., it may be necessary to be able to speedily retrain existing models as newdata becomes available, and perform inference. This can be computationallyexpensive at scale, even for linear models for univariate data. Deep learningmodels also incur additional compute costs to estimate their large number ofparameters. To address these issues, it is recommended to explore trade-offsthat balance the frequency of retraining and overall accuracy.

Human supervisionOne major challenge with unsupervised and semi-supervised approaches isthat they can be noisy and may generate a large amount of false positives. Inturn, false positives incur labor costs associated with human review. Giventhese costs, an important goal for anomaly detection systems is to incorporatethe results of human review (as labels) to improve model quality.

Definition of anomalyIn many data domains, the boundary between normal and anomalous behavioris not precisely defined and is continually evolving. Unlike in other task domainswhere dataset shift occurs sparingly, anomaly detection systems shouldanticipate and account for (frequent) changes in the distribution of the data. Inmany cases, this can be achieved by frequent retraining of the models.

Threshold selectionThe process of selecting a good threshold value can be challenging. In a semi-supervised setting (the approaches covered above), one has access to a pool oflabeled data. Using these labels, and some domain expertise, it is possible todetermine a suitable threshold. Specifically, one can explore the range ofanomaly scores for each data point in the validation set and select as athreshold the point that yields the best performance metric (accuracy, precision,recall). In the absence of labeled data, and assuming that most data points arenormal, one can use statistics such as standard deviation and percentiles toinfer a good threshold.

InterpretabilityDeep learning methods for anomaly detection can be complex, leading to theirreputation as black box models. However, interpretability techniques such asLIME (see our previous report, “Interpretability”) and Deep SHAP provideopportunities for analysts to inspect their behavior and make them moreinterpretable.

https://clients.fastforwardlabs.com/ff06/report

https://arxiv.org/abs/1705.07874

In this chapter, we will review a set of relevant deep learning modelarchitectures and how they can be applied to the task of anomaly detection. Asdiscussed in Chapter 2. Background, anomaly detection involves first teaching amodel to recognize normal behavior, then generating anomaly scores that canbe used to identify anomalous activity.

The deep learning approaches discussed here typically fall within a family ofencoder-decoder models: an encoder that learns to generate an internalrepresentation of the input data, and a decoder that attempts to reconstruct theoriginal input based on this internal representation. While the exact techniquesfor encoding and decoding vary across models, the overall benefit they offer isthe ability to learn the distribution of normal input data and construct a measureof anomaly respectively.

AutoencodersAutoencoders are neural networks designed to learn a low-dimensionalrepresentation, given some input data. They consist of two components: anencoder that learns to map input data to a low-dimensional representation(termed the bottleneck), and a decoder that learns to map this low-dimensionalrepresentation back to the original input data. By structuring the learningproblem in this manner, the encoder network learns an efficient “compression”function that maps input data to a salient lower-dimensional representation,such that the decoder network is able to successfully reconstruct the originalinput data. The model is trained by minimizing the reconstruction error, which isthe difference (mean squared error) between the original input and thereconstructed output produced by the decoder. In practice, autoencoders havebeen applied as a dimensionality reduction technique, as well as in other usecases such as noise removal from images, image colorization, unsupervisedfeature extraction, and data compression.

Deep Learning for AnomalyDetection

CHAPTER 3

It is important to note that the mapping function learned by an autoencoder isspecific to the training data distribution. That is, an autoencoder will typicallynot succeed at reconstructing data that is significantly different from the data ithas seen during training. As we will see later in this chapter, this property oflearning a distribution-specific mapping (as opposed to a generic linearmapping) is particularly useful for the task of anomaly detection.

Modeling Normal Behavior and AnomalyScoringApplying an autoencoder for anomaly detection follows the general principle offirst modeling normal behavior and subsequently generating an anomaly scorefor each new data sample. To model normal behavior, we follow a semi-supervised approach where we train the autoencoder on normal data samples.This way, the model learns a mapping function that successfully reconstructsnormal data samples with a very small reconstruction error. This behavior isreplicated at test time, where the reconstruction error is small for normal datasamples, and large for abnormal data samples. To identify anomalies, we use the

The components of an autoencoder.

reconstruction error score as an anomaly score and flag samples withreconstruction errors above a given threshold.

This process is illustrated in the figure above. As the autoencoder attempts toreconstruct abnormal data, it does so in a manner that is weighted towardnormal samples (square shapes). The difference between what it reconstructsand the input is the reconstruction error. We can specify a threshold and flaganomalies as samples with a reconstruction error above a given threshold.

Variational AutoencodersA variational autoencoder (VAE) is an extension of the autoencoder. Similar toan autoencoder, it consists of an encoder and a decoder network component,but it also includes important changes in the structure of the learning problemto accommodate variational inference. As opposed to learning a mapping fromthe input data to a fixed bottleneck vector (a point estimate), a VAE learns amapping from an input to a distribution, and learns to reconstruct the originaldata by sampling from this distribution using a latent code. In Bayesian terms,the prior is the distribution of the latent code, the likelihood is the distributionof the input given the latent code, and the posterior is the distribution of the

The use of autoencoders for anomaly detection.

latent code, given our input. The components of a VAE serve to derive goodestimates for these terms.

The encoder network learns the parameters (mean and variance) of adistribution that outputs a latent code vector Z, given the input data (posterior).In other words, one can draw samples of the bottleneck vector that“correspond” to samples from the input data. The nature of this distribution canvary depending on the nature of the input data (e.g., while Gaussiandistributions are commonly used, Bernoulli distributions can be used if theinput data is known to be binary). On the other hand, the decoder learns adistribution that outputs the original input data point (or something really closeto it), given a latent bottleneck sample (likelihood). Typically, an isotropicGaussian distribution is used to model this reconstruction space.

The VAE model is trained by minimizing the difference between the estimateddistribution produced by the model and the real distribution of the data. Thisdifference is estimated using the Kullback-Leibler divergence, which quantifiesthe distance between two distributions by measuring how much information islost when one distribution is used to represent the other. Similar toautoencoders, VAEs have been applied in use cases such as unsupervisedfeature extraction, dimensionality reduction, image colorization, imagedenoising, etc. In addition, given that they use model distributions, they can beleveraged for controlled sample generation.

The probabilistic Bayesian components introduced in VAEs lead to a few usefulbenefits. First, VAEs enable Bayesian inference; essentially, we can now samplefrom the learned encoder distribution and decode samples that do not explicitlyexist in the original dataset, but belong to the same data distribution. Second,VAEs learn a disentangled representation of a data distribution; i.e., a single unitin the latent code is only sensitive to a single generative factor. This allowssome interpretability of the output of VAEs, as we can vary units in the latentcode for controlled generation of samples. Third, a VAE provides true probabilitymeasures that offer a principled approach to quantifying uncertainty whenapplied in practice (for example, the probability that a new data point belongs tothe distribution of normal data is 80%).

Modeling Normal Behavior and AnomalyScoringSimilar to an autoencoder, we begin by training the VAE on normal data samples.At test time, we can compute an anomaly score in two ways. First, we can drawsamples of the latent code Z from the encoder given our input data, samplereconstructed values from the decoder using Z, and compute a meanreconstruction error. Anomalies are flagged based on some threshold on thereconstruction error.

Alternatively, we can output a mean and a variance parameter from the decoder,and compute the probability that the new data point belongs to the distributionof normal data on which the model was trained. If the data point lies in a low-density region (below some threshold), we flag that as an anomaly. We can dothis because we’re modeling a distribution as opposed to a point estimate.

A variational autoencoder.

Generative Adversarial NetworksGenerative adversarial networks (GANs[3]) are neural networks designed tolearn a generative model of an input data distribution. In their classicformulation, they’re composed of a pair of (typically feed-forward) neuralnetworks termed a generator, G, and discriminator, D. Both networks are trainedjointly and play a competitive skill game with the end goal of learning thedistribution of the input data, X.

The generator network G learns a mapping from random noise of a fixeddimension (Z) to samples X_ that closely resemble members of the input datadistribution. The discriminator D learns to correctly discern real samples thatoriginated in the source data (X) from fake samples (X_) that are generated by G.At each epoch during training, the parameters of G are updated to maximize itsability to generate samples that are indistinguishable by D, while the parametersof D are updated to maximize its ability to to correctly discern true samples Xfrom generated samples X_. As training progresses, G becomes proficient at

Anomaly scoring with a VAE: we output the mean reconstruction

probability (i.e., the probability that a sample belongs to the normal

data distribution).

producing samples that are similar to X, and D also upskills on the task ofdistinguishing real from fake samples.

In this classic formulation of GANs, while G learns to model the sourcedistribution X well (it learns to map random noise from Z to the sourcedistribution), there is no straightforward approach that allows us to harness thisknowledge for controlled inference; i.e., to generate a sample that is similar to agiven known sample. While we can conduct a broad search over the latentspace with the goal of recovering the most representative latent noise vectorfor an arbitrary sample, this process is compute-intensive and very slow inpractice.

To address these issues, recent research studies have explored newformulations of GANs that enable just this sort of controlled adversarial

inference by introducing an encoder network, E (BiGANs[4]) with applications in

anomaly detection (See GANomaly[5][6]). In simple terms, the encoder learnsthe reverse mapping of the generator; it learns to generate a fixed vector Z_,given a sample. Given this change, the input to the discriminator is alsomodified; the discriminator now takes in pairs of input that include the latentrepresentation (Z, and Z_), in addition to the data samples (X and X_). The

A traditional GAN.

encoder E is then jointly trained with the generator G; G learns an induceddistribution that outputs samples of X given a latent code Z, while E learns aninduced distribution that outputs Z, given a sample X.

Again, the mappings learned by components in the GAN are specific to the dataused in training. For example, the generator component of a GAN trained onimages of cars will always output an image that looks like a car, given any latentcode. At test time, we can leverage this property to infer how different a giveninput sample is from the data distribution on which the model was trained.

Modeling Normal Behavior and AnomalyScoringTo model normal behavior, we train a BiGAN on normal data samples. At the endof the training process, we have an encoder E that has learned a mapping fromdata samples (X) to latent code space (Z_), a discriminator D that has learned todistinguish real from generated data, and a generator G that has learned amapping from latent code space to sample space. Note that these mappings arespecific to the distribution of normal data that has been seen during training. Attest time, we perform the following steps to generate an anomaly score for a

A BiGAN.

given sample X. First, we obtain a latent space value Z_ from the encoder givenX, which is fed to the generator and yields a sample X_. Next, we can computean anomaly score based on the reconstruction loss (difference between X andX_) and the discriminator loss (cross entropy loss or feature differences in thelast dense layer of the discriminator, given both X and X_).

Sequence-to-Sequence ModelsSequence-to-sequence models are a class of neural networks mainly designedto learn mappings between data that are best represented as sequences. Datacontaining sequences can be challenging as each token in a sequence mayhave some form of temporal dependence on other tokens; a relationship thathas to be modeled to achieve good results. For example, consider the task oflanguage translation where a sequence of words in one language needs to bemapped to a sequence of words in a different language. To excel at such a task,a model must take into consideration the (contextual) location of eachword/token within the broader sentence; this allows it to generate anappropriate translation (See our previous report on Natural Language Processingto learn more about this area.)

A BiGAN applied to the task of anomaly detection.

https://clients.fastforwardlabs.com/ff11/report

On a high level, sequence-to-sequence models typically consist of an encoder,E, that generates a hidden representation of the input tokens, and a decoder, D,that takes in the encoder representation and sequentially generates a set ofoutput tokens. Traditionally, the encoder and decoder are composed of longshort-term memory (LSTM) blocks, that are particularly suitable for modelingtemporal relationships within input data tokens.

While sequence-to-sequence models excel at modeling data with temporaldependence, they can be slow during inference; each individual token in themodel output is sequentially generated at each time step, where the totalnumber of steps is the length of the output token.

We can use this encoder-decoder structure for anomaly detection by revisingthe sequence-to-sequence model to function like an autoencoder, training themodel to output the same tokens as the input, shifted by 1. This way, theencoder learns to generate a hidden representation that allows the decoder toreconstruct input data that is similar to examples seen in the training dataset.

Modeling Normal Behavior and AnomalyScoring

A sequence-to-sequence model.

To identify anomalies, we take a semi-supervised approach where we train thesequence-to-sequence model on normal data. At test time, we can thencompare the difference (mean squared error) between the output sequencegenerated by the model and its input. As in the approaches discussedpreviously, we can use this value as an anomaly score.

One-Class Support VectorMachinesIn this section, we discuss one-class support vector machines (OCSVMs), anon-deep learning approach to classification that we will use later (see Chapter4. Prototype) as a baseline.

Traditionally, the goal of classification approaches is to help distinguish betweendifferent classes, using some training data. However, consider a scenario wherewe have data for only one class, and the goal is to determine whether test datasamples are similar to the training samples. OCSVMs were introduced for exactlythis sort of task: novelty detection, or the detection of unfamiliar samples. SVMshave proven very popular for classification, and they introduced the use ofkernel functions to create nonlinear decision boundaries (hyperplanes) byprojecting data into a higher dimension. Similarly, OCSVMs learn a decisionfunction which specifies regions in the input data space where the probabilitydensity of the data is high. An OCSVM model is trained with varioushyperparameters:

nu specifies the fraction of outliers (data samples that do not belong to ourclass of interest) that we expect in our data.kernel specifies the kernel type to be used in the algorithm; examplesinclude RBF, polynomial (poly), and linear. This enables SVMs to use anonlinear function to project the input data to a higher dimension.gamma is a parameter of the RBF kernel type that controls the influence ofindividual training samples; this affects the “smoothness” of the model.

Modeling Normal Behavior and AnomalyScoringTo apply OCSVM for anomaly detection, we train an OCSVM model using normaldata, or data containing a small fraction of abnormal samples. Within mostimplementations of OCSVM, the model returns an estimate of how similar a datapoint is to the data samples seen during training. This estimate may be thedistance from the decision boundary (the separating hyperplane), or a discreteclass value (+1 for data that is similar and -1 for data that is not). Either type ofscore can be used as an anomaly score.

An OCSVM classifier learns a decision boundary around data seen during

training.

Additional ConsiderationsIn practice, applying deep learning relies on a few data and modelingassumptions. This section explores them in detail.

Anomalies as Rare EventsFor the training approaches discussed thus far, we operate on the assumptionof the availability of “normal” labeled data, which is then used to teach a modelrecognize normal behavior. In practice, it is often the case that labels do notexist or can be expensive to obtain. However, it is also a common observationthat anomalies (by definition) are relatively infrequent events and thereforeconstitute a small percentage of the entire event dataset (for example, theoccurrence of fraud, machine failure, cyberattacks, etc.). Our experiments (seeChapter 4. Prototype for more discussion) have shown that the neural networkapproaches discussed above remain robust in the presence of a smallpercentage of anomalies (less than 10%). This is mainly because introducing asmall fraction of anomalies does not significantly affect the network’s model of

At test time, An OCSVM model classifies data points outside the learned

decision boundary as anomalies (assigned class of -1).

normal behavior. For scenarios where anomalies are known to occur sparingly,our experiments show that it’s possible to relax the requirement of assemblinga dataset consisting only of labeled normal samples for training.

Discretizing Data and Handling StationarityTo apply deep learning approaches for anomaly detection (as with any othertask), we need to construct a dataset of training samples. For problem spaceswhere data is already discrete, we can use the data as is (e.g., a dataset ofimages of wall panels, where the task is to find images containing abnormalpanels). When data exists as a time series, we can construct our dataset bydiscretizing the series into training samples. Typically this involves slicing thedata into chunks with comparable statistical properties. For example, given aseries of recordings generated by a data center temperature sensor, we candiscretize the data into daily or weekly time slices and construct a datasetbased on these chunks. This becomes our basis for anomaly comparison (e.g.,the temperature pattern for today is anomalous compared to patterns for thelast 20 days). The choice of the discretization approach (daily, weekly, averages,etc.) will often require some domain expertise; the one requirement is thateach discrete sample be comparable. For example, given that temperatures mayspike during work hours compared to non-work hours in the scenario we’reconsidering, it may be challenging to discretize this data by hour as differenthours exhibit different statistical properties.

This notion of constructing a dataset of comparable samples is related to theidea of stationarity. A stationary series is one in which properties of the data(mean, variance) do not vary with time. Examples of non-stationary data includedata containing trends (e.g., rising global temperatures) or with seasonality (e.g.,hourly temperatures within each day). These variations need to be handledduring discretization. We can remove trends by applying a differencing functionto the entire dataset. To handle seasonality, we can explicitly includeinformation on seasonality as a feature of each discrete sample; for instance, todiscretize by hour, we can attach a categorical variable representing the hour ofthe day. A common misconception regarding the application of neural networkscapable of modeling temporal relationships such as LSTMs is that theyautomatically learn/model properties of the data useful for predictions(including trends and seasonality). However, the extent to which this is possibleis dependent on how much of this behavior is represented in each trainingsample. For example, to automatically account for trends or patterns across theday, we can discretize data by hour with an additional categorical feature forhour of day, or discretize by day (24 features for each hour).

Temperature readings for a data center over several days can be

discretized (sliced) into daily 24-hour readings and labeled (0 for a

normal average daily temperature, 1 for an abnormal temperature) to

construct a dataset.

Note: For most machine learning algorithms, it is a requirement that samples beindependent and identically distributed. Ensuring we construct comparablesamples (i.e., handle trends and seasonality) from time series data allows us tosatisfy the latter requirement, but not the former. This can affect modelperformance. In addition, constructing a dataset in this way raises thepossibility that the learned model may perform poorly in predicting outputvalues that lie outside the distribution (range of values) seen during training;i.e., if there is a distribution shift. This greatly amplifies the need to retrain themodel as new data arrives, and complicates the model deployment process. Ingeneral, discretization should be applied with care.

Selecting a ModelThere are several factors that can influence the primary approach taken when itcomes to detecting anomalies. These include the data properties (time seriesvs. non-time series, stationary vs. non-stationary, univariate vs. multivariate,low-dimensional vs. high-dimensional), latency requirements, uncertaintyreporting, and accuracy requirements. More importantly, deep learning methodsare not always the best approach! To provide a framework for navigating thisspace, we offer the following recommendations:

Data Properties

Time series data: As discussed in the previous section, it is important tocorrectly discretize the data and to handle stationarity before training amodel. In addition, for discretized data with temporal relationships, the useof LSTM layers as part of the encoder or decoder can help model theserelationships.

Univariate vs Multivariate: Deep learning methods are well suited to datathat has a wide range of features: they’re recommended for high-dimensional data, such as images, and work well for modeling theinteractions between multiple variables. For most univariate datasets, linearmodels (see Chandola et al.) are both fast and accurate and thus typicallypreferred.

Business Requirements

https://dl.acm.org/doi/10.1145/1541880.1541882

Latency: Deep learning models are slower than linear models. For scenariosthat include high volumes of data and have low latency requirements, linearmodels are recommended (e.g., for detecting anomalies in authenticationrequests for 200,000 work sites, with each machine generating 500requests per second).

Accuracy: Deep learning approaches tend to be robust, providing betteraccuracy, precision, and recall.

Uncertainty: For scenarios where it is a requirement to provide a principledestimate of uncertainty for each anomaly classification, deep learningmodels such as VAEs and BiGANs are recommended.

How to Decide on a ModelingApproach?Given the differences between the deep learning methods discussed above(and their variants), it can be challenging to decide on the right model. Whendata contains sequences with temporal dependencies, a sequence-to-sequence model (or architectures with LSTM layers) can model theserelationships, yielding better results. For scenarios requiring principledestimates of uncertainty, generative models such as a VAE and GAN basedapproaches are suitable. For scenarios where the data is images, AEs, VAEs andGANs designed with convolution layers are suitable.

A summary of steps for selecting an approach to anomaly detection.

The following table highlights the pros and cons of the different types ofmodels, to give you an idea under what kind of scenarios they arerecommended.

Model Pros Cons

AutoEncoder Flexible approach to modeling complex non-linear patterns in data

Does not support variational inference(estimates of uncertainty)

Requires a large dataset for training

VariationalAutoEncoder

Supports variational inference (probabilisticmeasure of uncertainty)

Requires a large amount of training data,training can take a while

GAN(BiGAN)

Supports variational inference (probabilisticmeasure of uncertainty)

Use of discriminator signal allows better learning

of data manifold[7] (useful for high dimensionalimage data).

GANs trained in semi-supervised learning modehave shown great promise, even with very few

labeled data[8]

Requires a large amount of training data, andlonger training time (epochs) to arrive at stable

results[9]

Training can be unstable (GAN modecollapse)

Sequence-to-SequenceModel

Well suited for data with temporal components(e.g., discretized time series data)

Slow inference (compute scales withsequence length which needs to be fixed)

Training can be slowLimited accuracy when data contains features

with no temporal dependenceSupports variational inference (probabilistic

measure of uncertainty)

One ClassSVM

Does not require a large amount of dataFast to trainFast inference time

Limited capacity in capturing complexrelationships within data

Requires careful parameter selection (kernel,nu, gamma) that need to be carefully tuned.

Does not model a probability distribution,harder to compute estimates of confidence.

In this section, we provide an overview of the data and experiments used toevaluate each of the approaches mentioned in Chapter 3. Deep Learning forAnomaly Detection. We also introduce two prototypes we built to demonstrateresults from the experiments and how we designed each prototype.

DatasetsKDDThe KDD network intrusion dataset is a dataset of TCP connections that havebeen labeled as normal or representative of network attacks.

“A connection is a sequence of TCP packets starting and ending at somewell-defined times, between which data flows to and from a source IPaddress to a target IP address under some well-defined protocol.”

These attacks fall into four main categories - denial of service, unauthorizedaccess from a remote machine, unauthorized access to local superuserprivileges, and surveillance, e.g., port scanning. Each TCP connection isrepresented as a set of attributes or features (derived based on domainknowledge) pertaining to each connection such as the number of failed logins,connection duration, data bytes from source to destination, etc. The dataset iscomprised of a training set (97278 normal traffic samples, 396743 attack trafficsamples) and a test set (63458 normal packet samples, 185366 attack trafficsamples). To make the data more realistic, the test portion of the datasetcontains 14 additional attack types that are not in the train portion; thus, a goodmodel should generalize well and detect attacks unseen during training.

ECG5000The ECG5000 dataset contains examples of ECG signals from a patient. Eachdata sample, which corresponds to an extracted heartbeat containing 140

PrototypeCHAPTER 4

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

http://www.timeseriesclassification.com/description.php?Dataset=ECG5000

points, has been labeled as normal or being indicative of heart conditionsrelated to congestive heart failure. Given an ECG signal sample, the task is topredict if it is normal or abnormal. ECG5000 is well-suited to a prototype for afew reasons: it is visual (signals can be visualized easily) and it is based on realdata associated with a concrete use case (heart disease detection). While thetask itself is not extremely complex, the data is multivariate (140 values persample, which allows us to demonstrate the value of a deep model), but smallenough to rapidly train and run inference.

Benchmarking ModelsWe sought to compare each of the models discussed earlier using the KDDdataset. We preprocessed the data to keep only 18 continuous features (Note:this slightly simplifies the problem and results in differences from similarbenchmarks on the same dataset). Feature scaling (0-1 minmax scaling) is alsoapplied to the data; scaling parameters are learned from training data and thenapplied to test data. We then trained each model using normal samples (97,278samples) and evaluated it on a random subset of the test data (8000 normalsamples and 2000 normal samples).

Method Encoder Decoder Other Parameters

PCA NA NA 2 Component PCA

OCSVM NA NA Kernel: Rbf, Outlier fraction: 0.01;gamma: 0.5. Anomaly score as distance fromdecision boundary.

Autoencoder 2 hidden layers[15, 7]

2 hidden layers [15, 7] Latent dimension: 2Batch size:256Loss: Mean squared error

VariationalAutoencoder

2 hidden layers[15, 7]

2 hidden layers [15, 7] Latent dimension: 2 Batch size: 256 Loss: Mean squared error + KLdivergence

Sequence toSequence Model

1 hidden layer,[10]

1 hidden layer [20] Bidirectional LSTMs Batch size: 256 Loss: Mean squared error

BidirectionalGAN

Encoder: 2 hiddenlayers [15, 7], Generator: 2hidden layers [15,7]

Generator: 2 hidden layers [15, 7]Discriminator: 2 hidden layers [15, 7]

Latent dimension: 32 Loss: Binary Cross Entropy Learning rate: 0.1

We implemented each model using comparable parameters (see the tableabove) that allow us to benchmark them in terms of training and inference (totaltraining time to best accuracy, inference time), storage (size of weights, number

of parameters), and performance (accuracy, precision, recall). The deep learningmodels (AE, VAE, Seq2seq, BiGAN) were implemented in Tensorflow (kerasapi); each model was trained till best accuracy measured on the same validationdataset, using the Adam optimizer, batch size of 256 and a learning rate of 0.01.OCSVM was implemented using the Sklearn OCSVM library using the non-linearrbf kernel and parameters (nu=0.01 and gamma=0.5). Results from PCA (usingthe the sum of the projected distance of a sample on all eigenvectors as theanomaly score) are also included. Additional details on the parameters for eachmodel are summarized in the table below for reproducibility. These experimentswere run on an Intel® Xeon® CPU @ 2.30GHz and a NVIDIA T4 GPU (applicableto the deep models).

Training, Inference, Storage

Method Model Size (KB) Inference Time (Seconds) # of Parameters Total Training Time (Seconds)

BiGAN 47.945 1.26 714 111.726

Autoencoder 22.008 0.279 842 32.751

OCSVM 10.77 0.029 NA 0.417

VAE 23.797 0.391 858 27.922

Seq2Seq 33.109 400.131 2741 645.448

PCA 1.233 0.003 NA 0.213

Each model is compared in terms of inference time on the entire test dataset,total training time to peak accuracy, number of parameters (deep models), andmodel size.

As expected, a linear model like PCA is both fast to train and fast for inference.This is followed by OCSVM, autoencoders, variational autoencoders, BiGAN, andsequence-to-sequence models in order of increasing model complexity. The

Comparison of anomaly detection models in terms of model size, inference

time and training time.

https://keras.io/optimizers/

https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

GAN-based model required the most training epochs to achieve stable results;this is in part due to a known stability issue associated with GANs. Thesequence-to-sequence model is particularly slow for inference given thesequential nature of the decoder.

For each of the deep models, we store the network weights to disk andcompute the size of weights. For OCSVM and PCA, we serialize the model todisk using pickle and compute the size of each model file. These values arehelpful in estimating memory and storage costs when deploying these modelsin production.

Performance

Method ROC AUC Accuracy Precision Recall F1 score F2 Score

BiGAN 0.972 0.962 0.857 0.973 0.911 0.947

Autoencoder 0.963 0.964 0.867 0.968 0.914 0.945

OCSVM 0.957 0.949 0.906 0.83 0.866 0.844

VAE 0.946 0.93 0.819 0.836 0.827 0.832

Seq2Seq 0.919 0.829 0.68 0.271 0.388 0.308

PCA 0.771 0.936 0.977 0.699 0.815 0.741


https://docs.python.org/3/library/pickle.html

For each model, we use labeled test data to first select a threshold that yieldsthe best accuracy and then report on metrics such as f1, f2, precision, and recallat that threshold. We also report on ROC (area under the curve) to evaluate theoverall skill of each model. Given that the dataset we use is not extremelycomplex (18 features), we see that most models perform relatively well. Deepmodels (BiGAN, AE) are more robust (precision, recall, ROC AUC), compared toPCA and OCSVM. The sequence-to-sequence model is not particularly

Histogram for the distribution of anomaly scores assigned to the test

data for each model. The red vertical line represents a threshold value

that yields the best accuracy.

competitive, given the data is not temporal. On a more complex dataset (e.g.,images), we expect to see (similar to existing research), more pronouncedadvantages in using a deep learning model.

Web Application PrototypesWe built two prototypes that demonstrate results and insights from ourexperiments. The first prototype, Blip, is built on on the KDD dataset used inthe experiments above and is a visualization of the performance of fourapproaches to anomaly detection. The second prototype, Anomagram, is aninteractive explainer that focuses on the autoencoder model, and the resultsfrom applying it to detecting anomalies in ECG data.

Blip

Blip plays back and visualizes the performance of four different algorithms on asubset of the KDD network intrusion dataset. Blip dramatizes the analogydetection process and builds user intution about the trade-offs involved.

The Blip prototype.


https://blip.fastforwardlabs.com/

http://anomagram.fastforwardlabs.com/#/

https://blip.fastforwardlabs.com/

The concept of an anomaly is easy to visualize: something that doesn’t look thesame. The conceptual simplicity of it actually makes the prototype’s job tricker.If we show you a dataset where the anomalies are easy to spot, it’s not clearwhat you need an algorithm for. Instead, we want to place you in a situationwhere the data is complicated enough, and streaming in fast enough, that thebenefits of an algorithm are clear. Often in a data visualization, you want toremove complexity; in Blip, we wanted to preserve it, but place it in context. Wedid this by including at the top a terminal-like view of the connection datacoming in. The speed and number of features involved make the usefulness ofan algorithm, which can operate at a higher speed and scale than a human, clear.

Directly below the terminal-like streaming data, we show performance metricsfor each of the algorithms. These metrics include accuracy, recall, and precision.The three different measurements hint at the trade-offs involved in choosing ananomaly detection algorithm. You will want to prioritize different metricsdepending on the situation. Accuracy is a measure of how often the algorithm isright. Prioritizing precision will minimize false positives, while focusing on recallwill minimize false negatives. By showing the formulas for each of thesemetrics, updated in real-time with the streaming data, we build intuition abouthow the different measures interact.

The terminal section shows the data streaming in.

The strategy section shows the performance of the algorithms across

different metrics.

In the visualizations, we want to give the user a feel for how each algorithmperforms across the various metrics. If a connection is classified by thealgorithm as an anomaly, it is stacked on the left; if it is classified as normal, it isplaced on the right. The ground truth is indicated by the color: red for anomaly,black for normal. In a perfectly performing algorithm, the left side would becompletely red and the right completely black. An algorithm that has lots of falsenegatives (low recall) will have a higher density of red mixed in with the black onthe right side. A low precision performance will show up as lots of black mixedinto the left side. The fact that each connection gets its own spot makes thescale of the dataset clear (versus the streaming-terminal view where oldconnections quickly leave the screen). Our ability to quickly assess visualdensity makes it easier to get a feel for what differences in performance metricsacross algorithms really mean.

One difficulty in designing the prototype was figuring out when to reveal theground truth (visualized as the red or black color). In a real-world situation, youwould not know the truth as it came in (you woudn’t need an algorithm then).Early versions of the prototype experimented with only revealing the color afterclassification. Ultimately, we decided that because there is already a lothappening in the prototype, a delayed reveal pushed the complexity a step toofar. Part of this is because the prototype shows the performance of fourdifferent algorithms. If we were showing only one, we’d have more room toanimate the truth reveal. We decided to reveal the color truth at the start to

The last section visualizes the performance of each algorithm.

strengthen the visual connection between the connection data as shown in theterminal and in each of the algorithm visualizations.

Prototype II - AnomagramThis section describes Anomagram - an interactive web based experiencewhere the user can build, train and evaluate an autoencoder to detectanomalous ECG signals. It utilizes the ECG5000 dataset mentioned above.

UX Goals for Anomagram

Anomagram is designed as part of a growing area of interactive visualizations(see Neural Network Playground, ConvNet Playground, GANLab, GAN dissection,etc.) that help communicate technical insights on how deep learning modelswork. It is entirely browser-based and implemented in Tensorflow.js. This way,users can explore live experiments with no installations required. Importantly,Anomagram moves beyond the use of toy/synthetic data and situates learningwithin the context of a concrete task (anomaly detection for ECG data). Theoverall user experience goals for Anomagram are summarized as follows:

1: Provide an introduction to autoencoders and how they can be applied to thetask of anomaly detection. This is achieved via the Introduction module (seescreenshot below). This entails providing definitions of concepts(reconstruction error, thresholds, etc.) paired with interactive visualizations thatdemonstrate concepts (e.g., an interactive visualization for inference on testdata, a visualization of the structure of an autoencoder, a visualization of errorhistograms as training progresses, etc.).

http://anomagram.fastforwardlabs.com/#/

https://playground.tensorflow.org/

http://convnetplayground.fastforwardlabs.com/

https://poloclub.github.io/ganlab/

https://gandissect.csail.mit.edu/

2: Provide an interactive, accessible experience that supports technical learningby doing. This is mostly accomplished within the Train a Model module (seescreenshot below) and is designed for users interested in additional technicaldepth. It entails providing a direct manipulation interface that allows the user tospecify a model (add/remove layers and units within layers), modify modelparameters (training steps, batchsize, learning rate, regularizer, optimizer),modify training/test data parameters (data size, data composition), train themodel, and evaluate model performance (visualization of accuracy, precision,recall, false positive, false negative, ROC, etc. metrics) as each parameter ischanged. Who should use Anomagram? Anyone interested to learn aboutautoencoders and anomaly detection in an accessible way. Anomagram is alsouseful for educators (as a tool to support guided discussion of the topic), entrylevel data scientists, and non-ML experts (citizen data scientists, softwaredevelopers, designers).

Introduction module view.

Interface Affordances and Insights

This section discusses some explorations the user can perform withAnomagram, and some corresponding insights.

Craft (Adversarial) Input: Anomalies by definition can take many different andpreviously unseen forms. This makes the assessment of anomaly detectionmodels more challenging. Ideally, we want the user to conduct their ownevaluations of a trained model, e.g., by allowing them to upload their own ECGdata. In practice, this requires the collection of digitized ECG data with similarpreprocessing (heartbeat extraction) and range as the ECG5000 dataset used intraining the model. This is challenging. The next best way to allow testing onexamples contributed by the user is to provide a simulator — hence the drawyour ECG data feature. This provides an (html) canvas on which the user candraw signals and observe the model’s behaviour. Drawing strokes are convertedto an array, with interpolation for incomplete drawings (total array size=140) andfed to the model. While this approach has limited realism (users may not have

Train a Model module view.

sufficient domain expertise to draw meaningful signals), it provides anopportunity to craft various types of (adversarial) samples and observe themodel’s performance.

Insights: The model tends to expect reconstructions that are close to the meanof normal data samples. Using the Draw your ECG data feature, the user candraw (adversarial) examples of input data and observe modelpredictions/performance.

Visually Compose a Model: Users can intuitively specify an autoencoderarchitecture using a direct manipulation model composer. They can add layersand add units to layers using clicks. The architecture is then used to specify themodel’s parameters each time the model is compiled. This follows a similarapproach used in “A Neural Network Playground”[3]. The model composerconnector lines are implemented using the leaderline library. Relevant lines areredrawn or added as layers are added or removed from the model.

Insights: There is no marked difference between a smaller model (1 layer) and alarger model (e.g., 8 layers) for the current task. This is likely because the task isnot especially complex (a visualization of PCA points for the ECG datasetsuggests it is linearly separable). Users can visually compose the autoencodermodel — add remove layers in the encoder and decoder. To keep the encoderand decoder symmetrical, add/remove operations on either are mirrored.

Effect of Learning Rate, Batchsize, Optimizer, Regularization: The user canselect from 6 optimizers (Adam, Adamax, Adadelta, Rmsprop, Momentum, Sgd),various learning rates, and regularizers (l1, l2, l1l2).

Insights: Adam reaches peak accuracy with less steps compared to otheroptimizers. Training time increases with no benefit to accuracy as batchsize isreduced (when using Adam). A two layer model will quickly overfit on the data;adding regularization helps address this to some extent.

Effect of Threshold Choices on Precision/Recall: Earlier in this report (seeChapter 2. Evaluating Models: Accuracy Is Not Enough ) we highlight theimportance of metrics such as precision and recall and why accuracy is notenough. To support this discussion, the user can visualize how thresholdchoices impact each of these metrics.

Insights: As threshold changes, accuracy can stay the same but, precision andrecall can vary. This further illustrates how the threshold can be used by an

analyst as a lever to reflect their precision/recall preferences.

Effect of Data Composition: We may not always have labeled normal data totrain a model. However, given the rarity of anomalies (and domain expertise), wecan assume that unlabeled data is mostly comprised of normal samples.However, this assumption raises an important question - does modelperformance degrade with changes in the percentage of abnormal samples inthe dataset? In the Train a Model section, you can specify the percentage ofabnormal samples to include when training the autoencoder model.

Insights: We see that with 0% abnormal data, the model AUC is ~96%. Great! At30% abnormal sample composition, AUC drops to ~93%. At 50% abnormal datapoints, there is just not enough information in the data that allows the model tolearn a pattern of normal behaviour. It essentially learns to reconstruct normaland abnormal data well and mse is no longer a good measure of anomaly. At thispoint, model performance is only slightly above random chance (AUC of 56%).

This chapter provides an overview of the landscape of currently available opensource tools and service vendor offerings available for anomaly detection, andconsiders the trade-offs as well as when to use each.

Open Source Tools andFrameworksSeveral popular open source machine learning libraries and packages in Pythonand R include implementations of algorithmic techniques that can be applied toanomaly detection tasks. Useful algorithms (e.g., clustering, OCSVMs, isolationforests) also exist as part of general-purpose frameworks like scikit-learnthat do not cater specifically to anomaly detection. In addition, genericpackages for univariate time series forecasting (e.g., Facebook’s Prophet) havebeen applied widely to anomaly detection tasks where anomalies are identifiedbased on the difference between the true value and a forecast.

In this section, our focus is on comprehensive toolboxes that specificallyaddress the task of anomaly detection.

Python Outlier Detection (PyOD)PyOD is an open source Python toolbox for performing scalable outlierdetection on multivariate data. It provides access to a wide range of outlierdetection algorithms, including established outlier ensembles and more recentneural network-based approaches, under a single, well-documented API.

PyOD offers several distinct advantages:

It gives access to over 20 algorithms, ranging from classical techniques suchas local outlier factor (LOF) to recent neural network architectures such asautoencoders and adversarial models.

LandscapeCHAPTER 5

https://facebook.github.io/prophet/

https://github.com/yzhao062/pyod

It implements combination methods for merging the results of multipledetectors and outlier ensembles, which are an emerging set of models.It includes a unified API, detailed documentation, and interactive examplesacross all algorithms for clarity and ease of use.All models are covered by unit testing with cross-platform continuousintegration, code coverage, and code maintainability checks.Optimization instruments are employed when possible: just-in-time (JIT)compilation and parallelization are enabled in select models for scalableoutlier detection.It’s compatible with both Python 2 and 3 across major operating systems.

Seldon’s Anomaly Detection PackageSeldon.io is known for its open source ML deployment solution for Kubernetes,which can in principle be used to serve arbitrary models. In addition, the Seldonteam has recently released alibi-detect, a Python package focused onoutlier, adversarial, and concept drift detection. The package aims to provideboth online and offline detectors for tabular data, text, images, and time seriesdata. The outlier detection methods should allow the user to identify global,contextual, and collective outliers.

Seldon has identified anomaly detection as a sufficiently important capability towarrant dedicated attention in the framework, and has implemented severalmodels to use out of the box. The existing models include sequence-to-sequence LSTMs, variational autoencoders, spectral residual models for timeseries, Gaussian mixture models, isolation forests, Mahalanobis distance, andothers. Examples and documentation are provided.

In the Seldon Core architecture, anomaly detection methods may beimplemented as either models or input transformers. In the latter case, they canbe composed with other data transformations to process inputs to anothermodel. This nicely illustrates one role anomaly detection can play in MLsystems: flagging bad inputs before they pass through the rest of a pipeline.

R PackagesThe following section reviews R packages that have been created for anomalydetection. Interestingly, most of them deal with time series data.

https://github.com/SeldonIO

https://github.com/SeldonIO/alibi-detect

Twitter’s AnomalyDetection package

Twitter’s AnomalyDetection is an open source R package that can be used toautomatically detect anomalies. It is applicable across a wide variety of contexts(for example, anomaly detection in system metrics after a new software release,user engagement after an A/B test, or problems in econometrics, financialengineering, or the political and social sciences). It can help detect both globaland local anomalies as well as positive/negative anomalies (i.e., a point-in-timeincrease/decrease in values).

The primary algorithm, Seasonal Hybrid Extreme Studentized Deviate (S-H-ESD),builds upon the Generalized ESD test for detecting anomalies, which can beeither global or local. This is achieved by employing time series decompositionand using robust statistical metrics; i.e., median together with ESD. In addition,for long time series (say, 6 months of minutely data), the algorithm employspiecewise approximation.

The package can also be used to detect anomalies in a vector of numericalvalues when the corresponding timestamps are not available, and it providesrich visualization support. The user can specify the direction of anomalies andthe window of interest (such as the last day or last hour) and enable or disablepiecewise approximation, and the x- and y-axes are annotated to assist visualdata analysis.

anomalize package

The anomalize package, open sourced by Business Science, performs timeseries anomaly detection that goes inline with other Tidyverse packages (orpackages supporting tidy data).

Anomalize has three main functions:

Decompose separates out the time series into seasonal, trend, andremainder components.Anomalize applies anomaly detection methods to the remaindercomponent.Recompose calculates upper and lower limits that separate the “normal”data from the anomalies.

tsoutliers package

https://github.com/twitter/AnomalyDetection

https://github.com/business-science/anomalize

https://www.tidyverse.org/

This package implements a procedure based on the approach described inChen and Liu (1993) for automatic detection of outliers in time series. Timeseries data often undergoes nonsystematic changes that alter the dynamics ofthe data, either transitorily or permanently. The approach considers innovationaloutliers, additive outliers, level shifts, temporary changes, and seasonal levelshifts while fitting a time series model.

Numenta’s HTM (Hierarchical TemporalMemory)Research organization Numenta has introduced a machine intelligenceframework for anomaly detection called Hierarchical Temporal Memory (HTM). Atthe core of HTM are time-based learning algorithms that store and recalltemporal patterns. Unlike most other ML methods, HTM algorithms learn time-based patterns in unlabeled data on a continuous basis. They are robust tonoise and high-capacity, meaning they can learn multiple patternssimultaneously. The HTM algorithms are documented and available through theopen source Numenta Platform for Intelligent Computing (NuPIC). They’reparticularly suited to problems involving streaming data and to identifyingunderlying patterns in data change over time, subtle patterns, and time-basedpatterns.

One of the first commercial applications to be developed using NuPIC is Grok, atool that performs IT analytics, giving insight into IT systems to identify unusualbehavior and reduce business downtime. Another is Cortical.io, which enablesapplications in natural language processing.

The NuPIC platform also offers several tools, such as HTM Studio and NumentaAnomaly Benchmark (NAB). HTM Studio is a free desktop tool that enablesdevelopers to find anomalies in time series data without the need to program,code, or set parameters. NAB is a novel benchmark for evaluating andcomparing algorithms for anomaly detection in streaming, real-time applications.It is composed of over 50 labeled real-world and artificial time series data files,plus a novel scoring mechanism designed for real-time applications.

Besides this, there are example applications available on NuPIC that includesample code and whitepapers for tracking anomalies in the stock market, roguebehavior detection (finding anomalies in human behavior), and geospatialtracking (finding anomalies in objects moving through space and time).

https://www.rdocumentation.org/packages/tsoutliers/versions/0.6-8

https://www.researchgate.net/publication/243768707_Joint_Estimation_of_Model_Parameters_and_Outlier_Effects_in_Time_Series

https://numenta.com/

https://grokstream.com/

https://www.cortical.io/

Numenta is a technology provider and does not create go-to-market solutionsfor specific use cases. The company licenses its technology and applicationcode to developers, organizations, and companies that wish to build upon it.Open source, trial, and commercial licenses are available. Developers can useNumenta technology within NuPIC via the AGPL v3 open source license.

Anomaly Detection as a ServiceIn this section, we survey a sample of the anomaly detection services availableat the time of writing. Most of these services can access data from public clouddatabases, provide some kind of dashboard/report format to view and analyzedata, have an alert mechanism to indicate when an anomaly occurs, and enabledevelopers to view underlying causes. An anomaly detection solution should tryto reduce the time to detect and speed up the time to resolution by identifyingkey performance indicators (KPIs) and attributes that are causing the alert. Wecompare the offerings across a range of criteria, from capabilities to delivery.

AnodotCapabilitiesAnodot is a real-time analytics and automated anomaly detection systemthat detects outliers in time series data and turns them into businessinsights. It explores anomaly detection from the perspective of forecasting,where anomalies are identified based on deviations from expectedforecasts. The product has a dual focus: business monitoring (SaaSmonitoring, anomaly detection, and root cause analysis) and businessforecasting (trend prediction, “what ifs,” and optimization).

Data requirementsAnodot supports multiple input data sources, including direct uploads andintegrations with Amazon’s S3 or Google Cloud storage. It’s data-agnosticand can track a variety of metrics: revenue, number of sales, number ofpage visits, number of daily active users, and others.

Modeling approach/technique(s)Anodot analyzes business metrics in real time and at scale by running its MLalgorithms on the live data stream itself, without reading or writing into adatabase. Every data point that flows into Anodot from all data sources iscorrelated with the relevant metric’s existing normal model, and either is

https://www.anodot.com/

flagged as an anomaly or serves to update the normal model. Anodot’sphilosophy is that no single model can be used to cover all metrics. Toallocate the optimal model for each metric, they’ve created a library ofmodel types for different signal types (metrics that are stationary or non-stationary, multimodal or discrete, irregularly sampled, sparse, stepwise,etc.). Each new metric goes through a classification phase and is matchedwith the optimal model. The model then learns “normal behavior” for thatmetric, which is a prerequisite to identifying anomalous behavior. Toaccommodate this kind of learning in real time at scale, Anodot usessequential adaptive learning algorithms which initialize a model of what isnormal on the fly, and then compute the relation of each new data pointgoing forward.

Point anomalies or intervalsInstead of flagging individual data points as anomalies, Anodot highlightsintervals. Points within the entire duration of the interval are consideredanomalous, preventing redundant alerts.

ThresholdingWhile users can specify static thresholds which trigger alerts, Anodot alsoprovides automatic defaults where no thresholding input from the user isrequired.

Root cause investigationAnodot helps users investigate why an alert was triggered. It tries tounderstand how different active anomalies correlate in order to expediteroot cause investigation and shorten the time to resolution, groupingtogether different anomalies (or “incidents”) that tell the story of thephenomena. These might be multiple anomalous values in the samedimension (e.g., a drop in traffic from sources A and B, but not C, D, or E), orcorrelations between different anomalous KPIs, such as visits, conversions,orders, revenue, or error rate. Each incident has an Anomap, a graphicdistribution of the dimensions most impacted. This is essentially a heat mapthat makes it easier to understand the whole picture.

User interfaceThe Anodot interface enables users to visualize and explore alerts. With thereceipt of each alert, the user is prompted to assign it a binary score (goodcatch/bad catch). This input is fed back into the learning model to furthertune it by providing real-life indications about the validity of its

performance. By training the algorithms with direct feedback on anomalies,users can influence the system’s functionality and results.

DeliveryNotifications can be forwarded to each user through their choice ofchannel(s). Anodot notification integrations include an API, Slack, email,PagerDuty, Jira, Microsoft Teams OpsGenie, and more.

Amazon QuickSightCapabilitiesAmazon QuickSight is a cloud-native business intelligence (BI) service thatallows its users to create dashboards and visualizations to communicatebusiness insights. In early 2019, Amazon’s machine learning capability wasintegrated with QuickSight to provide anomaly detection, forecasting, andauto-narrative capabilities as part of the BI tool. Its licensed software andpricing is usage-based; you only pay for active usage, regardless of thenumber of users. That said, the pricing model could end up beingexpensive, since anomaly detection tasks are compute-intensive.

Data requirementsQuickSight requires you to connect or import structured data directly querya SQL-compatible source, or ingest the data into SPICE. There is arequirement on the number of historical data points that must be provided,which varies based on the task (analyzing anomalies or forecasting). Thereare also restrictions on the number of category dimensions that can beincluded (for example: product category, region, segment).

Modeling approach/technique(s)QuickSight provides a point-and-click solution for learning about anomalousbehavior and generating forecasts. It utilizes a built-in version of theRandom Cut Forest (RCF) online algorithm, which not only can be noisy butalso can lead to large amounts of false positive alerts. On the plus side, itprovides a customizable narrative feature that explains key takeaways fromthe insights generated. For instance, it can provide a summary of howrevenue compares to a previous period or a 30-day average, and/orhighlight the event in case of an anomaly.

Point anomalies or intervalsAnomalous events are presented discretely, on a point-by-point basis. If an

https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection-function.html

anomaly lasts more than a single time unit the system will flag severalevents, which could be noisy and redundant.

ThresholdingAnomaly detection with QuickSight employs a thresholding approach totrigger anomalous events. The user provides a threshold value (low,medium, high, very high) that determines how sensitive the detector is toanomalies: expect to see more anomalies when the setting is low, andfewer anomalies when it’s high. The sensitivity is determined based onstandard deviations of the anomaly score generated by the RCF algorithm.This approach can be tedious, especially when there are multiple timeseries being analyzed across various data hierarchy combinations, andintroduces the need for manual intervention.

Root cause investigationUsers can interactively explore anomalies on the QuickSight dashboard orreport to help understand the root causes. The tool performs a contributionanalysis which highlights the factors that significantly contributed to ananomaly. If there are dimensions in the data that are not being used in theanomaly detection, it’s possible to add up to four of them for thecontribution analysis task. In addition, QuickSight supports interactive“what-if” queries. In these, some of the forecasts can be altered andtreated as hypotheticals to provide conditional forecasts.

User interfaceQuickSight provides a basic reporting interface. From a UI perspective, it isfairly unexceptional. For instance, it lacks a way to understand the overallpicture with anomalous points (do the anomalies have some commoncontributing factors?). Furthermore, the forecasted values do not haveconfidence intervals associated with them, which would help the end uservisually understand the magnitude of an anomaly. As it stands, there is nobasis for comparison.

DeliveryQuickSight dashboards and reports can be embedded within applications,shared among users, and/or sent via email, as long as the recipients have aQuickSight subscription.

Outlier.ai

CapabilitiesOutlier.ai is licensed software that uses artificial intelligence to automatethe process of business analytics. It can connect to databases provided bycloud services and automatically provides insights to your inbox, withoutthe need to create reports or write queries. Outlier works with customersacross industry segments, applying ML to automatically serve up business-critical insights.

Data requirementsOutlier can connect to databases provided by cloud services.

Point anomalies or intervalsAnomalous events are presented discretely, on a point-by-point basis.

Root cause investigationOutlier allows customers to not only surface key insights about businesschanges automatically, but also identify the likely root causes of thosechanges; this feature guides teams in making quick, informed businessdecisions. Teams can easily share stories through PDFs, PowerPoint-optimized images, or auto-generated emails, annotated with theircomments.

User interfaceThe UI is similar to that provided by most standard BI tools, making it fairlyuser-friendly.

DeliveryThe generated dashboards can be embedded within applications, sharedamong users, and/or sent via email.

Vectra CognitoCapabilitiesIn simple terms, Vectra.ai’s flagship platform, Cognito, can be described asan intrusion detection system. It’s a cloud-based network detection andresponse system that performs a number of cybersecurity-related tasks,including network traffic monitoring and anomaly detection. For this lattertask, metadata collected from captured packets (rather than via deep packetinspection) is analyzed using a range of behavioral detection algorithms.This provides insights about outlier characteristics that can be applied in a

https://outlier.ai/

https://www.vectra.ai/

wide range of cybersecurity detection and response use cases. Cognitoworks with both encrypted and unencrypted traffic.

Data requirementsCognito can connect to databases provided by cloud services. It also usesmetadata drawn from Active Directory and DHCP logs.

Modeling approach/technique(s)According to a whitepaper published by Vectra, a mix of ML approaches areused to deliver the cybersecurity features the platform supports, in bothglobal and local (network) contexts. For example, supervised learningtechniques such as random forests can help to address cyberthreatsassociated with suspicious HTTP traffic patterns. Drawing on large-scaleanalysis of many types of malicious traffic and content as well as domainexpertise, the random forest technique can be used to identify patterns ofcommand-and-control behavior that don’t exist in benign HTTP traffic.Cognito also uses unsupervised ML techniques such as k-means clusteringto identify valid credentials that have potentially been stolen from acompromised host. This type of theft is the basis of cyberattacks such aspass-the-hash and golden ticket. Elsewhere, deep learning has proveneffective in detecting suspicious domain activity; specifically, the detectionof algorithmically generated domains that are set up by cyberattackers asthe frontend of their command-and-control infrastructure.

Point anomalies or intervalsAnomalous events are presented discretely, on a point-by-point basis.

ThresholdingThe scoring of compromised hosts by the Vectra Threat Certainty Indexallows security teams to define threshold levels based on a combination offactors.

Root cause investigationAll detection events are correlated to specific hosts that show signs ofthreat behaviors. In turn, all context is assimilated into an up-to-the-moment score of the overall risk to the organization.

User interfaceVectra’s Cognito platform delivers detection information via a simpledashboard which displays information such as a prioritized (in terms of risk)

https://content.vectra.ai/rs/748-MCE-447/images/WhitePaper_2019_The_data_science_behind_Cognito_AI_threat_detection_models_English.pdf?mkt_tok=eyJpIjoiWkRGaVpHVmtaVGxrTkdFeiIsInQiOiJ2RVhkK3M0cHU3dXNQRDZ2YnA3QW16K0ZKVFVEK1lDeFRwcTZPMGxXZlB0clhOYmhPaVBXenkzRmY1Ylwvakp5d2FcL1dSakVKbDZhcHZtNEdZU1A3aHFMYkpxVlZHWXllXC9xUGRPOXNtZ0NyTFRjTitxUlVkaXBzNFdiQlBaUUxwVSJ9

list of compromised hosts, changes in a host’s threat and certainty scores,and “key assets” that show signs of attack.

DeliveryThe platform supports information sharing by security teams on demand, oron a set schedule managed by its customizable reporting engine. Real-timenotifications about network hosts, with attack indicators that have beenidentified (with the highest degree of certainty) as posing the biggest risk,are also supported.

Yahoo’s Anomaly Detector: SherlockCapabilitiesSherlock is an open source anomaly detection service built on top of Druid(an open source, distributed data store). It leverages the Extensible GenericAnomaly Detection System (EGADS) Java library to detect anomalies in timeseries data. Sherlock is fast and scalable. It allows users to schedule jobson an hourly, daily, weekly, or monthly basis (although it also supports adhoc real-time anomaly detection requests). Anomaly reports can be viewedfrom Sherlock’s interface, or received via email.

Data requirementsSherlock accesses time series data via Druid JSON queries and uses a Redisbackend to store job metadata, the anomaly reports (and other information)it generates, as well as a persistent job queue. The anomaly reports can beaccessed via direct requests using the Sherlock client API or delivered viascheduled email alerts.

Modeling approach/technique(s)Sherlock takes a time series modeling-based approach to anomalydetection using three important modules from the EGADS library: TimeSeries Modeling, Anomaly Detection, and Alerting. The Time SeriesModeling module supports the use of historical data to learn trends andseasonality in the data using models such as ARIMA. The resulting valuesare applied to the models that comprise the Anomaly Detection module.These models support a number of detection scenarios that are relevant ina cybersecurity context (e.g., outlier detection and change point detection).The Alerting module uses the error metric produced by the anomalydetection models and outputs candidate anomalies based on dynamicallylearned thresholds, learning to filter out irrelevant anomalies over time.

https://github.com/yahoo/sherlock

http://druid.io/

https://github.com/yahoo/egads

Point anomalies or intervalsAnomalous events are presented both discretely, on a point-by-point basis,and as intervals.

ThresholdThresholds are learned dynamically. No thresholding input from the user isrequired/supported.

Root cause analysisOut of the box root cause analysis is not supported.

User interfaceSherlock’s user interface is built with Spark Java, a UI framework for buildingweb applications. The UI enables users to submit instant anomaly analyses,create and launch detection jobs, and view anomalies on both a heat mapand a graph.

DeliveryScheduled anomaly requests are delivered via email or directly via API-based queries.

http://sparkjava.com/

Machine learning models that learn to solve tasks independently from data aresusceptible to biases and other issues that may raise ethical concerns. Anomalydetection models are associated with specific risks and mitigation tactics. Inanomaly detection, when the definition of “normal” is independently learnedand applied without controls, it may inadvertently reflect societal biases thatcan lead to harm. This presents risks to human welfare in scenarios whereanomaly detection applications intersect with human behavior. It should gowithout saying that in relation to humans, one must resist the assumption thatdifferent is bad.

Diversity Matters

EthicsCHAPTER 6

Different is not necessarily bad.

Although anomalies are often scrutinized, their absence in the data may beeven more problematic. Consider a model that is trained on X-ray images ofnormal luggage contents, but where the training data includes only images ofluggage packed by citizens of North America. This could lead to unusually highstop and search rates for users from other parts of the world, where the itemspacked might differ greatly.

To limit the potential harm of a machine learning model’s tendency to assumethat “different is bad,” one can use a larger (or more varied) dataset and have ahuman review the model’s output (both measures reduce the likelihood oferrors). In the luggage example, this translates to using more varied X-ray imagedata to expand the machine’s view of what is normal, and using human review toensure that positive predictions aren’t false positives.

ExplainabilityIn many anomaly detection applications, the system presents anomalousinstances to the end user, who provides a verdict (label) that can then be fedback to the model to refine it further. Unfortunately, some applications may notprovide enough explanation about why an instance was considered anomalous,leaving the end user with no particular guidance on where to begininvestigation. Blindly relying on such applications may cause or, in certain cases,exacerbate bias.

Additional Use CasesEthical considerations are important in every use of machine learning,particularly when the use case affects humans. In this section, we consider afew use cases that highlight ethical concerns with regard to anomaly detection.

Data PrivacyProtecting individuals’ data has been a growing concern in the last few years,culminating in recent data privacy laws like the EU’s General Data ProtectionRegulation (GDPR) and the California Consumer Privacy Act (CCPA). These lawsdon’t just penalize data breaches, but also guide and limit how an individual’spersonal data can be used and processed. Thus, when anomaly detectionmethods are used on protected data, privacy is a top concern.

To give an example, in Chapter 3. Deep Learning for Anomaly Detection wediscussed the autoencoder, a type of neural network that has been widely usedfor anomaly detection. As we saw, autoencoders have two parts: an encodernetwork that reduces the dimensions of the input data, and a decoder networkthat aims to reconstruct the input. Their learning goal is to minimize thereconstruction error, which is consequently the loss function. Because thedimensionality reduction brings information loss, and the learning goalencourages preservation of the information that is common to most trainingsamples, anomalies that contain rare information can be identified by measuringmodel loss.

In certain use cases, these identified anomalies could correspond toindividuals. Improper disclosure of such data can have adverse consequencesfor a data subject’s privacy, or even lead to civil liability or bodily harm. One way

to minimize these effects is to use a technique called differential privacy[10] onthe data before it is fed into an anomaly detection system. This techniqueessentially adds a small amount of noise to the data, in order to mask individualidentities while maintaining the accuracy of aggregated statistics. When coupledwith an anomaly detection system, differential privacy has been shown to

reduce the rate of false positives,[11] thus protecting the privacy of moreindividuals who might otherwise have been singled out and scrutinized.

Health Care DiagnosticsAnomaly detection can be applied in health care scenarios to provide quick andearly warnings for medical conditions. For example, a model can surface chestX-rays that substantially differ from normal chest X-rays, or highlight images oftissue samples that contain abnormalities. These analyses can havetremendous consequences for the patient: in addition to the positiveoutcomes, a false negative might mean an undetected disease, and a falsepositive could lead to unnecessary (and potentially painful or even harmful)treatment.

For other machine learning tasks, such as churn prediction or resumé review,analysts strive to remove racial or socioeconomic factors from the equation. Butin health care diagnostics, it may be both appropriate and advantageous toconsider them.

To the extent that an anomaly to be detected is connected with a certain groupof people, models can be tailored to that group. For example, sickle-cell disease

is more prevalent in parts of Africa and Asia than in Europe and the Americas. Adiagnostic system for detecting this disease should include enough samples ofAsian and African patients and acknowledgment of their ethnicity to ensure thedisease is identified.

In any event, it is important to ensure that these systems remain curated (i.e.,that a medical professional verifies the results) and that the models arecorrectly evaluated (precision, recall) before being put into production.

SecuritySimilar to the luggage example mentioned at the beginning of the chapter, homeor business security systems trained to identify anomalies present a problem ofthe “different is bad” variety. These systems need to have enough data; interms of quantity and variability; to prevent bias that would make them morelikely to identify people of different races, body types, or socioeconomic statusas anomalies based on, for example, their skin color, size, or clothing.

Content ModerationBlind reliance on anomaly detection systems for content moderation can lead tofalse positives that limit or silence system users based on the language or typesof devices they use. Content moderation software should be monitored forpatterns of inappropriate blocking or reporting, and should have a userappeal/review process.

Financial ServicesDetermining what harmful anomalies are in financial services applications iscomplex. On the one hand, fraudsters are actively trying to steal and laundermoney, and move it around the world. On the other hand, the vast majority oftransactions are legitimate services for valuable customers. Fraudsters emulatenormal business, making anomalies especially difficult to identify. As a result,financial services organizations should consider first whether anomalydetection is desirable in their use cases, and then consider the potential ethicaland practical risks.

For example, banks use customer data to offer mortgages and student loans. Alack of diverse data could lead to unfavorable results for certain demographicgroups. Such biased algorithms can result in costly mistakes, reduce customer

satisfaction, and damage a brand’s reputation. To combat this, one should posequestions (like the following) that can help check for bias in the data or model:

Fact check: do the detected anomalies mostly belong to underrepresentedgroups?Are there enough features that explain minority groups?Has model performance been evaluated for each subgroup within the data?

InnovationAs we have seen, the term anomaly can carry negative connotations. Anomalydetection by its very nature involves identifying samples that are different fromthe bulk of the other samples - but, as discussed here, assuming that differentis bad may not be fair in many cases. Instead of associating faultyinterpretations with anomalies, it may be helpful to investigate them to revealnew truths.

After all, progress in science is often triggered by anomalous activities that leadto innovation!

Anomaly detection is a classic problem, common to many business domains. Inthis report, we have explored how a set of deep learning approaches can beapplied to this task in a semi-supervised setting. We think this focus is usefulfor the following reasons:

While deep learning has demonstrated superior performance for manybusiness tasks, there is fairly limited information on how anomaly detectioncan be cast as a deep learning problem and how deep models perform.A semi-supervised approach is desirable in handling previously unseen,unknown anomalies and does not incur vast data labeling costs forbusinesses.Traditional machine learning approaches are suboptimal for handling high-dimensional, complex data and for modeling interactions between eachvariable.

In Chapter 4. Prototype, we show how deep learning models can achievecompetitive performance on a multivariate tabular dataset (for the task ofnetwork intrusion detection). Our findings are in concurrence with results fromexisting research that show the superior performance of deep learning modelsfor high-dimensional data, such as images.

But while deep learning approaches are useful, there are a few challenges thatmay limit their deployment in production settings. Problems to considerinclude:

LatencyCompared to linear models (AR, ARMA, etc.) or shallow machine learningmodels such as OCSVMs, deep learning models can have significant latencyassociated with inference. This makes it expensive to apply them in streamingdata use cases at scale (high volume, high velocity). For example, ourexperiments show that inference with an OCSVM is 10x faster than with anautoencoder accelerated on a GPU.

ConclusionCHAPTER 7

Data requirementsDeep learning models typically require a large dataset (tens of thousands ofsamples) for effective training. The models are also prone to overfitting andneed to be carefully evaluated to address this risk. Anomaly detection usecases frequently have relatively few data points; for example, daily sales datafor two years will generate 712 samples, which may be insufficient to train amodel. In such scenarios, linear models designed to work with smaller datasetsare a better option.

Managing distribution shiftIn many scenarios (including, but certainly not limited to, anomaly detection),the underlying process generating data may legitimately change such that a datapoint that was previously anomalous becomes normal. The changes could begradual, cyclical, or even abrupt in nature. This phenomenon is known asconcept drift. With respect to anomaly detection, one way to handle it is tofrequently retrain the model as new data arrives, or to trigger retraining whenconcept drift is detected. It can be challenging to maintain this continuouslearning approach for deep models, as training time can be significant.

Going forward, we expect the general approaches discussed in this report tocontinue evolving and maturing. As examples, see recent extensions to theencoder-decoder model approach that are based on Gaussian mixture models,[12] LSTMs,[13], convolutional neural networks,[14] and GANs.[15] As witheverything else in machine learning, there is no “one size fits all”; no one modelworks best for every problem. The right approach always depends on the usecase and the data.

1. Lukas Ruff et al. (2019), “Deep Semi-Supervised Anomaly Detection” ,arXiv:1906.02694. ↩

2. Varun Chandola et al. (2009), “Anomaly Detection, A Survey” , ACMComputing Surveys 41(3),https://dl.acm.org/doi/10.1145/1541880.1541882. ↩

3. Goodfellow, Ian, et al. (2014) “Generative adversarial nets.” Advances inneural information processing systems. https://arxiv.org/abs/1406.2661 ↩

4. Jeff Donahue et al. (2016), “Adversarial Feature Learning”https://arxiv.org/abs/1605.09782 ↩


https://dl.acm.org/doi/10.1145/1541880.1541882



5. Samet AkCay et al. (2018), “GANomaly: Semi-Supervised AnomalyDetection via Adversarial Training” https://arxiv.org/abs/1805.06725. ↩

6. Di Mattia, F., Galeone, P., De Simoni, M. and Ghelfi, E., (2019). “A survey ongans for anomaly detection”. arXiv preprint arXiv:1906.11632. ↩

7. Mihaela Rosca (2018), Issues with VAEs and GANs, CVPR18 ↩

8. Raghavendra Chalapathy et al. (2019) “Deep Learning for AnomalyDetection: A Survey” https://arxiv.org/abs/1901.03407 ↩

9. Tim Salimans et al. (2016) “Improved Techniques for Training GANs”,Neurips 2016 https://arxiv.org/abs/1606.03498 ↩

10. Cynthia Dwork (2006), “Differential Privacy”, Proceedings of the 33rdInternational Conference on Automata, Languages and Programming Part II:1-12, https://doi.org/10.1007/11787006_1. ↩

11. Min Du et al. (2019), “Robust Anomaly Detection and Backdoor AttackDetection Via Differential Privacy”, arXiv:1911.07116. ↩

12. Bo Zong et al. (2018), “Deep Autoencoding Gaussian Mixture Model forUnsupervised Anomaly Detection” ICLR 2018,https://openreview.net/forum?id=BJJLHbb0-. ↩

13. Run-Qing Chen et al. (2019), “Sequential VAE-LSTM for Anomaly Detectionon Time Series” , arXiv:1910.03818. ↩

14. Hansheng Ren et al. (2019), “Time-Series Anomaly Detection Service atMicrosoft” , arXiv:1906.03821. ↩

15. Dan Li et al. (2019), “MAD-GAN: Multivariate Anomaly Detection for TimeSeries Data with Generative Adversarial Networks” arXiv:1901.04997. ↩


http://efrosgans.eecs.berkeley.edu/CVPR18_slides/VAE_GANS_by_Rosca.pdf



https://doi.org/10.1007/11787006_1


https://openreview.net/forum?id=BJJLHbb0-




Anomaly Detection Deep Learning for...Anomaly detection approaches can be categorized in terms of...

Documents

Transcript of Anomaly Detection Deep Learning for...Anomaly detection approaches can be categorized in terms of...