Adversarial Examples: Attacks and Defenses for Deep Learning

21
1 Adversarial Examples: Attacks and Defenses for Deep Learning Xiaoyong Yuan, Pan He, Qile Zhu, Xiaolin Li National Science Foundation Center for Big Learning, University of Florida {chbrian, pan.he, valder}@ufl.edu, [email protected]fl.edu Abstract—With rapid progress and significant successes in a wide spectrum of applications, deep learning is being applied in many safety-critical environments. However, deep neural networks have been recently found vulnerable to well-designed input samples, called adversarial examples. Adversarial pertur- bations are imperceptible to human but can easily fool deep neural networks in the testing/deploying stage. The vulnerability to adversarial examples becomes one of the major risks for applying deep neural networks in safety-critical environments. Therefore, attacks and defenses on adversarial examples draw great attention. In this paper, we review recent findings on adversarial examples for deep neural networks, summarize the methods for generating adversarial examples, and propose a taxonomy of these methods. Under the taxonomy, applications for adversarial examples are investigated. We further elaborate on countermeasures for adversarial examples. In addition, three major challenges in adversarial examples and the potential solutions are discussed. Index Terms—deep neural network, deep learning, security, adversarial examples I. I NTRODUCTION Deep learning (DL) has made significant progress in a wide domain of machine learning (ML): image classification, object recognition [1], [2], object detection [3], [4], speech recognition [5], language translation [6], voice synthesis [7]. Many deep learning empowered applications are life crucial, raising great concerns in the field of safety and security. De- spite great successes in numerous applications, recent studies find that deep learning is vulnerable against well-designed input samples. These samples can easily fool a well-performed deep learning model with little perturbations imperceptible to humans. Szegedy et al. first generated small perturbations on the im- ages for the image classification problem and fooled state-of- the-art deep neural networks with high probability [8]. These misclassified samples were named as Adversarial Examples. Extensive deep learning based applications have been used or planned to be deployed in the physical world, especially in the safety-critical environments. In the meanwhile, recent studies show that adversarial examples can be applied to real world. For instance, an adversary can construct phys- ical adversarial examples and confuse autonomous vehicles by manipulating the stop sign in a traffic sign recognition system [9], [10] or removing the segmentation of pedestrians This work is partially supported by National Science Foundation (grants CNS-1842407, CNS-1747783, CNS-1624782, and OAC-1229576). Corresponding author. in an object recognition system [11]. Attackers can generate adversarial commands against automatic speech recognition (ASR) models and Voice Controllable System (VCS) [12], [13] such as Apple Siri [14], Amazon Alexa [15], and Microsoft Cortana [16]. Deep learning is widely regarded as a “black box” technique — we all know that it performs well, but with limited knowledge of the reason [17], [18]. Many studies have been proposed to explain and interpret deep neural networks [19]– [22]. From inspecting adversarial examples, we may gain insights on semantic inner levels of neural networks [23] and find problematic decision boundaries, which in turn helps to increase robustness and performance of neural networks [24] and improve the interpretability [25]. In this paper, we investigate and summarize approaches for generating adversarial examples, applications for adversarial examples and corresponding countermeasures. We explore the characteristics and possible causes of adversarial examples. Recent advances in deep learning revolve around supervised learning, especially in the field of computer vision task. Therefore, most adversarial examples are generated against computer vision models. We mainly discuss adversarial ex- amples for image classification/object recognition tasks in this paper. Adversarial examples for other tasks will be investigated in Section V. Inspired by [26], we define the Threat Model as follows: The adversaries can attack only at the testing/deploying stage. They can tamper only the input data in the testing stage after the victim deep learning model is trained. Neither the trained model or the training dataset can be modified. The adversaries may have knowledge of trained models (architectures and parameters) but are not allowed to modify models, which is a common assumption for many online machine learning services. Attacking at the training stage (e.g., training data poisoning) is another interesting topic and has been studied in [27]–[32]. Due to the limitation of space, we do not include this topic in the paper. We focus on attacks against models built with deep neural networks, due to their great performance achieved. We will discuss adversarial examples against conventional machine learning (e.g., SVM, Random Forest) in Sec- tion II. Adversarial examples against deep neural net- works proved effective in conventional machine learning models [33] (see Section VII-A). Digital Object Identifier: 10.1109/TNNLS.2018.2886017 2162-2388 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Transcript of Adversarial Examples: Attacks and Defenses for Deep Learning

Page 1: Adversarial Examples: Attacks and Defenses for Deep Learning

1

Adversarial Examples: Attacks and Defenses forDeep Learning

Xiaoyong Yuan, Pan He, Qile Zhu, Xiaolin Li∗

National Science Foundation Center for Big Learning, University of Florida{chbrian, pan.he, valder}@ufl.edu, [email protected]

Abstract—With rapid progress and significant successes in awide spectrum of applications, deep learning is being appliedin many safety-critical environments. However, deep neuralnetworks have been recently found vulnerable to well-designedinput samples, called adversarial examples. Adversarial pertur-bations are imperceptible to human but can easily fool deepneural networks in the testing/deploying stage. The vulnerabilityto adversarial examples becomes one of the major risks forapplying deep neural networks in safety-critical environments.Therefore, attacks and defenses on adversarial examples drawgreat attention. In this paper, we review recent findings onadversarial examples for deep neural networks, summarize themethods for generating adversarial examples, and propose ataxonomy of these methods. Under the taxonomy, applicationsfor adversarial examples are investigated. We further elaborateon countermeasures for adversarial examples. In addition, threemajor challenges in adversarial examples and the potentialsolutions are discussed.

Index Terms—deep neural network, deep learning, security,adversarial examples

I. INTRODUCTION

Deep learning (DL) has made significant progress in awide domain of machine learning (ML): image classification,object recognition [1], [2], object detection [3], [4], speechrecognition [5], language translation [6], voice synthesis [7].Many deep learning empowered applications are life crucial,raising great concerns in the field of safety and security. De-spite great successes in numerous applications, recent studiesfind that deep learning is vulnerable against well-designedinput samples. These samples can easily fool a well-performeddeep learning model with little perturbations imperceptible tohumans.

Szegedy et al. first generated small perturbations on the im-ages for the image classification problem and fooled state-of-the-art deep neural networks with high probability [8]. Thesemisclassified samples were named as Adversarial Examples.

Extensive deep learning based applications have been usedor planned to be deployed in the physical world, especiallyin the safety-critical environments. In the meanwhile, recentstudies show that adversarial examples can be applied toreal world. For instance, an adversary can construct phys-ical adversarial examples and confuse autonomous vehiclesby manipulating the stop sign in a traffic sign recognitionsystem [9], [10] or removing the segmentation of pedestrians

This work is partially supported by National Science Foundation (grantsCNS-1842407, CNS-1747783, CNS-1624782, and OAC-1229576).∗ Corresponding author.

in an object recognition system [11]. Attackers can generateadversarial commands against automatic speech recognition(ASR) models and Voice Controllable System (VCS) [12], [13]such as Apple Siri [14], Amazon Alexa [15], and MicrosoftCortana [16].

Deep learning is widely regarded as a “black box” technique— we all know that it performs well, but with limitedknowledge of the reason [17], [18]. Many studies have beenproposed to explain and interpret deep neural networks [19]–[22]. From inspecting adversarial examples, we may gaininsights on semantic inner levels of neural networks [23] andfind problematic decision boundaries, which in turn helps toincrease robustness and performance of neural networks [24]and improve the interpretability [25].

In this paper, we investigate and summarize approaches forgenerating adversarial examples, applications for adversarialexamples and corresponding countermeasures. We explore thecharacteristics and possible causes of adversarial examples.Recent advances in deep learning revolve around supervisedlearning, especially in the field of computer vision task.Therefore, most adversarial examples are generated againstcomputer vision models. We mainly discuss adversarial ex-amples for image classification/object recognition tasks in thispaper. Adversarial examples for other tasks will be investigatedin Section V.

Inspired by [26], we define the Threat Model as follows:

• The adversaries can attack only at the testing/deployingstage. They can tamper only the input data in the testingstage after the victim deep learning model is trained.Neither the trained model or the training dataset can bemodified. The adversaries may have knowledge of trainedmodels (architectures and parameters) but are not allowedto modify models, which is a common assumption formany online machine learning services. Attacking at thetraining stage (e.g., training data poisoning) is anotherinteresting topic and has been studied in [27]–[32]. Dueto the limitation of space, we do not include this topic inthe paper.

• We focus on attacks against models built with deep neuralnetworks, due to their great performance achieved. Wewill discuss adversarial examples against conventionalmachine learning (e.g., SVM, Random Forest) in Sec-tion II. Adversarial examples against deep neural net-works proved effective in conventional machine learningmodels [33] (see Section VII-A).

Digital Object Identifier: 10.1109/TNNLS.2018.2886017

2162-2388 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: Adversarial Examples: Attacks and Defenses for Deep Learning

2

TABLE I: Notation and symbols used in this paper

Notations and Symbols Descriptionx original (clean, unmodified) input datal label of class in the classification problem. l =

1, 2, . . . ,m, where m is the number of classesx′ adversarial example (modified input data)l′ label of the adversarial class in targeted adver-

sarial examplesf(·) deep learning model (for the image classifica-

tion task, f ∈ F : Rn → l)θ parameters of deep learning model f

Jf (·, ·) loss function (e.g., cross-entropy) of model fη difference between original and modified input

data: η = x′ − x (the exact same size as theinput data)

‖ · ‖p ℓp norm∇ gradient

H(·) Hessian, the second-order of derivativesKL(·) Kullback-Leibler (KL) divergence function

• Adversaries only aim at compromising integrity. Integrityis presented by performance metrics (e.g., accuracy, F1score, AUC), which is essential to a deep learning model.Although other security issues pertaining to confidential-ity and privacy have been drawn attention in deep learn-ing [34]–[36], we focus on the attacks that degrade theperformance of deep learning models, cause an increaseof false positives and false negatives.

• The rest of the threat model differs in different adversarialattacks. We will categorize them in Section III.

Notations and symbols used in this paper are listed inTable I.

This paper presents the following contributions:

• To systematically analyze approaches for generating ad-versarial examples, we taxonomize attack approachesalong different axes to provide an accessible and intuitiveoverview of these approaches.

• We investigate recent approaches and their variants forgenerating adversarial examples and compare them us-ing the proposed taxonomy. We show examples of se-lected applications from fields of reinforcement learning,generative modeling, face recognition, object detection,semantic segmentation, natural language processing, andmalware detection. Countermeasures for adversarial ex-amples are also discussed.

• We outline main challenges and potential future researchdirections for adversarial examples based on three mainproblems: transferability of adversarial examples, exis-tence of adversarial examples, and robustness evaluationof deep neural networks.

The remaining of this paper is organized as follows. Sec-tion II introduces the background of deep learning techniques,models, and datasets. We discuss adversarial examples raisedin conventional machine learning in Section II. We propose ataxonomy of approaches for generating adversarial examplesin Section III and elaborate on these approaches in Section IV.In Section V, we discuss applications for adversarial examples.Corresponding countermeasures are investigated in Section VI.We discuss current challenges and potential solutions in Sec-tion VII. Section VIII concludes the work.

II. BACKGROUND

In this section, we briefly introduce basic deep learningtechniques and approaches related to adversarial examples.Next, we review adversarial examples in the era of conven-tional ML and compare the difference between adversarialexamples in conventional ML and that in DL.

A. Brief Introduction to Deep Learning

This subsection discusses main concepts, existed techniques,popular architectures, and standard datasets in deep learning,which, due to the extensive use and breakthrough successes,have become acknowledged targets of attacks, where adver-saries are usually applied to evaluate their attack methods.

1) Main concepts in deep learning: Deep learning is a typeof machine learning methods that makes computers to learnfrom experience and knowledge without explicit programmingand extract useful patterns from raw data. For conventionalmachine learning algorithms, it is difficult to extract well-represented features due to limitations, such as curse of dimen-sionality [37], computational bottleneck [38], and requirementof the domain and expert knowledge. Deep learning solvesthe problem of representation by building multiple simplefeatures to represent a sophisticated concept. For example,a deep learning-based image classification system representsan object by describing edges, fabrics, and structures in thehidden layers. With the increasing number of available trainingdata, deep learning becomes more powerful. Deep learningmodels have solved many complicated problems, with the helpof hardware acceleration in computational time.

A neural network layer is composed of a set of perceptrons(artificial neurons). Each perceptron maps a set of inputs tooutput values with an activation function. The function of aneural network is formed in a chain:

f(x) = f (k)(· · · f (2)(f (1)(x))), (1)

where f (i) is the function of the ith layer of the network,i = 1, 2, · · · k.

Convolutional neural networks (CNNs) and Recurrent neu-ral networks (RNNs) are two most widely used neural net-works in recent neural network architectures. CNNs deployconvolution operations on hidden layers to share weights andreduce the number of parameters. CNNs can extract localinformation from grid-like input data. CNNs have shownincredible successes in computer vision tasks, such as im-age classification [1], [39], object detection [4], [40], textrecognition [41], [42], and semantic segmentation [43], [44].RNNs are neural networks for processing sequential inputdata with variable length. RNNs produce outputs at each timestep. The hidden neuron at each time step is calculated basedon current input data and hidden neurons at the previoustime step. Long Short-Term Memory (LSTM) and GatedRecurrent Unit (GRU) with controllable gates are designedto avoid vanishing/exploding gradients of RNNs in long-termdependency.

Page 3: Adversarial Examples: Attacks and Defenses for Deep Learning

3

2) Architectures of deep neural networks: Several deeplearning architectures are widely used in computer visiontasks: LeNet [45], VGG [2], AlexNet [1], GoogLeNet [46]–[48] (Inception V1-V4), and ResNet [39], from the simplest(oldest) network to the deepest and the most complex (newest)one. AlexNet first showed that deep learning models canlargely surpass conventional machine learning algorithms inthe ImageNet 2012 challenge and led the future study of deeplearning. These architectures made tremendous breakthroughsin the ImageNet challenge and can be seen as milestonesin image classification problem. Attackers usually generateadversarial examples against these baseline architectures.

3) Standard deep learning datasets: MNIST, CIFAR-10,ImageNet are three widely used datasets in computer vi-sion tasks. The MNIST dataset is for handwritten digitsrecognition [49]. The CIFAR-10 dataset and the ImageNet

dataset are for image recognition task [50]. The CIFAR-10

consists of 60,000 tiny color images (32 × 32) with tenclasses. The ImageNet dataset consists 14,197,122 imageswith 1,000 classes [51]. Because of the large number ofimages in the ImageNet dataset, most adversarial approachesare evaluated on only part of the ImageNet dataset. The Street

View House Numbers (SVHN) dataset, similar to the MNIST

dataset, consists of ten digits obtained from real-world housenumbers in Google Street View images. The YoutubeDataset

dataset is gained from Youtube consisting of about ten millionimages [45] and used in [8].

B. Adversarial Examples and Countermeasures in Machine

Learning

Adversarial examples in conventional machine learningmodels have been discussed since decades ago. Machinelearning-based systems with handcrafted features are primarytargets, such as spam filters, intrusion detection, biometricauthentication, fraud detection, etc. [52]. For example, spamemails are often modified by adding characters to avoiddetection [53]–[55].

Dalvi et al. first discussed adversarial examples and formu-lated this problem as a game between adversary and classifier(Naïve Bayes), both of which are sensitive to cost [53]. Theattack and defense on adversarial examples became an iterativegame. Biggio et al. first tried a gradient-based approach togenerate adversarial examples against linear classifier, supportvector machine (SVM), and a neural network [56]. Comparedwith deep learning adversarial examples, their methods allowmore freedom to modify the data. The MNIST dataset wasfirst evaluated under their attack, although a human couldeasily distinguish the adversarial digit images. Biggio et al.also reviewed several proactive defenses and discussed reac-tive approaches to improve the security of machine learningmodels [28].

Barreno et al. presented an initial investigation on the secu-rity problems of machine learning [52], [57]. They categorizedattacking against machine learning system into three axes: 1)influence: whether attacks can poison the training data; 2)security violation: whether an adversarial example belongs tofalse positive or false negative; 3) specificity: attack is targeted

to a particular instance or a wide class. We discuss these axesfor deep learning area in Section III. Barreno et al. comparedattacks against SpamBayes spam filter and defenses as a studycase. However, they mainly focused on binary classificationproblem such as virus detection system, intrusion detectionsystem (IDS), and intrusion prevention system (IPS).

Adversarial examples in conventional machine learning re-quire knowledge of feature extraction, while deep learningusually needs only raw data input. In conventional ML,both attacking and defending methods paid great attention tofeatures, even the previous step (data collection), giving lessattention to the impact of humans. Then the target becomesa fully automatic machine learning system. Inspired by thesestudies on conventional ML, in this paper, we review recentsecurity issues in the deep learning area.

[26] provided a comprehensive overview of security issuesin machine learning and recent findings in deep learning. [26]established a unifying threat model. A “no free lunch” theoremwas introduced: the tradeoff between accuracy and robustness.

Compared to their work, our paper focuses on adversarialexamples in deep learning and has a detailed discussion onrecent studies and findings.

For example, adversarial examples in an image classificationtask can be described as follows: Using a trained imageclassifier published by a third party, a user inputs one imageto get the prediction of class label. Adversarial images areoriginal clean images with small perturbations, often barelyrecognizable by humans. However, such perturbations mis-guide the image classifier. The user will get a response of anincorrect image label. Given a trained deep learning model fand an original input data sample x, generating an adversarialexample x′ can generally be described as a box-constrainedoptimization problem:

minx′

‖x′ − x‖

s.t. f(x′) = l′,

f(x) = l,

l 6= l′,

x′ ∈ [0, 1],

(2)

where l and l′ denote the output label of x and x′, ‖ · ‖denotes the distance between two data sample. Let η = x′−xbe the perturbation added on x. This optimization problemminimizes the perturbation while misclassifying the predictionwith a constraint of input data. In the rest of the paper, wewill discuss variants of generating adversarial images andadversarial examples in other tasks.

III. TAXONOMY OF ADVERSARIAL EXAMPLES

To systematically analyze approaches for generating adver-sarial examples, we analyze the approaches for generatingadversarial examples (see details in Section IV) and categorizethem along three dimensions: threat model, perturbation, andbenchmark.

A. Threat Model

We discuss the threat model in Section I. Based on differentscenarios, assumptions, and quality requirements, adversaries

Page 4: Adversarial Examples: Attacks and Defenses for Deep Learning

4

decide the attributes they need in adversarial examples andthen deploy specific attack approaches. We further decomposethe threat model into four aspects: adversarial falsification,adversary’s knowledge, adversarial specificity, and attack fre-quency. For example, if an adversarial example is required tobe generated in real-time, adversaries should choose a one-time attack instead of an iterative attack, in order to completethe task (see Attack Frequency).

• Adversarial Falsification

– False positive attacks generate a negative sample whichis misclassified as a positive one (Type I Error). In amalware detection task, a benign software being classifiedas malware is a false positive. In an image classificationtask, a false positive can be an adversarial image unrecog-nizable to human, while deep neural networks predict it toa class with a high confidence score. Figure 2 illustratesa false positive example of image classification.

– False negative attacks generate a positive sample whichis misclassified as a negative one (Type II Error). Ina malware detection task, a false negative can be thecondition that a malware (usually considered as positive)cannot be identified by the trained model. False negativeattack is also called machine learning evasion. This erroris shown in most adversarial images, where a humancan recognize the image, but the neural networks cannotidentify it.

• Adversary’s Knowledge

– White-box attacks assume the adversary knows everythingrelated to trained neural network models, including train-ing data, model architectures, hyper-parameters, numbersof layers, activation functions, model weights. Manyadversarial examples are generated by calculating modelgradients. Since deep neural networks tend to require onlyraw input data without handcrafted features and to deployend-to-end structure, feature selection is not necessarycompared to adversarial examples in machine learning.

– Black-box attacks assume the adversary has no access tothe trained neural network model. The adversary, actingas a standard user, only knows the output of the model(label or confidence score). This assumption is commonfor attacking online Machine Learning services (e.g.,Machine Learning on AWS1, Google Cloud AI2, BigML3,Clarifai4, Microsoft Azure5, IBM Bluemix6, Face++7).Most adversarial example attacks are white-box attacks.However, they can be transferred to attack black-box

services due to the transferability of adversarial examplesproposed by Papernot et al. [33]. We will elaborate on itin Section VII-A.

• Adversarial Specificity

– Targeted attacks misguide deep neural networks to a

1https://aws.amazon.com/machine-learning2https://cloud.google.com/products/machine-learning3https://bigml.com4https://www.clarifai.com5https://azure.microsoft.com/en-us/services/machine-learning6https://console.bluemix.net/catalog/services/machine-learning7https://www.faceplusplus.com

specific class. Targeted attacks usually occur in the multi-class classification problem. For example, an adversaryfools an image classifier to predict all adversarial exam-ples as one class. In a face recognition/biometric system,an adversary tries to disguise a face as an authorized user(Impersonation) [58]. Targeted attacks usually maximizethe probability of targeted adversarial class.

– Non-targeted attacks do not assign a specific class to theneural network output. The adversarial class of outputcan be arbitrary except the original one. For example, anadversary makes his/her face misidentified as an arbitraryface in face recognition system to evade detection (dodg-ing) [58]. Non-targeted attacks are easier to implementcompared to targeted attacks since it has more optionsand space to redirect the output. Non-targeted adversarialexamples are usually generated in two ways: 1) runningseveral targeted attacks and taking the one with thesmallest perturbation from the results; 2) minimizing theprobability of the correct class.Some generation approaches (e.g., extended BIM, ZOO)can be applied to both targeted and non-targeted attacks.For binary classification, targeted attacks are equivalentto non-targeted attacks.

• Attack Frequency

– One-time attacks take only one time to optimize theadversarial examples.

– Iterative attacks take multiple times to update the adver-sarial examples.Compared with one-time attacks, iterative attacks usuallyperform better adversarial examples, but require moreinteractions with victim classifier (more queries) andcost more computational time to generate them. Forsome computational-intensive tasks (e.g., reinforcementlearning), one-time attacking may be the only feasiblechoice.

B. Perturbation

Small perturbation is a fundamental premise for adversarialexamples. Adversarial examples are designed to be close tothe original samples and imperceptible to a human, whichcauses the performance degradation of deep learning modelscompared to that of a human. We analyze three aspects ofperturbation: perturbation scope, perturbation limitation, andperturbation measurement.

• Perturbation Scope

– Individual attacks generate different perturbations foreach clean input.

– Universal attacks only create a universal perturbation forthe whole dataset. This perturbation can be applied to allclean input data.Most of the current attacks generate adversarial examplesindividually. However, universal perturbations make iteasier to deploy adversary examples in the real world.Adversaries do not require to change the perturbationwhen the input sample changes.

• Perturbation Limitation

Page 5: Adversarial Examples: Attacks and Defenses for Deep Learning

5

– Optimized Perturbation sets perturbation as the goal ofthe optimization problem. These methods aim to mini-mize the perturbation so that humans cannot recognizethe perturbation.

– Constraint Perturbation sets perturbation as the constraintof the optimization problem. These methods only requirethe perturbation to be small enough.

• Perturbation Measurement

– ℓp measures the magnitude of perturbation by p-normdistance:

‖x‖p =

(

n∑

i=1

‖xi‖p

)1

p

. (3)

ℓ0, ℓ2, ℓ∞ are three commonly used ℓp metrics. ℓ0 countsthe number of pixels changed in the adversarial examples;ℓ2 measures the Euclidean distance between the adver-sarial example and the original sample; ℓ∞ denotes themaximum change for all pixels in adversarial examples.

– Psychometric perceptual adversarial similarity score

(PASS) is a new metric introduced in [59], consistent withhuman perception.

C. Benchmark

Adversaries show the performance of their adversarial at-tacks based on different datasets and victim models. This in-consistency brings obstacles to evaluate the adversarial attacksand measure the robustness of deep learning models. Largeand high-quality datasets, complex and high-performance deeplearning models usually make adversaries/defenders hard toattack/defend. The diversity of datasets and victim modelsalso makes researchers hard to tell whether the existence ofadversarial examples is due to datasets or models. We willdiscuss this problem in Section VII-C.

• DatasetsMNIST, CIFAR-10, and ImageNet are the three most widelyused image classification datasets to evaluate adversarialattacks. Because MNIST and CIFAR-10 are proved easyto attack and defend due to its simplicity and small size,ImageNet is the best dataset to evaluate adversarial attacksso far. A well-designed dataset is required to evaluateadversarial attacks.

• Victim ModelsAdversaries usually attack several well-known deep learningmodels, such as LeNet, VGG, AlexNet, GoogLeNet, Caf-

feNet, and ResNet.

In the following sections, we will investigate recent studieson adversarial examples according to this taxonomy.

IV. METHODS FOR GENERATING ADVERSARIAL

EXAMPLES

In this section, we illustrate several representative ap-proaches for generating adversarial examples. Although manyof these approaches are defeated by a countermeasure in laterstudies, we present these methods to show how the adversarialattacks improved and to what extent state-of-the-art adversarialattacks can achieve. The existence of these methods also

Fig. 1: An adversarial image generated by Fast Gradient Sign Method [60]:left: a clean image of a panda; middle: the perturbation; right: one sampleadversarial image, classified as a gibbon.

requires investigation, which may improve the robustness ofdeep neural networks.

Table II summaries the methods for generating adversarialexamples in this section based on the proposed taxonomy.

A. L-BFGS Attack

Szegedy et al. first introduced adversarial examples againstdeep neural networks in 2014 [8]. They generated adversarialexamples using an L-BFGS method to solve the generaltargeted problem:

minx′

c‖η‖+ Jθ(x′, l′)

s.t. x′ ∈ [0, 1].(4)

To find a suitable constant c, L-BFGS Attack calculatedapproximate values of adversarial examples by line-searchingc > 0. The authors showed that the generated adversarialexamples could also be generalized to different models anddifferent training datasets. They suggested that adversarialexamples are never/rarely seen examples in the test datasets.

L-BFGS Attack was also used in [71], which implementeda binary search to find the optimal c.

B. Fast Gradient Sign Method (FGSM)

L-BFGS Attack used an expensive linear search methodto find the optimal value, which was time-consuming andimpractical. Goodfellow et al. proposed a fast method calledFast Gradient Sign Method to generate adversarial exam-ples [60]. They only performed one step gradient update alongthe direction of the sign of gradient at each pixel. Theirperturbation can be expressed as:

η = ǫsign(∇xJθ(x, l)), (5)

where ǫ is the magnitude of the perturbation. The generatedadversarial example x′ is calculated as: x′ = x + η. Thisperturbation can be computed by using back-propagation.Figure 1 shows an adversarial example on ImageNet.

They claimed that the linear part of the high dimensionaldeep neural network could not resist adversarial examples, al-though the linear behavior speeded up training. Regularizationapproaches are used in deep neural networks such as dropout.Pre-training could not improve the robustness of networks.

[59] proposed a new method, called Fast Gradient Value

method, in which they replaced the sign of the gradient withthe raw gradient: η = ∇xJ(θ, x, l). Fast Gradient Value

Page 6: Adversarial Examples: Attacks and Defenses for Deep Learning

6

TABLE II: Taxonomy of Adversarial Examples

Threat Model

Adversarial FalsificationFalse Negative [8], [9], [59]–[73]False Positive [74]

Adversary’s KnowledgeWhite-Box [8], [9], [59]–[63], [65], [67], [69]–[74]Black-Box [64], [66], [68], [73]

Adversarial SpecificityTargeted [8], [59], [61], [63], [64], [66], [67], [69]–[73]

Non-Targeted [9], [60], [62], [64]–[66], [68], [69], [73], [74]

Attack FrequencyOne-time [59], [60]Iterative [8], [9], [61]–[74]

Perturbation

Perturbation ScopeIndividual [8], [9], [59]–[64], [66]–[71], [73], [74]Universal [65]

Perturbation LimitationOptimized [8], [59], [61]–[65], [68], [70], [71]Constraint [59], [66], [67], [69], [73]

None [9], [60], [72], [74]

Perturbation Measurement

Element-wise [9], [60], [72]

ℓp(p ≥ 0)

ℓ0: [63], [66],ℓ1: [70],

ℓ2: [8], [61]–[65], [67]–[69], [71], [73],ℓ∞: [62], [63], [65], [70], [73],

PASS [59]None [74]

Benchmark

Datasets

MNIST [8], [59]–[63], [68], [70], [71], [74]CIFAR-10 [62]–[64], [66]ImageNet [8], [9], [59], [60], [62]–[65], [67], [69], [71]–[74]

OthersYoutubeDataset: [8]LSUN, SNLI: [68]

Victim Models

LeNet [59], [61], [62], [68], [74]VGG [65]–[67], [69]

AlexNet [67], [74]QuocNet [8]

GoogLeNet [9], [59], [60], [62]–[65], [67]–[69], [72], [73]CaffeNet [62], [65], [67]ResNet [59], [65], [69], [73]LSTM [68]

method has no constraints on each pixel and can generateimages with a larger local difference.

According to [72], one-step attack is easy to transfer but alsoeasy to defend (see Section VII-A). [73] applied momentumto FGSM to generate adversarial examples more iteratively.The gradients were calculated by:

gt+1 = µgt +∇xJθ(x

′t, l)

‖∇xJθ(x′t, l)‖

, (6)

then the adversarial example is derived by x′t+1 = x′

t +ǫsigngt+1. The authors increased the effectiveness of attackby introducing momentum and improved the transferability byapplying the one-step attack and the ensembling method.

[72] extended FGSM to a targeted attack by maximizingthe probability of the target class:

x′ = x− ǫsign(∇xJ(θ, x, l′)). (7)

The authors refer to this attack as One-step Target Class

Method (OTCM).[75] found that FGSM with adversarial training is more

robust to white-box attacks than to black-box attacks due togradient masking. They proposed a new attack, RAND-FGSM,which added random when updating the adversarial examplesto defeat adversarial training:

xtmp = x+ α · sign(N (0d, Id)),

x′ = xtmp + (ǫ− α) · sign(∇xtmpJ(xtmp, l)),

(8)

where α, ǫ are the parameters, α < ǫ.

C. Basic Iterative Method (BIM) and Iterative Least-Likely

Class Method (ILLC)

Previous methods assume adversarial data can be directlyfed into deep neural networks. However, in many applications,people can only pass data through devices (e.g., cameras,sensors). Kurakin et al. applied adversarial examples to thephysical world [9]. They extended Fast Gradient Sign methodby running a finer optimization (smaller change) for multipleiterations. In each iteration, they clipped pixel values to avoidlarge change on each pixel:

Clipx,ξ{x′} = min{255, x+ ξ,max{0, x− ǫ, x′}}, (9)

where Clipx,ξ{x′} limits the change of the generated adver-

sarial image in each iteration. The adversarial examples weregenerated in multiple iterations:

x0 = x,

xn+1 = Clipx,ξ{xn + ǫsign(∇xJ(xn, y))}.(10)

The authors referred to this method as Basic Iterative method.To further attack a specific class, they chose the least-likely

class of the prediction and tried to maximize the cross-entropyloss. This method is referred to as Iterative Least-Likely Class

method:

x0 = x,

yLL = argminy{p(y|x)},

xn+1 = Clipx,ǫ{xn − ǫsign(∇xJ(xn, yLL))}.

(11)

They successfully fooled the neural network with a craftedimage taken from a cellphone camera. They also found that

Page 7: Adversarial Examples: Attacks and Defenses for Deep Learning

7

Fast Gradient Sign method is robust to phototransformation,while iterative methods cannot resist phototransformation.

D. Jacobian-based Saliency Map Attack (JSMA)

Papernot et al. designed an efficient saliency adversarialmap, called Jacobian-based Saliency Map Attack [61]. Theyfirst computed Jacobian matrix of given sample x, which isgiven by:

JF (x) =∂F (x)

∂x=

[

∂Fj(x)

∂xi

]

i×j

. (12)

According to [63], F denotes the second-to-last layer (logits)in [61]. Carlini and Wagner modify this approach by using theoutput of the softmax layer as F [63]. In this way, they foundthe input features of x that made most significant changes tothe output. A small perturbation was designed to successfullyinduce large output variations so that change in a small portionof features could fool the neural network.

Then the authors defined two adversarial saliency maps toselect the feature/pixel to be crafted in each iteration. Theyachieved 97% adversarial success rate by modifying only4.02% input features per sample. However, this method runsvery slow due to its significant computational cost.

E. DeepFool

Moosavi-Dezfooli et al. proposed DeepFool to find the clos-est distance from the original input to the decision boundaryof adversarial examples [62]. To overcome the non-linearityin high dimension, they performed an iterative attack with alinear approximation. Starting from an affine classifier, theyfound that the minimal perturbation of an affine classifier isthe distance to the separating affine hyperplane F = {x :wTx+ b = 0} . The perturbation of an affine classifier f canbe η∗(x) = − f(x)

‖w‖2w.If f is a binary differentiable classifier, they used an iterative

method to approximate the perturbation by considering f islinearized around xi at each iteration. The minimal perturba-tion is computed as:

argminηi

‖ηi‖2

s.t. f(xi) +∇f(xi)T ηi = 0.

(13)

This result can also be extended to the multi-class classifierby finding the closest hyperplanes. It can also be extended toa more general ℓp norm, p ∈ [0,∞). DeepFool provided lessperturbation compared to FGSM and JSMA did. Comparedto JSMA, DeepFool also reduced the intensity of perturbationinstead of the number of selected features.

F. CPPN EA Fool

Nguyen et al. discovered a new type of attack, compo-sitional pattern-producing network-encoded EA (CPPN EA),where adversarial examples are classified by deep neuralnetworks with high confidence (99%), which is unrecognizableto human [74]. We categorize this kind of attack as a False

positive attack. Figure 2 illustrates false-positive adversarialexamples.

Fig. 2: Unrecognizable examples to humans, but deep neural networks classifythem to a class with high certainty (≥ 99.6%) [74]

They used evolutionary algorithms (EAs) algorithm to pro-duce the adversarial examples. To solve multi-class classi-fication problem using EA algorithms, they applied multi-dimensional archive of phenotypic elites MAP-Elites [76].The authors first encoded images with two different meth-ods: direct encoding (grayscale or HSV value) and indirectencoding (compositional pattern-producing network). Then ineach iteration, MAP-Elites, like general EA algorithm, chose arandom organism, mutated them randomly, and replaced withthe current ones if the new ones have higher fitness (highcertainty for a class of a neural network). In this way, MAP-Elites can find the best individual for each class. As theyclaimed, for many adversarial images, CPPN could locate thecritical features to change outputs of deep neural networksjust like JSMA did. Many images from same evolutionary arefound similar on closely related categories. More interestingly,CPPN EA fooling images are accepted by an art contest with35.5% acceptance rate.

G. C&W’s Attack

Carlini and Wagner launched a targeted attack to defeatDefensive distillation (Section VI-A) [63]. According to theirfurther study [77], [78], C&W’s Attack is effective for mostof existing adversarial detecting defenses. The authors madeseveral modifications in Equation 2.

They first defined a new objective function g, so that:

minη

‖η‖p + c · g(x+ η)

s.t. x+ η ∈ [0, 1]n,(14)

where g(x′) ≥ 0 if and only if f(x′) = l′. In this way, thedistance and the penalty term can be better optimized. Theauthors listed seven objective function candidates g. One ofthe effective functions evaluated by their experiments can be:

g(x′) = max(maxi 6=l′

(Z(x′)i)− Z(x′)t,−κ), (15)

where Z denotes the Softmax function, κ is a constant tocontrol the confidence (κ is set to 0 in [63]).

Second, instead of using box-constrained L-BFGS to findminimal perturbation in L-BFGS Attack method, the authorsintroduced a new variant w to avoid the box constraint, wherew satisfies η = 1

2 (tanh(w)+1)−x. General optimizers in deeplearning like Adam and SGD were used to generate adversarialexamples and performed 20 iterations of such generation tofind an optimal c by binary searching. However, they foundthat if the gradients of ‖η‖p and g(x+ η) are not in the samescale, it is hard to find a suitable constant c in all of the

Page 8: Adversarial Examples: Attacks and Defenses for Deep Learning

8

iterations of the gradient search and get the optimal result.Due to this reason, two of their proposed functions did notfind optimal solutions for adversarial examples.

Third, three distance measurements of perturbation werediscussed in the paper: ℓ0, ℓ2, and ℓ∞. The authors providedthree kinds of attacks based on the distance metrics: ℓ0 attack,ℓ2 attack, and ℓ∞ attack.ℓ2 attack can be described by:

minw

‖1

2(tanh(w) + 1)‖2 + c · g(

1

2tanh(w) + 1). (16)

The authors showed that the distillation network could not helpdefend ℓ2 attack.

ℓ0 attack was conducted iteratively since ℓ0 is not differen-tiable. In each iteration, a few pixels are considered trivial forgenerating adversarial examples and removed. The importanceof pixels is determined by the gradient of ℓ2 distance. Theiteration stops if the remaining pixels can not generate anadversarial example.

ℓ∞ attack was also an iterative attack, which replaced theℓ2 term with a new penalty in each iteration:

min c · g(x+ η) +∑

i

[(ηi − τ)+]. (17)

For each iteration, they reduced τ by a factor of 0.9, if allηi < τ . ℓ∞ attack considered τ as an estimation of ℓ∞.

H. Zeroth Order Optimization (ZOO)

Different from gradient-based adversarial generating ap-proaches, Chen et al. proposed a Zeroth Order Optimization

(ZOO) based attack [64]. Since this attack does not requiregradients, it can be directly deployed in a black-box attackwithout model transferring. Inspired by [63], the authorsmodified g(·) in [63] as a new hinge-like loss function:

g(x′) = max(maxi 6=l′

(log[f(x)]i)− log[f(x)]l′ ,−κ), (18)

and used symmetric difference quotient to estimate the gradi-ent and Hessian:

∂f(x)

∂xi

≈f(x+ hei)− f(x− hei)

2h,

∂2f(x)

∂x2i

≈f(x+ hei)− 2f(x) + f(x− hei)

h2,

(19)

where ei denotes the standard basis vector with the ithcomponent as 1, h is a small constant.

Through employing the gradient estimation of gradient andHessian, ZOO does not need the access to the victim deeplearning models. However, it requires expensive computationto query and estimate the gradients. The authors proposedADAM like algorithms, ZOO-ADAM, to randomly select avariable and update adversarial examples. Experiments showedthat ZOO achieved the comparable performance as C&W’s

Attack.

Fig. 3: A universal adversarial example fools the neural network on images.Left images: original labeled natural images; center image: universal pertur-bation; right images: perturbed images with wrong labels. [65]

I. Universal Perturbation

Leveraging their previous method on DeepFool, Moosavi-Dezfooli et al. developed a universal adversarial attack [65].The problem they formulated is to find a universal perturbationvector satisfying

‖η‖p ≤ ǫ,

P(

x′ 6= f(x))

≥ 1− δ.(20)

ǫ limits the size of universal perturbation, and δ controls thefailure rate of all the adversarial samples.

For each iteration, they use DeepFool method to get aminimal sample perturbation against each input data andupdate the perturbation to the total perturbation η. This loopwill not stop until most data samples are fooled (P < 1− δ).From experiments, the universal perturbation can be generatedby using a small part of data samples instead of the entiredataset. Figure 3 illustrates a universal adversarial examplecan fool a group of images. The universal perturbations wereshown to be generalized well across popular deep learningarchitectures (e.g., VGG, CaffeNet, GoogLeNet, ResNet).

J. One Pixel Attack

To avoid the problem of measurement of perceptiveness, Suet al. generated adversarial examples by only modifying onepixel [66]. The optimization problem becomes:

minx′

J(f(x′), l′)

s.t. ‖η‖0 ≤ ǫ0,(21)

where ǫ0 = 1 for modifying only one pixel. The new constraintmade it hard to optimize the problem.

Su et al. applied differential evolution (DE), one of theevolutionary algorithms, to find the optimal solution. DE doesnot require the gradients of the neural networks and can beused in non-differential objective functions. They evaluated theproposed method on the CIFAR-10 dataset using three neuralnetworks: All convolution network (AllConv) [79], Network

Page 9: Adversarial Examples: Attacks and Defenses for Deep Learning

9

in Network (NiN) [80], and VGG16. Their results showed that70.97% of images successfully fooled deep neural networkswith at least one target class with confidence 97.47% onaverage.

K. Feature Adversary

Sabour et al. performed a targeted attack by minimizingthe distance of the representation of internal neural networklayers instead of the output layer [67]. We refer to this attackas Feature Adversary. The problem can be described by:

minx′

‖φk(x)− φk(x′)‖

s.t. ‖x− x′‖∞ < δ,(22)

where φk denotes a mapping from image input to the outputof the kth layer. Instead of finding a minimal perturbation,δ is used as a constraint of perturbation. They claimed thata small fixed value δ is good enough for human perception.Similar to [8], they used L-BFGS-B to solve the optimizationproblem. The adversarial images are more natural and closerto the targeted images in the internal layers.

L. Hot/Cold

Rozsa et al. proposed a Hot/Cold method to find multipleadversarial examples for every single image input [59]. Theythought small translations and rotations should be allowed aslong as they were imperceptible.

They defined a new metric, Psychometric Perceptual Ad-versarial Similarity Score (PASS), to measure the noticeablesimilarity to humans. Hot/Cold neglected the unnoticeable dif-ference based on pixels and replaced widely used ℓp distancewith PASS. PASS includes two stages: 1) aligning the modifiedimage with the original image; 2) measuring the similaritybetween the aligned image and the original one.

Let φ(x′, x) be a homography transform from the adversar-ial example x′ to the original example x. H is the homographymatrix, with size 3 × 3. H is solved by maximizing theenhanced correlation coefficient (ECC) [81] between x′ andx. The optimization function is:

argminH

x

‖x‖−

φ(x′, x)

‖φ(x′, x)‖

, (23)

where · denotes the normalization of an image.Structural SIMilarity (SSIM) index [82] was adopted to mea-

sure the just noticeable difference of images. [59] leveragedSSIM and defined a new measurement, regional SSIM index(RSSIM) as:

RSSIM(xi,j , x′i,j) = L(xi,j , x

′i,j)

αC(xi,j , x′i,j)

βS(xi,j , x′i,j)

γ ,

where α, β, γ are weights of importance for luminance(L(·, ·)), contrast (C(·, ·)), and structure (S(·, ·)). The SSIMcan be calculated by averaging RSSIM:

SSIM(xi,j , x′i,j) =

1

n×m

i,j

RSSIM(xi,j , x′i,j).

PASS is defined by combination of the alignment and thesimilarity measurement:

PASS(x′, x) = SSIM(φ∗(x′, x), x). (24)

The adversarial problem with the new distance is describedas:

min D(x, x′)

s.t. f(x′) = y′,

PASS(x, x′) ≥ γ.

(25)

D(x, x′) denotes a measure of distance (e.g., 1−PASS(x, x′)or ‖x− x′‖p).

To generate a diverse set of adversarial examples, the au-thors defined the targeted label l′ as hot class, and the originallabel l as cold class. In each iteration, they moved toward atarget (hot) class while moving away from the original (cold)class. Their results showed that generated adversarial examplesare comparable to FGSM, and with more diversity.

M. Natural GAN

Zhao et al. utilized Generative Adversarial Networks(GANs) as part of their approach to generate adversarialexamples of images and texts [68], which made adversarialexamples more natural to human. We name this approachNatural GAN. The authors first trained a WGAN model on thedataset, where the generator G maps random noise to the inputdomain. They also trained an “inverter” I to map input data todense inner representations. Hence, the adversarial noise wasgenerated by minimizing the distance of the inner representa-tions like “Feature Adversary.” The adversarial examples weregenerated using the generator: x′ = G(z′):

minz

‖z − I(x)‖

s.t. f(G(z)) 6= f(x).(26)

Both the generator G and the inverter I were built to makeadversarial examples natural. Natural GAN was a generalframework for many deep learning fields. [68] applied Natural

GAN to image classification, textual entailment, and machinetranslation. Since Natural GAN does not require gradients oforiginal neural networks, it can also be applied to Black-box

Attack.

N. Model-based Ensembling Attack

Liu et al. conducted a study of transferability (Sec-tion VII-A) over deep neural networks on ImageNet andproposed a Model-based Ensembling Attack for targeted adver-sarial examples [69]. The authors argued that compared to non-targeted adversarial examples, targeted adversarial examplesare much harder to transfer over deep models. Using Model-

based Ensembling Attack, they can generate transferable ad-versarial examples to attack a black-box model.

The authors generated adversarial examples on multipledeep neural networks with full knowledge and tested themon a black-box model. Model-based Ensembling Attack wasderived by the following optimization problem:

argminx′

− log

(

(

k∑

i=1

αiJi(x′, l′))

)

+ λ‖x′ − x‖, (27)

where k is the number of deep neural networks in thegeneration, fi is the function of each network, and αi is

Page 10: Adversarial Examples: Attacks and Defenses for Deep Learning

10

the ensemble weight (∑k

i αi = 1). The results showed thatModel-based Ensembling Attack could generate transferabletargeted adversarial images, which enhanced the power ofadversarial examples for black-box attacks. They also provedthat this method performs better in generating non-targetedadversarial examples than previous methods. The authorssuccessfully conducted a black-box attack against Clarifai.comusing Model-based Ensembling Attack.

O. Ground-Truth Attack

Formal verification techniques aim to evaluate the robust-ness of a neural network even against zero-day attacks (Sec-tion VI-F). Carlini et al. constructed a ground-truth attack,which provided adversarial examples with minimal perturba-tion (ℓ1, ℓ∞) [70]. Network Verification always checks whetheran adversarial example violates a property of a deep neuralnetwork and whether there exists an example changes the labelwithin a certain distance. Ground-Truth Attack conducted abinary search and found such an adversarial example with thesmallest perturbation by invoking Reluplex [83] iteratively.The initial adversarial example is found using C&W’s At-tack [63] to improve the performance.

V. APPLICATIONS FOR ADVERSARIAL EXAMPLES

We have investigated adversarial examples for image clas-sification task. In this section, we review adversarial examplesagainst the other tasks. We mainly focus on three questions:What scenarios are adversarial examples applied in new tasks?How to generate adversarial examples in new tasks? Whetherto propose a new method or to translate the problem into theimage classification task and solve it by the aforementionedmethods? Table III summarizes the applications for adversarialexamples in this section.

A. Reinforcement Learning

Deep neural networks have been used in reinforcementlearning by training policies on raw input (e.g., images). [84],[85] generated adversarial examples on deep reinforcementlearning policies. Since the inherent intensive computation ofreinforcement learning, both of them performed fast One-time

attack.Huang et al. applied FGSM to attack deep reinforcement

learning networks and algorithms [84]: deep Q network(DQN), trust region policy optimization(TRPO), and asyn-chronous advantage actor-critic (A3C) [97]. Similarly to [60],they added small perturbations on the input of policy bycalculating the gradient of the cross-entropy loss function:∇xJ(θ, x, ℓ)). Since DQN does not have stochastic policyinput, softmax of Q-values is considered to calculate the lossfunction. They evaluated adversarial examples on four Atari2600 games with three norm constraints ℓ1, ℓ2, ℓ∞. They foundHuang’s Attack with ℓ1 norm conducted a successful attack onboth White-box attack and Black-box attack (no access to thetraining algorithms, parameters, and hyper-parameters).

[85] used FGSM to attack A3C algorithm and Atari Pongtask. [85] found that injecting perturbations in a fraction offrames is sufficient.

Fig. 4: Adversarial attacks for autoencoders [87]. Perturbations are added tothe input the encoder. After encoding and decoding, the decoder will outputan adversarial image presenting an incorrect class

B. Generative Modeling

Kos et al. [86] and Tabacof et al. [87] proposed adversarialexamples for generative models. An adversary for autoencodercan inject perturbations into the input of encoder and generatea targeted class after decoding. Figure 4 depicts a targeted ad-versarial example for an autoencoder. Adding perturbations onthe input image of the encoder can misguide the autoencoderby making decoder to generating a targeted adversarial outputimage.

Kos et al. described a scenario to apply adversarial ex-amples against autoencoder. Autoencoders can be used tocompress data by an encoder and decompress by a decoder.For example, Toderici et al. use RNN-based AutoEncoderto compress image [98]. Ledig et al. used GAN to super-resolve images [99]. Adversaries can leverage autoencoder toreconstruct an adversarial image by adding perturbation to theinput of the encoder.

Tabacof et al. used Feature Adversary Attack against AEand VAE. The adversarial examples were formulated as fol-lows [87]:

minη

D(zx′ , zx) + c‖η‖

s.t. x′ ∈ [L,U ]

zx′ = Encoder(x′)

zx = Encoder(x),

(28)

where D(·) is the distance between latent encoding representa-tion zx and zx′ . Tabacof et al. chose KL-divergence to measureD(·) in [87]. They tested their attacks on the MNIST andSVHN dataset and found that generating adversarial examplesfor autoencoder is much harder than for classifiers. VAE iseven slightly more robust than deterministic autoencoder.

Kos et al. extended Tabacof et al.’s work by designing an-other two kinds of distances. Hence, the adversarial examplescan be generated by optimizing:

minη

c‖η‖+ J(x′, l′). (29)

The loss function J can be cross-entropy (refer to “ClassifierAttack” in [86]), VAE loss function (“LV AE Attck”), anddistance between the original latent vector z and modifiedencoded vector x′ (“Latent Attack”, similar to Tabacof et al.’swork [87]). They tested VAE and VAE-GAN [100] on the

Page 11: Adversarial Examples: Attacks and Defenses for Deep Learning

11

TABLE III: Summary of Applications for Adversarial Examples

Applications RepresentativeStudy

Method AdversarialFalsification

Adversary’sKnowledge

AdversarialSpecificity

PerturbationScope

PerturbationLimitation

AttackFrequency

PerturbationMeasurement

Dataset Architecture

ReinforcementLearning

[84] FGSM N/A White-box &Black-box

Non-Targeted

Individual N/A One-time ℓ1, ℓ2, ℓ∞ Atari DQN,TRPO, A3C

[85] FGSM N/A White-box Non-Targeted

Individual N/A One-time N/A Atari Pong A3C

GenerativeModeling

[86] FeatureAdversary,

C&W

N/A White-box Targeted Individual Optimized Iterative ℓ2 MNIST,SVHN,CelebA

VAE,VAE-GAN

[87] FeatureAdversary

N/A White-box Targeted Individual Optimized Iterative ℓ2 MNIST,SVHN

VAE, AE

Face Recog-nition

[58] Impersonation& Dodging

Attack

Falsenegative

white-box &black-box

Targeted &Non-

Targeted

Universal Optimized Iterative TotalVariation

LFW, VGGFace

ObjectDetection

[11] DAG Falsenegative &

Falsepositive

White-box &Black-box

Non-Targeted

Individual N/A Iterative N/A VOC2007,VOC2012

Faster-RCNN

SemanticSegmentation

[11] DAG Falsenegative &

Falsepositive

White-box &Black-box

Non-Targeted

Individual N/A Iterative N/A DeepLab FCN

[88] ILLC Falsenegative

White-box Targeted Individual N/A Iterative ℓ∞ Cityscapes FCN

[89] ILLC Falsenegative

White-box Targeted Universal N/A Iterative N/A Cityscapes FCN

ReadingComprehension

[90] AddSent,AddAny

N/A Black-box Non-Targeted

Individual N/A One-time &Iterative

N/A SQuAD BiDAF,Match-

LSTM, andtwelve other

publishedmodels

[91] ReinforcementLearning

Falsenegative

White-box Non-Targeted

Individual Optimized Iterative ℓ0 TripAdvisorDataset

Bi-LSTM,memorynetwork

MalwareDetection

[92] JSMA Falsenegative

White-box Targeted Individual Optimized Iterative ℓ2 DREBIN 2-layer FC

[93] ReinforcementLearning

Falsenegative

Black-box Targeted Individual N/A Iterative N/A N/A GradientBoostedDecision

Tree[94] GAN False

negativeBlack-box Targeted Individual N/A Iterative N/A malwr Multi-layer

Perceptron[95] GAN False

negativeBlack-box Targeted Individual N/A Iterative N/A Alexa Top

1MRandomForest

[96] Generic Pro-gramming

Falsenegative

Black-box Targeted Individual N/A Iterative N/A Contagio RandomForest, SVM

Page 12: Adversarial Examples: Attacks and Defenses for Deep Learning

12

Fig. 5: An example of adversarial eyeglass frame against Face RecognitionSystem [58]

MNIST, SVHN, and CelebA datasets. In their experimentalresults, “Latent Attack” achieved the best result.

C. Face Recognition

Deep neural network based Face Recognition System (FRS)and Face Detection System have been widely deployed incommercial products due to their high performance. [58] firstprovided a design of eyeglass frames to attack a deep neuralnetwork based FRS [101], which composes 11 blocks with38 layers and one triplet loss function for feature embedding.Based on the triplet loss function, [58] designed a softmaxloss

function:

J(x) = − log

(

e<hcx ,f(x)>

∑mc=1 e

<hc,f(x)>

)

, (30)

where hc is a one-hot vector of class c, < ·, · > denotes innerproduct. Then they used L-BFGS Attack to generate adversarialexamples.

In a further step, [58] implemented adversarial eyeglassframes to achieve attacks in the physical world: the per-turbations can only be injected into the area of eyeglassframes. They also enhanced the printability of adversarialimages on the frame by adding a penalty of non-printabilityscore (NPS) to the optimized objective. Similarly to Universal

Perturbation, they optimize the perturbation to be applied toa set of face images. They successfully dodged (non-targetedattack) against FRS (over 80 % time) and misguided FRSas a specific face (targeted attack) with a high success rate(depending on the target). Figure 5 illustrates an example ofadversarial eyeglass frames.

Leveraging the approach of printability, [10] proposed anattack algorithm, Robust Physical Perturbations (RP2), tomodify a stop sign as a speed limit sign) 8. They changedthe physical road signs by two kinds of attacks: 1) overlayingan adversarial road sign over a physical sign; 2) sticking per-turbations on an existing sign. [10] included a non-printabilityscore in the optimization objective to improve the printability.

D. Object Detection

The object detection task is to find the proposal of anobject (bounding box), which can be viewed as an imageclassification task for every possible proposal. [11] proposed auniversal algorithm called Dense Adversary Generation (DAG)

to generate adversarial examples for both object detectionand semantic segmentation. The authors aimed at making theprediction (detection/segmentation) incorrect (non-targeted).

8This method was shown not effective for standard detectors (YOLO andFaster RCNN) in [102].

Fig. 6: An adversarial example for object detection task [11]. Left: objectdetection on a clean image. Right: object detection on an adversarial image.

Figure 6 illustrates an adversarial example for the objectdetection task.

[11] defined T = t1, t2, . . . , tN as the recognition targets.For image classification, the classifier only needs one target– entire image (N = 1); For semantic segmentation, targetsconsist of all pixels (N = #ofpixels); For object detection,targets consist of all possible proposals (N = (#ofpixels)2).Then the objective function sums up the loss from all targets.Instead of optimizing the loss from all targets, the authorsperformed an iterative optimization and only updated the lossfor the targets correctly predicted in the previous iteration.The final perturbation sums up normalized perturbations inall iterations. To deal with a large number of targets forobjective detection problem, the authors used regional proposalnetwork (RPN) [4] to generate possible targets, which greatlydecreases the computation for targets in object detection. DAGalso showed the capability of generating images which areunrecognizable to human but deep learning could predict (false

positives).

E. Semantic Segmentation

Image segmentation task can be viewed as an image clas-sification task for every pixel. Since each perturbation isresponsible for at least one pixel segmentation, this makesthe space of perturbations for segmentation much smallerthan that for image classification [103]. [11], [88], [103]generated adversarial examples against the semantic imagesegmentation task. However, their attacks are proposed underdifferent scenarios. As we just discussed, [11] performeda non-targeted segmentation. [88], [103] both performed atargeted segmentation and tried to removed a certain class bymaking deep learning model to misguide it as backgroundclasses.

[88] generated adversarial examples by assigning pixelswith the adversarial class that their nearest neighbor belongsto. The success rate was measured by the percentage of pixelsof the chosen class to be changed or of the rest classes to bepreserved.

[103] presented a method to generate universal adversarialperturbations against semantic image segmentation task. Theyassigned the primary objective of adversarial examples andhid the objects (e.g., pedestrians) while keeping the restsegmentation unchanged. Metzen et al. defined backgroundclasses and targeted classes (not targeted adversarial classes).Targeted classes are classes to be removed. Similar to [88],pixels which belong to the targeted classes would be assignedto their nearest background classes:

Page 13: Adversarial Examples: Attacks and Defenses for Deep Learning

13

Fig. 7: Adversary examples of hiding pedestrians in the semantic segmentationtask [103]. Left image: original image; Middle image: the segmentation ofthe original image predicted by DNN; Right image: the segmentation of theadversarial image predicted by DNN.

ltargetij = lpredi′j′ ∀(i, j) ∈ Itargeted,

ltargetij = lpredij ∀(i, j) ∈ Ibackground,

(i′, j′) = argmin(i′,j′)∈Ibackground

‖i′ − i‖+ ‖j′ − j‖,(31)

where Itargeted = (i, j)|f(xij) = l∗ denotes the area to beremoved. Figure 7 illustrates an adversarial example to hidepedestrians. They used ILLC attack to solve this problemand also extended Universal Perturbation method to get theuniversal perturbation. Their results showed the existence ofuniversal perturbation for semantic segmentation task.

F. Natural Language Processing (NLP)

Many tasks in natural language processing can be attackedby adversarial examples. People usually generate adversarialexamples by adding/deleting words in the sentences.

The task of reading comprehension (a.k.a. question an-swering) is to read paragraphs and answer questions aboutthe paragraphs. To generate adversarial examples that areconsistent with the correct answer and do not confuse human,Jia and Liang added distracting (adversarial) sentences to theend of paragraph [90]. They found that models for the readingcomprehension task are overstable instead of oversensitivity,which means deep learning models cannot tell the subtle butcritical difference in the paragraphs.

They proposed two kinds of methods to generate adver-sarial examples: 1) adding grammatical sentences similarto the question but not contradictory to the correct answer(AddSent); 2) adding a sentence with arbitrary English words(AddAny). [90] successfully fooled all the models (sixteenmodels) they tested on Stanford Question Answering Dataset(SQuAD) [104]. The adversarial examples also have the capa-bility of transferability and cannot be improved by adversarialtraining. However, the adversarial sentences require manpowerto fix the errors in the sentences.

[91] aimed to fool a deep learning-based sentiment classi-fier by removing the minimum subset of words in the giventext. Reinforcement learning was used to find an approximatesubset, where the reward function was proposed as 1

‖D‖ whenthe sentiment label changes, and 0 otherwise. ‖D‖ denotesthe number of removing word set D. The reward functionalso included a regularizer to make sentence contiguous.

The changes in [90], [91] can easily be recognized byhumans. More natural adversarial examples for texture datawas proposed by Natural GAN [68] (Section IV-M).

TABLE IV: Summary of Countermeasures for Adversarial Examples

Defensive Strategies Representative Studies

ReactiveAdversarial Detecting [23], [89], [109]–[116]Input Reconstruction [114], [117], [118]Network Verification [83], [119], [120]

ProactiveNetwork Distillation [121]

Adversarial (Re)Training[24], [25], [60], [72], [75],

[122]Classifier Robustifying [123], [124]

G. Malware Detection

Deep learning has been used in static and behavioral-basedmalware detection due to its capability of detecting zero-day malware [105]–[108]. Recent studies generated adversar-ial malware samples to evade deep learning-based malwaredetection [92]–[94], [96].

[92] adapted JSMA method to attack Android malwaredetection model. [96] evaded two PDF malware classifier,PDFrate and Hidost, by modifying PDF. [96] parsed the PDFfile and changed its object structure using genetic program-ming. The adversarial PDF file was then packed with newobjects.

[95] used GAN to generate adversarial domain names toevade detection of domain generation algorithms. [94] pro-posed a GAN based algorithm, MalGan, to generate malwareexamples and evade black-box detection. [94] used a substitutedetector to simulate the real detector and leveraged the trans-ferability of adversarial examples to attack the real detector.MalGan was evaluated by 180K programs with API features.However, [94] required the knowledge of features used in themodel. [93] used a large number of features (2,350) to coverthe required feature space of portable executable (PE) files.The features included PE header metadata, section metadata,import & export table metadata. [93] also defined several mod-ifications to generate malware evading deep learning detection.The solution was trained by reinforcement learning, where theevasion rate is considered as a reward.

VI. COUNTERMEASURES FOR ADVERSARIAL EXAMPLES

Countermeasures for adversarial examples have two typesof defense strategies: 1) reactive: detect adversarial examplesafter deep neural networks are built; 2) proactive: makedeep neural networks more robust before adversaries generateadversarial examples. In this section, we discuss three reactivecountermeasures (Adversarial Detecting, Input Reconstruc-

tion, and Network Verification) and three proactive counter-measures (Network Distillation, Adversarial (Re)training, andClassifier Robustifying). We will also discuss an ensemblingmethod to prevent adversarial examples. Table IV summarizesthe countermeasures.

A. Network Distillation

Papernot et al. used network distillation to defend deepneural networks against adversarial examples [121]. Networkdistillation was originally designed to reduce the size ofdeep neural networks by transferring knowledge from a largenetwork to a small one [125], [126] (Figure 8). The probabilityof classes produced by the first DNN is used as inputs to

Page 14: Adversarial Examples: Attacks and Defenses for Deep Learning

14

Fig. 8: Network distillation of deep neural networks [121]

train the second DNN. The probability of classes extracts theknowledge learned from the first DNN. Softmax is usuallyused to normalize the last layer of DNN and produce theprobability of classes. The softmax output of the first DNN,also the input of the next DNN, can be described as:

qi =exp(zi/T )

j exp(zj/T ), (32)

where T is a temperature parameter to control the level ofknowledge distillation. In deep neural networks, temperature

T is set to 1. When T is large, the output of softmax will bevague (when T → ∞, the probability of all classes → 1

m).

When T is small, only one class is close to 1 while therest goes to 0. This schema of network distillation can beduplicated several times and connects several deep neuralnetworks.

In [121], network distillation extracted knowledge fromdeep neural networks to improve robustness. The authorsfound that attacks primarily aimed at the sensitivity of net-works and then proved that using high-temperature softmaxreduced the model sensitivity to small perturbations. Network

Distillation defense was tested on the MNIST and CIFAR-10datasets and reduced the success rate of JSMA attack by 0.5%and 5% respectively. “Network Distillation” also improved thegeneralization of the neural networks.

B. Adversarial (Re)training

Training with adversarial examples is one of the counter-measures to make neural networks more robust. Goodfellow et

al. [60] and Huang et al. [122] included adversarial examplesin the training stage. They generated adversarial examples inevery step of training and inject them into the training set. [60],[122] showed that adversarial training improved the robustnessof deep neural networks. Adversarial training could provideregularization for deep neural networks [60] and improve theprecision as well [24].

[60] and [122] were evaluated only on the MNIST dataset.A comprehensive analysis of adversarial training methods onthe ImageNet dataset was presented in [72]. They used halfadversarial examples and half origin examples in each stepof training. From the results, adversarial training increasedthe robustness of neural networks for one-step attacks (e.g.,FGSM) but would not help under iterative attacks (e.g., BIM

and ILLC methods). [72] suggested that adversarial training isused for regularization only to avoid overfitting (e.g., the casein [60] with the small MNIST dataset).

[75] found that the adversarial trained models on theMNIST and ImageNet datasets are more robust to white-box

adversarial examples than to the transferred examples (black-box).

[25] minimized both the cross-entropy loss and internalrepresentation distance during adversarial training, which canbe seen as a defense version of Feature Adversary.

To deal with the transferred black-box model, [75] pro-posed Ensembling Adversarial Training method that trainedthe model with adversarial examples generated from multiplesources: the models being trained and also pre-trained externalmodels.

C. Adversarial Detecting

Many research projects tried to detect adversarial examplesin the testing stage [23], [89], [109]–[114], [116].

[23], [89], [111] trained deep neural network-based binaryclassifiers as detectors to classify the input data as a legitimate(clean) input or an adversarial example. Metzen et al. createda detector for adversarial examples as an auxiliary networkof the original neural network [89]. The detector is a smalland straightforward neural network predicting on binary clas-sification, i.e., the probability of the input being adversarial.SafetyNet [23] extracted the binary threshold of each ReLUlayer’s output as the features of the adversarial detector anddetects adversarial images by an RBF-SVM classifier. Theauthors claimed that their method is hard to be defeated byadversaries even when adversaries know the detector becauseit is difficult for adversaries to find an optimal value, forboth adversarial examples and new features of SafetyNetdetector. [112] added an outlier class to the original deeplearning model. The model detected the adversarial examplesby classifying it as an outlier. They found that the measurementof maximum mean discrepancy (MMD) and energy distance(ED) could distinguish the distribution of adversarial datasetsand clean datasets.

[110] provided a Bayesian view of detecting adversarialexamples. [110] claimed that the uncertainty of adversarialexamples is higher than the clean data. Hence, they deployeda Bayesian neural network to estimate the uncertainty of inputdata and distinguish adversarial examples and clean input databased on uncertainty estimation.

Similarly, [114] used probability divergence (Jensen-Shannon divergence) as one of its detectors. [113] showedthat after whitening by Principal Component Analysis (PCA),adversarial examples have different coefficients in low-rankedcomponents.

[118] trained a PixelCNN neural network [127] and foundthat the distribution of adversarial examples is different fromclean data. They calculated p-value based on the rank ofPixelCNN and rejected adversarial examples using the p-values. The results showed that this approach could detectFGSM, BIM, DeepFool, and C&W attack.

[115] trained neural networks with “reverse cross-entropy”to better distinguish adversarial examples from clean datain the latent layers and then detected adversarial examplesusing a method called “Kernel density” in the testing stage.The “reverse cross-entropy” made the deep neural network topredict with high confidence on the true class and uniform

Page 15: Adversarial Examples: Attacks and Defenses for Deep Learning

15

distribution on the other classes. In this way, the deep neu-ral network was trained to map the clean input close to alow-dimensional manifold in the layer before softmax. Thisbrought great convenience for further detection of adversarialexamples.

[116] leveraged multiple previous images to predict futureinput and detect adversarial examples, in the task of reinforce-ment learning.

However, Carlini and Wagner summarized most of theseadversarial detecting methods ( [89], [109]–[113]) and showedthat these methods could not defend against their previousattack C&W’s Attack with slight changes of loss function [77],[78].

D. Input Reconstruction

Adversarial examples can be transformed to clean data viareconstruction. After transformation, the adversarial exampleswill not affect the prediction of deep learning models. Guand Rigazio proposed a variant of autoencoder network witha penalty, called deep contractive autoencoder, to increasethe robustness of neural networks [117]. A denoising au-toencoder network is trained to encode adversarial examplesto original ones to remove adversarial perturbations. [114]reconstructed the adversarial examples by 1) adding Gaussiannoise or 2) encoding them with autoencoder as a plan B inMagNet [114](Section VI-G).

PixelDefend reconstructed the adversarial images back tothe training distribution [118] using PixelCNN. PixelDefend

changed all pixels along each channel to maximize the prob-ability distribution:

maxx′

Pt(x′)

s.t. ‖x′ − x‖∞ ≤ ǫdefend,(33)

where Pt denotes the training distribution, ǫdefend controlsthe new changes on the adversarial examples. PixelDefend

also leveraged adversarial detecting, so that if an adversarialexample is not detected as malicious, no change will be madeto the adversarial examples (ǫdefend = 0).

E. Classifier Robustifying

[123], [124] design robust architectures of deep neuralnetworks to prevent adversarial examples.

Due to the uncertainty from adversarial examples, Bradshawet al. leveraged Bayesian classifiers to build more robust neuralnetworks [123]. Gaussian processes (GPs) with RBF kernelswere used to provide uncertainty estimation. The proposedneural networks were called Gaussian Process Hybrid Deep

Neural Networks (GPDNNs). GPs expressed the latent vari-ables as a Gaussian distribution parameterized by the functionsof mean and covariance and encoded them with RBF kernels.[123] showed that GPDNNs achieved comparable performancewith general DNNs and more robust to adversarial examples.The authors claimed that GPDNNs “know when they do notknow.”

[124] observed that adversarial examples usually went intoa small subset of incorrect classes. [124] separated the classesinto sub-classes and ensembled the result from all sub-classesby voting to prevent adversarial examples misclassified.

Fig. 9: MagNet workflow: one or more detectors first detects if input x isadversarial; If not, reconstruct x to x∗ before feeding it to the classifier.(modified from [114])

F. Network Verification

Verifying properties of deep neural networks is a promisingsolution to defend adversarial examples, because it may de-tect the new unseen attacks. Network verification checks theproperties of a neural network: whether an input violates orsatisfies the property.

Katz et al. proposed a verification method for neural net-works with ReLU activation function, called Reluplex [83].They used Satisfiability Modulo Theory (SMT) solver to verifythe neural networks. The authors showed that within a smallperturbation, there was no existing adversarial example tomisclassify the neural networks. They also proved that theproblem of network verification is NP-complete. Carlini et al.extended their assumption of ReLU function by presentingmax(x, y) = ReLU(x−y)+y and ‖x‖ = ReLU(2x)−x [70].However, Reluplex runs very slow due to the large computa-tion of verifying the networks and only works for DNNs withseveral hundred nodes [120]. [120] proposed two potentialsolutions: 1) prioritizing the order of checking nodes 2) sharinginformation of verification.

Instead of checking each point individually, Gopinath et al.proposed DeepSafe to provide safe regions of a deep neuralnetwork [119] using Reluplex. They also introduced targeted

robustness a safe region only regarding a targeted class.

G. Ensembling Defenses

Due to the multi-facet of adversarial examples, multipledefense strategies can be performed together (parallel orsequential) to defend adversarial examples.

Aforementioned PixelDefend [118] is composed of an ad-versarial detector and an “input reconstructor” to establish adefense strategy.

MagNet included one or more detectors and a reconstructor(“reformer” in the paper) as Plan A and Plan B [114]. Thedetectors are used to find the adversarial examples which arefar from the boundary of the manifold. In [114], they firstmeasured the distance between input and encoded input andalso the probability divergence (Jensen-Shannon divergence)between softmax output of input and encoded input. Theadversarial examples were expected a large distance andprobability divergence. To deal with the adversarial examplesclose to the boundary, MagNet used a reconstructor built byneural network based autoencoders. The reconstructor willmap adversarial examples to legitimate examples. Figure 9illustrates the workflow of the defense of two phases.

Page 16: Adversarial Examples: Attacks and Defenses for Deep Learning

16

After investigating several defensive approaches, [128]showed that the ensemble of those defensive approaches doesnot make the neural networks strong.

H. Summary

Almost all defenses are shown to be effective only for partof attacks. They tend not to be defensive for some strong(fail to defend) and unseen attacks. Most defenses targetadversarial examples in the computer vision task. However,with the development of adversarial examples in other areas,new defenses for these areas, especially for safety-criticalsystems, are urgently required.

VII. CHALLENGES AND DISCUSSIONS

In this section, we discuss the current challenges and thepotential solutions for adversarial examples. Although manymethods and theorems have been proposed and developedrecently, a lot of fundamental questions need to be well ex-plained and many challenges need to be addressed. The reasonfor the existence of adversarial examples is an interesting andone of the most fundamental problems for both adversariesand researchers, which exploits the vulnerability of neuralnetworks and help defenders to resist adversarial examples. Wewill discuss the following questions in this section: Why doadversarial examples transfer? How to stop the transferability?Why are some defenses effective and others not? How tomeasure the strength of an attack as well as a defense? Howto evaluate the robustness of a deep neural network againstseen/unseen adversarial examples?

A. Transferability

Transferability is a common property for adversarial ex-amples. Szegedy et al. first found that adversarial examplesgenerated against a neural network can fool the same neuralnetworks trained by different datasets. Papernot et al. foundthat adversarial examples generated against a neural networkcan fool other neural networks with different architectures,even other classifiers trained by different machine learning al-gorithms [33]. Transferability is critical for Black-Box attackswhere the victim deep learning model and the training datasetare not accessible. Attackers can train a substitute neuralnetwork model and then generate adversarial examples againstsubstitute model. Then the victim model will be vulnerableto these adversarial examples due to transferability. From adefender’s view, if we hinder transferability of adversarialexamples, we can defend all white-box attackers who needto access the model and require transferability.

We define the transferability of adversarial examples in threelevels from easy to hard: 1) transfer among the same neuralnetwork architecture trained with different data; 2) transferamong different neural network architectures trained for thesame task; 3) transfer among deep neural networks for differ-ent tasks. To our best knowledge, there is no existing solutionon the third level yet (for instance, transfer an adversarialimage from object detection to semantic segmentation).

Many studies examined transferability to show the abilityof adversarial examples [8], [60]. Papernot et al. studied the

transferability between conventional machine learning tech-niques (i.e., logistic regression, SVM, decision tree, kNN)and deep neural networks. They found that adversarial exam-ples can be transferred between different parameters, trainingdataset of a machine learning models and even across differentmachine learning techniques.

Liu et al. investigated transferability of targeted and non-targeted adversarial examples on complex models and largedatasets (e.g., the ImageNet dataset) [69]. They found thatnon-targeted adversarial examples are much more transferablethan targeted ones. They observed that the decision boundariesof different models aligned well with each other. Thus theyproposed Model-Based Ensembling Attack to create transfer-able targeted adversarial examples.

Tramèr et al. found that the distance to the model’s decisionboundary is on average larger than the distance between twomodels’ boundaries in the same direction [129]. This mayexplain the existence of transferability of adversarial examples.Tramèr et al. also claimed that transferability might not bean inherent property of deep neural networks by showing acounter-example.

B. The existence of Adversarial Examples

The reason for the existence of adversarial examples isstill an open question. Are adversarial examples an inherentproperty of deep neural networks? Are adversarial examplesthe “Achilles’ heel” of deep neural networks with high perfor-mance? Many hypotheses have been proposed to explain theexistence.

Data incompletion One assumption is that adversarialexamples are of low probability and low test coverage ofcorner cases in the testing dataset [8], [130]. From traininga PixelCNN, [118] found that the distribution of adversarialexamples was different from clean data. Even for a simpleGaussian model, a robust model can be more complicated andrequires much more training data than that of a “standard”model [131].

Model capability Adversarial examples are a phenomenonnot only for deep neural networks but also for all classi-fiers [33], [132]. [60] suggested that adversarial examples arethe results of models being too linear in high dimensionalmanifolds. [133] showed that in the linear case, the adversarialexamples exist when the decision boundary is close to themanifold of the training data.

Contrary to [60], [132] believed that adversarial examplesare due to the “low flexibility” of the classifier for certaintasks. Linearity is not an “obvious explanation” [67]. [71]blamed adversarial examples for the sparse and discontinuousmanifold which makes classifier erratic.

No robust model [25] suggested that the decision bound-aries of deep neural networks are inherently incorrect, whichdo not detect semantic objects. [134] showed that if a datasetis generated by a smooth generative model with large latentspace, there is no robust classifier to adversarial examples.Similarly, [135] prove that if a model is trained on a spheredataset and misclassifies a small part of the dataset, then thereexist adversarial examples with a small perturbation.

Page 17: Adversarial Examples: Attacks and Defenses for Deep Learning

17

In addition to adversarial examples for image classificationtask, as discussed in Section V, adversarial examples havebeen generated in various applications. Many of them de-ployed utterly different methods. Some applications can usethe same method used in image classification task. However,some need to propose a novel method. Current studies on ad-versarial examples mainly focus on image classification task.No existing paper explains the relationship among differentapplications and existence of a universal attacking/defendingmethod to be applied to all the applications.

C. Robustness Evaluation

The competition between attacks and defenses for adversar-ial examples becomes an “arms race”: a defensive method thatwas proposed to prevent existing attacks was later shown tobe vulnerable to some new attacks, and vice versa [70], [112].Some defenses showed that they could defend a particularattack, but later failed with a slight change of the attack [109],[110]. Hence, the evaluation on the robustness of a deepneural network is necessary. For example, [136] provided anupper bound of robustness for linear classifier and quadraticclassifier. The following problems for robustness evaluation ofdeep neural networks require further exploration.

1) A methodology for evaluation on the robustness

of deep neural networks: Many deep neural networks areplanned to be deployed in safety-critical settings. Defendingonly existing attacks is not sufficient. Zero-day (new) attackswould be more harmful to deep neural networks. A method-ology for evaluating the robustness of deep neural networks isrequired, especially for zero-day attacks, which helps peopleunderstand the confidence of model prediction and how muchwe can rely on them in the real world. [70], [83], [137],[138] conducted initial studies on the evaluation. Moreover,this problem lies not only in the performance of deep neuralnetwork models but also in the confidentiality and privacy.

2) A benchmark platform for attacks and defenses:Most attacks and defenses described their methods withoutpublicly available code, not to mention the parameters usedin their methods. This brings difficulties for other researchersto reproduce their solutions and provide the correspondingattacks/defenses. For example, Carlini tried his best to “findthe best possible defense parameters + random initialization”9.Some researchers even drew different conclusions becauseof different settings in their experiments. If there exists anybenchmark, where both adversaries and defenders conductexperiments in a uniform way (i.e., the same threat model,dataset, classifier, attacking/defending approach), we can makea more precise comparison between different attacking anddefending techniques.

Cleverhans [139] and Foolbox [140] are open-source li-braries to benchmark the vulnerability of deep neural networksagainst adversarial images. They build frameworks to evaluatethe attacks. However, defensive strategies are missing in bothtools. Providing a dataset of adversarial examples generatedby different methods will make it easy for finding the blind

9Code repository used in [77]: https://github.com/carlini/nn_breaking_detection

Fig. 10: Workflow of a benchmark platform for attackers and defenders: 1)attackers and defenders update/train their strategies on training dataset; 2)attackers generate adversarial examples on the clean data; 3) the adversarialexamples are verified by crowdsourcing whether recognizable to human; 4)defenders generate a deep neural network as a defensive strategy; 5) evaluatethe defensive strategy.

point of deep neural networks and developing new defensestrategies. This problem also occurs in other areas in deeplearning.

Google Brain organized three competitions in NIPS 2017competition track, including targeted adversarial attack, non-targeted adversarial attack, and defense against adversarialattack [141]. The dataset in the competition consisted of aset of images never used before and manually labeled theimages, 1,000 images for development and 5,000 images forfinal testing. The submitted attacks and competitions are usedas benchmarks to evaluate themselves. The adversarial attacksand defenses are scored by the number of runs to fool thedefenses/correctly classify images.

We present the workflow of a benchmark platform forattackers and defenders (Figure 10).

3) Various applications for robustness evaluation: Similarto the existence of adversarial examples for various applica-tions, a wide range of applications make it hard to evaluatethe robustness, of a deep neural network architecture. Howto compare methods generating adversarial example underdifferent threat models? Do we have a universal methodologyto evaluate the robustness under all scenarios? Tackling theseunsolved problems is a future direction.

VIII. CONCLUSION

In this paper, we reviewed recent findings of adversarialexamples in deep neural networks. We investigated existingmethods for generating adversarial examples10. A taxonomyof adversarial examples was proposed. We also explored theapplications and countermeasures for adversarial examples.

This paper attempted to cover state-of-the-art studies foradversarial examples in the deep learning domain. Comparedwith recent work on adversarial examples, we analyzed anddiscussed current challenges and potential solutions in adver-sarial examples.

10Due to the rapid development of adversarial examples (attacks anddefenses), we only considered the papers published before November 2017.We will update the survey with new methodologies and papers in our futurework.

Page 18: Adversarial Examples: Attacks and Defenses for Deep Learning

18

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural

information processing systems, 2012, pp. 1097–1105.[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.[3] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXiv

preprint arXiv:1612.08242, 2016.[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-

time object detection with region proposal networks,” in Advances in

neural information processing systems, 2015, pp. 91–99.[5] G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The ibm 2015

english conversational telephone speech recognition system,” arXiv

preprint arXiv:1505.05899, 2015.[6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” in Advances in neural information processing

systems, 2014, pp. 3104–3112.[7] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,

A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:A generative model for raw audio,” arXiv preprint arXiv:1609.03499,2016.

[8] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-low, and R. Fergus, “Intriguing properties of neural networks,” arXiv

preprint arXiv:1312.6199, 2013.[9] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in

the physical world,” arXiv preprint arXiv:1607.02533, 2016.[10] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash,

A. Rahmati, and D. Song, “Robust physical-world attacks on deeplearning models,” arXiv preprint arXiv:1707.08945, vol. 1, 2017.

[11] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adver-sarial examples for semantic segmentation and object detection,” inInternational Conference on Computer Vision. IEEE, 2017.

[12] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields,D. Wagner, and W. Zhou, “Hidden voice commands.” in USENIX

Security Symposium, 2016, pp. 513–530.[13] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphinatack:

Inaudible voice commands,” arXiv preprint arXiv:1708.09537, 2017.[14] “iOS - Siri - Apple,” https://www.apple.com/ios/siri/.[15] “Alexa,” https://developer.amazon.com/alexa.[16] “Cortana | Your Intelligent Virtual & Personal Assistant | Microsoft,”

https://www.microsoft.com/en-us/windows/cortana.[17] W. Knight, “The dark secret at the heart of ai,” MIT Technology Review,

2017.[18] D. Castelvecchi, “Can we open the black box of AI?” Nature News,

vol. 538, no. 7623, p. 20, 2016.[19] P. W. Koh and P. Liang, “Understanding black-box predictions via

influence functions,” Proceedings of the International Conference on

Machine Learning (ICML), 2017.[20] Z. C. Lipton, “The mythos of model interpretability,” International

Conference on Machine Learning (ICML) Workshop, 2016.[21] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural

networks via information,” arXiv preprint arXiv:1703.00810, 2017.[22] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and

D. Batra, “Grad-cam: Why did you say that? visual explanationsfrom deep networks via gradient-based localization,” arXiv preprint

arXiv:1610.02391, 2016.[23] J. Lu, T. Issaranon, and D. Forsyth, “Safetynet: Detecting and rejecting

adversarial examples robustly,” ICCV, 2017.[24] Y. Wu, D. Bamman, and S. Russell, “Adversarial training for relation

extraction,” in Proceedings of the 2017 Conference on Empirical

Methods in Natural Language Processing, 2017, pp. 1779–1784.[25] Y. Dong, H. Su, J. Zhu, and F. Bao, “Towards interpretable deep

neural networks by leveraging adversarial examples,” Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2017.[26] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Sok: Towards

the science of security and privacy in machine learning,” arXiv preprint

arXiv:1611.03814, 2016.[27] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support

vector machines,” arXiv preprint arXiv:1206.6389, 2012.[28] F. Roli, B. Biggio, and G. Fumera, “Pattern recognition systems under

attack,” in Iberoamerican Congress on Pattern Recognition. Springer,2013, pp. 1–8.

[29] H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Isfeature selection secure against training data poisoning?” in Interna-

tional Conference on Machine Learning, 2015, pp. 1689–1698.

[30] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghunathan, and N. K. Jha,“Systematic poisoning attacks on and defenses for machine learning inhealthcare,” IEEE journal of biomedical and health informatics, vol. 19,no. 6, pp. 1893–1905, 2015.

[31] A. Beatson, Z. Wang, and H. Liu, “Blind attacks on machine learners,”in Advances in Neural Information Processing Systems, 2016, pp.2397–2405.

[32] S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks againstautoregressive models.” in AAAI, 2016, pp. 1452–1458.

[33] N. Papernot, P. D. McDaniel, and I. J. Goodfellow, “Transferabilityin machine learning: from phenomena to black-box attacks usingadversarial samples,” CoRR, vol. abs/1605.07277, 2016.

[34] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacksthat exploit confidence information and basic countermeasures,” inProceedings of the 22nd ACM SIGSAC Conference on Computer and

Communications Security. ACM, 2015, pp. 1322–1333.[35] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,

K. Talwar, and L. Zhang, “Deep learning with differential privacy,”in Proceedings of the 2016 ACM SIGSAC Conference on Computer

and Communications Security. ACM, 2016, pp. 308–318.[36] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership

inference attacks against machine learning models,” in Security and

Privacy (SP), 2017 IEEE Symposium on. IEEE, 2017, pp. 3–18.[37] Y. Bengio, Y. LeCun et al., “Scaling learning algorithms towards ai,”

Large-scale kernel machines, vol. 34, no. 5, pp. 1–41, 2007.[38] D. Storcheus, A. Rostamizadeh, and S. Kumar, “A survey of modern

questions and challenges in feature extraction,” in Feature Extraction:

Modern Questions and Challenges, 2015, pp. 1–18.[39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:

Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision,2015, pp. 1026–1034.

[40] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2016, pp.779–788.

[41] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single shot textdetector with regional attention,” in The IEEE International Conference

on Computer Vision (ICCV), vol. 6, no. 7, 2017.[42] C. Yan, H. Xie, J. Chen, Z.-J. Zha, X. Hao, Y. Zhang, and Q. Dai, “An

effective uyghur text detector for complex background images,” IEEE

Transactions on Multimedia, 2018.[43] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks

for semantic segmentation,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.[44] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking

atrous convolution for semantic image segmentation,” arXiv preprint

arXiv:1706.05587, 2017.[45] Q. V. Le, “Building high-level features using large scale unsupervised

learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013

IEEE International Conference on. IEEE, 2013, pp. 8595–8598.[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2015, pp. 1–9.[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-

ing the inception architecture for computer vision,” in Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2818–2826.

[48] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning.”in AAAI, 2017, pp. 4278–4284.

[49] Y. LeCun, C. Cortes, and C. Burges, “The mnist data set,” 1998.[50] A. Krizhevsky and G. Hinton, “Learning multiple layers of features

from tiny images,” Technical report, University of Toronto, 2009.[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,

Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.211–252, 2015.

[52] M. Barreno, B. Nelson, A. D. Joseph, and J. Tygar, “The security ofmachine learning,” Machine Learning, vol. 81, no. 2, pp. 121–148,2010.

[53] N. Dalvi, P. Domingos, S. Sanghai, and D. Verma, “Adversarialclassification,” in Proceedings of the tenth ACM SIGKDD international

conference on Knowledge discovery and data mining. ACM, 2004,pp. 99–108.

Page 19: Adversarial Examples: Attacks and Defenses for Deep Learning

19

[54] D. Lowd and C. Meek, “Adversarial learning,” in Proceedings of

the eleventh ACM SIGKDD international conference on Knowledge

discovery in data mining. ACM, 2005, pp. 641–647.[55] B. Biggio, G. Fumera, and F. Roli, “Multiple classifier systems for

robust classifier design in adversarial environments,” International

Journal of Machine Learning and Cybernetics, vol. 1, no. 1-4, pp.27–41, 2010.

[56] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndic, P. Laskov,G. Giacinto, and F. Roli, “Evasion attacks against machine learningat test time,” in Joint European Conference on Machine Learning and

Knowledge Discovery in Databases. Springer, 2013, pp. 387–402.[57] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar,

“Can machine learning be secure?” in Proceedings of the 2006 ACM

Symposium on Information, computer and communications security.ACM, 2006, pp. 16–25.

[58] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize toa crime: Real and stealthy attacks on state-of-the-art face recognition,”in Proceedings of the 2016 ACM SIGSAC Conference on Computer

and Communications Security. ACM, 2016, pp. 1528–1540.[59] A. Rozsa, E. M. Rudd, and T. E. Boult, “Adversarial diversity and

hard positive generation,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) Workshops, 2016,pp. 25–32.

[60] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[61] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,”in Security and Privacy (EuroS&P), 2016 IEEE European Symposium

on. IEEE, 2016, pp. 372–387.[62] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a sim-

ple and accurate method to fool deep neural networks,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2016, pp. 2574–2582.[63] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural

networks,” in Security and Privacy (S&P), 2017 IEEE Symposium on.IEEE, 2017, pp. 39–57.

[64] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zerothorder optimization based black-box attacks to deep neural networkswithout training substitute models,” arXiv preprint arXiv:1708.03999,2017.

[65] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Univer-sal adversarial perturbations,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2017.[66] J. Su, D. V. Vargas, and S. Kouichi, “One pixel attack for fooling deep

neural networks,” arXiv preprint arXiv:1710.08864, 2017.[67] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet, “Adversarial manipulation

of deep representations,” Proceedings of the International Conference

on Learning Representations (ICLR), 2016.[68] Z. Zhao, D. Dua, and S. Singh, “Generating natural adversarial exam-

ples,” arXiv preprint arXiv:1710.11342, 2017.[69] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable

adversarial examples and black-box attacks,” Proceedings of the In-

ternational Conference on Learning Representations (ICLR), 2017.[70] N. Carlini, G. Katz, C. Barrett, and D. L. Dill, “Ground-truth adver-

sarial examples,” arXiv preprint arXiv:1709.10207, 2017.[71] P. Tabacof and E. Valle, “Exploring the space of adversarial images,”

in Neural Networks (IJCNN), 2016 International Joint Conference on.IEEE, 2016, pp. 426–433.

[72] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machinelearning at scale,” Proceedings of the International Conference on

Learning Representations (ICLR), 2017.[73] Y. Dong, F. Liao, T. Pang, H. Su, X. Hu, J. Li, and J. Zhu, “Boosting

adversarial attacks with momentum,” arXiv preprint arXiv:1710.06081,2017.

[74] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High confidence predictions for unrecognizable images,” inProceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2015, pp. 427–436.[75] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel,

“Ensemble adversarial training: Attacks and defenses,” arXiv preprint

arXiv:1705.07204, 2017.[76] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that can

adapt like animals,” Nature, vol. 521, no. 7553, pp. 503–507, may2015.

[77] N. Carlini and D. Wagner, “Adversarial examples are not easilydetected: Bypassing ten detection methods,” AISEC, 2017.

[78] ——, “Magnet and efficient defenses against adversarial attacks arenot robust to adversarial examples,” arXiv preprint arXiv:1711.08478,2017.

[79] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,“Striving for simplicity: The all convolutional net,” arXiv preprint

arXiv:1412.6806, 2014.[80] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint

arXiv:1312.4400, 2013.[81] G. D. Evangelidis and E. Z. Psarakis, “Parametric image alignment

using enhanced correlation coefficient maximization,” IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, vol. 30, no. 10,pp. 1858–1865, 2008.

[82] J. R. Flynn, S. Ward, J. Abich, and D. Poole, “Image quality assess-ment using the ssim and the just noticeable difference paradigm,” inInternational Conference on Engineering Psychology and Cognitive

Ergonomics. Springer, 2013, pp. 23–30.[83] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer, “Reluplex:

An efficient smt solver for verifying deep neural networks,” arXiv

preprint arXiv:1702.01135, 2017.[84] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel,

“Adversarial attacks on neural network policies,” arXiv preprint

arXiv:1702.02284, 2017.[85] J. Kos and D. Song, “Delving into adversarial attacks on deep policies,”

ICLR Workshop, 2017.[86] J. Kos, I. Fischer, and D. Song, “Adversarial examples for generative

models,” arXiv preprint arXiv:1702.06832, 2017.[87] P. Tabacof, J. Tavares, and E. Valle, “Adversarial images for variational

autoencoders,” NIPS Workshop, 2016.[88] V. Fischer, M. C. Kumar, J. H. Metzen, and T. Brox, “Adversarial

examples for semantic image segmentation,” ICLR 2017 workshop,2017.

[89] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detectingadversarial perturbations,” Proceedings of 5th International Conference

on Learning Representations (ICLR), 2017.[90] R. Jia and P. Liang, “Adversarial examples for evaluating reading com-

prehension systems,” in Proceedings of the Conference on Empirical

Methods in Natural Language Processing (EMNLP), 2017.[91] J. Li, W. Monroe, and D. Jurafsky, “Understanding neural networks

through representation erasure,” arXiv preprint arXiv:1612.08220,2016.

[92] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel,“Adversarial examples for malware detection,” in European Symposium

on Research in Computer Security. Springer, 2017, pp. 62–79.[93] H. S. Anderson, A. Kharkar, B. Filar, and P. Roth, “Evading machine

learning malware detection,” Black Hat, 2017.[94] W. Hu and Y. Tan, “Generating adversarial malware examples for

black-box attacks based on gan,” arXiv preprint arXiv:1702.05983,2017.

[95] H. S. Anderson, J. Woodbridge, and B. Filar, “Deepdga: Adversarially-tuned domain generation and detection,” in Proceedings of the 2016

ACM Workshop on Artificial Intelligence and Security (AISec). ACM,2016, pp. 13–21.

[96] W. Xu, Y. Qi, and D. Evans, “Automatically evading classifiers,” inNetwork and Distributed System Security Symposium (NDSS), 2016.

[97] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International Conference on Machine Learning,2016, pp. 1928–1937.

[98] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor,and M. Covell, “Full resolution image compression with recurrentneural networks,” arXiv preprint arXiv:1608.05148, 2016.

[99] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic singleimage super-resolution using a generative adversarial network,” Con-

ference on Computer Vision and Pattern Recognition, 2017.[100] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,

“Autoencoding beyond pixels using a learned similarity metric,” arXiv

preprint arXiv:1512.09300, 2015.[101] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition,”

in BMVC, vol. 1, no. 3, 2015, p. 6.[102] J. Lu, H. Sibai, E. Fabry, and D. Forsyth, “Standard detectors aren’t

(currently) fooled by physical adversarial stop signs,” arXiv preprint

arXiv:1710.03337, 2017.[103] J. Hendrik Metzen, M. Chaithanya Kumar, T. Brox, and V. Fischer,

“Universal adversarial perturbations against semantic image segmenta-tion,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2017, pp. 2755–2764.

Page 20: Adversarial Examples: Attacks and Defenses for Deep Learning

20

[104] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+questions for machine comprehension of text,” arXiv preprint

arXiv:1606.05250, 2016.[105] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, “Droid-sec: deep learning in

android malware detection,” in ACM SIGCOMM Computer Communi-

cation Review, vol. 44, no. 4. ACM, 2014, pp. 371–372.[106] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware

classification using random projections and neural networks,” in Acous-

tics, Speech and Signal Processing (ICASSP), 2013 IEEE International

Conference on. IEEE, 2013, pp. 3422–3426.[107] J. Saxe and K. Berlin, “Deep neural network based malware detection

using two dimensional binary program features,” in Malicious and

Unwanted Software (MALWARE), 2015 10th International Conference

on. IEEE, 2015, pp. 11–20.[108] R. Sun, X. Yuan, P. He, Q. Zhu, A. Chen, A. Gregio, D. Oliveira,

and L. Xiaolin, “Learning fast and slow: Propedeutica for real-timemalware detection,” arXiv preprint arXiv:1712.01145, 2017.

[109] A. N. Bhagoji, D. Cullina, and P. Mittal, “Dimensionality reductionas a defense against evasion attacks on machine learning classifiers,”arXiv preprint arXiv:1704.02654, 2017.

[110] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detectingadversarial samples from artifacts,” arXiv preprint arXiv:1703.00410,2017.

[111] Z. Gong, W. Wang, and W.-S. Ku, “Adversarial and clean data are nottwins,” arXiv preprint arXiv:1704.04960, 2017.

[112] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel,“On the (statistical) detection of adversarial examples,” arXiv preprint

arXiv:1702.06280, 2017.[113] D. Hendrycks and K. Gimpel, “Early methods for detecting adversarial

images,” ICLR Workshop, 2017.[114] D. Meng and H. Chen, “Magnet: a two-pronged defense against

adversarial examples,” CCS, 2017.[115] T. Pang, C. Du, Y. Dong, and J. Zhu, “Towards robust detection of

adversarial examples,” arXiv preprint arXiv:1706.00633, 2017.[116] Y.-C. Lin, M.-Y. Liu, M. Sun, and J.-B. Huang, “Detecting adversarial

attacks on neural network policies with visual foresight,” arXiv preprint

arXiv:1710.00814, 2017.[117] S. Gu and L. Rigazio, “Towards deep neural network architectures

robust to adversarial examples,” Proceedings of the International

Conference on Learning Representations (ICLR), 2015.[118] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, “Pixelde-

fend: Leveraging generative models to understand and defend againstadversarial examples,” arXiv preprint arXiv:1710.10766, 2017.

[119] D. Gopinath, G. Katz, C. S. Pasareanu, and C. Barrett, “Deepsafe:A data-driven approach for checking adversarial robustness in neuralnetworks,” arXiv preprint arXiv:1710.00486, 2017.

[120] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer,“Towards proving the adversarial robustness of deep neural networks,”arXiv preprint arXiv:1709.02802, 2017.

[121] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 2016,pp. 582–597.

[122] R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári, “Learning witha strong adversary,” arXiv preprint arXiv:1511.03034, 2015.

[123] J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani, “Adversarialexamples, uncertainty, and transfer testing robustness in gaussianprocess hybrid deep networks,” arXiv preprint arXiv:1707.02476, 2017.

[124] M. Abbasi and C. Gagné, “Robustness to adversarial examples throughan ensemble of specialists,” arXiv preprint arXiv:1702.06856, 2017.

[125] J. Ba and R. Caruana, “Do deep nets really need to be deep?” inAdvances in neural information processing systems, 2014, pp. 2654–2662.

[126] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in aneural network,” arXiv preprint arXiv:1503.02531, 2015.

[127] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++:Improving the pixelcnn with discretized logistic mixture likelihood andother modifications,” ICLR Poster, 2017.

[128] W. He, J. Wei, X. Chen, N. Carlini, and D. Song, “Adversarial exampledefense: Ensembles of weak defenses are not strong,” in 11th USENIX

Workshop on Offensive Technologies (WOOT 17). Vancouver, BC:USENIX Association, 2017.

[129] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel,“The space of transferable adversarial examples,” arXiv preprint

arXiv:1704.03453, 2017.

[130] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whiteboxtesting of deep learning systems,” Proceedings of the ACM SIGOPS

26th symposium on Operating systems principles, 2017.[131] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry,

“Adversarially robust generalization requires more data,” arXiv preprint

arXiv:1804.11285, 2018.[132] A. Fawzi, O. Fawzi, and P. Frossard, “Fundamental limits on adversar-

ial robustness,” in Proc. ICML, Workshop on Deep Learning, 2015.[133] T. Tanay and L. Griffin, “A boundary tilting persepective on the phe-

nomenon of adversarial examples,” arXiv preprint arXiv:1608.07690,2016.

[134] A. Fawzi, H. Fawzi, and O. Fawzi, “Adversarial vulnerability for anyclassifier,” arXiv preprint arXiv:1802.08686, 2018.

[135] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wat-tenberg, and I. Goodfellow, “Adversarial spheres,” arXiv preprint

arXiv:1801.02774, 2018.[136] A. Fawzi, O. Fawzi, and P. Frossard, “Analysis of classifiers’ robustness

to adversarial perturbations,” arXiv preprint arXiv:1502.02590, 2015.[137] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, and

A. Criminisi, “Measuring neural net robustness with constraints,” inAdvances in Neural Information Processing Systems, 2016, pp. 2613–2621.

[138] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verificationof deep neural networks,” Computer Aided Verification: 29th Interna-

tional Conference (CAV), pp. 3–29, 2017.[139] I. Goodfellow, N. Papernot, P. McDaniel, R. Feinman, F. Faghri,

A. Matyasko, K. Hambardzumyan, Y.-L. Juang, A. Kurakin, R. Sheats-ley, A. Garg, and Y.-C. Lin, “cleverhans v2.0.0: an adversarial machinelearning library,” arXiv preprint arXiv:1610.00768, 2017.

[140] J. Rauber, W. Brendel, and M. Bethge, “Foolbox v0.8.0: Apython toolbox to benchmark the robustness of machine learningmodels,” arXiv preprint arXiv:1707.04131, 2017. [Online]. Available:http://arxiv.org/abs/1707.04131

[141] A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao, M. Liang,T. Pang, J. Zhu, X. Hu, C. Xie et al., “Adversarial attacks and defencescompetition,” arXiv preprint arXiv:1804.00097, 2018.

Xiaoyong Yuan is currently a PhD student in De-partment of Computer & Information Science &Engineering at University of Florida. His researchinterests include deep learning and security. Hereceived a BS degree in mathematics from FudanUniversity in 2012 and a MS degree in softwareengineering from Peking University in 2015.

Pan He is currently a PhD student in Department ofComputer & Information Science & Engineering atUniversity of Florida. His research interests includedeep learning and computer vision. He received B.E.degree from Department of Software Engineering,Sichuan University in June, 2015. As an enthusiasticresearcher, his goal is to combine state-of-the-artcomputer vision algorithms with real-life industrialproblems.

Page 21: Adversarial Examples: Attacks and Defenses for Deep Learning

21

Qile Zhu is currently a PhD student in Departmentof Computer & Information Science & Engineeringat University of Florida. His research interests in-clude deep learning and natural language processing.He received a B.E. degree from Zhejiang Universityin 2013 and a master degree in University of Floridain 2015.

Dr. Xiaolin (Andy) Li is a Professor and UniversityTerm Professor in Department of Electrical andComputer Engineering and Department of Computer& Information Science & Engineering (affiliated) atUniversity of Florida. He is the founding Director ofNational Science Foundation Center for Big Learn-ing, the first NSF center on deep learning, with UF(lead), CMU, UO, and UMKC. He is also Directorof Large-scale Intelligent Systems Laboratory (LiLab). His research interests include Big Data, Ma-chine Learning, Deep Learning, Cloud Computing,

Intelligent Platforms, HPC, and Security & Privacy for Health, PrecisionMedicine, IoT, CV, NLP, Robotics, Genomics, and Science, Engineering, andBusiness. He received a PhD degree in Computer Engineering from RutgersUniversity. He has published over 120 peer-reviewed papers in journals andconference proceedings, 5 books, and 4 patents (three licensees). His team hascreated many software systems and tools. He is a recipient of the NationalScience Foundation CAREER Award, the Internet2 Innovative ApplicationAward, NSF I-Corps Top Team Award, Top Team (DeepBipolar) in the CAGIChallenge on detecting bipolar disorder, and best paper awards (IEEE ICMLA2016, IEEE SECON 2016, ACM CAC 2013, and IEEE UbiSafe 2007). Formore information, http://www.andyli.ece.ufl.edu and http://nsfcbl.org.