Adversarial Attacks on Semantic Segmentation in the ...oa.upm.es/55875/1/TFM_FABIAN_EITEL.pdf ·...

UNIVERSIDAD POLITÉCNICA DE MADRID

MASTER THESIS

Adversarial Attacks on Semantic

Segmentation in the Physical World

Author:Fabian EITEL

Supervisor:Prof. Luis BAUMELA

Dr. Jan Hendrik METZEN

A thesis submitted in fulfillment of the requirementsfor the degree of Master of Science at UPM

In cooperation with

Bosch Center for Artificial Intelligence

Robert Bosch GmbH

July 21, 2017

Declaration of AuthorshipI, Fabian EITEL, declare that this thesis titled, “Adversarial Attacks on SemanticSegmentation in the Physical World” and the work presented in it are my own. Iconfirm that:

• This work was done wholly or mainly while in candidature for a researchdegree at this University.

• Where any part of this thesis has previously been submitted for a degree orany other qualification at this University or any other institution, this hasbeen clearly stated.

• Where I have consulted the published work of others, this is always clearlyattributed.

• Where I have quoted from the work of others, the source is always given.With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

Signed:

Date:

iii

Universidad Politécnica de Madrid

AbstractFaculty Name

Robert Bosch GmbH

Master of Science

Adversarial Attacks on Semantic Segmentation in the Physical World

by Fabian EITEL

Machine Learning systems and especially deep neural networks have shown tobe vulnerable to adversarial perturbations. Adversarial perturbations are a kindof noise that is obtained through optimization and added to images in order tofool a machine learning system. These perturbations are able to attack a variety oftasks including image classification, object detection and semantic segmentation.Furthermore it has been shown that adversarial examples can be crafted so thatthey fool classifiers acting on them when printed and that they can also fool facerecognition systems when being printed on a frame of glasses and worn by theadversary. In other words, adversarial examples can be transferred to the phys-ical world. Because today there are self-driving vehicles and other safety criticalapplications deployed using state of the art semantic segmentation models, thegoal of this thesis is to investigate whether adversarial perturbations for semanticsegmentation can be transferred to the physical world as well.

v

Universidad Politécnica de Madrid

AbstractFaculty Name

Robert Bosch GmbH

Master of Science

Adversarial Attacks on Semantic Segmentation in the Physical World

by Fabian EITEL

Los sistemas de aprendizaje automático y especialmente las redes neuronales pro-fundas han demostrado ser vulnerables a las "perturbaciones adversarias". Éstasson un tipo de ruido que se obtiene a través de la optimización y se agrega a lasimágenes para engañar a un sistema de aprendizaje automático. Estas pertur-baciones son capaces de atacar una amplia variedad de tareas, incluyendo clasi-ficación de imágenes, detección de objetos y segmentación semántica. Además,se ha demostrado que se pueden elaborar ejemplos adversarios que cuando seimprimen engañan a los clasificadores que actúan sobre sus imágenes y que,además, pueden engañar a los sistemas de reconocimiento facial cuando se impri-men sobre la montura de unas gafas utilizadas por usuario adversario. En otraspalabras, se pueden transferir ejemplos al mundo físico. Debido a que hoy en díahay vehículos autodirigidos y otras aplicaciones de seguridad crítica desplegadasutilizando modelos de segmentación semántica de última generación, el objetivode esta tesis es investigar si las perturbaciones adversarias para la segmentaciónsemántica pueden transferirse al mundo físico también.

vii

AcknowledgementsI would like to sincerely thank my supervisor Dr. Jan Hendrik Metzen as wellas Dr. Volker Fischer for their timely support, helpful intuitions and interestingdiscussions which has helped me during the work for this thesis.

I would like to express my sincere thanks to my supervisor Prof. Luis Baumelafor the support of this thesis and the fruitful advice.

I appreciate all the help, advice and fun times I had with all other colleaguesand fellow students at the Bosch Center for Artificial Intelligence.

Lastly, I would like to thank the developers of Theano, Keras and the Cityscapesdataset.

ix

Contents

Declaration of Authorship iii

Abstract v

Abstract vii

Acknowledgements ix

1 Introduction 1

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Deep Learning 5

2.1 Optimization and Machine learning . . . . . . . . . . . . . . . . . . 52.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Fully Connected Neural Networks . . . . . . . . . . . . . . . 72.2.3 Overview of Activation Functions . . . . . . . . . . . . . . . 82.2.4 Overview of Loss Functions . . . . . . . . . . . . . . . . . . . 102.2.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . 102.2.6 Optimization Strategies in Deep Learning . . . . . . . . . . . 13

Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 13Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 13Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Per-Parameter Adaptive Learning Rate Algorithms . . . . . 15

2.2.7 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Deep Learning in Computer Vision . . . . . . . . . . . . . . . . . . . 19

xi

2.3.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . 212.3.4 Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . 232.3.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Datasets 25

3.1 Datasets for Classification . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Datasets for Semantic Segmentation . . . . . . . . . . . . . . . . . . 26

4 Adversarial Attacks 29

4.1 Overview of Adversarial Attacks . . . . . . . . . . . . . . . . . . . . 294.1.1 Adversarial Goals and Capabilities . . . . . . . . . . . . . . . 32

4.2 Methods for Creating Adversarial Examples . . . . . . . . . . . . . 334.2.1 Fast Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.3 Targeted Methods . . . . . . . . . . . . . . . . . . . . . . . . 354.2.4 Jacobian-based Saliency Map Approach . . . . . . . . . . . . 364.2.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Properties of Adversarial Examples . . . . . . . . . . . . . . . . . . . 374.3.1 Attacks in the Physical World . . . . . . . . . . . . . . . . . . 374.3.2 Transferability and Black-Box Attacks . . . . . . . . . . . . . 384.3.3 Universal Perturbations . . . . . . . . . . . . . . . . . . . . . 414.3.4 Attacks on Semantic Segmentation . . . . . . . . . . . . . . . 424.3.5 Attacks on Face Recognition Systems in the Physical World 44

5 Creating Adversarial Examples for the Physical World 49

5.1 Adversarial Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Introducing Parallel Computation in GPU . . . . . . . . . . . . . . . 555.4 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.5 Baseline Using Digital Noise . . . . . . . . . . . . . . . . . . . . . . . 575.6 Addressing Noise Restrictiveness . . . . . . . . . . . . . . . . . . . . 60

xii

5.7 Addressing Rotational Invariance . . . . . . . . . . . . . . . . . . . . 625.7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.8 Addressing Scaling Invariance . . . . . . . . . . . . . . . . . . . . . 665.8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.9 Addressing Object Size Invariance . . . . . . . . . . . . . . . . . . . 685.9.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.9.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.10 Addressing Physical Realizability . . . . . . . . . . . . . . . . . . . . 695.10.1 Methodology Smoothing . . . . . . . . . . . . . . . . . . . . 715.10.2 Evaluation Smoothing . . . . . . . . . . . . . . . . . . . . . . 725.10.3 Methodology Printability . . . . . . . . . . . . . . . . . . . . 735.10.4 Evaluation Printability . . . . . . . . . . . . . . . . . . . . . . 78

5.11 Intensified Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . 795.11.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.12 Source/Target Misclassification . . . . . . . . . . . . . . . . . . . . . 835.13 Other Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Conclusion 87

A Appendix A 89

A.1 Implementation Details Smoothing Loss . . . . . . . . . . . . . . . . 90

Bibliography 93

xiii

List of Figures

2.1 Fully connected neural network. . . . . . . . . . . . . . . . . . . . . 82.2 VGG-16 Architecture.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Object detection prediction example.2 . . . . . . . . . . . . . . . . . 202.4 Image semantic segmentation example.3 . . . . . . . . . . . . . . . . 21

4.1 Adversarial examples as illustrated in Szegedy et al. [52]. (Left)Original images, correctly classified, (center) adversarial perturba-tion, (right) adversarial example. All adversarial examples havebeen classfied as "ostrich, Struthio camelus" . . . . . . . . . . . . . . 30

4.2 Adversarial attack taxonomy. . . . . . . . . . . . . . . . . . . . . . . 324.3 Successful examples of adversarial perturbations against a face recog-

nition system printed on glassframes (top) adversary wearing theattack glasses (bottom) impersonation target.4 . . . . . . . . . . . . 45

5.1 Adversarial Goal: A poster attack . . . . . . . . . . . . . . . . . . . . 505.2 The powerwall from Robert Bosch GmbH in Renningen . . . . . . . 545.3 Prediction average of 119 images without a universal perturbation

in the background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4 Comparing different epsilon values to the respective destruction

rate of human pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5 Universal perturbation created for the digital environment. It rep-

resents the baseline attack. . . . . . . . . . . . . . . . . . . . . . . . . 605.6 Learning the perturbation by applying it everywhere except on the

target objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.7 Intermediate adversarial perturbation during training, as applied

above. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xv

5.8 Comparison of the digital noise with the rotated version regardingrotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.9 Universal perturbation using rotation. . . . . . . . . . . . . . . . . . 645.10 Comparison of the rotational perturbation with the benchmark (22,588

pixels) on the powerwall. . . . . . . . . . . . . . . . . . . . . . . . . 655.11 Comparison of the digital noise with the scaled version regarding

scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.12 Comparison of the scaling perturbation with the benchmark (22,588

pixels) on the powerwall. . . . . . . . . . . . . . . . . . . . . . . . . 685.13 Comparison on the powerwall of the noise where the objects where

increased in size to the benchmark (22,588 pixels). . . . . . . . . . . 705.14 Comparison on the powerwall of the perturbation using the smooth-

ing loss to the benchmark (22,588 pixels). . . . . . . . . . . . . . . . 725.15 Comparison on the powerwall of the perturbation using the smooth-

ing layer to the benchmark (22,588 pixels). . . . . . . . . . . . . . . . 735.16 A 10th of the RGB color palette with equal spacing. . . . . . . . . . 745.17 Captured photo of the RGB color palette from the powerwall. . . . 745.20 30 triplets with the smallest variance in difference from the entire

RGB space used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.21 30 triplets with the smallest variance in difference from the cap-

tured RGB space used. . . . . . . . . . . . . . . . . . . . . . . . . . . 755.18 3 random examples of the color mapping using the euclidean dis-

tance. Explanation in the text. . . . . . . . . . . . . . . . . . . . . . . 765.19 3 random examples of the color mapping using the CIEDE2000 dis-

tance. Explanation in the text. . . . . . . . . . . . . . . . . . . . . . . 765.22 Digital2transformed transformation: (left) no transformation ap-

plied (center) transformation applied everywhere (right) maskedD2T transformation applied on background only. . . . . . . . . . . 78

5.23 Comparison on the powerwall of the perturbation using the NPSto the benchmark (22,588 pixels). . . . . . . . . . . . . . . . . . . . . 79

5.24 Comparison of the perturbation using the NPS and the physicalworld layer to the benchmark (22,588 pixels). . . . . . . . . . . . . . 80

xvi

5.25 Comparison of the perturbation using the smoothing loss factoredby 2 to the benchmark (22,588 pixels). . . . . . . . . . . . . . . . . . 81

5.26 Comparison of the perturbation using the smoothing loss factoredby 4 to the benchmark (22,588 pixels). . . . . . . . . . . . . . . . . . 82

5.27 Prediction results: (top-row) Non-printability score (bottom-row)scaling noise (left) captured images (right) network predictions. . . 83

A.1 Evaluation of perturbations using the NPS and the smoothing losson the powerwall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.2 Sheared images from Cityscapes dataset, sheared by 1.0 and -1.0degree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.3 Matrix subtraction for smoothing loss. . . . . . . . . . . . . . . . . . 91

xvii

List of Tables

3.1 Cityscapes classes and their groups. . . . . . . . . . . . . . . . . . . 27

5.1 Source/Target misclassification attack results. . . . . . . . . . . . . . 85

6.1 Misclassification attack results summary. . . . . . . . . . . . . . . . 88

xix

Acronyms

ANN artificial neural network. 7, 10

BMVA The British Machine Vision Association and Society for Pattern Recogni-tion. 19

CNN Convolutional Neural Network. 11, 12

D2T digital2transformed. 77

FCIS Fully Convolutional Instance-aware Semantic Segmentation. 23

FCN Fully Convolutional Neural Network. 21

i.i.d. independent and identically distributed. 5

ILSVRC ImageNet Large Scale Visual Recognition Challenge. 19, 26

IoU Intersection over Union. 24

MLP multilayer perceptron. 7

NPS Non-printability Score. 47

ReLU Rectified Linear Unit. 9

SGD Stochastic Gradient Descent. 13

tanh Hyperbolic Tangent. 9

TV total variation. 46

xxi

Chapter 1

Introduction

The success of machine learning in recent years has led to a new hype aroundArtificial Intelligence (AI). Some machine learning models have even achievedbetter-than-human performance on specific, limited tasks. A multifold of fieldscan be approached by machine learning and barely any can be safely excluded.Regression methods can be used to forecast stock prices, clustering allows auto-matic segmentation of customers into separate groups and kernel methods canhelp to identify cancer. Deep learning methods are able to learn features directlyon the input, and to take into consideration spatial arrangements. This allowsthem to learn how to drive cars, translate between languages and analyze senti-ments of written or spoken paragraphs. Furthermore, as deep learning modelscan store representations, they can also be used generatively: they can create art,generate somewhat meaningful text and converse in dialogues.

Computational limitations and the unavailability of large datasets preventeddeep learning from having the same success as it is having today. Surprisingly,the core algorithms used in deep learning today are incremental improvementsof those developed 20 years ago. The gaming industry over time has pushed thedevelopment towards better graphical processing units (GPUs) which are able tooperate highly in parallel. Deep learning methods can be broken down to simplematrix operations which benefit from this parallelization. Hence, the compu-tational limitations are not preventive anymore1. The second limitation is that

1Needless to say that state of the art results are hard to achieve with consumer-grade hardware

1

Chapter 1. Introduction

supervised deep learning requires large amounts of data. Here the data collec-tion by large companies, the increased participation by citizens on the Internetthrough social media and the trend toward a connected world (i.e. the Inter-net of Things) have lead to a skyrocketing availability of data. Although mostof the data is proprietary as it is a major competitive advantage, many datasetshave been released to fuel further research. Examples are the ImageNet database,Youtube-8M and Cifar-10.

1.1 Problem Statement

The goal of this thesis is to increase the safety of self-learning systems against po-tential attacks. Specifically it is intended to practically evaluate the transferabil-ity of adversarial attacks on neural networks, trained for semantic segmentation,into the physical world. This evaluation should encourage further research in thesafety of machine learning and AI.

Adversarial examples are samples which are manipulated so that a classifierwill no longer classify them correctly, while a human would still perceive thesamples correctly. It has been shown in Kurakin, Goodfellow, and Bengio [27]that images specifically created to fool a neural network can survive a transfor-mation to the physical world. I.e. they can be printed and photographed and stillfool a model. Some of the ideas described in the previous section are already be-ing solved by deep learning models in current systems. Semantic segmentation,which is a pixel-wise classification, is especially relevant for situations that re-quire a fine-grained understanding of its surrounding. A popular field, in whichmany industrial players are also researching, is the application of semantic seg-mentation for self-driving vehicles. In this scenario it is crucial to be able to pre-vent any kind of attacks against the visual understanding systems of the vehicle.In the first step it is therefore necessary to find a variety of such attacks and dis-cover them before an actual adversary does.

In this work, I have shown that adversarial examples for semantic segmen-tation cannot directly transfer to the physical world. Furthermore I have intro-duced novel methods that increase the likelihood of adversarial perturbations to

2

1.1. Problem Statement

transfer to the physical world. I have used these methods to craft adversarialexamples based on the Cityscapes dataset and shown that when attacking hu-man pixels specifically, more than a quarter of them can be removed on averageeven though they are tested on a non-i.i.d. dataset. Nevertheless fully replacinghuman pixels with a specific target class remains challenging.

3

Chapter 2

Deep Learning

2.1 Optimization and Machine learning

In machine learning the goal is to reduce the risk of a model. A machine learningmodel is supposed to infer information about a population. Since it is typically tooexpensive or impossible to inspect the entire population X and use it to build amodel F , one needs to inspect a subset Xtrain of the entire population and find aset of parameters θ that, when build into a model F have the highest likelihood ofinferring the correct information about the inspected subset Xtrain. Finding theseoptimal parameters is called maximum likelihood estimation.

Optimization is the task of finding a maximum or a minimum of an objectivefunction F (X) by altering X. In machine learning the risk of a model is typicallyrelated to some performance measure P , like the accuracy of a classification orthe mean average precision in object detection. The objective function in machinelearning, often named loss function or error function, is typically denoted as J(θ).Hence we are optimizing P only indirectly, which differentiates machine learningfrom classical optimization.

The performance is measured on a separate subset Xtest that needs to be inde-pendent and identically distributed (i.i.d.). It is therefore required to collect at leasttwo separate subsets Xtrain and Xtest from the same distribution. The split is nec-essary to prevent the model from learning only information about the collectedsamples but not being able to generalize to unseen data. We learn θ by usingthe test set Xtrain and then check whether our learning algorithm converges to a

5

Chapter 2. Deep Learning

generalizable optimum by testing it on the test set Xtest. Note that in practice oneoften intends to predict on a dataset which is not identically distributed as thetraining set. This is also the case in this thesis where the training set is part of theCityscapes dataset, whereas my evaluation is being done on data created myself.

The ability of a model to learn is limited by its capacity. Models with a highercomplexity tend to gain larger capacities. For example a neural network with20 5-unit layers has a higher capacity than a neural network with 2 5-unit lay-ers. This means it can approximate functions with higher non-linearity. Differenttypes of models also have different properties that influence their capacity. As arule of thumb, it can be said that models with a higher capacity are harder to trainand require more training data.

2.2 Deep Learning

Deep learning is a type of representation learning that learns abstractions of theunderlying data by itself. In contrast, in traditional machine learning one of themain challenges is to handcraft a variety of features, which is called feature engi-neering. This typically requires deep knowledge of the specific field (i.e. domainintelligence) as well as knowledge about the model class itself. Therefore it isnearly impossible to transfer a model from one task to another. Deep learninguses many layers of non-linearities to learn these features creating a deep com-putation graph. This kind of learning requires more data than typical machinelearning and is in turn expensive to process. Nonetheless it works especially wellin practice as shown in Krizhevsky, Sutskever, and Hinton [25].

2.2.1 Perceptrons

The perceptron is a binary classifier that was first introduced in Rosenblatt [43]and was inspired by biophysics and psychology. The goal of the research was tounderstand how information from the physical world is processed and used forcognitive tasks. It loosely resembles the biological neuron which is activated ifthe neuron’s activation is larger then some threshold.

6

2.2. Deep Learning

f(x) =

{1 if w · x+ b > 0

0 otherwise(2.1)

The perceptron as depicted in equation 2.1 is a binary activation function thatreturns 1 if the inner term is larger than 0 and returns 0 otherwise. The inner termis the inner-product of a parameter vector w and an input tensor x. Additionallya bias term or offset b, is added as well. It becomes clear that a single perceptron isable to model the bitwise operators NOT, AND, and OR. However, modeling anXOR operator is impossible. The XOR operator requires a more complex modelin a higher dimension that can make the problem linearly separable. The combi-nation of multiple perceptrons, called a multilayer perceptron (MLP) can overcomethis shortcoming.

2.2.2 Fully Connected Neural Networks

Nowadays the literature refers to MLPs as artificial neural network (ANN) orsimply neural networks in its simplest, feedforward form, meaning a networkwith multiple hidden layers and no recurrence. Hidden layers are those that areneither the input nor the output of the model. Hornik, Stinchcombe, and White[19] has theoretically proven that feedforward MLPs are universal in their capa-bility to approximate any measurable function to any degree of accuracy. This isknown as the universal approximation theorem.

Fully connected neural networks as depicted in 2.1 are special in the sensethat any neuron in a layer li is connected to every neuron in layer li+1 for everylayer i except for the output layer. The hidden layers in 2.1 are denoted with H1

and H2. The vertices represent neurons and the edges represent the inner-productbetween the previous neuron and the corresponding weight. The neurons in thehidden layers compute an activation. Therefore each hidden layer is similar to aperceptron and can be activated individually with the difference being that hid-den layers use more complex activation functions (sect. 2.2.3) rather than a simpleperceptron.

7


FIGURE 2.1: Fully connected neural network.

2.2.3 Overview of Activation Functions

Activation functions tell how much of a neuron is relevant further. If the acti-vation function returns a large value the information of the neuron is important,while a small or even negative value suggests to ignore the information.

Traditionally the most used activation function used to be the sigmoid func-tion. It is defined as

σ(z) =1

1 + e−x(2.2)

and draws an ’S’-curve typically having a range within [0, 1]. Although it is abounded, differentiable real function with a non-negative derivative at each pointit has two major drawbacks, which prevent it from being used in practice any-more:

• Vanishing gradient problem: The vanishing gradient problem [48] is an oftendiscussed problem which many neural network architectures faced. In thesigmoid functions all inputs that compute an output close to the tails at0 or 1 will have a gradient close to zero. Following the backpropagationalgorithm (sect. 2.2.7) the gradient is passed along the layers using the chain

8

2.2. Deep Learning

rule. Hence the close-to-zero gradients are multiplied by another leading toan exponential decrease of the gradient. This makes training in early layersand training of sequence models very hard. Solutions have been proposedin Schmidhuber [48], Hochreiter and Schmidhuber [18], and He et al. [16]and recently for reinforcement learning in Salimans et al. [46] but it remainsan open research challenge.

• Outputs are not zero-centered: In machine learning it is advisable to normalizedata before training a model, which effects zero-centering. If the output ofa hidden layer is not zero-centered every higher layer will train on non-normalized data.

To overcome the second issue the Hyperbolic Tangent (tanh) non-linearity canbe used. It is simply a scaled version of the sigmoid non-linearity, defined astanh(z) = ez−e−z

ez+e−z and squashes the values between [−1, 1]. Therefore the tanhfunction is zero-centered but nonetheless cannot prevent the vanishing gradientproblem.

The Rectified Linear Unit (ReLU), introduced in Glorot, Bordes, and Bengio[13] has become the recommended activation function [28] working especiallywell in practice. The ReLU activation

ReLU(z) = max(0, z) (2.3)

is a piece-wise linear function of the input which cuts out everything less than orequal to zero. Conversely to previously described activations the ReLU activationhas two unique features: a) it is non-saturating (i.e. its range is bounded by +∞)and b) it is piece-wise linear. It was shown in Glorot, Bordes, and Bengio [13]and Krizhevsky, Sutskever, and Hinton [26] that ReLU units enable much fastertraining and it was suggested that these two features are largely contributing tothe speed improvements, as well as their computationally simple form and itshigh sparsity. Furthermore ReLU units are less prone to the vanishing gradientproblem, because gradients are not typically close to zero unlike in sigmoid ortanh activations. On the other hand, the ReLU activation is not differentiable atzero (but at values close to zero) and all negative values return a zero gradient.

9


The latter can cause neurons to ’die’ as they are never activated and will notaccept any further inputs.

2.2.4 Overview of Loss Functions

The final layer or output layer of the ANN comprises of an activation functionon which a loss is being computed. In binary classification a sigmoid functionwould suffice, whereas in multi-class classification the softmax function is usedmost often. The softmax function not only squashes the values between 0 and 1,i.e. into a firing rate, it further generalizes the sigmoid function over K classes.

The softmax function is defined as

softmax(z)j =ezj

ΣKk=1e

zkforj = 1, ..., K (2.4)

for a K-dimensional vector z. Hence it returns a value in the range [0, 1] for eachclass j. The negative log of of the softmax function for a given class j is the cross-entropy loss of that class. In practice the cross-entropy loss for all classes is beingcomputed and the minimum taken. The goal of the ANN is then defined as thecategorical cross-entropy loss:

min− log(ezj

ΣKk=1e

zk) forj = 1, ..., K. (2.5)

2.2.5 Convolutional Neural Networks

So far, fully connected networks are able to learn representations of the underly-ing data, but cannot take any local correlation between different input featuresinto account. With experiments about the visual cortex of cats Hubel and Wiesel[21] has shown that there is a local correlation between neighboring cells, hencethere is a local receptive field that responds to different areas in visual perception.According to LeCun and Bengio [29] there is similarly high local correlation in2D on images and sound waves as well as in 1D on time-series. That meansthat input features, for example pixels, that are spatially or temporally close arehighly correlated. Another shortcoming of fully connected neural networks is

10

2.2. Deep Learning

their large amount of parameters that do not scale well to large images. An im-age in RGB space with 227x227 pixels requires 227*227*3 = 154,587 weights alone.Many more parameters would be needed to process high-resolution images, foradding a bias term and to expand the network to more than one layer. In order tobuild better and more scalable feature extractors for spatially correlated data theConvolutional Neural Network (CNN) architecture has been introduced.

There are two main new components: convolutional layers and pooling layers.Convolutional layers are, as the name suggests, applying a convolution to its in-put. Each convolution filter has a filter size that is typically much smaller than thelayer’s inputs, and unlike fully-connected layers a dot-product is only calculatedon that small patch. Each layer can have multiple filters, each of the same size.Filters, which are sometimes also referred to as kernels, are moved along bothspatial dimensions (in a 3D image example) in a 2D convolution by some stride.Lets take as an example a network layer with a filter size of 5x5, this is calledthe receptive field of the layer. In computer vision convolutions are often used asedge-detectors. Having a filter in a single hidden layer neural network that candetect horizontal lines in images for example, will then be moved along both axesto detect all horizontal lines no matter where they are in the image. This intuitionmotivates the parameter sharing of convolutional layers: each filter will share thesame weights and bias no matter where it is applied on the image. If horizontallines are important at some point in an image, they are likely to be important any-where else. Higher level features can then for example be trained to recognize amore complex object in any part of the image. This ensures a high translation in-variance of CNNs. All filters are learned through backpropagation. Padding canbe applied at the borders to ensure that outermost pixels are included in the sameamount of convolutions as any other pixel and to retain the same image size afterconvolving the image. The number of filters defines the amount of dimensionsfor the next layer. For example a 10x10 image in RGB space is first represented as(10x10x3). Using 5 convolution kernels of size 3x3 with stride 1 and zero-paddingof 1 leads to an output shape of (10x10x5). Without padding it would be an out-put of shape (8x8x5) and using no padding a stride of 1 but a filter size of 5x5would lead to an output of shape (6x6x5).

11


Pooling layers are used to reduce the amount of features in each layer by tak-ing some aggregation. As we are constantly adding dimensions in the convo-lutional layers, we need to ensure that the network does not grow too large tostore efficiently. Pooling layers downsample the input along the spatial dimen-sions. The two main alternatives are average pooling and max pooling. Based againon some pooling size, stride and padding, inputs are either averaged within thepooling size patch, or the maximum within the patch is taken.

In practice convolutional layers (with activations) and pooling layers com-pose the representational learning part of CNNs. They are used as a feature ex-tractor. The second part is the task-specific part, which is often composed offully-connected layers and can be used for different tasks including classifica-tion, segmentation and regression. Pooling layers are used after one or moreconvolutional layers. Figure 2.2 shows the popular VGG16 network architectureas defined in Simonyan and Zisserman [50]. It shows the input size to each layeron top, which shows how the pooling layers reduce the spatial dimensions by afactor of 2 each time. This is caused by a 2x2 pooling filters with a stride of 2.Furthermore it shows the use of fully-connected layers in the second half of thearchitecture.

FIGURE 2.2: VGG-16 Architecture.1

12

2.2. Deep Learning

2.2.6 Optimization Strategies in Deep Learning

Gradient Descent

Most optimization algorithms that are being used in deep learning scenarios arebased on Gradient Descent. In gradient descent one updates the parameters of themodel in the opposite direction of the gradient. Formally we simply compute thegradient g at each iteration

θ ← θ − α∇θJ(X, θ, y) (2.6)

for a loss function J(X) and a learning rate α. The learning rate is a hyper-parameter that needs to be tuned accordingly to the task. The intuition behindgradient descent is simple: because of the non-convexity of neural network ob-jective functions we cannot compute a closed-form solution but rather use aniterative approach. In each iteration we find the gradient at the current positionand move in the opposite direction. Hence we descent down the slope towardsa minimum. It is not guaranteed that a global minimum is found, but tuning thelearning rate can help to escape local minima.

Stochastic Gradient Descent

In the basic from of gradient descent the gradient needs to be computed for theentire training dataset before an update is applied. Computing the expectation ofthe model loss can be very expensive in practice and therefore in each iterationwe can use an inner loop that randomly samples individual examples and thenaverages over their expectations. This is called Stochastic Gradient Descent (SGD).Most times, instead of sampling single examples multiple examples are randomlysampled as mini batches and the gradient is taken over the entire mini batch atonce, leading to mini-batch gradient descent. [14] motivates that it is sufficient toonly use subsets in order to stochastically approximate the gradient of the entireset (i.e. the deterministic gradient). As the computation time reduces linearly the

1Taken from https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/

13


number of samples n in each mini batch, the standard error only reduces by√n

so less then linear. For m mini batches mini-batch gradient descent is defined as:

θ ← θ − α1

m∇θ

m∑i=0

J(Xi, θ, yi). (2.7)

A special form of SGD is online learning in which examples are drawn froma continuous stream. This is the opposite of offline learning where a training setis fixed beforehand. Online learning is often used to improve an algorithm whilebeing deployed or to personalize a model to users behavior.

Momentum

In order to speed up the learning of SGD the momentum algorithm has been intro-duced. It uses a velocity parameter v - inspired by Newtonian dynamics - whichstores all gradient updates in an exponential decaying fashion. The intuition isthat gradients from prior iterations can have an accelerating effect on the learn-ing. A physical analogy is a ball rolling down a hill: the velocity of the ball is notjust dependent on the gradient at each position but also on the velocity the ballalready has, which is in turn influenced by previous gradients. Thus the learningsteps now depend not only on the norm of a gradient but also on the alignment ofa sequence. If multiple gradients point in the same direction, the steps get largerand the algorithm can move more quickly in that direction. Using a decayingfactor ε the momentum algorithm is defined as

v ← εv − α1

m∇θ

m∑i=0

J(Xi, θ, yi) (2.8)

θ ← θ + v (2.9)

Nesterov Momentum is an addition to momentum that uses lookahead steps.Instead of evaluating the loss at the current θ one can already add the velocity ofthe current iteration. This can be seen as a correction factor and is defined as

14

2.2. Deep Learning

v ← εv − α1

m∇θ

m∑i=0

J(Xi, θ + αv, yi) (2.10)

θ ← θ + v (2.11)

Per-Parameter Adaptive Learning Rate Algorithms

The above algorithms have all used the same learning rate to scale the gradientupdates. Their success can be mitigated by sections in which there is a largegradient in one direction, but a very small gradient in another direction. Thesekind of saddle points would cause the previous algorithms to wiggle along thesmall gradient dimension as well. This can be prevented by using a separatelearning rate for each parameter which can be adapted during learning.

The AdaGrad algorithm as proposed in Duchi, Hazan, and Singer [8] is for-mally defined as

r0 = 0, r ← r + g2 (2.12)

θ ← θ +

(− αg√

r + δ

)(2.13)

for a gradient g and a small correction constant δ that ensures numerical sta-bility. Parameters with a large partial derivative have their learning rate reducedrapidly, while parameters with small partial derivatives have their learning ratereduced only slightly.

RMSProp was introduced by Tieleman and Hinton [54] and is an adaptionof AdaGrad using an exponentially weighted average. In contrast to AdaGrad,which remembers all gradients equally, RMSProp decays the average exponen-tially over time. This can help to converge especially well once the algorithmfinds a convex bowl. AdaGrad would still be influenced by initial gradients whileRMSProp can focus on the recent steps alone. The trade-off with RMSProp is thatit requires an additional hyperparameter ρ which controls the length of the mov-ing average i.e. the decay rate. RMSProp is defined as:

15


r0 = 0, r ← ρr + (1− ρ)g2 (2.14)

θ ← θ +

(− αg√

r + δ

). (2.15)

RMSProp can also be extended with momentum.Another popular adaptive learning rate algorithm is Adam as introduced in

Kingma and Ba [23]. Adam, which stands for "adaptive moments" and is some-what an addition to RMSProp plus momentum. For a time-step t at each iterationAdam is defined as

for s0 = 0, r0 = 0,

s← ρ1s+ (1− ρ1)g

1− ρt1

r ← ρ2r + (1− ρ2)g2

1− ρt2

θ ← θ +

(− αs√

r + δ

)(2.16)

for the first-order moment s aka the momentum and the (uncentered) second-order moment r. The denominator in both updates are bias corrections. As both s

and r are initialized at 0 it would lead to slow learning, hence the bias is correctedfor. RMSProp also uses the second-order moment but does not correct for thebias.

All of the optimization methods described above are based on first-order deriva-tives. Second-order methods exists as well and include Newton’s method, BFGS andL-BFGS. Although using the Hessian can lead to desirable learning processes, it israrely used in practice because its computation is very costly. It requires comput-ing the second-order derivative of all parameters and large neural networks canhave millions of parameters. As these methods are rarely used in practice andwere not used in the implementation of this project, they will not be discussedhere further.

16

2.2. Deep Learning

2.2.7 Backpropagation

Backpropagation is a key part of algorithms for learning in neural networks. Ithas become popular for training neural networks after Rumelhart, Hinton, andWilliams [44] has shown its advantages over other training methods. The gen-eral idea of backpropagation is to compute the derivative of some error measure-ment and the end of the network and propagate it backwards through all layers,updating the weights and biases alongside. It is responsible for the backwardpropagation of errors obtained from an optimization function. A neural networkwith an input x a single hidden layer H and a loss layer L can be described as thecomposition

NN(x) = L(H(x)). (2.17)

The gradient with respect to any intermediary value in a composition is de-fined by the chain rule through multiplication

δNN

δx=

δNN

δH

δH

δx. (2.18)

Algorithm 1 shows the backpropagation algorithm in pseudo-code. It updatesthe gradients on weights and biases for all layers iteratively. As explained in 2.2.5the filters of convolutional layers are also learned through backpropagation. Thatis, all weight parameters of the convolutional layer are stored in a 3D matrix (inthe case of 2D convolutions) of the shape (output_size_x, output_size_y, n_filters).Although the gradient for each neuron is computed individually, the gradientsare summed across each dimension (i.e. there is one sum for each filter) whichcomputes the single parameter update for the filter. This explains how gradientsare computed in a parameter sharing situation.

2Taken from Goodfellow, Bengio, and Courville [14]

17


Algorithm 1 Backprop: This computation yields the gradients on the activationsa(k) for each layer k, starting from the output layer and going backwards to thefirst hidden layer. From these gradients,which can be interpreted as an indicationof how each layer’s output should change to reduce error, one can obtain the gra-dient on the parameters of each layer. The gradients on weights and biases can beimmediately used as part of a stochastic gradient update (performing the updateright after the gradients have been computed) or used with other gradient-basedoptimization methods2.After the forward computation, compute the gradient on the output layer:

1: g ← ∇yJ = ∇yL(y, y)2: for k = l, l - 1, ... , 1 do3: Convert the gradient on the layer’s output into a gradient into the pre-

nonlinearity activation (element-wise multiplication if f is element-wise):4: g ← ∇a(k)J = g � f ′(a(k))5: Compute gradients on weights and biases (including the regularization

term,where needed):6: ∇b(k)J = g + λ∇b(k)Ω(θ)

7: ∇W(k)J = gh(k−1)ᵀ + λ∇W(k)Ω(θ)8: Propagate the gradients w.r.t. the next lower-level hidden layer’s activa-

tions:9: g ← ∇h(k−1)J = W(k)ᵀg

18

2.3. Deep Learning in Computer Vision

2.3 Deep Learning in Computer Vision

In Artificial Intelligence an agent can be described as an autonomous system thatuses perception in order to execute some action. The subfield of computer sciencethat addresses the agent’s perception using visual features is called computer vi-sion.

The The British Machine Vision Association and Society for Pattern Recogni-tion (BMVA) has defined computer vision as follows:

"Computer vision is concerned with the automatic extraction, analy-sis and understanding of useful information from a single image or asequence of images. It involves the development of a theoretical andalgorithmic basis to achieve automatic visual understanding."

There is a wide variety of applications that can be addressed using computervision. Examples include robotic movement, board game playing, self-driving/flyingvehicles, wildfire detection, building security and document digitalization. Theperception for many of these tasks can be addressed using one or more of thetechniques part of this chapter. These tasks will next be described from a deeplearning standpoint. Although there are many more traditional machine learningworks in this field, deep learning approaches have outperformed them in manybenchmarks and represent now the state of the art.

2.3.1 Image Classification

Image classification describes the task of assigning one or more most likely classesto an image. Each image has only a single correct class and the goal is to return aprobability distribution over all classes which predicts the highest probability tothe correct class.

The VGG network in 2.2 was one of the first deep learning architecture thatbeat the state of the art on ImageNet Large Scale Visual Recognition Challenge(ILSVRC). Since then only deep learning methods have been able to win the com-petition. In He et al. [16] researchers have shown that they can beat human per-formance on the ImageNet dataset with a new architecture termed ResNet.

19


2.3.2 Object Detection

Whereas in image classification the location of the object is irrelevant to the taskand each image usually contains one main object, in object detection the goal is toidentify multiple different objects and to determine their location. The challengeis that this can be represented as two different tasks:

1. Regressing the correct location of an object

2. Classification of the object

FIGURE 2.3: Object detection prediction example.3

Current state of the art network architectures include Faster R-CNN [42], R-FCN [6], YOLO [41] and SSD [32]. Although all of these architectures were de-signed with a specific feature extractor (e.g. VGG, Resnet etc.), they can be used

3Taken from https://pjreddie.com/darknet/yolo/

20


interchangeably. Their main difference lies in the task-specific part of the net-work. Huang et al. [20] has shown a extensive comparison on the speed andaccuracy trade-offs each task-specific architecture (named meta-architecture inthe paper) offers, taking into account different feature extractors to allow bettercomparison.

2.3.3 Semantic Segmentation

Safety critical applications like self-driving cars benefit from a finer predictionthan bounding boxes around objects. In a parking situation often few centimetersare important. Algorithms that achieve this kind of accuracy by doing pixel-wiseclassification are called semantic segmentation algorithms. This scenario is chal-lenging since every single pixel needs to be classified well making it especiallyexpensive for high-resolution images that modern commodity cameras produce.An example of a semantic segmentation annotation can be seen in figure 2.4.

FIGURE 2.4: Image semantic segmentation example.4

Most state of the art deep learning approaches to semantic segmentation relyon the Fully Convolutional Neural Network (FCN) architecture introduced inLong, Shelhamer, and Darrell [34]. There are two main ideas that have beenproposed by the authors:

4Taken from the cityscapes dataset Cordts et al. [5]

21


1. Turning the fully connected classification layers into convolutional layersoutputs a coarse classification of the input while reducing computation time.Long et al. argue that if the receptive fields of the final layer significantlyoverlap they share a lot of computation. Computing the forward and back-ward propagation layer-by-layer for the entire image thus becomes muchfaster than iterating over single-patches as minibatches. Their result re-mains the same. In practice a 4096 neuron fully connected layer can beturned into a convolution of size (1x1x4096). In this view the fully con-nected layer is simply a convolution with a kernel size of their entire input.

2. Even though convolutional layers are size invariant the output of the newconvolutional layers is smaller than the original image size due to the sub-sampling done by the pooling layers in the feature extractor. In order to re-ceive a fully sized dense classification the coarse output needs to be upsam-pled to the original image specifications. For this Long et al. use backwardsstrided convolution (also referred to as deconvolution and fractional convolu-tion). It reverses the backward and forward pass of convolution by usingan input stride of 1

fand an output stride of f . The benefits of using this

transposed form of convolution is that it can be integrated into the networkallowing end-to-end learning and that the layer can learn any kind of up-sampling instead of a fixed one based on heuristics.

FCNs in [34] are presented using 3 different upsampling settings. First, FCN-32s use an upsampling with stride 32 on the final layer only. Second, FCN-16scombine the 32x upsampling on the final layer with an upsampling of stride 16on the output of the pool4 layer. Last, FCN-8s use both upsamplings from aboveand additionally an upsampling with stride 8 from layer pool3. These additionalskip connections to previous layers allows the network to create finer predictionssimilarly to ResNets. The authors report an increase in mean IoU (see section2.3.5) from 59.4% to 62.7% using FCN-8s as compared to FCN-32s on the valida-tion set.

22


Alternatives to FCNs include Encoder-Decoder networks as described in Seg-Net Badrinarayanan, Kendall, and Cipolla [1] and Bayesian SegNet Kendall, Badri-narayanan, and Cipolla [22]. They use an architecture that resembles Autoen-coders which try to extract features from an input in an encoding stage and thenrebuilt the input in a decoding stage. The difficulty lies in preventing the modelfrom learning the identity function. There are many similarities in both the FCNand the Encoder-Decoder approach. One difference is in their upsampling: FCNslearn deconvolution filters whereas SegNets use the max-pooling indices fromthe encoding stage to upsample in the the decoding stage.

Semantic segmentation requires both local as well as global knowledge aboutan image to be successful. Garcia-Garcia et al. [11] review many different tech-niques including conditional random fields as used in DeepLab [3], dilated convolu-tions as proposed in Yu and Koltun [56], multi-scale prediction as well as recurrentneural networks used for semantic segmentation.

2.3.4 Instance Segmentation

Instance segmentation describes the task of separating different instances of thesame class in an image. An example could be to separate different humans froma group that is standing in front of each other. In semantic segmentation theentire population of humans would receive the same class whereas in instancesegmentation each human would receive a uniquely identifying class. Often in-stance segmentation networks are either based on object detection or semanticsegmentation architectures and expand these to become instance-aware. Exam-ples include Mask R-CNN [17] which extends Faster R-CNN with an additionalbranch that predicts instance masks and Fully Convolutional Instance-aware Se-mantic Segmentation (FCIS) [31] that expand FCNs with dual position-sensitivescore maps.

23


2.3.5 Evaluation metrics

Evaluation in computer vision is task dependent. While in image classificationone can simply compute a 0-1 performance, i.e. assign each correctly classifiedimage a 0 and each wrongly classified image a 1 and average over them, in ob-ject detection, semantic segmentation and instance segmentation one is requiredto measure a more precise distance from the ground truth. Formally in imageclassification accuracy is defined as

1

n

n∑i=1

d(yi, yi) where d(y, y) =

⎧⎨⎩0 if y �= y

1 otherwise(2.19)

for n images and a true label y and a predicted label y.The distance measurement typically used in the other tasks is the Intersection

over Union (IoU) as described in Everingham et al. [9]. Some form of it are used inmany of the most popular computer vision competitions including Pascal VOC,MS COCO and ILSVRC. IoU measures how the set of predicted pixels A matchesthe set of true pixels B

IoU =TP

TP + FP + FN(2.20)

where TP stands for the number of true positives, FP for the number of falsepositives and FN for the number of false negatives.

If the IoU is exceeds some threshold t for instance 0.5 then the object is cor-rectly labeled. Some competitions use an average over multiple thresholds forinstance of the set [0.5, 0.6, 0.7, 0.8]. The mean Average Precision (mAP) is theprecision averaged over all classes c:

mAP =1

c

c∑i=1

TPc

TPc + FPc

(2.21)

24

Chapter 3

Datasets

Data is supposed to be the oil of the 21st century. Business models that target thegeneration and commercialization of data can be found throughout all kind ofindustries and from small startups to big enterprises. By the law of large numbersthe observed mean will converge to the expected value when the number of trialsn → ∞. This means that the more plausible data can be collected, the larger thevalue that can be achieved. Note that there exists the fallacy that any kind of datais valuable, hence the word "plausible". As deep models in deep learning requiremore data than for example Support Vector Machines, research in deep learningrequires large datasets. Fortunately many universities and large enterprises havepublished large and labeled datasets that allow experiments and benchmarking.

This chapter will give a brief overview of a non-exclusive list of datasets thatare commonly used in research, including the works referred to in this thesis.Section 3.2 will present the dataset used in this thesis and in some related workmore closely.

3.1 Datasets for Classification

Image classification refers to the task of assigning one class y to an image whichis labeled with exactly one true label y. Popular datasets include:

• MNIST: A popular benchmark is the MNIST dataset introduced in LeCun etal. [30] which has been nicknamed the "hello world" of perceptual machinelearning. MNIST contains handwritten digits that have been preprocessed

25

Chapter 3. Datasets

using normalization and scaling. The goal of this task is to predict the cor-rect digit on a testing set. I.e. there are 10 classes (0-9) and the output of amodel would be a probability distribution over those 10 classes. There are60,000 images in the training set and 10,000 more in the test set. Images are25x25 pixels and grayscale only.

• CIFAR: The CIFAR-10 Krizhevsky, Nair, and Hinton [24] and CIFAR-100datasets contain 60,000 images of size 32x32 each. Images are in color socontain 3 channels. CIFAR-10 has 10 classes and 6,000 images per classwhereas CIFAR-100 has 100 classes with 600 images per class.

• ImageNet: Another popular and more challenging benchmark is the ILSVRC[45]. It is based on the ImageNet dataset [7] containing 10,000,000 imagesclassified in 10,000 classes. Because of the complexity of the challenge therecan be 1, .., n labels in the ground truth of each image. Hence there is theadditional challenge to correctly detect a set of 5 classes (i.e. the Top-5) thatare a subset of the ground truth.

3.2 Datasets for Semantic Segmentation

• KITTI Vision: The Karlsruhe Institute of Technology has published datafrom their autonomous driving platform Annieway in the KITTI VISIONbenchmark [12]. There is no official semantic segmentation labeling yet, butmultiple individuals have labeled some subsets of the dataset on a pixel-wise level.

• Daimler Urban Segmentation: DUS [47] contains 5,000 images from videocameras mounted on a car of which 500 are labeled. Images are labeled in 5classes: ground, vehicle, pedestrian, building and sky.

• CityScapes: The dataset used in this paper and some of the related workis the Cityscape dataset [5]. It was collected by Daimler, the Max-PlanckInstitute for Informatics and the TU Darmstadt Visual Inference Group. It

26

3.2. Datasets for Semantic Segmentation

Group ClassGround (2) road, sidewalkHuman (2) person, riderVehicle (6) car, truck, bus, on rails, motorcycle, bicycleConstruction (3) building, wall, fenceObject (3) traffic sign, traffic light, poleNature (2) tree, terrainSky (1) skyVoid (3) other horizontal (ground) surfaces,

dynamic (movable) objects e.g. strollers, animals,static objects that don’t match any of the above

TABLE 3.1: Cityscapes classes and their groups.

contains 5,000 images with fine annotations and 20,000 images with coarseannotations. Dataset was collected over several months (spring, summer,fall) in 50 different cities in Germany. All images are taken at daytime andtypically in good to medium weather conditions. I.e. there are no sceneswith rain, heavy clouds or snow. Cityscapes is has a total of 30 classes but infact only 22 also appear during evaluation and the remaining are thereforetreated as void. The classes are grouped shown in table 3.1.

Examples of a Cityscapes annotation is shown in figure 2.4.

27

Chapter 4

Adversarial Attacks

This chapter will introduce adversarial attacks. First, a definition and an expla-nation of their effect will be given, as well as a hypothesis for their success. Then,different methods for creating adversarial examples will be evaluated. Lastly,some properties of adversarial examples investigated in the literature will be pre-sented.

4.1 Overview of Adversarial Attacks

An adversarial example is a sample of data that is fed to a classifier which has beenminimally modified such that it is misclassified. The main structure of the exam-ple is retained, so an object in an image is still clearly recognizable by humansafter adding the modifications. An adversarial perturbation is the perturbation thatis added to a normal data sample in order to convert it to an adversarial example.As this perturbation is kept to a minimum, the adversarial perturbation is oftenimperceptible by humans. In some literature the adversarial perturbation is alsocalled adversarial noise or simply noise. In this work these terms will be usedinterchangeably.

Using adversarial examples to attack a machine learning model poses a hugesafety risk. In many use-cases today machine learning powers other businessprocesses. Robots are navigating factories, cars are driving partially autonomousand cameras detect faces for building access. In each of these scenarios an at-tack could be realized without access to the actual computer systems: a package

29

Chapter 4. Adversarial Attacks

FIGURE 4.1: Adversarial examples as illustrated in Szegedy et al.[52]. (Left) Original images, correctly classified, (center) adversarialperturbation, (right) adversarial example. All adversarial examples

have been classfied as "ostrich, Struthio camelus"

with an adversarial perturbation printed on could confuse the robots perception;a perturbation printed on a public advertisement could hide humans on the road;a perturbation printed on the glasses of a person could allow access to intruders.Therefore it is important to investigate adversarial attacks further, in order to un-derstand their nature and then build effective defense mechanisms. As of todaythere is no defense mechanism that can prevent around 100% of adversarial ex-amples while ensuring close to state of the art accuracy. Formally speaking for aloss function J() and a parameter matrix θ an adversarial example is defined as

xadv = x+ η for minloss

loss = J(θ, x+ η, y′) (4.1)

where η is the adversarial perturbation, x is the clean example, y′ is target la-bel where y′ �= ytrue. The goal is to find an η that fools the classifier. Geometricallyspeaking the goal is to move the prediction from the correct class manifold justinto the manifold of a different class. By setting η to a small value the attack be-comes less perceptible to humans, whereas setting it to a large value makes the

30

4.1. Overview of Adversarial Attacks

attack against the classifier easier. Thus equation 4.1 is often adjusted to further-more minimize over η.

One current hypothesis for the success of adversarial examples against neuralnetworks is their linear nature. Goodfellow, Shlens, and Szegedy [15] describesthat popular neural network architectures like LSTMs, ReLUs and maxout net-works are specifically designed to behave linearly in order to optimize easier.Furthermore, more non-linear models like sigmoid networks are tuned to like-wise spend most of their time on linear regions. Now considering a simple linearmodel of taking the dot product (i.e. the activation) of a weight vector w with anadversarial example xadv:

wᵀxadv = wᵀx+ wᵀη. (4.2)

The precision of an image is often limited to 1/255 because of an 8-bit per pixelformat. Any η that has a ||η||∞ smaller than the precision of the input features ε

should therefore not effect the class prediction i.e. each adversarial perturbationhas an L∞ norm constrain. The paper then goes on to explain that the pertur-bation increases the activation in 4.2 by wᵀη. To maximize the increase one canassign η = sign(w) considering the L∞norm bound on η. Now with an increasein dimensionality n of the weights w the activation will change by εmn where m

is the weight average. The max-norm of the perturbation does not increase lin-early to the dimensions directly, so increasing the dimensions would not effectits max-norm constrain. But since the activation increases linearly to the dimen-sions, it leads to the outcome that many small changes (even those smaller thanthe precision) can change the output of a high-dimensional model. This explainswhy even linear models are vulnerable to adversarial attacks, which leads to theconclusion, that a linear combination of those models (which a neural networkessentially is) is also vulnerable.

31


FIGURE 4.2: Adversarial attack taxonomy.

4.1.1 Adversarial Goals and Capabilities

Figure 4.2 (originally appeared in [40]) shows a taxonomy of different attacks,their goals and capabilities. The x-axis shows the adversarial goal, ordered by in-creasing complexity:

1. Confidence reduction - reduce the inferred probability of the correct class.

2. Misclassification - change the predicted class to any other class other thanthe true class.

3. Targeted misclassification - create a new example from random noise or anempty image that gets predicted as a specific class.

4. Source/target misclassification - change the predicted class of a given inputinto a specific different class.

On the other hand, the y-axis shows how much the attacker knows about themodel he is attacking, his adversarial capabilities in decreasing order:

32

4.2. Methods for Creating Adversarial Examples

1. Architecture and Training Tools - the attacker has perfect knowledge of thesystem to attack. He has access to the training data, the network architec-ture including activation functions, layer properties and trained parame-ters. The adversary can therefore use this knowledge to train attacks usingthe same setting.

2. Network Architecture - the adversary has knowledge about the architec-ture and the trained parameters. He can therefore simulate the networkwhich makes certain attacks easier.

3. Training Data - the attacker has access to data from the same distributionas the network he intends to attack. He does not have access to the networkarchitecture.

4. Oracle - the adversary can use a proxy of the network to retrieve classifica-tions. He can investigate output changes when altering his inputs in orderto iteratively improve his attack. This type of attack is often referred to as"black-box" because the adversary has no knowledge about the model it-self. Examples for oracles are computer vision APIs that return labels andprobabilities for images that are uploaded to the system. Sometimes thesesystems have an absolute or rate limitation.

5. Samples - the attacker can collect input samples along with their output la-bels but cannot modify the former to observe any changes in the prediction.These kind of labeled samples would, intuitively, only be valuable in largequantities.

This taxonomy from [40] will be expanded in 5.1 to add the dimension ofphysical realizability.

4.2 Methods for Creating Adversarial Examples

The following notation holds for the entire section:

33


• X - an image stored in a 3D matrix with shape (width, height, channels).Pixel values are in the range [0, 255].

• ytrue - true label of the image

• ytarget - target label of a targeted attack

• J(X, y) - loss function of the neural network typically cross-entropy. Thelearnable parameters are fixed as we assume the model to have convergedbeforehand.

• ClipX,ε{X’} - Pixel-wise clipping function that ensures that the generatedvalues are in the vicinity of the original values. The exact clipping dependson the Lpnorm used, the L∞ norm clips the maximum values whereas the L2

norm projects the values to the ε-ball around X. Specifically for the L∞normthe clipping function is defined that for each pixel p in the image X:

ClipX,ε{X’}(p) = min {255,X(p) + ε,max{0,X(p)− ε,X’(p)}} .

4.2.1 Fast Methods

One of the initial methods to create adversarial examples was defined in Goodfel-low, Shlens, and Szegedy [15], called the Fast Gradient Sign method. It adheres tothe L∞norm bound and needs to compute the gradients only once. It is definedas:

Xadv = Clip(

X + εsign(∇XJ(X, ytrue)

)). (4.3)

An alternative form is the Fast Gradient method that can use different Lpnorms.It is a generalization of the Fast Gradient Sign method in which the updates movealong the gradient direction instead of the gradient sign direction. It is definedas:

Xadv = Clip

(X + ε

∇XJ(X, ytrue)

||∇XJ(X, ytrue)||). (4.4)

34

4.2. Methods for Creating Adversarial Examples

4.2.2 Iterative Methods

An extension to the fast methods was presented in Kurakin, Goodfellow, andBengio [27]. It repeatedly applies the fast method with a step size α. It resemblesgradient descent that is used to optimize the model, with the difference being thatthe pixel space of the input is adapted rather than the weight space and that wemove in the direction of the gradient-sign, i.e. using gradient ascent.

Xadv0 = X, Xadv

N+1 = ClipX,ε

{Xadv

N + αsign(∇XJ(X

advN , ytrue)

)}. (4.5)

The authors report using a fixed α = 1 and selecting the number of iterationsto be min(ε+ 4, 1.25ε). Again this method can be generalized to other Lpnorms:

Xadv0 = X, Xadv

N+1 = ClipX,ε

{Xadv

N + α

(∇XJ(XadvN , ytrue)

)||(∇XJ(X

advN , ytrue)

)||}. (4.6)

4.2.3 Targeted Methods

The so far described methods have the goal of changing the pixel space of theimage in order to reduce the probability of the correct class, which leads the ad-versarial example to being classified as an arbitrary class. Instead of reducing theprobability for the correct class one could also try to increase the probability for atargeted class were ytarget �= ytrue while decreasing the probability for every otherclass j �= ytarget The motivation behind a targeted attack are that the previousideas often lead to very slight changes in class. On a dataset like ImageNet thathas many classes it can lead to changing the prediction from the correct breed ofdog to a different breed of dog. A targeted approach can be used to optimize fora least-likely class which represents the most difficult adversarial perturbation.Also, a targeted attack would be much more likely in practice, e.g. an attackercould be interested in impersonating a specific other person to gain access or toturn a human prediction into a road prediction rather than another object. These

35


targeted methods then resemble gradient descent updating the image space. WithL∞norm:

Xadv0 = X, Xadv

N+1 = ClipX,ε

{Xadv

N − αsign(∇XJ(X

advN , ytarget)

)}. (4.7)

For any Lpnorm:

Xadv0 = X, Xadv

N+1 = ClipX,ε

{Xadv

N − α

(∇XJ(XadvN , ytarget)

)||(∇XJ(X

advN , ytarget)

)||}. (4.8)

4.2.4 Jacobian-based Saliency Map Approach

Instead of slightly changing arbitrary pixels as done in the previously, Papernotet al. [40] has introduced a method to change only specific pixels which wouldhave a maximum impact in each iteration. For this, the forward derivative ofthe learned model is computed which is the Jacobian of what the network haslearned. Hence the gradients are not propagated backwards but rather forwards,and the derivatives are based on the input features (i.e. the pixels) rather thanthe network features. This means that the gradients should highlight which in-put features would need to be modified to achieve the largest change in output.These saliency maps were introduced as visualization tools in Simonyan, Vedaldi,and Zisserman [51]. Papernot et al. [40] used these as adversarial saliency mapsin a greedy policy to indicate which pixels the adversary should change in eachiteration in order to have the maximum impact. In practice typically one or twopixels only are changed per iteration.

4.2.5 Other Methods

There are several other methods in the literature that have been used to createadversarial examples. These include box-constrained L-BFGS [52], other opti-mization based methods as in Bastani et al. [2] and Liu et al. [33] and DeepFool,proposed in Moosavi-Dezfooli, Fawzi, and Frossard [37].

36

4.3. Properties of Adversarial Examples

4.3 Properties of Adversarial Examples

Adversarial examples have many intriguing properties which make them a threatto machine learning systems. The following section will highlight some of themas shown in current related work. Specifically it is shown that adversarial exam-ples can:

• survive being printed and photographed in image classification.

• be transfered to other models and datasets, even in black-box scenarios.

• be made universal and work on multiple images.

• fool other tasks such as semantic segmentation.

• fool face recognition tasks after being printed and photographed.

4.3.1 Attacks in the Physical World

In the initial work by Szegedy et al. [52] that has shown the existence of adversar-ial perturbations it was assumed that the attacker has direct access to the modeland can input digital data. It was then shown in [27] that this is not necessaryand many adversarial examples are able to fool classifiers even after typical sig-nal processing as used in the physical world. This was the first time it was shownthat an adversarial example can be printed and photographed using a smart-phone camera1 and in many cases still fool a classifier.

They have used an Inception v3 network architecture trained on the ImageNetdataset. They have printed multiple adversarial examples on a single sheet andcropped and warped them automatically after taking the photos. The output wasthen fed back to the same model that was also being used to create the adversar-ial examples. The authors have used the fast gradient sign method in equation4.3, an iterative sign method as in equation 4.5 and a least-likely method usingequation 4.7 in which ytarget = argmin

y{p(y|X)}. The noise max-norm ε was set to

1The authors used a Ricoh MP C5503 office printer at 600dpi for printing and a Nexus 5x forphoto capturing.

37


be lower than 16 and they reported specific values for each ε ∈ {2, 4, 8, 16}. Theyhave compared the accuracy of the correct class before and after using this phototransformation as well as computed a measure of destruction rate that shows howmany adversarial examples are no longer misclassified after the transformation.In other words the destruction rate shows how the effect of adversarial pertur-bations is destroyed as part of the transformation. Specifically the authors foundthat fast methods were more likely to create examples that are misclassified af-ter printing, compared with iterative methods. Furthermore the photos of all 3attack types were better at misclassification using larger values of ε as would beexpected because a larger ε value leads to a higher distance to the original im-age. Lastly, the authors have also tested the adversarial examples on a black-boxattack using an unknown network in the TensorFlow Camera Demo app. Whatthe authors have not shown however is how well their targeted (least-likely) at-tacks are able to achieve a targeted attack rather than a misclassification attack.Furthermore the images that are attacked are known to the authors. They arei.i.d from the training set which might explain why no further adjustment to theadversarial perturbations was necessary.

4.3.2 Transferability and Black-Box Attacks

In [52] it was investigated whether adversarial examples would be able to gener-alize to models trained on a disjoint subset of the data. They studied the cross-training-set generalization and found that adversarial examples are indeed trans-ferable to other data subsets of the same distribution. Furthermore they showedthat the transferability holds for different model architectures as well. This wasshown for small datasets, in particular MNIST and CIFAR-10 and with the goalof misclassification. Later, Papernot et al. [39] demonstrated the transferability todifferent machine learning models including decision trees, kNN, etc.

In Liu et al. [33] it was then shown that these results are also applicable onthe much larger ImageNet dataset. As previous results reported mostly on mis-classification, [33] investigated the transferability of targeted attacks. First, theyshowed that most targeted attacks get destroyed when being transfered to other

38


network architectures trained on ImageNet. Then, they introduced an ensemble-based approach that would generate adversarial examples using multiple modelsand disjoint training subsets in order to train a 1-vs-rest attack. They have used 5different architectures including ResNet with 152, 101 and 50 layers, VGG-16 andGoogLeNet. Then they trained adversarial examples by using an optimization-based approach solving the following formula:

argminXadv

−log((

k∑i=1

αiJi(Xadv)) · 1ytarget

)+ λd(X, Xadv) (4.9)

for k attack models with a softmax output J1, ..., Jk, the one-hot encoded targetlabel 1ytarget and model weights α. The function d(X, Xadv) is a distance metric tooptimize for a minimal divergence of the adversarial example from the originalexample to ensure the imperceptibility of the attack. It can be weighted usinglambda.

At each iteration an Adam update is computed for each attack model andtheir updates are summed and added to the image. In this way the 4 modelscontribute to create an adversary together. The final adversarial example is thentested against each model, including the left out one. As one would assume thepercentage of successful targeted attacks against each of the four attack models isquite high. On top of that even against the left out model there is a success ratebetween 11% and 46%. What can be noticed, is that in all their experiments therate of matching the target in the attacked model is higher for attacking ResNetmodels compared to GoogLeNet and VGG-16. There is no reasoning offered inthe paper, but one hypothesis could be the smaller capacity of the latter two mod-els, the ResNet architectures contain many more layers than the other two. Theauthors have also studied targeted fast-gradient methods but could not achievebetter results using the ensemble approach as compared to the single model ap-proach.

Black-box attacks refer to the adversarial capabilities 4 and 5 as described in4.1.1. Here, the adversary has no knowledge about the model architecture orthe data used to train the model which is targeted. In the ’oracle’ scenario (i.e.capability 4) the adversary can query the black-box model and receive outputs

39


from this oracle. This attack method is especially hard, because the adversarycan make no prior assumptions. Investigations on the geometric properties ofdifferent models have shown that the gradient directions tend to be close to or-thogonal to each other [33].

The ensemble approach from above has also been used to attack a black-boxsystem in the same way. Liu et al. [33] have used adversarial examples created byboth a single VGG-16 network and an ensemble of 4 out of their 5 models (leavingout ResNet-152). Using those they have attacked the image recognition API ofClarifai.com. Clarifai.com is a commercial platform that offers state of the artimage recognition-as-a-service. It has been founded by Matthew Zeiler who wonthe ILSVRC 2013 challenge with ZF net [57]. What makes the Clarifai.com APIdistinct from other networks, and hence a true black-box, is that it has differentlabels than the ILSVRC challenge. Nonetheless the authors report that 57% of theadversarial examples generated using VGG-16 and 76% of those generated by theensemble are able to be misclassified on Clarifai.com. Furthermore, 18% of thetargeted ensemble attacks are successful in being classified closely to the target,whereas only 2% of the single VGG-16 attacks are achieving this goal. This showsthe strength of the ensemble approach especially considering that the labels varyfrom the ones used in training.

Another approach for black-box attacks, proposed in [39], trains a substitutemodel in order to craft adversarial examples which then are used to attack theblack-box model. Model extracting attacks, as shown in Tramèr et al. [55] arehighly effective in using machine learning-as-a-service systems in order to ex-tract the model’s partial knowledge. This information can then be used to train asubstitute model whose outputs resemble the attacked model’s outputs strongly.In particular the attacker intents to train a model F that closely approximates theblack-box model F . In order to do so the adversary collects a training datasetand uses the oracle (i.e. black-box model) to label it. By using additional aug-mentation and synthetic data generation the adversary can iteratively increaseits dataset size. After having collected a sufficiently large dataset the adversaryemploys it to train its substitute model F . The model F is then used to gener-ate adversarial examples. [39] has shown that they can use this method to force

40


the MetaMind API, trained on MNIST, to misclassify 84.24% of adversarial ex-amples which are imperceptible to humans and that they can force a second or-acle, trained on the German Traffic Signs Recognition Benchmark, to misclassify64.24% of adversarial examples which are imperceptible to humans. The authorshave not tested their method on source/target misclassification.

4.3.3 Universal Perturbations

Adversarial attacks as described above always perturb a single image in order tofool a classifier using the specific combination of image and adversarial pertur-bation. In other words, these perturbations are image-dependent, i.e. one cannotapply a perturbation designed for image A to a different image B and expect theattack to work successfully. In Moosavi-Dezfooli et al. [38] an algorithm has beenintroduced that allows crafting of universal (i.e. image-agnostic) perturbations.Universal perturbations may pose a larger threat because they allow an adver-sary to create attacks where the underlying sample that is being used for theattack does not matter. The goals of a universal perturbation are to be minimal indistance to the class boundary of a class y′ �= ytrue to ensure the imperceptibilityof the perturbation and to fool as many images as possible from a separate subsetXtest. The authors have described these goals in terms of two constrains to theirminimization problem such that:

1. ||η||p ≤ ε

2. P (F (x + η) �= F (x)) ≥ 1− δ

Remember that η is the adversarial perturbation with a p-norm constrain ε

and F () is the classifier. δ is a newly introduced parameter that quantifies thefooling rate.

The algorithm for creating universal perturbations, as shown in algorithm 2iterates multiple times over the dataset X until constrain 2 is met. For each imagexi it is checked in line 6 whether the current universal perturbation v already fools

2Loosely taken from [38]

41


Algorithm 2 Computation of universal perturbations 2.

1: input: Data points X, classifier F , lp norm constrain ε, desired fooling rate δ.2: output: Universal perturbation v3: Initialize v ← 04: while L(Xv) ≤ 1− δ do5: for each datapoint xi ∈ X do6: if F (xi + v) = F (xi) then7: Compute the minimal perturbation that sends xi+v to the decision

boundary:8: Δvi ← AdvAlgo(xi, v)9: Update the perturbation:

10: v ← Clip(v +Δvi).

the classifier when added to xi. A minimal perturbation is computed in line 8 thatfools the current image xi using an adversarial creation algorithm AdvAlgo(). In[38] the authors have used the DeepFool algorithm from [37] but other algorithmsfrom section 4.2 can be used as well. Lastly the algorithm enforces constrain 1 byusing a clipping function in line 10 projecting the universal perturbation on the lp

ball of radius ε centered at 0.The authors have tested their algorithm on multiple neural network architec-

tures, trained on ImageNet and were able to fool between 78.3% and 93.7% of allvalidation images. That is, in some cases images which were not used to createthe noise were being able to fool a neural network in more than 9 out of 10 cases.Both the L2 and the L∞ norms were tested but neither was better on all models,hence one needs to find the best norm for a given model. Furthermore the au-thors have shown that universal adversarial perturbations are also transferableto different model architectures without ensemble-methods in often more than50% of examples tested.

4.3.4 Attacks on Semantic Segmentation

Attacking semantic segmentation using adversarial examples was addressed inFischer et al. [10]. Fooling a pixel-wise classifier is inherently different from fool-ing an image-wise classifier because the effective receptive field that labels each

42


pixel in the former is unknown. Whereas in image-wise classification the effec-tive receptive field spans over (close to) the entire image and the classification isnot influenced by other surrounding objects. A semantic segmentation networkon the other hand could be highly influenced by the distance between differentobject pixels. As an example, in an image from Cityscapes [5] which portrays apublic street scene, pixels that are close to a region classified as ’bicycle’ might behighly likely to be classified as ’rider’ or as ’road’.

The authors of [10] have used a targeted attack method on a fully-convolutionalneural network trained on the Cityscapes dataset. More specifically they havetrained an FCN-8 using a VGG-16 feature extractor on the 2,975 training imagesof Cityscapes achieving a 55.5% IoU. All training and evaluation was done ona downscaled version of the dataset using 1024x512 pixels instead of 2048x1024pixels for computational efficiency. The targeted attack method was used with anL∞ norm, the step size was set to α = 1 and n = min(ε+ 4, 1.25ε) as suggested in[27]. The target was set to hide pedestrian class pixels using the nearest neighborclass of each pixel as its target. In a second experiment the authors have appliedthe noise only to the pixels which contain humans and not on the background. Itcan be seen that even for large ε values the background pixels can be kept intactand, as expected, larger values lead to larger destruction of human pixels. Theresults are interesting as they show in both cases that a hardly perceptible ε aslow as 5 can fool about 80% of the targeted object pixels, while leaving the mainimage prediction unchanged.

Furthermore the existence of universal adversarial perturbations on semanticsegmentation was shown in Metzen et al. [36]. The paper presents two differentscenarios, both using the universal perturbation algorithm 2: First, a static targetsegmentation which uses the same target image Ytarget for all input images Xi. Thisattack scenario could be used in situations where the adversary intends to attacka static camera for instance in a security compound using past video frames toattack a situation in the future. Second, a dynamic target segmentation in whicheach input image Xi is used to train the noise with a specific target Ytarget

i . Thisscenario is necessary where an ego-motion of the camera changes the scene overtime and a static target attack would cause suspicion. A self-driving car that is in

43


movement would be such a use-case.The authors argue that when averaging the loss gradient over all images in

the training set as in

∇v =1

m

m∑k=1

∇XL(F (Xk + v), ytargetk ) (4.10)

there is a chance to overfit on the training data because of the high-dimensionalityof v. In order to prevent this, they have introduced a regularization strategy thatenforces v to be periodic in both spatial dimensions. In other words, they cre-ate patches of fixed height and width h, w, smaller than the full image and re-peat this patch in both spatial dimensions. The authors report a best trade-offbetween pedestrian pixels hidden and background pixels preserved when usingtwo 512x512 patches.

In both scenarios, static and dynamic target segmentation, the authors are ableto create adversarial perturbations that can fool large portions of the 500 valida-tion set images. The authors have shown that the success is highly dependenton the training set size. For all adversarial examples the success rate increaseswith more training data. The static target approach works surprisingly well eventhough the dataset does not include static image data and the authors have useda static target to fool images from completely different scenes. In other scenariosthe noise typically has a somewhat periodic content, and does not show clear ob-jects. In the static target attack on the other hand, the noise clearly resembles thetarget.

4.3.5 Attacks on Face Recognition Systems in the Physical World

Sharif et al. [49] have shown the possibility of fooling a face recognition systemin many cases using an adversarial perturbation printed on the frames of glassesin the physical world. A face recognition system is a multi-class classifier thatoutputs a probability function over the persons in the dataset. In other words,when presented an image of a person X, the model gives a probability p(F (X) =

44


yk) for each person out of the k total persons. Where the sum of all probabilities∑k

p(F (X) = yk) = 1.

FIGURE 4.3: Successful examples of adversarial perturbationsagainst a face recognition system printed on glassframes (top) ad-versary wearing the attack glasses (bottom) impersonation target.3

The authors have used both a targeted and a non-targeted approach. In theformer the system is supposed to predict a specific other person i.e. the adversarywants to impersonate another person, whereas in the latter the adversary wants tododge the recognition of himself and allow the model to predict an arbitrary otherperson.

The goal of the authors was to build an attack that is both inconspicuous andphysical realizable. The motivation to creating attacks which are inconspicuous isthat these attacks are less likely to be caught by cursory investigation. In compar-ison with a person who wears a full face-mask, which raises suspicion, a colorfulglass-frame might not raise any suspicion by observers. Hence, these attacks aremore likely to pass by human security personnel and special attention in craftingsystems that are secure against this kind of attack are required. Furthermore, aninconspicuous adversary might be able to plausibly deny his intention of attack asystem if his methods are not as obvious. Physical realizability on the other hand

3Taken from Sharif et al. [49].

45


is required to test whether these attacks are able to fool facial biometric systemsoutside of a simulation. Because of this, the authors also decided to attack stateof the art deep learning systems rather than simpler algorithms.

In order to achieve inconspicuousness the authors print their perturbationsand stick them on big glass frames, also known as "nerd frames", which are sim-ilar to the popular Ray Ban Wayfarer glasses as can be seen in figure 4.3. Theyargue that these glasses are quite popular and also available in colorful patterns,similar to the outputs achieved by their algorithm.

In order to achieve physical realizability the authors have introduced twonovel methods that they integrated into their algorithm to craft adversarial per-turbations. The first method addresses the smoothness of the adversarial pertur-bation. Natural images tend to be smooth in the sense that neighboring pixelsare similar in color frequency and that they change only gradually over multi-ple pixels. A noise generated with one of the methods from section 4.2 typicallyhas very high frequency changes on neighboring pixels, as it does not implementsmoothness. Furthermore since the camera introduces a sampling noise, basedon its point-spread function, large differences between spatially close pixels arehard to capture anyways. Hence the authors have introduced an additional losscalled the total variation (TV) loss:

TV (η) =∑i,j

((ηi,j − ηi+1,j)

2 + (ηi,j − ηi,j+1)2) 1

2 . (4.11)

It reduces the distance between neighboring pixels in a perturbation η.The second method deals with the printability of the adversarial perturba-

tion. Colors that are printed on a sheet of paper vary from those that are definedby the adversarial generation method. Reasons for this are that not all colors inthe RGB color space [0, 1]3 are printable and that the printer introduces a distor-tion from those colors that are requested to be printed and those colors that areactually printed based on its technology. In other words, an image printed ontwo different printer models is likely to to be slightly different in color. In orderto overcome these problems the authors have printed a color palette, containinga fifth of the RGB space and captured an image of that using the same camera

46


which was used in the experiments. This way they were able to capture a set ofprintable colors P ⊂ [0, 1]3. They have defined the Non-printability Score (NPS) ofa pixel p as:

NPS(p) =∏p∈P|p− p|. (4.12)

It captures the distance between each pixel p in the adversarial perturbation andeach printable pixel p ∈ P . As this would be computationally very expensive be-cause of the high dimensionality of the images and the many identified printablecolors the authors have quantized the set of printable colors to 30 RGB tripletswhich have a minimal variance in distance from the complete set. The number30 turned out to be the best trade-off between accuracy and computation timein their experiments. Furthermore the authors use the captured color palette tobuild a mapping from p to the captured pixel p. This mapping is then used toreplace each pixel p requested to be printed with p such that the printing error|p − m(p)| is minimized. In other words, they find a substitute color p that is,when mapped to the printable colors, closer to the desired color. Both the TVloss and the NPS are then incorporated into the optimization function for findingperturbations

argminη

((∑k

softmax(X + η, ytarget))+ κ1TV (η) + κ2NPS(η)

)(4.13)

where k is the number of images the adversary is using to train its attack andκ1 and κ2 are constants for balancing the different objectives. The authors reportusing 0.15 and 0.25 respectively in practice. The successful results can be seen infigure 4.3.

47

Chapter 5

Creating Adversarial Examples for

the Physical World

Following the related work presented in the previous chapters, the goal of thisthesis is to investigate novel methods and difficulties when creating adversarialexamples to attack a semantic segmentation model in the physical world. Thischapter will give an overview of the experimental setup being used, the initialinvestigations and assumptions taken about the task and then motivate multipleadaptions to the existing algorithms and show their results on the given experi-mental setup.

5.1 Adversarial Goal

The goal of the adversary in this scenario is to create some sort of poster that canbe placed near a road in the physical world and which would hide an object (inthe experiment a human) in front of it. To the best of my knowledge, this is thefirst attempt of such kind an attack. It differs from [27] and [49] by using semanticsegmentation instead of image classification and face recognition. Furthermore incontrast to [27] and in line with [49] it combines both an adversarial perturbationthat was digitally created and transformed into the physical world by means ofprinting or displaying and a physical object i.e. in both cases a human being.While [49] applies the adversarial perturbation on glass-frames that are in frontof the object whose class is intended to be changed, this work attempts to insert

49

Chapter 5. Creating Adversarial Examples for the Physical World

the adversarial perturbation behind the object. [10] has shown that it is sufficientto apply a (universal) adversarial perturbation in semantic segmentation only onthe object itself and not on the background, but it has not been shown beforewhether it is likewise sufficient to apply a (universal) adversarial perturbation onthe background only.

FIGURE 5.1: Adversarial Goal: A poster attack

The threat model taxonomy from figure 4.2 can thus be expanded along an-other dimension. So far it includes only the adversarial goals and capabilitiesin a simulated environment. By adding another dimension it is possible to alsomap the physical realizability of the attack. I suggest the following dimensions,in increasing difficulty:

1. Digital environment - An attack in the digital environment space is themost common attack in the literature as of today. It does not include anytransformation into the physical world and hence is not required to achieveany physical realizability.

2. Controlled physical environment - The adversary has full control/knowledgeof the physical environment settings, he can control/knows about light set-tings, distances and camera angle for example. The limited attack does notneed to generalize to other environments.

50

5.1. Adversarial Goal

3. Controlled physical environment with restricted noise - The adversaryhas full control/knowledge of the physical environment but is limited inthe regions on which he can apply the adversarial perturbation. It is usu-ally necessary to apply the perturbation on a flat surface and applying aperturbation on the object which he intends to hide is not always feasible.For instance it would raise suspicion when a perturbation is attached toyour car.

4. Unknown Environment - In this scenario the adversary does not have con-trol about the physical environment and the attack has to generalize to mul-tiple settings.

5. Unknown Environment with restricted noise - The adversary does nothave control about the physical environment and is restricted in applyingthe perturbation only to regions that are feasible and do not cause suspicion.

Following this taxonomy the attack strategy in this goal is addressing a con-trolled physical environment with restricted noise placement, a (source/target)misclassification while having full knowledge about the training data and net-work architecture. Specifically the goals in this thesis are to 1) maximize the pix-els which are not classified as pedestrian and 2) to use the road class as a targetfor anywhere in front of the noise. This includes any objects that are between thecamera and the noise. This is to simulate an attack that could be executed in or-der to hide a human in front of a self-driving vehicle. Note that although havingfull knowledge about the network architecture and its predictions on each image,I have not taken advantage of model specific weaknesses. An example wouldbe to explicitly focus on targets that the model cannot predict in the first place.Nonetheless, performance evaluation of the attack is always being done with theprediction of the classifier rather than the ground truth. Otherwise one wouldattribute weaknesses in the model as a strength in the adversarial attack whichunjust.

51


5.2 Experimental Setup

The setup of my experiments is similar to those in [10] and [36] as this is a follow-up project, written within the Bosch Center for Artificial Intelligence. The datasetused is also a downsampled version of the Cityscapes dataset with 1024x512 pixelimages in RGB colors. The network is the same Fully Convolutional Network asdescribed in section 4.3.4 using a VGG-16 feature extractor and a combination ofthe upsampling of 8x, 16x, and 32x as known as FCN-8 (ref. 2.3.3). The VGG-16network is a version pre-trained on ImageNet obtained from the Keras Appli-cations library1 which contains several network architectures along with theirweights. The FCN-8 is trained on Cityscapes and achieves a 55.5% IoU.

We have implemented adversarial generation methods for fast attacks, itera-tive attacks, targeted attacks and DeepFool. Based on the adversarial goal thatwas described a targeted attack is necessary in order to assign every pixel in frontof the perturbation as a road class, which only leaves DeepFool and the targetedapproach from [27]. In practice the latter worked better for our environment andit was used for both prior papers and this thesis.

The initial idea was to use a printed poster which could be shown as an ad-vertisement on advertising pillars/boards. I have tested this approach by print-ing a digitally successful semantic segmentation attack and a random, unalteredCityscapes on a sheet of A4 paper using a color office printer. Even with differ-ent printer settings and different programs to open the image for printing it wasnot possible to print either the image, nor the perturbed image and get a validsegmentation or attack, using the available printer. A full poster attack wouldrequire printing a large poster that is at least the size of the human which it issupposed to hide (more details will follow). Printing large posters is possible andusing a high quality printer would assumingly lead to a better segmentation re-sult. Nevertheless, requesting such a printout would take multiple days and ismore costly than printing on an office printer. Considering these observations Ihave decided against using a printed poster attack.

1https://keras.io/applications/#vgg16

52

5.2. Experimental Setup

The Robert Bosch office in Renningen, where I was located for my thesis, hasa so called powerwall. The powerwall is a a 3x7 meter display using 6 projectorsthat screen the input on a glass wall. It has a total of 5800x2500 pixels leading toa pixel size of about 1,2x1,2 mm2. It can be seen in figure 5.2. The powerwall hasmultiple handy features which made it a good choice for this project:

• It is larger than a human standing in front of it.

• It allows for cheap iterations of testing new ideas in the physical world.

• The light settings in the room can be somewhat controlled e.g. preventingnatural light, and light dimmers.

• The room is large enough to test different camera placements.

Furthermore, an attack in which an adversary hacks a digital billboard in or-der to show a perturbation for a few seconds leaves much less traces as comparedto an adversarial perturbation poster which is glued to a physical billboard. Asuspicious billboard would likely raise concerns whereas finding out whether adigital billboard has been hacked is much harder and one might assume a mal-functioning of the self-driving software. Therefore it is especially important toinvestigate the possibility of this kind of an attack and to find defense mecha-nisms.

All attacks were tested in a digital environment first. Adversarial perturba-tions that were designed to fool a digital environment will be referred to as digitalnoise or digital adversarial perturbations. The camera used for all experimentswas a Logitech C910 1080p USB webcam. All code was written in Python, neuralnetwork architectures were build and trained using the Keras library [4] and newlayers and losses were written in Theano [53]. Training and inference was doneon a server with 4 GeForce GTX Titan X GPUs with 12GB of memory each.

Furthermore, the noise is always trained to be universal i.e. gradients are av-eraged over all training images, where 1704 images were selected as training datathat contain sufficient human pixels in the ground truth. This way computationtime is saved since using images without humans in them would not benefit theattack.

53


FIGURE 5.2: The powerwall from Robert Bosch GmbH in Renningen

The algorithm used for creating universal adversarial perturbations resemblesalgorithm 2 but has some minor changes. The algorithm is shown in algorithm 3.The differences are:

• It is not checked whether an image is already sufficiently misclassified.

• The targeted method is always used.

• The gradients are accumulated after each iteration over the dataset.

• The algorithm is run for a fixed set of iterations.

These changes have been motivated by the idea that we do not search for aminimal attack but rather for a more successful attack. By updating the pertur-bation using an image that already passes a desired fooling rate we increase theperturbation in the direction of its gradient. Thus the attack becomes more robust

54

5.3. Introducing Parallel Computation in GPU

for that example. Thus we have one large matrix computation instead of manysmall ones. The usage of the targeted method is motivated above and using afixed size of iterations n gives us better control.

Algorithm 3 Generating universal perturbations.

1: input: Data points X, classifier F , lp norm constrain ε, number of iterations n.2: output: Universal perturbation v3: Initialize v ← 04: for n iterations do5: for each datapoint xi ∈ X do6: Compute a perturbation that predicts xi + v as road:7: Δvi ← Clip(Targeted(xi, v))8: Update the perturbation:9: v ← Clip(v +mean(Δv)).

5.3 Introducing Parallel Computation in GPU

The prior implementation of universal adversarial perturbation generation is runon a single GPU and takes around 25 hours for running 60 iterations on the 1704selected training images. One of the bottlenecks in the generation algorithm isthe computation of the gradient, from the loss with regards to the image pixels.This computation is highly optimized in the GPU through Theano which uses theNvidia CUDA library. If a single image xi takes t seconds to be passed throughthe network forward and backward and if the computation of the gradient ofother images xj �=i within the same iteration are not dependent on the gradient ofxi, then the computation of the gradients of each image within an iteration can bedone in parallel. The independence between different gradients within the sameiteration is given in algorithm 3 because the gradients are only added after eachiteration. Thus if we can compute the gradients for k images at the same time thetotal gradient computation time could be reduced up to t

k. Note that this does not

reduce the entire generation time of the universal adversarial perturbation.Algorithm 4 shows this parallelized form of the algorithm. Here a batch size

k is specified in which the dataset is divided into. Then all examples within the

55


Algorithm 4 Parallelized generation of universal perturbations.

1: input: Data points X, classifier F , lp norm constrain ε, number of iterations n,batch size k

2: output: Universal perturbation v3: Initialize v ← 04: for n iterations do5: for each batch bi ∈ X do6: Compute a perturbation that predicts bi + v as road:7: Δvi ← Clip(Targeted(bi, v))8: Update the perturbation:9: v ← Clip(v +mean(Δv)) Average over all iterations

same batch bi are passed backward and forward through the GPU in parallel. Asingle image of size 1024x512 used about 1-2GB of memory in the GPU. Usingthe parallelized algorithm the Titan X GPUs can handle 8 images at the sametime in the 12GB of available memory. Using this approach saves about one fifthof the total generation time of a universal adversarial perturbation. Because ofthe parallel computation the gradient averaging changes minimally in precision.This does not have a significant effect and the noise remains having almost thesame capabilities.

I have also tested averaging and adding the noise of the current iteration Δvi

after a number of images p. If the version in algorithm 3 is considered similarto batch gradient descent (i.e. iterate over the entire dataset), then the versionin algorithm 2 can be considered as similar to stochastic gradient descent (i.e.one example at a time). Using this reference adding the noise after a number ofimages can be considered as similar to mini-batch gradient descent (i.e. a mini-batch at a time). Nevertheless this method did not improve execution speed nordestruction rate of the perturbation.

5.4 Evaluation Method

As an evaluation method of the adversarial perturbations in the physical worldI have created a benchmark performance without a noise. When presenting the

56

5.5. Baseline Using Digital Noise

perturbation on the powerwall and capturing a photo with a human subject infront of it, there is no ground-truth. Obviously one could could create a ground-truth by labeling the captured images and then compare the performance of theclassifier. Nevertheless this would take quite some time to even test a single per-turbation! For each perturbation that is to be tested, multiple images would haveto be hand-labeled which is infeasible when testing as many degrees of freedomas given in the scenario of this thesis. Therefore I have taken 119 pictures withouta noise background, where a human subject is walking in front of the powerwallwith fixed lighting conditions. The human subject was the author in all experi-ments. This set creates my benchmark for the classification performance of theFCN-8 model in the physical world scenario tested in this work. By computingthe average amount of pedestrian pixels in each of the 119 images all adversarialperturbations can be tested to see how large the difference in detected pedestrianpixels is. If the average of pedestrian pixels is significantly less than this bench-mark then the attack is partly successful. Note that in the following the pedestrianclass will be referred to as "human" class as well.

Note that the human subject is sometimes only partially in the photo (as mighthappen in a real case scenario), hence the few very small detection values. Theaverage detection is 22,588 pixels per image with a standard deviation of 8,913pixels. In order to check the variance of this method I have run a second ex-periment with similar settings on a separate day (i.e. slight camera movement,different clothes etc.) and have gotten a similar result (23,926 human pixels perimage). This is the benchmark that any adversarial attack has to beat.

Chapter 6 includes a table showing the overview of all results as evaluated inthis chapter.

5.5 Baseline Using Digital Noise

The first step in generating a transferable noise was to build a baseline whichis optimized for the digital use case. For this scenario a noise has been createdwhich performs well in hiding human pixels in the validation set of Cityscapes.Using different values of ε one can see in figure 5.4 that the noise destroys over

57


FIGURE 5.3: Prediction average of 119 images without a universalperturbation in the background.

99% of all human pixels if epsilon is set to be ε ≥ 20 in all 500 validation examples.The human destruction rate (HDR) is simply defined as:

HDR(X) = 1− F (X + η)

F (X)(5.1)

Where the function F () returns the number of pixels predicted to be human. Thehuman destruction rate can become negative when the perturbation causes theclassifier to believe there are more human pixels in the image than without theperturbation. In other words, it becomes negative when the perturbation causes

58

5.5. Baseline Using Digital Noise

the semantic segmentation classifier to misclassify many background pixels ashuman pixels as well, which is not intended. Original images in which the clas-sifier does not detect any human pixels have not been included in computing thedestruction rate. There are 53 images out of the 500 validation images on whichour classifier did not detect any human pixels.

1 5 10 20 30 100Epsilon

0.0

0.2

0.4

0.6

0.8

1.0

Hum

anD

estr

uctio

nR

ate

Noise Performance - Digitally

FIGURE 5.4: Comparing different epsilon values to the respectivedestruction rate of human pixels.

Figure 5.4 shows that the destruction rate is largely dependent on ε as wouldbe expected. The scenarios in most previous publications required the perturba-tion to be imperceptible and therefore chose an ε which is small but still effective.This trade-off is not required in the scenario presented in this work. The adver-sarial perturbation can be visible for the duration of the attack, and when using adigital screen it is possible to remove the perturbation right after the attack with-out leaving a trace. Therefore focus here is given on effectiveness and for mostexperiments ε was set to 100 unless otherwise stated. The perturbation created byoptimizing for the validation set, i.e. for the digital environment is further called

59


digital noise and will be used as a comparison baseline for some experiments.The digital noise is shown in figure 5.5

FIGURE 5.5: Universal perturbation created for the digital environ-ment. It represents the baseline attack.

Next, the digital noise was tested on the powerwall to see whether an adver-sary needs to take any additional measures in order to successfully attack a se-mantic segmentation model in the physical world. The performance of the digitalnoise on the powerwall is being measured as described in 5.4. With 21,159 humanpixels detected it reduces the amount of human pixels detected by only 6.33% ascompared to the benchmark. This shows that an adversarial perturbation for se-mantic segmentation optimized for a digital environment does not transfer to aphysical environment when the object to be hidden is in a new scene, for whichno digital image exists.

5.6 Addressing Noise Restrictiveness

One strict constraint in the given scenario is the positioning of the adversarialperturbation. So far the creation of the noise has taken in the loss of all pixels andallowed an update on all pixels in the image when backpropagating the gradient.In order to improve the perturbations’ effectiveness on hiding objects which itdoes not cover, I have introduced a masked variant of the targeted algorithm.

60

5.6. Addressing Noise Restrictiveness

FIGURE 5.6: Learning the perturbation by applying it everywhereexcept on the target objects.

FIGURE 5.7: Intermediate adversarial perturbation during training,as applied above.

The perturbation update step of the masked algorithm is defined as:

η(t)masked ← η(t) ∗ α (5.2)

where η(t) is the perturbation at time step t and α is the sparse mask. α has thesame spatial shape as η but only one dimension (same for all color channels) andcontains a 0 for all object pixels and a 1 for every other pixel.

61


The images 5.6 and 5.7 show how the masked algorithm is applying the per-turbation only to the background of the image and not on the human objects. Itfurthermore shows the humans subtracted from the noise, as it is generated in asingle iteration.

5.7 Addressing Rotational Invariance

5.7.1 Methodology

Rotational invariance in computer vision refers to the invariance of a model to-wards one-dimensional rotations in the input. For example an object detectorthat has learned to identify airplanes should identify an image of an airplaneeven though it has been rotated by a few degrees. When it is intended to train amachine learning model that is rotational invariant, the typical approach is to aug-ment the training data by slightly rotating some of the training images at random.The augmented images can either replace the corresponding training images orcan be added to the training set. Although the latter method increases the datasetit might lead to overfitting.

The idea was to use a similar kind of process for improving the perturbation’srotational invariance. When the perturbation is presented on a screen the camerais likely not to be in parallel to the screen. In other words there will be a slightrotation between the cameras x-axis and the screens x-axis. Using fine calibrationthis rotation could be avoided but considering the practicality of this attack it isnot likely. Note that improving the model’s rotational invariance is not my goalbut rather to improve the perturbation’s rotational invariance to the scene.

Following the data augmentation idea it is not sufficient to rotate the inputdata at random in order to improve the perturbation’s invariance. Therefore theperturbation of each image is rotated. The rotated noise is thus defined as:

η(t)rotated ←

1

|D|∑β∈D

η(t) ∗ β (5.3)

62

5.7. Addressing Rotational Invariance

where η(t) is the perturbation at time step t and D is the set of all rotation anglesD = [−5,−4,−3,−2,−1, 0, 1, 2, 3, 4, 5].

5.7.2 Evaluation

Figure 5.8 shows the digital noise being rotated uniformly from -5 to 5 degreeswith steps of 1 degree and then used with the 500 images of the Cityscapes val-idation set. This way, the perturbation is sometimes added as a rotated variant.It turns out that the rotation causes large artifacts in the noise. It thus leads toa strong misclassification in all regions, but to a very weak source/target mis-classification. Instead of learning to adjust to the rotation the algorithm seems totry to repeatedly apply the same increase or decrease in the perturbation. Thisway the adversarial perturbation looks like some sort of averaging over the sameperturbation rotated in multiple directions.

The image in figure 5.9 looks clearly different from previous perturbations andmost artifacts and sequences have been destroyed. Hence, it becomes clear thatthe noise might perform much worse in terms of accuracy. Figure 5.10 then showsthat the perturbation performs on average worse than the benchmark. I havetried to position the rotation at other parts in the noise creation algorithm but withsimilar results. One improvement could be to randomly rotate the perturbationwith a probability p instead of rotating it in all directions.

63


-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0Rotation Degree

-6.0

-5.0

-4.0

-3.0

-2.0

-1.0

0.0

1.0

Hum

anD

estr

uctio

nR

ate

Rotating the Noise

DataDigital NoiseExperiment

FIGURE 5.8: Comparison of the digital noise with the rotated ver-sion regarding rotation.

FIGURE 5.9: Universal perturbation using rotation.

64

5.7. Addressing Rotational Invariance

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Noise RotationPredictionAverageBenchmark average

FIGURE 5.10: Comparison of the rotational perturbation with thebenchmark (22,588 pixels) on the powerwall.

65


5.8 Addressing Scaling Invariance

5.8.1 Methodology

The adversarial perturbation needs to be invariant to scaling transformations.That is when a self-driving vehicle would be moving towards an adversarialposter at each position the pixel size of the adversarial perturbation as capturedby the vehicles cameras would increase. Furthermore the adversarial perturba-tion is created with using the 1024x512 pixels downscaled version of Cityscapes,and for presenting the perturbations on the powerwall they need to be resized toa size of 4096x2048 pixels in order to be larger than the person in front of it.

Similarly to previous sections I follow an augmentation inspired method inwhich the noise is transformed before being cumulated. More specifically thenoise of the current mini-batch is scaled with a 20% probability by a factor f uni-formly drawn with replacement from the set Pf [0.8, 0.9, 1.1, 1.2]. Thus the scaledperturbation is defined as

η(t)scaled ←

⎧⎨⎩F (η(t), Pf ) if p > 0.8

η(t) otherwise(5.4)

where η(t) is the perturbation at time step t, F (η(t), Pf ) is a function that scales theperturbation up by a random factor in Pf and then downsamples it to the originalsize, both times using a nearest-neighbor method. p defines the probability suchthat the transformation is only done in 20% of the cases.

5.8.2 Evaluation

Figure 5.11 shows the result on the validation set using the scaled noise. Againthe attack performs best without or only slight modifications. Here the transfor-mation seems to have a negative effect on the attack in some settings as well butless aggressive as compared to the rotation. While the digital perturbation doesnot get destroyed by upsizing, the scaled noise does.

66

5.8. Addressing Scaling Invariance

Here the interpolation method is nearest-neighbor. In another experimentdone on the powerwall I have noticed that interpolation using nearest-neighborcan actually increase the destruction rate whereas bilinear interpolation reducesthe destruction rate. Based on this, one can conclude that performing an oper-ation that includes smoothing after the optimization is done i.e. bilinear inter-polation of the final perturbation, destroys important features of the adversarialattack.

0.25 0.5 0.75 0.8 0.9 1.0 1.1 1.25 1.5 2.0 4.0Resize Factor

-1.0

-0.5

0.0

0.5

1.0

Hum

anD

estr

uctio

nR

ate

Resizing the Noise

DataDigital NoiseExperiment

FIGURE 5.11: Comparison of the digital noise with the scaled ver-sion regarding scaling.

Lastly, figure 5.12 shows how the scaled noise, which was trained for 45 iter-ations, performs on the powerwall as compared to the benchmark. One can seethat the average prediction of human pixels drops to 17,879 which is a reductionof 20.85% as compared to the benchmark. Based on the latter it can therefore

67


be concluded that the scaling of the noise increases the attack capability in thephysical world.

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Scaling the NoisePredictionAverageBenchmark average

FIGURE 5.12: Comparison of the scaling perturbation with thebenchmark (22,588 pixels) on the powerwall.

5.9 Addressing Object Size Invariance

5.9.1 Methodology

Even though the adversarial perturbations created in the digital environment arelargely invariant to the size of the humans they aim to hide in the image, theydo sometimes struggle with especially large ones. This is due to the fact thatthere are only very few large humans in the dataset. Additionally, a larger object

68

5.10. Addressing Physical Realizability

requires a larger effective receptive field size. This means that there are morepixels effecting the prediction for larger objects than there would be for smallerobjects. Therefore more pixels need to be changed to change the prediction ofthe network. When using an upscaled noise on the powerwall the human subjectstanding in front of it is still larger - in comparison to the noise size - than mosthumans in the training set. Therefore it is hard for the perturbation to hide ahuman standing in front of the powerwall.

As an approach to address this object size invariance I have doubled the size ofall humans in the Cityscapes dataset. More specifically, I have taken a box aroundall human pixels in an image (note that this can include multiple humans), anddoubled its size. In order to apply the noise only to the background, the sameresizing needs to be done to the ground truth.

5.9.2 Evaluation

The evaluation of the perturbation when shown on the powerwall and capturing119 photos with a single human walking in front of it is shown in figure. Theaverage human pixels detected over all images is 19,055 (refer to figure 5.13) andthus 15.64% lower than the benchmark.

5.10 Addressing Physical Realizability

One of the main issues of transferring an adversarial perturbation into the phys-ical world is that there are two major transformations happening, besides thosealready elaborated: First, a transformation of the digital image onto a projectedimage on a screen. Every screen represents colors slightly different and colormanagement is required to get comparable results. A digital display furthermoreemits light that is not existent in the digital image. Second, a transformation of thescene into a digital image using a camera. Again, each camera behaves slightlydifferent and the scene changes. Even though these slight changes do not change

69


0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Object Scaling + NPS + Physical World LayerPredictionAverageBenchmark average

FIGURE 5.13: Comparison on the powerwall of the noise where theobjects where increased in size to the benchmark (22,588 pixels).

the perceptibility of a human, they do change the perceptibility of a trained classi-fier. This can be explained with the fact that the training algorithm of the classifiernever sees these kind of transformations.

The authors from [49] have addressed these transformations using a smoothingloss2 and a non-printability score. Note that they have used a printer instead of adisplay but the issue at hand remains the same. In order to address the physicalrealizability of adversarial perturbations for semantic segmentation I have imple-mented both smoothing loss and non-printability score as custom loss functionsusing Theano.

2The original paper called it total variation loss.

70


5.10.1 Methodology Smoothing

The equation 4.11 only takes the gradient of a single image as input. As I haveparallelized the algorithm in the GPU to speed up training the loss function re-ceives multiple images at the same time. It is therefore necessary to computethe smoothing loss for all images within a batch at the same time. In most runsthe amount of GPU memory used is already at maximum and creating multiplecopies of each image in parallel is infeasible. Hence, my implementation loopsover all images in a batch, slowing down the total computation speed. In order tobalance the loss I am furthermore taking the square root of the sum of all imagelosses in a batch. The smoothing loss function is then defined as:

TV (η) =∑k

∑c

∑k,c,i,j

( ((ηk,c,i,j − ηk,c,i+1,j)

2 + (ηk,c,i,j − ηk,c,i,j+1)2) 1

2) 1

2 (5.5)

Where k is the batch size, c are the number of channels (in this case 3 i.e. RGB)and i and j are the pixel locations along x- and y-axis. For implementation detailssee section A.1.

Furthermore I have introduced an alternative method of smoothing. I haveintroduced a smoothing layer which uses a moving average over neighboring pix-els. More specifically each pixel will be the average of its neighbors by sliding a3x3 grid across the image. This can be simply seen as a convolution layer with a3x3 kernel size, and one filter for each channel. Each filter weight is set to 1 whereinput channel and output channel are identical, and the rest to 0. Otherwise thelayer would average over the three color channels. The smoothing convolutionis applied after the gradients have been computed and before the current noiseis added to the global average. The motivation behind using a smoothing layeris that individual pixels are hard to identify after both the displaying and thecapturing transformation. Thus by smoothing over certain regions the hope is tocreate a perturbation that looks more natural and is in that sense better consumedby the classification model which is intended to be fooled.

71


5.10.2 Evaluation Smoothing

The result of using the smoothing loss on the powerwall is shown in figure 5.14.The average of detected human pixels over all 119 images is 20,376 and therefore9.79% lower than the benchmark. Thus there is only a small improvement whichmight not hold always.

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Smoothing LossPredictionAverageBenchmark average

FIGURE 5.14: Comparison on the powerwall of the perturbation us-ing the smoothing loss to the benchmark (22,588 pixels).

The result of using the smoothing layer on the powerwall is shown in figure5.15. With an average of 19,859 human pixels detected per image it is slightlybetter than the smoothing loss. The difference to the benchmark is 12.08%.

72


0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Smoothing LayerPredictionAverageBenchmark average

FIGURE 5.15: Comparison on the powerwall of the perturbation us-ing the smoothing layer to the benchmark (22,588 pixels).

5.10.3 Methodology Printability

The second addition that [49] has introduced is the non-printability score. This isa loss function which penalizes each pixel by the amount it differs from a set ofprintable RGB triplets. For simplicity, I will refer to the ability to display colors ona screen as printability as well. The set of printable RGB triplets P has been createdby showing a subspace of the RGB space on the powerwall and then capturing apicture of it, using the same camera which is used for the experiments. There isan even spacing between each color triplets and the shown image and its captureare shown in figure 5.16.

Figure 5.17 shows the photo of the RGB space as shown on the powerwall. The

73


FIGURE 5.16: A 10th of the RGB color palette with equal spacing.

FIGURE 5.17: Captured photo of the RGB color palette from thepowerwall.

actual capture is cropped and warped so it matches the spacing from the originalRGB space image. One can easily notice the difference in the contrast, brightnessand color saturation. Also the variety of colors decreases largely. As the formulaof the non-printability score (ref. equation 4.12) involves taking a product overthe set P and each pixel in the image, this would be computationally expensive.Hence, the set P is quantized to 30 colors with the smallest variation in distancefrom the rest of the set. The number 30 was suggested in [49] and using largersizes did not lead to significantly better results in my experiments either.

The computation of the colors triplets with the smallest difference from all the

74


others has been done using the CIEDE2000 color-difference. The CIEDE2000 algo-rithm was introduced by Luo, Cui, and Rigg [35] and uses the CIELAB color space,also called L*A*B* color space. The CIELAB color space describes colors visible tothe human eye and is therefore supposed to be display agnostic. Color distancesin CIELAB space are also called ΔE∗ and CIEDE2000 is the most recent version.By using the CIEDE2000 algorithm the color distances are therefore closer to whata human perceives in color distance, as compared to an L2 distance. In [49] it wasnot specified which kind of distance measure was used.

FIGURE 5.20: 30 triplets withthe smallest variance in differ-ence from the entire RGB space

used.

FIGURE 5.21: 30 triplets withthe smallest variance in differ-ence from the captured RGB

space used.

The selected colors are shown in figures 5.20and 5.21. For comparison the 30 RGB triplets withthe least variance in distance to the entire set areshown for both the original RGB colors and for thethe captured RGB space.

The non-printability score was then imple-mented in Theano, similarly to the smoothing loss.Each loss was added as a separate head in the FCN-8 network architecture and the total loss is the com-bination of all three. Before being put into practicethe noise was also subject to a mapping which triesto alleviate the effect of the color transformationfrom a digital image to a captured photo. The map-ping simply maps each RGB triplet to a triplet thatis actually printable while having the minimal ΔE∗

distance from the original RGB triplet. The map-ping is done by using a simple linear regressionmodel which has learned to map the the digital col-ors to the transformed colors with a mean squarederror of 257 and a variance score of 90%. Some ofthese mappings are shown in figure 5.18 and fig-ure 5.19. The left column shows the desired color(left) and the color as it would be printed, basedon the RGB space photograph (right). The title of each row contains the error as

75


FIGURE 5.18: 3 random examples of the color mapping using theeuclidean distance. Explanation in the text.

FIGURE 5.19: 3 random examples of the color mapping using theCIEDE2000 distance. Explanation in the text.

76


measured using the CIEDE2000 distance. The right column contains the desiredcolor (left), a suggested replacement color, based on the mapping, (center) andthe predicted color based on the RGB space photograph from that replacementcolor (right). The title of each row contains the CIEDE2000 distance from thatreplacement color prediction. In other words the right color in the right columnis supposed to be a better approximation of the leftmost color as compared to theright color in the left column. As one can see the distance to the printed color ison average smaller when using ΔE∗ as compared to the euclidean distance.

As an addition to the non-printability score I have used the learned linearregression model described above as an additional layer in my network. Thistransformation, called digital2transformed (D2T), can be seen as a convolutionallayer with 3 filters (one for each color channel) of size 1x1. Initially, I have imple-mented it as an additional layer after the input layer in the network. The intuitionis that each image is first transformed to the physical world and then the pertur-bation is learned on top of that. This way the final perturbation does not needto be transformed anymore in the last step as done previously. Rather the goal isthat the perturbation inherently learns to handle the D2T transformation.

In another experiment the D2T transformation is implemented at the end ofthe network so that it could solely be applied to the background and not theobject. This is called masked D2T. This is motivated by the fact that the humanstanding in front of the powerwall is not subject to the displaying transformationdone by the screen. Qualitative analysis of the results has shown that this trans-formation has a much larger effect than that of the camera. Thus by applying theD2T transformation only on the background, the training process is moved closerto the physical world scenario. In figure 5.22 one can see how applying the trans-formation on the humans as well would make the scene unrealistic. By applyingit on the background only, it resembles the scenario using the powerwall muchmore.

77


FIGURE 5.22: Digital2transformed transformation: (left) no trans-formation applied (center) transformation applied everywhere

(right) masked D2T transformation applied on background only.

5.10.4 Evaluation Printability

Figure 5.23 shows the result of the perturbation with only the non-printabilityscore as addition. This turns out to be the single most effective addition. Withan average of 16,786 human pixels detected per image it beats the benchmarkby 25.69%. This means that it allows misclassification of up to one quarter of allhuman pixels on average.

The combination of non-printability score and smoothing loss as was describedin [49] only lead to an average of 18,321 human pixels detected per image. This isa reduction from the benchmark of 18.89%, which is comparably good but worsethan the non-printability score alone. The evaluation graph is shown in AppendixA.

Using the physical world layer alone achieves an average of 21,859 humanpixels detected per image. This is only a 3.23% reduction from the benchmarkand thus worse than the digital noise baseline. Nevertheless combining the phys-ical world layer with the non-printability score achieves the best overall destruc-tion: it leads to an average of 16,539 human pixels detected per image. This is a26.78% reduction from the baseline and thus even better than the non-printabilityscore alone. See figure 5.24 using the physical world layer together with the non-printability score and the smoothing loss only achieves a 10.93% reduction fromthe benchmark.

78

5.11. Intensified Perturbations

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Non-printability ScorePredictionAverageBenchmark average

FIGURE 5.23: Comparison on the powerwall of the perturbation us-ing the NPS to the benchmark (22,588 pixels).

5.11 Intensified Perturbations

5.11.1 Methodology

In another experiment the goal is to intensify the learned perturbation η by mul-tiplying each pixel value with the same constant f . Thus the intensified pertur-bation at time step t is defined as:

η(t)intensified ← η(t) ∗ f (5.6)

This approach works surprisingly well when testing it in a digital environ-ment. Even noise that is not optimal like the one created with only the smoothing

79


0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

NPS + Physical World LayerPredictionAverageBenchmark average

FIGURE 5.24: Comparison of the perturbation using the NPS andthe physical world layer to the benchmark (22,588 pixels).

loss, which performs worse in a digital environment than other noise, turned outto hide almost all human pixels when tested digitally. Thus the intensified per-turbation was tested with f set to 2 and 4.

Figure 5.25 shows the result for an intensified perturbation by a factor f = 2.With 20,450 average human pixels per image detected it performs 9.47% betterthan the benchmark. Nevertheless it performs almost identical to the smoothingloss which it originates from. Thus there is no improvement compared to thesmoothing loss.

Figure 5.26 shows the result for an intensified perturbation by a factor f = 4.With 33,503 average human pixels per image detected it performs 48.32% worsethan the benchmark. Thus the attack actually becomes worse in the physical

80

5.11. Intensified Perturbations

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Factored Smoothing Loss x2PredictionAverageBenchmark average

FIGURE 5.25: Comparison of the perturbation using the smoothingloss factored by 2 to the benchmark (22,588 pixels).

world when multiplying by a large factor.

81


0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

Factored Smoothing Loss x4PredictionAverageBenchmark average

FIGURE 5.26: Comparison of the perturbation using the smoothingloss factored by 4 to the benchmark (22,588 pixels).

82

5.12. Source/Target Misclassification

5.12 Source/Target Misclassification

FIGURE 5.27: Prediction results: (top-row) Non-printability score(bottom-row) scaling noise (left) captured images (right) network

predictions.

Figure 5.27 shows two example images with corresponding predictions. Thebright red color represents the class pedestrian the purple color represents the classroad. The success of a misclassification attack S is to misclassify the pedestrian i.e.human pixels as any other class in the Cityscapes dataset (refer to table 3.1. It canbe defined as

S =∑p∈O

⎧⎨⎩0 if ypred = yattack

1 otherwise(5.7)

where O is the set of all object pixels, yattack is the class to be hidden (in thisscenario pedestrian) and ypred is the predicted class label. Thus it takes a sumof all object pixels which are classified incorrectly. This means as long as thepedestrian class is not detected as such, the attack is being successful.

The source/target misclassification success S, on the other hand, is measuredby the amount of pixels which are classified as the target class:

S =∑p∈X

⎧⎨⎩1 if ypred = ytarget

0 otherwise.(5.8)

83


Here the sum is taken over the entire image X and those pixels are added wherethe prediction equals the target class ytarget. In the experiment here the goal isto assign the road class to everywhere in front of the screen. Thus the objecti.e. the human and the background should be predicted as class road. If thehuman or the background is predicted to be any other class than road, then themisclassification attack might be successful (as long as it is not the pedestrianclass) but the source/target misclassification attack fails.

As one can see in figure 5.27 although the pixels are not classified as humanpixels the shape of the human subject is is partly kept intact. I.e. multiple otherclasses are assigned to the same location or around it. In some attacks the back-ground gets destroyed largely whereas in the top row in figure 5.27 the back-ground fulfills the target class (road) mostly.

Table 5.1 shows the result of all created perturbations and their amount of roadpixels on average over all 119 powerwall images. As a comparison the amountof pixels which are not classified as pedestrian pixels in the benchmark is shown.These are the pixels that were assigned to some background class in the bench-mark. An adversarial attack that succeeds in a source/target attack should be ableto classify close to or more than that amount of pixels with the targeted class. Thetarget class here is road which is colored purple in the predictions.

No attack is able to detect more road pixels than the benchmark would sug-gest. That means, even when the perturbation is able to hide human pixels itis not able to fill all those pixels with the targeted class. The sole usage of thenon-printability score achieves the highest amount of average road pixels and itachieved the second best attack in the misclassification attack. Thus it presentsthe most successful attack.

Some other attacks which seem to work well in the misclassification attackperform very badly in the source/target misclassification attack. Examples in-clude the object scaling + NPS + physical world layer attack (15.64% reductionfrom benchmark) as well as the scaled noise attack (20.58% reduction from bench-mark). Lastly, attacks that perform poorly in the misclassification attack also

84

5.13. Other Experiments

Non-PedestrianPixels

Benchmark 95.69%Road Pixels

Digital Noise 94.04%Smoothing Loss 89.58%

Smoothing Layer 86.01%Noise Rotation 60.43%

Object Scaling + NPS + Physical World Layer 7.99%NPS 94.31%

NPS + Physical World Layer 86.71%NPS + Smoothing 81.93%

Scaled Noise 32.29%Smoothing Loss factored x2 81.52%Smoothing Loss factored x4 55.05%

Physical World Layer 93.38%Physical World Layer + Smoothing Loss + NPS 82.90%

TABLE 5.1: Source/Target misclassification attack results.

perform poorly in the source/target misclassification attack. Examples are thefactored x4 perturbation (48.32% increase from benchmark) as well as the noiserotation perturbation (48.53% increase from benchmark).

5.13 Other Experiments

Other ideas that I have investigated but which would need further study andideas for further research include:

• Using an ensemble to create adversarial examples.

• Using different attack methods than the iterative targeted method.

• When cumulating the noise without a background the performance wasreduced - why?

• Using a different Lp norm.

85


• Implementing the CIEDE2000 algorithm in Theano directly.

• Using a network architecture with a different performance.

• Change the α step size parameter.

• Find optimal combination of the methods from above.

86

Chapter 6

Conclusion

All results from Chapter 5 have been aggregated in table 6.1 for misclassificationand table 5.1 for source/target misclassification. It was shown that universal ad-versarial perturbations can change up to 27% of an attacked class in the physicalworld of a semantic segmentation classifier. Nevertheless, no adversarial pertur-bation has been found that allows a targeted attack in which the entire object andits surroundings are classified as a specified target class. The experiments havebeen done on the Cityscapes dataset [5]1, the pedestrian class was being attackedin both scenarios and the road class was used as a target in the latter scenario. Theadversarial perturbations have been specifically trained to survive the transfor-mation to the physical world and are universal in the sense that they are trainedin an image-agnostic fashion.

The benchmark has been created by taking 119 images in front of the pow-erwall (a 3x7 meter display). These images have been fed into a semantic seg-mentation classifier and predict an average of 22,588 pedestrian class pixels. Thisleaves 95.59% of non-pedestrian pixels. A misclassification attack should signifi-cantly reduce the amount of predicted pedestrian pixels whereas a source/targetmisclassification attack should predict more pixels as road class than the bench-mark predicts as non-pedestrian.

First, it was shown that a universal adversarial perturbation optimized fora digital environment (here called a digital noise) is not able to achieve a strong

1Downscaled to 1024x512 pixels

87

Chapter 6. Conclusion

misclassification in the physical world. Thus multiple novel methods have beenintroduced to allow the perturbation to transfer to the physical world.

The misclassification attack summary in table 6.1 shows that universal adver-sarial perturbations are able to fool a semantic segmentation classifier to misclas-sify close to 27% of all pixels of an attacked object in the physical world. Thisresult is subject to the constraint that the adversarial perturbation is only shownbehind the object to be hidden i.e. it may not cover the object.

Avg.Pedestrian

Pixels

Change toBench-mark

Benchmark 22,588 -Digital Noise 21,159 6.33%

Smoothing Loss 20,376 9.79%Smoothing Layer 19,859 12.08%

Noise Rotation 33,550 -48.53%Object Scaling + NPS + Physical World Layer 19,055 15.64%

NPS 16,786 25.69%NPS + Physical World Layer 16,539 26.78%

NPS + Smoothing 18,321 18.89%Scaled Noise 17.879 20.85%

Smoothing Loss factored x2 20,450 9.47%Smoothing Loss factored x4 33,503 -48.32%

Physical World Layer 21,859 3.23%Physical World Layer + Smoothing Loss + NPS 20,119 10.93%

TABLE 6.1: Misclassification attack results summary.

The source/target misclassification summary in table 5.1 shows that no uni-versal adversarial perturbation was found that is able to replace an attacked ob-ject class with a target class. This means that even though the prediction of manypixels can be altered in this scenario, finding a perturbation that allows the ad-versary to select the predicted class remains a challenge.

88

Appendix A

Appendix A

0 20 40 60 80 100 120

Images

0

10000

20000

30000

40000

50000

60000

Hum

anP

ixel

sD

etec

ted

NPS + SmoothingPredictionAverageBenchmark average

FIGURE A.1: Evaluation of perturbations using the NPS and thesmoothing loss on the powerwall.

89

Appendix A. Appendix A

FIGURE A.2: Sheared images from Cityscapes dataset, sheared by1.0 and -1.0 degree.

A.1 Implementation Details Smoothing Loss

For the implementation of the smoothing loss I have adjusted the formula inequation 4.11 such that it works for my high dimensional case. What the algo-rithm does is calculate the difference between each pixel and its right and lowerneighbors along the x- and y-axis. A straight forward method of implementingthis is using two matrices of the same size and subtracting one from the other,for both vertical and horizontal shift. As matrix reshaping becomes expensive in

90

A.1. Implementation Details Smoothing Loss

Theano I have created four sub-matrices r, l, t, b denoting right, left, top and bot-tom respectively and where shape(r) = shape(l) and shape(r) = shape(b). Thesesub-matrices are each 1 pixel smaller in horizontal or vertical direction as com-pared to the original image as shown in figure A.3. Since the matrices t and b havethe same shape and r and l as well, these pairs can be used for matrix subtraction.

FIGURE A.3: Matrix subtraction for smoothing loss.

91

Bibliography

[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. “SegNet: A DeepConvolutional Encoder-Decoder Architecture for Image Segmentation”. In:CoRR abs/1511.00561 (2015). URL: http://arxiv.org/abs/1511.00561.

[2] Osbert Bastani et al. “Measuring Neural Net Robustness with Constraints”.In: CoRR abs/1605.07262 (2016). URL: http://arxiv.org/abs/1605.07262.

[3] Liang-Chieh Chen et al. “DeepLab: Semantic Image Segmentation with DeepConvolutional Nets, Atrous Convolution, and Fully Connected CRFs”. In:CoRR abs/1606.00915 (2016). URL: http://arxiv.org/abs/1606.00915.

[4] François Chollet et al. Keras. https://github.com/fchollet/keras.2015.

[5] Marius Cordts et al. “The Cityscapes Dataset for Semantic Urban Scene Un-derstanding”. In: Proc. of the IEEE Conference on Computer Vision and PatternRecognition (CVPR). 2016.

[6] Jifeng Dai et al. “R-FCN: Object Detection via Region-based Fully Convolu-tional Networks”. In: CoRR abs/1605.06409 (2016). URL: http://arxiv.org/abs/1605.06409.

[7] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”. In:Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conferenceon. IEEE. 2009, pp. 248–255.

93

BIBLIOGRAPHY

[8] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient meth-ods for online learning and stochastic optimization”. In: Journal of MachineLearning Research 12.Jul (2011), pp. 2121–2159.

[9] Mark Everingham et al. “The pascal visual object classes (voc) challenge”.In: International journal of computer vision 88.2 (2010), pp. 303–338.

[10] Volker Fischer et al. “Adversarial Examples for Semantic Image Segmen-tation”. In: Proceedings of 5th International Conference on Learning Represen-tations (ICLR), Workshop paper. 2017. URL: https://arxiv.org/abs/1703.01101.

[11] Alberto Garcia-Garcia et al. “A Review on Deep Learning Techniques Ap-plied to Semantic Segmentation”. In: CoRR abs/1704.06857 (2017). URL:http://arxiv.org/abs/1704.06857.

[12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for Au-tonomous Driving? The KITTI Vision Benchmark Suite”. In: Conference onComputer Vision and Pattern Recognition (CVPR). 2012.

[13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Recti-fier Neural Networks”. In: Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics. Ed. by Geoffrey Gordon, DavidDunson, and Miroslav Dudk. Vol. 15. Proceedings of Machine Learning Re-search. Fort Lauderdale, FL, USA: PMLR, 2011, pp. 315–323. URL: http://proceedings.mlr.press/v15/glorot11a.html.

[14] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining andharnessing adversarial examples”. In: arXiv preprint arXiv:1412.6572 (2014).

[16] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:CoRR abs/1512.03385 (2015). URL: http://arxiv.org/abs/1512.03385.

[17] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). URL:http://arxiv.org/abs/1703.06870.

94

BIBLIOGRAPHY

[18] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In:Neural computation 9.8 (1997), pp. 1735–1780.

[19] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feed-forward networks are universal approximators”. In: Neural networks 2.5 (1989),pp. 359–366.

[20] Jonathan Huang et al. “Speed/accuracy trade-offs for modern convolu-tional object detectors”. In: CoRR abs/1611.10012 (2016). URL: http://arxiv.org/abs/1611.10012.

[21] David H Hubel and Torsten N Wiesel. “Receptive fields, binocular interac-tion and functional architecture in the cat’s visual cortex”. In: The Journal ofphysiology 160.1 (1962), pp. 106–154.

[22] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. “Bayesian Seg-Net: Model Uncertainty in Deep Convolutional Encoder-Decoder Archi-tectures for Scene Understanding”. In: CoRR abs/1511.02680 (2015). URL:http://arxiv.org/abs/1511.02680.

[23] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Op-timization”. In: CoRR abs/1412.6980 (2014). URL: http://arxiv.org/abs/1412.6980.

[24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. “CIFAR-10 (CanadianInstitute for Advanced Research)”. In: (). URL: http://www.cs.toronto.edu/~kriz/cifar.html.

[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classi-fication with Deep Convolutional Neural Networks”. In: Advances in NeuralInformation Processing Systems 25. Ed. by F. Pereira et al. Curran Associates,Inc., 2012, pp. 1097–1105. URL: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-

neural-networks.pdf.

[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classi-fication with deep convolutional neural networks”. In: Advances in neuralinformation processing systems. 2012, pp. 1097–1105.

95

BIBLIOGRAPHY

[27] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. “Adversarial exam-ples in the physical world”. In: CoRR abs/1607.02533 (2016). URL: http://arxiv.org/abs/1607.02533.

[28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In:Nature 521.7553 (2015), pp. 436–444.

[29] Yann LeCun, Yoshua Bengio, et al. “Convolutional networks for images,speech, and time series”. In: The handbook of brain theory and neural networks3361.10 (1995), p. 1995.

[30] Yann LeCun et al. “Gradient-based learning applied to document recogni-tion”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[31] Yi Li et al. “Fully Convolutional Instance-aware Semantic Segmentation”.In: CoRR abs/1611.07709 (2016). URL: http://arxiv.org/abs/1611.07709.

[32] Wei Liu et al. “SSD: Single shot multibox detector”. In: European Conferenceon Computer Vision. Springer. 2016, pp. 21–37.

[33] Yanpei Liu et al. “Delving into Transferable Adversarial Examples and Black-box Attacks”. In: CoRR abs/1611.02770 (2016). URL: http://arxiv.org/abs/1611.02770.

[34] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutionalnetworks for semantic segmentation”. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2015, pp. 3431–3440.

[35] M. R. Luo, G. Cui, and B. Rigg. “The development of the CIE 2000 colour-difference formula: CIEDE2000”. In: Color Research and Application 26.5 (2001),pp. 340–350. ISSN: 1520-6378. DOI: 10.1002/col.1049. URL: http://dx.doi.org/10.1002/col.1049.

[36] Jan Hendrik Metzen et al. “Universal Adversarial Perturbations Against Se-mantic Image Segmentation”. In: submitted. 2017. URL: https://arxiv.org/abs/1704.05712.

96

BIBLIOGRAPHY

[37] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard.“DeepFool: a simple and accurate method to fool deep neural networks”.In: CoRR abs/1511.04599 (2015). URL: http://arxiv.org/abs/1511.04599.

[38] Seyed-Mohsen Moosavi-Dezfooli et al. “Universal adversarial perturbations”.In: CoRR abs/1610.08401 (2016). URL: http://arxiv.org/abs/1610.08401.

[39] Nicolas Papernot et al. “Practical Black-Box Attacks against Deep Learn-ing Systems using Adversarial Examples”. In: CoRR abs/1602.02697 (2016).URL: http://arxiv.org/abs/1602.02697.

[40] Nicolas Papernot et al. “The Limitations of Deep Learning in AdversarialSettings”. In: CoRR abs/1511.07528 (2015). URL: http://arxiv.org/abs/1511.07528.

[41] Joseph Redmon et al. “You only look once: Unified, real-time object detec-tion”. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 2016, pp. 779–788.

[42] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection withregion proposal networks”. In: Advances in neural information processing sys-tems. 2015, pp. 91–99.

[43] Frank Rosenblatt. “The perceptron: A probabilistic model for informationstorage and organization in the brain.” In: Psychological review 65.6 (1958),p. 386.

[44] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learn-ing representations by back-propagating errors”. In: Cognitive modeling 5.3(1988), p. 1.

[45] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Chal-lenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. DOI: 10.1007/s11263-015-0816-y.

97

BIBLIOGRAPHY

[46] T. Salimans et al. “Evolution Strategies as a Scalable Alternative to Rein-forcement Learning”. In: ArXiv e-prints (Mar. 2017). arXiv: 1703.03864[stat.ML].

[47] Timo Scharwaechter et al. “Efficient Multi-cue Scene Segmentation”. In:Pattern Recognition: 35th German Conference, GCPR 2013, Saarbrücken, Ger-many, September 3-6, 2013. Proceedings. Ed. by Joachim Weickert, MatthiasHein, and Bernt Schiele. Berlin, Heidelberg: Springer Berlin Heidelberg,2013, pp. 435–445. ISBN: 978-3-642-40602-7. DOI: 10.1007/978-3-642-40602-7_46. URL: http://dx.doi.org/10.1007/978-3-642-40602-7_46.

[48] Jürgen Schmidhuber. “Learning complex, extended sequences using theprinciple of history compression”. In: Neural Computation 4.2 (1992), pp. 234–242.

[49] Mahmood Sharif et al. “Accessorize to a crime: Real and stealthy attacks onstate-of-the-art face recognition”. In: Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security. ACM. 2016, pp. 1528–1540.

[50] K. Simonyan and A. Zisserman. “Very Deep Convolutional Networks forLarge-Scale Image Recognition”. In: CoRR abs/1409.1556 (2014).

[51] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. “Deep InsideConvolutional Networks: Visualising Image Classification Models and SaliencyMaps”. In: CoRR abs/1312.6034 (2013). URL: http://arxiv.org/abs/1312.6034.

[52] Christian Szegedy et al. “Intriguing properties of neural networks”. In:CoRR abs/1312.6199 (2013). URL: http://arxiv.org/abs/1312.6199.

[53] Theano Development Team. “Theano: A Python framework for fast compu-tation of mathematical expressions”. In: arXiv e-prints abs/1605.02688 (May2016). URL: http://arxiv.org/abs/1605.02688.

98

BIBLIOGRAPHY

[54] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by arunning average of its recent magnitude. COURSERA: Neural Networks forMachine Learning. 2012.

[55] Florian Tramèr et al. “Stealing Machine Learning Models via PredictionAPIs”. In: CoRR abs/1609.02943 (2016). URL: http://arxiv.org/abs/1609.02943.

[56] Fisher Yu and Vladlen Koltun. “Multi-Scale Context Aggregation by Di-lated Convolutions”. In: CoRR abs/1511.07122 (2015). URL: http://arxiv.org/abs/1511.07122.

[57] Matthew D. Zeiler and Rob Fergus. “Visualizing and Understanding Con-volutional Networks”. In: CoRR abs/1311.2901 (2013). URL: http://arxiv.org/abs/1311.2901.

99

Adversarial Attacks on Semantic Segmentation in the ...oa.upm.es/55875/1/TFM_FABIAN_EITEL.pdf ·...

Documents

Transcript of Adversarial Attacks on Semantic Segmentation in the ...oa.upm.es/55875/1/TFM_FABIAN_EITEL.pdf ·...