Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto...

203
Analysing the Behaviour of Neural Networks A Thesis by Stephan Breutel Dipl.-Inf. In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Queensland University of Technology, Brisbane Center for Information Technology Innovation March 2004

Transcript of Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto...

Page 1: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Analysing the Behaviour of Neural Networks

A Thesis byStephan Breutel

Dipl.-Inf.

In Partial Fulfillmentof the Requirements for the Degree

Doctor of Philosophy

Queensland University of Technology, BrisbaneCenter for Information Technology Innovation

March 2004

Page 2: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Copyright c�

Stephan Breutel, MMIV. All rights reserved.

The author hereby grants permission to the Queensland University of Technology, Brisbane toreproduce and distribute publicly paper and electronic copies of this thesis document in wholeor in part.

Page 3: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Keywords

Artificial Neural Network, Annotated Artificial Neural Network, Rule-Extraction, Val-

idation of Neural Network, Polyhedra, Forward-propagation, Backward-propagation,

Refinement Process, Non-linear optimization, Polyhedral Computation, Polyhedral

Projection Techniques

Page 4: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 5: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Analysing the Behaviour of Neural Networks

by

Stephan Breutel

Abstract

A new method is developed to determine a set of informative and refined interface as-sertions satisfied by functions that are represented by feed-forward neural networks.Neural networks have often been criticized for their low degree of comprehensibility.It is difficult to have confidence in software components if they have no clear and validinterface description. Precise and understandable interface assertions for a neural net-work based software component are required for safety critical applications and for theintegration into larger software systems.The interface assertions we are considering are of the form “if the input � of the neuralnetwork is in a region � of the input space then the output ������� of the neural net-work will be in the region of the output space” and vice versa. We are interestedin computing refined interface assertions, which can be viewed as the computation ofthe strongest pre- and postconditions a feed-forward neural network fulfills. Unions ofpolyhedra (polyhedra are the generalization of convex polygons in higher dimensionalspaces) are well suited for describing arbitrary regions of higher dimensional vectorspaces. Additionally, polyhedra are closed under affine transformations.Given a feed-forward neural network, our method produces an annotated neural net-work, where each layer is annotated with a set of valid linear inequality predicates.The main challenges for the computation of these assertions is to compute the solu-tion of a non-linear optimization problem and the projection of a polyhedron onto alower-dimensional subspace.

Page 6: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 7: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Contents

List of Figures vi

List of Tables vii

List of Listings ix

1 Introduction 1

1.1 Motivation and Significance . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Software Verification and Neural Network Validation . . . . . . . . . 6

1.4 Annotated Artifical Neural Networks . . . . . . . . . . . . . . . . . . 9

1.5 Highlights and Organization of this Dissertation . . . . . . . . . . . . 13

1.6 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Analysis of Neural Networks 17

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Validation of Neural Network Components . . . . . . . . . . . . . . 22

2.2.1 Propositional Rule Extraction . . . . . . . . . . . . . . . . . 28

2.2.2 Fuzzy Rule Extraction . . . . . . . . . . . . . . . . . . . . . 35

2.2.3 Region-based Analysis . . . . . . . . . . . . . . . . . . . . . 44

2.3 Overview of Discussed Neural Network Validation Techniques and

Validity Polyhedral Analysis . . . . . . . . . . . . . . . . . . . . . . 62

3 Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Trans-

formations 65

i

Page 8: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

ii CONTENTS

3.1 Polyhedra and their Representation . . . . . . . . . . . . . . . . . . . 65

3.2 Operations on Polyhedra and Important Properties . . . . . . . . . . . 69

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations . 70

3.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 82

4 Nonlinear Transformation Phase 83

4.1 Mathematical Analysis of Non-Axis-parallel Splits of a Polyhedron . 83

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region . . . . 86

4.2.1 Sequential Quadratic Programming . . . . . . . . . . . . . . 89

4.2.2 Maximum Slice Approach . . . . . . . . . . . . . . . . . . . 90

4.2.3 Branch and Bound Approach . . . . . . . . . . . . . . . . . . 95

4.2.4 Binary Search Approach . . . . . . . . . . . . . . . . . . . . 98

4.3 Complexity Analysis of the Branch and Bound and the Binary Search

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Affine Transformation Phase 113

5.1 Introduction to the Problem . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 Backward Propagation Phase . . . . . . . . . . . . . . . . . . . . . . 114

5.3 Forward Propagation Phase . . . . . . . . . . . . . . . . . . . . . . . 117

5.4 Projection of a Polyhedron onto a Subspace . . . . . . . . . . . . . . 118

5.4.1 Fourier-Motzkin . . . . . . . . . . . . . . . . . . . . . . . . 120

5.4.1.1 A Variation of Fourier-Motzkin . . . . . . . . . . . 123

5.4.2 Block Elimination . . . . . . . . . . . . . . . . . . . . . . . 126

5.4.3 The -Box Approximation . . . . . . . . . . . . . . . . . . . 131

5.4.3.1 Projection of a face . . . . . . . . . . . . . . . . . 133

5.4.3.2 Determination of Facets of � . . . . . . . . . . . . 134

5.4.3.3 Further Improvements of the S-Box method . . . . 137

5.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.5 Further Considerations about the Approximation of the Image . . . . 140

5.6 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 140

Page 9: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

CONTENTS iii

6 Implementation Issues and Numerical Problems 143

6.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.2 Numerical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.3 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 150

7 Evaluation of Validity Polyhedral Analysis 151

7.1 Overview and General Procedure . . . . . . . . . . . . . . . . . . . . 151

7.2 Circle Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Benchmark Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.3.1 Iris Neural Network . . . . . . . . . . . . . . . . . . . . . . 156

7.3.2 Pima Neural Network . . . . . . . . . . . . . . . . . . . . . 158

7.4 SP500 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.5 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 163

8 Conclusion and Future Work 165

8.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 165

8.2 Fine Tuning of VPA . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.3 Future Directions and Validation Methods for Kernel Based Machines 168

A Overview of used symbols 171

B Linear Algebra Background 175

Bibliography 185

Page 10: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 11: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

List of Figures

1.1 Annotated version of a neural network. . . . . . . . . . . . . . . . . . 10

2.1 Single neuron of a multilayer perceptron . . . . . . . . . . . . . . . . 19

2.2 Sigmoid and threshold activation functions and the graph of the func-

tion computed by a single neuron with a two-dimensional input. . . . 20

2.3 Two-layer feed-forward neural network . . . . . . . . . . . . . . . . 21

2.4 Overview of different validation methods for neural networks. . . . . 26

2.5 Example for the KT-Method. . . . . . . . . . . . . . . . . . . . . . . 30

2.6 Example for the � -of- method. . . . . . . . . . . . . . . . . . . . 34

2.7 DIBA : recursively projection of hyperplanes onto each other . . . . . 51

2.8 DIBA: a decision region and the traversing of a line. . . . . . . . . . . 52

2.9 Annotation of a neural network with validity intervals. . . . . . . . . 56

2.10 Piece-wise linear approximation of the sigmoid function. . . . . . . . 61

3.1 Combination of two vectors in a two-dimensional space: from left to

right: linear, non-negative, affine and convex combination. . . . . . . 66

3.2 The back-propagation of a polyhedron through the transfer function

layer. Given the polyhedral description ��������� ��������� in the output

space of a transfer function layer the reciprocal image of this polyhe-

dron under the non-linear transfer function is given by � �"!$#&%'����������$� �(!������)����� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3 Approximation of the non-linear region. . . . . . . . . . . . . . . . . 73

3.4 Subdivision into cells. . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5 Example for subdivision into cells. . . . . . . . . . . . . . . . . . . . 74

3.6 Point sampling approach. . . . . . . . . . . . . . . . . . . . . . . . . 75

v

Page 12: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

vi LIST OF FIGURES

3.7 Eigenvalue and eigenvector analysis of the facet manifold. . . . . . . 79

3.8 Example for convex curvature. . . . . . . . . . . . . . . . . . . . . . 80

3.9 Example for concave curvature. . . . . . . . . . . . . . . . . . . . . 81

4.1 Non axis-parallel split of a polyhedron. . . . . . . . . . . . . . . . . 84

4.2 Polyhedral wrapping of the non-linear region � . . . . . . . . . . . . . 87

4.3 Application of branch and bound. . . . . . . . . . . . . . . . . . . . 95

4.4 Two dimensional example for the binary search method. . . . . . . . 103

5.1 The forward and backward-propagation of a polyhedron through the

weight layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.2 The projection of a polyhedron onto a two-dimensional subspace. . . 121

5.3 Relevant hinge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.4 Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.5 The projected polyhedron � is contained in the approximation. . . . . 139

6.1 Overview of the framework. . . . . . . . . . . . . . . . . . . . . . . 144

7.1 Visualisation for the behaviour of the circle neural network. . . . . . 156

7.2 Projection onto the subspace *,+�-/.�021�354 .7698 :<;>=@?7+BADC . . . . . . . . . 158

7.3 Projection onto the subspace *,+�-/.�021�354 .7698FEHG�=@?7+BADC . . . . . . . . . 158

7.4 The computed output regions for the Pima Neural Network. . . . . . . 160

7.5 The computed input region for the SP500 Neural Network. . . . . . . 163

Page 13: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

List of Tables

2.1 Overview of neural network validation techniques . . . . . . . . . . . 63

3.1 Concave and convex curvature in the neighborhood of a point. . . . . 75

4.1 Comparison of branch and bound and binary search. . . . . . . . . . . 109

5.1 Computation times for the projection of a polyhedron onto a a lower-

dimensional subspace. . . . . . . . . . . . . . . . . . . . . . . . . . 139

vii

Page 14: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 15: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

List of Listings

5.1 projExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.1 net-struct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.2 mainLoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.3 forwardStep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.4 mainVIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.5 mainVPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.6 numExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

ix

Page 16: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Statement of Original Authorship

The work contained in this thesis has not been previously submitted for a degree or

diploma at any other higher education institution. To the best of my knowledge and

belief, the thesis contains no material previously published or written by another per-

son except where due reference is made.

Stephan Breutel

March 2004

Page 17: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Acknowledgments

I would like to thank my principal supervisor Frederic Maire, without whose un-

bounded energy, idealism, motivation and immense enthusiasm this research would

not be possible. It was an honor to receive wisdom and guidance from all my su-

pervisory team. A special thanks also to my associate supervisor Ross Hayward who

provided excellent support whenever I needed it.

Apart from my supervisors, I would like to thank my other panel members of my oral

defense, Joaquin Sitte and Arthur ter Hofstede for their valuable comments about this

thesis.

I would also like to thank all of my colleagues in my research center for their help

and encouragement during my time here, and in particular all the people of the Smart

Devices Laboratory. It would take too many pages to list them all.

I was fortunate to have many friends with whom I enjoyed fantastic climbing sessions

at Kangaroo Point and had many great times at the Press Club whilst listing to superb

Jazz performances, all of which helped to keep me relatively sane during my time at

QUT.

I would also like to thank the Coffee Coffee Coffee crew for keeping me awake by

serving me literally over a thousand coffees (to be precise 1084).

Finally, I really would like to thank all my friends and my family back home in Bavaria

for supporting my endevour. A special thanks to my parents for their moral and finan-

cial support.

This work was supported in part by an IPRS Scholarship and a QUT Faculty of Infor-

mation Technology Scholarship.

Page 18: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 19: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 1

Introduction

The first section provides examples of neural networks in industrial applications that

demonstrate the need for neural network validation methods.

In section 1.2 notation conventions are explained. Required terminology, concepts of

software verification and neural network validation are introduced in section 1.3.

In section 1.4 annotated artifical neural networks are defined and the central ideas of

the novel Validity Polyhedral Analysis (VPA) is explained a nutshell. The organization

of this thesis is outlined in section 1.5.

1.1 Motivation and Significance

A conclusion of the report “Industrial use of safety-related artifical neural networks”

by Lisboa [Lis01] is that one of the keys to a successful transfer of neural networks

to marketplace is the integration with other systems (e.g. standard software systems,

fuzzy systems, rule-based systems). This requires an analysis of the behaviour of

neural network based components. Examples for products using neural networks in

safety-critical areas are [Lis01]:

I Explosive detection, 1987. SNOOPE from SAIC is an explosive detector. It was

motivated by the need to detect the plastic explosive Semtex. The detector irra-

diated suitcases with low energy neutrons and collected an emission gamma-ray

spectrum. A standard feed-forward neural network was used to classify be-

tween: bulk explosive, sheet explosive and no explosive. However, there were

1

Page 20: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2 Chapter 1. Introduction

several practical problems. For example the 4% false positive of the MLP re-

sulted in a large number of items to be checked. This is not practical, especially

for highly frequented airports such as Heathrow and Los Angeles airport, where

the system was tested.

I Financial risk management. PRISM by Nestor relied on a recursive adaptive

model. HNC’s Falcon is based on a regularised Multi layer perceptron (MLP).

Both systems are still market leader for credit card fraud detection [LVE00].

I Siemens applied neural networks for the control of steel rolling mines. A pro-

totype neural network based model was used for strip temperature and rolling

force at the hot strip mill of Hoesch, in Dortmund, in 1993. Later Siemens

applied this technology at 40 rolling mills world-wide. Siemens experience in-

dicates that neural networks always complement, and never replace, physical

models. Additionally, the domain expertise is essential in the validation process.

A third observation was that data requirements are severe.

I NASA and Boeing are testing a neural network-based damage recovery control

system for military and commercial aircraft. This system has the aim to add a

significant margin of safety fly-by-wire control, when the aircraft sustains major

equipment or system failure, ranging from the inability to use flaps to encoun-

tering extreme icing.

I Vibration analysis monitoring in jet engines is a joined research project by Rolls-

Royce and the Department of Engineering at Oxford University. The diagnos-

tic system QUINCE combines the outputs from neural networks with template

matching, statistical processing and signal processing methods. The software is

designed for the pass-off test of jet engines. It includes a tracking facility of the

most likely fault. Another project is a real-time in-flight monitoring system of

the Trent 900 Rolls-Royce engine. The project combines different techniques,

like for example Kalman filters with signal processing methods and neural net-

works.

I In an European collaborative project involving leading car manufactures, differ-

Page 21: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.1 Motivation and Significance 3

ent control systems ranging from engine management models to physical speed

control have been implemented. These control systems combined engineering

expertise with non-linear interpolation by neural network architectures. It in-

cluded rule-based systems, fuzzy systems and neural networks.

I Siemens produces the FP-11 intelligent fire detector. This detector was trained

from fire tests carried out over many years. According to [Lis01] this fire detec-

tor triggered one-thirtieth false alarms of conventional detectors. The detector

is based on a digital implementation of fuzzy logic, with rules discovered by a

neural network but validated by human experts.

These examples show that neural networks need to be integrated with other systems,

that it is relevant to extract rules learnt by neural networks (for example for the fire

detector FP-11 or for the credit card fraud detection) and that it is important to provide

valid statements of the neural network behaviour in safety-critical applications.

Therefore, it is necessary to describe the neural network behaviour, e.g. in form of

valid and refined rules. Additionally, it is interesting to obtain explanations for the

neural network behaviour.

However, our main motivation is to compute valid statements about the neural network

behaviour and as such help to prevent software faults. Software errors can cause a lot of

problems and risks, especially in safety-critical environments. Several software errors

and their consequences are collected in [Huc99]. The explosion of the Ariane 5 rocket

and the loss of Mars climate orbiter [Huc99] are recent examples of consequences of

software errors.

To summarize: it is significant to describe the behaviour of neural network components

to:

I overcome the low degree of comprehensibility of (trained) neural networks,

I validate neural networks,

I integrate neural network components in (large) software environments,

I apply neural network components even in safety-critical applications,

Page 22: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4 Chapter 1. Introduction

I detect interesting/new knowledge of statistical data, which was not previously

obvious,

I prevent un-learning. Given a polyhedral interface description of a neural net-

work component, one can define a set of invariants, i.e. adapting the neural

network to new environments must not violate these invariants,

I control the generalization of a trained neural network. Generalization expresses

the ability of the neural network to produce correct output for previously unseen

input data. A description of the neural network behaviour will provide a better

insight into the neural network generalization capability,

I visualize corresponding regions in the input and the output space of a neural

network.

1.2 Notations and Definitions

The following notation conventions are inspired from the book by Fritzke [Fri98] and

most of the conventions followed the Matlab [Mat00c] notation.

I Sets are denoted by calligraphic upper case letters (e.g. � ).

I Matrices are denoted by single upper case bold letters (e.g. � ). To denote a

column or row vector or any arbitrary element within the matrix, we use the

Matlab notation [Mat00c], e.g. �J�LK@MONL� is the N -th column vector of matrix � .�J�OPRQSMUTWVXMYKF� extracts the Q -th and the T -th row vector.

I Vectors are denoted by single lower case bold letters (e.g. Z ). By default a vector

is a column vector. Z\[ represents the transpose of Z . To represent an element

within the vector we use Matlab notation, e.g. Z$��NL� denotes the N -th element of

the vector.

I Let ] and ^ denote index sets. Then we denote with �J�_]`Ma^(� the selection of

the rows ] and columns ^ of the matrix � .

Page 23: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.2 Notations and Definitions 5

I A scalar variable is denoted with a simple lower case letter (e.g. ).I Let � and b be two matrices with the same number of columns. With

cd � bef

we

denote the vertical concatenation of two matrices. Similarly, PR� b�V denotes

the horizontal concatenation of two matrices with the same number of rows.

Additionally, we use the convention, that definitions and expressions, which are

explained in the glossary 1, will be written in emphasized style, when used for the first

time. Also function names are written in emphasized style.

An overview of all symbols and special operators is provided in appendix A. How-

ever, the following table contains symbols which are already relevant for this chapter

and for the literature review of neural network analysis techniques in Chapter 2.

� ... polyhedron, i.e. the intersection of a finite number of half-spaces.� ... arbitrary region. ... a box is an axis parallel hyper-rectangle. We also use the expression

... hypercube.� ... activation vector of the output neuronsg�hji ... net input vector (input vector for the transfer function layer)� ... activation vector of input neuronsk

... weight matrixl

... single bias valuel l l

... bias vector! ... sigmoid transfer function

The next table defines a special function symbol, one operator and the interval notation.

m... box function,

m ���n� denotes the smallest axis-parallel hypercube

containing a region � .o� ... boolean operator to compare if two expressions are equivalentP pqMsrUV ... denotes the interval ��t�� pu�vt,��r>�To refer to sigmoidal functions we often use the Matlab terms logsig and tansig.

1A glossary is not provided, yet, but it will be included in the final version. We apologise for any

inconveniences.

Page 24: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

6 Chapter 1. Introduction

1.3 Software Verification and Neural Network Validation

We will divide software components, depending on the task and its implementation,

into two classes, namely into trainable software components and into non-trainable

software components.

Definition 1.1 Trainable Software Components

Software components for classification and non-linear regression, where the task is im-

plicitly specified via a set of examples, and a set of parameters is optimized according

to these examples are called trainable software components. wTypically, we use trainable software components where it is not easy or not possi-

ble to define a clear algorithm. For example, in tasks like speech recognition, image

recognition or robotic control, often statistical learners, like neural networks or support

vector machines are applied.

Definition 1.2 Non-trainable Software Components

Software components, where the task is precisely specified, an algorithm can be de-

fined and the task is implemented with a programming language are called non-trainable

software components. wWe also refer to non-trainable software components as standard software.

Software Verification and Validation of Neural Network Components

Standard software program verification methods take the source code as input and

prove the correctness against the (formal) specification. Among others, important con-

cepts of software verification are pre- and postconditions.

Definition 1.3 Precondition and Postcondition

Given a software component and two boolean expressions xyMUz about the input and

output data, the statement

�5x{�5$�>z{�

Page 25: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.3 Software Verification and Neural Network Validation 7

indicates, that for every input status which fulfills �5x{� before the execution of the

output status �>z{� is true after the termination of . The assertions �5x{� and �>z{� are

also named precondition and postcondition. wWe can view pre- and postconditions as specifications of the component properties.

However, using pre- and postconditions does not assure that the software component fulfills these specifications. A formal proof is required.

In the context of artifical neural networks we talk about validation techniques. In the

following we discuss the central ideas of neural network validation techniques in a nut-

shell. The validation approaches for neural networks are propositional rule extraction,

fuzzy rule extraction and region based analysis. Propositional rule extraction methods

take a (trained) neural network component as input, extract symbolic rules and test

them against the network behaviour itself and against the test data. These methods are

helpful to test neural networks.

Fuzzy rule extraction methods try to extract a set of fuzzy rules, which mimics the

behaviour of neural networks. The advantage of fuzzy rule extraction, compared to

propositional rule extraction, is that, generally, less rules are needed to explain the

neural network behaviour. In addition, with the use of linguistic expressions easily

understandable characterizations about the neural network behaviour are obtained.

Region based analysis methods take a (trained) neural network as input and compute

related regions in the input and output space. Region based analysis techniques differ

from the above methods because they have a geometric origin, are usable for a broader

range of neural networks and have the ability to compute more accurate interface de-

scriptions of a neural network component. These methods compute a region mapping

of the form: if the input is in region �n| then output is in region �n} , where ��| describes

a set of points in the input space and � } a set of points in the output space. For some

methods, these region based rules agree exactly with the behaviour of the neural net-

work. The more refined those regions are, the more information we obtain about the

neural network. Validity Interval Analysis (VIA), developed by Thrun [Thr93], for

example, is able to find provably correct axis-parallel rules, i.e. rules of the form: “if

the input � is in the hypercube �~ then the output � is in the hypercube $� ”. Hence,

region based analysis approaches are suitable to validate the behaviour of neural net-

Page 26: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

8 Chapter 1. Introduction

work based software components.

The development of large software systems requires the interaction of different soft-

ware components. As motivated with the examples, we need some kind of human

understandable descriptions of neural network based software components (e.g. by

using fuzzy rules) as well as techniques to assert important properties of the neural

network behaviour (e.g. in form of valid relations between input and output regions).

Our approach to compute corresponding regions, represented as unions of polyhedra,

in the input and the output space of a neural network, is able to validate properties

about the neural network behaviour.

Definition 1.4 Polyhedral Precondition and polyhedral Postcondition

A polyhedral precondition � is a precondition, where the constraints on the input data

are sets of linear inequalities. Similarly, a polyhedral postcondition � is a postcon-

dition, where the constraints on the output data are expressed as a system of linear

inequalities. wWe can view polyhedral pre- and postconditions as conjunctions of linear inequality

predicates. We also use the terminology “polyhedral interface assertions” or “polyhe-

dral interface description”.

Among others, the following properties are desirable for methods analysing the be-

haviour of trained neural networks:

I generality (also known in the literature as portability, e.g. see Andrews et al.

[TAGD98]): algorithms that make no assumptions about the neural network

architecture and the learning algorithm,

I usable for classification problems as well as function approximation,

I high fidelity with the neural network, i.e. the computed rule set mimics well the

behaviour of the neural network,

I precise and concise description of the neural network behaviour, e.g. in form of

a small number of informative and refined rules,

I polynomial algorithmic time and space complexity. In other words the algorithm

is still applicable for higher-dimensional cases,

Page 27: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.4 Annotated Artifical Neural Networks 9

I usable to validate properties about the neural network behaviour.

1.4 Annotated Artifical Neural Networks

Our approach is to forward- and backward-propagate finite unions of polyhedra through

all layers of the neural network. This strategy can be viewed as an extension of Valid-

ity Interval Analysis (VIA) and is consequently named Validity Polyhedral Analysis

(VPA). The method is very general, as the only assumptions are, that we work with

feed-forward neural networks (a brief introduction to feed-forward neural networks

follows in Chapter 2), and that the network has invertible and continuous transfer-

functions.

Both, the VIA and VPA technique, rely on a refinement algorithm. Let � represent the

function a feed-forward neural network computes. � is a mapping from an input space] to an output space � . Suppose that a pair ����MO�\� satisfies the following system of

constraints (see also [Mai00a]): ������� ���������,�����������������

The numerical values of � and � are unknown. It is only known that � and � fulfill the

system of constraints. We refine our knowledge by computing proper subsets for �and � , such that the initial system of constraints implies that �����n� and ������� . By

computing the image of � under � and the reciprocal image of � , we obtain proper

subsets.������� �����������2���<���n��#&%>������������������n���X�(������������

Generally, the sets � � and � � are non-linear regions. VIA uses axis-parallel hyper-

cubes for � and � and to compute an approximation of � � and � � , respectively. VPA

relies on polyhedra.

Our validation algorithm uses a, generally, trained neural network as input and pro-

duces an annotated version of this neural network.

Page 28: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

10 Chapter 1. Introduction

Definition 1.5 Annotated Artifical Neural Network (AANN)

An artifical neural network, where the input and output of each layer is annotated via

a set of valid pre- and postconditions is named an Annotated Artifical Neural Network

(AANN).

For example VIA produces pre-and postconditions in form of axis-parallel rules. VPA

annotates a neural network via a set of linear inequality predicates.

1 2 m

σσσ if in R y = x)( in R yσthen x1

if x in R x then = Γ x)( in xRx1

{ }xR

}{R x

R }{ y

Γ

σ x1

1

1

Figure 1.1: Annotated version of a neural network.

Generally, the behaviour of neural networks can be more accurately described with a

finite number of unions of polyhedra compared to a finite number of unions of axis-

parallel hypercubes.

There is an interesting analogy between the notion of annotated neural networks and

software-verification for programs. One strategy to verify the correctness of a program

against a given specification is to annotate the program with logical expressions and to

prove the correctness of each step. In our case the “program” is a neural network and

each layer is annotated with a set of valid linear inequality predicates.

Page 29: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.4 Annotated Artifical Neural Networks 11

A Bridge to Logic and Software Verification

The Hoare calculus provides a formal framework to verify the correctness of programs

by annotating the program with assertions about the status of the program variables

and the change of this status under the program execution. The Hoare calculus defines

rules for the correct annotation of a program. The book by Broy [Bro97] provides a

thorough introduction to the basics of the Hoare calculus. Within the scope of this the-

sis the rule of statement sequence, the rule of consequence and the concepts of weaker

and stronger pre- and postcondition are relevant. Let ��M�� % MO��MO� % denote predicate

logical pre-and postconditions with program variables as free identifiers. Statements

of the program are represented with S, S1 and S2. In the following description the rule

condition and the rule consequence are separated by a horizontal line.

Rule of Statement Sequence

����� S1 ��� % � ��� % �5������������� S1;S2 �����Example for the Rule of Statement Sequence [Bro97]:

��t��H���"p�� t,KF�Ht2�"�`��tn�"p�� ��t��"p���t,KF���5tu��t�����p����t��"���Hp��¡t�KF�Ht2�"��¢Ot,KF�<�5tn��tu����p��

Rule of Consequence

� %�£ � ���¤� S ����� � £ � %��� % � S ��� % �Example for the Rule of Consequence [Bro97]:

tu�"p £ tq¥`��p�¥ ��tS¥`�"p¦¥5��t,KF�Htq¥)��tn�"p¦¥'� tn�"p¦¥ £ t,§�¨��t��"p���t,KF�Ht ¥ ��t�§�¨¦�Definition 1.6 Weaker and Stronger Precondition� %�£ � denotes that with � % also � is true. We say: � is the weaker precondition

and � % is the stronger precondition.

Page 30: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

12 Chapter 1. Introduction

Definition 1.7 Weaker and Stronger Postcondition� £ � % denotes, that with � also � % is true. We say: � is the stronger postcondition

and � % is the weaker postcondition.

We can denote a feed-forward neural network as a finite sequence of an affine

transformation © , followed by a (usually) non-linear transformation ! . Let �,~ be an

arbitrary region specifying a valid region for an input vector � , with ��%~ a region

defining the possible output after an affine transformation of an arbitrary ���u�,~ , and

finally, let � � denote the possible output region after a non-linear transformation of an

arbitrary vector � % ��� %~ . In the logical framework we simply use the notation ��~ to

express: ���n��~ . As possible instances of regions we consider axis-parallel boxes ( )

and polyhedral regions ( � ).

In the sequel we assume that �¡~2ª�«~�M and ���¬ª�«� . Let us consider two consecutive

layers of an annotated neural network.

Rule of Statement Sequence for an Annotated Artifical Neural Network

���­~®�¡©���� %~ � ��� %~ �>!\���­�������~¦� ©¯¢U!������°�

We will refer to assertions of type �J%~ as intermediate status. The repeated application

of the rule of statement sequence on a multilayer feed-forward neural network, allows

us to write: ����~®�>±�²�²B���­�°�where ANN represents the sequence of computation a feed-forward neural network

performs from the input to the output. The above statement reads: “if �³�v�,~ then�����­� ”Rule of Consequence for an Annotated Artifical Neural Network

�$~ £ «~ �>«~¦�¡©�¢U!,�����°� ��� £ «�����~��)©¯¢U!��>«�D�In this logical framework we can view the polyhedron ��~ as the stronger precondition

and the polyhedron �¡� as the stronger postcondition. The more refined the regions

Page 31: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.5 Highlights and Organization of this Dissertation 13

of an annotated neural network are, the stronger the corresponding pre - and postcon-

ditions. It turns out, that our geometrical perspective is quite useful, as it allows us

to define a precise measurement for the strength of a precondition or postcondition,

namely the volume of the corresponding region.

1.5 Highlights and Organization of this Dissertation

The highlights and the organization of this thesis are as follows:

Chapter 2: Analysis of Neural Networks

In this chapter we introduce basic concepts of feed-forward neural networks and pro-

vide a literature overview of validation methods for neural network components. We

classified the validation methods into propositional rule extraction, fuzzy rule extrac-

tion and region-based analysis. Finally, the different methods are compared and our

approach, named Validity Polyhedral Analysis (VPA), is motivated.

Chapter 3: Polyhedral Computations and Deformations of Polyhedral Facets un-

der Sigmoidal Transformations

Polyhedra are the generalization of convex polygons to higher dimensional spaces.

This chapter presents the most important properties and concepts of polyhedral analy-

sis to make this thesis self contained.

To obtain refined polyhedral interface assertions, we have to propagate unions of poly-

hedra through all layers of a neural network. This requires computing the image of a

polyhedron under a non-linear transformation. The image of non-axis parallel polyhe-

dra, under a sigmoidal transformation are non-linear regions. In our initial investiga-

tions we analyse how polyhedral facets get twisted under a sigmoidal transformation.

Chapter 4: Mathematical Analysis of the Non-linear Transformation Phase

In this chapter we explain how to approximate the image of a polyhedron under a non-

linear transformation by a finite union of polyhedra. This approximation process can

Page 32: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

14 Chapter 1. Introduction

be reduced to a non-linear optimization problem. Several approaches to approximate

the global maximum of the corresponding optimization problem are discussed.

Chapter 5: Mathematical Analysis of the Affine Transformation Phase

The computation of the reciprocal image of a polyhedron under an affine transforma-

tion is explained in this chapter. Furthermore, this chapter discusses how to calculate

the image of a polyhedron under an affine transformation and strategies for computing

or approximating the projection of a polyhedron onto a lower dimensional subspace.

Within the scope of this thesis projection techniques are used for the computation of an

image of a polyhedron under an affine transformation characterized by a non-invertible

matrix.

Chapter 6: Implementation Issues and Numerical Problems

This chapter discusses the design and implementation of a general framework for any

region-based refinement algorithm. The framework is successfully used for the Valid-

ity Interval Analysis (VIA) and our new Validity Polyhedral Analysis (VPA) method.

It is always necessary to study numerical properties of a mathematical algorithm when

implementing the algorithm on a digital machine with finite precision. Section 6.2 is

devoted to these problems, which will be referred to as the numerical problems.

Chapter 7: Evaluation of Validity Polyhedral Analysis

Validity Polyhedral Analysis computes interface assertions of a neural network in poly-

hedral format. We evaluated VPA on toy neural networks, on neural networks trained

with benchmark data sets of the UI Irvine database [Rep] and on a neural network

trained to predict the SP500 stock-market index. Additionally, the method is com-

pared to VIA (Validity Interval Analysis) and the refinement process is discussed.

Page 33: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

1.6 Summary of this Chapter 15

Chapter 8: Conclusion and Future Work

This chapter summarizes the main contributions, explains how to “fine tune” the in-

troduced VPA strategy, and finally motivates future investigations to obtain validation

techniques for kernel-based machines, like for example support vector machines.

Appendices

Appendix A summarizes all used symbols and notations.

Appendix B recalls the relevant knowledge and notions of linear algebra to make this

thesis self contained.

1.6 Summary of this Chapter

As throughout this thesis a summary of the chapter and a list of new contributions is

provided.

To motivate neural network validation methods, examples of neural network compo-

nents in safety-critical applications have been illustrated. Notation conventions have

been introduced, neural network validation techniques discussed and criteria for neu-

ral network validation methods have been formulated. Finally, the idea of validity

polyhedral analysis was introduced, which is used to obtain an annotated version of a

feed-forward neural network.

Contributions - Chapter 1 -

I The notion of Annotated Artifical Neural Networks (AANN) and application

of the method of assertions to neural networks.

I Validity Polyhedral Analysis (VPA), as a tool to annotate a feed-forward neu-

ral network with valid pre- and postconditions in form of linear inequality

predicates.

Page 34: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 35: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 2

Analysis of Neural Networks

Section 2.1 recalls central ideas of artifical neural networks. For a very thorough in-

troduction, the reader is referred to the excellent book by Haykin [Hay99].

As motivated in the introduction, trained neural network components need to undergo

a validation or testing procedure before their (industrial) use. Section 2.2 is devoted to

the topic of neural network validation.

Finally, a short summary of the discussed validation methods is provided and our ap-

proach, named Validity Polyhedral Analysis (VPA) is motivated (why we do it) and

justified (why we do it this way).

2.1 Neural Networks

Artifical Neural Networks (ANNs) are partly inspired by observations about the bio-

logical brain. These observations led to the conclusion that information in biological

neural systems are processed in parallel over a network of a large number of intercon-

nected, distributed neurons (simple computational units). However, there are a lot of

differences between biological neural systems and ANNs.

For example, the output of a neuron of an ANN is a single value, whereas a biological

neuron produces, an often, complex time series of spikes.1 Artifical Neural Networks

1There is ongoing research about simulating spiking neural networks. But still it is not possible to

simulate exactly the behaviour of a biological neuron. Furthermore biological neurons are also influenced

by hormones and other chemical transmitters.

17

Page 36: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

18 Chapter 2. Analysis of Neural Networks

are general function approximators, able to learn from a set of examples. Therefore

ANNs are also often characterized as statistical learners. The main features of neural

network based machines are [Fri98]:

I large number of identical computational units, called neurons,

I weighted connections between these units,

I the parameters (weights) of the neural network are adjusted (stepwise) during

the learning process and,

I usually nonlinear transformations are used.

In this work we focus on sigmoidal feed-forward neural networks (also known as

multi-layer perceptrons). The structure of a single neuron is presented in Figure

2.1. The input to a neuron of a feed-forward neural network is calculated by com-

puting the weighted sum of the activation of preceding neurons and adding a bias

value. In other words, the input to the N -th neuron is given by ´2µ�MO�·¶<� l , withµ¸� k ��NsMYKF� andl � l l l ��NL� , where µ is the weight vector and � the activation of the

preceding neurons, also referred to as the input vector. The bias, added to the N -thneuron, is denoted with

l l l ��NL�In case of a threshold activation function a neuron is active (has a positive value),

if the weighted input ´2µ,MO�·¶ is bigger than the biasl. A hyperplane defined by¹ �º���$� µ�»¼���<¨¦� is a hyperplane through the origin, and splits the input space into

two half-spaces. The set of points � with µu»7�½¶¾¨ defines the positive halfspace¹n¿, all other points on the opposite side of the hyperplane are in the negative halfs-

pace (¹ # ). For all points on the hyperplane the dot product is zero. Adding the biasl

results in a shift of the hyperplane along µ . Geometrically, in case of a threshold

activation, a neuron is active if the input � is in the positive halfspace, i.e. ��� ¹ ¿ .

This geometrical interpretation also explains why, historically [Ros58], neural net-

works were introduced as classifiers. For example, for a linearly separable classifica-

tion problem a single weight layer neural network would be sufficient to learn the task

correctly. Labeled data is said to be linearly separable, if the patterns lie on opposite

sides of a hyperplane.

Page 37: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.1 Neural Networks 19

Activation of neuron iΣ

Σj=1

f

−1

(

x (1)

(2)x

(n)xW(i,j)

(i,1)

(i,2)

(i,n)

x(j) − =(i)n

net

net

f =(i)

(i)

(i)

W

W

W

(i))

θθ

θθ

y

Figure 2.1: Single neuron of a multilayer perceptron, wherek �ÁÀÂMYKF� are the weighted

connections between the input units and the N -th neuron of the next layer.

Often the threshold function

��� g¼h�i ��NL�O�¯���� �� ¨ K g�hji ��NÃ�¡´v¨� K g�hji ��NÃ�¡§v¨

and the sigmoid function

��� g¼h�i ��NL�O�$� ��$��Ä #qÅ5ÆsÇaÈ |ÊÉare used as activation functions for feed-forward neural networks. The graphs of these

functions are on the left of Figure 2.2. The figure also shows the function output of a

single output neuron with 2 dimensional input space, when applying a) the logsig and

b) the threshold function to the neuron input.

Page 38: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

20 Chapter 2. Analysis of Neural Networks

Figure 2.2: Sigmoid and threshold activation functions and the graph of the function

computed by a single neuron with a two-dimensional input. In this case the weight

matrix reduces to a vector µ . Note that along any vector perpendicular to µ , the

output is constant and that along a line in direction µ the output has the shape of the

transfer function.

A feed-forward neural network architecture has an input layer, several hidden layers

and an output layer. The input vector propagates through all layers of the neural net-

work in a forward direction. The dimension (size) of the input layer, i.e. the number

of input neurons and the dimension of the output layer are defined by the application.

It is difficult to determine a priori a suitable number of hidden layers and the number

of hidden neurons for each of these layers. This has to be solved during the model

selection process. In Figure 2.3 we show a typical multilayer perceptron architecture

consisting of an -dimensional input layer, an � -dimensional output layer and one

hidden layer of dimension Q .

Page 39: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.1 Neural Networks 21

1 f1 1 f f

f f f2 2 2

1 2 m

1 2 k

(1)y y y(2) (m)=y ( ( − ) − )f 2

(1,1) W

WW(1,1)

x xx

θθθθ

= W −net x2

22W Wx1 1

f 1

W2 (m,2)

W (k,2) (k,n)1 1 1

2

= W −net 1x1

connecting the input layer

1

nettransfer−function layer

net1 ... input vector to the first transfer−function layer

2

1θθ

θθ2

... Input Weight matrix,W , 1θθ

to the next layer and the bias.

2

1

(2)1(1)1 1(n) x1... input of the neural network

2

connecting the previous layerW , θθ2... Layer Weight matrix,

and successor layer and the bias.

x2... input to the second layer

... input vector to the second

y ... the neural network output vector

and output of the first transfer−function layer.

W2 (m,k)2

Neural Network Architecture

(net) output vector

Weight

Layer

Weight

Layer

Layer

(net) input vector

Layer

Transfer function

Transfer function

Explanation

Figure 2.3: Two-layer feed-forward neural network

Learning Paradigms

The most widespread learning method is supervised learning. Supervised learning as-

sumes that the output data for the training set is available. The neural network learns

by adjusting stepwise the weight parameters and biases, according to the difference

of its actual output and the desired output. An example for this type of learning is

the backpropagation of error algorithm. This algorithm is typically used to train feed-

forward neural networks.

Other often applied machine learning strategies are unsupervised learning and rein-

forcement learning. Unsupervised learning is used, for example, to cluster data points

with unknown class labels into groups with similar inputs.

Page 40: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

22 Chapter 2. Analysis of Neural Networks

Reinforcement learning describes a process, where an autonomous agent that acts in

an environment learns to choose optimal actions to achieve its goals. The reader is

referred to the book by Fritzke [Fri98] for more information on unsupervised learning

and to the book by Mitchell [Mit97] for reinforcement learning.

2.2 Validation of Neural Network Components

Artifical neural networks have numerous advantages, like for example, universal ap-

proximation capability and ability to learn. However, neural networks have often been

criticized for their low degree of comprehensibility. For a user of a neural network it

is impossible to infer how a specific output is obtained.

Validation of ANNs is important in safety-critical problem domains and for the in-

tegration of ANN components into large software environments interface assertions,

describing the behaviour of the neural network, are desirable.

This section gives a literature overview of methods useful to explain the function com-

puted by an ANN. The methods are categorized into three groups, namely proposi-

tional rule extraction, fuzzy rule extraction and region based analysis. Before describ-

ing some approaches of these classes in detail, we will define the problem of validating

neural network components, introduce some useful formalism and provide an exam-

ple. The excellent introduction book to computer science by Broy [Bro97] starts with

the definition of the terms information, representation and interpretation.

Definition 2.1 Information and Representation

We call information the abstract content (semantics) of a document, expression, mes-

sage, statement or program. We refer to its formal description as representation. wDefinition 2.2 Interpretation

We call I K�ËÍÌÏÎ an interpretation (function) which explains a given representationÐ �nË with the semantics Ñ��nÎ . wWe can view an interpretation function as a mapping from one representation

into another representation. Such a system ( Ë(MOÎ`M I � is called an information system

[Bro97]. Two important remarks [Bro97]:

Page 41: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 23

1. For a given representation various interpretation functions are possible.

2. The semantics itself has to be represented.

As a very simple example of an information system, consider the natural numbers�Ã��Ms�¦MsÒ¦MYÓ@Ó@ÓF� represented with the Arabic numbers. As the representation of the seman-

tics for the natural numbers we use a line representation, i.e. �U�FM'�@�FM'�@�@�FMYÓ@Ó@ÓF� .I �Ã�>���Ô� I �Á�°�¯�Ô�@� I �ÁÒ°���Ô�@�@�ÕÓ@Ó@Ó

In our context the neural network is the formal description and we are interested to

compute the abstract content or semantics of the neural network. We refer to this

interpretation process as validation of neural network components.

Definition 2.3 Validation of Neural Network Components

Validation methods for neural networks try to find the semantics for the function com-

puted by a neural network. wAs mentioned before the semantics itself has to be represented. This semantical rep-

resentation, also referred to as alternative representation, should express an equivalent

or very similar behaviour as the function computed by a neural network. Furthermore,

we demand that the computed representation is concise and more comprehensible than

the neural network. However, we will, in general, not be able to find an exact repre-

sentation of the neural network, instead we will approximate the function computed by

an ANN. In the literature, the process of validating a neural network is often referred

to as “rule extraction” [ADT95] or “neural network analysis” [MP00]. We prefer the

term validation techniques, because this is our main focus.

Traditionally, people have been interested to obtain human readable explanations of

the neural network behaviour. In the following we provide an “ideal” example for the

validation process of a neural network. Ideal in the sense that we assume: we have a

priori a precise specification what the neural network has to learn, the neural network

learned the task correctly and we have a validation method capable of finding an exact,

alternative representation of the function computed by the neural network. For demon-

stration purposes the problem class M1 of the three MONK’s benchmark problem as

introduced by Thrun [Thr91] is used.

Page 42: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

24 Chapter 2. Analysis of Neural Networks

The three MONK benchmark problems are binary classification tasks, defined in an

artificial robot domain. Robots are described by the following attributes.

�¯�Ã�>�ÖK�×�Ä>p�Ø®�×�p>ÙSÄ{�J� ÐDÚ5Û 7Ø�MsÑ>Ü Û p Ð ÄDM Ú�Ý/Þ pjß Ú «��¯�Á�°�ÖK�r Ú Ø°à��×�p>ÙSÄ��J� ÐDÚ'Û 7Ø�MsÑ'Ü Û p Ð Ä°M Ú�ÝaÞ pjß Ú «��¯�ÁÒ°�ÖK'á°p Ý Q�Ä ÞUâ�Ú T Ú5Ð ��� Ð Ä>Ø�MOà�Ä�T�T Ú5ã MÃß Ð Ä>Ä� �Msr/T Û Ä���¯��ä��ÖK°NÂÑ�\�nN9T�NX åßu����à®Ä'Ñ°MO Ú ��¯�Áæ°�ÖK�× Ú T�Ø°N9 åß(�J�'Ñ ã�Ú5Ð Ø�Msr/p¦T�T Ú �Ms�åT�p°ß���¯�Áç°�ÖK�×�p®Ñ�è`NÂÄ�����à®Ä'Ñ°MO Ú �Formally, the above association of an element of the input vector � with a natural

language term is also an information system, for example: é5ê Èë% É �OP ¨®M��/V��$�<×�Ä�p¦Ø¦�×�p>ÙSÄ .The learning task is a binary classification problem. A robot belongs to the target class

M1 if:

�Á×�Ä>p�Ø®�×�p>ÙSÄ o��r Ú Ø°à��×�p>ÙSÄ'�¼ì��@á°p Ý Q�Ä ÞUâ�Ú T Ú'Ð o� Ð Ä>ئ�We assume that a neural network was trained with a correct training set (i.e. no noise

in the training data) to classify robots according to M1. Let us define the following

mapping between numerical intervals and natural language expressions2 :

é ê Èë% É �OP ¨®MU¨®ÓíÒDÒ�Pî� � ÐDÚ'Û 7Ø é ê ÈW% É �OP ¨®ÓíÒDÒ¦MU¨®ÓíçDç�Pî�ï� Ñ'Ü Û p Ð Äé ê Èë% É �WV먮ÓíçDç¦M��/V�� � Ú�ÝaÞ pjß Ú é ê È ¥ É � é ê ÈW% Ééðê È_ñ É �OP ¨®MU¨®Óí�Dæ'V�� � Ð Ä�Ø éðê Èòñ É �WV먮Óí�Dæ¦MU¨®Óíæ'V�� � à�Ä�T�T Ú5ãé ê È_ñ É �WV먮Óíæ¦MU¨®ÓFó�æ'V��ï� ß Ð Ä>Ä� é ê Èòñ É �WV먮ÓFó�æ¦M��/V�� � r/T Û Äéð�¦�OP ¨®MU¨®Óíæ�Pî� � robot is not in M1 é��¦�OP ¨®Óíæ¦M��/V�� � robot is in M1

I We assume a validation method found a set of axis-parallel rules to describe the

2For demonstration purposes we prefer to use intervals instead of sparse coding.

Page 43: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 25

behaviour of the neural network, for example:

À�ôõööö÷¨®ÓÊ�¨®ÓÊ�¨

øYùùùú �

õööö÷�¯�Ã�>��¯�Á�°��¯�ÁÒ°�

øYùùùú �

õööö÷¨®ÓíÒ¨®Óí��

øYùùùú i5û¼h®g ¨®Óíç­�v���³�

I Applying an interpretation on the input vector leads to a more human readable

rule like:

ÀWôõööö÷¨®ÓÊ�¨®ÓÊ�¨

øYùùùú �

õööö÷×�Ä>p�Ø®�×�p>ÙSÄr Ú Øjà��×�p>ÙSÄá°p Ý Q�Ä ÞUâ�Ú T Ú'Ð

øYùùùú �

õööö÷¨®ÓíÒ¨®Óí��

øYùùùú i'û�h®g ¨®Óíç���üý�¬�<�

I Finally, we obtain a representation comprehensible to humans when using the

interpretation on the intervals of the input and output nodes (note that the jack-

etColor can take an arbitrary value and therefore is not relevant for the rule):

À�ô ×�Ä�p¦Ø¦�×�p>ÙSÄ o� ÐDÚ'Û 7Ø�þnr Ú Ø°à��×�p>ÙSÄ o� ÐDÚ5Û 7Ø i5û¼h®g M1

I The extracted rule set, in human readable form, is as follows:

ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o� ÐDÚ5Û 7Ø�þ�r Ú Øjà��×�p�ÙåÄ o� ÐDÚ5Û 7Ø i'û�h¦g M1ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o��Ñ'Ü Û p Ð Ä$þnr Ú Øjà��×�p>ÙSÄ o�<Ñ>Ü Û p Ð Ä i'û�h®g M1ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o� Ú�Ý/Þ pjß Ú �þnr Ú Ø°à��×�p>ÙSÄ o� Ú�Ý/Þ pjß Ú i'û�h¦g M1ÀWô á°p Ý Q�Ä ÞUâ�Ú T Ú5Ð o� Ð Ä>Ø i5û¼h®g M1

I After a rule refinement (see the following definition) process we would obtain

simply:

À�ô �Á×�Ä�p¦Ø¦�×�p>ÙSÄ o��r Ú Øjà��×�p>ÙSÄ'��ìJ�@á°p Ý Q�Ä ÞOâ·Ú T Ú5Ð o� Ð Ä>ئ� i5û¼h®g M1

This means we would be able to describe the semantics of this trained neural

network. This neural network is named ±�²�²­ÿ % . Overall it is possible to state:

éq��±¤²,² ÿ % ��� À�ô �Á×qÄ�p¦Ø¦�×�p�ÙSÄ o��r Ú Ø°à��×�p�ÙSÄ5�7ì��@ájp Ý Q�Ä ÞUâ�Ú T Ú5Ð o� Ð Ä�Ø®�i'û�h¦g M1

With the above statement we would have a very comprehensible description

about the neural network behaviour. As motivated in the introduction, this could

Page 44: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

26 Chapter 2. Analysis of Neural Networks

be helpful to prevent software faults. In addition we would have validated, that

the neural network learned the task to classify if a robot belongs to M1 correctly.

However, as mentioned previously, any neural network validation method is just

an approximation of the true behaviour of the neural network. Therefore, for real

world applications, we are not able to compute an alternative representation of

the neural network, which semantically agrees always with the neural network.

Definition 2.4 Rule refinement

In general, rule extraction of ANNs results in a large number of rules. Often it is

possible to reduce the number of rules, e.g. by applying boolean algebra. This process

is called rule refinement. For the rule refinement process we demand that the refined

rule set is semantically equivalent to the original rule set. wWe decided to classify validation methods according to their representation format,

because this also reflects quite well the different strategies to analyse neural networks.

Propositional rule extraction represents the semantics of an ANN with a set of propo-

sitional rules, fuzzy rule extraction with a set of fuzzy rules and region based analysis

via a set of region mappings. Figure 2.4 provides an overview.

propositional rule extraction if x(1) and x(2) then y

if x in Region X then y is in Region Yregion based analysis

�������������������� ����������������fuzzy rule extraction

ANN Interpretation using: Representation (example)

if x(1) is small then y

Figure 2.4: Overview of different validation methods for neural networks.

To describe the different neural network validation techniques, we start with a short

introduction of the corresponding validation class and introduce selected algorithms

followed by a discussion according to several properties. Within the scope of this the-

sis we are especially interested in validating properties of neural networks. Hence, we

included the property “validation capability” in our discussion about rule extraction

Page 45: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 27

techniques. This property was not found anywhere in the literature. All other prop-

erties have been proposed before by Towell and Shavlik [TS93a] or Andrews et al.

[TAGD98]. We will analyse a validation method according to the following properties:

I Portability: can the method be used for arbitrary feed-forward neural networks?

A method is said to be generally applicable to feed-forward neural networks, if

the only assumption is that the transfer-functions are invertible.

I Fidelity: how well does the computed representation mimic the neural network

behaviour? This could be experimentally measured by comparing the output of

the neural network with the output of the extracted rule set.

I Comprehensibility: is the representation easy to understand and usable to make

informative statements about the neural network behaviour?

I Validation capability: can the method be used to verify properties of the neural

network behaviour? We motivate this property, as it is an important aspect for

safety-critical applications to prove that the system fulfills special properties.

For example we would like to extract refined rules of this form: “if � is in

a Region ��~ of the input space then it is guaranteed that the neural network

output � is in the region �u� of the output space”. It is trivial that the statement

is always true if � ~ and � � are the maximal input and output space of the neural

network. Hence, we are interested in the most refined rules of the above type,

because this provides more information. In other words we want to compute

valid rule sets with high fidelity. In our example this means: for a given input

region ��~ we are interested to compute the smallest output region ��� such that

the neural network output for all points in ��~ lies within ��� .I Algorithmic complexity: the time and space complexity of the neural network

analysis algorithm.

I Usability for function approximation and classification tasks: is the method

helpful for neural networks performing a function approximation task and how

useful is the method to explain neural networks performing a classification task?

Page 46: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

28 Chapter 2. Analysis of Neural Networks

It is important to notice that the property comprehensibility is subjective. We discuss

this property, but we can not provide exact measurements. Furthermore, it is difficult to

provide proper experimental comparisons between different validation techniques with

respect to fidelity, because all reviewed methods discussed evaluations for (usually)

different neural networks trained with the same data set.

For a sound comparison with respect to fidelity all validation methods should use

exactly the same neural network, i.e. the same architecture and the same values for

the weights and biases. Unfortunately, according to the knowledge of the author, just

benchmark data sets are available but no benchmark neural networks.

The aim of the following literature review is to explain the core idea of various neural

network validation algorithms. As a consequence only a sketch of the algorithm is

provided. For more detailed information about the corresponding algorithm the reader

is referred to the relevant publications.

2.2.1 Propositional Rule Extraction

An overview of (classical) propositional rule extraction methods is given in [ADT95],

and in the updated version [TAGD98]. The rule extraction approaches are categorized

into decompositional, hybrid (eclectic) and pedagogical. Decompositional (white box)

approaches try to extract the embedded knowledge by analysing each of the neural

network units with respect to their inputs. Pedagogical methods view the neural net-

work as a black box and analyse it on their input/output behaviour (e.g. ruleneg, see

[ADT95]). Between these two approaches are the so called eclectic or hybrid meth-

ods. Decompositional approaches within the class of propositional rule extraction can

be characterized by the following general properties:

I These algorithms mainly rely on clusterization of the weight values (e.g. � -of- ), large pruning and perform an exhaustive search on the inputs of all units (e.g,� -of- , KT-method).

I The methods are only usable for special constructed neural networks, i.e. these

techniques are not general. This is due to their assumptions of the neural net-

work topology, weights, transfer functions and biases, e.g. the KT-method is

Page 47: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 29

only suitable for sparsely connected neural network topologies, furthermore KT

assumes that a neuron output is either 1 or 0. In some cases even special learn-

ing algorithms are required (e.g. � -of- uses a training of the modified neural

network, while keeping the weights constant).

I Classical decompositional rule extraction methods include approximations from

the start (e.g. � -of- assumes a 0,1 activity for a neuron). The algorithm itself

approximates the rules on an already approximated neural network, instead of

using the original neural network. Therefore the extracted rules are not neces-

sarily valid for the original neural network.

Additionally, most propositional rule extraction methods are only suitable for classifi-

cation tasks, but not for function approximation [SLZ02].

We briefly explain two decompositional methods, namely KT and � -of- . Both meth-

ods are representative examples for propositional rule extraction methods, which try

to analyse the neural network in a decompositional manner. In our description we will

use the term antecedent as follows:

Definition 2.5 Antecedent

A variable in the premise of a propositional rule is called antecedent. For example�¯�Ã�>� is the antecedent in the rule “if �¯�Ã�>� then � ” w

Page 48: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

30 Chapter 2. Analysis of Neural Networks

KT-Method

The method takes a decompositional approach by finding rules for each unit, which

are assumed to have binary output. The KT algorithm (in [TS93b] referred to as the

subset algorithm) searches for subsets of incoming positive weights which exceed the

bias of an unit. In a second phase a search for subsets of negative weights starts, such

that the sum of the negative weights is greater than the sum of the positive weights

minus the bias. The search space for the activity of a single unit rises exponentially

with the number of incoming links to this unit (power set). Hence, the number of

possible combinations to form a rule is exponential. To handle this problem, heuristics

are introduced to prune the search space. For example, in [TS93b] an upper bound

for the number of positive and negative subsets is defined. Other variations of the KT-

method (especially with different heuristics to prune the search space) are described in:

[Fu94], [SN88] and [TS93b]. In Figure 2.5 an example of the effect of the KT-method

is depicted and on the next page the algorithm is explained.

positive subsets which exceed the bias

KT−Algorithm calculates

{(x(3),x(4))}

negative subset(s) whose summed weights are greater than (x(2)−bias):

negative subset(s) whose summed weights are greater than (x(1)+x(2)−bias):

if x(2) and not (x(3),x(4)) then y

Extracted Rules

{x(2),(x(1),x(2))}

{x(4),(x(3),x(4))}

Σ −1

f

bias2

1 4 −3−1

NeuralNet

if x(2) and not x(4) then y

if (x(1),x(2)) and not (x(3),x(4)) then y

y

x(1) x(2) x(3) x(4)

Figure 2.5: Example for the KT-Method.

Page 49: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 31

KT-Method

Input: Weights, biases of the neural network

Output: rule set

for each neuron�

//of the hidden and output layer����

is the weight matrix between layers

� + 6 � ��� ?����� 6 � ?7+ ��

����� � 6��5?X.76���?��! j6 � ? //the�-th input to an activation function

���Often the search space is restricted by defining upper bounds for the number���of positive and negative subsets, e.g in [TS93b]

find up to "$# subsets of indexes %&#('�8íE �*) = such that:

� 6���?,+,A , ��0-% # and��/.021 � 6���?3+4 �6 � ?

5 #n+�- % �# �76�676�� %98 1# C:<;>=each indexset % # 0 5 #�

find up to "�? subsets of indexes %@?A'�8íE �*) =such that: � 6CBj?,D�A , B·0-%@?AD�A and

��/.0 1 � 6���?@�! j6 � ?@� �

E .0GF � 6CB°?,H�A5 ?­+v-I% �? �76�676�� %98 F? Cfor each indexset %@?n0 5 ?�

// %J# and % ? are vectors of indexes

rule set = rule set K if .76�%J#�?&L!MS.76N% ? ? then output unit is activeOO

O

Page 50: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

32 Chapter 2. Analysis of Neural Networks

Discussion of the KT-method

I Portability: The KT- method requires a special neural network architecture, due

to the assumption that inputs to, and outputs from a neuron are either 1 or 0.

I Fidelity: KT achieves a high fidelity for small binary neural networks, where the

whole search space can be explored. Usually, due to the pruning of the search

space, the larger the neural network, the less informative the rule set becomes,

because only a fraction of the possible neural network inputs are considered.

I Comprehensibility: The KT method produces human readable propositional

if/then rules. For problems with high dimensional inputs and outputs the com-

prehensibility clearly decreases as a large set of rules is needed to explain the

behaviour of the neural network.

I Validation capability : Each rule is a valid statement about the property of the

neural network. This type of rule can also be found by computing the neural

network output for special points in the input space. Hence, the KT-method can

not be used to validate general properties of the neural network, e.g. to make

statements such as : “if the input is in region � then the neural network always

classifies the data as positive”. Overall we can state that the KT-method is a

strategy to test the neural network behaviour rather than to validate it.

I Algorithmic complexity: The algorithm would scale exponentially in time com-

plexity as well as in space complexity (exponential number of rules) with an

increasing number of inputs to a neuron. However, depending on the heuristics

to prune the search space this complexity can be reduced.

I Usability for function approximation and classification tasks: The KT-method

is only applicable for (discrete) classification tasks.

Page 51: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 33

� -of- Method

In [TS93b] the � -of- rule extraction method was introduced. According to [TS93b]

rules extracted by the KT-method often contain � -of- style rules. These are rules of

this format:

À�ô ( � of the following antecedents are true) i'û�h®g output unit à is active

The basic idea of the � -of- approach is to cluster weights of similar values into

groups, i.e. to build equivalence classes.� -of- Method

Input: weights, biases of the neural network

Output: rule set, modified neural network

1. For each hidden and output unit: build groups of similar weights.

2. Assign all weights of the same group the average value of this group.

3. Eliminate groups which are not significantly important for the activity of the

subsequent unit (using a heuristic and an algorithmic method, see [TS93b]).

4. Optimize. Because unimportant groups are eliminated this could change the

activity of units within the neural network. To address this problem, the

modified neural network has to be trained. The remaining weights are kept

constant during this learning process, such that the groups remain unchanged.

Therefore the retraining just optimizes the biases.

5. Extracting. Translating the bias and incoming weights to each unit into a rule

with weighted antecedents such that the rule is true if the sum of weighted

antecedents exceeds the bias.

6. Simplifying. Rewriting the rule in an � -of- notation.

Page 52: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

34 Chapter 2. Analysis of Neural Networks

Σ 11.2−1

bias

f

Initial Part of ANN Step 1 and Step 2

17 1

Σ 11.2−1

bias

f

7

Step 3

Σ

0.96.81.17.2

11.2−1

bias

f

7 7 77 7

Step 4−5 :

if

if if

if

Step 6: if 2 of {x(1),x(3),x(5)}

y y y

x(1) x(2) x(3) x(4) x(5) x(1) x(3) x(5) x(2) x(4) x(1) x(3) x(5)

and x(5)

x(1)

x(3)

ythen

ythen

x(5)and x(3)and

x(1) ythenx(3)and

ythenx(5)and x(1)

ythen

Figure 2.6: Example for the � -of- method. The rules express when the input weights

exceed the bias (we assume the bias remains constant after retraining the modified

neural network). The output neuron (denoted with y) is active, if the incoming weights

exceed the bias.

Discussion of the � -of- -method

I Portability: The method requires a special neural network architecture, because

it assumes that a neuron output is either 1 or 0.

I Fidelity: The � -of- algorithm changes the architecture, the weight values and

biases of the original neural network. Hence, the fidelity depends on the differ-

ence between the original and the final neural network, which will be used to

extract rules. Intuitively we can assume that the more similar the final neural

network is to the original one, the better the fidelity.

I Comprehensibility: The � -of- method can represent the neural network be-

haviour with comprehensible, concise and simple (due to the � -of- notation)

rules. The comprehensibility usually decreases in higher dimensional cases, be-

cause the number of rules will increase. However, depending on the elimination

of groups, more comprehensible, general rules can be produced, but with, usu-

ally, less fidelity.

I Validation capability: As the � -of- method modifies the original neural net-

Page 53: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 35

work before extraction rules it is not possible to verify properties of the function

computed by the original neural network.

I Algorithmic complexity: The method itself (the grouping process) scales within

polynomial time complexity. Additionally, the time complexity increases with

the re-learning process, but overall still scales within polynomial time complex-

ity. Overall the method would also scale for higher dimensional cases.

I Usability for function approximation and classification tasks: The � -of- method is only applicable to classification tasks.

2.2.2 Fuzzy Rule Extraction

Although our approach is different, we decided to include a section about fuzzy rule

extraction, because fuzzy representations are an interesting choice to explain the be-

haviour of neural networks. For example, one important advantage is that, in general,

fewer fuzzy rules are necessary to explain the behaviour of neural networks when com-

pared to propositional (crisp) rules.

Theoretical results [BBDR96] showed that it is possible to build a fuzzy system which

calculates the same function as a neural network. However, to obtain a good approx-

imation of the neural network behaviour an increased number of linguistic terms is

required. According to [PNJ01] this is also the main drawback of fuzzy rule extrac-

tion as the number of rules increases exponentially with a better approximation.

In this section three different approaches of fuzzy rule extraction are discussed. Lin-

guistic rule extraction [INT99] handles inputs as fuzzy members and the correspond-

ing output is calculated as a fuzzy number by fuzzy arithmetic. Fuzzy Trepan [FJ99]

is a pedagogical approach to fuzzy rule extraction and uses fuzzy decision trees to ex-

tract rules of neural networks. Finally, REX [MKW03] uses an evolutionary algorithm

to compute a set of fuzzy rules which mimics the neural network behaviour.

For a better understanding of the discussed fuzzy rule extraction methods relevant ter-

minology is explained. The following definitions mainly rely on the book by Mendel

[Men01].

Page 54: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

36 Chapter 2. Analysis of Neural Networks

Definition 2.6 Linguistic Variable

A linguistic variable is a variable, whose values are words or sentences. wAccording to Zadeh [Zad75] the justification for the use linguistic variables is: “The

motivation for the use of words or sentences rather than numbers is that linguistic

characterizations are, in general, less specific than numerical ones.”

Definition 2.7 Linguistic Value

A linguistic value is a value, which is characterized by a word. A linguistic value is

specified by its membership function, i.e. a value is handled as a fuzzy number. wDefinition 2.8 Membership Function

Membership functions assigns for a given numerical value the degree of membership

to a linguistic value. This is a continuous value between 0 and 1. wTypical membership functions are for example: triangular, trapezoidal, piecewise-

linear, Gaussian and bell-shaped.

Definition 2.9 Linguistic Term

A linguistic variable could be decomposed into a set of linguistic terms, which cover

the complete value range (called universe of discourse in [Men01]). wDefinition 2.10 PRQ cut

An PRQ cut of a fuzzy set � defined on the universal set � is a crisp set: PRS ����$� �n��tå� §TP�� w

Page 55: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 37

Linguistic rule extraction

The method as described by Ishibushi, Nii and Tanaka [INT99] is an improvement of

their initial work [IN]. Linguistic rules are of the form:

Rule zVUyK IF �¯�Ã�>� is ±WUGX7þJÓYÓYÓ5þ(�¯�� �� is ±WU2Y THEN Class â U with â-Z Uwhere:� ÓYÓYÓ -dimensional input vectorÙ ÓYÓYÓ index for rule z U±WU2[ ÓYÓYÓ antecedent linguistic valuesâ U ÓYÓYÓ consequent classâ\Z U ÓYÓYÓ certainty grade

Typical examples for antecedent linguistic values are “small”, “medium” and “large”.

A linguistic value is specified by its membership function. A possible instance of a

linguistic rule could be the following:

IF �¯�Ã�>� is small þu�¯��ä�� is large THEN Class 3 with â-Z U �"¨®ÓN]�¨As indicated in the above rule “don’t care” attributes are omitted. The above rule

reads like this: “if the input attribute �¯�Ã�>� is small and the input attribute �¯�Á�°� is

large then the output is class 3 with a certainty grade 0.9”. As explained by Ishibushi

et.al [INT99] the certainty grade of a consequence class is determined by analysing

the classification of the P -cut of the fuzzy input vector for various values of P . For

example [INT99] used for experiments 100 different values of P between 0.01 and 1.

In the initial algorithm all combinations of antecedent linguistic values are com-

puted by forward propagating them through the neural network. The forward propaga-

tion of a rule premise (particular combination of antecedent linguistic values) requires

to calculate, starting from the input layer, the fuzzy output of a given fuzzy input for

each layer of the feed-forward neural network until the output layer. The calculation

is performed by using fuzzy arithmetic [KG85].

According to [INT99] the main drawbacks of this approach are:

I exponential increase of possible combinations of antecedent linguistic values,

e.g. for an input vector with 10 attributes and 2 linguistic values for each at-

Page 56: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

38 Chapter 2. Analysis of Neural Networks

tribute �¦%*^ fuzzy input combinations are possible. This would result in the cal-

culation of �¦%*^ fuzzy outputs.

I in general we obtain a large number of extracted linguistic rules

The Original Fuzzy Linguistic Algorithm

Input: neural network, antecedent linguistic values

Output: Set of fuzzy rules

for all antecedent combinations_I Compute the fuzzy output vector using fuzzy arithmetic

I Perform the numerical calculation by interval arithmetic on P -cuts of fuzzy

numbers.

I Determine the consequence class and the certainty grade with the fuzzy

output vector.

`However, as suggested in [INT99] it is possible to bypass the first problem by extract-

ing more general linguistic rules with a smaller number of antecedents, but this would,

generally, result in less fidelity. To overcome the second difficulty [INT99] uses ge-

netic algorithms to select only a small number of significant linguistic rules. More

general rules requires to reduce the number of antecedent conditions. This leads to

less combinations of antecedent linguistic values. To overcome the increasing excess

of fuzziness (large overlap between different linguistic values) of the outputs a more

accurate interval arithmetic is necessary. This could be achieved with a hierarchical

subdivision method of an -dimensional interval vector [INT99] .

As reported in [INT99] the wine data of the UC Irvine database was used to test the

improved method. The wine data has 13 continuous inputs and three output classes.

Out of 37765 fuzzy input vectors for generating linguistic rules with three antecedent

linguistic values (all other attributes were set to ”don’t care”) 7381 different rules have

been extracted [INT99]. To select the most significant rules a genetic algorithm [INT]

was applied.

Page 57: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 39

Discussion of the linguistic rule extraction method

I Portability: Generally applicable to feed-forward neural networks.

I Fidelity: The fidelity depends on the excess of fuzziness of the extracted rules

and usually decreases the more general rules the system computes.

I Comprehensibility: The algorithm produces often a large number of rules and

therefore a low degree of comprehensibility. For example the before mentioned

7381 rules extracted for of a neural network trained on the wine data. How-

ever, a post-processing process can be used to extract the most significant rules

[INT99]. The extracted fuzzy rules, together with linguistic terms, are human

readable explanations about the function computed by the neural network.

I Validation capability: The method is, in general, not able to validate special

behaviour of the neural network, as usually a certainty grade of 1 is not com-

puted.

I Algorithmic complexity: The time and space complexity of the original algo-

rithm can increase exponentially, depending on the number of input nodes. By

extracting more general rules, with usally less accuracy, the complexity can be

controled.

I Usability for function approximation and classification tasks: The proposed

algorithm is only applicable to classification tasks.

FuzzyTrepan

Fuzzy Trepan by Faifer and Janikow [FJ99] is an extension of the TREPAN [CS96]

algorithm. TREPAN is a pedagogical approach which uses decision trees for both

knowledge extraction and representation. As such this approach views the task of

extracting comprehensible knowledge of a neural network as an inductive learning

problem.

Fuzzy Trepan follows the same idea as TREPAN but uses fuzzy decision trees. Fuzzy

decision trees were first proposed by Janikow ( [Jan93], [Jan96]) as an extension of

Page 58: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

40 Chapter 2. Analysis of Neural Networks

ID3 [Qui86].

The algorithm, named FuzzyTrepan, is summarized in the following steps:

FuzzyTrepan

Input: neural network, antecedent linguistic values, training data

Output: Fuzzy decision tree

(i) Replace the original class labels with the class labels

assigned by the ANN to the given input.

(ii) Induce the tree in a top-down manner, with the most informative

label on the top. To split a particular node an information gain

measure, modified for fuzzy representations, is applied.

(iii) A node is a leaf, if one of the following criteria is

fulfilled: minimal entropy, minimal information gain or

minimal example count.

As described in [FJ99] FuzzyTrepan outperformed TREPAN with respect to better

fidelity for the Iris and Pima dataset, but had less fidelity for the Bupa dataset.

Discussion of FuzzyTrepan

I Portability: Generally applicable to any feed-forward neural networks, because

the neural network is considerd as black box.

I Fidelity: Trade-off between tree size and comprehensibility, because a greater

number of fuzzy terms results in a larger tree and higher fidelity, but it decreases

the comprehensibility.

I Comprehensibility: As mentioned before: the complexity of the tree defines

the degree of comprehensibility.

I Validation capability: The method is a pedagogical approach and, as such, is

relying on sampling data from the neural network; therefore the method is not

suitable to validate properties of the function computed by a neural network.

I Algorithmic complexity: The complexity depends on the number of sampled

neural network data and on the learning algorithm for the fuzzy decision tree.

Page 59: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 41

The algorithm scales within polynomial time and space complexity, because the

method only depends on the size of the sample data set and not on the neural

network architecture.

I Usability for function approximation and classification tasks:´ The proposed

algorithm is restricted to classification problems.

REX

REX [MKW03] is an evolutionary algorithm to extract fuzzy rules of a neural network

computing a classification task. The extracted knowledge is represented as a fuzzy rule

set. The vector � represents the network input. For an � -class classification problem

the vector � describes the output in binary format. An entry �$��NL� is equal to 1 if the

neural network predicts class â | . All other entries of � are then equal to 0. The rule

format is:

Rule zVUyK IF �¯�Ã�>� is ±WUGX7þJÓYÓYÓ5þ(�¯�� �� is ±WU2Y THEN Class â UREX starts with a randomly selected first population. Sets of input patterns are sequen-

tially passed to the set of rules in a given generation. For each individual an evaluation

function is calculated. The evaluation function is a measurement of the fidelity be-

tween the set of rules and the output of the neural network. During the crossover phase

rules and fuzzy set groups can be exchanged between individuals. This process is re-

peated until a stopping criterion is fulfilled.

To apply the algorithm in an experimental phase the parameters for REX like the size

of the population, the mutation probability, the crossover probability and the maxi-

mum number of rules for an individual have to be determined. In the following we

summarize the main idea of the REX algorithm as described in [MKW03].

Page 60: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

42 Chapter 2. Analysis of Neural Networks

REX

Input: neural network, antecedent linguistic values

Output: Set of fuzzy rules

(i) An individual and the initial population.

An individual consists of a set of fuzzy rules and a related fuzzy set.

The length of a chromosome is constant. Each rule, premise and fuzzy set can

be marked as active. Fuzzy sets are encoded as one real number, characterized

by the centroid of the corresponding membership function.

(ii) Evaluation of an individual �the vectors a �Ibc�Id9��e and � are metrics, where:

a�6���? 676�6 number of correctly, compared to the neural network, classified patterns.b 6���? 676�6 number of incorrectly classified patterns.d 6���? 676�6 number of patterns that were not classified as no rule was active.e 6���? 676�6 total number of active premises in an active rule� 6���? 676�6 total number of active fuzzy sets in the individual � .For an individual � the value of the evaluation function is given by:f � +hgia�6���?*B�6 b 6���?9?@KAjkaj6���?�B�6 d 6���?9?kKml2B�6 e 6���?9?iKonpB�6 � 6���?�The function B is defined as follows:

B�6rq¦?7+stu tv : q{+BA�w qAx+BA

Finally g � j � l and n are coefficients.

(iii) Evolutionary operators:

The mutation of a centroid of a fuzzy set is based on adding a random floating

point number (similar for integer values a random integer number modulo an

allowed range is added). Mutating an activity bit is done by its negation.

The crossover operator allows to exchange rules and fuzzy set groups between

individuals.

(iv) Stopping Criteria:

The algorithm terminates, if one of the following stopping criteria is fulfilled:

maximum number of steps is elapsed or no progress for a given number of

generations or the evaluation function for the best individual is higher

than a certain value.

Page 61: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 43

Discussion of REX

I Portability: Generally applicable to feed-forward neural networks, because the

neural network is considered as a black box.

I Fidelity: As reported in [MKW03] experimental studies showed that REX pro-

duces similar results as the crisp rule extraction method FULL-RE [TG96] on

the IRIS data set. However, as the method is based on sampling a high fidelity

is not guaranteed. In particular the fidelity is dependent on the chosen samples

in the input space of the neural network.

I Validation capability: The method is a pedagogical approach and therefore not

suitable to verify properties of the function computed by a neural network.

I Comprehensibility: REX produces a set of fuzzy rules, which provide a good

comprehensibility.

I Algorithmic complexity: With the stopping criterion “maximal number of step”

it is ensured tha the method stays within polynomial time complexity. The pa-

rameter for the maximal number of rules restricts the space complexity. With

this parameter settings the algorithm is also usable for high dimensional cases.

I Usability for function approximation and classification tasks: The proposed

algorithm is restricted to classification problems.

Page 62: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

44 Chapter 2. Analysis of Neural Networks

2.2.3 Region-based Analysis

We classify algorithms as region-based if they have a geometrical perspective and if

they do not require any simplifications for the weight-layer. In addition, all these

methods analyse the behaviour of the neural network in a decompositional manner.

Within region-based analysis techniques it is to distinguish between refinement-based

and non refined based methods. As motivated in Section 1.4 refined based methods

rely on the forward and or backward-propagtion of regions through all layers of a

feed-forward neural network. We first introduce non-refined based methods.

REFANN

The algorithm called rule extraction from function approximating neural networks

(REFANN) approximates the continuous activation function of the hidden unit by

piece-wise linear functions [SLZ02]. This divides the input space into subregions.

For each non-empty subregion a rule is generated. As explained by Setiono et. al

[SLZ02] the approximation of the sigmoid function is dependent on the training data

as the training data defines the maximal input value for the N -th hidden unit and there-

fore also the point t ^ of the intersection of two line segments.

REFANN generates rules from trained neural networks with one hidden layer and one

linear output unit. In [SLZ02] a pruning algorithm, called N2PFA, is introduced. This

pruning algorithm removes redundant and irrelevant units. It is recommended to apply

the pruning algorithm before the rule extraction process, because the time complexity

of REFANN increases exponentially with the number of hidden units. The main steps

of the algorithm are summarized on the next page.

Page 63: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 45

REFANN

Input: Data set, neural network

Output: Set of rules

//The neural network has)

inputs, y hidden units and 1 output unit.

(i) for each hidden unit� +�E � y�

approximate the sigmoid function with z piece-wise linear functions

let z�+�G and define the piece-wise linear function {�| as follows:

// to simplify the notation we use here q , with q�+ �k�}� 6 � ?{@|Â6~q¦?¼+

stttu tttv6rq�Kmq�|��¡?*�J��6~q>|��¡?@���&6~q>|��¡? if q�D���q>|��q if ��q�|N��Hoq�H�q>|��6rq���q�|��¡?*�J��6~q>|��¡?iKm�&6~q>|��¡? if q�+(q>|��

where:

q |�� 6�6�6 maximal possible input, q |�� is determined from the training data

q |�� 676�6 intersection point of two line segments, with:

q |�� +���� w��N��� ? w7��� ���r� w��N�������� w��N���O(ii) Divide the input space into G � subregions according to the definition of {�|

a subregion is a polyhedron in the input space, for example:� � 0u8íE � y�= � ��q>|���H 6 � ��� ?Á.�H�q>|��(iii) For all non-empty subregions ��

compute an approximation for the output of the B -th training example

in the subregion � as follows:�� E + ��| ���*� 6 � ?*{@|L6 ����� E 6 � ?9?where:� � is the weight vector between the hidden layer and the output unit� �k��� E 6 � ?7+ 6 � �7� ?X. E , with . E representing the input vector

for the B -th training example.

Form a rule of the following structure:

if .u0��¡  E then�� E

where: �¢  E is a polyhedron of the form -/.�021 � 4 £·.4H�¤¼C and

£ is a matrix with)

columns, ¤ � . are vectors of length)

O

Page 64: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

46 Chapter 2. Analysis of Neural Networks

Discussion of REFANNI Portability: REFANN, as introduced in [SLZ02], requires a neural network

with one hidden layer and one linear output unit.

I Fidelity: The fidelity depends on the number of piece-wise linear functions cho-

sen to approximate a sigmoidal function. Furthermore it is possible to compute

a precise error bound (for further details the reader is referred to the paper by

Setiono et.al [SLZ02]).

I Comprehensibility: REFANN produces rules of the form :

if �B�n�4¥*¦ then ��¦where �4¥�¦ is a polyhedron, containing the Q -th training example, and ��¦ is

the approximation of the neural network output in this region. Together with

visualization tools the extracted rules are helpful to explain the behaviour of the

neural network.

I Validation capability: For rules of the above form it is not possible to assure

certain behaviours of the neural network, because the output of an input vector�v����¥�¦ could be above or below the value �3¦ . However, with continuous and

monotonic transfer functions like the sigmoidal function, it is possible to modify

the algorithm such that rules guarantee a certain behaviour. For example, by

applying linear programming techniques (dependent on the sign of Z$��NL� either

maximize or minimize g¼h�i ��NL�¯� k ��NUMYKF�Â� subject to ���n�m¥�¦ ) an upper bound,

denoted with ��§W¨U~ , for the neural network output of an input vector �<�v�(¥�¦can be computed. Therefore, it can be guaranteed that the following rule is

always fulfilled:

if ���n�4¥*¦ then ������§�¨s~Similarly, a lower bound value �9§ |ª© could be computed. Together a valid in-

terval bound on the neural network output value � for any input vector in the

polyhedron �4¥�¦ is defined. Hence, we could extend the method to compute

valid rules of the form:

if �B�n�4¥*¦ then ����P ��§ |ª© MO��§�¨s~�V

Page 65: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 47

I Algorithmic complexity: Exponential increase of the number of subregions

with the number of hidden units. Let � be the number of hidden units and

assume a three piece linear approximation of the activation function, then the

number of subregions is Ò § .

I Usability for function approximation and classification tasks: The proposed

algorithm is applicable for classification problems and for function approxima-

tion tasks.

Rule Extraction by Clustering

In the paper “Rule Extraction from Feedforward Neural Network for Function Approx-

imation” by Gaweda et.al [GSZ00] a rule extraction technique is described, which

relies on clustering of hidden units activation.

A cluster center of a hidden unit output space is a point which represents a hyperplane

in the input space. These hyperplanes are defined from the weight matrixk

, con-

taining the weights between the input layer and the hidden layer. Rules are defined

according to the inverse of the cluster centers and the corresponding network outputs.

The computed rule set is used for a rule-based approximation of the behaviour of the

neural network. For a given input the most active rule is selected. The activation of a

rule is calculated as the sum of a distance measure between the point and each hyper-

plane of the rule condition.

The algorithm is an extension of previous cluster methods as introduced in [SL97]

and [WvdB98].

Page 66: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

48 Chapter 2. Analysis of Neural Networks

CLUSTER - REFA

Input: Data set, neural network

Output: Set of rules

//The neural network has inputs, � hidden units and 1 output unit.

(i) Find cluster during the training phase in the output space of the� dimensional hidden layer. To find clusters algorithms like, for

example neural gas [TBS93] can be used.

(ii) For each cluster center Q­� ��K�Ü_«¬«�­ ¦ denotes the Q -th cluster center.

��¦�� §®|r¯ % Z$��NL�­ ¦���NL�

where:Z is the weight vector between the hidden unit and the output unit`(iii) For each cluster center Q­� ��KjÜ_

For each hidden unit áy� ��KD�_//the linear prototypes is denoted as TFÙ�¦�� k �@áDMYKF�O�TíÙk¦�� k �@áDMYKF�O�±°� k �@áDMYKF�Â�hQ�! #&% � ­ ¦°�`

`(iv) Build Ü rules, the Q -th rule has the form:

if TFÙ�¦�� k �Ã��MYKF�O� o�"¨�þ�ÓYÓYÓ�þuTFÙ�¦�� k ����MYKF�O� o�"¨ then �J�H�9¦Geometrically each linear prototype represents a hyperplane in the input space. For

a single input vector � the distance to each of these hyperplanes is computed. This

allows to select the best rule for an input vector � .

Page 67: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 49

APPLICATION OF CLUSTER - REFA

Input: Input vector � , weight matrixk

, linear prototypes TFÙOutput: Approximated output vector �

(i) Let ßS��TFÙ | � k �@áDMYKF�O�aMO��� be a function reciprocal to the distance

between the input vector � and a hyperplane corresponding to the linear

prototype TFÙ | � k �@áDMYKF�O� . [GSZ00] use the following distance measure:ØS��TíÙ | � k �@áDMYKF�aMO���O�±°�³² ´ ȶµ�·¶¸ É ê ¿&¹¹ ¹ ȶµ É #iº¬» X È�¼7½ É ²²¶² ´ ȶµ�·¶¸ É ²¶²and the similarity measure ß is:ßS��TFÙ�|O� k �@áDMYKF�O�aMO���¡� %% ¿@¾�¿/ÀÂÁ à ½ À Ä�À Å�Æ Ç È~Æ É�ȪÈ

(ii) For each of the Ü rules compute the activation P�| by:

P¼|&�ý�@� §®µ ¯ % ßS��TFÙ�|k �@áDMYKF�aMO���Y�@�

(iii) compute the maximum activation, i.e.: P ¥ � max%pÊ | ÊiË P | .(iv) Apply the rule with the maximum activation, i.e. the Ñ -th rule

The output � of the rule system for input � is then given by:���"� ¥

Discussion of CLUSTER-REFA

I Portability: The method, as introduced in the paper by Gaweda et.al [GSZ00],

is restricted to neural networks with one hidden layer.

I Fidelity: The fidelity depends on the number of extracted rules. To achieve a

high fidelity a large number of rules is necessary, because a rule consequence

is a single value. However, as stated in [GSZ00] the approximation accuracy

could be improved by the use of functional rule consequences or by using fuzzy

clustering algorithms to determine the shape of clusters.

I Comprehensibility: The comprehensibility depends on the number of rules.

The method produces rules of the form:

if � is closest to the hyperplanes of rule Ñ then � ¥

Page 68: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

50 Chapter 2. Analysis of Neural Networks

where � ¥ is the approximation of the output for inputs closest to the hyperplanes

of rule condition Ñ .I Validation capability: CLUSTER-REFA, produces correct rules for points con-

tained in the subspace defined as the intersection of multiple linear prototypes

(rule condition). In other words the Ü rules computed with CLUSTER-REFA

are valid statements about the neural network. However, for all points outside

which are not included in one of the Ü subspaces approximations are used (see

algorithm Application of CLUSTER-REFA).

I Algorithmic complexity: The algorithmic complexity is dependent on the ap-

plied cluster algorithm. The rule extraction part itself scales with a time com-

plexity �ÍÌ·Ü , where � is the number of hidden nodes and Ü the number of

cluster centers. Hence, the algorithm scales well with increasing number of hid-

den nodes.

I Usability for function approximation and classification tasks: The proposed

algorithm is applicable for classification problems as well as for function ap-

proximation tasks.

Decision Intersection Boundary Algorithm

The Decision Intersection Boundary Algorithm (DIBA) is designed to extract exact

representations of threshold feed-forward neural networks [MP00]. A decision region

is defined as a region in the input space, such that all points in that region result in

the same network output. Decision regions are limited by decision boundaries. A

decision boundary is a location in the input space where an output unit switches its

activation state (0 to 1 or vice-versa). The possible decision regions for a threshold

neural network are polytops (the input space is constrained, i.e. there is an interval

of possible values for each input dimension defined). The algorithm relies on a few

important observations, see also [MP00].

1. Independence of an output unit. Each output unit is computing its own value

irrespective of the other output units.

Page 69: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 51

2. For an ANN with threshold activation function the output of hidden units are 0

or 1. Therefore the output units just compute a partial sum of their weights.

3. An output unit changes only if at least one hidden unit value changes. Therefore

a decision boundary corresponds to a region in the input space where at least

one hidden unit undergoes a change.

4. Hidden units split the input space through hyperplanes. If the input space is

bounded the decision regions are defined through higher dimensional polytops.

The algorithm consists of two phases, namely a generative part which calculates the

possible vertices and connecting line segments of decision regions and an evaluation

part which tests if these basic elements (vertices, lines) form the boundary of a deci-

sion regions.

The computation of candidate vertices and lines is performed by finding hidden layer

hyperplane intersections and by projecting these hyperplanes incrementally onto each

other until dimension one. The result of this part are line segments.

The second phase is to test whether a line segment builds a boundary, i.e. to test

whether the intersection of hyperplanes forms a corner boundary and if the interven-

ing line segment is a boundary line. Figure 2.7 illustrates the projection part of the

algorithm and Figure 2.8 shows a decision region and explains the traversing of a line.

A sketch of the algorithm for a two weight layer feed-forward neural network with a

single output neuron is provided on the next page.

Î Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï Ï

Figure 2.7: DIBA recursively projects the hyperplanes of the first layer onto each

other until dimension one. We demonstrate the projection of two hyperplanes onto the

grey colored hyperplane from dimension three to dimension one.

Page 70: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

52 Chapter 2. Analysis of Neural Networks

h1

h3

h 2

Backward pass

Forward pass

(2) (2) + (3)

(3)

(1)

w W

W

W

W

Figure 2.8: A decision region and the traversing of a line. A single output neuron is con-

sidered ( � is a vector). In the forward pass each line segment is denoted with the partial sum

of the weights of the output neuron, once passed an intersection hyperplane which switches

from 0 to 1 in the traverse direction. The backward pass is symmetrical. During the backward

phase it is determined if a line segment builds an edge of the decision region and if a corner is

a vertex of the decision region.

DIBA - Decision Intersection Boundary Algorithm -

Input: ANN weight matrixes and biases, interval restrictions on each input dimension.

Output: Decision regions represented through vertices and line segments.

// The dimension of the hidden layer is y and the dimension of the input)

.

1. Recursive hyperplane projection to gain lines. The algorithm considers yhyperplanes (e.g. the

�-th hyperplane is defined by:

6 �I�7� ?7+o j6 � ? , where

denotes

the matrix between the input and hidden layer and is the bias vector) in an)

dimensional input space. For each hidden unit hyperplane of dimension)

: project all

other y���E hyperplanes on it. Repeat this recursively until dimension one.

2. Boundary test along each projected hyperplane line. Along a line the points where

an output unit can change are at the intersection with the remaining y�,E lines. A

boundary is a location in the input space, where an output value transition occurs. An

intersection of hyperplanes is a boundary if each hyperplane of the intersection, has at

least one face in the vicinity of the intersection that is a boundary, i.e. the

corresponding hidden units value changes and this results in a change of the output

value. Line segments are intersections of) ��E hyperplanes and corners are

intersections of)

hyperplanes. Hence, to test if a line segment is a boundary or a vertex

is a corner boundary are identical. To perform the boundary test the line is traversed. In

a forward pass we start to traverse the line at the leftmost to the rightmost point and

scan for hyperplane intersections in this direction. The partial sum of the output unit

weight is added and assigned to the passed line segment. The backward phase is

identical, but starts from the rightmost to the leftmost point and adding its sum to the

value of the forward phase. During the backward phase a boundary test is performed.

Page 71: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 53

In [MP00] it is mentioned how to extend the algorithm for additional hidden layers

and multiple outputs. Additional layers do not change the possible locations of the

decision regions, so the first layer defines the possible decision regions. For multiple

outputs the algorithm has to be extended, such that the transition of a specific combina-

tion of output units is analysed instead of a single output unit change. This algorithm

is useful for extracting representations of threshold-based neural networks.

Discussion of DIBA

I Portability: Restriction to threshold activation functions.

I Fidelity: In the case for threshold activation functions an exact polyhedral rep-

resentation of the neural network behaviour could be provided. However, for

sigmoidal neural networks the transfer-function is replaced with threshold acti-

vation functions. This is, usually, less accurate than piece-wise linear approxi-

mations of the sigmoidal function.

I Comprehensibility: Decision regions are polyhedra and expressed through ver-

tices and lines. Hence, the comprehensibility depends on the dimension of the

input space.

I Validation Capability: For threshold activation functions, the method calcu-

lates exact decision regions in the input space. This is a guarantee, that any

point within a specific decision region, produces the same output.

I Algorithmic complexity: Exponential time complexity. The method has expo-

nential time complexity with increasing dimension of the input space. As ex-

plained by Melnik and Pollack [MP00] the complexity stems from the transver-

sal of the hyperplane arrangement in the first layer of hidden units. This is of

complexity Ñ���� © � , where � is the number of neurons in the first hidden layer

(number of hyperplanes) and is the number of input signals. Testing if a vertex

is a corner has again exponential time complexity ( Ñ­�Á� © � ). DIBA has an expo-

nential space complexity with increasing dimensionality of the input space, as

the method relies on corners to describe the decision regions.

Page 72: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

54 Chapter 2. Analysis of Neural Networks

I Usability for function approximation and classification tasks: The proposed

algorithm is applicable for classification problems.

Propagation of regions

In the following methods which rely on propagation of regions through a feed-forward

neural network are discussed. These methods are usable for feed-forward neural net-

works with invertible and continuous transfer functions.An initial hypothesis is defined

in the input and/or the output space of the neural network. These methods refine this

knowledge by propagating the initial constraints through all layers of the feed-forward

neural network. There are two well known works in this field: Validity Interval Anal-

ysis (VIA) [Thr93] and the backpropagation of polyhedra method [Mai98]. As the

name implies the latter method requires an initial hypothesis for the output space and

refines the corresponding knowledge in the input space with a single back-propagation

phase. In the following paragraphs we explain the validity interval analysis and the

backpropagation of polyhedra algorithm. To explain these algorithms we rely on the

following terminology:

1. Transfer function phase: Within the transfer function phase a polyhedron is

propagated through a vector of scalar transfer functions.

2. Affine transformation Phase: This term is used for the propagation of a poly-

hedron through the linear weight layer of a neural network.

3. � -space : Denotes the output space of a weight or transfer function layer within

the neural network or the overall neural network output space.

4. � -space : Defines the input space of a weight or transfer function layer within

the neural network or the overall neural network input space.

Interval Propagation

In the literature several interval propagation methods are published, for example by

Palade ( [PNP00] and [PNJ01]). The main idea of these algorithms is to extract and

validate axis-parallel rules. The initial algorithm, named Validity Interval Analysis

(VIA), was developed by Thurn [Thr93].

Page 73: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 55

VIA

Validity Interval Analysis (VIA) [Thr93], [Thr90] is a generic tool for analysing the

behaviour of feed-forward neural networks. VIA describes the input-output mapping

using axis parallel hypercubes. The only assumption is that the non-linear transfer

functions of the neurons are continuous and monotonic. The algorithm is based on so

called validity intervals, that is, each neuron is initially constrained with a maximum

interval of valid output. The Cartesian product of the validity intervals defines a hy-

percube. VIA iteratively refines these intervals and is able to detect inconsistencies

by forward and backward propagation of intervals through all layers of a feed-forward

neural network. This refinement process is repeated until one of the following criteria

is fulfilled:

I Consistent convergence: the validity intervals converge to non-empty intervals,

i.e. there is no significant, or no change with additional forward and backward

propagating hypercubes through the neural network.

I Contradiction: an empty interval is found, i.e. the lower bound exceeds the up-

per bound of the interval. The initial intervals are inconsistent with the weights

and biases of the neural network. Consequently we can use VIA also to verify a

given axis-parallel rule set.

The algorithm description and examples (including the MONK benchmark problem)

are available in [Thr93]. As explained in the introduction VIA is an interesting ap-

proach to region-based analysis methods, because annotating each layer with valid

axis-parallel expressions draws an interesting analogy to software verification.

In the following we use them

box function, as introduced in [Mai00a], to describe

the VIA algorithm in a compact form. The box function calculates the smallest axis

parallel hypercube containing a region � , which can be found by linear programming

methods, such as the Simplex algorithm [RR89] or interior point methods [Mat00b].

As depicted in Figure 2.9, ¨7ÒÔÓ ÈªÕ É denotes, with validity intervals, the hypercube defin-

ing the possible activity values of the subsequent layer. Similarly, © ¾ Ó ÈªÕ É is the valid

hypercube for the net input layer. Finally, ¨�ÒcÓ È�Ö É denotes the valid hypercube for the

activity of the previous layer.

Page 74: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

56 Chapter 2. Analysis of Neural Networks

1 2 m

σσσσ σ−1 Layer S

Weight−Layer

Layer P

Transfer function layer

B = i1 i1 ... im imx

Bact(S) = ... [a , b ]im imx

net(S) ’ ’ ’[a , b ]’

B = ... [a , b ]knx knact(P)

x

xx[a , b ]

x

xx [a ,b ]k2

[a , b ]i2’ ’

[a , b ]i2 i2

i2

k2

[a , b ]i1 i1

[a ,b ]k1 k1

Figure 2.9: The neurons of the preceding layer P are connected through the weight

layer to those of the subsequent layer S. Every neuron output is denoted with a validity

interval. The Cartesian product of these validity intervals define a hypercube. The

intervals P p �| Msr �| V of the net input and the intervals P p | Msr | V of the activation of the neurons

of layer S are connected through the bijective transfer function ! .To explain the forward and backward phase of the algorithm, the notation is simplified,

since we assume, without loss of generality, that the bias vectorl l l

is the null vector.

A non-zero bias value can be simulated by defining an additional input neuron, with a

constant activation value one. The value of the bias is assigned to the weighted con-

nection of this corresponding neuron.

VIA

Input: neural network, initial restriction of input and/or output regions

Output: annotated neural network

Forward Phase. Refining the upper and lower bounds of the validity intervals of

the subsequent layer.

© ¾ Ó ÈªÕ É � m ��! #&% �Ø× ¨7ÒÔÓ ÈªÕ É �7� k ¨7ÒÔÓ È�Ö É � ¨�ÒÔÓ ÈÙÕ É �"!��� © ¾ Ó ÈÙÕ É �Backward Phase. Refinement of the output intervals of the preceding layer

¨�ÒÔÓ ÈªÖ É � m �� ¨�ÒÔÓ ÈªÖ É � k #&% �� © ¾ Ó ÈÙÕ É �O� ,wherek #&% ��¡��� ���$� k ���n`�

Page 75: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 57

Discussion of VIA

I Portability: VIA can be applied to any feed-forward neural network with con-

tinuous and invertible transfer function.

I Fidelity: Often VIA does not refine very well. In this case the fidelity is not

satisfactory. One of the reasons is, that the propagation of an axis-parallel hyper-

cube through the weight layer results in a polyhedron, which is not necessarily

an axis-parallel hypercube. VIA computes the box of the intersection between

this polyhedron and the hypercube defined through the intervals of © ¾ Ó ÈªÕ É . This

often results in rough approximations. For example, in higher dimension the

wrapping box of a flat polyhedron is a very rough approximation and as a con-

sequence the method does not refine very well. The reader is also referred to

Chapter 7, where we provide some examples.

I Comprehensibility: The extracted rules are axis-parallel, thus they are easily

understandable.

I Validation Capability: VIA is, according to our knowledge, the first algorithm

able to validate the behaviour of some properties of the function computed by the

neural network. Given some initial hypothesis defined as the hypercube ¡� in

the output space of the neural network, after backpropagating � and computing

of the box ~ the following valid statements are obtained:

if ����«� then ����«~if �ÛÚ��«~ then �TÚ����

Furthermore, VIA could be used to refine and validate existing axis-parallel

rules.

I Algorithmic complexity: The time complexity is dependent on the rule refine-

ment process. Maire [Mai00a] proved that VIA always converges in one run

(iteration) for single-weight-layer neural networks. However, for multilayer net-

works no formula for the termination of the rule refinement process is found yet.

As reported in [Mai00a] experimental results showed that the in the average

Page 76: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

58 Chapter 2. Analysis of Neural Networks

the rate of box refinement decreases exponentially. In other words, a rule gets

mainly refined within the first steps of the refinement process.

I Usability for function approximation and classification tasks: The proposed

algorithm is applicable for classification problems and for function approxima-

tion tasks.

Backpropagation of Polyhedra

The possible regions which can be analysed by VIA are decisively limited due to the

fact that VIA is restricted to axis-parallel hypercubes. A good alternative to axis-

parallel hypercubes for the analysis of neural networks are polyhedra. Finite unions of

polyhedra approximate arbitrary regions quite well. Hence, it is possible to compactly

describe even complex input-output relations. Additionally, polyhedra are closed un-

der affine transformations. As a result the propagation of polyhedra through the weight

layer is exact. The notion of backpropagating finite unions of polyhedra through feed-

forward neural networks was introduced by Maire [Mai98]. Starting with a (user-

defined) set of unions of polyhedra at the output layer the inverse regions are calcu-

lated (again using a polyhedral approximation). In a feed-forward neural network we

can distinguish two different phases, namely the affine transformation phase and the

transfer function phase. Consequently the algorithm consists of these two phases. In

[Mai98] a formula to backpropagate a polyhedron through the linear weight layer is de-

scribed. The computationally expensive part of this approach is to remove redundant

inequalities. The following algorithm explains how to backpropagate a polyhedron

Page 77: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 59

through an affine transformation. 3

Backpropagation of Polyhedra: affine transformation phase

Input: � � M k M l l lOutput: � ~Backpropagation of polyhedron �)�¤� ����� ���v���¯� through the weight layer.

�,ÌÕ©$������� k �(� l l l Ó����© #&% ������� if and only if:

��� k �(� l l l ����� therefore: � ~¬� ���$� � k ������QB� ll l �remove redundant inequalities in � ~

In order to backpropagate a polyhedron through the nonlinear transfer function layer

the sigmoid function is approximated by piece-wise linear functions. The piece-wise

linear approximation of a sigmoidal function results in axis-parallel splits of the orig-

inal polyhedron in � -space. As depicted in Figure 2.10 (for a two-dimensional case)

the polyhedron is subdivided into cells. The N -th cell is a hypercube and contains

the sub-polyhedron �`| . The polyhedron � can be expressed as the union of these

sub-polyhedra, i.e. ���ÝÜ | �¡| . With each cell a different affine transformation is asso-

ciated. The diagonal matrix Þ | and the vector ß ß ß | define the affine transformation for

the N -th cell. In an -dimensional space the diagonal matrix Þ | is an by matrix and

the á -th entry on the diagonal matrix corresponds to the slope of the piece-wise linear

approximation for the á -th component. The vector ß ß ß | is an -dimensional vector and

the á -entry represents the constant of the linear approximation for the á -th component.

The algorithm to backpropagate a polyhedron through a non-linear transformation is

sketched on the next page.

3A more detailed description of the algorithm and approaches to remove redundant inequalities is

given in Chapter 5.

Page 78: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

60 Chapter 2. Analysis of Neural Networks

Backpropagation of Polyhedra: transfer function phase

Input: ���jMU!Output: ��~Transfer function phase. Backpropagation of a polyhedron ����� ��� � �­�����¯�through a vector of sigmoidal transfer functions.

«�¤� m ���$�D�ô�àiá N«� ��K�Ø // Ø2� dimension of layer S_â | � approximation of ! on the interval P bn��NUM��>�aMUbn��NUMs�°�9V

i.e. build subdivisions of P bn��NUM��>�aMUbn��NUMs�°�9V`Partition � � according to the subdivisions of the intervals P bn��NUM��>�aMUbn��NUMs�°�9Vô�àiá each sub-polyhedron � | of a cell_â | ��tå���ãÞ | ���äßßß | M where Þ | is a diagonal matrix

��~ | � â #&%| �a� | ��� ���$� � | Þ | ����� | QB� | ß ß ß | �`��~¬�H�$~ %3å ÓYÓYÓ å ��~ © M where : number of sub-polyhedra

Discussion of Backpropagation by Polyhedra

I Portability: Generally, usable for feed-forward neural networks.

I Fidelity: The method extracts correct rules for linear transfer functions. For

sigmoidal transfer functions, the fidelity decreases with a decreasing number of

piece-wise linear functions, used in approximating the sigmoid function.

I Comprehensibility: Together with visualization tools, the extracted polyhedral

rules are quite informative for small neural networks. The piece-wise linear ap-

proximation results in an exponential increase of the number of sub-polyhedra.

Page 79: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.2 Validation of Neural Network Components 61

1Q

Q 2Q 3

Q 4

−1

ϕ−1

ϕ−1

y−space x−space

cell 1cell 4

cell 3 cell 2

2

ϕ14

ϕ−13

Q 4

Q 3

1Q

Q 2

( )

( ) ( )

( )

P

P x2P x3

Px4 x1

Figure 2.10: The approach of piece-wise linear approximation of the sigmoid function

results in axis parallel splits of the polyhedron. In � -space we see the split of the

polyhedron in different cells. For each cell a different approximation of the sigmoid

function is required, i.e. â |U�������æÞu|����çßßß'| . To backpropagate a sub-polyhedron of

cell N the linear approximation function â | is applied. We plotted a possible reciprocal

image for the x-space. We can see that it would be feasible to merge the polyhedra

defined by â #&% �a� % � and â #&% �a� ¥ � , likewise the polyhedra â #&% �a� ñ � and â #&% �a�Vè'� .As a consequence a lot of polyhedral rules are required for the higher dimen-

sional case, which in turn makes the rule set less understandable.

I Validation Capability: The proposed method is capable of proving some prop-

erties about the function learned by the neural network. With an overestimation

of the sigmoid function the polyhedral approximation would contain the true

reciprocal image. Therefore, after termination, the method can assure the fol-

lowing property about the function computed by the neural network:

if ���n� � then �B� å � ~ |where ��~ | denotes the N -th sub-polyhedron. Similarly, by assuming an over-

estimation of the sigmoid function, the method can guarantee that the neural

network behaves according to the following rule:

if �TÚ� å � ~ | then �éÚ�n� �I Algorithmic complexity: Exponential time complexity. The computation time

increases exponentially with the number of neurons. The method of approxi-

Page 80: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

62 Chapter 2. Analysis of Neural Networks

mating a non-linear transfer function by a piece-wise linear function results in

axis-parallel splits of the polyhedron according to the partioning of ! . As this

partioning occurs on every axis, the number of cells and therefore the number

of splits increases exponentially. An additional problem, which arises with this

method, is the merging of polyhedra once backpropagated through the piece-

wise linear function.

I Usability for function approximation and classification tasks: The proposed

algorithm is applicable for classification problems and for function approxima-

tion tasks.

2.3 Overview of Discussed Neural Network Validation Tech-

niques and Validity Polyhedral Analysis

So far, several algorithms to analyse neural networks have been introduced. In this

section we provide a classification of the discussed methods and suggest some desir-

able properties for neural network validation methods. Finally, we motivate and justify

our approach to validate the behaviour of feed-forward neural networks.

The literature overview covered the following validation or rule extraction techniques:

I Propositional rule extraction: KT, � of ²I Fuzzy rule extraction: Linguistic rule extraction, FuzzyTrepan, REX

I Region-based methods: REFANN, Cluster-REFA, Interval Propagation (VIA),

Backpropagation of Polyhedra, DIBA

We could derive properties for the different classes (propositional, fuzzy, region-

based) of discussed neural network validation techniques. Table 2.1 summarizes these

results. The first column defines the validation class. The columns white box and

black box indicate whether decompositional (white box) or pedagogical techniques

are available within the given class. The attribute general indicates if decompositional

approaches have no or minor assumptions on the feed-forward neural network archi-

tecture. With the columns classification and FA (function approximation), we indicate

Page 81: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

2.3 Overview of Discussed Neural Network Validation Techniques and Validity Polyhedral Analysis 63

if methods are usable for classification or function approximation tasks. Finally, the

column denoted with VC (validation capability) specifies if there are approaches in a

class, which are suitable to prove properties about the neural network behaviour.

RE - Class White-box Black-box General Classification FA VC

Prop. RE yes yes no yes no no

Fuzzy RE yes yes yes yes no no

Region based yes no yes yes yes yes

Table 2.1: Overview of neural network validation techniques.

The algorithm developed within this thesis is named Validity Polyhedral Analysis

(VPA). VPA is a decompositional approach (white box), usable for classification prob-

lems as well as for function approximation problems. It is a very general method,

because the only assumption is that the transfer-function is monotonic, continuous and

invertible.

As motivated in Section 1.4 VPA is a method which computes valid polyhedral region

mappings by forward- and backward propagating finite unions of polyhedra through

all layers of a feed-forward neural network.

The development of VPA is motivated by our desire to verify some properties about

the function performed by a neural network. Thus, we are interested in the most re-

fined informative, region based rules. VPA achieves this goal by annotating each layer

of a feed-forward neural network with valid linear inequality predicates, which are

geometrically finite unions of polyhedra. Polyhedra are closed under affine transfor-

mations. Hence, polyhedra are a good choice to represent the knowledge embedded

in a neural network. Most region-based analysis methods scale exponentially with in-

creasing neural network size (REFANN, DIBA, Backpropagation of Polyhedra). For

the development of our algorithm one important requirement was to assure that the

algorithm also scales with higher dimension. Additionally, it was important to assure

that any approximation contains the true reciprocal image, and hence, to obtain valid

statements about the neural network behaviour.

Page 82: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

64 Chapter 2. Analysis of Neural Networks

Summary and Outlook to the next Chapters

Also this chapter contained some new aspects, which are summarized below:

Contributions - Chapter 2 -

I Classification of neural network analysis techniques into: propositional rule

extraction, fuzzy rule extraction and region-based analysis.

I Introduction of the new attribute “validation capability” to characterize prop-

erties of neural network validation methods.

I Modification of REFANN [SLZ02] to obtain valid rules of the form:

if ���n� | then �B�BP ��§ |ª© MO��§W¨U~DVwhere � | is a polyhedron in the input space of the neural network.

In the following chapters we will describe a way to compute refined polyhedral

pre- and postconditions of feed-forward neural networks and therefore how to obtain

an annotated version of the neural network.

The first difficulty is to approximate the image or reciprocal image of a polyhedron un-

der a non-linear transformation, which is a non-linear region. We started to analyse the

structure of manifolds of this regions, followed the idea of approximating a non-linear

region via a set of finite unions of polyhedra, which lead us finally to the interesting

field of non-linear optimization.

The next difficulty was to compute the image and reciprocal image of a polyhedron

under an affine transformation. This question lead us to the fascinating topic of poly-

hedral projections. According to a proper software engineering approach, we con-

structed an abstract framework for iterative refinement algorithms and embedded VIA

and VPA as possible instances. This implementation was used to evaluate our ideas.

Page 83: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 3

Polyhedra and Deformations of

Polyhedral Facets under Sigmoidal

Transformations

Polyhedra and their geometrical properties are important for a wide range of problem

domains, such as (linear) optimization, geometry and parallelized compiler. We use

polyhedra to describe the behaviour of feed-forward neural networks. This chapter

explains basic concepts of polyhedral computations. More sophisticated polyhedral

operations like the removal of redundant inequalities and the projection of a polyhe-

dron onto a lower dimensional subspace will be discussed in chapter 5.

In section 3.3 we consider the problem of the forward and backward-propagation

of a polyhedron through the transfer-function layer of a neural network. Addition-

ally, we describe approaches to analyse the manifold of the image of a hyperplane¹ � ����� ê®»¼����r>� under a non-linear transformation.

3.1 Polyhedra and their Representation

Our summary relied mainly on the work by Wilde [Wil97] and Fukuda [Fuk00]. The

books by Schrijver “Theory of Linear and Integer Programming” [Sch90] and Ziegler

“Lectures on polytopes” [Zie94] are advanced sources for polyhedral computation.

In the sequel, we consider polyhedra in the -dimensional Euclidean space. Before we

65

Page 84: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

66 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

define a polyhedron and a polyhedral cone, we recall the definitions for linear, non-

negative, affine and convex combinations of two vectors.

Definition 3.1 Linear, non-negative, affine and convex combination of vectors

Let � | be a vector and ë | a scalar. Then:

I ® | ë | � | is called a linear combination,

I ® | ë | � | and ë | §�¨ is called a non-negative combination,

I ® | ë | � | and® ë | �º� is called an affine combination,

I ® | ë�|��&| and® ë�|&�º� and ë�|«§�¨ is called a convex combination.

wFigure 3.1 shows the geometric interpretation of different combinations of two vectors

in the two dimensional space.

Figure 3.1: Combination of two vectors in a two-dimensional space: from left to right:

linear, non-negative, affine and convex combination.

Definition 3.2 Vertex

A vertex of a polyhedron � is a point in � , which cannot be expressed as convex

combination of other points in � . wDefinition 3.3 Ray

A vector á is a ray of � , if for any ���n� also ���(�íì�áD� �n�2Mpì�§�¨ . In other words a

ray defines a direction in which � is open (infinite). w

Page 85: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.1 Polyhedra and their Representation 67

Definition 3.4 Extreme Ray

A ray is called an extreme ray of a polyhedron � if and only if it cannot be expressed

as a positive combination of two distinct rays of � . wDefinition 3.5 Polyhedron

A Polyhedron � is a subspace of î © defined as the intersection of a finite number of

half-spaces. Analytically this can be described as a set of linear inequalities. wDefinition 3.6 Dual representation of a Polyhedron

A polyhedron can be represented in the implicit form as a set of linear inequalities �"����$� ����� ��� . For every polyhedron � exists an equivalent parametric representation

(also known as Minkowski characterization) in terms of a linear combination of lines,

a convex combination of vertices, and a positive combination of extreme rays: � ����ý�ïî © � ���ðë&ñ��òì�óÍ�Ýô$õ and ì�M�ô<§Í¨ and® ì � �5� , where ñ�M/ó,Mpõ are

matrices whose columns represent the lines, extreme rays and vertices, respecitvely. wDefinition 3.7 Line

A line of � is a bidirectional ray, i.e. a vector Z , such that with ���n� also �����Aì&Z«� �� . wDefinition 3.8 Polyhedral Cone

A cone is the intersection of a finite number of closed halfspaces of the form¹ #| ����$� �J��NUMYKF�Â�¾� ¨¦� The implicit description of a cone is: öý�Ï���$� ��� � ¨¦� . The

parametric representation for a pointed cone is simply: ö,� ���$� ì�ó , ì�§�¨¦� . wDefinition 3.9 Convex Hull

The convex hull of a set of points is a convex combination of all points in the set. It is

the smallest convex set which contains all points of the set. wDefinition 3.10 Polytope

A bounded subset � ª÷î © is a polytope, iff it is the convex hull of a finite set of

points, denoted with ø . Alternatively, a polytope is a polyhedron without rays or lines.w

Page 86: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

68 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

Definition 3.11 Dimension of a Polytope

The dimension of a polytope � is the dimension of the convex hull of points included

in � . wThe dimension of a polytope is denoted with dim ����� . A polyhedron in an -

dimensional space with dim �������H is said to be full-dimensional.

The expressions supporting hyperplane, proper face and boundary of a polyhedron

are borrowed from Rambau [Ram94].

Definition 3.12 Supporting Hyperplane

If the polyhedron � is contained in the halfspace defined by¹ # � ���$� �J��NUMYKF�Â� �� ��NL�s� , with �"� ¹ Ú�Ýù , then the hyperplane

¹is called a supporting hyperplane. w

Definition 3.13 Face of a Polyhedron

The intersection of � and a supporting hyperplane¹

is a face of the polyhedron � . wVertices are faces of dimension 0, edges are faces of dimension 1, and facets are

faces of Ø°NX�B�����úQ�� . A face with dimension Ø°NX�B����� is the polyhedron � itself.

Definition 3.14 Proper Faces

Faces between dimension 0 and dimension ذN9�B������Q�� are called proper faces. wDefinition 3.15 Boundary of a Polytope

The union of all proper faces of � is the boundary of � , denoted with û�� . wThe definition of the analytic center of a polyhedron is important in machine learn-

ing. For example in the paper by Malyscheff and Trafalis the computation of the ana-

lytic center is used to define a new learning algorithm [TM].

Definition 3.16 Analytic Center of a Polyhedron

The analytic center of a polyhedron is an approximation of the center of mass of a

polyhedron, which is computed by maximizing the distance to all facets of the poly-

hedron. w

Page 87: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.2 Operations on Polyhedra and Important Properties 69

We decided to represent a polyhedron as the intersection of a finite number of

half-spaces. This is a better representation with respect to the complexity than the

parametric representation, e.g. an -cube can be described with a set of �5 inequalities

whereas we need � © vertices to represent the same cube through its vertices. In general

we write a polyhedron as the set of points: ���¾���$� �­���Ô�¯� . Often, an index will

indicate whether the polyhedron belongs to the input space or output space.

3.2 Operations on Polyhedra and Important Properties

All polyhedra are convex. Let �ýüÛî © and �ÝüTî © be two polyhedra.

Intersection of Polyhedra

The intersection of ��� ���B�(î © � ���B���¯� and �H� ���B�mî © � b{�B� ­ � is computed

by concatenating the inequalities of � to � , i.e.: þ������B�oî © �cd � bef�cd � ­ef�

Union of Polyhedra

Let � and � be two polyhedra. The union is defined as the set of points which are in� or in � . With the union of polyhedra it is also possible to describe concave sets.

Definition 3.17 Non-redundant Polyhedron

A polyhedron � � ���$� ��� ���¯� defined with the minimal number of inequalities is

called non-redundant. An equivalent definition is that with the removal of any inequal-

ity of � the polyhedron � � �H�éÿ¤���$� ����NsMYKF�Â����� ��NL�s� contains � , i.e. �ýªv� � wOften also the terminology facet-defined polyhedron is used when referring to a non-

redundant polyhedron. The intersection of two polyhedra can result in a redundant

representation. In Chapter 5 we will discuss how to detect redundant inequalities.

Definition 3.18 Affine Transformation ©An affine transformation is a function of the form ���Ì ��� k �n� l l l , where

kis a

matrix and ��MO� andl l l

are vectors. w

Page 88: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

70 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

The following Lemmas hold for polyhedra.

Lemma 3.1 Closure under Intersection

The intersection of two polyhedra is a polyhedron.

Proof:

Let ��� ���B�(î © � ���B����� and �H� �����oî © � b���� ­ � be two polyhedra.

From the definition of the intersection of two polyhedra it follows directly that

þ½� ���v�hî © �cd � bef�cd � ­ef� is also a polyhedron, because it is a set of linear

inequalities. wLemma 3.2 Closure under Affine Transformations

The image of a polyhedron under an affine transformation is a polyhedron.

We assume for the following proof that the transformation matrixk

is invertible. As

we will explain in Chapter 5 the projection of a polyhedron onto a lower-dimensional

subspace is another polyhedron. With this property we can reduce the non-invertible

case to the invertible.

Proof:

Let � � ����� î © � �������¯� be a polyhedron and © an affine transformation.

The image of the polyhedron is þ �Í©������ . Hence þ¸� ����� � �Ô� ��þ�� �k �(� l l l � .Let the transformation matrix

kbe invertible, then: þ½�ý����� � k #&% �������� k #&% l l l � , i.e. þ is a set of linear inequalities w

3.3 Deformations of Polyhedral Facets under Sigmoidal Trans-

formations

Validity Polyhedral Analysis (VPA), requires the approximation of the image or recip-

rocal image of a polyhedron � under a non-linear transfer function with a set of finite

unions of polyhedra. The problem of propagating a polyhedron through the non-linear

Page 89: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 71

transfer function layer of a neural network is depicted in Figure 3.2. We restrict our

studies to continuous and invertible transfer functions. Typically, feed-forward neural

networks use sigmoidal functions for the non-linear transformation. As these functions

are bijections it is sufficient to analyse the back-propagation of a polyhedron through a

sigmoidal transfer function layer. The forward-propagation can be solved analogously.

σ σ −1σ

y(2) y(m)

(2) (m)

σ

y(1)

x(1) x Aσ (x) b<−

Ay <− b

x

σ

Figure 3.2: The back-propagation of a polyhedron through the transfer function layer.

Given the polyhedral description ��� ����� �����º�¯� in the output space of a transfer

function layer the reciprocal image of this polyhedron under the non-linear transfer

function is given by � ��! #&% �����¯� ���$� �(!������¡���¯� .In this and the next chapters the knowledge of the following definitions and terminol-

ogy is necessary.

Definition 3.19 Non-linear region �We use the expression non-linear region and denote it with � , to express the image or

reciprocal image of a polyhedron � under a non-linear transformation. wDefinition 3.20 Manifold

The surface of � is a manifold. wDefinition 3.21 Facet manifold

The image or reciprocal image of a facet of a polyhedron, under a nonlinear transfor-

mation, is referred to as facet manifold. wDefinition 3.22 Polyhedral Approximation Algorithm

The expression polyhedral approximation algorithm (or just: approximation algo-

rithm) is used to refer to algorithms which approximate the non-linear region � with

a set of finite unions of polyhedra. w

Page 90: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

72 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

Definition 3.23 Slice

The intersection of the region � with an axis-parallel affine subspace defines a slice.wA Q -dimensional slice of a -dimensional region ( �{��Q�´H ) is obtained by keeping mQBQ variables fixed.

Definition 3.24 Free Variables

The constrained variables within a slice are called free variables. wDefinition 3.25 Saddle Point

Points on the manifold where the surface switches from concave to convex (or vice

versa) are referred to as saddle points. wLet ���º����� ���v���¯� be a facet-defined (non-redundant) polyhedron in � -space. The

reciprocal image in � -space is � � ! #&% �����·� ���$� �(!������­�Ô��� . It is important to

notice, that ! is a vector1 of scalar sigmoid functions and is applied on the vector �,����¯�Ã�>�aMYÓYÓYÓ5MO�¯�� ��O� component-wise, i.e. !������ denotes the vector ��!����¯�Ã�>�O�aMYÓYÓYÓ�MU!����¯�� ��O�O� .A facet of the polyhedron � under a component-wise transformation ! can get stretched,

squeezed and convex or concave curvatures are possible.

An important observation is that the reciprocal image of axis-parallel facets in the� -space are axis-parallel facets in the � -space and vice versa. If we consider a sin-

gle facet of a polyhedron, (let’s say the N -th facet), then we write ê » �ý�¸r , where

ê�� �J��NUMYKF� and r��Õ� ��NL� . An axis-parallel hyperplane is characterized by: ê ���¨®MYÓYÓYÓ>MU¨®M���MU¨®MYÓYÓYÓDMU¨j� , i.e. just one component is non-zero. As we apply the transfer

function component-wise the image of an axis-parallel facet, under a sigmoidal trans-

formation, is another axis-parallel facet. In other words an axis-parallel facet under the

sigmoidal transformation can only get stretched or squeezed.

For the general case the analysis of the curvature behaviour in the neighbourhood of

a point � on a facet manifold is used. In an -dimensional space a manifold is of di-

mension AQv� . This allows curvatures in AQv� directions. Knowledge about the type

of a curvature, i.e. whether the curvature is convex or concave and the strength of the

1Note that the vector � of scalar functions is not denoted by a bold letter, as this could cause confusion.

Page 91: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 73

curvature is useful for any polyhedral approximation algorithm. Information on the

curvatures can be used to define splits for the polyhedron in � -space. For example, to

obtain a precise approximation in areas of strong concave curvatures in the manifold

of � , a polyhedral split is useful to refine the approximation. An example is depicted

in Figure 3.3.

a) b) c)

Figure 3.3: The non-linear region is depicted on the left hand side. An approximation

of with a single polyhedron is shown in the middle. In c) an improved approximation

by using the union of two polyhedra is depicted.

The next three paragraphs explain three different methods to analyze the structure

of the manifold of � .

Subdivision in cells

The first approach in analysing the manifold structure assumes a subdivision of the � -space into a large number of arbitrarily small axis parallel cells. Within every cell the

sigmoid function is approximated through piece-wise linear functions. As a single cell

is arbitrarily small, we can view the sigmoidal function as linear within the cell. The

slope of the linear function is defined by the first derivative of the sigmoid function. For

the two dimensional case, the linear approximation associated with a cell is described

as follows:

â �����$�òÞu��� ß ß ß ,where Þ �õ÷ Ø % ¨¨ Ø ¥

øú and Ø | �"! � ���¯��NL�O�$�"! � ��! #&% ���$��NL�O�O�To analyze the curvature behaviour in a small neighbourhood the effect of a transition

from one cell to another on the reciprocal image has to be observed. Figure 3.4 illus-

trates this idea and Figure 3.5 depicts an example for a two dimensional space. The

Page 92: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

74 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

curvature behaviour of the true reciprocal image of a hyperplane¹ �ý����� êå»���� r�� ,

is obviously dependent on the vector ê , the constant r and the vector of transfer func-

tions. Table 3.1 summarizes observations of switches between neighbouring cells. In

higher dimensional cases we have to find mQ � cells in the neighbourhood of a point.

A criticism of this method is that it is just applicable within small neighbourhoods,

does not scale well with higher dimensions and it does not consider the effect of the

constant r .(2)

(1)

y

y x(1)

x(2)cell i+1

cell i

Figure 3.4: The sigmoid function is approximated through a piece-wise linear function within

every cell. A transition from cell�

to cell� K�E , by assuming positive .76W:'? values, results in a

concave curvature in the . -space. The approximation for a cell is defined by: {«6Ê.�?�+���.VK¡l l l .A transition from cell

�to cell

� K(E changes only the corresponding component in the diagonal

matrix and in the vector l l l , in our case ��� and l l l°6�:>? .

−0.2 −0.1 0 0.1 0.2 0.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1y−space

−0.5 0 0.50

0.5

1

1.5

2

2.5

3x−space

Figure 3.5: The hyperplane in � -space is defined by ��+Ô8��¡E�A 6 E �Y= and �·+ A . With an

arbitrarily small subdivision of the � -space, we can observe that cell transitions are most often

in � 6W:>? direction. Hence the manifold is concave in the neighbourhood of the plotted point

(see also Figure 3.4 and the table of the previous page).

Page 93: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 75

premises equivalent premises conclusion

pj|«¶�¨ and tq|«¶�¨ p�| ��� º� � ~ [ ´�¨ concavep | ´�¨ and t | ¶�¨ p | � � º� � ~ [ ¶�¨ convexpj|«¶�¨ and tq|«´�¨ p�| � � º� � ~ [ ¶�¨ convexp | ´�¨ and t | ´�¨ p | � � º� � ~ [ ´�¨ concave

Table 3.1: Concave and convex curvature in the neighborhood of a point.

Point sampling method

The point sampling method analyses the manifold structure by determing points on the

manifold. This approach is shown in Figure 3.4. For any axis-parallel two dimensional

slice of the region � it is possible to find the intervals for the free variables. Addition-

ally, we are able to find the inflection points in this slice2. The connecting line seg-

ments provide information about stretching, squeezing and curvature behaviour. To

refine the information a divide and conquer strategy can be applied on the line seg-

ments.

0.7 0.8 0.9 1 1.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1y−space

1.5 2 2.50

0.5

1

1.5

2

2.5

x−space

Figure 3.6: The hyperplane with � + 8N�¡E\A 6 E���= and �2+ÔA 6 ����� in � -space and the true

reciprocal image in . -space. The three points in � -space and the corresponding points in . -space indicate how a point sampling method can be used to find the curvature behaviour within

a two dimensional slice.

2Section 4.2 explains the computation of the interval of the free variables of a two dimensional slice

and the computation of the inflection point.

Page 94: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

76 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

Eigenvalue and Eigenvector analysis

The vector-based expansion of the Taylor formula is applied to compute an approxi-

mation for the manifold of � in the neighborhood of a point � .By computing the eigenvalues and eigenvectors in the tangent hyperplane of the point� , the strength, type (convex or concave) and the direction of the deformation is deter-

mined. For the following computation a non-redundant description of the polyhedron

is assumed, i.e. any inequality corresponds to a facet.

I The N -th facet of a polyhedron � is the defined with the following equality:

– ê�»7�J��r , where: ê�»����J��NUMYKF� and r¡�"� ��NL�I The true reciprocal image of this polyhedral facet hyperplane is given by:

– ê » !���������rI We want to analyze the structure of the facet manifold, in the neighborhood of

the point � with ���"!������– The facet manifold ������� is: ���������ãê�»¼!������,QBr)�H¨– To get an approximation of the manifold in the neighbourhood of � , the

vector based Taylor formula can be applied:

�����������������������7����ô » �¼�u� �� �¼� » � ¥ ô����where �¼� denotes a direction in the tangent hyperplane starting at point � .

– � ¥ ô is the Hessian matrix. It is a diagonal matrix with the component-wise

second derivate on the diagonal.

� ¥ ô��õöööööö÷ê��Ã�>� ��� ºDÈ ê É� � ê ÈW% É ¨ Ó ¨¨ Ó ¨ ÓÓ ¨ Ó ¨¨ Ó ¨ ê¼�� �� � � º°È ê É� � ê È ©DÉ

øYùùùùùùú

– We can write �����·���������"¨��,¨«� %¥ ��� » � ¥ ô���� , since ���������"¨ and, be-

cause all directions in the tangent hyperplane are orthogonal to the gradient

vector �2ô , the dot product �2ô » ���,�"¨ . Hence, we get:

���������¼����� �� �¼� » ��¥/ô �¼� �Ã�>�

Page 95: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 77

– Generally, the gradient vector �2ô is not parallel to any of the basis vectors

of the standard basis b ^ . To obtain the basis of the tangent hyperplane we

compute the RQ�� mutual orthogonal vectors to the gradient vector �2ô . The

standard technique to compute for a given vector mutual orthogonal vec-

tors is the Gram-Schmidt decomposition (also known as QR-Factorization

[GVL89]). The QÔ� mutual orthogonal vectors, defining the tangent

hyperplane, and the gradient vector ��ô build the basis b % . We use the

notation !#"%$&" X , where the column vectors of !'"($&" X are the basis vectors

of b % expressed in basis b ^ , to denote the change from basis b ^ to basisb % .I With �)"%$)�*!+"($&" X �," X formula (1) becomes:

�����,"($����¼�,"($���� �� ��� » " $ �2¥Y�-���,"%$�Ì���.!+"%$&" X t/" X ��!+"%$&" X �¼�," X ��� �� ��� » " X ! » "($0" X �2¥ðô1!#"%$&" X ���)" X �Á�°�

– the matrix 2¾�*!�» "($&" X � ¥ ô1! "%$0" X is symmetric.

– for a symmetric matrix it is always possible to find a matrix !3"RX1" � , such

that the matrix 2,�«�4!�» " X " � 25! " X " � is a diagonal matrix [BS81]. The

column vector of the matrix ! " X " � are the eigenvectors of 2 with respect

to . basis b % . The diagonal values of the resulting matrix 2 � are the

eigenvalues of 2 with respect to basis b % . The steps to calculate !'" X " �are:

1. calculate eigenvalues ë of 2 with respect to basis b % , i.e. calcu-

late the roots of the characteristic polynom given by the determinant

det �.2÷Q�ë76�����¨ .2. calculate the eigenvectors, i.e. 2��,��ë��

– with �,"RX$�*!+"RX8" � �," � formula (2) becomes:

���.!#" $ "RXÃ�,"RX���!+" $ "úX9���,"úXa��� �� ��� » " X ! » "($&" X � ¥ ô:!+" $ "úX&�¼�,"RX�Ì���.!#"%$&" X !#" X " � �," � �;!+"($&" X !+" X " � ���)" � �<� �� ��� » " � ! » " X " � 25!+" X " � �¼�%" �

Page 96: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

78 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

I finally, for the approximation of � in the neighbourhood of ��"($ , we obtain the

subsequent formula with 2,� representing the diagonal matrix containing the

eigenvalues ë | .�����,"($����¼�%"%$Y�������.!+"($0" X !+" X " � �," � ��!+"($=" X !+" X " � �¼�%" � �<�

�� ��� » " � 2 � ���)" � � ��© #&%>|r¯ % ë | ���¼¥| " �

To summarize the interpretation and the use of this result:

I eigenvector of 2 with respect to basis b % tells us the direction of the curvature

I eigenvalues of 2 with respect to basis b % provides the strength of the curvature

in the direction of the corresponding eigenvector.

I ë | ´�¨ £ concave curvature in the direction of the corresponding eigenvector.

I ë | ¶�¨ £ convex curvature in the direction of the corresponding eigenvector.

Figures 3.7 with 3.9 are visualizations of the above result for a three-dimensional

space. These figures show the deformation of a two dimensional polyhedral facet3

under a vector of three tansig functions. We also plotted the gradient and the eigenvec-

tors at a point � . The plots are from three different points of view.

3For the implementation in Matlab we used the QR-Factorization to calculate a grid of points on the

polyhedral facet and finally computed the corresponding points in ? -space by using � » X .

Page 97: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 79

Figure 3.7: The reciprocal image of a hyperplane in � -space. The dotted eigenvector

indicates the direction of the convex curvature, the dashed eigenvector points in the

direction of the concave curvature in the neighbourhood of the corresponding point

on the facet manifold. The length of the eigenvector depends on the strength of the

curvature. The gradient vector at the given point is plotted with a solid line.

Page 98: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

80 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

Figure 3.8: The convex curvature of the manifold in the direction of the dotted eigen-

vector is clear. The length of the eigenvector represents the eigenvalue, i.e. expresses

the strength of the curvature in this direction. The eigenvector is underneath the mani-

fold.

Page 99: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 81

Figure 3.9: The concave curvature in the direction of the dashed eigenvector is difficult

to see. We can see that the eigenvector points out of the manifold, i.e. above the

manifold.

Page 100: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

82 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations

3.4 Summary of this Chapter

This chapter first introduced basic concepts of polyhedra, which are relevant within

the scope of this thesis.

Next we introduced three different approaches to analyse the structure of a manifold

of the non-linear region � . The cell subdivision approach is not generally applicable,

because the constant r is not included in the analysis, and the point sampling approach

can not guarantee to find the directions of the strongest concave or convex curvature

for a given point on the manifold. By computing the eigenvectors and eigenvalues

in the vicinity of a point � on the manifold, we obtain the directions of the strongest

curvatures in � . This is the most informative method and helpful to obtain refined

polyhedral approximations of the non-linear region � .

Contributions - Chapter 3 -

I The computation of the eigenvalues and the eigenvectors in the neighbour-

hood of a point � on the manifold of the region � provides useful information

about the deformations of polyhedral facets under sigmoidal transformations.

Page 101: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 4

Nonlinear Transformation Phase

The back-propagation of polyhedra technique [Mai98], can be improved by mini-

mizing the number of subdivisions of a polyhedron, which is necessary with a linear

approximation of the non-linear transfer function. Section 4.1 introduces a possible

method.

Another approach uses a direct approximation of the non-linear region � . This method

leads to a non-linear optimization problem. Section 4.2 discusses different techniques

to solve this non-linear optimization problem, namely : Sequential Quadratic Program-

ming (SQP), the repeated computation of the optimal solution within two-dimensional

slices, a branch and bound approach and finally a binary search method.

4.1 Mathematical Analysis of Non-Axis-parallel Splits of a

Polyhedron

The piece-wise linear approximation of the sigmoid function results in axis parallel

splits of the polyhedron [Mai98]. The number of splits increases exponentially with

the number of neurons of a layer. Therefore methods based on piece-wise linear ap-

proximations of the sigmoid function have to minimize the number of splits in order to

be applicable to higher dimensional cases. As discussed in Chapter 3, the analysis of

the surface (manifold) of the true reciprocal image is helpful in reducing the number

of splits.

Given a polyhedron � in � -space, we can compute for each of the facets the analytic

83

Page 102: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

84 Chapter 4. Nonlinear Transformation Phase

center of mass and analyse the twist behaviour in the vicinity of these points. This

information can be applied to efficiently split a polyhedron into a small number of

sub-polyhedra. A selection of sample points on � together with a computation of the

corresponding eigenvalues and the eigenvectors are used to develop heuristics to ob-

tain “good” splits of the polyhedron. Figure 4.1 depicts a scenario in dimension two.

In this example a splitting hyperplane is defined via two opposite points in which the

region � has a concave curvature.

This approach has several drawbacks. In general the methods results in splits of the

polyhedron that are not axis-parallel, and for each cell we have to determine a piece-

wise linear approximation of the sigmoid function. It is important to notice that for

neighbouring cells the approximation has to agree on the splitting hyperplane (see Fig-

ure 4.1). A difficulty is to obtain a reasonable approximation of the piece-wise linear

functions â | . Additionally, we did not investigate sophisticated techniques to obtain

splitting hyperplanes in higher-dimensional cases. A sketch of an algorithm follows.

x−space: approximation y−space x−space: reciprocal image

,iD δδ i

T = ba

, δδi+1 i+1

cell i+1D

cell i R= bT ya x

x(1)

x(2)x (2)

(1)x

y(2)

(1)y

Figure 4.1: A scenario for a split of a polyhedron which is not parallel to an axis.

The piece-wise linear approximation of the sigmoid function for each cell has to be

determined, i.e we have to find the matrix Þ and the vector ß ß ß which define the affine

approximation. Note that in general the matrix Þ is not diagonal. For neighbouring

cells the approximation has to agree on the separating hyperplane.

Page 103: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.1 Mathematical Analysis of Non-Axis-parallel Splits of a Polyhedron 85

Non axis-parallel split approach

1. Compute sample points on each facet of the polyhedron � , e.g. includ-

ing the analytic center of mass points.

2. Compute the eigenvalues and the eigenvectors at the corresponding

points on the surface of � .

3. Define a low number of splitting hyperplanes by using the information

about the twist behaviour in the sample points. Split the polyhedron�<� ����� ��������� by adding these hyperplanes.

4. Subdivide the space into cells according to the split of the polyhedron.

5. Determine the piece-wise linear approximation of the sigmoid function

within every cell. For two neighbouring cells NUMON��º� we have to find

the linear functions â | M â | ¿ % K â | �������òÞ | ��� ß ß ß | and â | ¿ % �ãÞ | ¿ % �(�ß ß ß | ¿ % . These affine functions have to agree on the separating hyperplane¹ �����$� ê�»¼�J� r>� . Therfore we have to solve the following non-linear

optimization problem:

min �@� !������úQ â | �����Y�@� and min �@� !�������Q â | ¿ % �����Y�@�s.to. @q��� ¹ K â | �����¯� â | ¿ % �����

Choosing a number of sample points and calculating Þ | , ß ß ß | according to

a least mean square error defines â | . We are now able to define â | ¿ % by

choosing sample points in the cell Ý | ¿ % and determine Þ5A ¿%B , ß ß ß'| ¿ % under

the constraint that the two functions agree on the separating hyperplane.

As indicated in the above algorithm no heuristics for the definition of splitting hyper-

planes was developed. Additionally, in higher dimensional cases an increasing number

of neighbouring cells complicates the approximation of a piece-wise linear function â | .Thus far we did not develop a satisfactory method for the approximation of â | , also,

because a different approach which seemed more promising has been identified (see

next subsection). A further disadvantage of the above algorithm is that, due to the non-

Page 104: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

86 Chapter 4. Nonlinear Transformation Phase

axis parallel split, the matrix Þ of the affine function â ����� is not necessarily diagonal.

This leads to the undesired side-effect that axis parallel facets of sub-polyhedra within

a cell are not mapped to axis parallel facets. Due to these reasons further analysis was

not undertaken. The problem was reconsidered from another point of view: instead

of approximating the sigmoid function with piece-wise linear functions, the polyhe-

dral approximation started on the non-linear region � itself. This reduces to a non-

linear optimization problem. The following approaches can be viewed as a polyhedral

wrapping of the region � . The next section explains this view and in the following

subsections our approaches are presented.

4.2 Mathematical Analysis of a Polyhedral Wrapping of a

Region

We start this section with some definitions which we will use in our descriptions about

methods to compute a polyhedron containing the region � .

Definition 4.1 Wrapping Polyhedron

A wrapping polyhedron is a polyhedron that is used to wrap the nonlinear region �from outside. wDefinition 4.2 Nonlinear Optimization Problem

In the most general form a nonlinear programming problem is to maximize (or mini-

mize) a function subject to some constraints.

��p�t �������ÑjÓ ÞLÚ Ó�K ×¼�����$�"¨ßS�����¡��¨where �SMs× and ß are functions, with �BK&î © Ì î�Ms×JK&î © Ì î § M and ß�K&î © Ì î ¦ .wSpecial cases are linear and quadratic programs. In these cases the function � is linear

or quadratic and the constraint functions × and ß are affine.

Page 105: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 87

Definition 4.3 Cost-function

The function ������� which is to minimize or maximize is also referred to as cost-

function. wDefinition 4.4 Constraint Function

We also use the expression constraint functions or simply constraints to express that a

function should be optimized under these conditions. wThe following approaches postpone the approximation process. Instead of a piece-wise

linear approximation of the transfer function, we compute a polyhedron containing the

true reciprocal image (see Figure 4.2 as an example). The direction vector C of each

hyperplane of this (wrapping) polyhedron has to be determined. For example, the

gradient vector of a point on the manifold of the true reciprocal image has the nice

property that we can find the exact solution in the linear case. Given a polyhedron� � �Ï��� � �­� � ��� in the output space ( � -space) of a transfer function layer we

want to approximate the reciprocal image �n� ���$� �(!�������� �¯� with a finite union of

polyhedra.

The reciprocal image of the analytic center of mass of a facet is an interesting point

for the computation of a suitable direction vector. Figure 4.2 depicts the wrapping of

the non-linear region using one polyhedron.

g

g2

3

1g1HH

HH 3

HH 2

b)a)

Figure 4.2: In (a) we see the true reciprocal image of a polyhedron with respect to the

transfer function. Given an interior point on each manifold of the true reciprocal image

we determine the gradient C at this point (a). The gradient vector defines the optimiza-

tion direction. This has the nice property that we always find the optimal solution for

linear manifolds. In b) we plotted the region and the polyhedral approximation, which

can be viewed as the convex hull of the true reciprocal image.

Page 106: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

88 Chapter 4. Nonlinear Transformation Phase

The orientation of each hyperplane of the wrapping polyhedron is given by a vector C .

It remains to find the optimal position for the hyperplane. Geometrically, the optimal

solution for a hyperplane (i.e. the best position) is characterized through a point (or

set of points) on the manifold of � where the hyperplane is tangent to the manifold.

Analytically this can be formulated as a non-linear optimization problem, and repeat-

ing this process for all hyperplanes defines a wrapping polyhedron �¤~ containing the

region � . A positive property of this method is that the calculation can be easily

distributed among different CPU’s (central processing units), according to the chosen

number of hyperplane directions. In � -space the optimization problem is defined by:

max C [ � subject to �(!������¡��� �ÁÀÁ�This means we have to maximize a linear cost-function under non-linear constraints.

The equivalent formulation of this problem in � -space is:

max C [ ! # B ���\� subject to ������� �ÁÀ�ÀÁ�In general, the above optimization problems are hard, and can not be solved exactly.

The continuous optimization problem is hard in the sense that it is not possible to find a

method which always guarantees to compute the global optimum in a reasonable time.

The optimization problems (i) and (ii) are hard problems

Feed-forward neural networks are general function approximators [HSW89]. A non-

linear constrained optimization problem is a hard problem. As feed-forward neural

networks are general function approximators the following problem must also be a

hard problem ( C represents here the weight vector to a linear output node):

max CS[¯!�� k ��� subject to ������� �0D¦�Polyhedra are closed under affine transformations and with ��� k � , (1) becomes:

max C [ !����«� subject to E����� E� �8Fq�In (2) we used G� to express the image of the polyhedron � �º���$� �­�v�H��� under an

affine transformation. By comparing the optimization problems (2) and (ii) it becomes

clear that both problems are essentially the same (the use of the inverse function !¡#&%does not change the difficulty of the problem). Therefore: as (2) is a hard problem also

(i) and (ii) are hard problems w

Page 107: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 89

The next subsections describe different approaches to obtain an approximation for the

global optimum. The first two approaches search on the manifold for the optimal

solution. These techniques can not guarantee to find the optimal solution and therefore

we can not assure that the wrapping polyhedron � ~ contains the non-linear region� . However, to verify the behaviour of neural networks we have to assert that the

polyhedron ��~ contains � . Hence, these two methods are not suitable for our overall

goal. Consequently, two other approaches, namely a branch and bound and a binary

search method, which can guarantee that the wrapping polyhedron �¤~ contains � have

been developed.

4.2.1 Sequential Quadratic Programming

Nonlinearly constrained optimization problems often have been solved with Sequential

Quadratic Programming (SQP) techniques. SQP is a conceptual method from which

various algorithms have evolved.

The main idea of SQP is as follows (for an overview of SQP methods, the reader is

referred to the publication by Boggs and Tolle [BT96]): given a starting point � ^ , the

nonlinear optimization problem is modeled by a quadratic programming subproblem.

A solution of this quadratic programming problem often results in a better approxi-

mation � % . This process is iterated to generate a sequence of approximations until

the method converges to a (often local) optimal solution. The Matlab optimization

toolbox provides a function, named fmincon, which relies on Sequential Quadratic

Programming (SQP) methods [Mat00b]. According to [Mat00b] the state-of-the-

art optimization algorithms are implemented. These routines for solving linear and

quadratic optimization problems use an active-set method combined with projection

techniques [Mat00b]. However, the application of fmincon for the optimization prob-

lem in � -space, i.e.

max CS[�� subject to �(!������¡�v�or the corresponding problem in � -space

max CS[�!\# B ���\� subject to �������often did not converge to the global optimum. As a result the wrapping polyhedron

did not contain all points of the nonlinear region � . This was shown by randomly

Page 108: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

90 Chapter 4. Nonlinear Transformation Phase

generating polyhedra in the � -space and testing if the computed wrapping polyhedron

contained the non-linear region � . Points were generated on the surface of the poly-

hedron, mapped using the sigmoidal function, and tested whether they were contained

in the approximated polyhedron. Of course using different starting points on the man-

ifold while optimizing in the same direction C could lead to better results. But this is

very cost expensive and still does not guarantee to find the optimal solution. The graph

of the inverse sigmoid function indicates that, points of strong changes are close to +1

and -1. This simple observation resulted in a first heuristic. Given the interior point �on the corresponding facet of the polyhedron in � -space, we determine the strongest

component of � and optimize on the facet in this direction, under the constraint that

the optimum is a point of the polyhedron � � . Linear programming techniques solve

this problem. Now we choose sample points on the connecting line segment. This lead

to better results but still could not guarantee to find the global optimum. To conclude:

current implementations of SQP often converge to local optima. As a consequence

the polyhedral approximation would not be the convex hull of � . The next approach,

which relied on the search for the optimal solution on the manifold, is named the Max-

imum Slice Approach (MSA).

4.2.2 Maximum Slice Approach

In a two dimensional space a solution to the optimization problem

max CS[�� subject to �(!������¡�v�can always be found. For higher dimensions we can build a two dimensional slice by

keeping all but two variables constant. In a system with two free variables one variable

can be expressed as a function of the other. Thus, by determing the domain for the free

variable, the maximum of the cost function can be calculated. This observation is the

basis idea for the Maximum Slice Approach (MSA). Given an approximation solution

after the Q -th iteration, say �3¦ , by computing new solutions in all two dimensional

slices through � ¦ we construct a better approximation �,¦ ¿ % . This process is iterated

until we get no significant improvement between two consecutive points.

Page 109: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 91

P ��§W¨U~�V =maximumSlice( C¼MO� )

// let be the dimension of the polyhedron and

// � be the number of facets of the polyhedron �for N�� ��K��_� ^ � analyticCenterOfMass � Z p Ý Ä Þ |Â�a¢� ^ �"! #&% ��� ^ �a¢Q��"¨®¢ // k is an iteration index

� à _Q­��Q·�"��¢��¦·�H��¦ #&% ¢for all

¯�� AQ��>�� slice combinations

_slice �ºP TLMO�uV //, where T Ú�H�«¬« ��¦��OPò��KjT@Q���MUT��"�·KD� Q���MO�½�"��K° åV�� is kept constant.

��¦®�OP TÂMO�uV��$� optimizeInSlice � slice MO� ¦¦M9C�MONL�`` µ û ÀIH h �.C » ��¦�¶JC » ��¦ #&% �«¬« � |§�¨s~ denotes the maximum for facet N«¬«note that �9¦ is not an improvement of � ¦ #&%� |§�¨s~ �H��¦ #&%`��§�¨s~�� max| � |§W¨U~

Page 110: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

92 Chapter 4. Nonlinear Transformation Phase

It is possible that the optimal solution is not on the facet-manifold which defined the

gradient, hence a maximum for all facets is computed. The overall maximum then

defines the best point. A further improvement of this algorithm would reduce the num-

ber to the relevant facet-manifolds (e.g. facets opposite to the search direction are not

important).

The following describes how to compute the optimal solution for the N -th facet mani-

fold in the �¯�Ã�>�aMO�¯�Á�°� slice. As the optimum has to be on the manifold of the reciprocal

image of the facet, the N -th row of the system of inequalities becomes an equality, i.e.:

�J��NUMYKF�L!��������"����NÃ�Within the considered slice (as the other variables �¯�ÁÒ°�aMYÓYÓYÓ5MO�¯�� �� are kept constant)

the optimization problem reduces to:

max C«�Ã�>�Â�¯�Ã�>�\��C\�Á�°�Â�¯�Á�°�Ñ°Ó ÞÃÚ Ó �J�LK@M>Pò�·Kj�'V��L!����¯�OPò��K��'V��O� ���hQB�J�LK@M>PRÒ�KD åV��L!����¯�OPRÒ2K° åV��O�Now it is possible to express �¯�Á�°� as function of �¯�Ã�>� . To simplify the expression let

ê����J��NUMYKF� , r¡�"� ��NL� and G�J����Q��J�LK@M>PRÒ�K° åV��L!����¯�OPRÒ2K° åV��O� .ê��Ã�>�L!����¯�Ã�>�O��� ê��Á�°�L!����¯�Á�°�O�$��r Q�P ê��ÁÒ°�L!����¯�ÁÒ°�O�\�*KLKLK5� ê¼�� ��L!����¯�� ��O�9Vand as �¯�ÁÒ°�aMYÓYÓYÓ5MO�¯�� �� this simplifies to:

ê��Ã�>�L!����¯�Ã�>�O��� ê��Á�°�L!����¯�Á�°�O�$��ra��M ra����r Q�P ê¼�ÁÒ°�L!����¯�ÁÒ°�O�\�MKLKLK5� ê��� ��L!����¯�� ��O�we can express �¯�Á�°� as function of �¯�Ã�>�aÓ The assumption ê¼�Á�°��Ú��¨ is valid, otherwise

we do not analyse a slice with �¯�Á�°� as free variable.

�¯�Á�°�¯� º¬» X ÈON.PW#RQ�Èë% É ºDÈ ê Èë% ÉWÉQ�È ¥ ÉTo summarize, maximizing CS»�� in the �¯�Ã�>�aMO�¯�Á�°� slice is equivalent to maximizing

ô¦���¯�Ã�>�O�¯�*C«�Ã�>�Â�¯�Ã�>�\��C\�Á�°� º » X ÈON P #RQ�Èë% É ºDÈ ê Èë% ÉWÉQ�È ¥ ÉIt is important to note that the term C«�Á�°� º » X ÈëÈON P #RQ�Èë% É ºDÈ ê ÈW% ÉëÉëÉQ�È ¥ É is monotonic. For the

calculation of the maximum the computation of the interval for the free variable �¯�Ã�>�is required. Using the knowledge that axis-parallel slices in � -space are again axis-

parallel slices in � -space, we are able to calculate the interval in the � -space and

simply backpropagate it to the � -space. The calculation of the interval in � -space is

Page 111: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 93

solved with the two linear optimization problems.

max �$�Ã�>�s.to �J�LK@M>Pò�¬Kj�'V��Â�$�OPò�¬K¦�'V�� � G�$Ms�J��NUM>Pò�·K¦�'V��Â�$�OPò�¬Kj�'V��$��r �

The interval of �¯�Ã�>� is: P !�#&%'���$�Ã�>� § |ª©��aMU!\#&%5���$�Ã�>� §W¨U~ �9V . The calculation of the max-

imum of �����¯�Ã�>�O� requires the first derivative. To simplify the notation we use here:tu�"�¯�Ã�>�aMÃß | �SC\��NL�aMUp | �òê¼��NÃ� .� � ��tå��� �Wß ¥ p % p ¥ QJß % p ¥ % �L!���tå� ¥ �v�'ß % p % ra�ò!���tå�7��ß % p ¥ ¥ QJß % ra� ¥ QJß ¥ p % p ¥Q)p % ¥ ! ¥ ��tå�¼�v��p % r � !���tå�7�vp ¥ ¥ QBr � ¥

Note that �S����tå����¨ if and only if the numerator is equal to zero. With P p « rsV we notate

a substitution process, i.e. r substitutes p . We apply the following substitutions to the

numerator.

P p « ß ¥ p % p ¥ QJß % p % ¥ VïPRr « �'ß % p % GrUV P Ý « ß % p ¥ ¥ QJß % ra� ¥ Q�ß ¥ p % p ¥ VïP à « !���t&�9VThis results in the computation of the roots of a quadratic polynom:

p�থ��vraà·� Ý �"¨With the solution à % MOà ¥ we can compute the points, where � � ��tå��� ¨®M and t % �! #&% ��à % �aMOt ¥ �Ô! #&% ��à ¥ � . Let us assume t��HP T N M Û N V . The maximal point for the opti-

mization problem is either a point where the derivative is zero, or it is T N or Û N , i.e a

point at the border of the domain of t (due of the structure of the function � ). The op-

timumInSlice function describes this algorithm in pseudo code. Note that �'T U V and �WT U Vdenotes the interval of the free variable for the � -space and � -space respectively.

Page 112: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

94 Chapter 4. Nonlinear Transformation Phase

[ X$M&Y ]=optimumInSlice(slice, ��M�C¼MUN )«¬« X is the computed optimal point within the slice«¬« Y is the computed value for the cost function.«¬« � and � are global variables. �J�H�$�OP TLMO�uV��� T U V � max �$�Ã�>� s.to �J�LK@M>P TLMO�uV��Â�v����þu�J��NUM>P TLMO�uV��$�"� ��NL�«¬«

the interval of the free variable is given by:

��P_Ó V&�"! #&% ���WT U VW�T N � min � T U V ¢ Û N � max � T U Vtu�H�¯��TX�ß % �SC\��TX�a¢Ãß ¥ �*C«���,�a¢p % �"�J��NUM��>�a¢Up ¥ ���J��NUMs�°�a¢r � ��r Q�P ê��ÁÒ°�L!����¯�ÁÒ°�O�\�*KLKLK5� ê¼�� ��L!����¯�� ��O�9V����tå��� ß % ty��ß ¥ ! #&% �Ár � Q�p % !���t&�O�p ¥«¬«

the maximum is either at interval borders or at a point where � � ��t&���"¨«¬«w.l.o.g. let � � ��t % ���"¨�þn� � ��t ¥ ���"¨`þ�����t % � ¶�����t ¥ �&þut % �,�ZT U V

Yy�H��p�tu�'����t % �aMs����T N �aMs��� Û N �s�«¬«if t % MOt ¥ Ú�,� T U V then Yy�H��p�tu�'����T N �aMs��� Û N �s�

tu�������� ������t % if Y�������t % �T N if Y�������TÁrð�Û N else

XJ�ÔP t¼¢Ãß ¥ !\#&%'�Árs�JQ�p % !���tå�O�p ¥ V

This method performed quite well for a large number of lower-dimensional experi-

ments, but unfortunately for higher-dimensional examples, the method did not con-

verge to the global optimum. The reasons why this method did not always converge to

the global optimum remain unclear and require a detailed analysis.

Page 113: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 95

The next subsections introduce two methods which guarantee finding an optimum

which is either on the manifold, or within some precision outside the manifold, i.e.

these methods can ensure, that the wrapping polyhedron ��~ contains � , while ap-

proximating the global optimum on the manifold. Hence, for a given number of hy-

perplanes, the computed wrapping polyhedron is not minimal.

4.2.3 Branch and Bound Approach

Branch and bound is a well known strategy which has been successfully applied to

a number of optimization problems, e.g. integer programming problems. Branch and

bound splits the original problem into two (or more) disjunct subproblems (branch) and

successively computes solutions on the smaller subproblems. Within an iteration a new

solution on the best subproblem is calculated. The best current solution is used to prune

(bound) irrelevant subproblems. This process is repeated until a stopping criterion is

fulfilled, e.g. if further calculation would not significantly improve the result. We

apply the branch and bound approach for the following optimization problem:

max C [ ! # B ���\� subject to �������

t 2t 2

t 1t 1 cu cu

Figure 4.3: Application of branch and bound to approximate the optimal solution of

the optimization problem: max C [ ! # B ���\� subject to ������� . The upper

right corner defines the upper bound for the value of the cost-function, and the arrow

on the surface of the region indicates the optimization direction. The initial split of the

hypercube is depicted on the left side, while the right side plots the next split for the

upper sub-box. A sub-box can be pruned if its upper bound for the cost-function is less

than the best current solution on the surface of the manifold.

Page 114: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

96 Chapter 4. Nonlinear Transformation Phase

We refer to the maximal point of the cost-function, restricted to a box  � , as­\[ ��«���

and for the value of the cost-function at this point we write Y$Ò8] . In our tests we used a

tansig transfer function and restrict the side of a cube to the interval PªQ)¨®ÓN]_^�ÓYÓYÓ/¨®ÓN]_^'V ,which corresponds to an input range between [-2.29,2.29].

The intersection between the restricted hypercube and the original polyhedron is a

polytope. The hypercube ^� , containing this polytope in � -space is calculated by

applying linear optimization techniques. An interior point of the polytope is calculated

as the barycentre between the component-wise extreme points (named i | ) of the cost-

function subject to the polytope. This defines a starting point for a sequential quadratic

programming problem and leads to a local maximum on a facet of the polytope.

Starting with the initial box V^� we split the box (along the longest side) into two sub-

boxes `%� and ¥� . The sub-box with the larger upper bound is selected for further

calculation. The process to compute a local optimum on a facet of the polytope is

repeated, but restricted to the sub-polytope contained in the chosen sub-box. The best

current solution point on a facet of the polytope is called ô�` , and Yba�c expresses the

value for the cost-function at this point. We can prune a box if the intersection with the

polyhedron is empty or when its upper bound (for the cost-function) is less than the

best (current) solution. This process is repeated until a stopping criterion is fulfilled.

All relevant sub-boxes are sorted in a list according to the supremum of the box. The

first box in this list (i.e. the box with the largest supremum) is named top box �»� . The

stopping criteria are defined as follows; stop if

I the volume of the box »� is less than a pre-defined value or:

I the distance between a surface point and the supremum of »� is less than a pre-

defined value, which indicates that the current solution already is the optimal

solution or close to it (the supremum defines an upper bound for the possible

global optimum).

Page 115: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 97

P �&Vå� BranchAndBound ��� ��M9Cå�InitPhase

«¬« i | denotes the N -th component-wise maxima of the cost-function in � -space«¬« is the dimension

^� � m �������Z$��NÃ�¯�*C » ! #&% � i | �µ � Z® ©|r¯ % � Z$��NL�Y�«¬« X is the barycentre between all i | ÓX�� ©>|r¯ % µJ��NL�

i |«¬«

use seq. quadratic programming to find a local maximum

� ` �H��p�t C [ ! #&% ���\� �����$� start at: X× Ú t/d¯NÂÑ Þ �" ^�

MainLoop

�3à _«¬«

to branch we use the box with the maximal supremum«¬«this box is divided, BoxList gets updated and ô `«¬«is updated, if necessary

P × Ú ted�NÂÑ Þ MUô ` Vå� split �Ø× Ú ted�NÂÑ Þ �Ã�>�aMO� ��M9C�MUô ` �«¬«delete all sub-boxes of the box-list, with: upper bound �fYga�c

× Ú ted�NÂÑ Þ � prune �Ø× Ú ted�NÂÑ Þ MUô ` �` µ û À.H h �8h�Ñ ÞÃÚ Ù°Ù�NX åß â�Ð N Þ Ä Ð N Ú ����� ­_[ �Ø× Ú t/d¯NÂÑ Þ �Ã�>�O�

Page 116: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

98 Chapter 4. Nonlinear Transformation Phase

This method ensures that we compute a wrapping polyhedron, which contains the non-

linear region � . However, tests showed that the time complexity of the method does

not scale well with increasing dimension (see also experimental results in Section 5.3).

Furthermore, the method has one disadvantage; for approximately flat linear surfaces,

large numbers of boxes with similar upper values for the cost-function are obtained.

In these cases the method is very slow and not useful. The method finally used for the

VPA implementation is the Binary Search Approach (BSA). This method guarantees,

that the wrapping polyhedron contains � , and experiments indicate that the method

scales well with higher dimension.

4.2.4 Binary Search Approach

We use a binary search method to find for a hyperplane¹

a position close to the region�u� ���$� �(!������`� ��� , such that � ü ¹ # . ¹ is directed by a given vector C . Initially¹is positioned with the midpoint between a point, named i\~ , on the manifold of �

and the vertex X\~¬� ­_[ � m ���n�O� , i.e. the corner of the wrapping box with the maximum

value for the cost-function CS»7� subject to t��� ^~ � m ���u� . The hyperplane is moved

closer towards � if � � ¹�¿ is empty. To determine whether � � ¹�¿ is empty, a

box-refinement process is applied.

The refinement process is performed by a refinement step in � and � - space and

by forward- and backward propagating those refined boxes. The refinement for theQu�ý� iteration in � -space is calculated by: ¦ ¿ %~ � m �� ¦~ � ¹ ¿ � and in � -space

by: ¦ ¿ %� � m �� ¦� �B����� . The forward-propagation of a box from � -space to � -space is a component-wise application of the sigmoid function on the box, similarly

the backward-propagation of a box from � -space to � -space is a component-wise ap-

plication of the inverse sigmoid function on the box. We will say we send a box from� -space to � -space and vice versa. Due to the decreasing volume in the sequence of

boxes the convergence is guaranteed, and if, after the Q -th iteration a box is empty,

then there is no intersection between � and¹�¿

. We demonstrate the algorithm with

a simple example for the two dimensional case (i.e. two neurons), before describing

further details, such as the rate of convergence, and proving some key observations.

Page 117: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 99

−1 −0.5 0 0.5−0.4

−0.2

0

0.2

0.4

0.6

0.8

1y−Space

y1

y2

−3 −2 −1 0 1

−0.5

0

0.5

1

1.5

2

2.5

x−Space

x1

x2

(a) Start. Polyhedron jlk in m -space and the region n in o -space. Arrows on the manifold of n are

gradient vectors at the corresponding point, plotted as circle.

(b) Initial step. Calculating the wrapping hypercubes p $q and p $k . The hyperplane with direction

vector r will be positioned between the two points s q (right circle) and t q (left circle).

(c) Hyperplane insertion and intersection detection. The binary search strategy first inserts a hy-

perplane which cuts the line segment between s q and t q in the middle. We have to detect if this

hyperplane u intersects with the region n . Therefore we start the box-refinement process by cal-

culating p Xqwvyx{z p $q)| u~}7� and send the box to the m -space, i.e. p Xk v � z p Xq � .

Page 118: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

100 Chapter 4. Nonlinear Transformation Phase

(d) Intersection detection through refinement of boxes. The intersection of the polyhedron j k and

the box � Xk results in the refined box p �k vyx{z p Xk | jlk�� .

(e) Further refinement. Calculating the new box p ½ } Xq v�x{z p ½q | u } � and sending it again to the

y-space, results in smaller boxes.

(f) End of the refinement process. The volume of the refined box p q is less than a small value � . This

means there is an intersection between the box p q and the region n within a small � -neighbourhood,

i.e. we can terminate and use this hyperplane as approximation.

Page 119: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 101

(g) Insert of the next hyperplane u . We computed the box p Xq vSx{z p $q | u } � and computedp Xk v � z p Xq � . It is easy to see that p Xk | jlk is empty. This means we can move the hyperplane

closer to n .

(h) Moving the hyperplane closer to the region. At this state �;� vf��� � � and � v���� ��� (we use� to define the position of u and ��� to express the step-size. This will become clear later in the

description of the algorithm). As before (in Figure 5.4 (g) ) the intersection of p Xk v � z p Xq � and the

polyhedron jlk is empty.

(i) Moving closer to n and intersection detection process. The intersection of p Xk v � z p Xq � and the

polyhedron j k is not empty.

Page 120: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

102 Chapter 4. Nonlinear Transformation Phase

(j) Refinement of the box and detection of no intersection. In y-space p �k v�x{z p Xk | jlk�� . After

calculating the new box in ? -space, i.e. p �q v � » X z p �k � , there is no intersection of the inserted

hyperplane u and the region n (because p �q is completely in u » ). This example demonstrates

again the relevance of the refinement process and shows that sometimes more iterations are needed,

before deciding if there is an intersection between u } and n .

(k) Moving the hyperplane closer to the manifold and refinement of boxes. This figure shows the

refined box in y-space p �k vyx{z p Xk | j k � and the corresponding box p �q v � » X z p �k � in o -space.

(l) Detection of no intersection. Calculating p-�q v p �q | u } results in the refined box in x-space.

Sending this box to y-space, i.e. calculating p �k v � z p �q � leads to the small box in y-space. The

intersection between pe�k and the polyhedron j k is empty. This means there is still no intersection

between u and the region n in x-space. The algorithm terminated as �Z�3��� , i.e. the distance

between two consecutive hyperplane positions is less than a small value � .

Page 121: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 103

(m) The state of the wrapping polyhedron after approximating the first two facets. Compared to

Figure 5.4 (b) a much better approximation is obtained.

Figure 4.4: Two dimensional example for the binary search method.

In this section we enumerate a few properties about functions which are useful to

prove a key observation (Lemma 4.3) about the relation between a box ¡~ in x-space

which intersects the region � and the corresponding box �� and the polyhedron �¡�in y-space. Lemma 4.4 provides a proof which guarantees a decreasing sequence of

boxes (with respect to the volume). In the following � is a function from � to � .

Lemma 4.1 For any function � and any regions �(MU�K �������2 �Wü����������­���� � .Proof: �������y¡�Wü�������� and �������y¡�Wü�����¡� . Hence: �������y �Wü �����2�������� �OwLemma 4.2 If � is an injection then: �������u ���<�����2�&�u����¡� .Proof: Using Lemma 4.1 it is sufficient to show the reverse inclusion � , i.e. �������¼�����¡�Aü �����ý��¡� . Let à³� �����2�$�B����¡� .Therefore: it exists t-�¾� � and t/�º� such that à�� ����t/�$�y� ����t7�&�aÓ As � is an injection, we have t-�ý� t7� and t/� ��<�u`Ó Therefore àu���������(¡� wLemma 4.3 �~`�(!\#&%'���$�D�±Ú��ù iff !���«~j�&�(��� Ú��ùProof: Lemma 4.2 implies !����~j�«!\#&%>���$�D�O���"!���«~j�O�\���jÓ Therefore: �~D�«!\#&%>��������Ú�ù iff !��� ~ �&�(� � Ú��ù wLemma 4.4

m �� �«�u�Wü� , where is an axis parallel hypercube and � is a region.

Proof: J�(�÷ü� .Therefore:m ����(�n�Wü m ��¡���H w

Page 122: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

104 Chapter 4. Nonlinear Transformation Phase

The following nomenclature is used in describing the algorithm

­... we write again

­b[ �� � for the upper corner of box , i.e. the upper

bound for the cost-function subject to . For the opposite lower

corner the subscript T is used.X ê ... maximum of the cost-function in x-space (subject to �^~ ).i-� ... point in à -space which is the solution point of the following

linear optimization problem:p Ð ß���pjt C [ � subject to �������i ê ... corresponding point to i7� in x-space, i.e. i&~¬�"! #&% �.iå�5� . In the

linear case this would define already the optimal position for¹

.� ~ ... point between X«~ and i&~ .ë ... value between (0,1). ë defines the position of � ~ on the line betweenX�~ and i&~ . �¢ë expresses the rate of change between two consecutive

values for ë .� �Ø×2� ... volume of a box. Two consecutive boxes are considered equal,

if the volume of their difference is negligible that is:� �� ¦~b� ¦ ¿ %~ � ´J�� ... � expresses a small positive real number

Given an optimization direction C , we face the following non-linear optimization prob-

lem in the t -space.

max C [ � subject to �(!������¡�v�The solution of this optimization problem defines the optimal position for a hyperplane¹ �Ï���$� C�»��½����� , such that � ü ¹ # . In general we can not solve the above

optimization problem exactly, instead we use a binary search algorithm to find a close

approximation. Geometrically we want to move the hyperplane¹

close to � and

assure that � ü ¹ # . Analytically this results in an adjustment of � such that � is

close to the maximum value of the optimization problem. We move the hyperplane,

using a binary search strategy, along a line defined through the point X$~ and a pointiå~ , i.e. between the upper corner point of the wrapping hypercube and a point on the

manifold. Within every iteration we have to test if there is an intersection between �

Page 123: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 105

and¹n¿

. The test for an intersection is realized through a refinement process. Boxes

are refined iteratively by using the following observations:

1. As boxes are closed under component-wise transformations, it follows that: if«~ is a box then ��{��!���«~j� is a box and if �� is a box then �~���!\#&%'��«�5� is

a box.

2. Since ¦ ¿ %� � m �� ¦� �(���5� ü� ¦� and ¦ ¿ %~ � m �� ¦~ � ¹n¿ �Wü� ¦~ (see Lemma

5.4), we have a decreasing sequence of boxes.

The result of this refinement process, together with Lemma 4.3, is used to decide

whether to move¹

closer towards � . The algorithm is presented in a top-down man-

ner. The initial phase calculates the wrapping boxes ~ and � , by using linear pro-

gramming techniques, and defines the points X ~ and i ~ . Within a do-while loop the

value of � is adjusted using a binary search strategy and the refine function is called.

Once the refinement process has finished we have to decide whether to:

I Terminate the algorithm,

– if there is an intersection within an � neighbourhood or:

– if �Aë�´�� (i.e. the distance between two consecutive hyperplanes is less

than � ).I Move up (increase ë ), i.e. we move the hyperplane closer to the manifold (using

half of the distance of the previous two points).

I Move down (decrease ë ), i.e. we move the hyperplane back towards the hyper-

cube corner (again using half of the distance of the previous two points).

Page 124: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

106 Chapter 4. Nonlinear Transformation Phase

P � V =Binary Search( � �¦M9Cå�InitPhase

«¬« �����{����� �������¯� ^� � m ���$�D� ^~ � m ��! #&% �� ^� �O�X�~·� argmax C » t ÑjÓ ÞLÚ t��� ^~iå��� argmax C » à ÑjÓ ÞÃÚ �­�����iå~·��! #&% �.iå�D��¡ë���¨®Óíæë���¨®Óíæ

MainLoop

�3à _� �*X�~`�çë¼�.iå~\Q�X¼~j�// definition of the constant � and therefore the position of

¹¹ � ���$� C » ���SC » � �P ~ MU � Vå� refine �� ^~ MU ^� MO� � M ¹ ��¡ë���¨®Óíæ3�¡ëÀWô ��«~ o��ù�ì(«� o��ù·�ë��Ýëq���¢ëh HI� hë��Ýë�Q��¢ë` µ û À.H h �O�U�@� ­_[ ��«~j�,Q ­b� ��«~j�Y�@�®¶J�a�¼þJ�0�¡ë,¶J�a�O�

Page 125: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 107

The sign of C is used to calculate C7[�� subject to W^~ . Hence, this computation is very

fast, because no linear programming method needs to be applied.

The crucial part of the algorithm is to detect whether there exists an intersection be-

tween the hyperplane¹

and the region � . This is done by sending boxes between the� and � -space. At each iteration new boxes are calculated for the � and � -space. We

start with the box %~ � m �� ^~ � ¹ ¿ � , i.e. %~ is the box of the polyhedron defined by

the intersection of the wrapping hypercube ^~ and the half-space¹ ¿

. If this box inter-

sects the region � , then the corresponding box �%� � !���`%~ � intersects � � in � -space

(see Lemma 4.3). In � -space it is necessary to determine the box of the intersection

between �%� and ��� , i.e. ¥� � m ��)%� �(���5�aÓD ¥� is either (see Lemma 4.4):

1. An empty box, i.e. %� �(������ù .

2. Unchanged box, i.e.m �� %� �(�$�D���" %�

3. Refined box, i.e.m ��¤%� �u�$�D��ª�)%�

For case three (refined box ¥� ) we have to calculate ¥~ , i.e. ¥~ � ! #&% �� ¥� � . In � -space the new box  ñ~ � m �� ¥~ � ¹n¿ � is computed (again the three cases are possible).

This process is repeated until we know that:

I There is no intersection (this is a box is empty).

I A box does not change anymore, i.e. we know there has to be an intersection.

I The distance between­b[ ��«~j� and

­b� ��«~j� is less than � , i.e. we intersect within

a small � environment.

Page 126: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

108 Chapter 4. Nonlinear Transformation Phase

P «~¦MU«��VS� refine �� ^~ MU ^� MO���jM ¹ �Q��"¨ ¦ ¿ %~ � m �� ¦~ � ¹ ¿ ��3à _«¬« � Ú5Ð�ã p Ð Ø�MONUÓíÄ°Ó�� Ì¡ Q��<Q¬�"� ¦� ��!��� ¦~ � ¦ ¿ %� � m �� ¦� �(���5�«¬« r/p Ý Q ã p Ð Ø�MONUÓíÄDÓ¢ ÔÌ��Q��<Q¬�"� ¦~ ��! #&% �� ¦� � ¦ ¿ %~ � m �� ¦~ � ¹ ¿ �`µ û À.H h � � �� ¦ ¿ %~ �±Ú�"¨ ì � �� ¦~b� ¦ ¿ %~ � ´J� ì �@� ­_[ �� ¦ ¿ %~ �3Q ­\� �� ¦ ¿ %~ �Y�@��´J�a�«~¬�" ¦ ¿ %~«�¤�� ¦�

Improvement of the Original Binary Search Approach

The binary search method as introduced can be improved. For example, after the first

forward step it is sufficient to consider the smaller polyhedron � %� �½ %� ���$� . This

is valid, because an intersection between¹ ¿

and � is only possible within the region!\#&%'���y%� � . This modifies the algorithm by restarting the search for the correct posi-

tion of the hyperplane¹

on this subproblem. This would result in a faster algorithm

compared to the basic method, because the considered polyhedron � ¦� is smaller, and

Page 127: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.3 Complexity Analysis of the Branch and Bound and the Binary Search Method 109

generally, contains less inequalities than the original polyhedron �¤� . This simplifies

the operation of intersecting the polyhedron with the corresponding box. However,

we did not implement yet the modified binary search method. In future versions an

implementation and comparison to the basic algorithm will be provided.

4.3 Complexity Analysis of the Branch and Bound and the

Binary Search Method

We tested the new algorithms for the approximation process of a single hyperplane,

with a randomly chosen direction C . Thus far tests have covered polyhedra of dimen-

sion 3 up to dimension 9, corresponding to a neural network with 3 up to 9 neurons

in a hidden layer. For a given number of randomly chosen directions (between 1 and

10) we inserted hyperplanes within the restricting hypercube, such that the polytope

was non-empty by construction. For each dimension, and each possible number of ad-

ditional hyperplanes, 10 different polyhedra were used, overall testing 100 polyhedra

for each dimension. The experiments of the branch and bound method were used to

test the binary search approach. Table 4.1 provides the 95% confidence interval for the

time-complexity of the binary search and branch and bound technique.

Dimension Branch and Bound (in seconds) Binary Search (in seconds)

3 [24.137,30.854] [6.839,12.070]

4 [58.927,81.921] [8.542,20.131]

5 [98.885,151.425] [12.440,26.774]

6 [150.867,243.428] [7.437,18.529]

7 [277.533,464.628] [8.946,15.465]

8 [569.874,1195.381] [13.527,26.355]

9 [945.266,2147.457] [14.343,21.431]

Table 4.1: Comparison of branch and bound and binary search.

The above results indicate that the binary search strategy outperformed the branch

and bound approach for our experiments. Important remarks:

Page 128: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

110 Chapter 4. Nonlinear Transformation Phase

I To avoid numerical difficulties experiments were restricted to polyhedra con-

tained in the interval PªQ)¨®ÓN]_^¦MU¨®ÓN]_^'V .I An additional stopping criterion was used for the binary search method: once a

solution was better or similar to the corresponding branch and bound result, the

method stopped.

I We also detected examples of very slow convergence with the binary search

method. These cases still have to be investigated.

3 4 5 6 7 8 91.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

3.2

Dimension

# O

uter

Loo

p

(a) Binary Search: number of outer loops.

3 4 5 6 7 8 90

1

2

3

4

5

6

7

Dimension

# In

ner L

oop

(b) Binary Search: number of inner loops.

3 4 5 6 7 8 95

10

15

20

25

30

Dimension

Tim

e

(c) Binary Search: average time in s.

3 4 5 6 7 8 92

3

4

5

6

7

8

Dimension

Tim

e (lo

g)

(d) Comparison of time complexity. The

solid line shows the branch and bound ap-

proach, the dotted binary search on a log-

arithmic time axis.

Page 129: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

4.4 Summary of this Chapter 111

4.4 Summary of this Chapter

Not finding a satisfactory solution with the technique of piece-wise linear approxima-

tion of the sigmoidal function, an alternative approach was developed which directly

approximates the non-linear region � from outside. The initial two methods Sequen-

tial Quadratic Programming (SQP) and the Maximum Slice Approach (MSA) aimed to

find the global optimum for the corresponding nonlinear optimization problem. Nei-

ther method could guarantee to find the global optimum. Thereafter, strategies were

developed to compute an approximation of the global optimum from outside, such that

the requirement to compute a wrapping polyhedron could be fulfilled. Experiments

indicated that the binary search approach is a suitable method.

Contributions - Chapter 4 -

I In our investigations we could not detect any suitable methods relying on

piece-wise linear approximations of the sigmoidal function for the approxi-

mation of the nonlinear region � . The reasons are: either the method does

not scale well (for axis-parallel splits) or a simple and computational-wise

cheap method for non axis-parallel splits was not found yet.

I This lead to the idea of computing a wrapping polyhedron � ~ , which contains

the nonlinear region � . The problem to compute �`~ reduces to a nonlinear

optimization problem.

I The SQP-approach as well as the MSA technique, can not guarantee to al-

ways find the global optimum on the manifold of � .

I Development of a branch and bound and a binary search technique, which

approximate a solution for the global optimum from outside. These methods

fulfill the requirement of an outside approximation for the nonlinear region� .

I Experiments indicated that the binary search method is in the average case

the better choice than the developed branch and bound approach.

Page 130: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 131: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 5

Affine Transformation Phase

This chapter gives first a short introduction to the problem of computing the image

or reciprocal image of a polyhedron under an affine transformation. In Section 5.2 a

solution for the backward phase is provided. The computation of the image of a poly-

hedron under an affine transformation is explained in Section 5.3.

Section 5.4 discusses the important concept of projecting a polyhedron onto a subspace

and provides a solution which computes an approximation of the projected polyhe-

dron in polynomial time complexity. The subsequent section summarizes further ap-

proaches to approximate the image of a polyhedron under an affine transformation

with non-invertible transformation matrix. The final section provides a short summary

of this chapter.

5.1 Introduction to the Problem

The computation of the image and the reciprocal image of a polyhedron under an

affine transformation is illustrated in Figure 5.1. As polyhedra are closed under affine

transformations the image or reciprocal image of a polyhedron under an affine trans-

formation is another polyhedron. For the backward computation we have to compute

for given polyhedron �¡� the polyhedron ©¯#&%>���$�D�¤� ���$� ©������{���$�j� and in the for-

ward propagation we compute for a given polyhedron �`~ the image ©$��� ~j� , where©$�����¯� k ��� l l l . Note that © is not necessarily invertible.

113

Page 132: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

114 Chapter 5. Affine Transformation Phase

σ Layer STransfer functionlayer

Weight−Layer

Layer P

= <− b}P y {y | Ay

Γ

σσσ σ−1

Γ −1

P )yΓ ( −1= {x | <−Ax b}

Γ(P x )

P x

θΓ :x y=Wx+

Figure 5.1: The forward and backward-propagation of a polyhedron through the

weight layer. We refer to the net input space of layer S as � -space, the reciprocal

image, which is the net output of layer P is called � -space.

5.2 Backward Propagation Phase

In the backward phase the reciprocal image of a polyhedron ��� , under an affine trans-

formation �J�"©�������� k ��� l l l has to be computed. As already described in [Mai98]

a vector � belongs to © #&% ������� if and only if:

�J� k �(� l l l �����The reciprocal image is a polyhedron and defined as follows:

� � ���$� � k �B���hQB� ll l � ��>�We have to remove redundant inequalities in order to reduce the computational effort

to solve a mathematical programming problem and to keep a compact description.

Furthermore, it is very inefficient to backpropagate an increasing number of redundant

inequalities and for the polyhedral description of the input and output space a non-

redundant description is necessary.

Linear programming techniques can be used to remove redundant inequalities in (1).

Redundant inequalities of a polyhedron are inequalities which do not define a facet of

the polyhedron. Hence, these inequalities are not relevant for the description of the set

of points enclosed by the polyhedron.

Page 133: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.2 Backward Propagation Phase 115

Removing redundant inequalities

Given a polyhedron ��� ���$� ���H� ��� , with � §�� inequalities. The problem of re-

moving redundant inequalities is to obtain an irredundant polyhedron ���Ê� ���$� ���_������ë� with a minimal number of inequalities such that �y� = � . In other words the inequal-

ities of the polyhedron �¬� define facets. The N -th inequality is redundant if and only if

the polyhedron:

�$|¼� ���$� �J�OPò�¬K°N9Q���MON&�"��K°�uVXMYKF�Â�B�����OPò��KjN9Q���MON7�"��K°�uV��s�is equal to � . Equivalently we can write:

for all ���n� | Kj�J��NUMYKF���������N�This can be tested by solving the following linear optimization problem:

max �J��NUMYKF�Â�s. to �J�OPò��K°N9Q���MON7�"��K°�uVXMYKF�Â�������OPò�¬KDN Q���MON&�"��KD�uV��

The inequality is irredundant if there is an � ^ �n� | , such that ����NsMYKF�Â� ^ ¶�� ��NL� . A test

for all inequalities leads to an irredundant description. We refer to this strategy as the

initial strategy. A more efficient method to remove redundant constraints is described

in the paper by Caron, McDonald and Ponic [CMP89].

This method is more efficient, because it incorporates several rules which can be ap-

plied to decide immediately if an inequality is redundant or necessary. In the sequel,

we describe important corollaries and theorems useful to remove immediately redun-

dant inequalities and to determine if an inequality is necessary [CMP89].

Let � have � constraints and define the indices set £�� �j��MYÓYÓYÓ>MO�J� . Let � ` ��� . We

refer to act ��� ` � as the set of active indices at � ` . A constraint is called active at a point

if it is an equality at this point. The number of active constraints at � ` is Ý .Corollary 5.1

Let � ` ��� .If the gradients of the constraints with indices act ��� ` � are linearly inde-

pendent, then all such constraints are necessary. Proof: see [CMP] wCorollary 5.2

The N -th inequality is redundant if the system of equations ¤��/. act �¦¥¨§ �ª© |« � £­6�� �7� ?�+ £�6 � ��� ? has

Page 134: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

116 Chapter 5. Affine Transformation Phase

a solution such that « �­¬ A . The geometrical interpretation of this Corollary is that the gradient

£­6 � ��� ? is in the cone generated by the other active constraints (see also Farkas Lemma). ®Corollary 5.3

Let � Ú�S¯ be an arbitrary vector. For each N$�y£ we define:

° | �<�+± , if �J��NUMYKF� » �¬��¨° | �³² È |ÊÉ #7´¡È | ·¶¸ ɶµ ê´ È | ·¶¸ É µ�· , otherwise

° � min � ° | � N$�¸£5�If ° is defined by a unique index Q , then the Q -th constraint is necessary. For a proof

see [CMP] or [Bon83]. wThe degenerate extreme point strategy [CMP89] integrates the above corollaries, but

basically relies on the following theorem and corollary.

Theorem 5.1

Let N¯� act ���%`�� . The N -th inequality is redundant if and only if ��` is an optimal solution

to the following linear optimization problem:

max �J��NUMYKF�Â�subject to: ���n� | ��� ` �������$� ��� act ��� ` �aMYKF�Â����� � act ��� ` �O�

Proof: see [CMP89] wIt is important to note that the above Theorem is essentially the initial strategy applied

to a smaller linear optimization problem. The LP is smaller in the sense, that the num-

ber of constraints is reduced from � to � act ��� ` �Y� .

Corollary 5.4

Let � ` be an extreme point of the polyhedron � , with � act ��� ` �Y�¯� Ý �Ô� . Let N(�act ��� ` � . Consider Ò>

µ�¹ act È ê É»ºL| P µ �J�@áDMYKF�¯�<����NsMYKF� �Á�°�Then we can derive the following statements (for a proof see [CMP89]):

Page 135: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.3 Forward Propagation Phase 117

(a) The N -th inequality is redundant and all other inequalities are necessary, if

there is a solution to (2) such that P µ §�¨®M9á(� act ������ÿ�N .(b) If (1) has a solution such that for some ¼H� act �����úÿ¤NUM�P�½�¶ ¨ and P µ �¨®M for all á � act ����� ÿy��NUM9¼S� . Then constraint ¼ is redundant and constraintsá�� act ������ÿ­¼ are necessary.

(c) If neither (a) nor (b) are satisfied, then all constraints are necessary.

wOur implementation to remove redundant inequalities relies on the initial strategy. In

a future version we plan to integrate the above mentioned corollaries and theorems to

obtain a faster algorithm.

5.3 Forward Propagation Phase

For the forward propagation of a polyhedron through the linear weight layer of a neural

network the image of a polyhedron �Ô�½���$� ��� � ��� under an affine function ���Ì©$�����¡� k �n� l l l has to be calculated. A vector � belongs to the image of © , if there

exists a vector � , such that: ��� �� ��� k �u� l l l���B���

With þ��"©������ the image of the polyhedron � under the affine transformation ©��������k �,� l l l is denoted. Ifk

is invertible the computation of the image þ¸� ©������ is

trivial, because the problem is then reduced to compute the inversek #&% and þ ������ � k #&% ��� Q l �¬�º��� . If

kis not invertible, we compute the image þ �½©������

by projecting � onto the subspace Î , which is a subspace orthogonal to the kernel 1

ofk

, and apply the bijection © Õ between Î and ¾�¿,� k � on the projected polyhedron� .

Ifk

is the matrix of an injection than Ø°NX�B�8À3ÁL®� k �O�$�"¨ . In this case for a given

point ���n©$��� ~ � the point �B�u� ~ such that �J�"©������ is unique.1For more details about the required linear algebra background the reader is referred to appendix B of

this thesis.

Page 136: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

118 Chapter 5. Affine Transformation Phase

þº� image ���2M k M l l l �«¬«

compute a basis for the Kernel à and the subspace Äà ��À3ÁL®� k �a¢Ä��ÅÀ3ÁÆÂ��.à » �a¢if dim �8À3ÁL®� k �O� o�"¨� �H�2¢

else_«¬«

project � onto Î� � proj ���2MÆļ�a¢«¬«

the restriction ofk

onto the subspace Îk � kÈÇ ¢`let �Í� ��Óí� and ��� ��Óî�if isSquareMatrix � k �þº� ����� � k #&% �����n�v� k #&% l l l �°¢

else«¬«Injection,

k ¿denotes the pseudo inverse.

þº� ����� � k ¿ �����n�v� k ¿ l l l � and ����é°�B� k �a¢It remains to explain how to project a polyhedron onto a lower-dimensional subspace.

5.4 Projection of a Polyhedron onto a Subspace

With � the exact projection of the polyhedron � onto a lower dimensional subspace

is denoted. We also use the expression true projected polyhedron. An approximation

is referred to as G� .

Page 137: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 119

The projection of a polyhedron onto a subspace is used in a number of different fields.

Among others, polyhedral projection techniques are used in the following areas:

I Parallelizing Compiler. Polyhedral projections are used in advanced compiler

techniques, e.g. for loop parallelizing or data dependency analysis of arrays.

For example, the static parallelization of perfectly nested loops is supported by

polytope models [Len93].

As explained in the PhD thesis “Parallelizing Compiler Techniques Based on

Linear Inequalities” by Amarasinghe [Ama97] the iteration space of nested

loops can be represented with a convex polyhedron. Furthermore he writes: “In

our compiler algorithms, we use projection as one of the key transformations in

manipulating systems of linear inequalities.”

I Optimization. In several works on optimization polyhedral projection tech-

niques are used. For example the work by Dyer and Megiddo [DM97] relies on

projection techniques to solve a linear programming problem of a fixed dimen-

sion.

I Neural Network Analysis. We need the computation of the projection of a poly-

hedron onto a subspace to forward-propagate a polyhedron through a neural

network. Additionally, as we shall see in the next chapter, projection methods

could be used to overcome numerical problems.

In the literature the Fourier-Motzkin method and the Block-elimination technique are

well known polyhedral projection algorithms.

However, according to the survey paper “Polyhedral Computation: a survey of pro-

jection methods” by Kaluzny [Kal02] currently no algorithm is available to compute

the projection of a polyhedron in polynomial time. In the following we describe the

Fourier-Motzkin approach and the Block-elimination technique. Both methods com-

pute the exact projection. Next, we introduce our method to compute an approximation

of the projected polyhedron. This algorithm has polynomial time complexity. For the

implementation of VPA we use the Fourier-Motzkin algorithm, if it is computationally

feasible, otherwise the new developed approximation method is applied. The problem

of projecting a polyhedron onto a lower dimensional subspace is defined as follows:

Page 138: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

120 Chapter 5. Affine Transformation Phase

Definition 5.1 Projection of a polyhedron � onto the subspace ÎLet

�������ÊÉ�MO�����oî ¦#Ë î ¥ � �ÌÉu��b{�B���¯�°Mwhere � is a P ��MsQ�V matrix, b is a P ��MsÑYV matrix and � is row vector with � rows. The

number of columns Q and Ñ corresponds to the dimension of the null-space and the

image, respectively, i.e. Ñ���Q��� . The projection of � onto the subspace Î��Ýî ¥ is

then given by:

� � ���B�oî ¥ � � É��mî ¦ Kq�ÊÉ�MO�«���n�y�w

This definition implies that we can view the projection of a polyhedron onto a lower-

dimensional subspace as variable elimination for a system of linear inequalities. Within

this thesis projection always means an orthogonal projection.

5.4.1 Fourier-Motzkin

Fourier-Motzkin is an algorithm that projects a polyhedron incrementally onto a lower

dimensional subspace. We define a hinge of a polyhedron as follows:

Definition 5.2 Hinge of a Polyhedron

A hinge of a polyhedron is defined as the intersection of two facets, i.e.

¹ �����$� ����NsMYKF�Â���"� ��NL� and �J�@á°MYKF�Â���"� �@á��s�and �J��NUMYKF�aMs�J�@áDMYKF� are linearly independent.

wDefinition 5.3 Positive and negative facet

Let the vector X be orthogonal to the Q�� dimensional subspace Í . A facet of the

polyhedron is called positive iff the scalar product between its direction vector ê andX is positive, otherwise the facet is called negative. w

Page 139: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 121

Geometrically, a negative facet means that it is visible from direction X , whereas a

positive facet is not.

Already in 1827, Fourier [Fou27] proposed a method to eliminate variables of a sys-

tem of linear inequalities. The method did not become widely known (probably as

the time complexity of the method is exponential) and was re-invented by several re-

searchers, e.g. by Motzkin in 1936. The Fourier-Motzkin elimination iteratively re-

moves Q variables. Geometrically, this corresponds to an incremental projection of a

polyhedron � in dimension î © onto a subspace of dimension î © #&%YM�î © # ¥ MYÓYÓYÓ>M�î © # ¦ .Figure 5.2 represents the projection of a two-dimensional polyhedron onto a one-

dimensional subspace. It shows already the main idea of the Fourier-Motzkin algo-

rithm, namely the projection of a hinge defined by the intersection of positive and

negative facets. In this example the projection corresponds to the removal of the vari-

able �¯�Á�°� ; in other words we consider the projection of the polyhedron � onto the

subspace Îv� ���B�oî ¥ � �¯�Á�°�$�"¨¦� .In general the Fourier-Motzkin algorithm eliminates the Q -th variable in a system of

linear inequalities by keeping all inequalites with �J�LK@MsQ��¯�"¨ and combining inequal-

ities with positives components in Q with inequalities with negative components in Q ,such that the resulting inequality is zero in the Q -th component. A single projection

step is summarized in the following algorithm. To project onto a QvQ dimensional

subspace, we repeatedly apply this projection.

P

project

x(1)

x(2)

Figure 5.2: The projection of the polyhedron Î onto the subspace Ï(+�-/.�021 � 4 .76W:'?7+�A°C .Positive facets with respect to the projection direction are plotted solid, negative facets dotted.

Page 140: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

122 Chapter 5. Affine Transformation Phase

� =fourierMotzkin( �2MsQ )«¬«

Elimination of the Q -th variable in a system of linear inequalities

� � ���ÊÉ�MO�«� �oî ¦'Ë î ¥ � �ÌÉu��b{�B���¯�] ¿ � ��N/� �J��NUMsQ��¡¶�¨¦�] # � ��N/� �J��NUMsQ��¡´�¨¦�] ^ � ��N/� �J��NUMsQ����"¨¦�«¬« � � MUb � and � � are the matrices and the vector for the projected polyhedron �� � �ºP�VX¢Ub � �ÔPYVX¢U� � �ºPYVX¢for each N¯�2] ¿ M9á(�2] #_� � �ºPR� � ¢s�J��NUMsQ��Ã�J�@áDMYKF��QB�J�@áDMsQ��Ã�J��NUMYKF�9Vb � �ºP b � ¢s�J��NUMsQ��Lbn�@áDMYKF��QB�J�@áDMsQ��Lbn��NUMYKF�9V� � �ºP � � ¢s�J��NUMsQ��L� �@á���QB�J�@á°MsQ��L����NL�9V`

// add the inequalities of ] ^� � �ºPR� � ¢s���_] ^ MYKF�9Vb � �ºP b � ¢Ubn�_] ^ MYKF�9V� � �ºP � � ¢U� �_] ^ �9V//remove the Q -th column of � �� � ��� � �LK@M��·KjQ Qv�>��H�����ÊÉ�MO�\�¡�mî ¦ #&% Ë î ¥ � � � É(��b � ����� � �// remove redundant inequalities with a method as explained in Section 5.2

�H� mkNonRedundant �a���

Page 141: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 123

Complexity Analysis

To project a polyhedron onto a �� -Qn�>� dimensional subspace Fourier-Motzkin projects

all combinations of hinges of two facets (positive and negative) onto the �� hQ"�>� di-

mensional subspace. In the average case the number of combinations for the projection

from a -dimensional space onto a �QB� dimensional space is Ñ���� ¥ � , where � is the

number of facets. Let be the number of variables we want to eliminate then the com-

plexity is : Ñ­��� ¥ Y � . Additionally, Fourier-Motzkin produces at each projection step a

large number of redundant inequalities. With an immediate removal of the redundant

inequalities the algorithm still scales exponentially. Additionally, as described before,

the removal of redundant inequalities is not a cheap computation, because it requires

to solve several linear programming problems.

5.4.1.1 A Variation of Fourier-Motzkin

To simplify the notation we use ê | �ý�J��NUMYKF�9» and ê µ � ���@á°MYKF�9» . Additionally, it is

assumed that all row-vectors of the matrix � are normalised. We define the expression

relevant hinge as follows:

Definition 5.4 Relevant Hinge

Let ê�| and ê µ be the direction vectors of two supporting hyperplanes of � defining a

hinge¹

of the polyhedron � . A hinge is relevant, iff dim � ¹ �(�����H mQB� wOne of the drawbacks of the Fourier-Motzkin algorithm is the generation of a large

number of redundant inequalities. We developed a variation of the Fourier-Motzkin

algorithm, which first determines if a hinge could be relevant before projecting it onto

the lower-dimensional subspace. The computation of the dimension of dim � ¹ �,���is sufficient, because

¹ ��� ü ¹ implies aff � ¹ �n���\ü aff � ¹ �)� ¹ and therefore:

aff � ¹ �(����� ¹ wCalculation of relevant hinges

We have to test if the dimension of the hinge is 4QJ� dimensional, because two facets

can intersect in a face of dimension mQBÒ¦MO mQ�ä�MYÓ@Ó@Ó@MU¨ . Figure 5.3 depicts an example

of relevant and irrelevant hinges in a two-dimensional scenario.

Page 142: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

124 Chapter 5. Affine Transformation Phase

P

a

a

p

j

i

x~

positive facet

negative facet

H

Figure 5.3: The hinge¹

between the facets defined by êå| and ê µ is outside of � .

Therefore:¹ ��� is-empty and hence not relevant.

relevant=isRelevant( � ,i,j)

relevant � TRUE ¢� �H�2Óí�,¢U�B� �2Óî�¯¢ê | ���J��NUMYKF�a¢/ê µ ���J�@áDMYKF�¹ � ���$� �������uþAê » | �,�"� ��NL�7þ4ê »µ �,�����@á¦�s�if dim � ¹ � o�" mQB�

relevant � FALSE ¢return relevant;

It remains to determine the dimension of the hinge. The problem reduces to the calcu-

lation of the dimension of a polyhedron, because a hinge itself is a face of a polyhedron

and a face is another polyhedron.

Calculation of the dimension of a polytope

Within this thesis we work with bounded polyhedra (polytopes). To calculate the di-

mension of a polytope � the number of linear independent row vectors of the matrix� is computed. For each linear independent vector we minimize and maximize in

direction ê on the polyhedron � . If the minimum and maximum value are equal, then

the polyhedron is contained in a hyperplane directed by ê , otherwise the polyhedron

Page 143: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 125

spans in direction ê . This process is repeated for all linearly independent vectors.Ø =dimPolytope( � )

«¬« � � ���$� ���B�����if � o��ùØ2�"¨

else_G� � getLinearIndependentRowVectors �Á�n�«¬« Q is number of lin. independent direction vectors

for N«�º��K�Q_ê�� G����NsMYKF� »Y § |ª©�� min ê » t ÑjÓ ÞLÚ t��n�Y §�¨s~ � max ê » t ÑjÓ ÞLÚ t��n�«¬«

let � be a small value.

if YG§�¨U~\Q�Y�§ |ª© ¶J�Ø2�HØ��"�`

`return Ø

Complexity Analysis of the Variation of Fourier-Motzkin

The time complexity of our variation is similar to the Fourier-Motzkin and scales ex-

ponentially. The difference between our variation and the original Fourier-Motzkin

Page 144: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

126 Chapter 5. Affine Transformation Phase

approach is that, by computing (possibly) relevant hinges first, we reduce the number

of redundant inequalities. But this computation is essentially as cost-expensive as first

projecting all possible combinations and then removing redundant inequalities. Nev-

ertheless, our variation included interesting aspects, namely the definition of a relevant

hinge for the projection and the computation of the dimension of a polytope. The def-

inition of relevant facets is interesting, because it would be a significant improvement,

once a cheap formula to determine relevant facets would be available.

As explained before, the incremental projection of a polyhedron results in an exponen-

tial time complexity. In the next section we describe a method, known in the literature

as block elimination, which directly projects a polyhedron onto a lower dimensional

subspace.

5.4.2 Block Elimination

One of the disadvantages of the Fourier-Motzkin method are the costs involved with

the incremental projection. A direct projection could be cheaper. The direct projection

can be viewed as elimination of more than one variable at a time and is known as block

elimination in the literature [Kal02]. One of the most recent algorithms is the Balas

block elimination [Bal98].

In the following we recall some fundamental linear programming theorems and lem-

mas, including Farkas Lemma and an alternative of Farkas Lemma, known as Gale’s

Theorem. Finally, a Projection Lemma can be deduced from Gale’s Theorem.

Lemma 5.1 (Farkas Lemma)

Either: (i)� ��§�¨�K����,�"�

or: (ii)� ��K°�7»��¾§�¨®MO�&»\��´�¨

Page 145: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 127

Proof:

In the literature several different proofs of Farkas Lemma are published. The first

complete proof was published by Farkas in several papers, e.g. in [Far02]. Recently,

a proof was published by Dax in the paper “An elementary Proof of Farkas’ Lemma”

[Dax97].

We only show that the two statements are exclusive: Assume (i) and (ii) holds. Then:� » � §Ð¯ and � » �"´³¨®MU���ý����MO� §Ñ¯ has to be fulfilled. But this is a contradic-

tion, since ���7»«�u�Â��§�¨ if �&»��¾§Ò¯ and ��§J¯7Ó wGeometrical Interpretation. The claim that �­�"� � has a positive solution � im-

plies that � � Ý/Ú ¼Ä��Á�u� , i.e. � is in the cone generated by the column vectors of � .

In other words: � can be expressed as a positive combination of the column vectors

of � . The alternative states if �ïÚ� Ý/Ú ¼Ä��Á�u� , then there is � such that �«»�� § ¨ and� » �º´ ¨ , which implies that there is a hyperplane¹ � �����òî § � � » �<� ¨¦� that

separates ÝðÚ ¼Ä�� and � .

Theorem 5.2 (Gale’s Theorem)

Either:� ��Kj���B���

or:� �B§Ò¯�KD�7»\� �S¯&MO�7»\��´�¨

Proof: We want to find the alternative for:� ��Kj�­����� .

�J�H� ¿ Q�� # with � ¿ MO� # §�¨���(�JÓ¸�"��K�ÓÕ§Ò¯ . Therefore we can write:

� � ¿ MO�«#�M�ÓϧM¯�KåPR� Q`� £ § Võööö÷� ¿�«#Ó

øYùùùú ��� , where £ § is the P ��MO�uV identity

matrix.

With the notation: G� KF�ºPR� Q`� £ § VXM G�BKF�õööö÷� ¿�«#Ó

øYùùùú we can write:

� G��§�¨2K G� G���"� . The application of Farkas Lemma leads to:

Page 146: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

128 Chapter 5. Affine Transformation Phase

Ô N Þ ×�Ä Ð K � G��§�¨yK G� G�,�"� Ú5Ð K � ��KD�7» G� §J¯&MO�&»«��´�¨The alternative can be written as:

� �½K���» PR� Q`� £§)V2§Õ¯7MO�&»��Í´¾¨×Ö� ��§J¯7MO�&»\� �S¯&MO�7»\��´�¨ wThe following Projection Lemma can be deduced from Gale’s Theorem:

Theorem 5.3 (Projection Lemma)

The projection of � onto the subspace î ¥ is given by:

� � �����oî ¥ � � » b����v� » �$M for all �B��Ä�t ÞLÐ �Iغ�s�where ÄYt ÞLÐ �IØ � refers to the set of extreme rays of the projection cone

Ø � �����oî § KD� » � �"¨®MO��§J¯å�Sketch of Proof

From Gale’s Theorem it can be deduced:

� ��K��������5Ö¢@q�B§J¯�K°�&»\� �S¯7MO�&»���§�¨ .Therefore:

� �ÊÉ�MO����K��Ìɾ� �TQ�b{�4Ö for all � §¡¯ K`�¼»\� �Ù¯&M � � K�7»����hQ�b��«� §�¨ .This leads to:

� �ÊÉ�MO�\�$K¦�ÚÉ����hQ�b{��Ö for all ��§J¯nK°�7»��Í�S¯7M � ��K°�&»¼b{�B���&»�� wIt remains to show that the inequalities of � are in 1-1 correspondence with the extreme

rays of Ø .

Sketch of Proof.

Let �¬���\� KF� �����oî ¥ � �&»�b{�B�v�&»\�¯MO���¸ØÔ�We claim that: Û

for all � ¹ Ü �¬���\�¯�Û

� ¹ ¾ ~�ÓÞÝ ÈßÜ É �¬���\�let ��� % MYÓYÓYÓ'MO��¦°� be the extreme rays of Ø .

Page 147: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 129

Û� ¹ ¾ ~�ÓÞÝ ÈOÜ É �¬���\�$� ���B�oî

¥ � àuby���fàu�¯� ,where à¾�õööö÷� » %Ó@Ó@Ó�&» ¦

øYùùùú

Any arbitrary ���¸Ø can be expressed as a positive combination of the extreme

rays, i.e. �J� ¦®|r¯ % ë | � | M7ë | §�¨Therefore our claim is:

���B�mî ¥ �¶ë » ànb{�B�Ûë » àn�¯M7ë,§J¯&��� ���B�(î ¥ � ànb{�B��àn���This is true as: ë�»,ànb{�B�Ûë�»,àn�¯M7ë,§J¯áÖâànb{����àn� w

The projection lemma is often attributed to ãCernikov [Cer61]. The application of the

projection lemma allows us to eliminate É in one step. In other words we eliminate

a block of Q variables. It is well known that the above description of � contains re-

dundant inequalities. Balas [Bal98] observed that once the matrix b of the projection

variables is nonsingular, then block elimination gives an irredundant representation of� . In [Bal98] an algorithm is described how to obtain for any polyhedron an alterna-

tive description such that the matrix b is always nonsingular.

However, as stated in [Bal98] the disadvantage of this method is that the structure of� gets lost with the transformation process. Therefore, in some cases, it may be much

more expensive to compute the extreme rays of Ø .

Given a facet description of a polyhedral cone the dual representation as a set of ex-

treme rays could result in a huge number of extreme rays (combinatorial explosion and

hence in an exponential time complexity) [Dut02]. Methods to compute the extreme

rays are described in [FP96].

In the following we apply the block elimination to our problem of projecting the poly-

hedron � onto the subspace Í��*ä�Ä Ð � k �9åI The matrices à and Ä contain the basis vectors of an orthonormal basis defining

the subspace À and Í respectively. The basis b % �ÔP Ã�ÄqV , the standard basis is

named b ^ �"é © .

I �)æ $ �"b ^ b % �)æ�X��"b % �)æ�X

Page 148: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

130 Chapter 5. Affine Transformation Phase

I �)æ�X��H�)ç���� Õ �

õöööööööööööö÷

t %Ó@Ó@Ótk¦¨Ó@Ó@Ó¨

øYùùùùùùùùùùùùú�

õöööööööööööö÷

¨Ó@Ó@Ó¨tk¦ ¿ %Ó@Ó@Ót�¦ ¿ ¥

øYùùùùùùùùùùùùú

I � � ���)æ7$j� ���)æ/$�������� ���)æ X � �(b % �)æ X ����� £I � � �����)ç{MO� Õ ���oî ¦ Ë î ¥ � �Úç��)çv�v� Õ � Õ �����where: �Ìç"�

õööö÷ê » %7è % Ó@Ó@ÓÍê » %Rè ¦Ó@Ó@Ó Ó@Ó@Ó Ó@Ó@Óê » § è % Ó@Ó@Ó ê » § è ¦

øYùùùú and � Õ �

õööö÷ê » % � % Ó@Ó@Ó³ê » % � ¥Ó@Ó@Ó Ó@Ó@Ó Ó@Ó@Óê » § � % Ó@Ó@Ó ê » § � ¥

øYùùùú

Given the rows of � are normalised and à and Ä are orthonormal bases then

the N�Q Þ × row of � ç is a vector, where the á -th entry expresses the cosine of

the angle between ê�| and è µ . Similarly, the N -th row � Õ is a vector, where theá -th entry expresses the cosine of the angle between ê | and � µ .I Using the block elimination gives the following description for the projected

polyhedron.

������� Õ �oî ¥ �ò���&»\� Õ �Â� Õ ���&»¼�¯M=@S��� extr �IØ �s�°MØ � �����oî § KD�7»\�Ìç"��¯&MO��§Ò¯å�Therefore the computation of � reduces to the computation of the extreme rays of the

polyhedral cone Ø �ý����� î § K®�&»\�Ìç �Я7MO��§S¯å� . If �Ìç has a structure which

allows a cheap computation of the extreme rays then the block elimination technique

can be used to compute � . However, this is generally not the case and hence, approx-

imation techniques are used instead. Generally, the Block-elimination method as well

as the Fourier-Motzkin approach scale with exponential time complexity [Kal02].

As explained by Kaludzny [Kal02] there is an interesting connection between the

Fourier-Motzkin and the Block-elimination technique, which would be worthwhile to

investigate more.

An important requirement for the VPA algorithm is to scale within polynomial time

Page 149: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 131

complexity. We developed two techniques to approximate the projected polyhedron� , which guarantee to scale with polynomial time complexity. These methods will be

explained in the subsequent sections. Experimental results and a comparison between

our approximation method and the Fourier-Motzkin algorithm conclude this section.

5.4.3 The é -Box Approximation

As motivated in section 2.3 any approximation used for VPA has to guarantee that the

true image is contained in the computed region.

To obtain a projection method scaling with polynomial time complexity a polyhedronG� which contains the projected polyhedron � is computed. This method is based on

the observation that we can easily compute a box containing � . Additionally, we can

refine the approximation by projecting faces of dimension !QnQ�QJ� onto the subspaceÎ .

The projected polyhedron � can be approximated with the wrapping box Õ � m �a�`� .Let ��ºÑUÙqp� «��� % M9� ¥ MYÓYÓYÓ'M9� © # ¦°� , i.e. the basis vectors of Í . We compute × Õ by op-

timizing in the positive and negative directions of all basis vectors of Í , i.e. we solve���� Q�Q�� optimization problems of the form: max êÑ�>»| � s.to. � �º� . The

computed optimal points are denoted with �_ë | . We can refine the approximation by

projecting the face defined with the intersection of Q���� active facets of � at the points� ë | . The projection of this face is a hyperplane in Í . This hyperplane is denoted with¹ ¥ ë | .This results in a better refinement, because the approximation will contain constraints

of the true projected polyhedron � . The S-Box approximation method has the follow-

ing important properties:

I correct solution once projecting onto a one dimensional subspace,

I Õ wraps � ,

I the (projected) points ��ë | are surface points of � ,

I there are faces of � at the point Ñ_ë [ which correspond to facets of the projected

polyhedron �

Page 150: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

132 Chapter 5. Affine Transformation Phase

In the following we summarize this initial algorithm, referred to as sboxProj.

G��� sboxProj ���2M9Ã�MÆļ�// use linear programming to compute the wrapping box of � wrt. basis S

// the box is the initial approximation

G� � m �a���//With ì we denote the computed hyperplanes, which build facets of �ì½�ºP Vfor N\� ��Kj�±Ì��� mQBQ��_

// to ensure polynomial time complexity the following loop is only executed, if

// the number of active facet combinations is less than a pre-defined upper bound

for all active facet combinations í at point � |_// the matrix í contains the direction vectors of the active facets

Let í �ºP ô % MYÓ@Ó@Ó@MUô*¦ ¿ % VX¢¹ · [ � projFace �.í�MÆÄ«M9� | �if isFacetofQ � ¹ · [ �

ì �ºP ìn¢ ¹ · [ V`G��� G���áì`

It remains to explain how to project a face of dimension Q�Q9Qy� onto a subspace Î and

how to determine if a projected face defines a facet of the true projected polyhedron� . The next two subsections are devoted to these problems.

Page 151: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 133

5.4.3.1 Projection of a face

To project a full-dimensional polyhedron directly onto a �� �Q¬Q�� dimensional subspace,

we have to consider the 4Q��ÁQ·�H�>� dimensional faces of � , which are defined by the

intersection of Q¤��� facets. Therefore the number of possible combinations is: î §¦ ¿ %:ï .We want to project a �� �QHQmQ �>� dimensional face of the polyhedron � onto an Ñdimensional subspace Í , where Ѭ� oQ�Q . The face is defined by the intersection ofQ·�"� facets of � . With � | we denote a point on the face.

The basis of the �� úQ`Q@Q·�>� dimensional face is orthogonal to the direction vectors of theQ¯��� facets. Thus, a basis ð for the face can be defined by: ð ��À3ÁÆ®�.í�� ,where íH�P ô % MYÓ@Ó@Ó@MUô�¦ ¿ % V and ô | is the direction vector of a facet of � . We project the basis vectors

of ð onto the Ñ -dimensional subspace Í and obtain the basis ð�ñ (represented as a

matrix), which in turn defines the basis of a hyperplane¹

in î ¥ . A direction vector

ê of¹

is computed as the kernel of ð�ñ , i.e. ê"�òÀ3ÁÆÂ��.ð�ñS� . Of the two possible

direction vectors ( ê�ê ) for¹

, we select the one which agrees with the direction vector

of the projected face. To determine the correct direction we project a direction vectorô of one of the Q{� � facets of � onto Í . The angle between ê and ôóñ has to be less

than ]�¨ } , i.e. ê is computed by: ê�� Ñ>NÁßj ¯�ôæê&MUô9ñv¶��Lô0ñ . Finally, to determine the

position of¹

, such that ��ü ¹ # , we project the point � | onto Í .

Page 152: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

134 Chapter 5. Affine Transformation Phase

[ ê&Msr ]=projFace �.í�MLͯM9�'�«¬«

compute the Basis ð for a face defined«¬«by the intersection of Q��"� facets of �

ð ��À3ÁÆÂ��.í��a¢«¬«project all basis vectors of the face onto Í«¬« ð Ç

are the projected vectors of ð with respect to the standard basis

ð Ç ��Ù Ð�Ú áq�.ðJMÆÄ7�a¢«¬«project the point � onto Î

� Ç ��Ù ÐDÚ áq�.�DMÆļ�a¢«¬«compute one orthogonal vector for the projected basis vectors

ê���À3ÁL®�.ð Ç �a¢«¬«determine the correct sign for ê

ô¤�*í��Ã��MYKF�a¢ô Ç ��Ù ÐDÚ á���ô°Ma¯�a¢ê���Ñ>NÁßj ¯�ôçê7MUô Ç ¶¤� ÌRê&¢«¬«

it remains to define the correct position of the projected face

r¡�òê » Ìw� ñ ¢The next section describes how to determine if a projected face builds a facet of the

projected polyhedron.

5.4.3.2 Determination of Facets of �As explained in Figure 5.4, we have to decide which projected faces are facets of the

(true) projected polyhedron � . The following cases are possible for a projected face:

1. The projected face is a facet of � .

2. The projected face is a hyperplane which cuts � .

3. The projected face is redundant in � .

Page 153: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 135

H 0

1H

H 2

Q

Figure 5.4: The hyperplane¹ ¥ cuts the projected polyhedron � , the hyperplane

¹ %defines a face of � and the hyperplane

¹ ^ is redundant for the polyhedron � .

Fourier-Motzkin projects a polyhedron incrementally onto a lower-dimensional sub-

space. This allows the classification into non-negative and negative facets. With a

direct projection such a classification is not possible. Thus, for the direct projection of

a �� mQBQ Q��>� dimensional face onto a lower-dimensional subspace all possible î §¦ ¿ %1ïcombinations of � active facets at an extreme point X���� are considered. This leads

to the additional problem that projected faces, which build a hyperplane in the sub-

space Î , can cut the true projected polyhedron � . An exampled is illustrated in Figure

5.4.

Section 5.2 described how to detect and remove redundant inequalities. Hence, the

algorithm to decide whether a projected face belongs to � reduces to determine those

hyperplanes, which cut � . The polyhedron � itself is unknown. We only know � and

the subspace Î . The algorithm is based on the so called “Straddie”-Theorem 2

We want to project a polyhedron �ÔªTî © onto the subspace Îv�³���B�(î © � �¯� è ���"¨¦� .In the following Theorem,

¹ U refers to a hyperplane through an extreme point X of

the polyhedron � . The direction vector of¹ U is defined by the projecting of the face,

obtained by the intersection of Q�� � active facets of � at a point X , onto the subspace

2I named this Theorem after the place where I developed it: in a coffee-shop of the beautiful north

strand-broke island, which is located at Brisbane’s coast and called by the locals “Straddie”.

Page 154: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

136 Chapter 5. Affine Transformation Phase

Î . Hence, the direction vector ê of¹ U is characterized by ê¼� è ����¨ , i.e. ê��nÎ .

Theorem 5.4 The Straddie Theorem:

The hyperplane¹ U with direction vector ê,��Î cuts the projected polyhedron � iff it

cuts the original polyhedron � .

Sketch of Proof

We first show that the above statement is true for “ £ ”, i.e.: if¹ U cuts � then¹ U cuts � .

¹ U cuts � then: �B� ¹n¿U Ú��ù and ��� ¹ #U Ú��ùHence: there are points i���� and áy��� such that: ê » i�¶�¨ and ê » á�´�¨ .For i and á exists points Gi���� and Gá(��� such that the orthogonal projection

of these points onto Î are i and á respectively.

Since the direction vector of¹ U is characterized by ê¼� è ��� ¨ the dot product

ê�»)i,�òê®» Gi and ê�»9á��òê�» Gá . Therefore:¹ U also cuts � .

The proof for “ ô ” is analogous. wThis Theorem allows us to determine if

¹ U cuts the (unknown) polyhedron � by solv-

ing two linear optimization problems on the (known) polyhedron � . The algorithm

isCutofQ is summarized below.

[iscut]=isCutofQ �ªìnMO�2M�� )

iscut=FALSE

YG§�¨s~�� max ìnÓîp s. to ���n�YG§ |ª© � min ìnÓîp s. to �B�n�if YG§W¨U~ ¶òìnÓír�þÚYG§ |�© ´�ìnÓírisCut=TRUE

Page 155: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 137

5.4.3.3 Further Improvements of the S-Box method

The following list is a summary of ideas to improve the initial S-Box approximation.

I use each non-null row-vectors of � Õ as a starting vector to compute an or-

thonormal basis Ä . The approximation is then given by the intersection of up to� S-Boxes. The motivation for this strategy is that the basis Ä is chosen with

respect to the structure of the polyhedron � .

I A possible heuristic for the selection of another basis is to select the most dif-

ferent one to the previous basis. The first basis vector is computed over the

normalised sum of all previous basis vectors , i.e.: ��� % �Y » ½õ · [[ßö X÷ Y » ½õ · [[ßö X ÷ . We use the

QR-Algorithm to compute the remaining �� (Q�Q Q �>� basis vectors for the new

basis. Theoretically, we can compute the Polyhedron � with a finite number of

different wrapping boxes Õ [ , where | defines a basis of Í , i.e. �H� ø| ¹�ù Õ [

Page 156: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

138 Chapter 5. Affine Transformation Phase

5.4.4 Experiments

The next listing is an example for the projection of a five-dimensional polyhedron onto

a two dimensional subspace. We denote with � the result computed with Fourier-

Motzkin and with G� the computed approximation. The last column represents the

constant vector � and the previous columns include the values for the matrix � of

each polyhedron.

Listing 5.1 Projection Example

1 8Oú 6 £ûú 6 ¤q= 8 �ú 6 £ �ú 6 ¤q=2 0.5898 0.8075 6.6036 -0.9931 0.1170 4.14063 0.5654 0.8248 6.7770 -0.9655 -0.2602 3.32504 -0.7751 0.6319 6.9353 0.9810 0.1941 3.71925 -0.6772 0.7358 7.6160 0.9228 -0.3853 3.19506 -0.6992 0.7150 7.4840 -0.0157 -0.9999 3.15767 -0.9931 0.1170 4.1406 -0.6772 0.7358 7.61608 -0.9655 -0.2602 3.3250 0.0949 0.9955 8.72659 -0.7662 -0.6426 3.4685

10 -0.8556 0.5176 6.285011 0.9303 0.3669 4.192612 0.9810 0.1941 3.719213 -0.0157 -0.9999 3.157614 0.9228 -0.3853 3.195015 -0.2334 -0.9724 3.105616 0.0949 0.9955 8.726517 0.1298 0.9915 8.6568

All computed inequalities of the approximation are also inequalities of the true

projected polyhedron, e.g compare line 2 and line 7 in the above listing. An output of

the log-file shows that it took 600 seconds to compute x , but just around 2.5 seconds

to compute the approximation. Additionally, the approximation is good as 92.76% of

the space are points of the true projection � . The output of the log-file is as follows:=================================================

Reference File:P5TO2No1

Input-Dimension: 5

Calculation Time in s Fourier-Motzkin: 600.19

Calculation Time in s S-Box Approximation: 2.57

=================================================

Volume Polyhedron: 6.8774

Volume Approximation: 7.4141

Ratio: 0.9276

==================================================

Page 157: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.4 Projection of a Polyhedron onto a Subspace 139

Dimension Fourier-Motzkin (in sec.) S-Box Approximation (in sec.) Volume-Ratio

5 Ì 4 [50.4564] [11.8630] [0.8779]

5 Ì 3 [2574.2155] [4.4634] [0.8879]

5 Ì 2 [3746.8898] [1.5983] [0.9131]

Table 5.1: Computation times for the projection of a polyhedron onto a a lower-

dimensional subspace comparing Fourier-Motzkin and the S-Box approximation.

The Figure below is a visualization of the polyhedron � and the approximation G� .

−4 −3 −2 −1 0 1 2 3 4−4

−2

0

2

4

6

8

10Approximation and Projected Polyhedron

Figure 5.5: The projected polyhedron � is contained in the approximation.

Randomly 10 polyhedra in dimension five3 have been generated to project onto a

four, three and two dimensional subspace. We compared the result of the approxima-

tion with the Fourier-Motzkin approach. The columns of Table 5.1 contain the average

computation times (in seconds) for the corresponding method. The column Volume-

Ratio is computed by dividing the volume of � by the volume of G� . This means, the

closer this value is to one, the better the approximation.

3We have chosen dimension five as Fourier-Motzkin is very slow. Additionally, also the exact volume

computation scales exponentially.

Page 158: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

140 Chapter 5. Affine Transformation Phase

5.5 Further Considerations about the Approximation of the

Image

For completeness we describe in this section further theoretical ideas, which were de-

veloped during the time of this PhD project. Due to time constraints, we did not explore

this theoretical observations in more detail. Additionally, with the S-Box approxima-

tion already a satisfactory solution for the approximation of the projected polyhedron� is provided.

With the following important observation it is possible to approximate þ :

max� ¹¨ü ­ » � � maxê ¹¨ý ­ » � k ��� l � , i.e. this observation allows us to approximate þfrom outside. For an arbitrary directed hyperplane

¹, we can determine the correct

position such that þæü ¹ # .An approximation of þ is obtained by solving Ù optimization problems of the form:

max� ¹¨ü �Á�J�ÁÀÂMYKF� k ¿ � » � , where N¯� Pò�«ÓYÓYÓXÙ�V and Ù is the number of inequalities defin-

ing � . This approximation would contain the projected polyhedron.

To obtain a better approximation these observations can be integrated into the S-Box

approximation process as both strategies approximate þ from outside.

5.6 Summary of this Chapter

We started this chapter with a solution to compute the reciprocal image of a polyhe-

dron � under an affine transformation. As described, this problem can be solved by

basic matrix operations, but requires removal of redundant inequalities to keep a non-

redundant and compact description of the polyhedron.

It followed a discussion of the computation of the image of a polyhedron under an

affine transformation. Projection techniques are used to compute the image, if the

dimension of À3ÁÆÂ�� k � is bigger than zero. The projection of a polyhedron onto a

lower-dimensional subspace itself is an interesting research question. We developed

an approximation of the projected polyhedron, which scales in polynomial time com-

Page 159: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

5.6 Summary of this Chapter 141

plexity. Experimental results indicated that the computed approximation is relatively

good. The contributions of this chapter are summarized in the following box:

- Contributions Chapter 5 -

I Computation of the image of a polyhedron under an affine transforma-

tion by applying projection techniques, if necessary.

I Development of the S-Box approximation technique, which scales in

polynomial time. For the development of this technique the following

algorithms and ideas are required:

– Projection of a Q<Q dimensional face of a polyhedron onto a

lower-dimensional subspace.

– The Straddie Theorem to determine if a projected face is a facet of

the true projected polyhedron.

I Approximation of the image of a polyhedron under a linear transforma-

tion when Ø°NX�B�8À3ÁL®� k �O�)¶�¨ . We developed two strategies to approx-

imate þ . Firstly by applying © Õ on the approximation G� and secondly

by approximating þ directly by solving optimization problems of the

form: max� ¹¨ü �Á���ÁÀÂMYKF� k ¿ � » � , where N`�ÍPò�«ÓYÓYÓ9Ù�V and Ù is the number

of rows of the matrix � .

Page 160: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 161: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 6

Implementation Issues and

Numerical Problems

In the first part of this chapter the structure of a general software framework for region-

based refinement methods is described and it is explained how to use this framework

and how to “plug-in” different refinement methods, such as VIA and VPA.

The second part of this chapter is devoted to numerical problems which always occur

when implementing mathematical algorithms on digital machines with finite precision.

6.1 The Framework

The core of refinement-based neural network validation algorithms is to forward- and

backward propagate regions successively through all layers of a neural network.

This can be expressed in a loop which alternates between a forward and a backward

propagation. The loop stops, if a pre-defined number of iterations is reached, or if no

significant refinement was computed. A prototype of the framework is implemented

with Matlab [Mat00c]. For clarification the original source code is simplified. An im-

portant abstraction concept in Matlab are function handles. A function handle can be

viewed as a reference to the corresponding function. Hence, by using function handles

in Matlab, general code can be written.

To represent a neural network the Matlab neural network structure is used. We mod-

ified this structure, by adding the sub-structure annotatedLayer, which contains the

143

Page 162: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

144 Chapter 6. Implementation Issues and Numerical Problems

computed pre- and postconditions. The relevant attributes of the net-structure and the

sub-structure are as follows:

Listing 6.1 The important part of the net structure.

1 % input2 inputs:{1x1 cell} of inputs, with3 net.inputs{1}.range defining the operating input range4 outputs:{1x2 cell} containing 1 output, with5 net.outputs{1}.range defining the operating output range6 % weight and bias values:7 IW: {2x1 cell} containing 1 input weight matrix8 LW: {2x2 cell} containing 1 layer weight matrix9 b: {2x1 cell} containing 2 bias vector

10 % our modification11 annotatedLayer:{#layers} with the attributes:12 preCondition13 postCondition

The following call-sequence diagram provides an overview of the structure of the

framework.

fhBackLin,fhBackNonLin)(net,fhForLin,fhForNonLin,...

back

ward

Step

aNet

back

ward

NonL

in

aNetaNet

back

ward

Lin

aNet aNet

(aNet,fhForLin,fhForNonLin)

aNetaNet

aNet

aNet aNet

aNet

(aNet,fhBackLin,fhBackNonLin)

forwa

rdLi

n

forwa

rdNo

nLin

main

Func

tion

refin

e

Time

forwa

rdSt

ep

Figure 6.1: Overview of the framework. The main function passes the function handles and

the neural network structure to the refine function. The refine function calls the forwardStep

and the backwardStep functions, which use the function handles. An annotated neural network

is returned.

Page 163: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

6.1 The Framework 145

Listing 6.2 The refinement loop.

1 function [aNet]=refine(aNet,fhForLin,fhForNonLin,2 fhBackdLin,fhBackNonLin,fhVolume)3 % initial stuff is here ...4 while ( (noIterations < gv_MaxIteration) )5 ( abs(volPx0-volPx1) > eps | abs(volPy0-volPy1) > eps))6 if alternationFlag==’forward’7 % Forward Step8 yRegion_prev = yRegion_next;9 aNet = forwardStep(aNet,fhForLin,fhForNonLin);

10 yR_next = aNet.annotatedLayer{1}.postCondition;11 alternationFlag = ’backward’;12 volPy0 = volPy1;13 volPy1 = feval(fhVolume,yRegion_next);14 else15 % Backward Step16 xRegion_prev = xRegion_next;17 aNet = backwardStep(aNet,fhBackLin,fhBackNonLin);18 xRegion_next = aNet.annotatedLayer{1}.postCondition;19 volPx0 = volPx1;20 volPx1 = feval(fhVolume,xRegion_next);21 end22 noIterations = noIterations + 1;23 end

An important part is the computation of the volume (see line 13 and line 20 of List-

ing 6.2) of a region (also note that we use a function handle to call a volume function

dependent on the region). If regions are hypercubes the volume computation is trivial.

For the computation of the volume of a polyhedron the reader is referred to the work

by Lasserre [Las83]. A computational-wise cheaper alternative to the volume compu-

tation is a comparison with the computed set of inequalities. This should be applied

for higher-dimensional cases, because the exact volume computation of a polyhedron

scales exponentially with the dimension of the polyhedron.

In an initial phase the neural network is annotated according to the operating input and

output space by using the refinement process. Later the user can specify initial regions

in the input and/or output space and a correct annotated version of the neural network,

according to the user specification, is computed.

The refinement loop performs forward and backward steps until no significant refine-

ments are observed. In the sequel, we describe the forward step. This is sufficient as

the backward-step implementation is analogous. A forward step is used to forward

propagate a region, starting from the input layer, through all layers of a feed-forward

neural network. This is implemented with a for-loop, which alternately computes the

Page 164: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

146 Chapter 6. Implementation Issues and Numerical Problems

image of a polyhedron through the linear weight layer, followed by a computation of

the image of a polyhedron � through the non-linear layer.

During the execution of the loop the pre- and postconditions of each layer get updated.

We implemented the update of the pre- and postconditions with a function named

annotateNN. The input parameters to this function are the annotated neural network

structure, the refined region, the layer and a parameter indicating if the pre- or post-

conditions are updated. The output is the updated annotated neural network structure.

Listing 6.3 forwardStep

1 function [aNet]=forwardStep(aNet,fhForLin,fhForNonLin)2 for layer=1:aNet.numLayers3 if layer==14 % Input Layer5 W = aNet.IW{1,1};6 theta = aNet.b{1};7 else8 W = aNet.LW{layer-1,layer-2};9 theta = aNet.b{layer-1}

10 end11 [Ry] = feval(fhForLin,aNet,layer,NP,W,theta);12 % annotate the network with the refined region13 % obtained of the linear transformation14 aNet = annotateNN(aNet,Ry,layer,’preCondition’);15 % non-linear phase16 [Ry] = feval(fhForNonLin,aNet,layer,NP);17 % update the postcondition of the layer.18 aNet=annotateNN(aNet,Ry,layer,’postCondition’);19end

For the computation of the forward (backward) propagation through a linear or

non-linear layer different algorithms can be used. It is easy to integrate different algo-

rithms due to the generality of the framework. For example, to use the VIA implemen-

tation the refine function is called with the following parameter settings.

Page 165: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

6.2 Numerical Problems 147

Listing 6.4 mainVIA

1 % load the global parameters for via2 gv_VIA;3 fhForwardLin = @via_forwardLin;4 fhForwardNonLin = @via_forwardNonLin;5 fhBackwardLin = @via_backwardLin;6 fhBackwardNonLin = @via_backwardNonLin;7 Bx = net.inputs{1}.range;8 % Example for an initial region of the output space for9 % a single output node

10By = [0 0.5-eps];11region.input = Bx;12region.output = By;13[aNet] = refine(net,region,fhForwardLin,fhForwardNonLin,14 fhBackwardLin,fhBackwardNonLin);

To call the implementation of Validity Polyhedral Analysis (VPA), the following

settings are defined.

Listing 6.5 mainVPA

1 % load the global parameters for vpa2 gv_VPA;3 fhForwardLin = @vpa_forwardLin;4 fhForwardNonLin = @vpa_forwardNonLin;5 fhBackwardLin = @vpa_backwardLin;6 fhBackwardNonLin = @vpa_backwardNonLin;7 % Example: Px and Py are initialised with results8 % obtained with VIA9 inLayer = 1;

10outLayer = 3;11Px = aNet.annotatedLayer{inLayer}.preCondition;12Py = aNet.annotatedLayer{outLayer}.postCondtion;13region.input = Px;14region.output = Py;15[aNet] = refine(aNet,region,fhForwardLin,fhForwardNonLin,16 fhBackwardLin,fhBackwardNonLin);

In the next section we discuss numerical aspects of the implementation.

6.2 Numerical Problems

Mathematical algorithms, involving the computation with real numbers, require an

analysis of numerical properties when implementing them on a computer. Digital

computers are finite machines, and hence the representation of real numbers are just

Page 166: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

148 Chapter 6. Implementation Issues and Numerical Problems

approximations. Additionally, to represent a function on a computer only two ways

are possible [Van83].

I implementation by a table including the function values,

I approximation of the function with the basic operations a computer can perform:

addition, subtraction, multiplication and division

These aspects lead to the field of “numerical computations”, which deals with numeri-

cal issues when implementing a mathematical algorithm on a machine with finite pre-

cision. For an introduction the reader is referred to the book by Vandergraft [Van83].

Our implementation uses the sigmoidal functions and relies on linear programming

techniques. This could cause numerical problems. The following is an example:

Listing 6.6 numerical example

1 >> invlogsig(0.99999999999999)2 ans = 32.23703 >> invlogsig(0.9999999999999999)4 ans = 36.04375 >> invlogsig(0.99999999999999999)6 ans = Inf7 >> x0=0.999999999999998 >> x1=0.99999999999999999 >> x2=0.99999999999999999

10>> x1-x011ans = 9.8810e-01512>> x2-x113ans = 1.1102e-016

The above example illustrates that even very small round-off errors cause incor-

rect results. The implementation of VPA relies on the linear programming function

linprog of the Matlab optimization tool box [Mat00b]. Rounding errors in the above

magnitude (e.g.: ]¦Ó¦^_^®�Y¨°Ä QH¨���æ ) often occur when using linprog. For example, by

backpropagating a polyhedron and computing for one component the value ÒDç¦Óî¨�äjÒ°óinstead of ÒD�¦Óí�DÒ°ó5¨ causes a numerical error. As we propagate polyhedra through all

layers of the neural network even small numerical errors get quickly magnified. Con-

sequently, the neural network would be annotated with incorrect polyhedral regions.

In the literature mathematical problems, which are sensitive to small changes in the

data are called unstable or ill-conditioned problems [Van83].

Page 167: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

6.2 Numerical Problems 149

Numerical Work Around

The following definitions are used in this paragraph.

Definition 6.1 Stable and Unstable Numerical Range

We use the expression stable numerical range, to state that the computation within this

range is considered as numerically stable. Otherwise we use the expression unstable

numerical range. wDefinition 6.2 Stable and Unstable Components

Components of a vector or matrix, which values are related to unstable numerical

ranges are called unstable components. Otherwise the components are referred to as

stable. wAs explained in Chapter 4 during the non-linear phase a polyhedron is backpropagated

through a vector of sigmoidal transfer-functions. Let C denote the optimization vector.

If a component of the wrapping box is classified as numerically unstable, then the cor-

responding entry of the vector C is set to zero and the polyhedron is projected onto the

subspace defined by the numerical stable components.

[ C�M�� ]=numWA( C , � )

�� m �����PþÉ�M g Vå� getUnstableComponents ��¡�// ɯÓ@Ó@Ó vector of unstable components

C\�ÊÉ«�¯��¨�H� proj ���2M g �

This numerical “work-around” still guarantees that the computed wrapping polyhedron

will contain the non-linear region � , because components with C«�ÊÉ\�¬�½¨ are irrele-

vant for the solution of the optimization problem. Furthermore, thanks to the convexity

property of polyhedra, the projection of a polyhedron onto a subspace corresponds to

the volume-wise biggest slice of the polyhedron (with respect to the projection). Fi-

nally, also the projection approximation ensures to compute a polyhedron containing

the true projection.

Page 168: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

150 Chapter 6. Implementation Issues and Numerical Problems

This numerical work-around is really quite nice, because it uses theories and algo-

rithms developed within this thesis but for, originally, different purposes.

6.3 Summary of this Chapter

In this chapter the design and the implementation in Matlab of a general framework

for region-based refinement algorithms was introduced. In particular the refinement

function and the functions for the forward and backward-propagations have been gen-

eralised. This framework is easy to use and allows to plug-in different implementations

for the forward and backward computation of regions through a feed-forward neural

network.

Furthermore, numerical problems have been discussed and solutions to overcome nu-

merical difficulties with VIA or VPA have been introduced.

Contributions - Chapter 6 -

I Design and implementation of a general framework for region-based refine-

ment algorithms.

I Implementation of Validity Interval Analysis (VIA) and Validity Polyhedral

Analysis (VPA) within this framework.

I Discussion of numerical difficulties for the VIA and VPA algorithms.

I Work-around to stabilize an implementation by introducing the concept of

stable and unstable components, and by applying polyhedral projection tech-

niques.

The results of the following evaluation chapter rely on the implementation of VIA

and VPA. The implementation at the current status is only a prototype implementation

and is restricted to the propagation of a single polyhedron. Furthermore, the approx-

imation of the non-linear region is obtained with a single polyhedron. However, in

later versions the implementation will be extended by using finite unions of polyhe-

dra. Additionally, in the current implementation not all numerical issues are solved

satisfactory.

Page 169: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 7

Evaluation of Validity Polyhedral

Analysis

In this chapter, the Validity Polyhedral Analysis (VPA) approach developed in this

thesis is contrasted against Validity Interval Analysis (VIA) [Thr93]. VPA was eval-

uated on a neural network referred to as the circle neural network. This network was

trained on an artificially created data set. Further evaluations were performed on sev-

eral benchmark data sets of the UC Irvine database, and on a neural network trained to

predict the SP500.

Section 7.1 explains the general evaluation procedure. Section 7.2 presents the results

for VIA and VPA on the circle neural network. Section 7.3 discusses the evaluation on

the Iris and Pima data sets. Section 7.4 explains the application of VIA and VPA on a

neural network trained to predict the SP500 stock-market index.

7.1 Overview and General Procedure

The literature review highlighted the fact that benchmarks for rule extraction methods

reported and compared results for data sets rather than neural networks. Contrasting

rule extraction methods using different neural networks trained on the same task. Un-

fortunately, as far as we know, no trained benchmark neural networks are available.

The central idea of validity polyhedral analysis is to annotate a neural network with

valid regions. As such, we are able to provide statements of the neural network be-

151

Page 170: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

152 Chapter 7. Evaluation of Validity Polyhedral Analysis

haviour which are guaranteed to be correct. This is a different focus compared to

propositional or fuzzy rule extraction. Neural network validation methods should be

compared on neural network level, because a validation process should be independent

of the training and data pre-processing phase.

However, this would have seriously disadvantaged rule extraction techniques requir-

ing special neural network architectures. Functions of the Matlab Neural Network

Toolbox [Mat00a] were used to train the neural networks. For improving the neu-

ral network generalization early stopping and regularization was applied by defining

an appropriate validation set and by using the matlab functions trainbr and trainbfg.

Regularisation modifies the performance function, which is usually the mean square

error function. According to the matlab documentation [Mat00a] a term is added to

the performance function which consists of the weihts and biases of the neural net-

work. Using the modified performance function ensures that the neural network will

have smaller weight and bias values which helps to reduce the risk of overfitting the

training data.

All available data was used for the training process of the Iris and Pima benchmarks.

For the circle task we randomly generated points in the input-space and used them

as training examples. The SP500 neural network used data collected from the stock-

market. The data was split into training, validation and test data.

The evaluation process followed the same structure for all experiments. In an initial

phase the neural network is annotated by propagating the operating input and output

space through the layers of the neural network. VIA was used to refine regions accord-

ing to initial, user-defined restrictions in the input space or the output space. VPA was

then applied to obtain further refinements.

To describe the evaluation, the task is explained, the neural network architecture is

described, and finally VIA and VPA results are discussed. A visualization of regions

is presented to help understanding the neural network behaviour. For the visualization

of higher-dimensional spaces projection techniques are useful. The circle task is ex-

plained in detail and provides a link of this chapter to the discussed concepts in Chapter

1 and Chapter 2, in particular to the idea of an annotated neural network and polyhe-

dral pre- and postconditions as well as to the discussion about interpreting numerical

Page 171: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.2 Circle Neural Network 153

rules.

7.2 Circle Neural Network

This artifical example demonstrates in a simple manner the core idea of validating the

behaviour of neural networks. It is named artifical example because we know a priori

exactly what rule the neural network has to learn.

The Task

A neural network had to distinguish between points inside and outside a circle, i.e to

learn if �@� ­ QB�$�@�S� Ð for a given center­

and a radius Ð . To make it more interesting

assume that the center of a two dimensional input space represents the current position

of an airplane. A system should alarm the pilot about approaching airplanes within a

certain distance. From a neural network training perspective this problem reduces to

classify points inside and outside a circle. The feed-forward neural network, which

learnt this task, is referred to as the circle neural network.

The Neural Network Architecture and the Learning Process

The architecture is a two-weight layer neural network with the following properties.

I Two dimensional input space, with input between [-1,1] for each dimension.

I A five dimensional hidden layer with sigmoidal transfer-function.

I One output node with sigmoidal transfer-function. The output is in the interval

[0,1].

The neural network was trained to predict a value greater than 0.5 for points within the

circle. Outside the circle the points had a value less than 0.5.

Application of VIA and VPA

Assume a user of the neural network is interested in the following question:

When does the neural network warn the pilot about approaching airplanes?

The developer of the neural network has to interpret this natural language specification.

In this case it is trivial. The user is interested in the following output interval:

à2�ÔP ¨®Óíæ¦M��/V

Page 172: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

154 Chapter 7. Evaluation of Validity Polyhedral Analysis

The tester has to validate which regions in the input space of the neural network predict

a value in the above interval. In the first phase, the relevant output region is expressed

as a polyhedron (in this case simply an interval):

�����³��à7�&ÿ X» X�� àu� ÿ X» $�� � � �The next steps are to apply VIA and VPA with the defined initial restriction in the

output space (and no restriction in the input space). VIA computes the following box

in the input space:

~ � ���$�cd » X $$ » XX $$ X

ef���cd $�� ��� � �$�� � � X��$�� � X��IX$�� � $�ef�

VPA is applied after the termination of VIA and computes a refined polyhedral region

in the input space:

��~·�����$� � $�� � �8� � $�� � � �» $�� � � X � $�� � $���$�� �Ê�Ê� $ » $�� � ��.$�� �B��� $�� X �� X$�� X � � X$�� X ���.$� �The reciprocal image of the pre-defined output condition was computed by back-

propagating the initial region through all layers of the neural network. It followed

a refinement process by forward and backward-propagating the regions until no fur-

ther refinements in the input or output space were observed. Finally, these polyhedral

rules are obtained (see Figure 7.1 for a visualisation of the input space):

if àu�n��� then �����~ �Ã�>�if àu�n� � then ���n� ~ �Á�°�if �TÚ�n ~ then àhÚ�n� � �ÁÒ°�if �TÚ�u��~ then à Ú�u��� ��ä��

As discussed in Section 1.4 the above rules are valid statements about the neural net-work behaviour. In our case we started with a backward step, because the initial region

was defined in the output space. Hence, the output region is viewed as the precondi-

tion and the computed region in the input space as postcondition (rules (1) and (2)).

Section 1.4 described that a measurement for the strength of a pre- or postcondition

is the volume of the computed region. In the above case the volume of the box is

0.3453 compared to the stronger postcondition, the triangle (simplex), with a volume

of 0.1599.

Page 173: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.2 Circle Neural Network 155

In Chapter 2 an introductory example explained that, usually, an interpretation process

on numerical rules is necessary to obtain more human readable representations. An

interpretation of the above rules allows the following valid statements about the neural

network behaviour.

1. if the neural network warns the pilot about approaching airplanes then the other

airplane is within the box defined by $~ .2. if the neural network warns the pilot about approaching airplanes then the other

airplane is within the triangle defined by �`~ .3. if the input is outside the box ~ then the neural network will not send a warning

to the pilot.

4. if the input is outside the simplex � ~ then the neural network will not send a

warning to the pilot.

It is important to notice that the correctness of statements 3 and 4 is guaranteed, be-

cause VIA and VPA are approximating the true region from outside. In other words:

the algorithms ensure that the true region is contained in the approximated polyhedral

region. Figure 7.1 is a visual representation of the neural network behaviour. Within

the two-dimensional input space the computed box and the computed triangular region

are shown. Points represent the neural network output value. Dots indicate that this

input corresponds to a neural network output less than 0.5 and the points plotted with

the “+” sign represent that the corresponding neural network output is greater than

0.5. It is important to notice that the neural network did not learn the task correctly, as

there are points within the circle which are classified with a value less than 0.5, i.e. the

neural network would not warn the pilot of approaching airplanes. With the VIA result

this misclassification remains undetected, whereas the computed VPA result proves the

misclassification. Additionally, this example highlights the difference between testing

a neural network and proving properties about the neural network. A test, by sampling

points, possibly would have not detected the misclassification.

Page 174: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

156 Chapter 7. Evaluation of Validity Polyhedral Analysis

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 7.1: Visualisation for the behaviour of the circle neural network.

7.3 Benchmark Data Sets

7.3.1 Iris Neural Network

The Task

The UC Irvine database contains 150 labeled patterns for the Iris classification prob-

lem. The examples have four numeric attributes: sepal length, sepal width, petal length

and petal width. The neural network has to learn to distinguish between three classes

of Iris plant, namely, Setosa, Versicolor and Virginica.

The Neural Network Architecture

The architecture is a two-weight layer neural network with the following properties.

I Four dimensional input space, with input between [-1,1] for each dimension.

I A four dimensional hidden layer with sigmoidal transfer-function.

I Three output nodes with sigmoidal transfer-function. The output is in the inter-

val P ¨®M��/V ñ .The neural network output is sparse coded.

Application of VIA and VPA

Page 175: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.3 Benchmark Data Sets 157

We are interested in the following question:

When does the neural network predict the Iris plant Setosa?

The neural network learnt to predict Setosa if the output is in the following polyhedron.

�$��� ����� � » X $�$$ X $$ $ X � ����� » $�� �$�� X$�� X � �VIA computed after one backward and one forward step the following region in the

input space.

~ � ���$�c�������d »X $ $ $$ » X $ $$ $ » X $$ $ $ » XX $ $ $$ X $ $$ $ X $$ $ $ X

e��������f �B�c�������dX � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$$�� � $$�� �8� � �e��������f �

After applying VPA the input region is refined to the following polyhedron.

�$~¬� ���$�c�������d$�� XÔX � � » $�� � � � � $�� � � � � $�� � X �$�� X $�� $�� � Ê� � » $�� �.$�� � » $�� � � � �» X � $Ê$Ê$Ê$ » $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$

e��������f �B�c�������d »$�� � � � �X � � �Ê�X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$e��������f �

The volume of the box computed by VIA is 9.6395, the volume of the polyhedron

computed by VPA is 1.4359.

Similarly, as before we can write a set of four valid statements about the neural net-

work. Out of the 150 patterns 50 patterns are classified as Setosa. The neural network

learnt to classify these 50 patterns correctly. In the box computed with VIA 116 out

of the 150 patterns are included and all 50 patterns for Setosa. The polyhedron com-

puted with VPA contained exactly 50 points and all of them are classified as Setosa.

In Figure 7.2 and Figure 7.3 axis-parallel projections of the polyhedron and the box

respectively onto a two dimensional space are depicted. Points which are classified as

Setosa are plotted using the “+” otherwise the class is Versicolor or Virginica.

Page 176: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

158 Chapter 7. Evaluation of Validity Polyhedral Analysis

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Sepal.Length

Pet

al.L

engt

h

Projection onto the Subpace: x(Sepal.Width )=0 and x(Petal.Width )

Figure 7.2: Projection onto the subspace *,+v-a.(0y1�3�4 .7698 :�;>=@?7+BADC .

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Sepal.Width

Pet

al.W

idth

Projection onto the Subpace: x(Sepal.Length)=0 and x(Petal.Length)=0

Figure 7.3: Projection onto the subspace *,+v-a.(0y1�3�4 .7698FE G�=@?7+BADC .7.3.2 Pima Neural Network

The Task

We trained a neural network with data of the Pima Indians Diabetes Database. The

768 instances are drawn from a larger database. All patients are females at least 21

years old of Pima Indian heritage. The network has to classify according to selected

attributes if a person is tested positive for diabetes. The following attributes are all

numeric.

Page 177: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.3 Benchmark Data Sets 159

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thickness (mm)

5. 2-Hour serum insulin (mu U/ml)

6. Body mass index (weight in Q�ß « � height in m � ¥ )7. Diabetes pedigree function

8. Age (years)

The Neural Network Architecture

The architecture is a two-weight layer neural network with the following properties.

I Eight dimensional input space with input between [-1,1] for each dimension.

I Eight dimensional hidden layer with sigmoidal transfer-function.

I Two output nodes with sigmoidal transfer-function. The output is in the intervalP ¨®M��/V ¥ .The network was trained to classify patients tested positive for diabetes with the output�J�ÔP ¨®ÓN] ¨®ÓÊ�/V , otherwise with the output �J�ºP ¨®ÓÊ�Ϩ®ÓN]'V .Application of VIA and VPA

We want to analyse the following property of the neural network.

What are the possible outputs of the neural network for a specific group of patients?

For example, a specific group of patients is characterised where each of the eight at-

tributes is above the average. This includes patients which are old (see attribute 8),

have a high body mass index (attribute 6) and so on.

We defined in the input space the following hypercube (to simplify the notation the

hypercube is written as the Cartesian product of intervals):

«~·�����B�BP ¨®Óíæ¦M��/V��D�

Page 178: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

160 Chapter 7. Evaluation of Validity Polyhedral Analysis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y1

y2

The output space of the Pima Neural Network

Figure 7.4: The computed output regions for the Pima Neural Network. The out-most

box is the initial restriction. The inner box is the computed region after applying VIA

and the polyhedral region inside is the region computed by VPA.

Additionally, the output space was restricted to P ¨®Óæ¦MU¨®ÓN]Dæ'V ¥ . VIA computed the fol-

lowing possible output region.

«�¤� �����cd » X $$ » XX $$ X

ef�B�cd » $�� $��.$» $�� $��.$Ê$$�� � � �$�� � ��

ef�

VPA terminated with a more refined polyhedral region as follows:

����� �����c��d » $�� �� � � » $�� $� X �» $�� � � » $�� �� � �$�� $Ê$Ê$Ê$ » X � $Ê$Ê$X � $Ê$Ê$Ê$ $�� $Ê$Ê$$�� $Ê$Ê$Ê$ X � $Ê$Ê$

e���f ��� c���d » $�� X � �IX» $�� X � �$�� � �� �» $�� $��.$Ê$$�� � � X$�� � � � �

e����f �Overall these polyhedral rules are valid descriptions about the neural network be-

haviour.

if t,���~ then ���n«� �Ã�>�if t,���~ then ���u��� �Á�°�if �ÛÚ���� then t�Ú���~ �ÁÒ°�if �ÛÚ�n� � then t Ú�� ~ ��ä��

The polyhedron is the stronger postcondition compared to the hypercube. The vol-

ume for the polyhedron is 0.5097 and the volume of the box is 0.8068. In Figure 7.4 a

Page 179: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.4 SP500 Neural Network 161

visualisation of the output space is depicted. To control the result sample points were

randomly generated in the input space and the neural network output was computed.

The computed outputs are represented with small circles.

It is obvious that the polyhedral approximation is not very good. The reasons are that

a single polyhedral approximation during the non-linear phase is not sufficient. In fact

the polyhedral approximation of the non-linear region gets worse the higher the dimen-

sion. This is not a surprise, because in higher dimension more curvatures and saddle

points are expected. Consequently, as a protocol of the VPA implementation showed,

the hyperplane to approximate the non-linear region got moved mostly towards the

corner of the wrapping hypercube (see the binary search method described in Chapter

4 for more details). Additionally, another approximation during the forward propa-

gation of the region was necessary, namely, for the projection of the computed eight

dimensional polyhedron of the hidden layer onto a two-dimensional subspace.

However, the above result contains an interesting insight into the neural network be-

haviour. As showed in Figure 7.4 refinements essentially occurred around the upper

left corner (“tested negative for diabetes”). In fact by sampling randomly 100 points in

the box �~ only 8 got classified as “tested negative for diabetes” by the neural network.

7.4 SP500 Neural Network

The Task

Historical time series data are used for predicting future price developments. Techni-

cal analysis (also know as charting, see [Mur86] for detailed information), assumes

that all influences of supply and demand are included in the price or index itself. Sev-

eral neural networks were trained to predict the SP500 index for the next trading day,

depending on historical data [Bre01]. Apart from recurrent neural networks also feed-

forward neural networks have been trained. The data set contained 500 Values (from

24/09/1998 to 15/09/2000) of the SP500 index. The data included the opening and

closing index and the highest and lowest index during a trading day. These variables

are referred to as Ý ��T Ú Ñ5Ä'� , Ú �FÙSÄ� �� , ×���NÁߦ×q� and TO� Ú5ã � . The variable �B��p�tå� is the max-

imum of daily change observed in the considered time period. In [Zir97] linear and

Page 180: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

162 Chapter 7. Evaluation of Validity Polyhedral Analysis

Fibonacci normalisation were introduced on financial time series data. For this task

linear normalisation was applied. The following two input variables are used:

Z��Ã�>�¯� Ý Z��Á�°�¯� × Q�T� � Ý Q Ú �Ý Q ÚThe variables can be interpreted as follows: Z$�Ã�>� is the closing value of the SP500

index normalised in the interval [0.4 1] (according to a previous analysis [Bre01])

and Z$�Á�°� is the activity of the day according to a maximal daily change, which was

defined by analysing the history of the data. The use of variable Z$�Á�°� was motivated

by [Azo94]. Z$�Á�°� will be positive for a rising SP500 value and negative for a falling

index value.

The Neural Network Architecture

The following neural network was analysed.

I Four dimensional input space. The input vector � corresponds to the values Z$�Ã�>�and Z$�Á�°� of the previous two days. The input is in the range [0.4,1] for the Z$�Ã�>�value and in the interval [-1,1] for the Z$�Á�°� value.

I 11 hidden nodes with sigmoidal transfer-function.

I Two dimensional output with sigmoidal transfer-function. The output is in the

interval P ¨®M��/V ¥ .The network was trained to compute values bigger than 0.5 for an increasing stock-

market index and values less than 0.5 for a decreasing SP500 index.

Application of VIA and VPA

We are interested to compute regions in the input space which correspond to an in-

creasing SP500 index (according to the neural network prediction).

Find regions in the input space which predict an increasing SP500 index ?

For this example VIA and VPA produced the same result, namely the complete input

space. This result has two reasons: firstly, sampling indicated that the input points for

predicting an increasing SP500 stock-market are widely distributed in the input space.

Secondly, a single polyhedral approximation is not sufficient. Figure 7.5 is an example

of the projection of the four-dimensional cube onto a two-dimensional input space.

Page 181: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

7.5 Summary of this Chapter 163

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

v1 two days before

v1 o

ne d

ay b

efor

e

Projection onto the Subpace: x(v2 two days before)=0 and x(v2 one day before )=0

Figure 7.5: The computed input region for the SP500 Neural Network. Neither VIA

nor VPA computed a refined region.

To obtain better statements about the neural network behaviour a pre-processing phase

is recommended, for example by using methods like Q -means regions in the input

space can be selected and further analysis can be performed by using VIA and VPA.

7.5 Summary of this Chapter

VPA was successfully used for the Circle and the Iris network. It also showed that

a better refinement of polyhedral rules is obtained compared to VIA. However, the

results for the Pima and the SP500 neural networks are not completely satisfactory.

Apart from the lack of time this has the following reasons:

I An analysis showed that the input patterns for different classes often lie in a close

neighbourhood. Because of this behaviour the neural network, it is to expect that

there are no significant region refinements, once applying VIA and VPA. In this

cases a pre-processing is required in order to select small regions (e.g. by using

clustering algorithms). Afterwards, VIA and VPA can be used to analyse these

regions.

I The approximation for the non-linear phase is crucial. Experiments indicated

that the approximation process (the positioning of a hyperplane) is decreasing

Page 182: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

164 Chapter 7. Evaluation of Validity Polyhedral Analysis

the higher the dimension. Reasons for this are probably an increasing number of

saddle points. To obtain better refinements a split of the polyhedron into a finite

union of sub-polyhedra is necessary.

- Contributions Chapter 7 -

I Evaluation of the VPA methods on four different neural networks and

comparison to VIA.

I Visualization of the behaviour of neural networks also for higher-

dimensional input and output spaces by applying polyhedral projection

techniques.

Page 183: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Chapter 8

Conclusion and Future Work

Section 8.1 enumerates the original contributions of this thesis. Section 8.2 discusses

several approaches to fine tune the existing implementation of VPA. Finally, Section

8.3 indicates further research directions.

8.1 Contributions of this Thesis

We enumerate the main contributions of this thesis according to their occurrence in the

text. The importance of these contributions is marked with one star (*) to express that

it is viewed by the author as a minor contribution up to three stars (***), which are

considered as major contributions by the author.

1. Introduction of the concept of an Annotated Artifical Neural Network (AANN),

with a connection to logic and software verification. (***)

2. Validity Polyhedral Analysis (VPA), as a tool to annotate a feed-forward neural

network with valid pre- and postconditions in form of linear inequality predi-

cates. (***)

3. Classification of neural network analysis techniques into: propositional rule ex-

traction, fuzzy rule extraction and region based analysis. (*)

4. Introduction of the property “validation capability” to indicate if a neural net-

work analysis technique is able to compute properties about the neural network,

which are provably correct. (*)

165

Page 184: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

166 Chapter 8. Conclusion and Future Work

5. Suggestion to modify REFANN [SLZ02] to obtain valid rules. (*)

6. The computation of the eigenvalues and the eigenvectors in the neighbourhood

of a point � on the manifold of the region � to obtain information about the

deformations of polyhedral facets under sigmoidal transformations.(**)

7. Analysis of piece-wise linear approximations of the sigmoidal function by using

non axis-parallel splits. (*)

8. The idea of computing a wrapping polyhedron �`~ , which contains the non-linear

region � . (**)

9. Application of the SQP-approach and development of the MSA (Maximum Slice

Approach) technique to solve the required non-linear optimization problem for

a polyhedral wrapping of the non-linear region � . (*)

10. Development of a branch and bound and a binary search technique, which ap-

proximate a solution for the global optimum from outside. These methods fulfill

the requirement of an outside approximation for the non-linear region � . (***)

11. Experiments comparing the binary search method and the branch and bound

approach.(*)

12. Computation of the image of a polyhedron under an affine transformation by

applying projection techniques, if necessary. (**)

13. Development of the S-Box approximation technique to compute an approxima-

tion for the projection of a polyhedron onto a lower dimensional subspace in

polynomial time complexity.(***)

14. Approximation of the image of a polyhedron under a linear transformation whenÀ3ÁL®� k �)¶�¨ by approximating þ directly by solving optimization problems of

the form: max� ¹¨ü �Á�J�ÁÀ9MYKF� k ¿ � » � . (*)

15. Design and implementation of a general framework for region-based refinement

algorithms. (**)

Page 185: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

8.2 Fine Tuning of VPA 167

16. Implementation of Validity Interval Analysis (VIA) and Validity Polyhedral Anal-

ysis (VPA) within this framework. (*)

17. Discussion of numerical difficulties for the VIA and VPA algorithms. (**)

18. Work-around to stabilize an implementation by introducing the concept of stable

and unstable components, and by applying polyhedral projection techniques.

(**)

19. Evaluation of the VPA methods on four different neural networks and compari-

son to VIA. (**)

20. Visualization of the behaviour of neural networks also for higher-dimensional

input and output spaces by applying polyhedral projection techniques. (*)

Thus far only the binary search method [BMH03] is published. However, these

publications are planed for the future with the following titles:

I Approximation Methods for the Projection of a Polyhedron onto a lower-dimensional

Subspace.

I The Concept of Annotated Artificial Neural Networks as a Method to validate

Properties of a Neural Network.

I Survey of Region-based Neural Network Analysis Techniques.

I Validity Polyhedral Analysis and its Application.

8.2 Fine Tuning of VPA

The current implementation of VPA is a prototype implementation. This implementa-

tion proved that the theoretical results, developed in this thesis, are valid and that they

could have practical relevance.

However, due to time limitations, this implementation did not explore the full potential

of techniques and ideas as introduced in this thesis. To fine tune the implementation

of Validity Polyhedral Analysis (VPA), we suggest the following steps:

Page 186: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

168 Chapter 8. Conclusion and Future Work

I Implementation of a component to explore the structure of the manifold of the

non-linear region � . This could be based on the eigenvalue and eigenvector

analysis as introduced in Chapter 3.

I Extension of the implementation to finite unions of polyhedra and therefore a

method to obtain more refined region mappings.

The next paragraph will conclude this thesis by pointing to future theoretical inves-

tigations related to this research and by providing an outlook to kernel-based machines.

8.3 Future Directions and Validation Methods for Kernel Based

Machines

This thesis contained a number of interesting theoretical problems, in particular the

non-linear optimization problem and the projection of a polyhedron onto a lower-

dimensional subspace. Future investigations should tackle the following aspects:

I Analysis of the refinement process for the binary search method and for the

refinement process of VPA.

I Development of a heuristic to compute the most interesting optimization direc-

tions for the wrapping of the non-linear region � . This would lead to questions

such as: given a non-linear region � and a finite number of possible directions

for a polyhedral approximation of � , how to choose the optimal directions to

obtain the most refined approximation?

I Application of the concept of an annotated neural network in an industrial con-

text. The idea to guarantee certain properties about the neural network behaviour

might motivate more people to use neural networks. For example, for neural net-

work control tasks people are often interested to know what inputs are related to

stable output states.

I Further improvement of the developed approximation techniques for the projec-

tion of a polyhedron onto a lower-dimensional subspace.

Page 187: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

8.3 Future Directions and Validation Methods for Kernel Based Machines 169

Kernel-based Machines

Kernel-based machines are interesting developments in machine learning (see for ex-

ample [SS02]). These machines rely on the observation that input patterns are more

likely to be linearly separable in higher dimensions. Kernel machines first perform a

mapping �M�Ì �\����� from an input space to a higher dimensional feature space. For

the learning process in the higher dimensional space only the dot product is needed.

The Kernel trick is that for some features spaces and mappings the computation of the

dot product in feature space is computable via kernel functions Q defined on the input

space such that Q&����MO�\�$�y´��\�����aM��\���\�`¶ .

Support Vector Machines (SVMs) are probably the best known kernel machines. The

architecture of a SVM is a feed-forward neural network.

Similarly, to the introduced method an analysis of SVMs could also be performed by

annotating the layers of SVMs with regions. Polyhedra are a suitable choice for the

linear layer. However, according to our knowledge there are no known approaches

how to handle kernel functions, like the Gaussian kernel. Hence, in order to generalise

VPA to SVMs this problem has to be addressed.

Page 188: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 189: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

171

Page 190: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

172 Chapter A. Overview of used symbols

Appendix A

Overview of used symbols

P ... previous layer

S ... subsequent layer¹... hyperplane, defined with direction vector ê and constant �¹n¿... positive half-space, where êq»7��¶��¹ # ... negative half-space, where ê�»7��´��� ... polyhedron defining a set of points� ... projected of a polyhedron � onto a subspaceG� ... approximation of �

þ ... polyhedron, given by þ ��©$������ ... (non linear) region� ... denotes an output vector of a function� ... denotes an input vector to a function� ... activation vectorg�hji ... net input vectork... weight matrixl l l... bias vector! ... sigmoid transfer-functionÎ ... subspace orthogonal to the kernel of the transformation

... matrixk

© ... denotes the affine transformation����� ����� �������¯� ... polyhedron in y-space, which we want to back-

propagate through the transfer function layer.

Page 191: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

173

��~·�����$� �­�ò�B���¼�W� ... polyhedron in � -space, i.e. the polyhedral

approximation of � .�u�����$� ��!$����� ���¯� ... true reciprocal image of the polyhedron in x-space,

called region � .W^� ... wrapping box of � � , i.e the smallest axis parallel

hypercube containing � � . ^~ ... wrapping box of � .

¦~ ... box in x-space, used for the intersection detection,

after the k-th iteration ¦� ... box in y-space, ditto to ¦~X ê ... maximum of the cost-function in x-space

(constraint to ^~ ).i � ... point in y-space when linear optimizing

max C [ � subject to �������i ê ... corresponding point to i7� in x-space, i.e.iå~¬�"!\#&%'�.iå�D� . In the linear case this would be

already the optimal point for¹

.��\�cëq� ... we call the line segment ��\�cëq�¯�*X ~ �çë��.i ~ Q�X ~ � ,ë��BP ¨®M��/V shift line.� ~ ... point used for the positioning of the hyperplane in the

... binary search method.�¡ë ... rate of change between two consecutive values for ë� �Ø×2� ... volume of a box.� ... � expresses a small positive real numberm

... box operator, smallest hypercube containing a region � or

polyhedron � .o� ... boolean operator to compare if two expressions are equivalentP pqMsrUV ... denotes the interval ��t�� pu�vt�� r��

Page 192: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,
Page 193: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Appendix B

Linear Algebra Background

We recall a few important properties about linear transformations, which are relevant

for the algorithm for the computation of the image of a polyhedron under an affine

transformation.

Letk

be a P ��MO åV matrix, i.e. a matrix with � rows and columns.

1. We can view the matrixk

as the representation of a linear transformation from

î © to î § .

2. The Kernel (sometimes called null-space) ofk

is the set of all vectors that get

mapped to the zero vector underk

. We write À3ÁÆÂ�� k ��� ���B�(î © � k ���S¯å� .The kernel is a subspace of the domain, i.e of î © . In the following we also useÀ to refer to the subspace À3ÁÆÂ�� k � .

3. The image (sometimes called range) ofk

is the entire set of vectors in î §reachable by an input vector � when multiplied with

k. We write ¾�¿,� k ��������mî § � � ��M �J� k �«� The image is a subspace of î § .

4. The restriction ofk

to a subspace Î orthogonal to À is a linear bijection be-

tween Î and ¾�¿�� k � .Proof:

(i) Injection: let � % MO� ¥ ��Í andk � % � k � ¥ Ó

then:k ��� % Q�� ¥ ���"¨ £ � % Q­� ¥ ��À ,but: Í�� À ���>¨¦� Hence: � % �H� ¥ w

175

Page 194: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

176 Chapter B. Linear Algebra Background

(ii) Surjection: let ���5¾�¿,� k � £ � ���mî © �ÅÀ�� ͯMO�J� k ��J�H�)ç���� Õ ,where: �%ç��5À�MO� Õ � Í��� k ��� k ��� Õ ���)ç���� k � Õ wIn other words any vector ����Î gets mapped to a distinct vector �v��¾�¿,� k � .We will write © Õ .

Page 195: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

Bibliography

[ADT95] R. Andrews, J. Diederich, and A. Tickle. A survey and critique of

techniques for extracting rules from trained artificial neural networks.

Knowledge-Based Systems 8 (1995) 6, pages 373–389, 1995.

[Ama97] Saman Prabhath Amarasinghe. Parallelizing Compiler Techniques

Based on Linear Inequalities. PhD thesis, Stanford University, Stan-

ford, CA 94305, January 1997.

[Azo94] Michal E. Azoff. Neural Network Time Series Forecasting of Financial

Markets. John Wiley & Sons Ltd., 1994.

[Bal98] Egon Balas. Projection with a minimal system of inequalities. Compu-

tational Optimization and Applications, 10:189–193, April 1998.

[BBDR96] J.M. Benitey, A. Blanco, M. Delgado, and I. Requena. Neural methods

for obtaining fuzzy rules. 3:371–382, 1996.

[Bis94] C.M. Bishop. Neural networks and their applications. Rev.Sci. In-

strum., 65(6):1803–1831, 1994.

[Bis95] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-

versity Press, 1995.

[BMH03] S. Breutel, F. Maire, and R Hayward. Extracting interface assertions

from neural networks in polyhedral format. In Michel Verleysen, edi-

tor, ESANN 2003, pages 463–468. Kluwer, 2003.

177

Page 196: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

178 BIBLIOGRAPHY

[Bon83] A. Boneh. PREDUCE: A Probabilistic Algorithm for Identifying Re-

dundancy by a Random Feasible Point Generator. Springer-Verlag,

Berlin, Germany, 1983.

[Bre01] S. Breutel. Neural network time series prediction for the sp500 index.

Internal Report, 2001.

[Bro97] M. Broy. Informatik - Eine grundlegende Einf“uhrung. Springer, 1997.

[BS81] Bronstein and Semendjajew. Taschenbuch der Mathematik. Harri

Deutsch und Thun, Frankfurt, 1981.

[BT96] Paul T. Boggs and John W. Tolle. Sequential quadratic programming.

pages 1–000, 1996.

[Cer61] R.N. Cernikov. The solution of linear programming problems by elim-

ination of unknowns. Doklady Akademii Nauk 139, pages 1314–1317,

1961.

[CMP] R.J. Caron, J.F. McDonald, and C.M. Ponic. Classification of linear

constraints as redundant or necessary. Technical Report WMR-85-09,

University of Windsor, Windsor Mathematics Report.

[CMP89] R.J. Caron, J.F. McDonald, and C.M. Ponic. A degenerate extreme

point strategy for the classification of linear constraints as redun-

dant or necessary. Journal of Optimization Theory and Applications,

62(2):225–237, August 1989.

[Cov65] T.M. Cover. Geometrical and statistical properties of systems of linear

inequalities with applications in pattern recognition. IEEE Transac-

tions on Electronic Comuters, EC-14:326–334, 1965.

[Cra96] M. Craven. Extracting comprehensible models from trained neural net-

works. Ph.D. dissertation,Univ. Wisconsin, Madison, WI, 1996, 1996.

[CS96] M.W Craven and J.W. Shavlik. Extracting tree-structured represen-

tations of trained neural networks. Advances in Neural Information

Processing Systems, 8, 1996.

Page 197: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

BIBLIOGRAPHY 179

[Dax97] A. Dax. An elementary proof of farkas’ lemma. SIAM Review,

39(3):503–507, 1997.

[DM97] M. Dyer and N. Meggiddo. Chapter 38 in The Handbook of Discrete

and Computational Geometry, pages 699–710, July 1997.

[Dut02] Mathieu Dutour. Computational methods for cones and polytopes with

symmetry. January 2002.

[Far02] J. Farkas. Ueber die theorie der einfachen ungleichungen. Journal fuer

die reine und angewandte Mathematik, 124:1–24, 1902.

[FJ99] Maciej Faifer and Cezyry Janikow. Extracting fuzzy symbolic rep-

resentation from artifical neural networks. 18 th International Con-

ference of the North American Fuzzy Information Processing Society,

June 1999.

[Fou27] J.B.J Fourier. (reported in:) analyse des travaux de l’academemie

royale des sciences pendant l’annee 1824. Partie mathematique, 1827.

[FP96] Komei Fukuda and Alain Prodon. Double description method revisted.

Combinatorics and Computer Science, pages 91–111, 1996.

[Fri98] Bernd Fritzke. Vektorbasierte Neuronale Netze. Shaker Verlag, 1998.

[Fu94] LiMin Fu. Rule-generation from neural networks. IEEE SMC,

24(8):1114–1124, 1994.

[Fuk00] Komei Fukuda. Frequently asked questions in polyhedral computation.

2000.

[GSZ00] Adam E. Gaweda, Rudy Setiono, and Jacek M. Zurada. Rule extrac-

tion from feedforward neural network for function approximation. In

Neural Networks and Soft Computing, Zakapone,Poland, June 2000.

[GVL89] G. Golub and C. Van Loadn. Matrix computations. John Hopkins

University Press, 1989.

Page 198: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

180 BIBLIOGRAPHY

[Hay99] Simon Haykin. Neural networks - A comprehensive foundation. Pren-

tice Hall, 1999.

[HEFROG03] Carlos Hernandez-Espinosa, Mercedes Fernandez-Redondo, and Ma-

men Ortiz-Gomez. A new rule extraction algorithm based on interval

arithmetic. In Michel Verleysen, editor, ESANN 2003, pages 155–160.

Kluwer, 2003.

[HHSD97] R. Hayward, C. Ho-Stuart, and J. Diederich. Neural Networks as Ora-

cles for Rule Extraction. Connectionist Systems for Knowledge Repre-

sentation and Deduction, 1997.

[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer Feedforward

Networks are Universal Approximators. Neural Networks, 2:359–366,

1989.

[HU79] John E Hopcroft and Jeffrey D Ullman. Introduction to Automata The-

ory, Languages, and Computation, chapter 3, pages 65–71. Addison-

Wesley, 1979.

[Huc99] T. Huckle. Kleine bugs grosse gaus, 1999. Seminar 2.12.1999,

wwwzenger.informatik.tu-muenchen.de/persons/huckle/bugs.html.

[IN] H. Ishibushi and M. Nii. Generating fuzzy if-then rules from trained

neural networks: Linguistic analysis of neural network. pages 1133–

1138.

[INT] H. Ishibushi, M. Nii, and K. Tanaka. Linguistic rule extraction from

neural networks and genetic-algorithm-based rule selection. pages

2390–2395.

[INT99] H. Ishibushi, M. Nii, and K. Tanaka. Linguistic rule extraction from

neural networks for higher-dimensional classification problems. Com-

plexity International, 6, 1999.

[Jan93] C.Z. Janikow. Fuzzy processing in decision trees. Proceedings of the

Sixth International Symposium on AI, pages 360–367, 1993.

Page 199: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

BIBLIOGRAPHY 181

[Jan96] C.Z. Janikow. Fuzzy decision trees: Issues and methods. Technical

report, Department of Mathematics and Computer Science, University

of Missouri-St Louis, 1996.

[Kal02] Bohdan L. Kaluzny. Polyhedral computation:a survey of projection

methods. Technical Report 308-760B, McGill University, April 2002.

[Ker00] Eric C. Kerrigan. Robust Constraint Satisfaction: Invariant Sets and

Predictive Control. PhD thesis, Control Group,Department of Engi-

neering, University of Cambridge, 2000.

[KG85] A. Kaufmann and M.M Gupta. Introduction to fuzzy arithmetic. 1985.

[Koh87] T. Kohonen. Self-Organization and Associative Memory. Springer-

Verlag, 2 edition, 1987.

[Las83] J. Lassere. An analytical expression and an algorithm for the volume

of convex polyhedron in. Journal of Optimization Theory and Appli-

cations, 39(4), 1983.

[Len93] C. Lengauer. Loop parallelization in the polytope model. CONCUR

’93,Lecture Notes in Computer Science 715, pages 398–416, 1993.

[Lis01] Paulo J G Lisboa. Industrial use of safety-related artificial neural net-

works. Technical report, Health and Safety Executive, 2001.

[LMP89] J. Li, A.N. Michel, and W. Porod. Analysis and synthesis of a class of

neural networks: linear systems operating on a closed hypercube. IEEE

Transactions on Circuits and Systems, 36(11):1405–1422, November

1989.

[LVE00] P.J.G Lisboa, A. Vellido, and B.(eds.) Edisbury. Neural network appli-

cations in business. World Scientific, 2000.

[Mai98] F. Maire. Rule-extraction by backpropagation of polyhedra. Neural

networks, 12:717–725, 1998.

Page 200: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

182 BIBLIOGRAPHY

[Mai00a] F. Maire. On the convergence of validity interval analysis. IEEE Trans-

actions on Neural Networks, 11(3), 2000.

[Mai00b] F. Maire. Polyhedral analyis of neural networks. SDL, 2000.

[Mat00a] MathWorks, 24 Prime Park Way, Natrick,MA. The Neural Network

Toolbox, 2000.

[Mat00b] MathWorks, 24 Prime Park Way, Natrick,MA. The Optimization Tool-

box 2.0, 2000.

[Mat00c] MathWorks, 24 Prime Park Way, Natrick,MA. Using Matlab, 2000.

[Men01] Jerry M. Mendel. Rule-Based Fuzzy Logic Systems. Prentice Hall,

Upper Saddle River,NJ 07458, 2001.

[Mit97] Tom M. Mitchell. Machine Learning. McGRAW-HILL, 1997.

[MKW03] Urszula Markowska-Kaczmar and Trelak Wojciecj. Extraction of fuzzy

rules from trained neural network using evolutionary algorithm. In

elisiver, editor, ESANN 2003, pages 149–154. Kluwer, 2003.

[MP00] Ofer Melnik and Jordan Pollack. Exact representations from

feed-forward networks. Technical report, Brandeis University,

Waltham,MA,USA, April 2000.

[Mur86] John Murphey. Technical Analysis of the Future Markets. The New

York Institute of Finance, Prentice Hall, New York, 1986.

[NP00] C.D. Neagu and V. Palade. An interactive fuzzy operator used in rule

extraction from neural networks. Neural Network World 4, pages 675–

684, 2000.

[PNJ01] Vasile Palade, Daniel-Ciprian Neagu, and Ron J.Patton. Interpretation

of trained neural networks by rule extraction. pages 152–161, 2001.

[PNP00] Vasile Palade, Daniel-Ciprian Neagu, and Gheorghe Puscasu. Rule

extraction from neural networks by interval propagation. 2000.

Page 201: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

BIBLIOGRAPHY 183

[Qui86] J.R. Quinlan. Induction of decision trees. Machine Learning 1, pages

81–106, 1986.

[Ram94] Joerg Rambau. Polyhedral Subdivisions and Projections of Polytopes.

Dissertation, TU Berlin, 1994.

[Rep] UCI Machine Learning Repository. http://www.ics.uci.edu/ mlearn/ml-

summary.html.

[Rep99] DTI Final Report. Evaluation of parallel processing and neural com-

puting application programs. Technical report, DTI, 1999.

[RHW86a] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal

Representations by Error Propogation. Parallel Distributed Processing,

Vol I + II, MIT Press, 1986.

[RHW86b] D.E. Rumelhart, G.E. Hinton, and R.J. Winston. Learning inter-

nal representations by error propagation. In D.E. Rumelhart and

J.L.McCleland, editors, Parallel Distrbuted Processing: Explorations

in the Microstructure of Cognition, volume 1, Cambridge,MA, 1986.

MIT Press.

[RHW86c] D.E. Rumelhart, G.E. Hinton, and R.J. Winston. Learning represen-

tations of back-propagation errors. Nature (London), 323:533–536,

1986.

[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information

storage and organization in the brain. Psychological Review, 65:386–

408, 1958.

[RR89] Bruno Riedmueller and Klaus Ritter. Lineare und quadratische opti-

mierung. Institut fuer angewandte Mathematik und Statistik, Technis-

che Universitaet Muenchen, 1989.

[Sch90] A. Schrijver. Theory of Linear and Integer Programming. Wiley-

Interscience Publication, 1990.

Page 202: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

184 BIBLIOGRAPHY

[SL97] R. Setino and H. Liu. Neurolinear: From neural networks to oblique

decision rules. Neurocomputing, 17(1):1–24, 1997.

[SLZ02] Rudy Setiono, Wee Kheng Leow, and Jacek M. Zurada. Extraction of

rules from artificial neural networks for nonlinear regression. IEEE

Transactions on Neural Networks, 13(3), May 2002.

[SN88] K. Saito and R. Nakano. Medical diagnostic expert system based on

pdp model. Proceedings of IEEE, International Conference on Neural

Networks, 1:255–262, 1988.

[SS95] Hava T Siegelmann and Eduardo D Sontag. On the computational

power of neural nets. Journal of Computer and System Sciences,

50(1):132–150, 1995.

[SS02] Bernhard Schoelkopf and Alexander J. Smola. Learning with Kernels.

MIT Press, Cambridge, Massachusetts, 2002.

[TAGD98] A. Tickle, R. Andrews, Mostefa Golea, and J. Diederich. The truth will

come to light: Directions and challenges in extracting the knowledge

embedded within trained artificial neural networks. IEEE Transactions

on Neural Networks, 9(6):1057–1068, 1998.

[TBS93] T.M.Martiney, S. Berkovich, and K. Schulten. Neural gas network for

vector quantization and its applications to time series prediction. IEEE

Transactions on Neural Networks, 4:558–569, 1993.

[TG96] I. Taha and J. Ghosh. Symbolic interpretation of artifical neural net-

works. (TR-97-01-106), 1996.

[Thr90] S. B. Thrun. Inversion in Time. In Proceedings of the EURASIP Work-

shop on Neural Networks. Springer Verlag, 1990.

[Thr91] S. B. Thrun. The monk’s problems - a performance comparison

of different learning algorithms. Technical Report CMU-CS-91-197,

Carnegie Mellon University, Pittsburgh,PA, December 1991.

Page 203: Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto marketplace is the integration with other systems (e.g. standard software systems,

BIBLIOGRAPHY 185

[Thr93] S. B. Thrun. Extracting Provably Correct Rules from Artificial Neural

Networks. Technical Report IAI-TR-93-5, Department of Computer

Science III, University of Bonn, 1993.

[Thr95] S. B. Thrun. Extracting Rules from Artificial Neural Networks with

Distributed Representations. Advances in Neural Information Process-

ing Systems (NIPS) 7, 1995.

[TM] Theodore B. Trafalis and Alexander M. Malyscheff. An analytic center

machine. Machine Learning.

[TS93a] G. Towell and J. Shavlik. The extraction of refined rules from

knowledge-based neural networks. Machine Learning, 131:71–101,

1993.

[TS93b] G. Towell and J.W. Shavlik. Extracting refined rules from knowledge-

based neural networks. Machine Learning, 13:71–101, 1993.

[Van83] James S. Vandergraft. Introduction to numerical computation. Aca-

demic Press, New York, 1983.

[Wil97] Doran K. Wilde. A library for doing polyhedral operations. Technical

report, Brigham Young University, Department of Electrical and Com-

puter Engineering, 459 CB,Box 24099,Provo,Utah, 1997.

[WvdB98] T. Weiters and A. van den Bosch. Interpretable neural networks with

bp-som. Tasks and Methods in Applied Artificial Intelligence. Lectures

Notes in Artificial Intelligence 1416, pages 564–573, 1998.

[Zad75] L. A. Zadeh. The concept of a linguistic variable and its application to

approximate reasoning-1. Information Sciences, 8:199–249, 1975.

[Zie94] G.M. Ziegler. Lectures on polytopes. Springer-Verlag, 1994.

[Zir97] Joseph S. Zirilli. Financial Prediction using Neural Networks. Inter-

national Thomson Computer Press, 1997.