Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto...
Transcript of Analysing the Behaviour of Neural Networkseprints.qut.edu.au/15943/1/Stephan_Breutel_Thesis.pdfto...
Analysing the Behaviour of Neural Networks
A Thesis byStephan Breutel
Dipl.-Inf.
In Partial Fulfillmentof the Requirements for the Degree
Doctor of Philosophy
Queensland University of Technology, BrisbaneCenter for Information Technology Innovation
March 2004
Copyright c�
Stephan Breutel, MMIV. All rights reserved.
The author hereby grants permission to the Queensland University of Technology, Brisbane toreproduce and distribute publicly paper and electronic copies of this thesis document in wholeor in part.
Keywords
Artificial Neural Network, Annotated Artificial Neural Network, Rule-Extraction, Val-
idation of Neural Network, Polyhedra, Forward-propagation, Backward-propagation,
Refinement Process, Non-linear optimization, Polyhedral Computation, Polyhedral
Projection Techniques
Analysing the Behaviour of Neural Networks
by
Stephan Breutel
Abstract
A new method is developed to determine a set of informative and refined interface as-sertions satisfied by functions that are represented by feed-forward neural networks.Neural networks have often been criticized for their low degree of comprehensibility.It is difficult to have confidence in software components if they have no clear and validinterface description. Precise and understandable interface assertions for a neural net-work based software component are required for safety critical applications and for theintegration into larger software systems.The interface assertions we are considering are of the form “if the input � of the neuralnetwork is in a region � of the input space then the output ������� of the neural net-work will be in the region of the output space” and vice versa. We are interestedin computing refined interface assertions, which can be viewed as the computation ofthe strongest pre- and postconditions a feed-forward neural network fulfills. Unions ofpolyhedra (polyhedra are the generalization of convex polygons in higher dimensionalspaces) are well suited for describing arbitrary regions of higher dimensional vectorspaces. Additionally, polyhedra are closed under affine transformations.Given a feed-forward neural network, our method produces an annotated neural net-work, where each layer is annotated with a set of valid linear inequality predicates.The main challenges for the computation of these assertions is to compute the solu-tion of a non-linear optimization problem and the projection of a polyhedron onto alower-dimensional subspace.
Contents
List of Figures vi
List of Tables vii
List of Listings ix
1 Introduction 1
1.1 Motivation and Significance . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Software Verification and Neural Network Validation . . . . . . . . . 6
1.4 Annotated Artifical Neural Networks . . . . . . . . . . . . . . . . . . 9
1.5 Highlights and Organization of this Dissertation . . . . . . . . . . . . 13
1.6 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Analysis of Neural Networks 17
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Validation of Neural Network Components . . . . . . . . . . . . . . 22
2.2.1 Propositional Rule Extraction . . . . . . . . . . . . . . . . . 28
2.2.2 Fuzzy Rule Extraction . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 Region-based Analysis . . . . . . . . . . . . . . . . . . . . . 44
2.3 Overview of Discussed Neural Network Validation Techniques and
Validity Polyhedral Analysis . . . . . . . . . . . . . . . . . . . . . . 62
3 Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Trans-
formations 65
i
ii CONTENTS
3.1 Polyhedra and their Representation . . . . . . . . . . . . . . . . . . . 65
3.2 Operations on Polyhedra and Important Properties . . . . . . . . . . . 69
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations . 70
3.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Nonlinear Transformation Phase 83
4.1 Mathematical Analysis of Non-Axis-parallel Splits of a Polyhedron . 83
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region . . . . 86
4.2.1 Sequential Quadratic Programming . . . . . . . . . . . . . . 89
4.2.2 Maximum Slice Approach . . . . . . . . . . . . . . . . . . . 90
4.2.3 Branch and Bound Approach . . . . . . . . . . . . . . . . . . 95
4.2.4 Binary Search Approach . . . . . . . . . . . . . . . . . . . . 98
4.3 Complexity Analysis of the Branch and Bound and the Binary Search
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 111
5 Affine Transformation Phase 113
5.1 Introduction to the Problem . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Backward Propagation Phase . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Forward Propagation Phase . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Projection of a Polyhedron onto a Subspace . . . . . . . . . . . . . . 118
5.4.1 Fourier-Motzkin . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4.1.1 A Variation of Fourier-Motzkin . . . . . . . . . . . 123
5.4.2 Block Elimination . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.3 The -Box Approximation . . . . . . . . . . . . . . . . . . . 131
5.4.3.1 Projection of a face . . . . . . . . . . . . . . . . . 133
5.4.3.2 Determination of Facets of � . . . . . . . . . . . . 134
5.4.3.3 Further Improvements of the S-Box method . . . . 137
5.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5 Further Considerations about the Approximation of the Image . . . . 140
5.6 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 140
CONTENTS iii
6 Implementation Issues and Numerical Problems 143
6.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Numerical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 150
7 Evaluation of Validity Polyhedral Analysis 151
7.1 Overview and General Procedure . . . . . . . . . . . . . . . . . . . . 151
7.2 Circle Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Benchmark Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.1 Iris Neural Network . . . . . . . . . . . . . . . . . . . . . . 156
7.3.2 Pima Neural Network . . . . . . . . . . . . . . . . . . . . . 158
7.4 SP500 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.5 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . 163
8 Conclusion and Future Work 165
8.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Fine Tuning of VPA . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 Future Directions and Validation Methods for Kernel Based Machines 168
A Overview of used symbols 171
B Linear Algebra Background 175
Bibliography 185
List of Figures
1.1 Annotated version of a neural network. . . . . . . . . . . . . . . . . . 10
2.1 Single neuron of a multilayer perceptron . . . . . . . . . . . . . . . . 19
2.2 Sigmoid and threshold activation functions and the graph of the func-
tion computed by a single neuron with a two-dimensional input. . . . 20
2.3 Two-layer feed-forward neural network . . . . . . . . . . . . . . . . 21
2.4 Overview of different validation methods for neural networks. . . . . 26
2.5 Example for the KT-Method. . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Example for the � -of- method. . . . . . . . . . . . . . . . . . . . 34
2.7 DIBA : recursively projection of hyperplanes onto each other . . . . . 51
2.8 DIBA: a decision region and the traversing of a line. . . . . . . . . . . 52
2.9 Annotation of a neural network with validity intervals. . . . . . . . . 56
2.10 Piece-wise linear approximation of the sigmoid function. . . . . . . . 61
3.1 Combination of two vectors in a two-dimensional space: from left to
right: linear, non-negative, affine and convex combination. . . . . . . 66
3.2 The back-propagation of a polyhedron through the transfer function
layer. Given the polyhedral description ��������� ��������� in the output
space of a transfer function layer the reciprocal image of this polyhe-
dron under the non-linear transfer function is given by � �"!$#&%'����������$� �(!������)����� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Approximation of the non-linear region. . . . . . . . . . . . . . . . . 73
3.4 Subdivision into cells. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Example for subdivision into cells. . . . . . . . . . . . . . . . . . . . 74
3.6 Point sampling approach. . . . . . . . . . . . . . . . . . . . . . . . . 75
v
vi LIST OF FIGURES
3.7 Eigenvalue and eigenvector analysis of the facet manifold. . . . . . . 79
3.8 Example for convex curvature. . . . . . . . . . . . . . . . . . . . . . 80
3.9 Example for concave curvature. . . . . . . . . . . . . . . . . . . . . 81
4.1 Non axis-parallel split of a polyhedron. . . . . . . . . . . . . . . . . 84
4.2 Polyhedral wrapping of the non-linear region � . . . . . . . . . . . . . 87
4.3 Application of branch and bound. . . . . . . . . . . . . . . . . . . . 95
4.4 Two dimensional example for the binary search method. . . . . . . . 103
5.1 The forward and backward-propagation of a polyhedron through the
weight layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 The projection of a polyhedron onto a two-dimensional subspace. . . 121
5.3 Relevant hinge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4 Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5 The projected polyhedron � is contained in the approximation. . . . . 139
6.1 Overview of the framework. . . . . . . . . . . . . . . . . . . . . . . 144
7.1 Visualisation for the behaviour of the circle neural network. . . . . . 156
7.2 Projection onto the subspace *,+�-/.�021�354 .7698 :<;>=@?7+BADC . . . . . . . . . 158
7.3 Projection onto the subspace *,+�-/.�021�354 .7698FEHG�=@?7+BADC . . . . . . . . . 158
7.4 The computed output regions for the Pima Neural Network. . . . . . . 160
7.5 The computed input region for the SP500 Neural Network. . . . . . . 163
List of Tables
2.1 Overview of neural network validation techniques . . . . . . . . . . . 63
3.1 Concave and convex curvature in the neighborhood of a point. . . . . 75
4.1 Comparison of branch and bound and binary search. . . . . . . . . . . 109
5.1 Computation times for the projection of a polyhedron onto a a lower-
dimensional subspace. . . . . . . . . . . . . . . . . . . . . . . . . . 139
vii
List of Listings
5.1 projExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1 net-struct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2 mainLoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 forwardStep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4 mainVIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5 mainVPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.6 numExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
ix
Statement of Original Authorship
The work contained in this thesis has not been previously submitted for a degree or
diploma at any other higher education institution. To the best of my knowledge and
belief, the thesis contains no material previously published or written by another per-
son except where due reference is made.
Stephan Breutel
March 2004
Acknowledgments
I would like to thank my principal supervisor Frederic Maire, without whose un-
bounded energy, idealism, motivation and immense enthusiasm this research would
not be possible. It was an honor to receive wisdom and guidance from all my su-
pervisory team. A special thanks also to my associate supervisor Ross Hayward who
provided excellent support whenever I needed it.
Apart from my supervisors, I would like to thank my other panel members of my oral
defense, Joaquin Sitte and Arthur ter Hofstede for their valuable comments about this
thesis.
I would also like to thank all of my colleagues in my research center for their help
and encouragement during my time here, and in particular all the people of the Smart
Devices Laboratory. It would take too many pages to list them all.
I was fortunate to have many friends with whom I enjoyed fantastic climbing sessions
at Kangaroo Point and had many great times at the Press Club whilst listing to superb
Jazz performances, all of which helped to keep me relatively sane during my time at
QUT.
I would also like to thank the Coffee Coffee Coffee crew for keeping me awake by
serving me literally over a thousand coffees (to be precise 1084).
Finally, I really would like to thank all my friends and my family back home in Bavaria
for supporting my endevour. A special thanks to my parents for their moral and finan-
cial support.
This work was supported in part by an IPRS Scholarship and a QUT Faculty of Infor-
mation Technology Scholarship.
Chapter 1
Introduction
The first section provides examples of neural networks in industrial applications that
demonstrate the need for neural network validation methods.
In section 1.2 notation conventions are explained. Required terminology, concepts of
software verification and neural network validation are introduced in section 1.3.
In section 1.4 annotated artifical neural networks are defined and the central ideas of
the novel Validity Polyhedral Analysis (VPA) is explained a nutshell. The organization
of this thesis is outlined in section 1.5.
1.1 Motivation and Significance
A conclusion of the report “Industrial use of safety-related artifical neural networks”
by Lisboa [Lis01] is that one of the keys to a successful transfer of neural networks
to marketplace is the integration with other systems (e.g. standard software systems,
fuzzy systems, rule-based systems). This requires an analysis of the behaviour of
neural network based components. Examples for products using neural networks in
safety-critical areas are [Lis01]:
I Explosive detection, 1987. SNOOPE from SAIC is an explosive detector. It was
motivated by the need to detect the plastic explosive Semtex. The detector irra-
diated suitcases with low energy neutrons and collected an emission gamma-ray
spectrum. A standard feed-forward neural network was used to classify be-
tween: bulk explosive, sheet explosive and no explosive. However, there were
1
2 Chapter 1. Introduction
several practical problems. For example the 4% false positive of the MLP re-
sulted in a large number of items to be checked. This is not practical, especially
for highly frequented airports such as Heathrow and Los Angeles airport, where
the system was tested.
I Financial risk management. PRISM by Nestor relied on a recursive adaptive
model. HNC’s Falcon is based on a regularised Multi layer perceptron (MLP).
Both systems are still market leader for credit card fraud detection [LVE00].
I Siemens applied neural networks for the control of steel rolling mines. A pro-
totype neural network based model was used for strip temperature and rolling
force at the hot strip mill of Hoesch, in Dortmund, in 1993. Later Siemens
applied this technology at 40 rolling mills world-wide. Siemens experience in-
dicates that neural networks always complement, and never replace, physical
models. Additionally, the domain expertise is essential in the validation process.
A third observation was that data requirements are severe.
I NASA and Boeing are testing a neural network-based damage recovery control
system for military and commercial aircraft. This system has the aim to add a
significant margin of safety fly-by-wire control, when the aircraft sustains major
equipment or system failure, ranging from the inability to use flaps to encoun-
tering extreme icing.
I Vibration analysis monitoring in jet engines is a joined research project by Rolls-
Royce and the Department of Engineering at Oxford University. The diagnos-
tic system QUINCE combines the outputs from neural networks with template
matching, statistical processing and signal processing methods. The software is
designed for the pass-off test of jet engines. It includes a tracking facility of the
most likely fault. Another project is a real-time in-flight monitoring system of
the Trent 900 Rolls-Royce engine. The project combines different techniques,
like for example Kalman filters with signal processing methods and neural net-
works.
I In an European collaborative project involving leading car manufactures, differ-
1.1 Motivation and Significance 3
ent control systems ranging from engine management models to physical speed
control have been implemented. These control systems combined engineering
expertise with non-linear interpolation by neural network architectures. It in-
cluded rule-based systems, fuzzy systems and neural networks.
I Siemens produces the FP-11 intelligent fire detector. This detector was trained
from fire tests carried out over many years. According to [Lis01] this fire detec-
tor triggered one-thirtieth false alarms of conventional detectors. The detector
is based on a digital implementation of fuzzy logic, with rules discovered by a
neural network but validated by human experts.
These examples show that neural networks need to be integrated with other systems,
that it is relevant to extract rules learnt by neural networks (for example for the fire
detector FP-11 or for the credit card fraud detection) and that it is important to provide
valid statements of the neural network behaviour in safety-critical applications.
Therefore, it is necessary to describe the neural network behaviour, e.g. in form of
valid and refined rules. Additionally, it is interesting to obtain explanations for the
neural network behaviour.
However, our main motivation is to compute valid statements about the neural network
behaviour and as such help to prevent software faults. Software errors can cause a lot of
problems and risks, especially in safety-critical environments. Several software errors
and their consequences are collected in [Huc99]. The explosion of the Ariane 5 rocket
and the loss of Mars climate orbiter [Huc99] are recent examples of consequences of
software errors.
To summarize: it is significant to describe the behaviour of neural network components
to:
I overcome the low degree of comprehensibility of (trained) neural networks,
I validate neural networks,
I integrate neural network components in (large) software environments,
I apply neural network components even in safety-critical applications,
4 Chapter 1. Introduction
I detect interesting/new knowledge of statistical data, which was not previously
obvious,
I prevent un-learning. Given a polyhedral interface description of a neural net-
work component, one can define a set of invariants, i.e. adapting the neural
network to new environments must not violate these invariants,
I control the generalization of a trained neural network. Generalization expresses
the ability of the neural network to produce correct output for previously unseen
input data. A description of the neural network behaviour will provide a better
insight into the neural network generalization capability,
I visualize corresponding regions in the input and the output space of a neural
network.
1.2 Notations and Definitions
The following notation conventions are inspired from the book by Fritzke [Fri98] and
most of the conventions followed the Matlab [Mat00c] notation.
I Sets are denoted by calligraphic upper case letters (e.g. � ).
I Matrices are denoted by single upper case bold letters (e.g. � ). To denote a
column or row vector or any arbitrary element within the matrix, we use the
Matlab notation [Mat00c], e.g. �J�LK@MONL� is the N -th column vector of matrix � .�J�OPRQSMUTWVXMYKF� extracts the Q -th and the T -th row vector.
I Vectors are denoted by single lower case bold letters (e.g. Z ). By default a vector
is a column vector. Z\[ represents the transpose of Z . To represent an element
within the vector we use Matlab notation, e.g. Z$��NL� denotes the N -th element of
the vector.
I Let ] and ^ denote index sets. Then we denote with �J�_]`Ma^(� the selection of
the rows ] and columns ^ of the matrix � .
1.2 Notations and Definitions 5
I A scalar variable is denoted with a simple lower case letter (e.g. ).I Let � and b be two matrices with the same number of columns. With
cd � bef
we
denote the vertical concatenation of two matrices. Similarly, PR� b�V denotes
the horizontal concatenation of two matrices with the same number of rows.
Additionally, we use the convention, that definitions and expressions, which are
explained in the glossary 1, will be written in emphasized style, when used for the first
time. Also function names are written in emphasized style.
An overview of all symbols and special operators is provided in appendix A. How-
ever, the following table contains symbols which are already relevant for this chapter
and for the literature review of neural network analysis techniques in Chapter 2.
� ... polyhedron, i.e. the intersection of a finite number of half-spaces.� ... arbitrary region. ... a box is an axis parallel hyper-rectangle. We also use the expression
... hypercube.� ... activation vector of the output neuronsg�hji ... net input vector (input vector for the transfer function layer)� ... activation vector of input neuronsk
... weight matrixl
... single bias valuel l l
... bias vector! ... sigmoid transfer function
The next table defines a special function symbol, one operator and the interval notation.
m... box function,
m ���n� denotes the smallest axis-parallel hypercube
containing a region � .o� ... boolean operator to compare if two expressions are equivalentP pqMsrUV ... denotes the interval ��t�� pu�vt,��r>�To refer to sigmoidal functions we often use the Matlab terms logsig and tansig.
1A glossary is not provided, yet, but it will be included in the final version. We apologise for any
inconveniences.
6 Chapter 1. Introduction
1.3 Software Verification and Neural Network Validation
We will divide software components, depending on the task and its implementation,
into two classes, namely into trainable software components and into non-trainable
software components.
Definition 1.1 Trainable Software Components
Software components for classification and non-linear regression, where the task is im-
plicitly specified via a set of examples, and a set of parameters is optimized according
to these examples are called trainable software components. wTypically, we use trainable software components where it is not easy or not possi-
ble to define a clear algorithm. For example, in tasks like speech recognition, image
recognition or robotic control, often statistical learners, like neural networks or support
vector machines are applied.
Definition 1.2 Non-trainable Software Components
Software components, where the task is precisely specified, an algorithm can be de-
fined and the task is implemented with a programming language are called non-trainable
software components. wWe also refer to non-trainable software components as standard software.
Software Verification and Validation of Neural Network Components
Standard software program verification methods take the source code as input and
prove the correctness against the (formal) specification. Among others, important con-
cepts of software verification are pre- and postconditions.
Definition 1.3 Precondition and Postcondition
Given a software component and two boolean expressions xyMUz about the input and
output data, the statement
�5x{�5$�>z{�
1.3 Software Verification and Neural Network Validation 7
indicates, that for every input status which fulfills �5x{� before the execution of the
output status �>z{� is true after the termination of . The assertions �5x{� and �>z{� are
also named precondition and postcondition. wWe can view pre- and postconditions as specifications of the component properties.
However, using pre- and postconditions does not assure that the software component fulfills these specifications. A formal proof is required.
In the context of artifical neural networks we talk about validation techniques. In the
following we discuss the central ideas of neural network validation techniques in a nut-
shell. The validation approaches for neural networks are propositional rule extraction,
fuzzy rule extraction and region based analysis. Propositional rule extraction methods
take a (trained) neural network component as input, extract symbolic rules and test
them against the network behaviour itself and against the test data. These methods are
helpful to test neural networks.
Fuzzy rule extraction methods try to extract a set of fuzzy rules, which mimics the
behaviour of neural networks. The advantage of fuzzy rule extraction, compared to
propositional rule extraction, is that, generally, less rules are needed to explain the
neural network behaviour. In addition, with the use of linguistic expressions easily
understandable characterizations about the neural network behaviour are obtained.
Region based analysis methods take a (trained) neural network as input and compute
related regions in the input and output space. Region based analysis techniques differ
from the above methods because they have a geometric origin, are usable for a broader
range of neural networks and have the ability to compute more accurate interface de-
scriptions of a neural network component. These methods compute a region mapping
of the form: if the input is in region �n| then output is in region �n} , where ��| describes
a set of points in the input space and � } a set of points in the output space. For some
methods, these region based rules agree exactly with the behaviour of the neural net-
work. The more refined those regions are, the more information we obtain about the
neural network. Validity Interval Analysis (VIA), developed by Thrun [Thr93], for
example, is able to find provably correct axis-parallel rules, i.e. rules of the form: “if
the input � is in the hypercube �~ then the output � is in the hypercube $� ”. Hence,
region based analysis approaches are suitable to validate the behaviour of neural net-
8 Chapter 1. Introduction
work based software components.
The development of large software systems requires the interaction of different soft-
ware components. As motivated with the examples, we need some kind of human
understandable descriptions of neural network based software components (e.g. by
using fuzzy rules) as well as techniques to assert important properties of the neural
network behaviour (e.g. in form of valid relations between input and output regions).
Our approach to compute corresponding regions, represented as unions of polyhedra,
in the input and the output space of a neural network, is able to validate properties
about the neural network behaviour.
Definition 1.4 Polyhedral Precondition and polyhedral Postcondition
A polyhedral precondition � is a precondition, where the constraints on the input data
are sets of linear inequalities. Similarly, a polyhedral postcondition � is a postcon-
dition, where the constraints on the output data are expressed as a system of linear
inequalities. wWe can view polyhedral pre- and postconditions as conjunctions of linear inequality
predicates. We also use the terminology “polyhedral interface assertions” or “polyhe-
dral interface description”.
Among others, the following properties are desirable for methods analysing the be-
haviour of trained neural networks:
I generality (also known in the literature as portability, e.g. see Andrews et al.
[TAGD98]): algorithms that make no assumptions about the neural network
architecture and the learning algorithm,
I usable for classification problems as well as function approximation,
I high fidelity with the neural network, i.e. the computed rule set mimics well the
behaviour of the neural network,
I precise and concise description of the neural network behaviour, e.g. in form of
a small number of informative and refined rules,
I polynomial algorithmic time and space complexity. In other words the algorithm
is still applicable for higher-dimensional cases,
1.4 Annotated Artifical Neural Networks 9
I usable to validate properties about the neural network behaviour.
1.4 Annotated Artifical Neural Networks
Our approach is to forward- and backward-propagate finite unions of polyhedra through
all layers of the neural network. This strategy can be viewed as an extension of Valid-
ity Interval Analysis (VIA) and is consequently named Validity Polyhedral Analysis
(VPA). The method is very general, as the only assumptions are, that we work with
feed-forward neural networks (a brief introduction to feed-forward neural networks
follows in Chapter 2), and that the network has invertible and continuous transfer-
functions.
Both, the VIA and VPA technique, rely on a refinement algorithm. Let � represent the
function a feed-forward neural network computes. � is a mapping from an input space] to an output space � . Suppose that a pair ����MO�\� satisfies the following system of
constraints (see also [Mai00a]): ������� ���������,�����������������
The numerical values of � and � are unknown. It is only known that � and � fulfill the
system of constraints. We refine our knowledge by computing proper subsets for �and � , such that the initial system of constraints implies that �����n� and ������� . By
computing the image of � under � and the reciprocal image of � , we obtain proper
subsets.������� �����������2���<���n��#&%>������������������n���X�(������������
Generally, the sets � � and � � are non-linear regions. VIA uses axis-parallel hyper-
cubes for � and � and to compute an approximation of � � and � � , respectively. VPA
relies on polyhedra.
Our validation algorithm uses a, generally, trained neural network as input and pro-
duces an annotated version of this neural network.
10 Chapter 1. Introduction
Definition 1.5 Annotated Artifical Neural Network (AANN)
An artifical neural network, where the input and output of each layer is annotated via
a set of valid pre- and postconditions is named an Annotated Artifical Neural Network
(AANN).
For example VIA produces pre-and postconditions in form of axis-parallel rules. VPA
annotates a neural network via a set of linear inequality predicates.
1 2 m
σσσ if in R y = x)( in R yσthen x1
if x in R x then = Γ x)( in xRx1
{ }xR
}{R x
R }{ y
Γ
σ x1
1
1
Figure 1.1: Annotated version of a neural network.
Generally, the behaviour of neural networks can be more accurately described with a
finite number of unions of polyhedra compared to a finite number of unions of axis-
parallel hypercubes.
There is an interesting analogy between the notion of annotated neural networks and
software-verification for programs. One strategy to verify the correctness of a program
against a given specification is to annotate the program with logical expressions and to
prove the correctness of each step. In our case the “program” is a neural network and
each layer is annotated with a set of valid linear inequality predicates.
1.4 Annotated Artifical Neural Networks 11
A Bridge to Logic and Software Verification
The Hoare calculus provides a formal framework to verify the correctness of programs
by annotating the program with assertions about the status of the program variables
and the change of this status under the program execution. The Hoare calculus defines
rules for the correct annotation of a program. The book by Broy [Bro97] provides a
thorough introduction to the basics of the Hoare calculus. Within the scope of this the-
sis the rule of statement sequence, the rule of consequence and the concepts of weaker
and stronger pre- and postcondition are relevant. Let ��M�� % MO��MO� % denote predicate
logical pre-and postconditions with program variables as free identifiers. Statements
of the program are represented with S, S1 and S2. In the following description the rule
condition and the rule consequence are separated by a horizontal line.
Rule of Statement Sequence
����� S1 ��� % � ��� % �5������������� S1;S2 �����Example for the Rule of Statement Sequence [Bro97]:
��t��H���"p�� t,KF�Ht2�"�`��tn�"p�� ��t��"p���t,KF���5tu��t�����p����t��"���Hp��¡t�KF�Ht2�"��¢Ot,KF�<�5tn��tu����p��
Rule of Consequence
� %�£ � ���¤� S ����� � £ � %��� % � S ��� % �Example for the Rule of Consequence [Bro97]:
tu�"p £ tq¥`��p�¥ ��tS¥`�"p¦¥5��t,KF�Htq¥)��tn�"p¦¥'� tn�"p¦¥ £ t,§�¨��t��"p���t,KF�Ht ¥ ��t�§�¨¦�Definition 1.6 Weaker and Stronger Precondition� %�£ � denotes that with � % also � is true. We say: � is the weaker precondition
and � % is the stronger precondition.
12 Chapter 1. Introduction
Definition 1.7 Weaker and Stronger Postcondition� £ � % denotes, that with � also � % is true. We say: � is the stronger postcondition
and � % is the weaker postcondition.
We can denote a feed-forward neural network as a finite sequence of an affine
transformation © , followed by a (usually) non-linear transformation ! . Let �,~ be an
arbitrary region specifying a valid region for an input vector � , with ��%~ a region
defining the possible output after an affine transformation of an arbitrary ���u�,~ , and
finally, let � � denote the possible output region after a non-linear transformation of an
arbitrary vector � % ��� %~ . In the logical framework we simply use the notation ��~ to
express: ���n��~ . As possible instances of regions we consider axis-parallel boxes ( )
and polyhedral regions ( � ).
In the sequel we assume that �¡~2ª�«~�M and ���¬ª�«� . Let us consider two consecutive
layers of an annotated neural network.
Rule of Statement Sequence for an Annotated Artifical Neural Network
���~®�¡©���� %~ � ��� %~ �>!\����������~¦� ©¯¢U!������°�
We will refer to assertions of type �J%~ as intermediate status. The repeated application
of the rule of statement sequence on a multilayer feed-forward neural network, allows
us to write: ����~®�>±�²�²B����°�where ANN represents the sequence of computation a feed-forward neural network
performs from the input to the output. The above statement reads: “if �³�v�,~ then������ ”Rule of Consequence for an Annotated Artifical Neural Network
�$~ £ «~ �>«~¦�¡©�¢U!,�����°� ��� £ «�����~��)©¯¢U!��>«�D�In this logical framework we can view the polyhedron ��~ as the stronger precondition
and the polyhedron �¡� as the stronger postcondition. The more refined the regions
1.5 Highlights and Organization of this Dissertation 13
of an annotated neural network are, the stronger the corresponding pre - and postcon-
ditions. It turns out, that our geometrical perspective is quite useful, as it allows us
to define a precise measurement for the strength of a precondition or postcondition,
namely the volume of the corresponding region.
1.5 Highlights and Organization of this Dissertation
The highlights and the organization of this thesis are as follows:
Chapter 2: Analysis of Neural Networks
In this chapter we introduce basic concepts of feed-forward neural networks and pro-
vide a literature overview of validation methods for neural network components. We
classified the validation methods into propositional rule extraction, fuzzy rule extrac-
tion and region-based analysis. Finally, the different methods are compared and our
approach, named Validity Polyhedral Analysis (VPA), is motivated.
Chapter 3: Polyhedral Computations and Deformations of Polyhedral Facets un-
der Sigmoidal Transformations
Polyhedra are the generalization of convex polygons to higher dimensional spaces.
This chapter presents the most important properties and concepts of polyhedral analy-
sis to make this thesis self contained.
To obtain refined polyhedral interface assertions, we have to propagate unions of poly-
hedra through all layers of a neural network. This requires computing the image of a
polyhedron under a non-linear transformation. The image of non-axis parallel polyhe-
dra, under a sigmoidal transformation are non-linear regions. In our initial investiga-
tions we analyse how polyhedral facets get twisted under a sigmoidal transformation.
Chapter 4: Mathematical Analysis of the Non-linear Transformation Phase
In this chapter we explain how to approximate the image of a polyhedron under a non-
linear transformation by a finite union of polyhedra. This approximation process can
14 Chapter 1. Introduction
be reduced to a non-linear optimization problem. Several approaches to approximate
the global maximum of the corresponding optimization problem are discussed.
Chapter 5: Mathematical Analysis of the Affine Transformation Phase
The computation of the reciprocal image of a polyhedron under an affine transforma-
tion is explained in this chapter. Furthermore, this chapter discusses how to calculate
the image of a polyhedron under an affine transformation and strategies for computing
or approximating the projection of a polyhedron onto a lower dimensional subspace.
Within the scope of this thesis projection techniques are used for the computation of an
image of a polyhedron under an affine transformation characterized by a non-invertible
matrix.
Chapter 6: Implementation Issues and Numerical Problems
This chapter discusses the design and implementation of a general framework for any
region-based refinement algorithm. The framework is successfully used for the Valid-
ity Interval Analysis (VIA) and our new Validity Polyhedral Analysis (VPA) method.
It is always necessary to study numerical properties of a mathematical algorithm when
implementing the algorithm on a digital machine with finite precision. Section 6.2 is
devoted to these problems, which will be referred to as the numerical problems.
Chapter 7: Evaluation of Validity Polyhedral Analysis
Validity Polyhedral Analysis computes interface assertions of a neural network in poly-
hedral format. We evaluated VPA on toy neural networks, on neural networks trained
with benchmark data sets of the UI Irvine database [Rep] and on a neural network
trained to predict the SP500 stock-market index. Additionally, the method is com-
pared to VIA (Validity Interval Analysis) and the refinement process is discussed.
1.6 Summary of this Chapter 15
Chapter 8: Conclusion and Future Work
This chapter summarizes the main contributions, explains how to “fine tune” the in-
troduced VPA strategy, and finally motivates future investigations to obtain validation
techniques for kernel-based machines, like for example support vector machines.
Appendices
Appendix A summarizes all used symbols and notations.
Appendix B recalls the relevant knowledge and notions of linear algebra to make this
thesis self contained.
1.6 Summary of this Chapter
As throughout this thesis a summary of the chapter and a list of new contributions is
provided.
To motivate neural network validation methods, examples of neural network compo-
nents in safety-critical applications have been illustrated. Notation conventions have
been introduced, neural network validation techniques discussed and criteria for neu-
ral network validation methods have been formulated. Finally, the idea of validity
polyhedral analysis was introduced, which is used to obtain an annotated version of a
feed-forward neural network.
Contributions - Chapter 1 -
I The notion of Annotated Artifical Neural Networks (AANN) and application
of the method of assertions to neural networks.
I Validity Polyhedral Analysis (VPA), as a tool to annotate a feed-forward neu-
ral network with valid pre- and postconditions in form of linear inequality
predicates.
Chapter 2
Analysis of Neural Networks
Section 2.1 recalls central ideas of artifical neural networks. For a very thorough in-
troduction, the reader is referred to the excellent book by Haykin [Hay99].
As motivated in the introduction, trained neural network components need to undergo
a validation or testing procedure before their (industrial) use. Section 2.2 is devoted to
the topic of neural network validation.
Finally, a short summary of the discussed validation methods is provided and our ap-
proach, named Validity Polyhedral Analysis (VPA) is motivated (why we do it) and
justified (why we do it this way).
2.1 Neural Networks
Artifical Neural Networks (ANNs) are partly inspired by observations about the bio-
logical brain. These observations led to the conclusion that information in biological
neural systems are processed in parallel over a network of a large number of intercon-
nected, distributed neurons (simple computational units). However, there are a lot of
differences between biological neural systems and ANNs.
For example, the output of a neuron of an ANN is a single value, whereas a biological
neuron produces, an often, complex time series of spikes.1 Artifical Neural Networks
1There is ongoing research about simulating spiking neural networks. But still it is not possible to
simulate exactly the behaviour of a biological neuron. Furthermore biological neurons are also influenced
by hormones and other chemical transmitters.
17
18 Chapter 2. Analysis of Neural Networks
are general function approximators, able to learn from a set of examples. Therefore
ANNs are also often characterized as statistical learners. The main features of neural
network based machines are [Fri98]:
I large number of identical computational units, called neurons,
I weighted connections between these units,
I the parameters (weights) of the neural network are adjusted (stepwise) during
the learning process and,
I usually nonlinear transformations are used.
In this work we focus on sigmoidal feed-forward neural networks (also known as
multi-layer perceptrons). The structure of a single neuron is presented in Figure
2.1. The input to a neuron of a feed-forward neural network is calculated by com-
puting the weighted sum of the activation of preceding neurons and adding a bias
value. In other words, the input to the N -th neuron is given by ´2µ�MO�·¶<� l , withµ¸� k ��NsMYKF� andl � l l l ��NL� , where µ is the weight vector and � the activation of the
preceding neurons, also referred to as the input vector. The bias, added to the N -thneuron, is denoted with
l l l ��NL�In case of a threshold activation function a neuron is active (has a positive value),
if the weighted input ´2µ,MO�·¶ is bigger than the biasl. A hyperplane defined by¹ �º���$� µ�»¼���<¨¦� is a hyperplane through the origin, and splits the input space into
two half-spaces. The set of points � with µu»7�½¶¾¨ defines the positive halfspace¹n¿, all other points on the opposite side of the hyperplane are in the negative halfs-
pace (¹ # ). For all points on the hyperplane the dot product is zero. Adding the biasl
results in a shift of the hyperplane along µ . Geometrically, in case of a threshold
activation, a neuron is active if the input � is in the positive halfspace, i.e. ��� ¹ ¿ .
This geometrical interpretation also explains why, historically [Ros58], neural net-
works were introduced as classifiers. For example, for a linearly separable classifica-
tion problem a single weight layer neural network would be sufficient to learn the task
correctly. Labeled data is said to be linearly separable, if the patterns lie on opposite
sides of a hyperplane.
2.1 Neural Networks 19
Activation of neuron iΣ
Σj=1
f
−1
(
x (1)
(2)x
(n)xW(i,j)
(i,1)
(i,2)
(i,n)
x(j) − =(i)n
net
net
f =(i)
(i)
(i)
W
W
W
(i))
θθ
θθ
y
Figure 2.1: Single neuron of a multilayer perceptron, wherek �ÁÀÂMYKF� are the weighted
connections between the input units and the N -th neuron of the next layer.
Often the threshold function
��� g¼h�i ��NL�O�¯���� �� ¨ K g�hji ��NÃ�¡´v¨� K g�hji ��NÃ�¡§v¨
and the sigmoid function
��� g¼h�i ��NL�O�$� ��$��Ä #qÅ5ÆsÇaÈ |ÊÉare used as activation functions for feed-forward neural networks. The graphs of these
functions are on the left of Figure 2.2. The figure also shows the function output of a
single output neuron with 2 dimensional input space, when applying a) the logsig and
b) the threshold function to the neuron input.
20 Chapter 2. Analysis of Neural Networks
Figure 2.2: Sigmoid and threshold activation functions and the graph of the function
computed by a single neuron with a two-dimensional input. In this case the weight
matrix reduces to a vector µ . Note that along any vector perpendicular to µ , the
output is constant and that along a line in direction µ the output has the shape of the
transfer function.
A feed-forward neural network architecture has an input layer, several hidden layers
and an output layer. The input vector propagates through all layers of the neural net-
work in a forward direction. The dimension (size) of the input layer, i.e. the number
of input neurons and the dimension of the output layer are defined by the application.
It is difficult to determine a priori a suitable number of hidden layers and the number
of hidden neurons for each of these layers. This has to be solved during the model
selection process. In Figure 2.3 we show a typical multilayer perceptron architecture
consisting of an -dimensional input layer, an � -dimensional output layer and one
hidden layer of dimension Q .
2.1 Neural Networks 21
1 f1 1 f f
f f f2 2 2
1 2 m
1 2 k
(1)y y y(2) (m)=y ( ( − ) − )f 2
(1,1) W
WW(1,1)
x xx
θθθθ
= W −net x2
22W Wx1 1
f 1
W2 (m,2)
W (k,2) (k,n)1 1 1
2
= W −net 1x1
connecting the input layer
1
nettransfer−function layer
net1 ... input vector to the first transfer−function layer
2
1θθ
θθ2
... Input Weight matrix,W , 1θθ
to the next layer and the bias.
2
1
(2)1(1)1 1(n) x1... input of the neural network
2
connecting the previous layerW , θθ2... Layer Weight matrix,
and successor layer and the bias.
x2... input to the second layer
... input vector to the second
y ... the neural network output vector
and output of the first transfer−function layer.
W2 (m,k)2
Neural Network Architecture
(net) output vector
Weight
Layer
Weight
Layer
Layer
(net) input vector
Layer
Transfer function
Transfer function
Explanation
Figure 2.3: Two-layer feed-forward neural network
Learning Paradigms
The most widespread learning method is supervised learning. Supervised learning as-
sumes that the output data for the training set is available. The neural network learns
by adjusting stepwise the weight parameters and biases, according to the difference
of its actual output and the desired output. An example for this type of learning is
the backpropagation of error algorithm. This algorithm is typically used to train feed-
forward neural networks.
Other often applied machine learning strategies are unsupervised learning and rein-
forcement learning. Unsupervised learning is used, for example, to cluster data points
with unknown class labels into groups with similar inputs.
22 Chapter 2. Analysis of Neural Networks
Reinforcement learning describes a process, where an autonomous agent that acts in
an environment learns to choose optimal actions to achieve its goals. The reader is
referred to the book by Fritzke [Fri98] for more information on unsupervised learning
and to the book by Mitchell [Mit97] for reinforcement learning.
2.2 Validation of Neural Network Components
Artifical neural networks have numerous advantages, like for example, universal ap-
proximation capability and ability to learn. However, neural networks have often been
criticized for their low degree of comprehensibility. For a user of a neural network it
is impossible to infer how a specific output is obtained.
Validation of ANNs is important in safety-critical problem domains and for the in-
tegration of ANN components into large software environments interface assertions,
describing the behaviour of the neural network, are desirable.
This section gives a literature overview of methods useful to explain the function com-
puted by an ANN. The methods are categorized into three groups, namely proposi-
tional rule extraction, fuzzy rule extraction and region based analysis. Before describ-
ing some approaches of these classes in detail, we will define the problem of validating
neural network components, introduce some useful formalism and provide an exam-
ple. The excellent introduction book to computer science by Broy [Bro97] starts with
the definition of the terms information, representation and interpretation.
Definition 2.1 Information and Representation
We call information the abstract content (semantics) of a document, expression, mes-
sage, statement or program. We refer to its formal description as representation. wDefinition 2.2 Interpretation
We call I K�ËÍÌÏÎ an interpretation (function) which explains a given representationÐ �nË with the semantics Ñ��nÎ . wWe can view an interpretation function as a mapping from one representation
into another representation. Such a system ( Ë(MOÎ`M I � is called an information system
[Bro97]. Two important remarks [Bro97]:
2.2 Validation of Neural Network Components 23
1. For a given representation various interpretation functions are possible.
2. The semantics itself has to be represented.
As a very simple example of an information system, consider the natural numbers�Ã��Ms�¦MsÒ¦MYÓ@Ó@ÓF� represented with the Arabic numbers. As the representation of the seman-
tics for the natural numbers we use a line representation, i.e. �U�FM'�@�FM'�@�@�FMYÓ@Ó@ÓF� .I �Ã�>���Ô� I �Á�°�¯�Ô�@� I �ÁÒ°���Ô�@�@�ÕÓ@Ó@Ó
In our context the neural network is the formal description and we are interested to
compute the abstract content or semantics of the neural network. We refer to this
interpretation process as validation of neural network components.
Definition 2.3 Validation of Neural Network Components
Validation methods for neural networks try to find the semantics for the function com-
puted by a neural network. wAs mentioned before the semantics itself has to be represented. This semantical rep-
resentation, also referred to as alternative representation, should express an equivalent
or very similar behaviour as the function computed by a neural network. Furthermore,
we demand that the computed representation is concise and more comprehensible than
the neural network. However, we will, in general, not be able to find an exact repre-
sentation of the neural network, instead we will approximate the function computed by
an ANN. In the literature, the process of validating a neural network is often referred
to as “rule extraction” [ADT95] or “neural network analysis” [MP00]. We prefer the
term validation techniques, because this is our main focus.
Traditionally, people have been interested to obtain human readable explanations of
the neural network behaviour. In the following we provide an “ideal” example for the
validation process of a neural network. Ideal in the sense that we assume: we have a
priori a precise specification what the neural network has to learn, the neural network
learned the task correctly and we have a validation method capable of finding an exact,
alternative representation of the function computed by the neural network. For demon-
stration purposes the problem class M1 of the three MONK’s benchmark problem as
introduced by Thrun [Thr91] is used.
24 Chapter 2. Analysis of Neural Networks
The three MONK benchmark problems are binary classification tasks, defined in an
artificial robot domain. Robots are described by the following attributes.
�¯�Ã�>�ÖK�×�Ä>p�Ø®�×�p>ÙSÄ{�J� ÐDÚ5Û 7Ø�MsÑ>Ü Û p Ð ÄDM Ú�Ý/Þ pjß Ú «��¯�Á�°�ÖK�r Ú Ø°à��×�p>ÙSÄ��J� ÐDÚ'Û 7Ø�MsÑ'Ü Û p Ð Ä°M Ú�ÝaÞ pjß Ú «��¯�ÁÒ°�ÖK'á°p Ý Q�Ä ÞUâ�Ú T Ú5Ð ��� Ð Ä>Ø�MOà�Ä�T�T Ú5ã MÃß Ð Ä>Ä� �Msr/T Û Ä���¯��ä��ÖK°NÂÑ�\�nN9T�NX åßu����à®Ä'Ñ°MO Ú ��¯�Áæ°�ÖK�× Ú T�Ø°N9 åß(�J�'Ñ ã�Ú5Ð Ø�Msr/p¦T�T Ú �Ms�åT�p°ß���¯�Áç°�ÖK�×�p®Ñ�è`NÂÄ�����à®Ä'Ñ°MO Ú �Formally, the above association of an element of the input vector � with a natural
language term is also an information system, for example: é5ê Èë% É �OP ¨®M��/V��$�<×�Ä�p¦Ø¦�×�p>ÙSÄ .The learning task is a binary classification problem. A robot belongs to the target class
M1 if:
�Á×�Ä>p�Ø®�×�p>ÙSÄ o��r Ú Ø°à��×�p>ÙSÄ'�¼ì��@á°p Ý Q�Ä ÞUâ�Ú T Ú'Ð o� Ð Ä>ئ�We assume that a neural network was trained with a correct training set (i.e. no noise
in the training data) to classify robots according to M1. Let us define the following
mapping between numerical intervals and natural language expressions2 :
é ê Èë% É �OP ¨®MU¨®ÓíÒDÒ�Pî� � ÐDÚ'Û 7Ø é ê ÈW% É �OP ¨®ÓíÒDÒ¦MU¨®ÓíçDç�Pî�ï� Ñ'Ü Û p Ð Äé ê Èë% É �WV먮ÓíçDç¦M��/V�� � Ú�ÝaÞ pjß Ú é ê È ¥ É � é ê ÈW% Ééðê È_ñ É �OP ¨®MU¨®Óí�Dæ'V�� � Ð Ä�Ø éðê Èòñ É �WV먮Óí�Dæ¦MU¨®Óíæ'V�� � à�Ä�T�T Ú5ãé ê È_ñ É �WV먮Óíæ¦MU¨®ÓFó�æ'V��ï� ß Ð Ä>Ä� é ê Èòñ É �WV먮ÓFó�æ¦M��/V�� � r/T Û Äéð�¦�OP ¨®MU¨®Óíæ�Pî� � robot is not in M1 é��¦�OP ¨®Óíæ¦M��/V�� � robot is in M1
I We assume a validation method found a set of axis-parallel rules to describe the
2For demonstration purposes we prefer to use intervals instead of sparse coding.
2.2 Validation of Neural Network Components 25
behaviour of the neural network, for example:
À�ôõööö÷¨®ÓÊ�¨®ÓÊ�¨
øYùùùú �
õööö÷�¯�Ã�>��¯�Á�°��¯�ÁÒ°�
øYùùùú �
õööö÷¨®ÓíÒ¨®Óí��
øYùùùú i5û¼h®g ¨®Óíç�v���³�
I Applying an interpretation on the input vector leads to a more human readable
rule like:
ÀWôõööö÷¨®ÓÊ�¨®ÓÊ�¨
øYùùùú �
õööö÷×�Ä>p�Ø®�×�p>ÙSÄr Ú Øjà��×�p>ÙSÄá°p Ý Q�Ä ÞUâ�Ú T Ú'Ð
øYùùùú �
õööö÷¨®ÓíÒ¨®Óí��
øYùùùú i'û�h®g ¨®Óíç���üý�¬�<�
I Finally, we obtain a representation comprehensible to humans when using the
interpretation on the intervals of the input and output nodes (note that the jack-
etColor can take an arbitrary value and therefore is not relevant for the rule):
À�ô ×�Ä�p¦Ø¦�×�p>ÙSÄ o� ÐDÚ'Û 7Ø�þnr Ú Ø°à��×�p>ÙSÄ o� ÐDÚ5Û 7Ø i5û¼h®g M1
I The extracted rule set, in human readable form, is as follows:
ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o� ÐDÚ5Û 7Ø�þ�r Ú Øjà��×�p�ÙåÄ o� ÐDÚ5Û 7Ø i'û�h¦g M1ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o��Ñ'Ü Û p Ð Ä$þnr Ú Øjà��×�p>ÙSÄ o�<Ñ>Ü Û p Ð Ä i'û�h®g M1ÀWô ×�Ä>p�Ø®�×�p>ÙSÄ o� Ú�Ý/Þ pjß Ú �þnr Ú Ø°à��×�p>ÙSÄ o� Ú�Ý/Þ pjß Ú i'û�h¦g M1ÀWô á°p Ý Q�Ä ÞUâ�Ú T Ú5Ð o� Ð Ä>Ø i5û¼h®g M1
I After a rule refinement (see the following definition) process we would obtain
simply:
À�ô �Á×�Ä�p¦Ø¦�×�p>ÙSÄ o��r Ú Øjà��×�p>ÙSÄ'��ìJ�@á°p Ý Q�Ä ÞOâ·Ú T Ú5Ð o� Ð Ä>ئ� i5û¼h®g M1
This means we would be able to describe the semantics of this trained neural
network. This neural network is named ±�²�²ÿ % . Overall it is possible to state:
éq��±¤²,² ÿ % ��� À�ô �Á×qÄ�p¦Ø¦�×�p�ÙSÄ o��r Ú Ø°à��×�p�ÙSÄ5�7ì��@ájp Ý Q�Ä ÞUâ�Ú T Ú5Ð o� Ð Ä�Ø®�i'û�h¦g M1
With the above statement we would have a very comprehensible description
about the neural network behaviour. As motivated in the introduction, this could
26 Chapter 2. Analysis of Neural Networks
be helpful to prevent software faults. In addition we would have validated, that
the neural network learned the task to classify if a robot belongs to M1 correctly.
However, as mentioned previously, any neural network validation method is just
an approximation of the true behaviour of the neural network. Therefore, for real
world applications, we are not able to compute an alternative representation of
the neural network, which semantically agrees always with the neural network.
Definition 2.4 Rule refinement
In general, rule extraction of ANNs results in a large number of rules. Often it is
possible to reduce the number of rules, e.g. by applying boolean algebra. This process
is called rule refinement. For the rule refinement process we demand that the refined
rule set is semantically equivalent to the original rule set. wWe decided to classify validation methods according to their representation format,
because this also reflects quite well the different strategies to analyse neural networks.
Propositional rule extraction represents the semantics of an ANN with a set of propo-
sitional rules, fuzzy rule extraction with a set of fuzzy rules and region based analysis
via a set of region mappings. Figure 2.4 provides an overview.
propositional rule extraction if x(1) and x(2) then y
if x in Region X then y is in Region Yregion based analysis
�������������������� ����������������fuzzy rule extraction
ANN Interpretation using: Representation (example)
if x(1) is small then y
Figure 2.4: Overview of different validation methods for neural networks.
To describe the different neural network validation techniques, we start with a short
introduction of the corresponding validation class and introduce selected algorithms
followed by a discussion according to several properties. Within the scope of this the-
sis we are especially interested in validating properties of neural networks. Hence, we
included the property “validation capability” in our discussion about rule extraction
2.2 Validation of Neural Network Components 27
techniques. This property was not found anywhere in the literature. All other prop-
erties have been proposed before by Towell and Shavlik [TS93a] or Andrews et al.
[TAGD98]. We will analyse a validation method according to the following properties:
I Portability: can the method be used for arbitrary feed-forward neural networks?
A method is said to be generally applicable to feed-forward neural networks, if
the only assumption is that the transfer-functions are invertible.
I Fidelity: how well does the computed representation mimic the neural network
behaviour? This could be experimentally measured by comparing the output of
the neural network with the output of the extracted rule set.
I Comprehensibility: is the representation easy to understand and usable to make
informative statements about the neural network behaviour?
I Validation capability: can the method be used to verify properties of the neural
network behaviour? We motivate this property, as it is an important aspect for
safety-critical applications to prove that the system fulfills special properties.
For example we would like to extract refined rules of this form: “if � is in
a Region ��~ of the input space then it is guaranteed that the neural network
output � is in the region �u� of the output space”. It is trivial that the statement
is always true if � ~ and � � are the maximal input and output space of the neural
network. Hence, we are interested in the most refined rules of the above type,
because this provides more information. In other words we want to compute
valid rule sets with high fidelity. In our example this means: for a given input
region ��~ we are interested to compute the smallest output region ��� such that
the neural network output for all points in ��~ lies within ��� .I Algorithmic complexity: the time and space complexity of the neural network
analysis algorithm.
I Usability for function approximation and classification tasks: is the method
helpful for neural networks performing a function approximation task and how
useful is the method to explain neural networks performing a classification task?
28 Chapter 2. Analysis of Neural Networks
It is important to notice that the property comprehensibility is subjective. We discuss
this property, but we can not provide exact measurements. Furthermore, it is difficult to
provide proper experimental comparisons between different validation techniques with
respect to fidelity, because all reviewed methods discussed evaluations for (usually)
different neural networks trained with the same data set.
For a sound comparison with respect to fidelity all validation methods should use
exactly the same neural network, i.e. the same architecture and the same values for
the weights and biases. Unfortunately, according to the knowledge of the author, just
benchmark data sets are available but no benchmark neural networks.
The aim of the following literature review is to explain the core idea of various neural
network validation algorithms. As a consequence only a sketch of the algorithm is
provided. For more detailed information about the corresponding algorithm the reader
is referred to the relevant publications.
2.2.1 Propositional Rule Extraction
An overview of (classical) propositional rule extraction methods is given in [ADT95],
and in the updated version [TAGD98]. The rule extraction approaches are categorized
into decompositional, hybrid (eclectic) and pedagogical. Decompositional (white box)
approaches try to extract the embedded knowledge by analysing each of the neural
network units with respect to their inputs. Pedagogical methods view the neural net-
work as a black box and analyse it on their input/output behaviour (e.g. ruleneg, see
[ADT95]). Between these two approaches are the so called eclectic or hybrid meth-
ods. Decompositional approaches within the class of propositional rule extraction can
be characterized by the following general properties:
I These algorithms mainly rely on clusterization of the weight values (e.g. � -of- ), large pruning and perform an exhaustive search on the inputs of all units (e.g,� -of- , KT-method).
I The methods are only usable for special constructed neural networks, i.e. these
techniques are not general. This is due to their assumptions of the neural net-
work topology, weights, transfer functions and biases, e.g. the KT-method is
2.2 Validation of Neural Network Components 29
only suitable for sparsely connected neural network topologies, furthermore KT
assumes that a neuron output is either 1 or 0. In some cases even special learn-
ing algorithms are required (e.g. � -of- uses a training of the modified neural
network, while keeping the weights constant).
I Classical decompositional rule extraction methods include approximations from
the start (e.g. � -of- assumes a 0,1 activity for a neuron). The algorithm itself
approximates the rules on an already approximated neural network, instead of
using the original neural network. Therefore the extracted rules are not neces-
sarily valid for the original neural network.
Additionally, most propositional rule extraction methods are only suitable for classifi-
cation tasks, but not for function approximation [SLZ02].
We briefly explain two decompositional methods, namely KT and � -of- . Both meth-
ods are representative examples for propositional rule extraction methods, which try
to analyse the neural network in a decompositional manner. In our description we will
use the term antecedent as follows:
Definition 2.5 Antecedent
A variable in the premise of a propositional rule is called antecedent. For example�¯�Ã�>� is the antecedent in the rule “if �¯�Ã�>� then � ” w
30 Chapter 2. Analysis of Neural Networks
KT-Method
The method takes a decompositional approach by finding rules for each unit, which
are assumed to have binary output. The KT algorithm (in [TS93b] referred to as the
subset algorithm) searches for subsets of incoming positive weights which exceed the
bias of an unit. In a second phase a search for subsets of negative weights starts, such
that the sum of the negative weights is greater than the sum of the positive weights
minus the bias. The search space for the activity of a single unit rises exponentially
with the number of incoming links to this unit (power set). Hence, the number of
possible combinations to form a rule is exponential. To handle this problem, heuristics
are introduced to prune the search space. For example, in [TS93b] an upper bound
for the number of positive and negative subsets is defined. Other variations of the KT-
method (especially with different heuristics to prune the search space) are described in:
[Fu94], [SN88] and [TS93b]. In Figure 2.5 an example of the effect of the KT-method
is depicted and on the next page the algorithm is explained.
positive subsets which exceed the bias
KT−Algorithm calculates
{(x(3),x(4))}
negative subset(s) whose summed weights are greater than (x(2)−bias):
negative subset(s) whose summed weights are greater than (x(1)+x(2)−bias):
if x(2) and not (x(3),x(4)) then y
Extracted Rules
{x(2),(x(1),x(2))}
{x(4),(x(3),x(4))}
Σ −1
f
bias2
1 4 −3−1
NeuralNet
if x(2) and not x(4) then y
if (x(1),x(2)) and not (x(3),x(4)) then y
y
x(1) x(2) x(3) x(4)
Figure 2.5: Example for the KT-Method.
2.2 Validation of Neural Network Components 31
KT-Method
Input: Weights, biases of the neural network
Output: rule set
for each neuron�
//of the hidden and output layer����
is the weight matrix between layers
� + 6 � ��� ?����� 6 � ?7+ ��
����� � 6��5?X.76���?��! j6 � ? //the�-th input to an activation function
���Often the search space is restricted by defining upper bounds for the number���of positive and negative subsets, e.g in [TS93b]
find up to "$# subsets of indexes %&#('�8íE �*) = such that:
� 6���?,+,A , ��0-% # and��/.021 � 6���?3+4 �6 � ?
5 #n+�- % �# �76�676�� %98 1# C:<;>=each indexset % # 0 5 #�
find up to "�? subsets of indexes %@?A'�8íE �*) =such that: � 6CBj?,D�A , B·0-%@?AD�A and
��/.0 1 � 6���?@�! j6 � ?@� �
E .0GF � 6CB°?,H�A5 ?+v-I% �? �76�676�� %98 F? Cfor each indexset %@?n0 5 ?�
// %J# and % ? are vectors of indexes
rule set = rule set K if .76�%J#�?&L!MS.76N% ? ? then output unit is activeOO
O
32 Chapter 2. Analysis of Neural Networks
Discussion of the KT-method
I Portability: The KT- method requires a special neural network architecture, due
to the assumption that inputs to, and outputs from a neuron are either 1 or 0.
I Fidelity: KT achieves a high fidelity for small binary neural networks, where the
whole search space can be explored. Usually, due to the pruning of the search
space, the larger the neural network, the less informative the rule set becomes,
because only a fraction of the possible neural network inputs are considered.
I Comprehensibility: The KT method produces human readable propositional
if/then rules. For problems with high dimensional inputs and outputs the com-
prehensibility clearly decreases as a large set of rules is needed to explain the
behaviour of the neural network.
I Validation capability : Each rule is a valid statement about the property of the
neural network. This type of rule can also be found by computing the neural
network output for special points in the input space. Hence, the KT-method can
not be used to validate general properties of the neural network, e.g. to make
statements such as : “if the input is in region � then the neural network always
classifies the data as positive”. Overall we can state that the KT-method is a
strategy to test the neural network behaviour rather than to validate it.
I Algorithmic complexity: The algorithm would scale exponentially in time com-
plexity as well as in space complexity (exponential number of rules) with an
increasing number of inputs to a neuron. However, depending on the heuristics
to prune the search space this complexity can be reduced.
I Usability for function approximation and classification tasks: The KT-method
is only applicable for (discrete) classification tasks.
2.2 Validation of Neural Network Components 33
� -of- Method
In [TS93b] the � -of- rule extraction method was introduced. According to [TS93b]
rules extracted by the KT-method often contain � -of- style rules. These are rules of
this format:
À�ô ( � of the following antecedents are true) i'û�h®g output unit à is active
The basic idea of the � -of- approach is to cluster weights of similar values into
groups, i.e. to build equivalence classes.� -of- Method
Input: weights, biases of the neural network
Output: rule set, modified neural network
1. For each hidden and output unit: build groups of similar weights.
2. Assign all weights of the same group the average value of this group.
3. Eliminate groups which are not significantly important for the activity of the
subsequent unit (using a heuristic and an algorithmic method, see [TS93b]).
4. Optimize. Because unimportant groups are eliminated this could change the
activity of units within the neural network. To address this problem, the
modified neural network has to be trained. The remaining weights are kept
constant during this learning process, such that the groups remain unchanged.
Therefore the retraining just optimizes the biases.
5. Extracting. Translating the bias and incoming weights to each unit into a rule
with weighted antecedents such that the rule is true if the sum of weighted
antecedents exceeds the bias.
6. Simplifying. Rewriting the rule in an � -of- notation.
34 Chapter 2. Analysis of Neural Networks
Σ 11.2−1
bias
f
Initial Part of ANN Step 1 and Step 2
17 1
Σ 11.2−1
bias
f
7
Step 3
Σ
0.96.81.17.2
11.2−1
bias
f
7 7 77 7
Step 4−5 :
if
if if
if
Step 6: if 2 of {x(1),x(3),x(5)}
y y y
x(1) x(2) x(3) x(4) x(5) x(1) x(3) x(5) x(2) x(4) x(1) x(3) x(5)
and x(5)
x(1)
x(3)
ythen
ythen
x(5)and x(3)and
x(1) ythenx(3)and
ythenx(5)and x(1)
ythen
Figure 2.6: Example for the � -of- method. The rules express when the input weights
exceed the bias (we assume the bias remains constant after retraining the modified
neural network). The output neuron (denoted with y) is active, if the incoming weights
exceed the bias.
Discussion of the � -of- -method
I Portability: The method requires a special neural network architecture, because
it assumes that a neuron output is either 1 or 0.
I Fidelity: The � -of- algorithm changes the architecture, the weight values and
biases of the original neural network. Hence, the fidelity depends on the differ-
ence between the original and the final neural network, which will be used to
extract rules. Intuitively we can assume that the more similar the final neural
network is to the original one, the better the fidelity.
I Comprehensibility: The � -of- method can represent the neural network be-
haviour with comprehensible, concise and simple (due to the � -of- notation)
rules. The comprehensibility usually decreases in higher dimensional cases, be-
cause the number of rules will increase. However, depending on the elimination
of groups, more comprehensible, general rules can be produced, but with, usu-
ally, less fidelity.
I Validation capability: As the � -of- method modifies the original neural net-
2.2 Validation of Neural Network Components 35
work before extraction rules it is not possible to verify properties of the function
computed by the original neural network.
I Algorithmic complexity: The method itself (the grouping process) scales within
polynomial time complexity. Additionally, the time complexity increases with
the re-learning process, but overall still scales within polynomial time complex-
ity. Overall the method would also scale for higher dimensional cases.
I Usability for function approximation and classification tasks: The � -of- method is only applicable to classification tasks.
2.2.2 Fuzzy Rule Extraction
Although our approach is different, we decided to include a section about fuzzy rule
extraction, because fuzzy representations are an interesting choice to explain the be-
haviour of neural networks. For example, one important advantage is that, in general,
fewer fuzzy rules are necessary to explain the behaviour of neural networks when com-
pared to propositional (crisp) rules.
Theoretical results [BBDR96] showed that it is possible to build a fuzzy system which
calculates the same function as a neural network. However, to obtain a good approx-
imation of the neural network behaviour an increased number of linguistic terms is
required. According to [PNJ01] this is also the main drawback of fuzzy rule extrac-
tion as the number of rules increases exponentially with a better approximation.
In this section three different approaches of fuzzy rule extraction are discussed. Lin-
guistic rule extraction [INT99] handles inputs as fuzzy members and the correspond-
ing output is calculated as a fuzzy number by fuzzy arithmetic. Fuzzy Trepan [FJ99]
is a pedagogical approach to fuzzy rule extraction and uses fuzzy decision trees to ex-
tract rules of neural networks. Finally, REX [MKW03] uses an evolutionary algorithm
to compute a set of fuzzy rules which mimics the neural network behaviour.
For a better understanding of the discussed fuzzy rule extraction methods relevant ter-
minology is explained. The following definitions mainly rely on the book by Mendel
[Men01].
36 Chapter 2. Analysis of Neural Networks
Definition 2.6 Linguistic Variable
A linguistic variable is a variable, whose values are words or sentences. wAccording to Zadeh [Zad75] the justification for the use linguistic variables is: “The
motivation for the use of words or sentences rather than numbers is that linguistic
characterizations are, in general, less specific than numerical ones.”
Definition 2.7 Linguistic Value
A linguistic value is a value, which is characterized by a word. A linguistic value is
specified by its membership function, i.e. a value is handled as a fuzzy number. wDefinition 2.8 Membership Function
Membership functions assigns for a given numerical value the degree of membership
to a linguistic value. This is a continuous value between 0 and 1. wTypical membership functions are for example: triangular, trapezoidal, piecewise-
linear, Gaussian and bell-shaped.
Definition 2.9 Linguistic Term
A linguistic variable could be decomposed into a set of linguistic terms, which cover
the complete value range (called universe of discourse in [Men01]). wDefinition 2.10 PRQ cut
An PRQ cut of a fuzzy set � defined on the universal set � is a crisp set: PRS ����$� �n��tå� §TP�� w
2.2 Validation of Neural Network Components 37
Linguistic rule extraction
The method as described by Ishibushi, Nii and Tanaka [INT99] is an improvement of
their initial work [IN]. Linguistic rules are of the form:
Rule zVUyK IF �¯�Ã�>� is ±WUGX7þJÓYÓYÓ5þ(�¯�� �� is ±WU2Y THEN Class â U with â-Z Uwhere:� ÓYÓYÓ -dimensional input vectorÙ ÓYÓYÓ index for rule z U±WU2[ ÓYÓYÓ antecedent linguistic valuesâ U ÓYÓYÓ consequent classâ\Z U ÓYÓYÓ certainty grade
Typical examples for antecedent linguistic values are “small”, “medium” and “large”.
A linguistic value is specified by its membership function. A possible instance of a
linguistic rule could be the following:
IF �¯�Ã�>� is small þu�¯��ä�� is large THEN Class 3 with â-Z U �"¨®ÓN]�¨As indicated in the above rule “don’t care” attributes are omitted. The above rule
reads like this: “if the input attribute �¯�Ã�>� is small and the input attribute �¯�Á�°� is
large then the output is class 3 with a certainty grade 0.9”. As explained by Ishibushi
et.al [INT99] the certainty grade of a consequence class is determined by analysing
the classification of the P -cut of the fuzzy input vector for various values of P . For
example [INT99] used for experiments 100 different values of P between 0.01 and 1.
In the initial algorithm all combinations of antecedent linguistic values are com-
puted by forward propagating them through the neural network. The forward propaga-
tion of a rule premise (particular combination of antecedent linguistic values) requires
to calculate, starting from the input layer, the fuzzy output of a given fuzzy input for
each layer of the feed-forward neural network until the output layer. The calculation
is performed by using fuzzy arithmetic [KG85].
According to [INT99] the main drawbacks of this approach are:
I exponential increase of possible combinations of antecedent linguistic values,
e.g. for an input vector with 10 attributes and 2 linguistic values for each at-
38 Chapter 2. Analysis of Neural Networks
tribute �¦%*^ fuzzy input combinations are possible. This would result in the cal-
culation of �¦%*^ fuzzy outputs.
I in general we obtain a large number of extracted linguistic rules
The Original Fuzzy Linguistic Algorithm
Input: neural network, antecedent linguistic values
Output: Set of fuzzy rules
for all antecedent combinations_I Compute the fuzzy output vector using fuzzy arithmetic
I Perform the numerical calculation by interval arithmetic on P -cuts of fuzzy
numbers.
I Determine the consequence class and the certainty grade with the fuzzy
output vector.
`However, as suggested in [INT99] it is possible to bypass the first problem by extract-
ing more general linguistic rules with a smaller number of antecedents, but this would,
generally, result in less fidelity. To overcome the second difficulty [INT99] uses ge-
netic algorithms to select only a small number of significant linguistic rules. More
general rules requires to reduce the number of antecedent conditions. This leads to
less combinations of antecedent linguistic values. To overcome the increasing excess
of fuzziness (large overlap between different linguistic values) of the outputs a more
accurate interval arithmetic is necessary. This could be achieved with a hierarchical
subdivision method of an -dimensional interval vector [INT99] .
As reported in [INT99] the wine data of the UC Irvine database was used to test the
improved method. The wine data has 13 continuous inputs and three output classes.
Out of 37765 fuzzy input vectors for generating linguistic rules with three antecedent
linguistic values (all other attributes were set to ”don’t care”) 7381 different rules have
been extracted [INT99]. To select the most significant rules a genetic algorithm [INT]
was applied.
2.2 Validation of Neural Network Components 39
Discussion of the linguistic rule extraction method
I Portability: Generally applicable to feed-forward neural networks.
I Fidelity: The fidelity depends on the excess of fuzziness of the extracted rules
and usually decreases the more general rules the system computes.
I Comprehensibility: The algorithm produces often a large number of rules and
therefore a low degree of comprehensibility. For example the before mentioned
7381 rules extracted for of a neural network trained on the wine data. How-
ever, a post-processing process can be used to extract the most significant rules
[INT99]. The extracted fuzzy rules, together with linguistic terms, are human
readable explanations about the function computed by the neural network.
I Validation capability: The method is, in general, not able to validate special
behaviour of the neural network, as usually a certainty grade of 1 is not com-
puted.
I Algorithmic complexity: The time and space complexity of the original algo-
rithm can increase exponentially, depending on the number of input nodes. By
extracting more general rules, with usally less accuracy, the complexity can be
controled.
I Usability for function approximation and classification tasks: The proposed
algorithm is only applicable to classification tasks.
FuzzyTrepan
Fuzzy Trepan by Faifer and Janikow [FJ99] is an extension of the TREPAN [CS96]
algorithm. TREPAN is a pedagogical approach which uses decision trees for both
knowledge extraction and representation. As such this approach views the task of
extracting comprehensible knowledge of a neural network as an inductive learning
problem.
Fuzzy Trepan follows the same idea as TREPAN but uses fuzzy decision trees. Fuzzy
decision trees were first proposed by Janikow ( [Jan93], [Jan96]) as an extension of
40 Chapter 2. Analysis of Neural Networks
ID3 [Qui86].
The algorithm, named FuzzyTrepan, is summarized in the following steps:
FuzzyTrepan
Input: neural network, antecedent linguistic values, training data
Output: Fuzzy decision tree
(i) Replace the original class labels with the class labels
assigned by the ANN to the given input.
(ii) Induce the tree in a top-down manner, with the most informative
label on the top. To split a particular node an information gain
measure, modified for fuzzy representations, is applied.
(iii) A node is a leaf, if one of the following criteria is
fulfilled: minimal entropy, minimal information gain or
minimal example count.
As described in [FJ99] FuzzyTrepan outperformed TREPAN with respect to better
fidelity for the Iris and Pima dataset, but had less fidelity for the Bupa dataset.
Discussion of FuzzyTrepan
I Portability: Generally applicable to any feed-forward neural networks, because
the neural network is considerd as black box.
I Fidelity: Trade-off between tree size and comprehensibility, because a greater
number of fuzzy terms results in a larger tree and higher fidelity, but it decreases
the comprehensibility.
I Comprehensibility: As mentioned before: the complexity of the tree defines
the degree of comprehensibility.
I Validation capability: The method is a pedagogical approach and, as such, is
relying on sampling data from the neural network; therefore the method is not
suitable to validate properties of the function computed by a neural network.
I Algorithmic complexity: The complexity depends on the number of sampled
neural network data and on the learning algorithm for the fuzzy decision tree.
2.2 Validation of Neural Network Components 41
The algorithm scales within polynomial time and space complexity, because the
method only depends on the size of the sample data set and not on the neural
network architecture.
I Usability for function approximation and classification tasks:´ The proposed
algorithm is restricted to classification problems.
REX
REX [MKW03] is an evolutionary algorithm to extract fuzzy rules of a neural network
computing a classification task. The extracted knowledge is represented as a fuzzy rule
set. The vector � represents the network input. For an � -class classification problem
the vector � describes the output in binary format. An entry �$��NL� is equal to 1 if the
neural network predicts class â | . All other entries of � are then equal to 0. The rule
format is:
Rule zVUyK IF �¯�Ã�>� is ±WUGX7þJÓYÓYÓ5þ(�¯�� �� is ±WU2Y THEN Class â UREX starts with a randomly selected first population. Sets of input patterns are sequen-
tially passed to the set of rules in a given generation. For each individual an evaluation
function is calculated. The evaluation function is a measurement of the fidelity be-
tween the set of rules and the output of the neural network. During the crossover phase
rules and fuzzy set groups can be exchanged between individuals. This process is re-
peated until a stopping criterion is fulfilled.
To apply the algorithm in an experimental phase the parameters for REX like the size
of the population, the mutation probability, the crossover probability and the maxi-
mum number of rules for an individual have to be determined. In the following we
summarize the main idea of the REX algorithm as described in [MKW03].
42 Chapter 2. Analysis of Neural Networks
REX
Input: neural network, antecedent linguistic values
Output: Set of fuzzy rules
(i) An individual and the initial population.
An individual consists of a set of fuzzy rules and a related fuzzy set.
The length of a chromosome is constant. Each rule, premise and fuzzy set can
be marked as active. Fuzzy sets are encoded as one real number, characterized
by the centroid of the corresponding membership function.
(ii) Evaluation of an individual �the vectors a �Ibc�Id9��e and � are metrics, where:
a�6���? 676�6 number of correctly, compared to the neural network, classified patterns.b 6���? 676�6 number of incorrectly classified patterns.d 6���? 676�6 number of patterns that were not classified as no rule was active.e 6���? 676�6 total number of active premises in an active rule� 6���? 676�6 total number of active fuzzy sets in the individual � .For an individual � the value of the evaluation function is given by:f � +hgia�6���?*B�6 b 6���?9?@KAjkaj6���?�B�6 d 6���?9?kKml2B�6 e 6���?9?iKonpB�6 � 6���?�The function B is defined as follows:
B�6rq¦?7+stu tv : q{+BA�w qAx+BA
Finally g � j � l and n are coefficients.
(iii) Evolutionary operators:
The mutation of a centroid of a fuzzy set is based on adding a random floating
point number (similar for integer values a random integer number modulo an
allowed range is added). Mutating an activity bit is done by its negation.
The crossover operator allows to exchange rules and fuzzy set groups between
individuals.
(iv) Stopping Criteria:
The algorithm terminates, if one of the following stopping criteria is fulfilled:
maximum number of steps is elapsed or no progress for a given number of
generations or the evaluation function for the best individual is higher
than a certain value.
2.2 Validation of Neural Network Components 43
Discussion of REX
I Portability: Generally applicable to feed-forward neural networks, because the
neural network is considered as a black box.
I Fidelity: As reported in [MKW03] experimental studies showed that REX pro-
duces similar results as the crisp rule extraction method FULL-RE [TG96] on
the IRIS data set. However, as the method is based on sampling a high fidelity
is not guaranteed. In particular the fidelity is dependent on the chosen samples
in the input space of the neural network.
I Validation capability: The method is a pedagogical approach and therefore not
suitable to verify properties of the function computed by a neural network.
I Comprehensibility: REX produces a set of fuzzy rules, which provide a good
comprehensibility.
I Algorithmic complexity: With the stopping criterion “maximal number of step”
it is ensured tha the method stays within polynomial time complexity. The pa-
rameter for the maximal number of rules restricts the space complexity. With
this parameter settings the algorithm is also usable for high dimensional cases.
I Usability for function approximation and classification tasks: The proposed
algorithm is restricted to classification problems.
44 Chapter 2. Analysis of Neural Networks
2.2.3 Region-based Analysis
We classify algorithms as region-based if they have a geometrical perspective and if
they do not require any simplifications for the weight-layer. In addition, all these
methods analyse the behaviour of the neural network in a decompositional manner.
Within region-based analysis techniques it is to distinguish between refinement-based
and non refined based methods. As motivated in Section 1.4 refined based methods
rely on the forward and or backward-propagtion of regions through all layers of a
feed-forward neural network. We first introduce non-refined based methods.
REFANN
The algorithm called rule extraction from function approximating neural networks
(REFANN) approximates the continuous activation function of the hidden unit by
piece-wise linear functions [SLZ02]. This divides the input space into subregions.
For each non-empty subregion a rule is generated. As explained by Setiono et. al
[SLZ02] the approximation of the sigmoid function is dependent on the training data
as the training data defines the maximal input value for the N -th hidden unit and there-
fore also the point t ^ of the intersection of two line segments.
REFANN generates rules from trained neural networks with one hidden layer and one
linear output unit. In [SLZ02] a pruning algorithm, called N2PFA, is introduced. This
pruning algorithm removes redundant and irrelevant units. It is recommended to apply
the pruning algorithm before the rule extraction process, because the time complexity
of REFANN increases exponentially with the number of hidden units. The main steps
of the algorithm are summarized on the next page.
2.2 Validation of Neural Network Components 45
REFANN
Input: Data set, neural network
Output: Set of rules
//The neural network has)
inputs, y hidden units and 1 output unit.
(i) for each hidden unit� +�E � y�
approximate the sigmoid function with z piece-wise linear functions
let z�+�G and define the piece-wise linear function {�| as follows:
// to simplify the notation we use here q , with q�+ �k�}� 6 � ?{@|Â6~q¦?¼+
stttu tttv6rq�Kmq�|��¡?*�J��6~q>|��¡?@���&6~q>|��¡? if q�D���q>|��q if ��q�|N��Hoq�H�q>|��6rq���q�|��¡?*�J��6~q>|��¡?iKm�&6~q>|��¡? if q�+(q>|��
where:
q |�� 6�6�6 maximal possible input, q |�� is determined from the training data
q |�� 676�6 intersection point of two line segments, with:
q |�� +���� w��N��� ? w7��� ���r� w��N�������� w��N���O(ii) Divide the input space into G � subregions according to the definition of {�|
a subregion is a polyhedron in the input space, for example:� � 0u8íE � y�= � ��q>|���H 6 � ��� ?Á.�H�q>|��(iii) For all non-empty subregions ��
compute an approximation for the output of the B -th training example
in the subregion � as follows:�� E + ��| ���*� 6 � ?*{@|L6 ����� E 6 � ?9?where:� � is the weight vector between the hidden layer and the output unit� �k��� E 6 � ?7+ 6 � �7� ?X. E , with . E representing the input vector
for the B -th training example.
Form a rule of the following structure:
if .u0��¡ E then�� E
where: �¢ E is a polyhedron of the form -/.�021 � 4 £·.4H�¤¼C and
£ is a matrix with)
columns, ¤ � . are vectors of length)
O
46 Chapter 2. Analysis of Neural Networks
Discussion of REFANNI Portability: REFANN, as introduced in [SLZ02], requires a neural network
with one hidden layer and one linear output unit.
I Fidelity: The fidelity depends on the number of piece-wise linear functions cho-
sen to approximate a sigmoidal function. Furthermore it is possible to compute
a precise error bound (for further details the reader is referred to the paper by
Setiono et.al [SLZ02]).
I Comprehensibility: REFANN produces rules of the form :
if �B�n�4¥*¦ then ��¦where �4¥�¦ is a polyhedron, containing the Q -th training example, and ��¦ is
the approximation of the neural network output in this region. Together with
visualization tools the extracted rules are helpful to explain the behaviour of the
neural network.
I Validation capability: For rules of the above form it is not possible to assure
certain behaviours of the neural network, because the output of an input vector�v����¥�¦ could be above or below the value �3¦ . However, with continuous and
monotonic transfer functions like the sigmoidal function, it is possible to modify
the algorithm such that rules guarantee a certain behaviour. For example, by
applying linear programming techniques (dependent on the sign of Z$��NL� either
maximize or minimize g¼h�i ��NL�¯� k ��NUMYKF�Â� subject to ���n�m¥�¦ ) an upper bound,
denoted with ��§W¨U~ , for the neural network output of an input vector �<�v�(¥�¦can be computed. Therefore, it can be guaranteed that the following rule is
always fulfilled:
if ���n�4¥*¦ then ������§�¨s~Similarly, a lower bound value �9§ |ª© could be computed. Together a valid in-
terval bound on the neural network output value � for any input vector in the
polyhedron �4¥�¦ is defined. Hence, we could extend the method to compute
valid rules of the form:
if �B�n�4¥*¦ then ����P ��§ |ª© MO��§�¨s~�V
2.2 Validation of Neural Network Components 47
I Algorithmic complexity: Exponential increase of the number of subregions
with the number of hidden units. Let � be the number of hidden units and
assume a three piece linear approximation of the activation function, then the
number of subregions is Ò § .
I Usability for function approximation and classification tasks: The proposed
algorithm is applicable for classification problems and for function approxima-
tion tasks.
Rule Extraction by Clustering
In the paper “Rule Extraction from Feedforward Neural Network for Function Approx-
imation” by Gaweda et.al [GSZ00] a rule extraction technique is described, which
relies on clustering of hidden units activation.
A cluster center of a hidden unit output space is a point which represents a hyperplane
in the input space. These hyperplanes are defined from the weight matrixk
, con-
taining the weights between the input layer and the hidden layer. Rules are defined
according to the inverse of the cluster centers and the corresponding network outputs.
The computed rule set is used for a rule-based approximation of the behaviour of the
neural network. For a given input the most active rule is selected. The activation of a
rule is calculated as the sum of a distance measure between the point and each hyper-
plane of the rule condition.
The algorithm is an extension of previous cluster methods as introduced in [SL97]
and [WvdB98].
48 Chapter 2. Analysis of Neural Networks
CLUSTER - REFA
Input: Data set, neural network
Output: Set of rules
//The neural network has inputs, � hidden units and 1 output unit.
(i) Find cluster during the training phase in the output space of the� dimensional hidden layer. To find clusters algorithms like, for
example neural gas [TBS93] can be used.
(ii) For each cluster center Q� ��K�Ü_«¬«� ¦ denotes the Q -th cluster center.
��¦�� §®|r¯ % Z$��NL� ¦���NL�
where:Z is the weight vector between the hidden unit and the output unit`(iii) For each cluster center Q� ��KjÜ_
For each hidden unit áy� ��KD�_//the linear prototypes is denoted as TFÙ�¦�� k �@áDMYKF�O�TíÙk¦�� k �@áDMYKF�O�±°� k �@áDMYKF�Â�hQ�! #&% � ¦°�`
`(iv) Build Ü rules, the Q -th rule has the form:
if TFÙ�¦�� k �Ã��MYKF�O� o�"¨�þ�ÓYÓYÓ�þuTFÙ�¦�� k ����MYKF�O� o�"¨ then �J�H�9¦Geometrically each linear prototype represents a hyperplane in the input space. For
a single input vector � the distance to each of these hyperplanes is computed. This
allows to select the best rule for an input vector � .
2.2 Validation of Neural Network Components 49
APPLICATION OF CLUSTER - REFA
Input: Input vector � , weight matrixk
, linear prototypes TFÙOutput: Approximated output vector �
(i) Let ßS��TFÙ | � k �@áDMYKF�O�aMO��� be a function reciprocal to the distance
between the input vector � and a hyperplane corresponding to the linear
prototype TFÙ | � k �@áDMYKF�O� . [GSZ00] use the following distance measure:ØS��TíÙ | � k �@áDMYKF�aMO���O�±°�³² ´ ȶµ�·¶¸ É ê ¿&¹¹ ¹ ȶµ É #iº¬» X È�¼7½ É ²²¶² ´ ȶµ�·¶¸ É ²¶²and the similarity measure ß is:ßS��TFÙ�|O� k �@áDMYKF�O�aMO���¡� %% ¿@¾�¿/ÀÂÁ à ½ À Ä�À Å�Æ Ç È~Æ É�ȪÈ
(ii) For each of the Ü rules compute the activation P�| by:
P¼|&�ý�@� §®µ ¯ % ßS��TFÙ�|k �@áDMYKF�aMO���Y�@�
(iii) compute the maximum activation, i.e.: P ¥ � max%pÊ | ÊiË P | .(iv) Apply the rule with the maximum activation, i.e. the Ñ -th rule
The output � of the rule system for input � is then given by:���"� ¥
Discussion of CLUSTER-REFA
I Portability: The method, as introduced in the paper by Gaweda et.al [GSZ00],
is restricted to neural networks with one hidden layer.
I Fidelity: The fidelity depends on the number of extracted rules. To achieve a
high fidelity a large number of rules is necessary, because a rule consequence
is a single value. However, as stated in [GSZ00] the approximation accuracy
could be improved by the use of functional rule consequences or by using fuzzy
clustering algorithms to determine the shape of clusters.
I Comprehensibility: The comprehensibility depends on the number of rules.
The method produces rules of the form:
if � is closest to the hyperplanes of rule Ñ then � ¥
50 Chapter 2. Analysis of Neural Networks
where � ¥ is the approximation of the output for inputs closest to the hyperplanes
of rule condition Ñ .I Validation capability: CLUSTER-REFA, produces correct rules for points con-
tained in the subspace defined as the intersection of multiple linear prototypes
(rule condition). In other words the Ü rules computed with CLUSTER-REFA
are valid statements about the neural network. However, for all points outside
which are not included in one of the Ü subspaces approximations are used (see
algorithm Application of CLUSTER-REFA).
I Algorithmic complexity: The algorithmic complexity is dependent on the ap-
plied cluster algorithm. The rule extraction part itself scales with a time com-
plexity �ÍÌ·Ü , where � is the number of hidden nodes and Ü the number of
cluster centers. Hence, the algorithm scales well with increasing number of hid-
den nodes.
I Usability for function approximation and classification tasks: The proposed
algorithm is applicable for classification problems as well as for function ap-
proximation tasks.
Decision Intersection Boundary Algorithm
The Decision Intersection Boundary Algorithm (DIBA) is designed to extract exact
representations of threshold feed-forward neural networks [MP00]. A decision region
is defined as a region in the input space, such that all points in that region result in
the same network output. Decision regions are limited by decision boundaries. A
decision boundary is a location in the input space where an output unit switches its
activation state (0 to 1 or vice-versa). The possible decision regions for a threshold
neural network are polytops (the input space is constrained, i.e. there is an interval
of possible values for each input dimension defined). The algorithm relies on a few
important observations, see also [MP00].
1. Independence of an output unit. Each output unit is computing its own value
irrespective of the other output units.
2.2 Validation of Neural Network Components 51
2. For an ANN with threshold activation function the output of hidden units are 0
or 1. Therefore the output units just compute a partial sum of their weights.
3. An output unit changes only if at least one hidden unit value changes. Therefore
a decision boundary corresponds to a region in the input space where at least
one hidden unit undergoes a change.
4. Hidden units split the input space through hyperplanes. If the input space is
bounded the decision regions are defined through higher dimensional polytops.
The algorithm consists of two phases, namely a generative part which calculates the
possible vertices and connecting line segments of decision regions and an evaluation
part which tests if these basic elements (vertices, lines) form the boundary of a deci-
sion regions.
The computation of candidate vertices and lines is performed by finding hidden layer
hyperplane intersections and by projecting these hyperplanes incrementally onto each
other until dimension one. The result of this part are line segments.
The second phase is to test whether a line segment builds a boundary, i.e. to test
whether the intersection of hyperplanes forms a corner boundary and if the interven-
ing line segment is a boundary line. Figure 2.7 illustrates the projection part of the
algorithm and Figure 2.8 shows a decision region and explains the traversing of a line.
A sketch of the algorithm for a two weight layer feed-forward neural network with a
single output neuron is provided on the next page.
Î Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÎ Î Î Î ÎÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï ÏÏ Ï Ï Ï
Figure 2.7: DIBA recursively projects the hyperplanes of the first layer onto each
other until dimension one. We demonstrate the projection of two hyperplanes onto the
grey colored hyperplane from dimension three to dimension one.
52 Chapter 2. Analysis of Neural Networks
h1
h3
h 2
Backward pass
Forward pass
(2) (2) + (3)
(3)
(1)
w W
W
W
W
Figure 2.8: A decision region and the traversing of a line. A single output neuron is con-
sidered ( � is a vector). In the forward pass each line segment is denoted with the partial sum
of the weights of the output neuron, once passed an intersection hyperplane which switches
from 0 to 1 in the traverse direction. The backward pass is symmetrical. During the backward
phase it is determined if a line segment builds an edge of the decision region and if a corner is
a vertex of the decision region.
DIBA - Decision Intersection Boundary Algorithm -
Input: ANN weight matrixes and biases, interval restrictions on each input dimension.
Output: Decision regions represented through vertices and line segments.
// The dimension of the hidden layer is y and the dimension of the input)
.
1. Recursive hyperplane projection to gain lines. The algorithm considers yhyperplanes (e.g. the
�-th hyperplane is defined by:
6 �I�7� ?7+o j6 � ? , where
denotes
the matrix between the input and hidden layer and is the bias vector) in an)
dimensional input space. For each hidden unit hyperplane of dimension)
: project all
other y���E hyperplanes on it. Repeat this recursively until dimension one.
2. Boundary test along each projected hyperplane line. Along a line the points where
an output unit can change are at the intersection with the remaining y�,E lines. A
boundary is a location in the input space, where an output value transition occurs. An
intersection of hyperplanes is a boundary if each hyperplane of the intersection, has at
least one face in the vicinity of the intersection that is a boundary, i.e. the
corresponding hidden units value changes and this results in a change of the output
value. Line segments are intersections of) ��E hyperplanes and corners are
intersections of)
hyperplanes. Hence, to test if a line segment is a boundary or a vertex
is a corner boundary are identical. To perform the boundary test the line is traversed. In
a forward pass we start to traverse the line at the leftmost to the rightmost point and
scan for hyperplane intersections in this direction. The partial sum of the output unit
weight is added and assigned to the passed line segment. The backward phase is
identical, but starts from the rightmost to the leftmost point and adding its sum to the
value of the forward phase. During the backward phase a boundary test is performed.
2.2 Validation of Neural Network Components 53
In [MP00] it is mentioned how to extend the algorithm for additional hidden layers
and multiple outputs. Additional layers do not change the possible locations of the
decision regions, so the first layer defines the possible decision regions. For multiple
outputs the algorithm has to be extended, such that the transition of a specific combina-
tion of output units is analysed instead of a single output unit change. This algorithm
is useful for extracting representations of threshold-based neural networks.
Discussion of DIBA
I Portability: Restriction to threshold activation functions.
I Fidelity: In the case for threshold activation functions an exact polyhedral rep-
resentation of the neural network behaviour could be provided. However, for
sigmoidal neural networks the transfer-function is replaced with threshold acti-
vation functions. This is, usually, less accurate than piece-wise linear approxi-
mations of the sigmoidal function.
I Comprehensibility: Decision regions are polyhedra and expressed through ver-
tices and lines. Hence, the comprehensibility depends on the dimension of the
input space.
I Validation Capability: For threshold activation functions, the method calcu-
lates exact decision regions in the input space. This is a guarantee, that any
point within a specific decision region, produces the same output.
I Algorithmic complexity: Exponential time complexity. The method has expo-
nential time complexity with increasing dimension of the input space. As ex-
plained by Melnik and Pollack [MP00] the complexity stems from the transver-
sal of the hyperplane arrangement in the first layer of hidden units. This is of
complexity Ñ���� © � , where � is the number of neurons in the first hidden layer
(number of hyperplanes) and is the number of input signals. Testing if a vertex
is a corner has again exponential time complexity ( Ñ�Á� © � ). DIBA has an expo-
nential space complexity with increasing dimensionality of the input space, as
the method relies on corners to describe the decision regions.
54 Chapter 2. Analysis of Neural Networks
I Usability for function approximation and classification tasks: The proposed
algorithm is applicable for classification problems.
Propagation of regions
In the following methods which rely on propagation of regions through a feed-forward
neural network are discussed. These methods are usable for feed-forward neural net-
works with invertible and continuous transfer functions.An initial hypothesis is defined
in the input and/or the output space of the neural network. These methods refine this
knowledge by propagating the initial constraints through all layers of the feed-forward
neural network. There are two well known works in this field: Validity Interval Anal-
ysis (VIA) [Thr93] and the backpropagation of polyhedra method [Mai98]. As the
name implies the latter method requires an initial hypothesis for the output space and
refines the corresponding knowledge in the input space with a single back-propagation
phase. In the following paragraphs we explain the validity interval analysis and the
backpropagation of polyhedra algorithm. To explain these algorithms we rely on the
following terminology:
1. Transfer function phase: Within the transfer function phase a polyhedron is
propagated through a vector of scalar transfer functions.
2. Affine transformation Phase: This term is used for the propagation of a poly-
hedron through the linear weight layer of a neural network.
3. � -space : Denotes the output space of a weight or transfer function layer within
the neural network or the overall neural network output space.
4. � -space : Defines the input space of a weight or transfer function layer within
the neural network or the overall neural network input space.
Interval Propagation
In the literature several interval propagation methods are published, for example by
Palade ( [PNP00] and [PNJ01]). The main idea of these algorithms is to extract and
validate axis-parallel rules. The initial algorithm, named Validity Interval Analysis
(VIA), was developed by Thurn [Thr93].
2.2 Validation of Neural Network Components 55
VIA
Validity Interval Analysis (VIA) [Thr93], [Thr90] is a generic tool for analysing the
behaviour of feed-forward neural networks. VIA describes the input-output mapping
using axis parallel hypercubes. The only assumption is that the non-linear transfer
functions of the neurons are continuous and monotonic. The algorithm is based on so
called validity intervals, that is, each neuron is initially constrained with a maximum
interval of valid output. The Cartesian product of the validity intervals defines a hy-
percube. VIA iteratively refines these intervals and is able to detect inconsistencies
by forward and backward propagation of intervals through all layers of a feed-forward
neural network. This refinement process is repeated until one of the following criteria
is fulfilled:
I Consistent convergence: the validity intervals converge to non-empty intervals,
i.e. there is no significant, or no change with additional forward and backward
propagating hypercubes through the neural network.
I Contradiction: an empty interval is found, i.e. the lower bound exceeds the up-
per bound of the interval. The initial intervals are inconsistent with the weights
and biases of the neural network. Consequently we can use VIA also to verify a
given axis-parallel rule set.
The algorithm description and examples (including the MONK benchmark problem)
are available in [Thr93]. As explained in the introduction VIA is an interesting ap-
proach to region-based analysis methods, because annotating each layer with valid
axis-parallel expressions draws an interesting analogy to software verification.
In the following we use them
box function, as introduced in [Mai00a], to describe
the VIA algorithm in a compact form. The box function calculates the smallest axis
parallel hypercube containing a region � , which can be found by linear programming
methods, such as the Simplex algorithm [RR89] or interior point methods [Mat00b].
As depicted in Figure 2.9, ¨7ÒÔÓ ÈªÕ É denotes, with validity intervals, the hypercube defin-
ing the possible activity values of the subsequent layer. Similarly, © ¾ Ó ÈªÕ É is the valid
hypercube for the net input layer. Finally, ¨�ÒcÓ È�Ö É denotes the valid hypercube for the
activity of the previous layer.
56 Chapter 2. Analysis of Neural Networks
1 2 m
σσσσ σ−1 Layer S
Weight−Layer
Layer P
Transfer function layer
B = i1 i1 ... im imx
Bact(S) = ... [a , b ]im imx
net(S) ’ ’ ’[a , b ]’
B = ... [a , b ]knx knact(P)
x
xx[a , b ]
x
xx [a ,b ]k2
[a , b ]i2’ ’
[a , b ]i2 i2
i2
k2
[a , b ]i1 i1
[a ,b ]k1 k1
Figure 2.9: The neurons of the preceding layer P are connected through the weight
layer to those of the subsequent layer S. Every neuron output is denoted with a validity
interval. The Cartesian product of these validity intervals define a hypercube. The
intervals P p �| Msr �| V of the net input and the intervals P p | Msr | V of the activation of the neurons
of layer S are connected through the bijective transfer function ! .To explain the forward and backward phase of the algorithm, the notation is simplified,
since we assume, without loss of generality, that the bias vectorl l l
is the null vector.
A non-zero bias value can be simulated by defining an additional input neuron, with a
constant activation value one. The value of the bias is assigned to the weighted con-
nection of this corresponding neuron.
VIA
Input: neural network, initial restriction of input and/or output regions
Output: annotated neural network
Forward Phase. Refining the upper and lower bounds of the validity intervals of
the subsequent layer.
© ¾ Ó ÈªÕ É � m ��! #&% �Ø× ¨7ÒÔÓ ÈªÕ É �7� k ¨7ÒÔÓ È�Ö É � ¨�ÒÔÓ ÈÙÕ É �"!��� © ¾ Ó ÈÙÕ É �Backward Phase. Refinement of the output intervals of the preceding layer
¨�ÒÔÓ ÈªÖ É � m �� ¨�ÒÔÓ ÈªÖ É � k #&% �� © ¾ Ó ÈÙÕ É �O� ,wherek #&% ��¡��� ���$� k ���n`�
2.2 Validation of Neural Network Components 57
Discussion of VIA
I Portability: VIA can be applied to any feed-forward neural network with con-
tinuous and invertible transfer function.
I Fidelity: Often VIA does not refine very well. In this case the fidelity is not
satisfactory. One of the reasons is, that the propagation of an axis-parallel hyper-
cube through the weight layer results in a polyhedron, which is not necessarily
an axis-parallel hypercube. VIA computes the box of the intersection between
this polyhedron and the hypercube defined through the intervals of © ¾ Ó ÈªÕ É . This
often results in rough approximations. For example, in higher dimension the
wrapping box of a flat polyhedron is a very rough approximation and as a con-
sequence the method does not refine very well. The reader is also referred to
Chapter 7, where we provide some examples.
I Comprehensibility: The extracted rules are axis-parallel, thus they are easily
understandable.
I Validation Capability: VIA is, according to our knowledge, the first algorithm
able to validate the behaviour of some properties of the function computed by the
neural network. Given some initial hypothesis defined as the hypercube ¡� in
the output space of the neural network, after backpropagating � and computing
of the box ~ the following valid statements are obtained:
if ����«� then ����«~if �ÛÚ��«~ then �TÚ����
Furthermore, VIA could be used to refine and validate existing axis-parallel
rules.
I Algorithmic complexity: The time complexity is dependent on the rule refine-
ment process. Maire [Mai00a] proved that VIA always converges in one run
(iteration) for single-weight-layer neural networks. However, for multilayer net-
works no formula for the termination of the rule refinement process is found yet.
As reported in [Mai00a] experimental results showed that the in the average
58 Chapter 2. Analysis of Neural Networks
the rate of box refinement decreases exponentially. In other words, a rule gets
mainly refined within the first steps of the refinement process.
I Usability for function approximation and classification tasks: The proposed
algorithm is applicable for classification problems and for function approxima-
tion tasks.
Backpropagation of Polyhedra
The possible regions which can be analysed by VIA are decisively limited due to the
fact that VIA is restricted to axis-parallel hypercubes. A good alternative to axis-
parallel hypercubes for the analysis of neural networks are polyhedra. Finite unions of
polyhedra approximate arbitrary regions quite well. Hence, it is possible to compactly
describe even complex input-output relations. Additionally, polyhedra are closed un-
der affine transformations. As a result the propagation of polyhedra through the weight
layer is exact. The notion of backpropagating finite unions of polyhedra through feed-
forward neural networks was introduced by Maire [Mai98]. Starting with a (user-
defined) set of unions of polyhedra at the output layer the inverse regions are calcu-
lated (again using a polyhedral approximation). In a feed-forward neural network we
can distinguish two different phases, namely the affine transformation phase and the
transfer function phase. Consequently the algorithm consists of these two phases. In
[Mai98] a formula to backpropagate a polyhedron through the linear weight layer is de-
scribed. The computationally expensive part of this approach is to remove redundant
inequalities. The following algorithm explains how to backpropagate a polyhedron
2.2 Validation of Neural Network Components 59
through an affine transformation. 3
Backpropagation of Polyhedra: affine transformation phase
Input: � � M k M l l lOutput: � ~Backpropagation of polyhedron �)�¤� ����� ���v���¯� through the weight layer.
�,ÌÕ©$������� k �(� l l l Ó����© #&% ������� if and only if:
��� k �(� l l l ����� therefore: � ~¬� ���$� � k ������QB� ll l �remove redundant inequalities in � ~
In order to backpropagate a polyhedron through the nonlinear transfer function layer
the sigmoid function is approximated by piece-wise linear functions. The piece-wise
linear approximation of a sigmoidal function results in axis-parallel splits of the orig-
inal polyhedron in � -space. As depicted in Figure 2.10 (for a two-dimensional case)
the polyhedron is subdivided into cells. The N -th cell is a hypercube and contains
the sub-polyhedron �`| . The polyhedron � can be expressed as the union of these
sub-polyhedra, i.e. ���ÝÜ | �¡| . With each cell a different affine transformation is asso-
ciated. The diagonal matrix Þ | and the vector ß ß ß | define the affine transformation for
the N -th cell. In an -dimensional space the diagonal matrix Þ | is an by matrix and
the á -th entry on the diagonal matrix corresponds to the slope of the piece-wise linear
approximation for the á -th component. The vector ß ß ß | is an -dimensional vector and
the á -entry represents the constant of the linear approximation for the á -th component.
The algorithm to backpropagate a polyhedron through a non-linear transformation is
sketched on the next page.
3A more detailed description of the algorithm and approaches to remove redundant inequalities is
given in Chapter 5.
60 Chapter 2. Analysis of Neural Networks
Backpropagation of Polyhedra: transfer function phase
Input: ���jMU!Output: ��~Transfer function phase. Backpropagation of a polyhedron ����� ��� � ������¯�through a vector of sigmoidal transfer functions.
«�¤� m ���$�D�ô�àiá N«� ��K�Ø // Ø2� dimension of layer S_â | � approximation of ! on the interval P bn��NUM��>�aMUbn��NUMs�°�9V
i.e. build subdivisions of P bn��NUM��>�aMUbn��NUMs�°�9V`Partition � � according to the subdivisions of the intervals P bn��NUM��>�aMUbn��NUMs�°�9Vô�àiá each sub-polyhedron � | of a cell_â | ��tå���ãÞ | ���äßßß | M where Þ | is a diagonal matrix
��~ | � â #&%| �a� | ��� ���$� � | Þ | ����� | QB� | ß ß ß | �`��~¬�H�$~ %3å ÓYÓYÓ å ��~ © M where : number of sub-polyhedra
Discussion of Backpropagation by Polyhedra
I Portability: Generally, usable for feed-forward neural networks.
I Fidelity: The method extracts correct rules for linear transfer functions. For
sigmoidal transfer functions, the fidelity decreases with a decreasing number of
piece-wise linear functions, used in approximating the sigmoid function.
I Comprehensibility: Together with visualization tools, the extracted polyhedral
rules are quite informative for small neural networks. The piece-wise linear ap-
proximation results in an exponential increase of the number of sub-polyhedra.
2.2 Validation of Neural Network Components 61
1Q
Q 2Q 3
Q 4
−1
ϕ−1
ϕ−1
y−space x−space
cell 1cell 4
cell 3 cell 2
2
ϕ14
ϕ−13
Q 4
Q 3
1Q
Q 2
( )
( ) ( )
( )
P
P x2P x3
Px4 x1
Figure 2.10: The approach of piece-wise linear approximation of the sigmoid function
results in axis parallel splits of the polyhedron. In � -space we see the split of the
polyhedron in different cells. For each cell a different approximation of the sigmoid
function is required, i.e. â |U�������æÞu|����çßßß'| . To backpropagate a sub-polyhedron of
cell N the linear approximation function â | is applied. We plotted a possible reciprocal
image for the x-space. We can see that it would be feasible to merge the polyhedra
defined by â #&% �a� % � and â #&% �a� ¥ � , likewise the polyhedra â #&% �a� ñ � and â #&% �a�Vè'� .As a consequence a lot of polyhedral rules are required for the higher dimen-
sional case, which in turn makes the rule set less understandable.
I Validation Capability: The proposed method is capable of proving some prop-
erties about the function learned by the neural network. With an overestimation
of the sigmoid function the polyhedral approximation would contain the true
reciprocal image. Therefore, after termination, the method can assure the fol-
lowing property about the function computed by the neural network:
if ���n� � then �B� å � ~ |where ��~ | denotes the N -th sub-polyhedron. Similarly, by assuming an over-
estimation of the sigmoid function, the method can guarantee that the neural
network behaves according to the following rule:
if �TÚ� å � ~ | then �éÚ�n� �I Algorithmic complexity: Exponential time complexity. The computation time
increases exponentially with the number of neurons. The method of approxi-
62 Chapter 2. Analysis of Neural Networks
mating a non-linear transfer function by a piece-wise linear function results in
axis-parallel splits of the polyhedron according to the partioning of ! . As this
partioning occurs on every axis, the number of cells and therefore the number
of splits increases exponentially. An additional problem, which arises with this
method, is the merging of polyhedra once backpropagated through the piece-
wise linear function.
I Usability for function approximation and classification tasks: The proposed
algorithm is applicable for classification problems and for function approxima-
tion tasks.
2.3 Overview of Discussed Neural Network Validation Tech-
niques and Validity Polyhedral Analysis
So far, several algorithms to analyse neural networks have been introduced. In this
section we provide a classification of the discussed methods and suggest some desir-
able properties for neural network validation methods. Finally, we motivate and justify
our approach to validate the behaviour of feed-forward neural networks.
The literature overview covered the following validation or rule extraction techniques:
I Propositional rule extraction: KT, � of ²I Fuzzy rule extraction: Linguistic rule extraction, FuzzyTrepan, REX
I Region-based methods: REFANN, Cluster-REFA, Interval Propagation (VIA),
Backpropagation of Polyhedra, DIBA
We could derive properties for the different classes (propositional, fuzzy, region-
based) of discussed neural network validation techniques. Table 2.1 summarizes these
results. The first column defines the validation class. The columns white box and
black box indicate whether decompositional (white box) or pedagogical techniques
are available within the given class. The attribute general indicates if decompositional
approaches have no or minor assumptions on the feed-forward neural network archi-
tecture. With the columns classification and FA (function approximation), we indicate
2.3 Overview of Discussed Neural Network Validation Techniques and Validity Polyhedral Analysis 63
if methods are usable for classification or function approximation tasks. Finally, the
column denoted with VC (validation capability) specifies if there are approaches in a
class, which are suitable to prove properties about the neural network behaviour.
RE - Class White-box Black-box General Classification FA VC
Prop. RE yes yes no yes no no
Fuzzy RE yes yes yes yes no no
Region based yes no yes yes yes yes
Table 2.1: Overview of neural network validation techniques.
The algorithm developed within this thesis is named Validity Polyhedral Analysis
(VPA). VPA is a decompositional approach (white box), usable for classification prob-
lems as well as for function approximation problems. It is a very general method,
because the only assumption is that the transfer-function is monotonic, continuous and
invertible.
As motivated in Section 1.4 VPA is a method which computes valid polyhedral region
mappings by forward- and backward propagating finite unions of polyhedra through
all layers of a feed-forward neural network.
The development of VPA is motivated by our desire to verify some properties about
the function performed by a neural network. Thus, we are interested in the most re-
fined informative, region based rules. VPA achieves this goal by annotating each layer
of a feed-forward neural network with valid linear inequality predicates, which are
geometrically finite unions of polyhedra. Polyhedra are closed under affine transfor-
mations. Hence, polyhedra are a good choice to represent the knowledge embedded
in a neural network. Most region-based analysis methods scale exponentially with in-
creasing neural network size (REFANN, DIBA, Backpropagation of Polyhedra). For
the development of our algorithm one important requirement was to assure that the
algorithm also scales with higher dimension. Additionally, it was important to assure
that any approximation contains the true reciprocal image, and hence, to obtain valid
statements about the neural network behaviour.
64 Chapter 2. Analysis of Neural Networks
Summary and Outlook to the next Chapters
Also this chapter contained some new aspects, which are summarized below:
Contributions - Chapter 2 -
I Classification of neural network analysis techniques into: propositional rule
extraction, fuzzy rule extraction and region-based analysis.
I Introduction of the new attribute “validation capability” to characterize prop-
erties of neural network validation methods.
I Modification of REFANN [SLZ02] to obtain valid rules of the form:
if ���n� | then �B�BP ��§ |ª© MO��§W¨U~DVwhere � | is a polyhedron in the input space of the neural network.
In the following chapters we will describe a way to compute refined polyhedral
pre- and postconditions of feed-forward neural networks and therefore how to obtain
an annotated version of the neural network.
The first difficulty is to approximate the image or reciprocal image of a polyhedron un-
der a non-linear transformation, which is a non-linear region. We started to analyse the
structure of manifolds of this regions, followed the idea of approximating a non-linear
region via a set of finite unions of polyhedra, which lead us finally to the interesting
field of non-linear optimization.
The next difficulty was to compute the image and reciprocal image of a polyhedron
under an affine transformation. This question lead us to the fascinating topic of poly-
hedral projections. According to a proper software engineering approach, we con-
structed an abstract framework for iterative refinement algorithms and embedded VIA
and VPA as possible instances. This implementation was used to evaluate our ideas.
Chapter 3
Polyhedra and Deformations of
Polyhedral Facets under Sigmoidal
Transformations
Polyhedra and their geometrical properties are important for a wide range of problem
domains, such as (linear) optimization, geometry and parallelized compiler. We use
polyhedra to describe the behaviour of feed-forward neural networks. This chapter
explains basic concepts of polyhedral computations. More sophisticated polyhedral
operations like the removal of redundant inequalities and the projection of a polyhe-
dron onto a lower dimensional subspace will be discussed in chapter 5.
In section 3.3 we consider the problem of the forward and backward-propagation
of a polyhedron through the transfer-function layer of a neural network. Addition-
ally, we describe approaches to analyse the manifold of the image of a hyperplane¹ � ����� ê®»¼����r>� under a non-linear transformation.
3.1 Polyhedra and their Representation
Our summary relied mainly on the work by Wilde [Wil97] and Fukuda [Fuk00]. The
books by Schrijver “Theory of Linear and Integer Programming” [Sch90] and Ziegler
“Lectures on polytopes” [Zie94] are advanced sources for polyhedral computation.
In the sequel, we consider polyhedra in the -dimensional Euclidean space. Before we
65
66 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
define a polyhedron and a polyhedral cone, we recall the definitions for linear, non-
negative, affine and convex combinations of two vectors.
Definition 3.1 Linear, non-negative, affine and convex combination of vectors
Let � | be a vector and ë | a scalar. Then:
I ® | ë | � | is called a linear combination,
I ® | ë | � | and ë | §�¨ is called a non-negative combination,
I ® | ë | � | and® ë | �º� is called an affine combination,
I ® | ë�|��&| and® ë�|&�º� and ë�|«§�¨ is called a convex combination.
wFigure 3.1 shows the geometric interpretation of different combinations of two vectors
in the two dimensional space.
Figure 3.1: Combination of two vectors in a two-dimensional space: from left to right:
linear, non-negative, affine and convex combination.
Definition 3.2 Vertex
A vertex of a polyhedron � is a point in � , which cannot be expressed as convex
combination of other points in � . wDefinition 3.3 Ray
A vector á is a ray of � , if for any ���n� also ���(�íì�áD� �n�2Mpì�§�¨ . In other words a
ray defines a direction in which � is open (infinite). w
3.1 Polyhedra and their Representation 67
Definition 3.4 Extreme Ray
A ray is called an extreme ray of a polyhedron � if and only if it cannot be expressed
as a positive combination of two distinct rays of � . wDefinition 3.5 Polyhedron
A Polyhedron � is a subspace of î © defined as the intersection of a finite number of
half-spaces. Analytically this can be described as a set of linear inequalities. wDefinition 3.6 Dual representation of a Polyhedron
A polyhedron can be represented in the implicit form as a set of linear inequalities �"����$� ����� ��� . For every polyhedron � exists an equivalent parametric representation
(also known as Minkowski characterization) in terms of a linear combination of lines,
a convex combination of vertices, and a positive combination of extreme rays: � ����ý�ïî © � ���ðë&ñ��òì�óÍ�Ýô$õ and ì�M�ô<§Í¨ and® ì � �5� , where ñ�M/ó,Mpõ are
matrices whose columns represent the lines, extreme rays and vertices, respecitvely. wDefinition 3.7 Line
A line of � is a bidirectional ray, i.e. a vector Z , such that with ���n� also �����Aì&Z«� �� . wDefinition 3.8 Polyhedral Cone
A cone is the intersection of a finite number of closed halfspaces of the form¹ #| ����$� �J��NUMYKF�Â�¾� ¨¦� The implicit description of a cone is: öý�Ï���$� ��� � ¨¦� . The
parametric representation for a pointed cone is simply: ö,� ���$� ì�ó , ì�§�¨¦� . wDefinition 3.9 Convex Hull
The convex hull of a set of points is a convex combination of all points in the set. It is
the smallest convex set which contains all points of the set. wDefinition 3.10 Polytope
A bounded subset � ª÷î © is a polytope, iff it is the convex hull of a finite set of
points, denoted with ø . Alternatively, a polytope is a polyhedron without rays or lines.w
68 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
Definition 3.11 Dimension of a Polytope
The dimension of a polytope � is the dimension of the convex hull of points included
in � . wThe dimension of a polytope is denoted with dim ����� . A polyhedron in an -
dimensional space with dim �������H is said to be full-dimensional.
The expressions supporting hyperplane, proper face and boundary of a polyhedron
are borrowed from Rambau [Ram94].
Definition 3.12 Supporting Hyperplane
If the polyhedron � is contained in the halfspace defined by¹ # � ���$� �J��NUMYKF�Â� �� ��NL�s� , with �"� ¹ Ú�Ýù , then the hyperplane
¹is called a supporting hyperplane. w
Definition 3.13 Face of a Polyhedron
The intersection of � and a supporting hyperplane¹
is a face of the polyhedron � . wVertices are faces of dimension 0, edges are faces of dimension 1, and facets are
faces of Ø°NX�B�����úQ�� . A face with dimension Ø°NX�B����� is the polyhedron � itself.
Definition 3.14 Proper Faces
Faces between dimension 0 and dimension ذN9�B������Q�� are called proper faces. wDefinition 3.15 Boundary of a Polytope
The union of all proper faces of � is the boundary of � , denoted with û�� . wThe definition of the analytic center of a polyhedron is important in machine learn-
ing. For example in the paper by Malyscheff and Trafalis the computation of the ana-
lytic center is used to define a new learning algorithm [TM].
Definition 3.16 Analytic Center of a Polyhedron
The analytic center of a polyhedron is an approximation of the center of mass of a
polyhedron, which is computed by maximizing the distance to all facets of the poly-
hedron. w
3.2 Operations on Polyhedra and Important Properties 69
We decided to represent a polyhedron as the intersection of a finite number of
half-spaces. This is a better representation with respect to the complexity than the
parametric representation, e.g. an -cube can be described with a set of �5 inequalities
whereas we need � © vertices to represent the same cube through its vertices. In general
we write a polyhedron as the set of points: ���¾���$� ����Ô�¯� . Often, an index will
indicate whether the polyhedron belongs to the input space or output space.
3.2 Operations on Polyhedra and Important Properties
All polyhedra are convex. Let �ýüÛî © and �ÝüTî © be two polyhedra.
Intersection of Polyhedra
The intersection of ��� ���B�(î © � ���B���¯� and �H� ���B�mî © � b{�B� � is computed
by concatenating the inequalities of � to � , i.e.: þ������B�oî © �cd � bef�cd � ef�
Union of Polyhedra
Let � and � be two polyhedra. The union is defined as the set of points which are in� or in � . With the union of polyhedra it is also possible to describe concave sets.
Definition 3.17 Non-redundant Polyhedron
A polyhedron � � ���$� ��� ���¯� defined with the minimal number of inequalities is
called non-redundant. An equivalent definition is that with the removal of any inequal-
ity of � the polyhedron � � �H�éÿ¤���$� ����NsMYKF�Â����� ��NL�s� contains � , i.e. �ýªv� � wOften also the terminology facet-defined polyhedron is used when referring to a non-
redundant polyhedron. The intersection of two polyhedra can result in a redundant
representation. In Chapter 5 we will discuss how to detect redundant inequalities.
Definition 3.18 Affine Transformation ©An affine transformation is a function of the form ���Ì ��� k �n� l l l , where
kis a
matrix and ��MO� andl l l
are vectors. w
70 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
The following Lemmas hold for polyhedra.
Lemma 3.1 Closure under Intersection
The intersection of two polyhedra is a polyhedron.
Proof:
Let ��� ���B�(î © � ���B����� and �H� �����oî © � b���� � be two polyhedra.
From the definition of the intersection of two polyhedra it follows directly that
þ½� ���v�hî © �cd � bef�cd � ef� is also a polyhedron, because it is a set of linear
inequalities. wLemma 3.2 Closure under Affine Transformations
The image of a polyhedron under an affine transformation is a polyhedron.
We assume for the following proof that the transformation matrixk
is invertible. As
we will explain in Chapter 5 the projection of a polyhedron onto a lower-dimensional
subspace is another polyhedron. With this property we can reduce the non-invertible
case to the invertible.
Proof:
Let � � ����� î © � �������¯� be a polyhedron and © an affine transformation.
The image of the polyhedron is þ �Í©������ . Hence þ¸� ����� � �Ô� ��þ�� �k �(� l l l � .Let the transformation matrix
kbe invertible, then: þ½�ý����� � k #&% �������� k #&% l l l � , i.e. þ is a set of linear inequalities w
3.3 Deformations of Polyhedral Facets under Sigmoidal Trans-
formations
Validity Polyhedral Analysis (VPA), requires the approximation of the image or recip-
rocal image of a polyhedron � under a non-linear transfer function with a set of finite
unions of polyhedra. The problem of propagating a polyhedron through the non-linear
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 71
transfer function layer of a neural network is depicted in Figure 3.2. We restrict our
studies to continuous and invertible transfer functions. Typically, feed-forward neural
networks use sigmoidal functions for the non-linear transformation. As these functions
are bijections it is sufficient to analyse the back-propagation of a polyhedron through a
sigmoidal transfer function layer. The forward-propagation can be solved analogously.
σ σ −1σ
y(2) y(m)
(2) (m)
σ
y(1)
x(1) x Aσ (x) b<−
Ay <− b
x
σ
Figure 3.2: The back-propagation of a polyhedron through the transfer function layer.
Given the polyhedral description ��� ����� �����º�¯� in the output space of a transfer
function layer the reciprocal image of this polyhedron under the non-linear transfer
function is given by � ��! #&% �����¯� ���$� �(!������¡���¯� .In this and the next chapters the knowledge of the following definitions and terminol-
ogy is necessary.
Definition 3.19 Non-linear region �We use the expression non-linear region and denote it with � , to express the image or
reciprocal image of a polyhedron � under a non-linear transformation. wDefinition 3.20 Manifold
The surface of � is a manifold. wDefinition 3.21 Facet manifold
The image or reciprocal image of a facet of a polyhedron, under a nonlinear transfor-
mation, is referred to as facet manifold. wDefinition 3.22 Polyhedral Approximation Algorithm
The expression polyhedral approximation algorithm (or just: approximation algo-
rithm) is used to refer to algorithms which approximate the non-linear region � with
a set of finite unions of polyhedra. w
72 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
Definition 3.23 Slice
The intersection of the region � with an axis-parallel affine subspace defines a slice.wA Q -dimensional slice of a -dimensional region ( �{��Q�´H ) is obtained by keeping mQBQ variables fixed.
Definition 3.24 Free Variables
The constrained variables within a slice are called free variables. wDefinition 3.25 Saddle Point
Points on the manifold where the surface switches from concave to convex (or vice
versa) are referred to as saddle points. wLet ���º����� ���v���¯� be a facet-defined (non-redundant) polyhedron in � -space. The
reciprocal image in � -space is � � ! #&% �����·� ���$� �(!�������Ô��� . It is important to
notice, that ! is a vector1 of scalar sigmoid functions and is applied on the vector �,����¯�Ã�>�aMYÓYÓYÓ5MO�¯�� ��O� component-wise, i.e. !������ denotes the vector ��!����¯�Ã�>�O�aMYÓYÓYÓ�MU!����¯�� ��O�O� .A facet of the polyhedron � under a component-wise transformation ! can get stretched,
squeezed and convex or concave curvatures are possible.
An important observation is that the reciprocal image of axis-parallel facets in the� -space are axis-parallel facets in the � -space and vice versa. If we consider a sin-
gle facet of a polyhedron, (let’s say the N -th facet), then we write ê » �ý�¸r , where
ê�� �J��NUMYKF� and r��Õ� ��NL� . An axis-parallel hyperplane is characterized by: ê ���¨®MYÓYÓYÓ>MU¨®M���MU¨®MYÓYÓYÓDMU¨j� , i.e. just one component is non-zero. As we apply the transfer
function component-wise the image of an axis-parallel facet, under a sigmoidal trans-
formation, is another axis-parallel facet. In other words an axis-parallel facet under the
sigmoidal transformation can only get stretched or squeezed.
For the general case the analysis of the curvature behaviour in the neighbourhood of
a point � on a facet manifold is used. In an -dimensional space a manifold is of di-
mension AQv� . This allows curvatures in AQv� directions. Knowledge about the type
of a curvature, i.e. whether the curvature is convex or concave and the strength of the
1Note that the vector � of scalar functions is not denoted by a bold letter, as this could cause confusion.
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 73
curvature is useful for any polyhedral approximation algorithm. Information on the
curvatures can be used to define splits for the polyhedron in � -space. For example, to
obtain a precise approximation in areas of strong concave curvatures in the manifold
of � , a polyhedral split is useful to refine the approximation. An example is depicted
in Figure 3.3.
a) b) c)
Figure 3.3: The non-linear region is depicted on the left hand side. An approximation
of with a single polyhedron is shown in the middle. In c) an improved approximation
by using the union of two polyhedra is depicted.
The next three paragraphs explain three different methods to analyze the structure
of the manifold of � .
Subdivision in cells
The first approach in analysing the manifold structure assumes a subdivision of the � -space into a large number of arbitrarily small axis parallel cells. Within every cell the
sigmoid function is approximated through piece-wise linear functions. As a single cell
is arbitrarily small, we can view the sigmoidal function as linear within the cell. The
slope of the linear function is defined by the first derivative of the sigmoid function. For
the two dimensional case, the linear approximation associated with a cell is described
as follows:
â �����$�òÞu��� ß ß ß ,where Þ �õ÷ Ø % ¨¨ Ø ¥
øú and Ø | �"! � ���¯��NL�O�$�"! � ��! #&% ���$��NL�O�O�To analyze the curvature behaviour in a small neighbourhood the effect of a transition
from one cell to another on the reciprocal image has to be observed. Figure 3.4 illus-
trates this idea and Figure 3.5 depicts an example for a two dimensional space. The
74 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
curvature behaviour of the true reciprocal image of a hyperplane¹ �ý����� êå»���� r�� ,
is obviously dependent on the vector ê , the constant r and the vector of transfer func-
tions. Table 3.1 summarizes observations of switches between neighbouring cells. In
higher dimensional cases we have to find mQ � cells in the neighbourhood of a point.
A criticism of this method is that it is just applicable within small neighbourhoods,
does not scale well with higher dimensions and it does not consider the effect of the
constant r .(2)
(1)
y
y x(1)
x(2)cell i+1
cell i
Figure 3.4: The sigmoid function is approximated through a piece-wise linear function within
every cell. A transition from cell�
to cell� K�E , by assuming positive .76W:'? values, results in a
concave curvature in the . -space. The approximation for a cell is defined by: {«6Ê.�?�+���.VK¡l l l .A transition from cell
�to cell
� K(E changes only the corresponding component in the diagonal
matrix and in the vector l l l , in our case ��� and l l l°6�:>? .
−0.2 −0.1 0 0.1 0.2 0.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1y−space
−0.5 0 0.50
0.5
1
1.5
2
2.5
3x−space
Figure 3.5: The hyperplane in � -space is defined by ��+Ô8��¡E�A 6 E �Y= and �·+ A . With an
arbitrarily small subdivision of the � -space, we can observe that cell transitions are most often
in � 6W:>? direction. Hence the manifold is concave in the neighbourhood of the plotted point
(see also Figure 3.4 and the table of the previous page).
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 75
premises equivalent premises conclusion
pj|«¶�¨ and tq|«¶�¨ p�| ��� º� � ~ [ ´�¨ concavep | ´�¨ and t | ¶�¨ p | � � º� � ~ [ ¶�¨ convexpj|«¶�¨ and tq|«´�¨ p�| � � º� � ~ [ ¶�¨ convexp | ´�¨ and t | ´�¨ p | � � º� � ~ [ ´�¨ concave
Table 3.1: Concave and convex curvature in the neighborhood of a point.
Point sampling method
The point sampling method analyses the manifold structure by determing points on the
manifold. This approach is shown in Figure 3.4. For any axis-parallel two dimensional
slice of the region � it is possible to find the intervals for the free variables. Addition-
ally, we are able to find the inflection points in this slice2. The connecting line seg-
ments provide information about stretching, squeezing and curvature behaviour. To
refine the information a divide and conquer strategy can be applied on the line seg-
ments.
0.7 0.8 0.9 1 1.10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1y−space
1.5 2 2.50
0.5
1
1.5
2
2.5
x−space
Figure 3.6: The hyperplane with � + 8N�¡E\A 6 E���= and �2+ÔA 6 ����� in � -space and the true
reciprocal image in . -space. The three points in � -space and the corresponding points in . -space indicate how a point sampling method can be used to find the curvature behaviour within
a two dimensional slice.
2Section 4.2 explains the computation of the interval of the free variables of a two dimensional slice
and the computation of the inflection point.
76 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
Eigenvalue and Eigenvector analysis
The vector-based expansion of the Taylor formula is applied to compute an approxi-
mation for the manifold of � in the neighborhood of a point � .By computing the eigenvalues and eigenvectors in the tangent hyperplane of the point� , the strength, type (convex or concave) and the direction of the deformation is deter-
mined. For the following computation a non-redundant description of the polyhedron
is assumed, i.e. any inequality corresponds to a facet.
I The N -th facet of a polyhedron � is the defined with the following equality:
– ê�»7�J��r , where: ê�»����J��NUMYKF� and r¡�"� ��NL�I The true reciprocal image of this polyhedral facet hyperplane is given by:
– ê » !���������rI We want to analyze the structure of the facet manifold, in the neighborhood of
the point � with ���"!������– The facet manifold ������� is: ���������ãê�»¼!������,QBr)�H¨– To get an approximation of the manifold in the neighbourhood of � , the
vector based Taylor formula can be applied:
�����������������������7����ô » �¼�u� �� �¼� » � ¥ ô����where �¼� denotes a direction in the tangent hyperplane starting at point � .
– � ¥ ô is the Hessian matrix. It is a diagonal matrix with the component-wise
second derivate on the diagonal.
� ¥ ô��õöööööö÷ê��Ã�>� ��� ºDÈ ê É� � ê ÈW% É ¨ Ó ¨¨ Ó ¨ ÓÓ ¨ Ó ¨¨ Ó ¨ ê¼�� �� � � º°È ê É� � ê È ©DÉ
øYùùùùùùú
– We can write �����·���������"¨��,¨«� %¥ ��� » � ¥ ô���� , since ���������"¨ and, be-
cause all directions in the tangent hyperplane are orthogonal to the gradient
vector �2ô , the dot product �2ô » ���,�"¨ . Hence, we get:
���������¼����� �� �¼� » ��¥/ô �¼� �Ã�>�
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 77
– Generally, the gradient vector �2ô is not parallel to any of the basis vectors
of the standard basis b ^ . To obtain the basis of the tangent hyperplane we
compute the RQ�� mutual orthogonal vectors to the gradient vector �2ô . The
standard technique to compute for a given vector mutual orthogonal vec-
tors is the Gram-Schmidt decomposition (also known as QR-Factorization
[GVL89]). The QÔ� mutual orthogonal vectors, defining the tangent
hyperplane, and the gradient vector ��ô build the basis b % . We use the
notation !#"%$&" X , where the column vectors of !'"($&" X are the basis vectors
of b % expressed in basis b ^ , to denote the change from basis b ^ to basisb % .I With �)"%$)�*!+"($&" X �," X formula (1) becomes:
�����,"($����¼�,"($���� �� ��� » " $ �2¥Y�-���,"%$�Ì���.!+"%$&" X t/" X ��!+"%$&" X �¼�," X ��� �� ��� » " X ! » "($0" X �2¥ðô1!#"%$&" X ���)" X �Á�°�
– the matrix 2¾�*!�» "($&" X � ¥ ô1! "%$0" X is symmetric.
– for a symmetric matrix it is always possible to find a matrix !3"RX1" � , such
that the matrix 2,�«�4!�» " X " � 25! " X " � is a diagonal matrix [BS81]. The
column vector of the matrix ! " X " � are the eigenvectors of 2 with respect
to . basis b % . The diagonal values of the resulting matrix 2 � are the
eigenvalues of 2 with respect to basis b % . The steps to calculate !'" X " �are:
1. calculate eigenvalues ë of 2 with respect to basis b % , i.e. calcu-
late the roots of the characteristic polynom given by the determinant
det �.2÷Q�ë76�����¨ .2. calculate the eigenvectors, i.e. 2��,��ë��
– with �,"RX$�*!+"RX8" � �," � formula (2) becomes:
���.!#" $ "RXÃ�,"RX���!+" $ "úX9���,"úXa��� �� ��� » " X ! » "($&" X � ¥ ô:!+" $ "úX&�¼�,"RX�Ì���.!#"%$&" X !#" X " � �," � �;!+"($&" X !+" X " � ���)" � �<� �� ��� » " � ! » " X " � 25!+" X " � �¼�%" �
78 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
I finally, for the approximation of � in the neighbourhood of ��"($ , we obtain the
subsequent formula with 2,� representing the diagonal matrix containing the
eigenvalues ë | .�����,"($����¼�%"%$Y�������.!+"($0" X !+" X " � �," � ��!+"($=" X !+" X " � �¼�%" � �<�
�� ��� » " � 2 � ���)" � � ��© #&%>|r¯ % ë | ���¼¥| " �
To summarize the interpretation and the use of this result:
I eigenvector of 2 with respect to basis b % tells us the direction of the curvature
I eigenvalues of 2 with respect to basis b % provides the strength of the curvature
in the direction of the corresponding eigenvector.
I ë | ´�¨ £ concave curvature in the direction of the corresponding eigenvector.
I ë | ¶�¨ £ convex curvature in the direction of the corresponding eigenvector.
Figures 3.7 with 3.9 are visualizations of the above result for a three-dimensional
space. These figures show the deformation of a two dimensional polyhedral facet3
under a vector of three tansig functions. We also plotted the gradient and the eigenvec-
tors at a point � . The plots are from three different points of view.
3For the implementation in Matlab we used the QR-Factorization to calculate a grid of points on the
polyhedral facet and finally computed the corresponding points in ? -space by using � » X .
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 79
Figure 3.7: The reciprocal image of a hyperplane in � -space. The dotted eigenvector
indicates the direction of the convex curvature, the dashed eigenvector points in the
direction of the concave curvature in the neighbourhood of the corresponding point
on the facet manifold. The length of the eigenvector depends on the strength of the
curvature. The gradient vector at the given point is plotted with a solid line.
80 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
Figure 3.8: The convex curvature of the manifold in the direction of the dotted eigen-
vector is clear. The length of the eigenvector represents the eigenvalue, i.e. expresses
the strength of the curvature in this direction. The eigenvector is underneath the mani-
fold.
3.3 Deformations of Polyhedral Facets under Sigmoidal Transformations 81
Figure 3.9: The concave curvature in the direction of the dashed eigenvector is difficult
to see. We can see that the eigenvector points out of the manifold, i.e. above the
manifold.
82 Chapter 3. Polyhedra and Deformations of Polyhedral Facets under Sigmoidal Transformations
3.4 Summary of this Chapter
This chapter first introduced basic concepts of polyhedra, which are relevant within
the scope of this thesis.
Next we introduced three different approaches to analyse the structure of a manifold
of the non-linear region � . The cell subdivision approach is not generally applicable,
because the constant r is not included in the analysis, and the point sampling approach
can not guarantee to find the directions of the strongest concave or convex curvature
for a given point on the manifold. By computing the eigenvectors and eigenvalues
in the vicinity of a point � on the manifold, we obtain the directions of the strongest
curvatures in � . This is the most informative method and helpful to obtain refined
polyhedral approximations of the non-linear region � .
Contributions - Chapter 3 -
I The computation of the eigenvalues and the eigenvectors in the neighbour-
hood of a point � on the manifold of the region � provides useful information
about the deformations of polyhedral facets under sigmoidal transformations.
Chapter 4
Nonlinear Transformation Phase
The back-propagation of polyhedra technique [Mai98], can be improved by mini-
mizing the number of subdivisions of a polyhedron, which is necessary with a linear
approximation of the non-linear transfer function. Section 4.1 introduces a possible
method.
Another approach uses a direct approximation of the non-linear region � . This method
leads to a non-linear optimization problem. Section 4.2 discusses different techniques
to solve this non-linear optimization problem, namely : Sequential Quadratic Program-
ming (SQP), the repeated computation of the optimal solution within two-dimensional
slices, a branch and bound approach and finally a binary search method.
4.1 Mathematical Analysis of Non-Axis-parallel Splits of a
Polyhedron
The piece-wise linear approximation of the sigmoid function results in axis parallel
splits of the polyhedron [Mai98]. The number of splits increases exponentially with
the number of neurons of a layer. Therefore methods based on piece-wise linear ap-
proximations of the sigmoid function have to minimize the number of splits in order to
be applicable to higher dimensional cases. As discussed in Chapter 3, the analysis of
the surface (manifold) of the true reciprocal image is helpful in reducing the number
of splits.
Given a polyhedron � in � -space, we can compute for each of the facets the analytic
83
84 Chapter 4. Nonlinear Transformation Phase
center of mass and analyse the twist behaviour in the vicinity of these points. This
information can be applied to efficiently split a polyhedron into a small number of
sub-polyhedra. A selection of sample points on � together with a computation of the
corresponding eigenvalues and the eigenvectors are used to develop heuristics to ob-
tain “good” splits of the polyhedron. Figure 4.1 depicts a scenario in dimension two.
In this example a splitting hyperplane is defined via two opposite points in which the
region � has a concave curvature.
This approach has several drawbacks. In general the methods results in splits of the
polyhedron that are not axis-parallel, and for each cell we have to determine a piece-
wise linear approximation of the sigmoid function. It is important to notice that for
neighbouring cells the approximation has to agree on the splitting hyperplane (see Fig-
ure 4.1). A difficulty is to obtain a reasonable approximation of the piece-wise linear
functions â | . Additionally, we did not investigate sophisticated techniques to obtain
splitting hyperplanes in higher-dimensional cases. A sketch of an algorithm follows.
x−space: approximation y−space x−space: reciprocal image
,iD δδ i
T = ba
, δδi+1 i+1
cell i+1D
cell i R= bT ya x
x(1)
x(2)x (2)
(1)x
y(2)
(1)y
Figure 4.1: A scenario for a split of a polyhedron which is not parallel to an axis.
The piece-wise linear approximation of the sigmoid function for each cell has to be
determined, i.e we have to find the matrix Þ and the vector ß ß ß which define the affine
approximation. Note that in general the matrix Þ is not diagonal. For neighbouring
cells the approximation has to agree on the separating hyperplane.
4.1 Mathematical Analysis of Non-Axis-parallel Splits of a Polyhedron 85
Non axis-parallel split approach
1. Compute sample points on each facet of the polyhedron � , e.g. includ-
ing the analytic center of mass points.
2. Compute the eigenvalues and the eigenvectors at the corresponding
points on the surface of � .
3. Define a low number of splitting hyperplanes by using the information
about the twist behaviour in the sample points. Split the polyhedron�<� ����� ��������� by adding these hyperplanes.
4. Subdivide the space into cells according to the split of the polyhedron.
5. Determine the piece-wise linear approximation of the sigmoid function
within every cell. For two neighbouring cells NUMON��º� we have to find
the linear functions â | M â | ¿ % K â | �������òÞ | ��� ß ß ß | and â | ¿ % �ãÞ | ¿ % �(�ß ß ß | ¿ % . These affine functions have to agree on the separating hyperplane¹ �����$� ê�»¼�J� r>� . Therfore we have to solve the following non-linear
optimization problem:
min �@� !������úQ â | �����Y�@� and min �@� !�������Q â | ¿ % �����Y�@�s.to. @q��� ¹ K â | �����¯� â | ¿ % �����
Choosing a number of sample points and calculating Þ | , ß ß ß | according to
a least mean square error defines â | . We are now able to define â | ¿ % by
choosing sample points in the cell Ý | ¿ % and determine Þ5A ¿%B , ß ß ß'| ¿ % under
the constraint that the two functions agree on the separating hyperplane.
As indicated in the above algorithm no heuristics for the definition of splitting hyper-
planes was developed. Additionally, in higher dimensional cases an increasing number
of neighbouring cells complicates the approximation of a piece-wise linear function â | .Thus far we did not develop a satisfactory method for the approximation of â | , also,
because a different approach which seemed more promising has been identified (see
next subsection). A further disadvantage of the above algorithm is that, due to the non-
86 Chapter 4. Nonlinear Transformation Phase
axis parallel split, the matrix Þ of the affine function â ����� is not necessarily diagonal.
This leads to the undesired side-effect that axis parallel facets of sub-polyhedra within
a cell are not mapped to axis parallel facets. Due to these reasons further analysis was
not undertaken. The problem was reconsidered from another point of view: instead
of approximating the sigmoid function with piece-wise linear functions, the polyhe-
dral approximation started on the non-linear region � itself. This reduces to a non-
linear optimization problem. The following approaches can be viewed as a polyhedral
wrapping of the region � . The next section explains this view and in the following
subsections our approaches are presented.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a
Region
We start this section with some definitions which we will use in our descriptions about
methods to compute a polyhedron containing the region � .
Definition 4.1 Wrapping Polyhedron
A wrapping polyhedron is a polyhedron that is used to wrap the nonlinear region �from outside. wDefinition 4.2 Nonlinear Optimization Problem
In the most general form a nonlinear programming problem is to maximize (or mini-
mize) a function subject to some constraints.
��p�t �������ÑjÓ ÞLÚ Ó�K ×¼�����$�"¨ßS�����¡��¨where �SMs× and ß are functions, with �BK&î © Ì î�Ms×JK&î © Ì î § M and ß�K&î © Ì î ¦ .wSpecial cases are linear and quadratic programs. In these cases the function � is linear
or quadratic and the constraint functions × and ß are affine.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 87
Definition 4.3 Cost-function
The function ������� which is to minimize or maximize is also referred to as cost-
function. wDefinition 4.4 Constraint Function
We also use the expression constraint functions or simply constraints to express that a
function should be optimized under these conditions. wThe following approaches postpone the approximation process. Instead of a piece-wise
linear approximation of the transfer function, we compute a polyhedron containing the
true reciprocal image (see Figure 4.2 as an example). The direction vector C of each
hyperplane of this (wrapping) polyhedron has to be determined. For example, the
gradient vector of a point on the manifold of the true reciprocal image has the nice
property that we can find the exact solution in the linear case. Given a polyhedron� � ���� � �� � ��� in the output space ( � -space) of a transfer function layer we
want to approximate the reciprocal image �n� ���$� �(!�������� �¯� with a finite union of
polyhedra.
The reciprocal image of the analytic center of mass of a facet is an interesting point
for the computation of a suitable direction vector. Figure 4.2 depicts the wrapping of
the non-linear region using one polyhedron.
g
g2
3
1g1HH
HH 3
HH 2
b)a)
Figure 4.2: In (a) we see the true reciprocal image of a polyhedron with respect to the
transfer function. Given an interior point on each manifold of the true reciprocal image
we determine the gradient C at this point (a). The gradient vector defines the optimiza-
tion direction. This has the nice property that we always find the optimal solution for
linear manifolds. In b) we plotted the region and the polyhedral approximation, which
can be viewed as the convex hull of the true reciprocal image.
88 Chapter 4. Nonlinear Transformation Phase
The orientation of each hyperplane of the wrapping polyhedron is given by a vector C .
It remains to find the optimal position for the hyperplane. Geometrically, the optimal
solution for a hyperplane (i.e. the best position) is characterized through a point (or
set of points) on the manifold of � where the hyperplane is tangent to the manifold.
Analytically this can be formulated as a non-linear optimization problem, and repeat-
ing this process for all hyperplanes defines a wrapping polyhedron �¤~ containing the
region � . A positive property of this method is that the calculation can be easily
distributed among different CPU’s (central processing units), according to the chosen
number of hyperplane directions. In � -space the optimization problem is defined by:
max C [ � subject to �(!������¡��� �ÁÀÁ�This means we have to maximize a linear cost-function under non-linear constraints.
The equivalent formulation of this problem in � -space is:
max C [ ! # B ���\� subject to ������� �ÁÀ�ÀÁ�In general, the above optimization problems are hard, and can not be solved exactly.
The continuous optimization problem is hard in the sense that it is not possible to find a
method which always guarantees to compute the global optimum in a reasonable time.
The optimization problems (i) and (ii) are hard problems
Feed-forward neural networks are general function approximators [HSW89]. A non-
linear constrained optimization problem is a hard problem. As feed-forward neural
networks are general function approximators the following problem must also be a
hard problem ( C represents here the weight vector to a linear output node):
max CS[¯!�� k ��� subject to ������� �0D¦�Polyhedra are closed under affine transformations and with ��� k � , (1) becomes:
max C [ !����«� subject to E����� E� �8Fq�In (2) we used G� to express the image of the polyhedron � �º���$� ��v�H��� under an
affine transformation. By comparing the optimization problems (2) and (ii) it becomes
clear that both problems are essentially the same (the use of the inverse function !¡#&%does not change the difficulty of the problem). Therefore: as (2) is a hard problem also
(i) and (ii) are hard problems w
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 89
The next subsections describe different approaches to obtain an approximation for the
global optimum. The first two approaches search on the manifold for the optimal
solution. These techniques can not guarantee to find the optimal solution and therefore
we can not assure that the wrapping polyhedron � ~ contains the non-linear region� . However, to verify the behaviour of neural networks we have to assert that the
polyhedron ��~ contains � . Hence, these two methods are not suitable for our overall
goal. Consequently, two other approaches, namely a branch and bound and a binary
search method, which can guarantee that the wrapping polyhedron �¤~ contains � have
been developed.
4.2.1 Sequential Quadratic Programming
Nonlinearly constrained optimization problems often have been solved with Sequential
Quadratic Programming (SQP) techniques. SQP is a conceptual method from which
various algorithms have evolved.
The main idea of SQP is as follows (for an overview of SQP methods, the reader is
referred to the publication by Boggs and Tolle [BT96]): given a starting point � ^ , the
nonlinear optimization problem is modeled by a quadratic programming subproblem.
A solution of this quadratic programming problem often results in a better approxi-
mation � % . This process is iterated to generate a sequence of approximations until
the method converges to a (often local) optimal solution. The Matlab optimization
toolbox provides a function, named fmincon, which relies on Sequential Quadratic
Programming (SQP) methods [Mat00b]. According to [Mat00b] the state-of-the-
art optimization algorithms are implemented. These routines for solving linear and
quadratic optimization problems use an active-set method combined with projection
techniques [Mat00b]. However, the application of fmincon for the optimization prob-
lem in � -space, i.e.
max CS[�� subject to �(!������¡�v�or the corresponding problem in � -space
max CS[�!\# B ���\� subject to �������often did not converge to the global optimum. As a result the wrapping polyhedron
did not contain all points of the nonlinear region � . This was shown by randomly
90 Chapter 4. Nonlinear Transformation Phase
generating polyhedra in the � -space and testing if the computed wrapping polyhedron
contained the non-linear region � . Points were generated on the surface of the poly-
hedron, mapped using the sigmoidal function, and tested whether they were contained
in the approximated polyhedron. Of course using different starting points on the man-
ifold while optimizing in the same direction C could lead to better results. But this is
very cost expensive and still does not guarantee to find the optimal solution. The graph
of the inverse sigmoid function indicates that, points of strong changes are close to +1
and -1. This simple observation resulted in a first heuristic. Given the interior point �on the corresponding facet of the polyhedron in � -space, we determine the strongest
component of � and optimize on the facet in this direction, under the constraint that
the optimum is a point of the polyhedron � � . Linear programming techniques solve
this problem. Now we choose sample points on the connecting line segment. This lead
to better results but still could not guarantee to find the global optimum. To conclude:
current implementations of SQP often converge to local optima. As a consequence
the polyhedral approximation would not be the convex hull of � . The next approach,
which relied on the search for the optimal solution on the manifold, is named the Max-
imum Slice Approach (MSA).
4.2.2 Maximum Slice Approach
In a two dimensional space a solution to the optimization problem
max CS[�� subject to �(!������¡�v�can always be found. For higher dimensions we can build a two dimensional slice by
keeping all but two variables constant. In a system with two free variables one variable
can be expressed as a function of the other. Thus, by determing the domain for the free
variable, the maximum of the cost function can be calculated. This observation is the
basis idea for the Maximum Slice Approach (MSA). Given an approximation solution
after the Q -th iteration, say �3¦ , by computing new solutions in all two dimensional
slices through � ¦ we construct a better approximation �,¦ ¿ % . This process is iterated
until we get no significant improvement between two consecutive points.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 91
P ��§W¨U~�V =maximumSlice( C¼MO� )
// let be the dimension of the polyhedron and
// � be the number of facets of the polyhedron �for N�� ��K��_� ^ � analyticCenterOfMass � Z p Ý Ä Þ |Â�a¢� ^ �"! #&% ��� ^ �a¢Q��"¨®¢ // k is an iteration index
� à _Q��Q·�"��¢��¦·�H��¦ #&% ¢for all
¯�� AQ��>�� slice combinations
_slice �ºP TLMO�uV //, where T Ú�H�«¬« ��¦��OPò��KjT@Q���MUT��"�·KD� Q���MO�½�"��K° åV�� is kept constant.
��¦®�OP TÂMO�uV��$� optimizeInSlice � slice MO� ¦¦M9C�MONL�`` µ û ÀIH h �.C » ��¦�¶JC » ��¦ #&% �«¬« � |§�¨s~ denotes the maximum for facet N«¬«note that �9¦ is not an improvement of � ¦ #&%� |§�¨s~ �H��¦ #&%`��§�¨s~�� max| � |§W¨U~
92 Chapter 4. Nonlinear Transformation Phase
It is possible that the optimal solution is not on the facet-manifold which defined the
gradient, hence a maximum for all facets is computed. The overall maximum then
defines the best point. A further improvement of this algorithm would reduce the num-
ber to the relevant facet-manifolds (e.g. facets opposite to the search direction are not
important).
The following describes how to compute the optimal solution for the N -th facet mani-
fold in the �¯�Ã�>�aMO�¯�Á�°� slice. As the optimum has to be on the manifold of the reciprocal
image of the facet, the N -th row of the system of inequalities becomes an equality, i.e.:
�J��NUMYKF�L!��������"����NÃ�Within the considered slice (as the other variables �¯�ÁÒ°�aMYÓYÓYÓ5MO�¯�� �� are kept constant)
the optimization problem reduces to:
max C«�Ã�>�Â�¯�Ã�>�\��C\�Á�°�Â�¯�Á�°�Ñ°Ó ÞÃÚ Ó �J�LK@M>Pò�·Kj�'V��L!����¯�OPò��K��'V��O� ���hQB�J�LK@M>PRÒ�KD åV��L!����¯�OPRÒ2K° åV��O�Now it is possible to express �¯�Á�°� as function of �¯�Ã�>� . To simplify the expression let
ê����J��NUMYKF� , r¡�"� ��NL� and G�J����Q��J�LK@M>PRÒ�K° åV��L!����¯�OPRÒ2K° åV��O� .ê��Ã�>�L!����¯�Ã�>�O��� ê��Á�°�L!����¯�Á�°�O�$��r Q�P ê��ÁÒ°�L!����¯�ÁÒ°�O�\�*KLKLK5� ê¼�� ��L!����¯�� ��O�9Vand as �¯�ÁÒ°�aMYÓYÓYÓ5MO�¯�� �� this simplifies to:
ê��Ã�>�L!����¯�Ã�>�O��� ê��Á�°�L!����¯�Á�°�O�$��ra��M ra����r Q�P ê¼�ÁÒ°�L!����¯�ÁÒ°�O�\�MKLKLK5� ê��� ��L!����¯�� ��O�we can express �¯�Á�°� as function of �¯�Ã�>�aÓ The assumption ê¼�Á�°��Ú��¨ is valid, otherwise
we do not analyse a slice with �¯�Á�°� as free variable.
�¯�Á�°�¯� º¬» X ÈON.PW#RQ�Èë% É ºDÈ ê Èë% ÉWÉQ�È ¥ ÉTo summarize, maximizing CS»�� in the �¯�Ã�>�aMO�¯�Á�°� slice is equivalent to maximizing
ô¦���¯�Ã�>�O�¯�*C«�Ã�>�Â�¯�Ã�>�\��C\�Á�°� º » X ÈON P #RQ�Èë% É ºDÈ ê Èë% ÉWÉQ�È ¥ ÉIt is important to note that the term C«�Á�°� º » X ÈëÈON P #RQ�Èë% É ºDÈ ê ÈW% ÉëÉëÉQ�È ¥ É is monotonic. For the
calculation of the maximum the computation of the interval for the free variable �¯�Ã�>�is required. Using the knowledge that axis-parallel slices in � -space are again axis-
parallel slices in � -space, we are able to calculate the interval in the � -space and
simply backpropagate it to the � -space. The calculation of the interval in � -space is
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 93
solved with the two linear optimization problems.
max �$�Ã�>�s.to �J�LK@M>Pò�¬Kj�'V��Â�$�OPò�¬K¦�'V�� � G�$Ms�J��NUM>Pò�·K¦�'V��Â�$�OPò�¬Kj�'V��$��r �
The interval of �¯�Ã�>� is: P !�#&%'���$�Ã�>� § |ª©��aMU!\#&%5���$�Ã�>� §W¨U~ �9V . The calculation of the max-
imum of �����¯�Ã�>�O� requires the first derivative. To simplify the notation we use here:tu�"�¯�Ã�>�aMÃß | �SC\��NL�aMUp | �òê¼��NÃ� .� � ��tå��� �Wß ¥ p % p ¥ QJß % p ¥ % �L!���tå� ¥ �v�'ß % p % ra�ò!���tå�7��ß % p ¥ ¥ QJß % ra� ¥ QJß ¥ p % p ¥Q)p % ¥ ! ¥ ��tå�¼�v��p % r � !���tå�7�vp ¥ ¥ QBr � ¥
Note that �S����tå����¨ if and only if the numerator is equal to zero. With P p « rsV we notate
a substitution process, i.e. r substitutes p . We apply the following substitutions to the
numerator.
P p « ß ¥ p % p ¥ QJß % p % ¥ VïPRr « �'ß % p % GrUV P Ý « ß % p ¥ ¥ QJß % ra� ¥ Q�ß ¥ p % p ¥ VïP à « !���t&�9VThis results in the computation of the roots of a quadratic polynom:
p�থ��vraà·� Ý �"¨With the solution à % MOà ¥ we can compute the points, where � � ��tå��� ¨®M and t % �! #&% ��à % �aMOt ¥ �Ô! #&% ��à ¥ � . Let us assume t��HP T N M Û N V . The maximal point for the opti-
mization problem is either a point where the derivative is zero, or it is T N or Û N , i.e a
point at the border of the domain of t (due of the structure of the function � ). The op-
timumInSlice function describes this algorithm in pseudo code. Note that �'T U V and �WT U Vdenotes the interval of the free variable for the � -space and � -space respectively.
94 Chapter 4. Nonlinear Transformation Phase
[ X$M&Y ]=optimumInSlice(slice, ��M�C¼MUN )«¬« X is the computed optimal point within the slice«¬« Y is the computed value for the cost function.«¬« � and � are global variables. �J�H�$�OP TLMO�uV��� T U V � max �$�Ã�>� s.to �J�LK@M>P TLMO�uV��Â�v����þu�J��NUM>P TLMO�uV��$�"� ��NL�«¬«
the interval of the free variable is given by:
��P_Ó V&�"! #&% ���WT U VW�T N � min � T U V ¢ Û N � max � T U Vtu�H�¯��TX�ß % �SC\��TX�a¢Ãß ¥ �*C«���,�a¢p % �"�J��NUM��>�a¢Up ¥ ���J��NUMs�°�a¢r � ��r Q�P ê��ÁÒ°�L!����¯�ÁÒ°�O�\�*KLKLK5� ê¼�� ��L!����¯�� ��O�9V����tå��� ß % ty��ß ¥ ! #&% �Ár � Q�p % !���t&�O�p ¥«¬«
the maximum is either at interval borders or at a point where � � ��t&���"¨«¬«w.l.o.g. let � � ��t % ���"¨�þn� � ��t ¥ ���"¨`þ�����t % � ¶�����t ¥ �&þut % �,�ZT U V
Yy�H��p�tu�'����t % �aMs����T N �aMs��� Û N �s�«¬«if t % MOt ¥ Ú�,� T U V then Yy�H��p�tu�'����T N �aMs��� Û N �s�
tu�������� ������t % if Y�������t % �T N if Y�������TÁrð�Û N else
XJ�ÔP t¼¢Ãß ¥ !\#&%'�Árs�JQ�p % !���tå�O�p ¥ V
This method performed quite well for a large number of lower-dimensional experi-
ments, but unfortunately for higher-dimensional examples, the method did not con-
verge to the global optimum. The reasons why this method did not always converge to
the global optimum remain unclear and require a detailed analysis.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 95
The next subsections introduce two methods which guarantee finding an optimum
which is either on the manifold, or within some precision outside the manifold, i.e.
these methods can ensure, that the wrapping polyhedron ��~ contains � , while ap-
proximating the global optimum on the manifold. Hence, for a given number of hy-
perplanes, the computed wrapping polyhedron is not minimal.
4.2.3 Branch and Bound Approach
Branch and bound is a well known strategy which has been successfully applied to
a number of optimization problems, e.g. integer programming problems. Branch and
bound splits the original problem into two (or more) disjunct subproblems (branch) and
successively computes solutions on the smaller subproblems. Within an iteration a new
solution on the best subproblem is calculated. The best current solution is used to prune
(bound) irrelevant subproblems. This process is repeated until a stopping criterion is
fulfilled, e.g. if further calculation would not significantly improve the result. We
apply the branch and bound approach for the following optimization problem:
max C [ ! # B ���\� subject to �������
t 2t 2
t 1t 1 cu cu
Figure 4.3: Application of branch and bound to approximate the optimal solution of
the optimization problem: max C [ ! # B ���\� subject to ������� . The upper
right corner defines the upper bound for the value of the cost-function, and the arrow
on the surface of the region indicates the optimization direction. The initial split of the
hypercube is depicted on the left side, while the right side plots the next split for the
upper sub-box. A sub-box can be pruned if its upper bound for the cost-function is less
than the best current solution on the surface of the manifold.
96 Chapter 4. Nonlinear Transformation Phase
We refer to the maximal point of the cost-function, restricted to a box � , as\[ ��«���
and for the value of the cost-function at this point we write Y$Ò8] . In our tests we used a
tansig transfer function and restrict the side of a cube to the interval PªQ)¨®ÓN]_^�ÓYÓYÓ/¨®ÓN]_^'V ,which corresponds to an input range between [-2.29,2.29].
The intersection between the restricted hypercube and the original polyhedron is a
polytope. The hypercube ^� , containing this polytope in � -space is calculated by
applying linear optimization techniques. An interior point of the polytope is calculated
as the barycentre between the component-wise extreme points (named i | ) of the cost-
function subject to the polytope. This defines a starting point for a sequential quadratic
programming problem and leads to a local maximum on a facet of the polytope.
Starting with the initial box V^� we split the box (along the longest side) into two sub-
boxes `%� and ¥� . The sub-box with the larger upper bound is selected for further
calculation. The process to compute a local optimum on a facet of the polytope is
repeated, but restricted to the sub-polytope contained in the chosen sub-box. The best
current solution point on a facet of the polytope is called ô�` , and Yba�c expresses the
value for the cost-function at this point. We can prune a box if the intersection with the
polyhedron is empty or when its upper bound (for the cost-function) is less than the
best (current) solution. This process is repeated until a stopping criterion is fulfilled.
All relevant sub-boxes are sorted in a list according to the supremum of the box. The
first box in this list (i.e. the box with the largest supremum) is named top box �»� . The
stopping criteria are defined as follows; stop if
I the volume of the box »� is less than a pre-defined value or:
I the distance between a surface point and the supremum of »� is less than a pre-
defined value, which indicates that the current solution already is the optimal
solution or close to it (the supremum defines an upper bound for the possible
global optimum).
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 97
P �&Vå� BranchAndBound ��� ��M9Cå�InitPhase
«¬« i | denotes the N -th component-wise maxima of the cost-function in � -space«¬« is the dimension
^� � m �������Z$��NÃ�¯�*C » ! #&% � i | �µ � Z® ©|r¯ % � Z$��NL�Y�«¬« X is the barycentre between all i | ÓX�� ©>|r¯ % µJ��NL�
i |«¬«
use seq. quadratic programming to find a local maximum
� ` �H��p�t C [ ! #&% ���\� �����$� start at: X× Ú t/d¯NÂÑ Þ �" ^�
MainLoop
�3à _«¬«
to branch we use the box with the maximal supremum«¬«this box is divided, BoxList gets updated and ô `«¬«is updated, if necessary
P × Ú ted�NÂÑ Þ MUô ` Vå� split �Ø× Ú ted�NÂÑ Þ �Ã�>�aMO� ��M9C�MUô ` �«¬«delete all sub-boxes of the box-list, with: upper bound �fYga�c
× Ú ted�NÂÑ Þ � prune �Ø× Ú ted�NÂÑ Þ MUô ` �` µ û À.H h �8h�Ñ ÞÃÚ Ù°Ù�NX åß â�Ð N Þ Ä Ð N Ú ����� _[ �Ø× Ú t/d¯NÂÑ Þ �Ã�>�O�
98 Chapter 4. Nonlinear Transformation Phase
This method ensures that we compute a wrapping polyhedron, which contains the non-
linear region � . However, tests showed that the time complexity of the method does
not scale well with increasing dimension (see also experimental results in Section 5.3).
Furthermore, the method has one disadvantage; for approximately flat linear surfaces,
large numbers of boxes with similar upper values for the cost-function are obtained.
In these cases the method is very slow and not useful. The method finally used for the
VPA implementation is the Binary Search Approach (BSA). This method guarantees,
that the wrapping polyhedron contains � , and experiments indicate that the method
scales well with higher dimension.
4.2.4 Binary Search Approach
We use a binary search method to find for a hyperplane¹
a position close to the region�u� ���$� �(!������`� ��� , such that � ü ¹ # . ¹ is directed by a given vector C . Initially¹is positioned with the midpoint between a point, named i\~ , on the manifold of �
and the vertex X\~¬� _[ � m ���n�O� , i.e. the corner of the wrapping box with the maximum
value for the cost-function CS»7� subject to t��� ^~ � m ���u� . The hyperplane is moved
closer towards � if � � ¹�¿ is empty. To determine whether � � ¹�¿ is empty, a
box-refinement process is applied.
The refinement process is performed by a refinement step in � and � - space and
by forward- and backward propagating those refined boxes. The refinement for theQu�ý� iteration in � -space is calculated by: ¦ ¿ %~ � m �� ¦~ � ¹ ¿ � and in � -space
by: ¦ ¿ %� � m �� ¦� �B����� . The forward-propagation of a box from � -space to � -space is a component-wise application of the sigmoid function on the box, similarly
the backward-propagation of a box from � -space to � -space is a component-wise ap-
plication of the inverse sigmoid function on the box. We will say we send a box from� -space to � -space and vice versa. Due to the decreasing volume in the sequence of
boxes the convergence is guaranteed, and if, after the Q -th iteration a box is empty,
then there is no intersection between � and¹�¿
. We demonstrate the algorithm with
a simple example for the two dimensional case (i.e. two neurons), before describing
further details, such as the rate of convergence, and proving some key observations.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 99
−1 −0.5 0 0.5−0.4
−0.2
0
0.2
0.4
0.6
0.8
1y−Space
y1
y2
−3 −2 −1 0 1
−0.5
0
0.5
1
1.5
2
2.5
x−Space
x1
x2
(a) Start. Polyhedron jlk in m -space and the region n in o -space. Arrows on the manifold of n are
gradient vectors at the corresponding point, plotted as circle.
(b) Initial step. Calculating the wrapping hypercubes p $q and p $k . The hyperplane with direction
vector r will be positioned between the two points s q (right circle) and t q (left circle).
(c) Hyperplane insertion and intersection detection. The binary search strategy first inserts a hy-
perplane which cuts the line segment between s q and t q in the middle. We have to detect if this
hyperplane u intersects with the region n . Therefore we start the box-refinement process by cal-
culating p Xqwvyx{z p $q)| u~}7� and send the box to the m -space, i.e. p Xk v � z p Xq � .
100 Chapter 4. Nonlinear Transformation Phase
(d) Intersection detection through refinement of boxes. The intersection of the polyhedron j k and
the box � Xk results in the refined box p �k vyx{z p Xk | jlk�� .
(e) Further refinement. Calculating the new box p ½ } Xq v�x{z p ½q | u } � and sending it again to the
y-space, results in smaller boxes.
(f) End of the refinement process. The volume of the refined box p q is less than a small value � . This
means there is an intersection between the box p q and the region n within a small � -neighbourhood,
i.e. we can terminate and use this hyperplane as approximation.
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 101
(g) Insert of the next hyperplane u . We computed the box p Xq vSx{z p $q | u } � and computedp Xk v � z p Xq � . It is easy to see that p Xk | jlk is empty. This means we can move the hyperplane
closer to n .
(h) Moving the hyperplane closer to the region. At this state �;� vf��� � � and � v���� ��� (we use� to define the position of u and ��� to express the step-size. This will become clear later in the
description of the algorithm). As before (in Figure 5.4 (g) ) the intersection of p Xk v � z p Xq � and the
polyhedron jlk is empty.
(i) Moving closer to n and intersection detection process. The intersection of p Xk v � z p Xq � and the
polyhedron j k is not empty.
102 Chapter 4. Nonlinear Transformation Phase
(j) Refinement of the box and detection of no intersection. In y-space p �k v�x{z p Xk | jlk�� . After
calculating the new box in ? -space, i.e. p �q v � » X z p �k � , there is no intersection of the inserted
hyperplane u and the region n (because p �q is completely in u » ). This example demonstrates
again the relevance of the refinement process and shows that sometimes more iterations are needed,
before deciding if there is an intersection between u } and n .
(k) Moving the hyperplane closer to the manifold and refinement of boxes. This figure shows the
refined box in y-space p �k vyx{z p Xk | j k � and the corresponding box p �q v � » X z p �k � in o -space.
(l) Detection of no intersection. Calculating p-�q v p �q | u } results in the refined box in x-space.
Sending this box to y-space, i.e. calculating p �k v � z p �q � leads to the small box in y-space. The
intersection between pe�k and the polyhedron j k is empty. This means there is still no intersection
between u and the region n in x-space. The algorithm terminated as �Z�3��� , i.e. the distance
between two consecutive hyperplane positions is less than a small value � .
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 103
(m) The state of the wrapping polyhedron after approximating the first two facets. Compared to
Figure 5.4 (b) a much better approximation is obtained.
Figure 4.4: Two dimensional example for the binary search method.
In this section we enumerate a few properties about functions which are useful to
prove a key observation (Lemma 4.3) about the relation between a box ¡~ in x-space
which intersects the region � and the corresponding box �� and the polyhedron �¡�in y-space. Lemma 4.4 provides a proof which guarantees a decreasing sequence of
boxes (with respect to the volume). In the following � is a function from � to � .
Lemma 4.1 For any function � and any regions �(MU�K �������2 �Wü�������������� � .Proof: �������y¡�Wü�������� and �������y¡�Wü�����¡� . Hence: �������y �Wü �����2�������� �OwLemma 4.2 If � is an injection then: �������u ���<�����2�&�u����¡� .Proof: Using Lemma 4.1 it is sufficient to show the reverse inclusion � , i.e. �������¼�����¡�Aü �����ý��¡� . Let à³� �����2�$�B����¡� .Therefore: it exists t-�¾� � and t/�º� such that à�� ����t/�$�y� ����t7�&�aÓ As � is an injection, we have t-�ý� t7� and t/� ��<�u`Ó Therefore àu���������(¡� wLemma 4.3 �~`�(!\#&%'���$�D�±Ú��ù iff !���«~j�&�(��� Ú��ùProof: Lemma 4.2 implies !����~j�«!\#&%>���$�D�O���"!���«~j�O�\���jÓ Therefore: �~D�«!\#&%>��������Ú�ù iff !��� ~ �&�(� � Ú��ù wLemma 4.4
m �� �«�u�Wü� , where is an axis parallel hypercube and � is a region.
Proof: J�(�÷ü� .Therefore:m ����(�n�Wü m ��¡���H w
104 Chapter 4. Nonlinear Transformation Phase
The following nomenclature is used in describing the algorithm
... we write again
b[ �� � for the upper corner of box , i.e. the upper
bound for the cost-function subject to . For the opposite lower
corner the subscript T is used.X ê ... maximum of the cost-function in x-space (subject to �^~ ).i-� ... point in à -space which is the solution point of the following
linear optimization problem:p Ð ß���pjt C [ � subject to �������i ê ... corresponding point to i7� in x-space, i.e. i&~¬�"! #&% �.iå�5� . In the
linear case this would define already the optimal position for¹
.� ~ ... point between X«~ and i&~ .ë ... value between (0,1). ë defines the position of � ~ on the line betweenX�~ and i&~ . �¢ë expresses the rate of change between two consecutive
values for ë .� �Ø×2� ... volume of a box. Two consecutive boxes are considered equal,
if the volume of their difference is negligible that is:� �� ¦~b� ¦ ¿ %~ � ´J�� ... � expresses a small positive real number
Given an optimization direction C , we face the following non-linear optimization prob-
lem in the t -space.
max C [ � subject to �(!������¡�v�The solution of this optimization problem defines the optimal position for a hyperplane¹ �Ï���$� C�»��½����� , such that � ü ¹ # . In general we can not solve the above
optimization problem exactly, instead we use a binary search algorithm to find a close
approximation. Geometrically we want to move the hyperplane¹
close to � and
assure that � ü ¹ # . Analytically this results in an adjustment of � such that � is
close to the maximum value of the optimization problem. We move the hyperplane,
using a binary search strategy, along a line defined through the point X$~ and a pointiå~ , i.e. between the upper corner point of the wrapping hypercube and a point on the
manifold. Within every iteration we have to test if there is an intersection between �
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 105
and¹n¿
. The test for an intersection is realized through a refinement process. Boxes
are refined iteratively by using the following observations:
1. As boxes are closed under component-wise transformations, it follows that: if«~ is a box then ��{��!���«~j� is a box and if �� is a box then �~���!\#&%'��«�5� is
a box.
2. Since ¦ ¿ %� � m �� ¦� �(���5� ü� ¦� and ¦ ¿ %~ � m �� ¦~ � ¹n¿ �Wü� ¦~ (see Lemma
5.4), we have a decreasing sequence of boxes.
The result of this refinement process, together with Lemma 4.3, is used to decide
whether to move¹
closer towards � . The algorithm is presented in a top-down man-
ner. The initial phase calculates the wrapping boxes ~ and � , by using linear pro-
gramming techniques, and defines the points X ~ and i ~ . Within a do-while loop the
value of � is adjusted using a binary search strategy and the refine function is called.
Once the refinement process has finished we have to decide whether to:
I Terminate the algorithm,
– if there is an intersection within an � neighbourhood or:
– if �Aë�´�� (i.e. the distance between two consecutive hyperplanes is less
than � ).I Move up (increase ë ), i.e. we move the hyperplane closer to the manifold (using
half of the distance of the previous two points).
I Move down (decrease ë ), i.e. we move the hyperplane back towards the hyper-
cube corner (again using half of the distance of the previous two points).
106 Chapter 4. Nonlinear Transformation Phase
P � V =Binary Search( � �¦M9Cå�InitPhase
«¬« �����{����� �������¯� ^� � m ���$�D� ^~ � m ��! #&% �� ^� �O�X�~·� argmax C » t ÑjÓ ÞLÚ t��� ^~iå��� argmax C » à ÑjÓ ÞÃÚ ������iå~·��! #&% �.iå�D��¡ë���¨®Óíæë���¨®Óíæ
MainLoop
�3à _� �*X�~`�çë¼�.iå~\Q�X¼~j�// definition of the constant � and therefore the position of
¹¹ � ���$� C » ���SC » � �P ~ MU � Vå� refine �� ^~ MU ^� MO� � M ¹ ��¡ë���¨®Óíæ3�¡ëÀWô ��«~ o��ù�ì(«� o��ù·�ë��Ýëq���¢ëh HI� hë��Ýë�Q��¢ë` µ û À.H h �O�U�@� _[ ��«~j�,Q b� ��«~j�Y�@�®¶J�a�¼þJ�0�¡ë,¶J�a�O�
4.2 Mathematical Analysis of a Polyhedral Wrapping of a Region 107
The sign of C is used to calculate C7[�� subject to W^~ . Hence, this computation is very
fast, because no linear programming method needs to be applied.
The crucial part of the algorithm is to detect whether there exists an intersection be-
tween the hyperplane¹
and the region � . This is done by sending boxes between the� and � -space. At each iteration new boxes are calculated for the � and � -space. We
start with the box %~ � m �� ^~ � ¹ ¿ � , i.e. %~ is the box of the polyhedron defined by
the intersection of the wrapping hypercube ^~ and the half-space¹ ¿
. If this box inter-
sects the region � , then the corresponding box �%� � !���`%~ � intersects � � in � -space
(see Lemma 4.3). In � -space it is necessary to determine the box of the intersection
between �%� and ��� , i.e. ¥� � m ��)%� �(���5�aÓD ¥� is either (see Lemma 4.4):
1. An empty box, i.e. %� �(������ù .
2. Unchanged box, i.e.m �� %� �(�$�D���" %�
3. Refined box, i.e.m ��¤%� �u�$�D��ª�)%�
For case three (refined box ¥� ) we have to calculate ¥~ , i.e. ¥~ � ! #&% �� ¥� � . In � -space the new box ñ~ � m �� ¥~ � ¹n¿ � is computed (again the three cases are possible).
This process is repeated until we know that:
I There is no intersection (this is a box is empty).
I A box does not change anymore, i.e. we know there has to be an intersection.
I The distance betweenb[ ��«~j� and
b� ��«~j� is less than � , i.e. we intersect within
a small � environment.
108 Chapter 4. Nonlinear Transformation Phase
P «~¦MU«��VS� refine �� ^~ MU ^� MO���jM ¹ �Q��"¨ ¦ ¿ %~ � m �� ¦~ � ¹ ¿ ��3à _«¬« � Ú5Ð�ã p Ð Ø�MONUÓíÄ°Ó�� Ì¡ Q��<Q¬�"� ¦� ��!��� ¦~ � ¦ ¿ %� � m �� ¦� �(���5�«¬« r/p Ý Q ã p Ð Ø�MONUÓíÄDÓ¢ ÔÌ��Q��<Q¬�"� ¦~ ��! #&% �� ¦� � ¦ ¿ %~ � m �� ¦~ � ¹ ¿ �`µ û À.H h � � �� ¦ ¿ %~ �±Ú�"¨ ì � �� ¦~b� ¦ ¿ %~ � ´J� ì �@� _[ �� ¦ ¿ %~ �3Q \� �� ¦ ¿ %~ �Y�@��´J�a�«~¬�" ¦ ¿ %~«�¤�� ¦�
Improvement of the Original Binary Search Approach
The binary search method as introduced can be improved. For example, after the first
forward step it is sufficient to consider the smaller polyhedron � %� �½ %� ���$� . This
is valid, because an intersection between¹ ¿
and � is only possible within the region!\#&%'���y%� � . This modifies the algorithm by restarting the search for the correct posi-
tion of the hyperplane¹
on this subproblem. This would result in a faster algorithm
compared to the basic method, because the considered polyhedron � ¦� is smaller, and
4.3 Complexity Analysis of the Branch and Bound and the Binary Search Method 109
generally, contains less inequalities than the original polyhedron �¤� . This simplifies
the operation of intersecting the polyhedron with the corresponding box. However,
we did not implement yet the modified binary search method. In future versions an
implementation and comparison to the basic algorithm will be provided.
4.3 Complexity Analysis of the Branch and Bound and the
Binary Search Method
We tested the new algorithms for the approximation process of a single hyperplane,
with a randomly chosen direction C . Thus far tests have covered polyhedra of dimen-
sion 3 up to dimension 9, corresponding to a neural network with 3 up to 9 neurons
in a hidden layer. For a given number of randomly chosen directions (between 1 and
10) we inserted hyperplanes within the restricting hypercube, such that the polytope
was non-empty by construction. For each dimension, and each possible number of ad-
ditional hyperplanes, 10 different polyhedra were used, overall testing 100 polyhedra
for each dimension. The experiments of the branch and bound method were used to
test the binary search approach. Table 4.1 provides the 95% confidence interval for the
time-complexity of the binary search and branch and bound technique.
Dimension Branch and Bound (in seconds) Binary Search (in seconds)
3 [24.137,30.854] [6.839,12.070]
4 [58.927,81.921] [8.542,20.131]
5 [98.885,151.425] [12.440,26.774]
6 [150.867,243.428] [7.437,18.529]
7 [277.533,464.628] [8.946,15.465]
8 [569.874,1195.381] [13.527,26.355]
9 [945.266,2147.457] [14.343,21.431]
Table 4.1: Comparison of branch and bound and binary search.
The above results indicate that the binary search strategy outperformed the branch
and bound approach for our experiments. Important remarks:
110 Chapter 4. Nonlinear Transformation Phase
I To avoid numerical difficulties experiments were restricted to polyhedra con-
tained in the interval PªQ)¨®ÓN]_^¦MU¨®ÓN]_^'V .I An additional stopping criterion was used for the binary search method: once a
solution was better or similar to the corresponding branch and bound result, the
method stopped.
I We also detected examples of very slow convergence with the binary search
method. These cases still have to be investigated.
3 4 5 6 7 8 91.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
Dimension
# O
uter
Loo
p
(a) Binary Search: number of outer loops.
3 4 5 6 7 8 90
1
2
3
4
5
6
7
Dimension
# In
ner L
oop
(b) Binary Search: number of inner loops.
3 4 5 6 7 8 95
10
15
20
25
30
Dimension
Tim
e
(c) Binary Search: average time in s.
3 4 5 6 7 8 92
3
4
5
6
7
8
Dimension
Tim
e (lo
g)
(d) Comparison of time complexity. The
solid line shows the branch and bound ap-
proach, the dotted binary search on a log-
arithmic time axis.
4.4 Summary of this Chapter 111
4.4 Summary of this Chapter
Not finding a satisfactory solution with the technique of piece-wise linear approxima-
tion of the sigmoidal function, an alternative approach was developed which directly
approximates the non-linear region � from outside. The initial two methods Sequen-
tial Quadratic Programming (SQP) and the Maximum Slice Approach (MSA) aimed to
find the global optimum for the corresponding nonlinear optimization problem. Nei-
ther method could guarantee to find the global optimum. Thereafter, strategies were
developed to compute an approximation of the global optimum from outside, such that
the requirement to compute a wrapping polyhedron could be fulfilled. Experiments
indicated that the binary search approach is a suitable method.
Contributions - Chapter 4 -
I In our investigations we could not detect any suitable methods relying on
piece-wise linear approximations of the sigmoidal function for the approxi-
mation of the nonlinear region � . The reasons are: either the method does
not scale well (for axis-parallel splits) or a simple and computational-wise
cheap method for non axis-parallel splits was not found yet.
I This lead to the idea of computing a wrapping polyhedron � ~ , which contains
the nonlinear region � . The problem to compute �`~ reduces to a nonlinear
optimization problem.
I The SQP-approach as well as the MSA technique, can not guarantee to al-
ways find the global optimum on the manifold of � .
I Development of a branch and bound and a binary search technique, which
approximate a solution for the global optimum from outside. These methods
fulfill the requirement of an outside approximation for the nonlinear region� .
I Experiments indicated that the binary search method is in the average case
the better choice than the developed branch and bound approach.
Chapter 5
Affine Transformation Phase
This chapter gives first a short introduction to the problem of computing the image
or reciprocal image of a polyhedron under an affine transformation. In Section 5.2 a
solution for the backward phase is provided. The computation of the image of a poly-
hedron under an affine transformation is explained in Section 5.3.
Section 5.4 discusses the important concept of projecting a polyhedron onto a subspace
and provides a solution which computes an approximation of the projected polyhe-
dron in polynomial time complexity. The subsequent section summarizes further ap-
proaches to approximate the image of a polyhedron under an affine transformation
with non-invertible transformation matrix. The final section provides a short summary
of this chapter.
5.1 Introduction to the Problem
The computation of the image and the reciprocal image of a polyhedron under an
affine transformation is illustrated in Figure 5.1. As polyhedra are closed under affine
transformations the image or reciprocal image of a polyhedron under an affine trans-
formation is another polyhedron. For the backward computation we have to compute
for given polyhedron �¡� the polyhedron ©¯#&%>���$�D�¤� ���$� ©������{���$�j� and in the for-
ward propagation we compute for a given polyhedron �`~ the image ©$��� ~j� , where©$�����¯� k ��� l l l . Note that © is not necessarily invertible.
113
114 Chapter 5. Affine Transformation Phase
σ Layer STransfer functionlayer
Weight−Layer
Layer P
= <− b}P y {y | Ay
Γ
σσσ σ−1
Γ −1
P )yΓ ( −1= {x | <−Ax b}
Γ(P x )
P x
θΓ :x y=Wx+
Figure 5.1: The forward and backward-propagation of a polyhedron through the
weight layer. We refer to the net input space of layer S as � -space, the reciprocal
image, which is the net output of layer P is called � -space.
5.2 Backward Propagation Phase
In the backward phase the reciprocal image of a polyhedron ��� , under an affine trans-
formation �J�"©�������� k ��� l l l has to be computed. As already described in [Mai98]
a vector � belongs to © #&% ������� if and only if:
�J� k �(� l l l �����The reciprocal image is a polyhedron and defined as follows:
� � ���$� � k �B���hQB� ll l � ��>�We have to remove redundant inequalities in order to reduce the computational effort
to solve a mathematical programming problem and to keep a compact description.
Furthermore, it is very inefficient to backpropagate an increasing number of redundant
inequalities and for the polyhedral description of the input and output space a non-
redundant description is necessary.
Linear programming techniques can be used to remove redundant inequalities in (1).
Redundant inequalities of a polyhedron are inequalities which do not define a facet of
the polyhedron. Hence, these inequalities are not relevant for the description of the set
of points enclosed by the polyhedron.
5.2 Backward Propagation Phase 115
Removing redundant inequalities
Given a polyhedron ��� ���$� ���H� ��� , with � §�� inequalities. The problem of re-
moving redundant inequalities is to obtain an irredundant polyhedron ���Ê� ���$� ���_������ë� with a minimal number of inequalities such that �y� = � . In other words the inequal-
ities of the polyhedron �¬� define facets. The N -th inequality is redundant if and only if
the polyhedron:
�$|¼� ���$� �J�OPò�¬K°N9Q���MON&�"��K°�uVXMYKF�Â�B�����OPò��KjN9Q���MON7�"��K°�uV��s�is equal to � . Equivalently we can write:
for all ���n� | Kj�J��NUMYKF���������N�This can be tested by solving the following linear optimization problem:
max �J��NUMYKF�Â�s. to �J�OPò��K°N9Q���MON7�"��K°�uVXMYKF�Â�������OPò�¬KDN Q���MON&�"��KD�uV��
The inequality is irredundant if there is an � ^ �n� | , such that ����NsMYKF�Â� ^ ¶�� ��NL� . A test
for all inequalities leads to an irredundant description. We refer to this strategy as the
initial strategy. A more efficient method to remove redundant constraints is described
in the paper by Caron, McDonald and Ponic [CMP89].
This method is more efficient, because it incorporates several rules which can be ap-
plied to decide immediately if an inequality is redundant or necessary. In the sequel,
we describe important corollaries and theorems useful to remove immediately redun-
dant inequalities and to determine if an inequality is necessary [CMP89].
Let � have � constraints and define the indices set £�� �j��MYÓYÓYÓ>MO�J� . Let � ` ��� . We
refer to act ��� ` � as the set of active indices at � ` . A constraint is called active at a point
if it is an equality at this point. The number of active constraints at � ` is Ý .Corollary 5.1
Let � ` ��� .If the gradients of the constraints with indices act ��� ` � are linearly inde-
pendent, then all such constraints are necessary. Proof: see [CMP] wCorollary 5.2
The N -th inequality is redundant if the system of equations ¤��/. act �¦¥¨§ �ª© |« � £6�� �7� ?�+ £�6 � ��� ? has
116 Chapter 5. Affine Transformation Phase
a solution such that « �¬ A . The geometrical interpretation of this Corollary is that the gradient
£6 � ��� ? is in the cone generated by the other active constraints (see also Farkas Lemma). ®Corollary 5.3
Let � Ú�S¯ be an arbitrary vector. For each N$�y£ we define:
° | �<�+± , if �J��NUMYKF� » �¬��¨° | �³² È |ÊÉ #7´¡È | ·¶¸ ɶµ ê´ È | ·¶¸ É µ�· , otherwise
° � min � ° | � N$�¸£5�If ° is defined by a unique index Q , then the Q -th constraint is necessary. For a proof
see [CMP] or [Bon83]. wThe degenerate extreme point strategy [CMP89] integrates the above corollaries, but
basically relies on the following theorem and corollary.
Theorem 5.1
Let N¯� act ���%`�� . The N -th inequality is redundant if and only if ��` is an optimal solution
to the following linear optimization problem:
max �J��NUMYKF��subject to: ���n� | ��� ` �������$� ��� act ��� ` �aMYKF������ � act ��� ` �O�
Proof: see [CMP89] wIt is important to note that the above Theorem is essentially the initial strategy applied
to a smaller linear optimization problem. The LP is smaller in the sense, that the num-
ber of constraints is reduced from � to � act ��� ` �Y� .
Corollary 5.4
Let � ` be an extreme point of the polyhedron � , with � act ��� ` �Y�¯� Ý �Ô� . Let N(�act ��� ` � . Consider Ò>
µ�¹ act È ê É»ºL| P µ �J�@áDMYKF�¯�<����NsMYKF� �Á�°�Then we can derive the following statements (for a proof see [CMP89]):
5.3 Forward Propagation Phase 117
(a) The N -th inequality is redundant and all other inequalities are necessary, if
there is a solution to (2) such that P µ §�¨®M9á(� act ������ÿ�N .(b) If (1) has a solution such that for some ¼H� act �����úÿ¤NUM�P�½�¶ ¨ and P µ �¨®M for all á � act ����� ÿy��NUM9¼S� . Then constraint ¼ is redundant and constraintsá�� act ������ÿ¼ are necessary.
(c) If neither (a) nor (b) are satisfied, then all constraints are necessary.
wOur implementation to remove redundant inequalities relies on the initial strategy. In
a future version we plan to integrate the above mentioned corollaries and theorems to
obtain a faster algorithm.
5.3 Forward Propagation Phase
For the forward propagation of a polyhedron through the linear weight layer of a neural
network the image of a polyhedron �Ô�½���$� ��� � ��� under an affine function ���Ì©$�����¡� k �n� l l l has to be calculated. A vector � belongs to the image of © , if there
exists a vector � , such that: ��� �� ��� k �u� l l l���B���
With þ��"©������ the image of the polyhedron � under the affine transformation ©��������k �,� l l l is denoted. Ifk
is invertible the computation of the image þ¸� ©������ is
trivial, because the problem is then reduced to compute the inversek #&% and þ ������ � k #&% ��� Q l �¬�º��� . If
kis not invertible, we compute the image þ �½©������
by projecting � onto the subspace Î , which is a subspace orthogonal to the kernel 1
ofk
, and apply the bijection © Õ between Î and ¾�¿,� k � on the projected polyhedron� .
Ifk
is the matrix of an injection than Ø°NX�B�8À3ÁL®� k �O�$�"¨ . In this case for a given
point ���n©$��� ~ � the point �B�u� ~ such that �J�"©������ is unique.1For more details about the required linear algebra background the reader is referred to appendix B of
this thesis.
118 Chapter 5. Affine Transformation Phase
þº� image ���2M k M l l l �«¬«
compute a basis for the Kernel à and the subspace Äà ��À3ÁL®� k �a¢Ä��ÅÀ3ÁÆÂ��.à » �a¢if dim �8À3ÁL®� k �O� o�"¨� �H�2¢
else_«¬«
project � onto Î� � proj ���2MÆļ�a¢«¬«
the restriction ofk
onto the subspace Îk � kÈÇ ¢`let �Í� ��Óí� and ��� ��Óî�if isSquareMatrix � k �þº� ����� � k #&% �����n�v� k #&% l l l �°¢
else«¬«Injection,
k ¿denotes the pseudo inverse.
þº� ����� � k ¿ �����n�v� k ¿ l l l � and ����é°�B� k �a¢It remains to explain how to project a polyhedron onto a lower-dimensional subspace.
5.4 Projection of a Polyhedron onto a Subspace
With � the exact projection of the polyhedron � onto a lower dimensional subspace
is denoted. We also use the expression true projected polyhedron. An approximation
is referred to as G� .
5.4 Projection of a Polyhedron onto a Subspace 119
The projection of a polyhedron onto a subspace is used in a number of different fields.
Among others, polyhedral projection techniques are used in the following areas:
I Parallelizing Compiler. Polyhedral projections are used in advanced compiler
techniques, e.g. for loop parallelizing or data dependency analysis of arrays.
For example, the static parallelization of perfectly nested loops is supported by
polytope models [Len93].
As explained in the PhD thesis “Parallelizing Compiler Techniques Based on
Linear Inequalities” by Amarasinghe [Ama97] the iteration space of nested
loops can be represented with a convex polyhedron. Furthermore he writes: “In
our compiler algorithms, we use projection as one of the key transformations in
manipulating systems of linear inequalities.”
I Optimization. In several works on optimization polyhedral projection tech-
niques are used. For example the work by Dyer and Megiddo [DM97] relies on
projection techniques to solve a linear programming problem of a fixed dimen-
sion.
I Neural Network Analysis. We need the computation of the projection of a poly-
hedron onto a subspace to forward-propagate a polyhedron through a neural
network. Additionally, as we shall see in the next chapter, projection methods
could be used to overcome numerical problems.
In the literature the Fourier-Motzkin method and the Block-elimination technique are
well known polyhedral projection algorithms.
However, according to the survey paper “Polyhedral Computation: a survey of pro-
jection methods” by Kaluzny [Kal02] currently no algorithm is available to compute
the projection of a polyhedron in polynomial time. In the following we describe the
Fourier-Motzkin approach and the Block-elimination technique. Both methods com-
pute the exact projection. Next, we introduce our method to compute an approximation
of the projected polyhedron. This algorithm has polynomial time complexity. For the
implementation of VPA we use the Fourier-Motzkin algorithm, if it is computationally
feasible, otherwise the new developed approximation method is applied. The problem
of projecting a polyhedron onto a lower dimensional subspace is defined as follows:
120 Chapter 5. Affine Transformation Phase
Definition 5.1 Projection of a polyhedron � onto the subspace ÎLet
�������ÊÉ�MO�����oî ¦#Ë î ¥ � �ÌÉu��b{�B���¯�°Mwhere � is a P ��MsQ�V matrix, b is a P ��MsÑYV matrix and � is row vector with � rows. The
number of columns Q and Ñ corresponds to the dimension of the null-space and the
image, respectively, i.e. Ñ���Q��� . The projection of � onto the subspace Î��Ýî ¥ is
then given by:
� � ���B�oî ¥ � � É��mî ¦ Kq�ÊÉ�MO�«���n�y�w
This definition implies that we can view the projection of a polyhedron onto a lower-
dimensional subspace as variable elimination for a system of linear inequalities. Within
this thesis projection always means an orthogonal projection.
5.4.1 Fourier-Motzkin
Fourier-Motzkin is an algorithm that projects a polyhedron incrementally onto a lower
dimensional subspace. We define a hinge of a polyhedron as follows:
Definition 5.2 Hinge of a Polyhedron
A hinge of a polyhedron is defined as the intersection of two facets, i.e.
¹ �����$� ����NsMYKF�Â���"� ��NL� and �J�@á°MYKF�Â���"� �@á��s�and �J��NUMYKF�aMs�J�@áDMYKF� are linearly independent.
wDefinition 5.3 Positive and negative facet
Let the vector X be orthogonal to the Q�� dimensional subspace Í . A facet of the
polyhedron is called positive iff the scalar product between its direction vector ê andX is positive, otherwise the facet is called negative. w
5.4 Projection of a Polyhedron onto a Subspace 121
Geometrically, a negative facet means that it is visible from direction X , whereas a
positive facet is not.
Already in 1827, Fourier [Fou27] proposed a method to eliminate variables of a sys-
tem of linear inequalities. The method did not become widely known (probably as
the time complexity of the method is exponential) and was re-invented by several re-
searchers, e.g. by Motzkin in 1936. The Fourier-Motzkin elimination iteratively re-
moves Q variables. Geometrically, this corresponds to an incremental projection of a
polyhedron � in dimension î © onto a subspace of dimension î © #&%YM�î © # ¥ MYÓYÓYÓ>M�î © # ¦ .Figure 5.2 represents the projection of a two-dimensional polyhedron onto a one-
dimensional subspace. It shows already the main idea of the Fourier-Motzkin algo-
rithm, namely the projection of a hinge defined by the intersection of positive and
negative facets. In this example the projection corresponds to the removal of the vari-
able �¯�Á�°� ; in other words we consider the projection of the polyhedron � onto the
subspace Îv� ���B�oî ¥ � �¯�Á�°�$�"¨¦� .In general the Fourier-Motzkin algorithm eliminates the Q -th variable in a system of
linear inequalities by keeping all inequalites with �J�LK@MsQ��¯�"¨ and combining inequal-
ities with positives components in Q with inequalities with negative components in Q ,such that the resulting inequality is zero in the Q -th component. A single projection
step is summarized in the following algorithm. To project onto a QvQ dimensional
subspace, we repeatedly apply this projection.
P
project
x(1)
x(2)
Figure 5.2: The projection of the polyhedron Î onto the subspace Ï(+�-/.�021 � 4 .76W:'?7+�A°C .Positive facets with respect to the projection direction are plotted solid, negative facets dotted.
122 Chapter 5. Affine Transformation Phase
� =fourierMotzkin( �2MsQ )«¬«
Elimination of the Q -th variable in a system of linear inequalities
� � ���ÊÉ�MO�«� �oî ¦'Ë î ¥ � �ÌÉu��b{�B���¯�] ¿ � ��N/� �J��NUMsQ��¡¶�¨¦�] # � ��N/� �J��NUMsQ��¡´�¨¦�] ^ � ��N/� �J��NUMsQ����"¨¦�«¬« � � MUb � and � � are the matrices and the vector for the projected polyhedron �� � �ºP�VX¢Ub � �ÔPYVX¢U� � �ºPYVX¢for each N¯�2] ¿ M9á(�2] #_� � �ºPR� � ¢s�J��NUMsQ��Ã�J�@áDMYKF��QB�J�@áDMsQ��Ã�J��NUMYKF�9Vb � �ºP b � ¢s�J��NUMsQ��Lbn�@áDMYKF��QB�J�@áDMsQ��Lbn��NUMYKF�9V� � �ºP � � ¢s�J��NUMsQ��L� �@á���QB�J�@á°MsQ��L����NL�9V`
// add the inequalities of ] ^� � �ºPR� � ¢s���_] ^ MYKF�9Vb � �ºP b � ¢Ubn�_] ^ MYKF�9V� � �ºP � � ¢U� �_] ^ �9V//remove the Q -th column of � �� � ��� � �LK@M��·KjQ Qv�>��H�����ÊÉ�MO�\�¡�mî ¦ #&% Ë î ¥ � � � É(��b � ����� � �// remove redundant inequalities with a method as explained in Section 5.2
�H� mkNonRedundant �a���
5.4 Projection of a Polyhedron onto a Subspace 123
Complexity Analysis
To project a polyhedron onto a �� -Qn�>� dimensional subspace Fourier-Motzkin projects
all combinations of hinges of two facets (positive and negative) onto the �� hQ"�>� di-
mensional subspace. In the average case the number of combinations for the projection
from a -dimensional space onto a �QB� dimensional space is Ñ���� ¥ � , where � is the
number of facets. Let be the number of variables we want to eliminate then the com-
plexity is : Ñ��� ¥ Y � . Additionally, Fourier-Motzkin produces at each projection step a
large number of redundant inequalities. With an immediate removal of the redundant
inequalities the algorithm still scales exponentially. Additionally, as described before,
the removal of redundant inequalities is not a cheap computation, because it requires
to solve several linear programming problems.
5.4.1.1 A Variation of Fourier-Motzkin
To simplify the notation we use ê | �ý�J��NUMYKF�9» and ê µ � ���@á°MYKF�9» . Additionally, it is
assumed that all row-vectors of the matrix � are normalised. We define the expression
relevant hinge as follows:
Definition 5.4 Relevant Hinge
Let ê�| and ê µ be the direction vectors of two supporting hyperplanes of � defining a
hinge¹
of the polyhedron � . A hinge is relevant, iff dim � ¹ �(�����H mQB� wOne of the drawbacks of the Fourier-Motzkin algorithm is the generation of a large
number of redundant inequalities. We developed a variation of the Fourier-Motzkin
algorithm, which first determines if a hinge could be relevant before projecting it onto
the lower-dimensional subspace. The computation of the dimension of dim � ¹ �,���is sufficient, because
¹ ��� ü ¹ implies aff � ¹ �n���\ü aff � ¹ �)� ¹ and therefore:
aff � ¹ �(����� ¹ wCalculation of relevant hinges
We have to test if the dimension of the hinge is 4QJ� dimensional, because two facets
can intersect in a face of dimension mQBÒ¦MO mQ�ä�MYÓ@Ó@Ó@MU¨ . Figure 5.3 depicts an example
of relevant and irrelevant hinges in a two-dimensional scenario.
124 Chapter 5. Affine Transformation Phase
P
a
a
p
j
i
x~
positive facet
negative facet
H
Figure 5.3: The hinge¹
between the facets defined by êå| and ê µ is outside of � .
Therefore:¹ ��� is-empty and hence not relevant.
relevant=isRelevant( � ,i,j)
relevant � TRUE ¢� �H�2Óí�,¢U�B� �2Óî�¯¢ê | ���J��NUMYKF�a¢/ê µ ���J�@áDMYKF�¹ � ���$� �������uþAê » | �,�"� ��NL�7þ4ê »µ �,�����@á¦�s�if dim � ¹ � o�" mQB�
relevant � FALSE ¢return relevant;
It remains to determine the dimension of the hinge. The problem reduces to the calcu-
lation of the dimension of a polyhedron, because a hinge itself is a face of a polyhedron
and a face is another polyhedron.
Calculation of the dimension of a polytope
Within this thesis we work with bounded polyhedra (polytopes). To calculate the di-
mension of a polytope � the number of linear independent row vectors of the matrix� is computed. For each linear independent vector we minimize and maximize in
direction ê on the polyhedron � . If the minimum and maximum value are equal, then
the polyhedron is contained in a hyperplane directed by ê , otherwise the polyhedron
5.4 Projection of a Polyhedron onto a Subspace 125
spans in direction ê . This process is repeated for all linearly independent vectors.Ø =dimPolytope( � )
«¬« � � ���$� ���B�����if � o��ùØ2�"¨
else_G� � getLinearIndependentRowVectors �Á�n�«¬« Q is number of lin. independent direction vectors
for N«�º��K�Q_ê�� G����NsMYKF� »Y § |ª©�� min ê » t ÑjÓ ÞLÚ t��n�Y §�¨s~ � max ê » t ÑjÓ ÞLÚ t��n�«¬«
let � be a small value.
if YG§�¨U~\Q�Y�§ |ª© ¶J�Ø2�HØ��"�`
`return Ø
Complexity Analysis of the Variation of Fourier-Motzkin
The time complexity of our variation is similar to the Fourier-Motzkin and scales ex-
ponentially. The difference between our variation and the original Fourier-Motzkin
126 Chapter 5. Affine Transformation Phase
approach is that, by computing (possibly) relevant hinges first, we reduce the number
of redundant inequalities. But this computation is essentially as cost-expensive as first
projecting all possible combinations and then removing redundant inequalities. Nev-
ertheless, our variation included interesting aspects, namely the definition of a relevant
hinge for the projection and the computation of the dimension of a polytope. The def-
inition of relevant facets is interesting, because it would be a significant improvement,
once a cheap formula to determine relevant facets would be available.
As explained before, the incremental projection of a polyhedron results in an exponen-
tial time complexity. In the next section we describe a method, known in the literature
as block elimination, which directly projects a polyhedron onto a lower dimensional
subspace.
5.4.2 Block Elimination
One of the disadvantages of the Fourier-Motzkin method are the costs involved with
the incremental projection. A direct projection could be cheaper. The direct projection
can be viewed as elimination of more than one variable at a time and is known as block
elimination in the literature [Kal02]. One of the most recent algorithms is the Balas
block elimination [Bal98].
In the following we recall some fundamental linear programming theorems and lem-
mas, including Farkas Lemma and an alternative of Farkas Lemma, known as Gale’s
Theorem. Finally, a Projection Lemma can be deduced from Gale’s Theorem.
Lemma 5.1 (Farkas Lemma)
Either: (i)� ��§�¨�K����,�"�
or: (ii)� ��K°�7»��¾§�¨®MO�&»\��´�¨
5.4 Projection of a Polyhedron onto a Subspace 127
Proof:
In the literature several different proofs of Farkas Lemma are published. The first
complete proof was published by Farkas in several papers, e.g. in [Far02]. Recently,
a proof was published by Dax in the paper “An elementary Proof of Farkas’ Lemma”
[Dax97].
We only show that the two statements are exclusive: Assume (i) and (ii) holds. Then:� » � §Ð¯ and � » �"´³¨®MU���ý����MO� §Ñ¯ has to be fulfilled. But this is a contradic-
tion, since ���7»«�u�Â��§�¨ if �&»��¾§Ò¯ and ��§J¯7Ó wGeometrical Interpretation. The claim that ��"� � has a positive solution � im-
plies that � � Ý/Ú ¼Ä��Á�u� , i.e. � is in the cone generated by the column vectors of � .
In other words: � can be expressed as a positive combination of the column vectors
of � . The alternative states if �ïÚ� Ý/Ú ¼Ä��Á�u� , then there is � such that �«»�� § ¨ and� » �º´ ¨ , which implies that there is a hyperplane¹ � �����òî § � � » �<� ¨¦� that
separates ÝðÚ ¼Ä�� and � .
Theorem 5.2 (Gale’s Theorem)
Either:� ��Kj���B���
or:� �B§Ò¯�KD�7»\� �S¯&MO�7»\��´�¨
Proof: We want to find the alternative for:� ��Kj������ .
�J�H� ¿ Q�� # with � ¿ MO� # §�¨���(�JÓ¸�"��K�ÓÕ§Ò¯ . Therefore we can write:
� � ¿ MO�«#�M�ÓϧM¯�KåPR� Q`� £ § Võööö÷� ¿�«#Ó
øYùùùú ��� , where £ § is the P ��MO�uV identity
matrix.
With the notation: G� KF�ºPR� Q`� £ § VXM G�BKF�õööö÷� ¿�«#Ó
øYùùùú we can write:
� G��§�¨2K G� G���"� . The application of Farkas Lemma leads to:
128 Chapter 5. Affine Transformation Phase
Ô N Þ ×�Ä Ð K � G��§�¨yK G� G�,�"� Ú5Ð K � ��KD�7» G� §J¯&MO�&»«��´�¨The alternative can be written as:
� �½K���» PR� Q`� £§)V2§Õ¯7MO�&»��Í´¾¨×Ö� ��§J¯7MO�&»\� �S¯&MO�7»\��´�¨ wThe following Projection Lemma can be deduced from Gale’s Theorem:
Theorem 5.3 (Projection Lemma)
The projection of � onto the subspace î ¥ is given by:
� � �����oî ¥ � � » b����v� » �$M for all �B��Ä�t ÞLÐ �Iغ�s�where ÄYt ÞLÐ �IØ � refers to the set of extreme rays of the projection cone
Ø � �����oî § KD� » � �"¨®MO��§J¯å�Sketch of Proof
From Gale’s Theorem it can be deduced:
� ��K��������5Ö¢@q�B§J¯�K°�&»\� �S¯7MO�&»���§�¨ .Therefore:
� �ÊÉ�MO����K��Ìɾ� �TQ�b{�4Ö for all � §¡¯ K`�¼»\� �Ù¯&M � � K�7»����hQ�b��«� §�¨ .This leads to:
� �ÊÉ�MO�\�$K¦�ÚÉ����hQ�b{��Ö for all ��§J¯nK°�7»��Í�S¯7M � ��K°�&»¼b{�B���&»�� wIt remains to show that the inequalities of � are in 1-1 correspondence with the extreme
rays of Ø .
Sketch of Proof.
Let �¬���\� KF� �����oî ¥ � �&»�b{�B�v�&»\�¯MO���¸ØÔ�We claim that: Û
for all � ¹ Ü �¬���\�¯�Û
� ¹ ¾ ~�ÓÞÝ ÈßÜ É �¬���\�let ��� % MYÓYÓYÓ'MO��¦°� be the extreme rays of Ø .
5.4 Projection of a Polyhedron onto a Subspace 129
Û� ¹ ¾ ~�ÓÞÝ ÈOÜ É �¬���\�$� ���B�oî
¥ � àuby���fàu�¯� ,where à¾�õööö÷� » %Ó@Ó@Ó�&» ¦
øYùùùú
Any arbitrary ���¸Ø can be expressed as a positive combination of the extreme
rays, i.e. �J� ¦®|r¯ % ë | � | M7ë | §�¨Therefore our claim is:
���B�mî ¥ �¶ë » ànb{�B�Ûë » àn�¯M7ë,§J¯&��� ���B�(î ¥ � ànb{�B��àn���This is true as: ë�»,ànb{�B�Ûë�»,àn�¯M7ë,§J¯áÖâànb{����àn� w
The projection lemma is often attributed to ãCernikov [Cer61]. The application of the
projection lemma allows us to eliminate É in one step. In other words we eliminate
a block of Q variables. It is well known that the above description of � contains re-
dundant inequalities. Balas [Bal98] observed that once the matrix b of the projection
variables is nonsingular, then block elimination gives an irredundant representation of� . In [Bal98] an algorithm is described how to obtain for any polyhedron an alterna-
tive description such that the matrix b is always nonsingular.
However, as stated in [Bal98] the disadvantage of this method is that the structure of� gets lost with the transformation process. Therefore, in some cases, it may be much
more expensive to compute the extreme rays of Ø .
Given a facet description of a polyhedral cone the dual representation as a set of ex-
treme rays could result in a huge number of extreme rays (combinatorial explosion and
hence in an exponential time complexity) [Dut02]. Methods to compute the extreme
rays are described in [FP96].
In the following we apply the block elimination to our problem of projecting the poly-
hedron � onto the subspace Í��*ä�Ä Ð � k �9åI The matrices à and Ä contain the basis vectors of an orthonormal basis defining
the subspace À and Í respectively. The basis b % �ÔP Ã�ÄqV , the standard basis is
named b ^ �"é © .
I �)æ $ �"b ^ b % �)æ�X��"b % �)æ�X
130 Chapter 5. Affine Transformation Phase
I �)æ�X��H�)ç���� Õ �
õöööööööööööö÷
t %Ó@Ó@Ótk¦¨Ó@Ó@Ó¨
øYùùùùùùùùùùùùú�
õöööööööööööö÷
¨Ó@Ó@Ó¨tk¦ ¿ %Ó@Ó@Ót�¦ ¿ ¥
øYùùùùùùùùùùùùú
I � � ���)æ7$j� ���)æ/$�������� ���)æ X � �(b % �)æ X ����� £I � � �����)ç{MO� Õ ���oî ¦ Ë î ¥ � �Úç��)çv�v� Õ � Õ �����where: �Ìç"�
õööö÷ê » %7è % Ó@Ó@ÓÍê » %Rè ¦Ó@Ó@Ó Ó@Ó@Ó Ó@Ó@Óê » § è % Ó@Ó@Ó ê » § è ¦
øYùùùú and � Õ �
õööö÷ê » % � % Ó@Ó@Ó³ê » % � ¥Ó@Ó@Ó Ó@Ó@Ó Ó@Ó@Óê » § � % Ó@Ó@Ó ê » § � ¥
øYùùùú
Given the rows of � are normalised and à and Ä are orthonormal bases then
the N�Q Þ × row of � ç is a vector, where the á -th entry expresses the cosine of
the angle between ê�| and è µ . Similarly, the N -th row � Õ is a vector, where theá -th entry expresses the cosine of the angle between ê | and � µ .I Using the block elimination gives the following description for the projected
polyhedron.
������� Õ �oî ¥ �ò���&»\� Õ �Â� Õ ���&»¼�¯M=@S��� extr �IØ �s�°MØ � �����oî § KD�7»\�Ìç"��¯&MO��§Ò¯å�Therefore the computation of � reduces to the computation of the extreme rays of the
polyhedral cone Ø �ý����� î § K®�&»\�Ìç �Я7MO��§S¯å� . If �Ìç has a structure which
allows a cheap computation of the extreme rays then the block elimination technique
can be used to compute � . However, this is generally not the case and hence, approx-
imation techniques are used instead. Generally, the Block-elimination method as well
as the Fourier-Motzkin approach scale with exponential time complexity [Kal02].
As explained by Kaludzny [Kal02] there is an interesting connection between the
Fourier-Motzkin and the Block-elimination technique, which would be worthwhile to
investigate more.
An important requirement for the VPA algorithm is to scale within polynomial time
5.4 Projection of a Polyhedron onto a Subspace 131
complexity. We developed two techniques to approximate the projected polyhedron� , which guarantee to scale with polynomial time complexity. These methods will be
explained in the subsequent sections. Experimental results and a comparison between
our approximation method and the Fourier-Motzkin algorithm conclude this section.
5.4.3 The é -Box Approximation
As motivated in section 2.3 any approximation used for VPA has to guarantee that the
true image is contained in the computed region.
To obtain a projection method scaling with polynomial time complexity a polyhedronG� which contains the projected polyhedron � is computed. This method is based on
the observation that we can easily compute a box containing � . Additionally, we can
refine the approximation by projecting faces of dimension !QnQ�QJ� onto the subspaceÎ .
The projected polyhedron � can be approximated with the wrapping box Õ � m �a�`� .Let ��ºÑUÙqp� «��� % M9� ¥ MYÓYÓYÓ'M9� © # ¦°� , i.e. the basis vectors of Í . We compute × Õ by op-
timizing in the positive and negative directions of all basis vectors of Í , i.e. we solve���� Q�Q�� optimization problems of the form: max êÑ�>»| � s.to. � �º� . The
computed optimal points are denoted with �_ë | . We can refine the approximation by
projecting the face defined with the intersection of Q���� active facets of � at the points� ë | . The projection of this face is a hyperplane in Í . This hyperplane is denoted with¹ ¥ ë | .This results in a better refinement, because the approximation will contain constraints
of the true projected polyhedron � . The S-Box approximation method has the follow-
ing important properties:
I correct solution once projecting onto a one dimensional subspace,
I Õ wraps � ,
I the (projected) points ��ë | are surface points of � ,
I there are faces of � at the point Ñ_ë [ which correspond to facets of the projected
polyhedron �
132 Chapter 5. Affine Transformation Phase
In the following we summarize this initial algorithm, referred to as sboxProj.
G��� sboxProj ���2M9Ã�MÆļ�// use linear programming to compute the wrapping box of � wrt. basis S
// the box is the initial approximation
G� � m �a���//With ì we denote the computed hyperplanes, which build facets of �ì½�ºP Vfor N\� ��Kj�±Ì��� mQBQ��_
// to ensure polynomial time complexity the following loop is only executed, if
// the number of active facet combinations is less than a pre-defined upper bound
for all active facet combinations í at point � |_// the matrix í contains the direction vectors of the active facets
Let í �ºP ô % MYÓ@Ó@Ó@MUô*¦ ¿ % VX¢¹ · [ � projFace �.í�MÆÄ«M9� | �if isFacetofQ � ¹ · [ �
ì �ºP ìn¢ ¹ · [ V`G��� G���áì`
It remains to explain how to project a face of dimension Q�Q9Qy� onto a subspace Î and
how to determine if a projected face defines a facet of the true projected polyhedron� . The next two subsections are devoted to these problems.
5.4 Projection of a Polyhedron onto a Subspace 133
5.4.3.1 Projection of a face
To project a full-dimensional polyhedron directly onto a �� �Q¬Q�� dimensional subspace,
we have to consider the 4Q��ÁQ·�H�>� dimensional faces of � , which are defined by the
intersection of Q¤��� facets. Therefore the number of possible combinations is: î §¦ ¿ %:ï .We want to project a �� �QHQmQ �>� dimensional face of the polyhedron � onto an Ñdimensional subspace Í , where Ѭ� oQ�Q . The face is defined by the intersection ofQ·�"� facets of � . With � | we denote a point on the face.
The basis of the �� úQ`Q@Q·�>� dimensional face is orthogonal to the direction vectors of theQ¯��� facets. Thus, a basis ð for the face can be defined by: ð ��À3ÁÆ®�.í�� ,where íH�P ô % MYÓ@Ó@Ó@MUô�¦ ¿ % V and ô | is the direction vector of a facet of � . We project the basis vectors
of ð onto the Ñ -dimensional subspace Í and obtain the basis ð�ñ (represented as a
matrix), which in turn defines the basis of a hyperplane¹
in î ¥ . A direction vector
ê of¹
is computed as the kernel of ð�ñ , i.e. ê"�òÀ3ÁÆÂ��.ð�ñS� . Of the two possible
direction vectors ( ê�ê ) for¹
, we select the one which agrees with the direction vector
of the projected face. To determine the correct direction we project a direction vectorô of one of the Q{� � facets of � onto Í . The angle between ê and ôóñ has to be less
than ]�¨ } , i.e. ê is computed by: ê�� Ñ>NÁßj ¯�ôæê&MUô9ñv¶��Lô0ñ . Finally, to determine the
position of¹
, such that ��ü ¹ # , we project the point � | onto Í .
134 Chapter 5. Affine Transformation Phase
[ ê&Msr ]=projFace �.í�MLͯM9�'�«¬«
compute the Basis ð for a face defined«¬«by the intersection of Q��"� facets of �
ð ��À3ÁÆÂ��.í��a¢«¬«project all basis vectors of the face onto Í«¬« ð Ç
are the projected vectors of ð with respect to the standard basis
ð Ç ��Ù Ð�Ú áq�.ðJMÆÄ7�a¢«¬«project the point � onto Î
� Ç ��Ù ÐDÚ áq�.�DMÆļ�a¢«¬«compute one orthogonal vector for the projected basis vectors
ê���À3ÁL®�.ð Ç �a¢«¬«determine the correct sign for ê
ô¤�*í��Ã��MYKF�a¢ô Ç ��Ù ÐDÚ á���ô°Ma¯�a¢ê���Ñ>NÁßj ¯�ôçê7MUô Ç ¶¤� ÌRê&¢«¬«
it remains to define the correct position of the projected face
r¡�òê » Ìw� ñ ¢The next section describes how to determine if a projected face builds a facet of the
projected polyhedron.
5.4.3.2 Determination of Facets of �As explained in Figure 5.4, we have to decide which projected faces are facets of the
(true) projected polyhedron � . The following cases are possible for a projected face:
1. The projected face is a facet of � .
2. The projected face is a hyperplane which cuts � .
3. The projected face is redundant in � .
5.4 Projection of a Polyhedron onto a Subspace 135
H 0
1H
H 2
Q
Figure 5.4: The hyperplane¹ ¥ cuts the projected polyhedron � , the hyperplane
¹ %defines a face of � and the hyperplane
¹ ^ is redundant for the polyhedron � .
Fourier-Motzkin projects a polyhedron incrementally onto a lower-dimensional sub-
space. This allows the classification into non-negative and negative facets. With a
direct projection such a classification is not possible. Thus, for the direct projection of
a �� mQBQ Q��>� dimensional face onto a lower-dimensional subspace all possible î §¦ ¿ %1ïcombinations of � active facets at an extreme point X���� are considered. This leads
to the additional problem that projected faces, which build a hyperplane in the sub-
space Î , can cut the true projected polyhedron � . An exampled is illustrated in Figure
5.4.
Section 5.2 described how to detect and remove redundant inequalities. Hence, the
algorithm to decide whether a projected face belongs to � reduces to determine those
hyperplanes, which cut � . The polyhedron � itself is unknown. We only know � and
the subspace Î . The algorithm is based on the so called “Straddie”-Theorem 2
We want to project a polyhedron �ÔªTî © onto the subspace Îv�³���B�(î © � �¯� è ���"¨¦� .In the following Theorem,
¹ U refers to a hyperplane through an extreme point X of
the polyhedron � . The direction vector of¹ U is defined by the projecting of the face,
obtained by the intersection of Q�� � active facets of � at a point X , onto the subspace
2I named this Theorem after the place where I developed it: in a coffee-shop of the beautiful north
strand-broke island, which is located at Brisbane’s coast and called by the locals “Straddie”.
136 Chapter 5. Affine Transformation Phase
Î . Hence, the direction vector ê of¹ U is characterized by ê¼� è ����¨ , i.e. ê��nÎ .
Theorem 5.4 The Straddie Theorem:
The hyperplane¹ U with direction vector ê,��Î cuts the projected polyhedron � iff it
cuts the original polyhedron � .
Sketch of Proof
We first show that the above statement is true for “ £ ”, i.e.: if¹ U cuts � then¹ U cuts � .
¹ U cuts � then: �B� ¹n¿U Ú��ù and ��� ¹ #U Ú��ùHence: there are points i���� and áy��� such that: ê » i�¶�¨ and ê » á�´�¨ .For i and á exists points Gi���� and Gá(��� such that the orthogonal projection
of these points onto Î are i and á respectively.
Since the direction vector of¹ U is characterized by ê¼� è ��� ¨ the dot product
ê�»)i,�òê®» Gi and ê�»9á��òê�» Gá . Therefore:¹ U also cuts � .
The proof for “ ô ” is analogous. wThis Theorem allows us to determine if
¹ U cuts the (unknown) polyhedron � by solv-
ing two linear optimization problems on the (known) polyhedron � . The algorithm
isCutofQ is summarized below.
[iscut]=isCutofQ �ªìnMO�2M�� )
iscut=FALSE
YG§�¨s~�� max ìnÓîp s. to ���n�YG§ |ª© � min ìnÓîp s. to �B�n�if YG§W¨U~ ¶òìnÓír�þÚYG§ |�© ´�ìnÓírisCut=TRUE
5.4 Projection of a Polyhedron onto a Subspace 137
5.4.3.3 Further Improvements of the S-Box method
The following list is a summary of ideas to improve the initial S-Box approximation.
I use each non-null row-vectors of � Õ as a starting vector to compute an or-
thonormal basis Ä . The approximation is then given by the intersection of up to� S-Boxes. The motivation for this strategy is that the basis Ä is chosen with
respect to the structure of the polyhedron � .
I A possible heuristic for the selection of another basis is to select the most dif-
ferent one to the previous basis. The first basis vector is computed over the
normalised sum of all previous basis vectors , i.e.: ��� % �Y » ½õ · [[ßö X÷ Y » ½õ · [[ßö X ÷ . We use the
QR-Algorithm to compute the remaining �� (Q�Q Q �>� basis vectors for the new
basis. Theoretically, we can compute the Polyhedron � with a finite number of
different wrapping boxes Õ [ , where | defines a basis of Í , i.e. �H� ø| ¹�ù Õ [
138 Chapter 5. Affine Transformation Phase
5.4.4 Experiments
The next listing is an example for the projection of a five-dimensional polyhedron onto
a two dimensional subspace. We denote with � the result computed with Fourier-
Motzkin and with G� the computed approximation. The last column represents the
constant vector � and the previous columns include the values for the matrix � of
each polyhedron.
Listing 5.1 Projection Example
1 8Oú 6 £ûú 6 ¤q= 8 �ú 6 £ �ú 6 ¤q=2 0.5898 0.8075 6.6036 -0.9931 0.1170 4.14063 0.5654 0.8248 6.7770 -0.9655 -0.2602 3.32504 -0.7751 0.6319 6.9353 0.9810 0.1941 3.71925 -0.6772 0.7358 7.6160 0.9228 -0.3853 3.19506 -0.6992 0.7150 7.4840 -0.0157 -0.9999 3.15767 -0.9931 0.1170 4.1406 -0.6772 0.7358 7.61608 -0.9655 -0.2602 3.3250 0.0949 0.9955 8.72659 -0.7662 -0.6426 3.4685
10 -0.8556 0.5176 6.285011 0.9303 0.3669 4.192612 0.9810 0.1941 3.719213 -0.0157 -0.9999 3.157614 0.9228 -0.3853 3.195015 -0.2334 -0.9724 3.105616 0.0949 0.9955 8.726517 0.1298 0.9915 8.6568
All computed inequalities of the approximation are also inequalities of the true
projected polyhedron, e.g compare line 2 and line 7 in the above listing. An output of
the log-file shows that it took 600 seconds to compute x , but just around 2.5 seconds
to compute the approximation. Additionally, the approximation is good as 92.76% of
the space are points of the true projection � . The output of the log-file is as follows:=================================================
Reference File:P5TO2No1
Input-Dimension: 5
Calculation Time in s Fourier-Motzkin: 600.19
Calculation Time in s S-Box Approximation: 2.57
=================================================
Volume Polyhedron: 6.8774
Volume Approximation: 7.4141
Ratio: 0.9276
==================================================
5.4 Projection of a Polyhedron onto a Subspace 139
Dimension Fourier-Motzkin (in sec.) S-Box Approximation (in sec.) Volume-Ratio
5 Ì 4 [50.4564] [11.8630] [0.8779]
5 Ì 3 [2574.2155] [4.4634] [0.8879]
5 Ì 2 [3746.8898] [1.5983] [0.9131]
Table 5.1: Computation times for the projection of a polyhedron onto a a lower-
dimensional subspace comparing Fourier-Motzkin and the S-Box approximation.
The Figure below is a visualization of the polyhedron � and the approximation G� .
−4 −3 −2 −1 0 1 2 3 4−4
−2
0
2
4
6
8
10Approximation and Projected Polyhedron
Figure 5.5: The projected polyhedron � is contained in the approximation.
Randomly 10 polyhedra in dimension five3 have been generated to project onto a
four, three and two dimensional subspace. We compared the result of the approxima-
tion with the Fourier-Motzkin approach. The columns of Table 5.1 contain the average
computation times (in seconds) for the corresponding method. The column Volume-
Ratio is computed by dividing the volume of � by the volume of G� . This means, the
closer this value is to one, the better the approximation.
3We have chosen dimension five as Fourier-Motzkin is very slow. Additionally, also the exact volume
computation scales exponentially.
140 Chapter 5. Affine Transformation Phase
5.5 Further Considerations about the Approximation of the
Image
For completeness we describe in this section further theoretical ideas, which were de-
veloped during the time of this PhD project. Due to time constraints, we did not explore
this theoretical observations in more detail. Additionally, with the S-Box approxima-
tion already a satisfactory solution for the approximation of the projected polyhedron� is provided.
With the following important observation it is possible to approximate þ :
max� ¹¨ü » � � maxê ¹¨ý » � k ��� l � , i.e. this observation allows us to approximate þfrom outside. For an arbitrary directed hyperplane
¹, we can determine the correct
position such that þæü ¹ # .An approximation of þ is obtained by solving Ù optimization problems of the form:
max� ¹¨ü �Á�J�ÁÀÂMYKF� k ¿ � » � , where N¯� Pò�«ÓYÓYÓXÙ�V and Ù is the number of inequalities defin-
ing � . This approximation would contain the projected polyhedron.
To obtain a better approximation these observations can be integrated into the S-Box
approximation process as both strategies approximate þ from outside.
5.6 Summary of this Chapter
We started this chapter with a solution to compute the reciprocal image of a polyhe-
dron � under an affine transformation. As described, this problem can be solved by
basic matrix operations, but requires removal of redundant inequalities to keep a non-
redundant and compact description of the polyhedron.
It followed a discussion of the computation of the image of a polyhedron under an
affine transformation. Projection techniques are used to compute the image, if the
dimension of À3ÁÆÂ�� k � is bigger than zero. The projection of a polyhedron onto a
lower-dimensional subspace itself is an interesting research question. We developed
an approximation of the projected polyhedron, which scales in polynomial time com-
5.6 Summary of this Chapter 141
plexity. Experimental results indicated that the computed approximation is relatively
good. The contributions of this chapter are summarized in the following box:
- Contributions Chapter 5 -
I Computation of the image of a polyhedron under an affine transforma-
tion by applying projection techniques, if necessary.
I Development of the S-Box approximation technique, which scales in
polynomial time. For the development of this technique the following
algorithms and ideas are required:
– Projection of a Q<Q dimensional face of a polyhedron onto a
lower-dimensional subspace.
– The Straddie Theorem to determine if a projected face is a facet of
the true projected polyhedron.
I Approximation of the image of a polyhedron under a linear transforma-
tion when Ø°NX�B�8À3ÁL®� k �O�)¶�¨ . We developed two strategies to approx-
imate þ . Firstly by applying © Õ on the approximation G� and secondly
by approximating þ directly by solving optimization problems of the
form: max� ¹¨ü �Á���ÁÀÂMYKF� k ¿ � » � , where N`�ÍPò�«ÓYÓYÓ9Ù�V and Ù is the number
of rows of the matrix � .
Chapter 6
Implementation Issues and
Numerical Problems
In the first part of this chapter the structure of a general software framework for region-
based refinement methods is described and it is explained how to use this framework
and how to “plug-in” different refinement methods, such as VIA and VPA.
The second part of this chapter is devoted to numerical problems which always occur
when implementing mathematical algorithms on digital machines with finite precision.
6.1 The Framework
The core of refinement-based neural network validation algorithms is to forward- and
backward propagate regions successively through all layers of a neural network.
This can be expressed in a loop which alternates between a forward and a backward
propagation. The loop stops, if a pre-defined number of iterations is reached, or if no
significant refinement was computed. A prototype of the framework is implemented
with Matlab [Mat00c]. For clarification the original source code is simplified. An im-
portant abstraction concept in Matlab are function handles. A function handle can be
viewed as a reference to the corresponding function. Hence, by using function handles
in Matlab, general code can be written.
To represent a neural network the Matlab neural network structure is used. We mod-
ified this structure, by adding the sub-structure annotatedLayer, which contains the
143
144 Chapter 6. Implementation Issues and Numerical Problems
computed pre- and postconditions. The relevant attributes of the net-structure and the
sub-structure are as follows:
Listing 6.1 The important part of the net structure.
1 % input2 inputs:{1x1 cell} of inputs, with3 net.inputs{1}.range defining the operating input range4 outputs:{1x2 cell} containing 1 output, with5 net.outputs{1}.range defining the operating output range6 % weight and bias values:7 IW: {2x1 cell} containing 1 input weight matrix8 LW: {2x2 cell} containing 1 layer weight matrix9 b: {2x1 cell} containing 2 bias vector
10 % our modification11 annotatedLayer:{#layers} with the attributes:12 preCondition13 postCondition
The following call-sequence diagram provides an overview of the structure of the
framework.
fhBackLin,fhBackNonLin)(net,fhForLin,fhForNonLin,...
back
ward
Step
aNet
back
ward
NonL
in
aNetaNet
back
ward
Lin
aNet aNet
(aNet,fhForLin,fhForNonLin)
aNetaNet
aNet
aNet aNet
aNet
(aNet,fhBackLin,fhBackNonLin)
forwa
rdLi
n
forwa
rdNo
nLin
main
Func
tion
refin
e
Time
forwa
rdSt
ep
Figure 6.1: Overview of the framework. The main function passes the function handles and
the neural network structure to the refine function. The refine function calls the forwardStep
and the backwardStep functions, which use the function handles. An annotated neural network
is returned.
6.1 The Framework 145
Listing 6.2 The refinement loop.
1 function [aNet]=refine(aNet,fhForLin,fhForNonLin,2 fhBackdLin,fhBackNonLin,fhVolume)3 % initial stuff is here ...4 while ( (noIterations < gv_MaxIteration) )5 ( abs(volPx0-volPx1) > eps | abs(volPy0-volPy1) > eps))6 if alternationFlag==’forward’7 % Forward Step8 yRegion_prev = yRegion_next;9 aNet = forwardStep(aNet,fhForLin,fhForNonLin);
10 yR_next = aNet.annotatedLayer{1}.postCondition;11 alternationFlag = ’backward’;12 volPy0 = volPy1;13 volPy1 = feval(fhVolume,yRegion_next);14 else15 % Backward Step16 xRegion_prev = xRegion_next;17 aNet = backwardStep(aNet,fhBackLin,fhBackNonLin);18 xRegion_next = aNet.annotatedLayer{1}.postCondition;19 volPx0 = volPx1;20 volPx1 = feval(fhVolume,xRegion_next);21 end22 noIterations = noIterations + 1;23 end
An important part is the computation of the volume (see line 13 and line 20 of List-
ing 6.2) of a region (also note that we use a function handle to call a volume function
dependent on the region). If regions are hypercubes the volume computation is trivial.
For the computation of the volume of a polyhedron the reader is referred to the work
by Lasserre [Las83]. A computational-wise cheaper alternative to the volume compu-
tation is a comparison with the computed set of inequalities. This should be applied
for higher-dimensional cases, because the exact volume computation of a polyhedron
scales exponentially with the dimension of the polyhedron.
In an initial phase the neural network is annotated according to the operating input and
output space by using the refinement process. Later the user can specify initial regions
in the input and/or output space and a correct annotated version of the neural network,
according to the user specification, is computed.
The refinement loop performs forward and backward steps until no significant refine-
ments are observed. In the sequel, we describe the forward step. This is sufficient as
the backward-step implementation is analogous. A forward step is used to forward
propagate a region, starting from the input layer, through all layers of a feed-forward
neural network. This is implemented with a for-loop, which alternately computes the
146 Chapter 6. Implementation Issues and Numerical Problems
image of a polyhedron through the linear weight layer, followed by a computation of
the image of a polyhedron � through the non-linear layer.
During the execution of the loop the pre- and postconditions of each layer get updated.
We implemented the update of the pre- and postconditions with a function named
annotateNN. The input parameters to this function are the annotated neural network
structure, the refined region, the layer and a parameter indicating if the pre- or post-
conditions are updated. The output is the updated annotated neural network structure.
Listing 6.3 forwardStep
1 function [aNet]=forwardStep(aNet,fhForLin,fhForNonLin)2 for layer=1:aNet.numLayers3 if layer==14 % Input Layer5 W = aNet.IW{1,1};6 theta = aNet.b{1};7 else8 W = aNet.LW{layer-1,layer-2};9 theta = aNet.b{layer-1}
10 end11 [Ry] = feval(fhForLin,aNet,layer,NP,W,theta);12 % annotate the network with the refined region13 % obtained of the linear transformation14 aNet = annotateNN(aNet,Ry,layer,’preCondition’);15 % non-linear phase16 [Ry] = feval(fhForNonLin,aNet,layer,NP);17 % update the postcondition of the layer.18 aNet=annotateNN(aNet,Ry,layer,’postCondition’);19end
For the computation of the forward (backward) propagation through a linear or
non-linear layer different algorithms can be used. It is easy to integrate different algo-
rithms due to the generality of the framework. For example, to use the VIA implemen-
tation the refine function is called with the following parameter settings.
6.2 Numerical Problems 147
Listing 6.4 mainVIA
1 % load the global parameters for via2 gv_VIA;3 fhForwardLin = @via_forwardLin;4 fhForwardNonLin = @via_forwardNonLin;5 fhBackwardLin = @via_backwardLin;6 fhBackwardNonLin = @via_backwardNonLin;7 Bx = net.inputs{1}.range;8 % Example for an initial region of the output space for9 % a single output node
10By = [0 0.5-eps];11region.input = Bx;12region.output = By;13[aNet] = refine(net,region,fhForwardLin,fhForwardNonLin,14 fhBackwardLin,fhBackwardNonLin);
To call the implementation of Validity Polyhedral Analysis (VPA), the following
settings are defined.
Listing 6.5 mainVPA
1 % load the global parameters for vpa2 gv_VPA;3 fhForwardLin = @vpa_forwardLin;4 fhForwardNonLin = @vpa_forwardNonLin;5 fhBackwardLin = @vpa_backwardLin;6 fhBackwardNonLin = @vpa_backwardNonLin;7 % Example: Px and Py are initialised with results8 % obtained with VIA9 inLayer = 1;
10outLayer = 3;11Px = aNet.annotatedLayer{inLayer}.preCondition;12Py = aNet.annotatedLayer{outLayer}.postCondtion;13region.input = Px;14region.output = Py;15[aNet] = refine(aNet,region,fhForwardLin,fhForwardNonLin,16 fhBackwardLin,fhBackwardNonLin);
In the next section we discuss numerical aspects of the implementation.
6.2 Numerical Problems
Mathematical algorithms, involving the computation with real numbers, require an
analysis of numerical properties when implementing them on a computer. Digital
computers are finite machines, and hence the representation of real numbers are just
148 Chapter 6. Implementation Issues and Numerical Problems
approximations. Additionally, to represent a function on a computer only two ways
are possible [Van83].
I implementation by a table including the function values,
I approximation of the function with the basic operations a computer can perform:
addition, subtraction, multiplication and division
These aspects lead to the field of “numerical computations”, which deals with numeri-
cal issues when implementing a mathematical algorithm on a machine with finite pre-
cision. For an introduction the reader is referred to the book by Vandergraft [Van83].
Our implementation uses the sigmoidal functions and relies on linear programming
techniques. This could cause numerical problems. The following is an example:
Listing 6.6 numerical example
1 >> invlogsig(0.99999999999999)2 ans = 32.23703 >> invlogsig(0.9999999999999999)4 ans = 36.04375 >> invlogsig(0.99999999999999999)6 ans = Inf7 >> x0=0.999999999999998 >> x1=0.99999999999999999 >> x2=0.99999999999999999
10>> x1-x011ans = 9.8810e-01512>> x2-x113ans = 1.1102e-016
The above example illustrates that even very small round-off errors cause incor-
rect results. The implementation of VPA relies on the linear programming function
linprog of the Matlab optimization tool box [Mat00b]. Rounding errors in the above
magnitude (e.g.: ]¦Ó¦^_^®�Y¨°Ä QH¨���æ ) often occur when using linprog. For example, by
backpropagating a polyhedron and computing for one component the value ÒDç¦Óî¨�äjÒ°óinstead of ÒD�¦Óí�DÒ°ó5¨ causes a numerical error. As we propagate polyhedra through all
layers of the neural network even small numerical errors get quickly magnified. Con-
sequently, the neural network would be annotated with incorrect polyhedral regions.
In the literature mathematical problems, which are sensitive to small changes in the
data are called unstable or ill-conditioned problems [Van83].
6.2 Numerical Problems 149
Numerical Work Around
The following definitions are used in this paragraph.
Definition 6.1 Stable and Unstable Numerical Range
We use the expression stable numerical range, to state that the computation within this
range is considered as numerically stable. Otherwise we use the expression unstable
numerical range. wDefinition 6.2 Stable and Unstable Components
Components of a vector or matrix, which values are related to unstable numerical
ranges are called unstable components. Otherwise the components are referred to as
stable. wAs explained in Chapter 4 during the non-linear phase a polyhedron is backpropagated
through a vector of sigmoidal transfer-functions. Let C denote the optimization vector.
If a component of the wrapping box is classified as numerically unstable, then the cor-
responding entry of the vector C is set to zero and the polyhedron is projected onto the
subspace defined by the numerical stable components.
[ C�M�� ]=numWA( C , � )
�� m �����PþÉ�M g Vå� getUnstableComponents ��¡�// ɯÓ@Ó@Ó vector of unstable components
C\�ÊÉ«�¯��¨�H� proj ���2M g �
This numerical “work-around” still guarantees that the computed wrapping polyhedron
will contain the non-linear region � , because components with C«�ÊÉ\�¬�½¨ are irrele-
vant for the solution of the optimization problem. Furthermore, thanks to the convexity
property of polyhedra, the projection of a polyhedron onto a subspace corresponds to
the volume-wise biggest slice of the polyhedron (with respect to the projection). Fi-
nally, also the projection approximation ensures to compute a polyhedron containing
the true projection.
150 Chapter 6. Implementation Issues and Numerical Problems
This numerical work-around is really quite nice, because it uses theories and algo-
rithms developed within this thesis but for, originally, different purposes.
6.3 Summary of this Chapter
In this chapter the design and the implementation in Matlab of a general framework
for region-based refinement algorithms was introduced. In particular the refinement
function and the functions for the forward and backward-propagations have been gen-
eralised. This framework is easy to use and allows to plug-in different implementations
for the forward and backward computation of regions through a feed-forward neural
network.
Furthermore, numerical problems have been discussed and solutions to overcome nu-
merical difficulties with VIA or VPA have been introduced.
Contributions - Chapter 6 -
I Design and implementation of a general framework for region-based refine-
ment algorithms.
I Implementation of Validity Interval Analysis (VIA) and Validity Polyhedral
Analysis (VPA) within this framework.
I Discussion of numerical difficulties for the VIA and VPA algorithms.
I Work-around to stabilize an implementation by introducing the concept of
stable and unstable components, and by applying polyhedral projection tech-
niques.
The results of the following evaluation chapter rely on the implementation of VIA
and VPA. The implementation at the current status is only a prototype implementation
and is restricted to the propagation of a single polyhedron. Furthermore, the approx-
imation of the non-linear region is obtained with a single polyhedron. However, in
later versions the implementation will be extended by using finite unions of polyhe-
dra. Additionally, in the current implementation not all numerical issues are solved
satisfactory.
Chapter 7
Evaluation of Validity Polyhedral
Analysis
In this chapter, the Validity Polyhedral Analysis (VPA) approach developed in this
thesis is contrasted against Validity Interval Analysis (VIA) [Thr93]. VPA was eval-
uated on a neural network referred to as the circle neural network. This network was
trained on an artificially created data set. Further evaluations were performed on sev-
eral benchmark data sets of the UC Irvine database, and on a neural network trained to
predict the SP500.
Section 7.1 explains the general evaluation procedure. Section 7.2 presents the results
for VIA and VPA on the circle neural network. Section 7.3 discusses the evaluation on
the Iris and Pima data sets. Section 7.4 explains the application of VIA and VPA on a
neural network trained to predict the SP500 stock-market index.
7.1 Overview and General Procedure
The literature review highlighted the fact that benchmarks for rule extraction methods
reported and compared results for data sets rather than neural networks. Contrasting
rule extraction methods using different neural networks trained on the same task. Un-
fortunately, as far as we know, no trained benchmark neural networks are available.
The central idea of validity polyhedral analysis is to annotate a neural network with
valid regions. As such, we are able to provide statements of the neural network be-
151
152 Chapter 7. Evaluation of Validity Polyhedral Analysis
haviour which are guaranteed to be correct. This is a different focus compared to
propositional or fuzzy rule extraction. Neural network validation methods should be
compared on neural network level, because a validation process should be independent
of the training and data pre-processing phase.
However, this would have seriously disadvantaged rule extraction techniques requir-
ing special neural network architectures. Functions of the Matlab Neural Network
Toolbox [Mat00a] were used to train the neural networks. For improving the neu-
ral network generalization early stopping and regularization was applied by defining
an appropriate validation set and by using the matlab functions trainbr and trainbfg.
Regularisation modifies the performance function, which is usually the mean square
error function. According to the matlab documentation [Mat00a] a term is added to
the performance function which consists of the weihts and biases of the neural net-
work. Using the modified performance function ensures that the neural network will
have smaller weight and bias values which helps to reduce the risk of overfitting the
training data.
All available data was used for the training process of the Iris and Pima benchmarks.
For the circle task we randomly generated points in the input-space and used them
as training examples. The SP500 neural network used data collected from the stock-
market. The data was split into training, validation and test data.
The evaluation process followed the same structure for all experiments. In an initial
phase the neural network is annotated by propagating the operating input and output
space through the layers of the neural network. VIA was used to refine regions accord-
ing to initial, user-defined restrictions in the input space or the output space. VPA was
then applied to obtain further refinements.
To describe the evaluation, the task is explained, the neural network architecture is
described, and finally VIA and VPA results are discussed. A visualization of regions
is presented to help understanding the neural network behaviour. For the visualization
of higher-dimensional spaces projection techniques are useful. The circle task is ex-
plained in detail and provides a link of this chapter to the discussed concepts in Chapter
1 and Chapter 2, in particular to the idea of an annotated neural network and polyhe-
dral pre- and postconditions as well as to the discussion about interpreting numerical
7.2 Circle Neural Network 153
rules.
7.2 Circle Neural Network
This artifical example demonstrates in a simple manner the core idea of validating the
behaviour of neural networks. It is named artifical example because we know a priori
exactly what rule the neural network has to learn.
The Task
A neural network had to distinguish between points inside and outside a circle, i.e to
learn if �@� QB�$�@�S� Ð for a given center
and a radius Ð . To make it more interesting
assume that the center of a two dimensional input space represents the current position
of an airplane. A system should alarm the pilot about approaching airplanes within a
certain distance. From a neural network training perspective this problem reduces to
classify points inside and outside a circle. The feed-forward neural network, which
learnt this task, is referred to as the circle neural network.
The Neural Network Architecture and the Learning Process
The architecture is a two-weight layer neural network with the following properties.
I Two dimensional input space, with input between [-1,1] for each dimension.
I A five dimensional hidden layer with sigmoidal transfer-function.
I One output node with sigmoidal transfer-function. The output is in the interval
[0,1].
The neural network was trained to predict a value greater than 0.5 for points within the
circle. Outside the circle the points had a value less than 0.5.
Application of VIA and VPA
Assume a user of the neural network is interested in the following question:
When does the neural network warn the pilot about approaching airplanes?
The developer of the neural network has to interpret this natural language specification.
In this case it is trivial. The user is interested in the following output interval:
à2�ÔP ¨®Óíæ¦M��/V
154 Chapter 7. Evaluation of Validity Polyhedral Analysis
The tester has to validate which regions in the input space of the neural network predict
a value in the above interval. In the first phase, the relevant output region is expressed
as a polyhedron (in this case simply an interval):
�����³��à7�&ÿ X» X�� àu� ÿ X» $�� � � �The next steps are to apply VIA and VPA with the defined initial restriction in the
output space (and no restriction in the input space). VIA computes the following box
in the input space:
~ � ���$�cd » X $$ » XX $$ X
ef���cd $�� ��� � �$�� � � X��$�� � X��IX$�� � $�ef�
VPA is applied after the termination of VIA and computes a refined polyhedral region
in the input space:
��~·�����$� � $�� � �8� � $�� � � �» $�� � � X � $�� � $���$�� �Ê�Ê� $ » $�� � ��.$�� �B��� $�� X �� X$�� X � � X$�� X ���.$� �The reciprocal image of the pre-defined output condition was computed by back-
propagating the initial region through all layers of the neural network. It followed
a refinement process by forward and backward-propagating the regions until no fur-
ther refinements in the input or output space were observed. Finally, these polyhedral
rules are obtained (see Figure 7.1 for a visualisation of the input space):
if àu�n��� then �����~ �Ã�>�if àu�n� � then ���n� ~ �Á�°�if �TÚ�n ~ then àhÚ�n� � �ÁÒ°�if �TÚ�u��~ then à Ú�u��� ��ä��
As discussed in Section 1.4 the above rules are valid statements about the neural net-work behaviour. In our case we started with a backward step, because the initial region
was defined in the output space. Hence, the output region is viewed as the precondi-
tion and the computed region in the input space as postcondition (rules (1) and (2)).
Section 1.4 described that a measurement for the strength of a pre- or postcondition
is the volume of the computed region. In the above case the volume of the box is
0.3453 compared to the stronger postcondition, the triangle (simplex), with a volume
of 0.1599.
7.2 Circle Neural Network 155
In Chapter 2 an introductory example explained that, usually, an interpretation process
on numerical rules is necessary to obtain more human readable representations. An
interpretation of the above rules allows the following valid statements about the neural
network behaviour.
1. if the neural network warns the pilot about approaching airplanes then the other
airplane is within the box defined by $~ .2. if the neural network warns the pilot about approaching airplanes then the other
airplane is within the triangle defined by �`~ .3. if the input is outside the box ~ then the neural network will not send a warning
to the pilot.
4. if the input is outside the simplex � ~ then the neural network will not send a
warning to the pilot.
It is important to notice that the correctness of statements 3 and 4 is guaranteed, be-
cause VIA and VPA are approximating the true region from outside. In other words:
the algorithms ensure that the true region is contained in the approximated polyhedral
region. Figure 7.1 is a visual representation of the neural network behaviour. Within
the two-dimensional input space the computed box and the computed triangular region
are shown. Points represent the neural network output value. Dots indicate that this
input corresponds to a neural network output less than 0.5 and the points plotted with
the “+” sign represent that the corresponding neural network output is greater than
0.5. It is important to notice that the neural network did not learn the task correctly, as
there are points within the circle which are classified with a value less than 0.5, i.e. the
neural network would not warn the pilot of approaching airplanes. With the VIA result
this misclassification remains undetected, whereas the computed VPA result proves the
misclassification. Additionally, this example highlights the difference between testing
a neural network and proving properties about the neural network. A test, by sampling
points, possibly would have not detected the misclassification.
156 Chapter 7. Evaluation of Validity Polyhedral Analysis
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 7.1: Visualisation for the behaviour of the circle neural network.
7.3 Benchmark Data Sets
7.3.1 Iris Neural Network
The Task
The UC Irvine database contains 150 labeled patterns for the Iris classification prob-
lem. The examples have four numeric attributes: sepal length, sepal width, petal length
and petal width. The neural network has to learn to distinguish between three classes
of Iris plant, namely, Setosa, Versicolor and Virginica.
The Neural Network Architecture
The architecture is a two-weight layer neural network with the following properties.
I Four dimensional input space, with input between [-1,1] for each dimension.
I A four dimensional hidden layer with sigmoidal transfer-function.
I Three output nodes with sigmoidal transfer-function. The output is in the inter-
val P ¨®M��/V ñ .The neural network output is sparse coded.
Application of VIA and VPA
7.3 Benchmark Data Sets 157
We are interested in the following question:
When does the neural network predict the Iris plant Setosa?
The neural network learnt to predict Setosa if the output is in the following polyhedron.
�$��� ����� � » X $�$$ X $$ $ X � ����� » $�� �$�� X$�� X � �VIA computed after one backward and one forward step the following region in the
input space.
~ � ���$�c�������d »X $ $ $$ » X $ $$ $ » X $$ $ $ » XX $ $ $$ X $ $$ $ X $$ $ $ X
e��������f �B�c�������dX � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$$�� � $$�� �8� � �e��������f �
After applying VPA the input region is refined to the following polyhedron.
�$~¬� ���$�c�������d$�� XÔX � � » $�� � � � � $�� � � � � $�� � X �$�� X $�� $�� � Ê� � » $�� �.$�� � » $�� � � � �» X � $Ê$Ê$Ê$ » $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ » X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$$�� $Ê$Ê$Ê$ X � $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$ $�� $Ê$Ê$Ê$
e��������f �B�c�������d »$�� � � � �X � � �Ê�X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$X � $Ê$Ê$Ê$e��������f �
The volume of the box computed by VIA is 9.6395, the volume of the polyhedron
computed by VPA is 1.4359.
Similarly, as before we can write a set of four valid statements about the neural net-
work. Out of the 150 patterns 50 patterns are classified as Setosa. The neural network
learnt to classify these 50 patterns correctly. In the box computed with VIA 116 out
of the 150 patterns are included and all 50 patterns for Setosa. The polyhedron com-
puted with VPA contained exactly 50 points and all of them are classified as Setosa.
In Figure 7.2 and Figure 7.3 axis-parallel projections of the polyhedron and the box
respectively onto a two dimensional space are depicted. Points which are classified as
Setosa are plotted using the “+” otherwise the class is Versicolor or Virginica.
158 Chapter 7. Evaluation of Validity Polyhedral Analysis
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Sepal.Length
Pet
al.L
engt
h
Projection onto the Subpace: x(Sepal.Width )=0 and x(Petal.Width )
Figure 7.2: Projection onto the subspace *,+v-a.(0y1�3�4 .7698 :�;>=@?7+BADC .
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Sepal.Width
Pet
al.W
idth
Projection onto the Subpace: x(Sepal.Length)=0 and x(Petal.Length)=0
Figure 7.3: Projection onto the subspace *,+v-a.(0y1�3�4 .7698FE G�=@?7+BADC .7.3.2 Pima Neural Network
The Task
We trained a neural network with data of the Pima Indians Diabetes Database. The
768 instances are drawn from a larger database. All patients are females at least 21
years old of Pima Indian heritage. The network has to classify according to selected
attributes if a person is tested positive for diabetes. The following attributes are all
numeric.
7.3 Benchmark Data Sets 159
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in Q�ß « � height in m � ¥ )7. Diabetes pedigree function
8. Age (years)
The Neural Network Architecture
The architecture is a two-weight layer neural network with the following properties.
I Eight dimensional input space with input between [-1,1] for each dimension.
I Eight dimensional hidden layer with sigmoidal transfer-function.
I Two output nodes with sigmoidal transfer-function. The output is in the intervalP ¨®M��/V ¥ .The network was trained to classify patients tested positive for diabetes with the output�J�ÔP ¨®ÓN] ¨®ÓÊ�/V , otherwise with the output �J�ºP ¨®ÓÊ�Ϩ®ÓN]'V .Application of VIA and VPA
We want to analyse the following property of the neural network.
What are the possible outputs of the neural network for a specific group of patients?
For example, a specific group of patients is characterised where each of the eight at-
tributes is above the average. This includes patients which are old (see attribute 8),
have a high body mass index (attribute 6) and so on.
We defined in the input space the following hypercube (to simplify the notation the
hypercube is written as the Cartesian product of intervals):
«~·�����B�BP ¨®Óíæ¦M��/V��D�
160 Chapter 7. Evaluation of Validity Polyhedral Analysis
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y1
y2
The output space of the Pima Neural Network
Figure 7.4: The computed output regions for the Pima Neural Network. The out-most
box is the initial restriction. The inner box is the computed region after applying VIA
and the polyhedral region inside is the region computed by VPA.
Additionally, the output space was restricted to P ¨®Óæ¦MU¨®ÓN]Dæ'V ¥ . VIA computed the fol-
lowing possible output region.
«�¤� �����cd » X $$ » XX $$ X
ef�B�cd » $�� $��.$» $�� $��.$Ê$$�� � � �$�� � ��
ef�
VPA terminated with a more refined polyhedral region as follows:
����� �����c��d » $�� �� � � » $�� $� X �» $�� � � » $�� �� � �$�� $Ê$Ê$Ê$ » X � $Ê$Ê$X � $Ê$Ê$Ê$ $�� $Ê$Ê$$�� $Ê$Ê$Ê$ X � $Ê$Ê$
e���f ��� c���d » $�� X � �IX» $�� X � �$�� � �� �» $�� $��.$Ê$$�� � � X$�� � � � �
e����f �Overall these polyhedral rules are valid descriptions about the neural network be-
haviour.
if t,���~ then ���n«� �Ã�>�if t,���~ then ���u��� �Á�°�if �ÛÚ���� then t�Ú���~ �ÁÒ°�if �ÛÚ�n� � then t Ú�� ~ ��ä��
The polyhedron is the stronger postcondition compared to the hypercube. The vol-
ume for the polyhedron is 0.5097 and the volume of the box is 0.8068. In Figure 7.4 a
7.4 SP500 Neural Network 161
visualisation of the output space is depicted. To control the result sample points were
randomly generated in the input space and the neural network output was computed.
The computed outputs are represented with small circles.
It is obvious that the polyhedral approximation is not very good. The reasons are that
a single polyhedral approximation during the non-linear phase is not sufficient. In fact
the polyhedral approximation of the non-linear region gets worse the higher the dimen-
sion. This is not a surprise, because in higher dimension more curvatures and saddle
points are expected. Consequently, as a protocol of the VPA implementation showed,
the hyperplane to approximate the non-linear region got moved mostly towards the
corner of the wrapping hypercube (see the binary search method described in Chapter
4 for more details). Additionally, another approximation during the forward propa-
gation of the region was necessary, namely, for the projection of the computed eight
dimensional polyhedron of the hidden layer onto a two-dimensional subspace.
However, the above result contains an interesting insight into the neural network be-
haviour. As showed in Figure 7.4 refinements essentially occurred around the upper
left corner (“tested negative for diabetes”). In fact by sampling randomly 100 points in
the box �~ only 8 got classified as “tested negative for diabetes” by the neural network.
7.4 SP500 Neural Network
The Task
Historical time series data are used for predicting future price developments. Techni-
cal analysis (also know as charting, see [Mur86] for detailed information), assumes
that all influences of supply and demand are included in the price or index itself. Sev-
eral neural networks were trained to predict the SP500 index for the next trading day,
depending on historical data [Bre01]. Apart from recurrent neural networks also feed-
forward neural networks have been trained. The data set contained 500 Values (from
24/09/1998 to 15/09/2000) of the SP500 index. The data included the opening and
closing index and the highest and lowest index during a trading day. These variables
are referred to as Ý ��T Ú Ñ5Ä'� , Ú �FÙSÄ� �� , ×���NÁߦ×q� and TO� Ú5ã � . The variable �B��p�tå� is the max-
imum of daily change observed in the considered time period. In [Zir97] linear and
162 Chapter 7. Evaluation of Validity Polyhedral Analysis
Fibonacci normalisation were introduced on financial time series data. For this task
linear normalisation was applied. The following two input variables are used:
Z��Ã�>�¯� Ý Z��Á�°�¯� × Q�T� � Ý Q Ú �Ý Q ÚThe variables can be interpreted as follows: Z$�Ã�>� is the closing value of the SP500
index normalised in the interval [0.4 1] (according to a previous analysis [Bre01])
and Z$�Á�°� is the activity of the day according to a maximal daily change, which was
defined by analysing the history of the data. The use of variable Z$�Á�°� was motivated
by [Azo94]. Z$�Á�°� will be positive for a rising SP500 value and negative for a falling
index value.
The Neural Network Architecture
The following neural network was analysed.
I Four dimensional input space. The input vector � corresponds to the values Z$�Ã�>�and Z$�Á�°� of the previous two days. The input is in the range [0.4,1] for the Z$�Ã�>�value and in the interval [-1,1] for the Z$�Á�°� value.
I 11 hidden nodes with sigmoidal transfer-function.
I Two dimensional output with sigmoidal transfer-function. The output is in the
interval P ¨®M��/V ¥ .The network was trained to compute values bigger than 0.5 for an increasing stock-
market index and values less than 0.5 for a decreasing SP500 index.
Application of VIA and VPA
We are interested to compute regions in the input space which correspond to an in-
creasing SP500 index (according to the neural network prediction).
Find regions in the input space which predict an increasing SP500 index ?
For this example VIA and VPA produced the same result, namely the complete input
space. This result has two reasons: firstly, sampling indicated that the input points for
predicting an increasing SP500 stock-market are widely distributed in the input space.
Secondly, a single polyhedral approximation is not sufficient. Figure 7.5 is an example
of the projection of the four-dimensional cube onto a two-dimensional input space.
7.5 Summary of this Chapter 163
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
v1 two days before
v1 o
ne d
ay b
efor
e
Projection onto the Subpace: x(v2 two days before)=0 and x(v2 one day before )=0
Figure 7.5: The computed input region for the SP500 Neural Network. Neither VIA
nor VPA computed a refined region.
To obtain better statements about the neural network behaviour a pre-processing phase
is recommended, for example by using methods like Q -means regions in the input
space can be selected and further analysis can be performed by using VIA and VPA.
7.5 Summary of this Chapter
VPA was successfully used for the Circle and the Iris network. It also showed that
a better refinement of polyhedral rules is obtained compared to VIA. However, the
results for the Pima and the SP500 neural networks are not completely satisfactory.
Apart from the lack of time this has the following reasons:
I An analysis showed that the input patterns for different classes often lie in a close
neighbourhood. Because of this behaviour the neural network, it is to expect that
there are no significant region refinements, once applying VIA and VPA. In this
cases a pre-processing is required in order to select small regions (e.g. by using
clustering algorithms). Afterwards, VIA and VPA can be used to analyse these
regions.
I The approximation for the non-linear phase is crucial. Experiments indicated
that the approximation process (the positioning of a hyperplane) is decreasing
164 Chapter 7. Evaluation of Validity Polyhedral Analysis
the higher the dimension. Reasons for this are probably an increasing number of
saddle points. To obtain better refinements a split of the polyhedron into a finite
union of sub-polyhedra is necessary.
- Contributions Chapter 7 -
I Evaluation of the VPA methods on four different neural networks and
comparison to VIA.
I Visualization of the behaviour of neural networks also for higher-
dimensional input and output spaces by applying polyhedral projection
techniques.
Chapter 8
Conclusion and Future Work
Section 8.1 enumerates the original contributions of this thesis. Section 8.2 discusses
several approaches to fine tune the existing implementation of VPA. Finally, Section
8.3 indicates further research directions.
8.1 Contributions of this Thesis
We enumerate the main contributions of this thesis according to their occurrence in the
text. The importance of these contributions is marked with one star (*) to express that
it is viewed by the author as a minor contribution up to three stars (***), which are
considered as major contributions by the author.
1. Introduction of the concept of an Annotated Artifical Neural Network (AANN),
with a connection to logic and software verification. (***)
2. Validity Polyhedral Analysis (VPA), as a tool to annotate a feed-forward neural
network with valid pre- and postconditions in form of linear inequality predi-
cates. (***)
3. Classification of neural network analysis techniques into: propositional rule ex-
traction, fuzzy rule extraction and region based analysis. (*)
4. Introduction of the property “validation capability” to indicate if a neural net-
work analysis technique is able to compute properties about the neural network,
which are provably correct. (*)
165
166 Chapter 8. Conclusion and Future Work
5. Suggestion to modify REFANN [SLZ02] to obtain valid rules. (*)
6. The computation of the eigenvalues and the eigenvectors in the neighbourhood
of a point � on the manifold of the region � to obtain information about the
deformations of polyhedral facets under sigmoidal transformations.(**)
7. Analysis of piece-wise linear approximations of the sigmoidal function by using
non axis-parallel splits. (*)
8. The idea of computing a wrapping polyhedron �`~ , which contains the non-linear
region � . (**)
9. Application of the SQP-approach and development of the MSA (Maximum Slice
Approach) technique to solve the required non-linear optimization problem for
a polyhedral wrapping of the non-linear region � . (*)
10. Development of a branch and bound and a binary search technique, which ap-
proximate a solution for the global optimum from outside. These methods fulfill
the requirement of an outside approximation for the non-linear region � . (***)
11. Experiments comparing the binary search method and the branch and bound
approach.(*)
12. Computation of the image of a polyhedron under an affine transformation by
applying projection techniques, if necessary. (**)
13. Development of the S-Box approximation technique to compute an approxima-
tion for the projection of a polyhedron onto a lower dimensional subspace in
polynomial time complexity.(***)
14. Approximation of the image of a polyhedron under a linear transformation whenÀ3ÁL®� k �)¶�¨ by approximating þ directly by solving optimization problems of
the form: max� ¹¨ü �Á�J�ÁÀ9MYKF� k ¿ � » � . (*)
15. Design and implementation of a general framework for region-based refinement
algorithms. (**)
8.2 Fine Tuning of VPA 167
16. Implementation of Validity Interval Analysis (VIA) and Validity Polyhedral Anal-
ysis (VPA) within this framework. (*)
17. Discussion of numerical difficulties for the VIA and VPA algorithms. (**)
18. Work-around to stabilize an implementation by introducing the concept of stable
and unstable components, and by applying polyhedral projection techniques.
(**)
19. Evaluation of the VPA methods on four different neural networks and compari-
son to VIA. (**)
20. Visualization of the behaviour of neural networks also for higher-dimensional
input and output spaces by applying polyhedral projection techniques. (*)
Thus far only the binary search method [BMH03] is published. However, these
publications are planed for the future with the following titles:
I Approximation Methods for the Projection of a Polyhedron onto a lower-dimensional
Subspace.
I The Concept of Annotated Artificial Neural Networks as a Method to validate
Properties of a Neural Network.
I Survey of Region-based Neural Network Analysis Techniques.
I Validity Polyhedral Analysis and its Application.
8.2 Fine Tuning of VPA
The current implementation of VPA is a prototype implementation. This implementa-
tion proved that the theoretical results, developed in this thesis, are valid and that they
could have practical relevance.
However, due to time limitations, this implementation did not explore the full potential
of techniques and ideas as introduced in this thesis. To fine tune the implementation
of Validity Polyhedral Analysis (VPA), we suggest the following steps:
168 Chapter 8. Conclusion and Future Work
I Implementation of a component to explore the structure of the manifold of the
non-linear region � . This could be based on the eigenvalue and eigenvector
analysis as introduced in Chapter 3.
I Extension of the implementation to finite unions of polyhedra and therefore a
method to obtain more refined region mappings.
The next paragraph will conclude this thesis by pointing to future theoretical inves-
tigations related to this research and by providing an outlook to kernel-based machines.
8.3 Future Directions and Validation Methods for Kernel Based
Machines
This thesis contained a number of interesting theoretical problems, in particular the
non-linear optimization problem and the projection of a polyhedron onto a lower-
dimensional subspace. Future investigations should tackle the following aspects:
I Analysis of the refinement process for the binary search method and for the
refinement process of VPA.
I Development of a heuristic to compute the most interesting optimization direc-
tions for the wrapping of the non-linear region � . This would lead to questions
such as: given a non-linear region � and a finite number of possible directions
for a polyhedral approximation of � , how to choose the optimal directions to
obtain the most refined approximation?
I Application of the concept of an annotated neural network in an industrial con-
text. The idea to guarantee certain properties about the neural network behaviour
might motivate more people to use neural networks. For example, for neural net-
work control tasks people are often interested to know what inputs are related to
stable output states.
I Further improvement of the developed approximation techniques for the projec-
tion of a polyhedron onto a lower-dimensional subspace.
8.3 Future Directions and Validation Methods for Kernel Based Machines 169
Kernel-based Machines
Kernel-based machines are interesting developments in machine learning (see for ex-
ample [SS02]). These machines rely on the observation that input patterns are more
likely to be linearly separable in higher dimensions. Kernel machines first perform a
mapping �M�Ì �\����� from an input space to a higher dimensional feature space. For
the learning process in the higher dimensional space only the dot product is needed.
The Kernel trick is that for some features spaces and mappings the computation of the
dot product in feature space is computable via kernel functions Q defined on the input
space such that Q&����MO�\�$�y´��\�����aM��\���\�`¶ .
Support Vector Machines (SVMs) are probably the best known kernel machines. The
architecture of a SVM is a feed-forward neural network.
Similarly, to the introduced method an analysis of SVMs could also be performed by
annotating the layers of SVMs with regions. Polyhedra are a suitable choice for the
linear layer. However, according to our knowledge there are no known approaches
how to handle kernel functions, like the Gaussian kernel. Hence, in order to generalise
VPA to SVMs this problem has to be addressed.
171
172 Chapter A. Overview of used symbols
Appendix A
Overview of used symbols
P ... previous layer
S ... subsequent layer¹... hyperplane, defined with direction vector ê and constant �¹n¿... positive half-space, where êq»7��¶��¹ # ... negative half-space, where ê�»7��´��� ... polyhedron defining a set of points� ... projected of a polyhedron � onto a subspaceG� ... approximation of �
þ ... polyhedron, given by þ ��©$������ ... (non linear) region� ... denotes an output vector of a function� ... denotes an input vector to a function� ... activation vectorg�hji ... net input vectork... weight matrixl l l... bias vector! ... sigmoid transfer-functionÎ ... subspace orthogonal to the kernel of the transformation
... matrixk
© ... denotes the affine transformation����� ����� �������¯� ... polyhedron in y-space, which we want to back-
propagate through the transfer function layer.
173
��~·�����$� ��ò�B���¼�W� ... polyhedron in � -space, i.e. the polyhedral
approximation of � .�u�����$� ��!$����� ���¯� ... true reciprocal image of the polyhedron in x-space,
called region � .W^� ... wrapping box of � � , i.e the smallest axis parallel
hypercube containing � � . ^~ ... wrapping box of � .
¦~ ... box in x-space, used for the intersection detection,
after the k-th iteration ¦� ... box in y-space, ditto to ¦~X ê ... maximum of the cost-function in x-space
(constraint to ^~ ).i � ... point in y-space when linear optimizing
max C [ � subject to �������i ê ... corresponding point to i7� in x-space, i.e.iå~¬�"!\#&%'�.iå�D� . In the linear case this would be
already the optimal point for¹
.��\�cëq� ... we call the line segment ��\�cëq�¯�*X ~ �çë��.i ~ Q�X ~ � ,ë��BP ¨®M��/V shift line.� ~ ... point used for the positioning of the hyperplane in the
... binary search method.�¡ë ... rate of change between two consecutive values for ë� �Ø×2� ... volume of a box.� ... � expresses a small positive real numberm
... box operator, smallest hypercube containing a region � or
polyhedron � .o� ... boolean operator to compare if two expressions are equivalentP pqMsrUV ... denotes the interval ��t�� pu�vt�� r��
Appendix B
Linear Algebra Background
We recall a few important properties about linear transformations, which are relevant
for the algorithm for the computation of the image of a polyhedron under an affine
transformation.
Letk
be a P ��MO åV matrix, i.e. a matrix with � rows and columns.
1. We can view the matrixk
as the representation of a linear transformation from
î © to î § .
2. The Kernel (sometimes called null-space) ofk
is the set of all vectors that get
mapped to the zero vector underk
. We write À3ÁÆÂ�� k ��� ���B�(î © � k ���S¯å� .The kernel is a subspace of the domain, i.e of î © . In the following we also useÀ to refer to the subspace À3ÁÆÂ�� k � .
3. The image (sometimes called range) ofk
is the entire set of vectors in î §reachable by an input vector � when multiplied with
k. We write ¾�¿,� k ��������mî § � � ��M �J� k �«� The image is a subspace of î § .
4. The restriction ofk
to a subspace Î orthogonal to À is a linear bijection be-
tween Î and ¾�¿�� k � .Proof:
(i) Injection: let � % MO� ¥ ��Í andk � % � k � ¥ Ó
then:k ��� % Q�� ¥ ���"¨ £ � % Q� ¥ ��À ,but: Í�� À ���>¨¦� Hence: � % �H� ¥ w
175
176 Chapter B. Linear Algebra Background
(ii) Surjection: let ���5¾�¿,� k � £ � ���mî © �ÅÀ�� ͯMO�J� k ��J�H�)ç���� Õ ,where: �%ç��5À�MO� Õ � Í��� k ��� k ��� Õ ���)ç���� k � Õ wIn other words any vector ����Î gets mapped to a distinct vector �v��¾�¿,� k � .We will write © Õ .
Bibliography
[ADT95] R. Andrews, J. Diederich, and A. Tickle. A survey and critique of
techniques for extracting rules from trained artificial neural networks.
Knowledge-Based Systems 8 (1995) 6, pages 373–389, 1995.
[Ama97] Saman Prabhath Amarasinghe. Parallelizing Compiler Techniques
Based on Linear Inequalities. PhD thesis, Stanford University, Stan-
ford, CA 94305, January 1997.
[Azo94] Michal E. Azoff. Neural Network Time Series Forecasting of Financial
Markets. John Wiley & Sons Ltd., 1994.
[Bal98] Egon Balas. Projection with a minimal system of inequalities. Compu-
tational Optimization and Applications, 10:189–193, April 1998.
[BBDR96] J.M. Benitey, A. Blanco, M. Delgado, and I. Requena. Neural methods
for obtaining fuzzy rules. 3:371–382, 1996.
[Bis94] C.M. Bishop. Neural networks and their applications. Rev.Sci. In-
strum., 65(6):1803–1831, 1994.
[Bis95] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-
versity Press, 1995.
[BMH03] S. Breutel, F. Maire, and R Hayward. Extracting interface assertions
from neural networks in polyhedral format. In Michel Verleysen, edi-
tor, ESANN 2003, pages 463–468. Kluwer, 2003.
177
178 BIBLIOGRAPHY
[Bon83] A. Boneh. PREDUCE: A Probabilistic Algorithm for Identifying Re-
dundancy by a Random Feasible Point Generator. Springer-Verlag,
Berlin, Germany, 1983.
[Bre01] S. Breutel. Neural network time series prediction for the sp500 index.
Internal Report, 2001.
[Bro97] M. Broy. Informatik - Eine grundlegende Einf“uhrung. Springer, 1997.
[BS81] Bronstein and Semendjajew. Taschenbuch der Mathematik. Harri
Deutsch und Thun, Frankfurt, 1981.
[BT96] Paul T. Boggs and John W. Tolle. Sequential quadratic programming.
pages 1–000, 1996.
[Cer61] R.N. Cernikov. The solution of linear programming problems by elim-
ination of unknowns. Doklady Akademii Nauk 139, pages 1314–1317,
1961.
[CMP] R.J. Caron, J.F. McDonald, and C.M. Ponic. Classification of linear
constraints as redundant or necessary. Technical Report WMR-85-09,
University of Windsor, Windsor Mathematics Report.
[CMP89] R.J. Caron, J.F. McDonald, and C.M. Ponic. A degenerate extreme
point strategy for the classification of linear constraints as redun-
dant or necessary. Journal of Optimization Theory and Applications,
62(2):225–237, August 1989.
[Cov65] T.M. Cover. Geometrical and statistical properties of systems of linear
inequalities with applications in pattern recognition. IEEE Transac-
tions on Electronic Comuters, EC-14:326–334, 1965.
[Cra96] M. Craven. Extracting comprehensible models from trained neural net-
works. Ph.D. dissertation,Univ. Wisconsin, Madison, WI, 1996, 1996.
[CS96] M.W Craven and J.W. Shavlik. Extracting tree-structured represen-
tations of trained neural networks. Advances in Neural Information
Processing Systems, 8, 1996.
BIBLIOGRAPHY 179
[Dax97] A. Dax. An elementary proof of farkas’ lemma. SIAM Review,
39(3):503–507, 1997.
[DM97] M. Dyer and N. Meggiddo. Chapter 38 in The Handbook of Discrete
and Computational Geometry, pages 699–710, July 1997.
[Dut02] Mathieu Dutour. Computational methods for cones and polytopes with
symmetry. January 2002.
[Far02] J. Farkas. Ueber die theorie der einfachen ungleichungen. Journal fuer
die reine und angewandte Mathematik, 124:1–24, 1902.
[FJ99] Maciej Faifer and Cezyry Janikow. Extracting fuzzy symbolic rep-
resentation from artifical neural networks. 18 th International Con-
ference of the North American Fuzzy Information Processing Society,
June 1999.
[Fou27] J.B.J Fourier. (reported in:) analyse des travaux de l’academemie
royale des sciences pendant l’annee 1824. Partie mathematique, 1827.
[FP96] Komei Fukuda and Alain Prodon. Double description method revisted.
Combinatorics and Computer Science, pages 91–111, 1996.
[Fri98] Bernd Fritzke. Vektorbasierte Neuronale Netze. Shaker Verlag, 1998.
[Fu94] LiMin Fu. Rule-generation from neural networks. IEEE SMC,
24(8):1114–1124, 1994.
[Fuk00] Komei Fukuda. Frequently asked questions in polyhedral computation.
2000.
[GSZ00] Adam E. Gaweda, Rudy Setiono, and Jacek M. Zurada. Rule extrac-
tion from feedforward neural network for function approximation. In
Neural Networks and Soft Computing, Zakapone,Poland, June 2000.
[GVL89] G. Golub and C. Van Loadn. Matrix computations. John Hopkins
University Press, 1989.
180 BIBLIOGRAPHY
[Hay99] Simon Haykin. Neural networks - A comprehensive foundation. Pren-
tice Hall, 1999.
[HEFROG03] Carlos Hernandez-Espinosa, Mercedes Fernandez-Redondo, and Ma-
men Ortiz-Gomez. A new rule extraction algorithm based on interval
arithmetic. In Michel Verleysen, editor, ESANN 2003, pages 155–160.
Kluwer, 2003.
[HHSD97] R. Hayward, C. Ho-Stuart, and J. Diederich. Neural Networks as Ora-
cles for Rule Extraction. Connectionist Systems for Knowledge Repre-
sentation and Deduction, 1997.
[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer Feedforward
Networks are Universal Approximators. Neural Networks, 2:359–366,
1989.
[HU79] John E Hopcroft and Jeffrey D Ullman. Introduction to Automata The-
ory, Languages, and Computation, chapter 3, pages 65–71. Addison-
Wesley, 1979.
[Huc99] T. Huckle. Kleine bugs grosse gaus, 1999. Seminar 2.12.1999,
wwwzenger.informatik.tu-muenchen.de/persons/huckle/bugs.html.
[IN] H. Ishibushi and M. Nii. Generating fuzzy if-then rules from trained
neural networks: Linguistic analysis of neural network. pages 1133–
1138.
[INT] H. Ishibushi, M. Nii, and K. Tanaka. Linguistic rule extraction from
neural networks and genetic-algorithm-based rule selection. pages
2390–2395.
[INT99] H. Ishibushi, M. Nii, and K. Tanaka. Linguistic rule extraction from
neural networks for higher-dimensional classification problems. Com-
plexity International, 6, 1999.
[Jan93] C.Z. Janikow. Fuzzy processing in decision trees. Proceedings of the
Sixth International Symposium on AI, pages 360–367, 1993.
BIBLIOGRAPHY 181
[Jan96] C.Z. Janikow. Fuzzy decision trees: Issues and methods. Technical
report, Department of Mathematics and Computer Science, University
of Missouri-St Louis, 1996.
[Kal02] Bohdan L. Kaluzny. Polyhedral computation:a survey of projection
methods. Technical Report 308-760B, McGill University, April 2002.
[Ker00] Eric C. Kerrigan. Robust Constraint Satisfaction: Invariant Sets and
Predictive Control. PhD thesis, Control Group,Department of Engi-
neering, University of Cambridge, 2000.
[KG85] A. Kaufmann and M.M Gupta. Introduction to fuzzy arithmetic. 1985.
[Koh87] T. Kohonen. Self-Organization and Associative Memory. Springer-
Verlag, 2 edition, 1987.
[Las83] J. Lassere. An analytical expression and an algorithm for the volume
of convex polyhedron in. Journal of Optimization Theory and Appli-
cations, 39(4), 1983.
[Len93] C. Lengauer. Loop parallelization in the polytope model. CONCUR
’93,Lecture Notes in Computer Science 715, pages 398–416, 1993.
[Lis01] Paulo J G Lisboa. Industrial use of safety-related artificial neural net-
works. Technical report, Health and Safety Executive, 2001.
[LMP89] J. Li, A.N. Michel, and W. Porod. Analysis and synthesis of a class of
neural networks: linear systems operating on a closed hypercube. IEEE
Transactions on Circuits and Systems, 36(11):1405–1422, November
1989.
[LVE00] P.J.G Lisboa, A. Vellido, and B.(eds.) Edisbury. Neural network appli-
cations in business. World Scientific, 2000.
[Mai98] F. Maire. Rule-extraction by backpropagation of polyhedra. Neural
networks, 12:717–725, 1998.
182 BIBLIOGRAPHY
[Mai00a] F. Maire. On the convergence of validity interval analysis. IEEE Trans-
actions on Neural Networks, 11(3), 2000.
[Mai00b] F. Maire. Polyhedral analyis of neural networks. SDL, 2000.
[Mat00a] MathWorks, 24 Prime Park Way, Natrick,MA. The Neural Network
Toolbox, 2000.
[Mat00b] MathWorks, 24 Prime Park Way, Natrick,MA. The Optimization Tool-
box 2.0, 2000.
[Mat00c] MathWorks, 24 Prime Park Way, Natrick,MA. Using Matlab, 2000.
[Men01] Jerry M. Mendel. Rule-Based Fuzzy Logic Systems. Prentice Hall,
Upper Saddle River,NJ 07458, 2001.
[Mit97] Tom M. Mitchell. Machine Learning. McGRAW-HILL, 1997.
[MKW03] Urszula Markowska-Kaczmar and Trelak Wojciecj. Extraction of fuzzy
rules from trained neural network using evolutionary algorithm. In
elisiver, editor, ESANN 2003, pages 149–154. Kluwer, 2003.
[MP00] Ofer Melnik and Jordan Pollack. Exact representations from
feed-forward networks. Technical report, Brandeis University,
Waltham,MA,USA, April 2000.
[Mur86] John Murphey. Technical Analysis of the Future Markets. The New
York Institute of Finance, Prentice Hall, New York, 1986.
[NP00] C.D. Neagu and V. Palade. An interactive fuzzy operator used in rule
extraction from neural networks. Neural Network World 4, pages 675–
684, 2000.
[PNJ01] Vasile Palade, Daniel-Ciprian Neagu, and Ron J.Patton. Interpretation
of trained neural networks by rule extraction. pages 152–161, 2001.
[PNP00] Vasile Palade, Daniel-Ciprian Neagu, and Gheorghe Puscasu. Rule
extraction from neural networks by interval propagation. 2000.
BIBLIOGRAPHY 183
[Qui86] J.R. Quinlan. Induction of decision trees. Machine Learning 1, pages
81–106, 1986.
[Ram94] Joerg Rambau. Polyhedral Subdivisions and Projections of Polytopes.
Dissertation, TU Berlin, 1994.
[Rep] UCI Machine Learning Repository. http://www.ics.uci.edu/ mlearn/ml-
summary.html.
[Rep99] DTI Final Report. Evaluation of parallel processing and neural com-
puting application programs. Technical report, DTI, 1999.
[RHW86a] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal
Representations by Error Propogation. Parallel Distributed Processing,
Vol I + II, MIT Press, 1986.
[RHW86b] D.E. Rumelhart, G.E. Hinton, and R.J. Winston. Learning inter-
nal representations by error propagation. In D.E. Rumelhart and
J.L.McCleland, editors, Parallel Distrbuted Processing: Explorations
in the Microstructure of Cognition, volume 1, Cambridge,MA, 1986.
MIT Press.
[RHW86c] D.E. Rumelhart, G.E. Hinton, and R.J. Winston. Learning represen-
tations of back-propagation errors. Nature (London), 323:533–536,
1986.
[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information
storage and organization in the brain. Psychological Review, 65:386–
408, 1958.
[RR89] Bruno Riedmueller and Klaus Ritter. Lineare und quadratische opti-
mierung. Institut fuer angewandte Mathematik und Statistik, Technis-
che Universitaet Muenchen, 1989.
[Sch90] A. Schrijver. Theory of Linear and Integer Programming. Wiley-
Interscience Publication, 1990.
184 BIBLIOGRAPHY
[SL97] R. Setino and H. Liu. Neurolinear: From neural networks to oblique
decision rules. Neurocomputing, 17(1):1–24, 1997.
[SLZ02] Rudy Setiono, Wee Kheng Leow, and Jacek M. Zurada. Extraction of
rules from artificial neural networks for nonlinear regression. IEEE
Transactions on Neural Networks, 13(3), May 2002.
[SN88] K. Saito and R. Nakano. Medical diagnostic expert system based on
pdp model. Proceedings of IEEE, International Conference on Neural
Networks, 1:255–262, 1988.
[SS95] Hava T Siegelmann and Eduardo D Sontag. On the computational
power of neural nets. Journal of Computer and System Sciences,
50(1):132–150, 1995.
[SS02] Bernhard Schoelkopf and Alexander J. Smola. Learning with Kernels.
MIT Press, Cambridge, Massachusetts, 2002.
[TAGD98] A. Tickle, R. Andrews, Mostefa Golea, and J. Diederich. The truth will
come to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks. IEEE Transactions
on Neural Networks, 9(6):1057–1068, 1998.
[TBS93] T.M.Martiney, S. Berkovich, and K. Schulten. Neural gas network for
vector quantization and its applications to time series prediction. IEEE
Transactions on Neural Networks, 4:558–569, 1993.
[TG96] I. Taha and J. Ghosh. Symbolic interpretation of artifical neural net-
works. (TR-97-01-106), 1996.
[Thr90] S. B. Thrun. Inversion in Time. In Proceedings of the EURASIP Work-
shop on Neural Networks. Springer Verlag, 1990.
[Thr91] S. B. Thrun. The monk’s problems - a performance comparison
of different learning algorithms. Technical Report CMU-CS-91-197,
Carnegie Mellon University, Pittsburgh,PA, December 1991.
BIBLIOGRAPHY 185
[Thr93] S. B. Thrun. Extracting Provably Correct Rules from Artificial Neural
Networks. Technical Report IAI-TR-93-5, Department of Computer
Science III, University of Bonn, 1993.
[Thr95] S. B. Thrun. Extracting Rules from Artificial Neural Networks with
Distributed Representations. Advances in Neural Information Process-
ing Systems (NIPS) 7, 1995.
[TM] Theodore B. Trafalis and Alexander M. Malyscheff. An analytic center
machine. Machine Learning.
[TS93a] G. Towell and J. Shavlik. The extraction of refined rules from
knowledge-based neural networks. Machine Learning, 131:71–101,
1993.
[TS93b] G. Towell and J.W. Shavlik. Extracting refined rules from knowledge-
based neural networks. Machine Learning, 13:71–101, 1993.
[Van83] James S. Vandergraft. Introduction to numerical computation. Aca-
demic Press, New York, 1983.
[Wil97] Doran K. Wilde. A library for doing polyhedral operations. Technical
report, Brigham Young University, Department of Electrical and Com-
puter Engineering, 459 CB,Box 24099,Provo,Utah, 1997.
[WvdB98] T. Weiters and A. van den Bosch. Interpretable neural networks with
bp-som. Tasks and Methods in Applied Artificial Intelligence. Lectures
Notes in Artificial Intelligence 1416, pages 564–573, 1998.
[Zad75] L. A. Zadeh. The concept of a linguistic variable and its application to
approximate reasoning-1. Information Sciences, 8:199–249, 1975.
[Zie94] G.M. Ziegler. Lectures on polytopes. Springer-Verlag, 1994.
[Zir97] Joseph S. Zirilli. Financial Prediction using Neural Networks. Inter-
national Thomson Computer Press, 1997.