Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...
Transcript of Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...
-
Knowledge Extraction from Neural Networks
using the All-Permutations Fuzzy Rule Base∗
Eyal Kolman and Michael Margaliot†
August 3, 2005
Abstract
A major drawback of artificial neural networks is their black-boxcharacter. Even when the trained network performs adequately, itis very difficult to understand its operation. In this paper, we usethe mathematical equivalence between artificial neural networks anda specific fuzzy rule base to extract the knowledge embedded in thenetwork. We demonstrate this using a benchmark problem: the recog-nition of digits produced by a LED device. The method provides asymbolic and comprehensible description of the knowledge learned bythe network during its training.
Keywords: Feedforward neural networks, knowledge extraction, rule extrac-
tion, rule generation, hybrid intelligent systems, neuro-fuzzy systems.
1 Introduction
The ability of artificial neural networks (ANNs) to learn and generalize from
examples makes them very suitable for use in numerous real-world appli-
∗This work was partially supported by the Tel Aviv University Internal Research Fundunder grant number 05110066. An abridged version of this paper was presented at the8’th International Work-Conference on Artificial Neural Networks (IWANN’2005).
†Corresponding author: Dr. Michael Margaliot, Dept. of Electrical Engineering-Systems, Tel Aviv University, Tel Aviv, Israel 69978. Tel: +972-3-640 7768; Fax: +972-3-640 5027; Homepage: www.eng.tau.ac.il/∼michaelm; Email: [email protected]
-
cations where exact algorithmic approaches are unknown or too difficult to
implement. The knowledge learned during the training process is distributed
in the weights of the different neurons and, whether the ANN operates prop-
erly or not, it is very difficult to comprehend exactly what it is computing.
In this respect, ANNs process information in a “black-box” and subsymbolic
level. The problem of extracting the knowledge learned by the network, and
representing it in a comprehensible form, received a great deal of attention
in the literature (see, e.g., [10, 3, 27]).
Rule-based systems process information in a manner that is much easier to
comprehend because the system’s knowledge is stated using symbolic If-Then
rules. In particular, fuzzy rule bases (FRBs) enable the use and manipulation
of expert knowledge stated using natural language [29, 30, 11, 31, 19]. Thus,
the knowledge is easy to understand, verify, and, if necessary, refine.
Recently, a great deal of research has been devoted to the design of hy-
brid intelligent systems that fuse subsymbolic and symbolic techniques for
information processing [20] and, in particular, to creating a synergy between
ANNs and FRBs [21]. Such a synergy may lead to systems with the ro-
bustness and learning capabilities of ANNs and the “white-box” character of
FRBs.
Understanding the operation of a trained ANN is difficult because the
knowledge is embedded in a complex, distributed, and sometimes self-contradictory
form [14]. A widespread heuristic approach for knowledge extraction is based
on finding the “most effective” input-output paths, and then transforming
2
-
them into symbolic rules (see, e.g., [25, 7, 22]). Ishikawa [14] incorporates
regularization terms that punish large weights into the error function used
in the training phase. This forces the ANN to develop a more condensed
and, therefore, easier to understand, form of knowledge representation. To
extract the knowledge, Ishikawa represents each hidden and output unit as
a Boolean function of outputs from previous layers. Forcing the ANN to
produce a skeletal structure seems to be a very useful technique in gen-
eral [26, 6], however, extracting knowledge in the form of Boolean functions
is not appropriate for general ANNs.
Other approaches are based on an attempt to develop an equivalence
between ANNs and rule-based systems. Fu and Fu [12] mapped rule-based
systems into a neural-like architecture: final hypotheses are represented using
output neurons; data attributes become input neurons; and the strength of
the rule is mapped into the weight of the corresponding connection. This al-
lowed back-propagation-like learning for modifying rule strengths. However,
this approach cannot be used to extract knowledge from a standard ANN.
Towell and Shavlik [28] introduced the Knowledge-Based Artificial Neu-
ral Network (KBANN) algorithm that transforms a knowledge base into an
ANN. This can be used to insert initial domain knowledge into an ANN and
thus reduce the training time and improve the chances for a global minima
convergence. After training, a heuristic method is used to extract symbolic
rules. However, the extracted rules are Boolean and the inherent “fuzziness”
of the ANN is lost.
3
-
A well-known neuro-fuzzy model is the Adaptive Network-Based Fuzzy
Inference System (ANFIS) developed by Jang et al. [16], which is a feed-
forward network representation of the fuzzy reasoning process. However,
this representation is not a standard ANN because nodes in different layers
perform different tasks corresponding to the different stages in the fuzzy rea-
soning process. For example, nodes in the first layer compute membership
function values, whereas nodes in the second layer perform T-norm opera-
tions.
Jang and Sun [15] noted that the local activation functions of radial
basis function networks (RBFNs) are the Gaussian membership functions
frequently used in FRBs. They used this to extract a set of fuzzy rules that
are mathematically equivalent to the RBFN. However, this equivalence holds
only for RBFNs and it also requires that each membership function will be
used by no more than one rule [2].
Benitez et al. [5] showed that ANNs with Logistic activation functions
are equivalent to the result of inferencing a set of Mamdani-type fuzzy rules
(see also [9, 32]). However, this is not a standard FRB as the operators used
in the inferencing method are not those commonly used in FRBs.
Recently, the authors introduced a new Mamdani-type FRB referred to
as the All-Permutations Fuzzy Rule Base (APFRB) [17]. Inferencing the
APFRB, using standard tools from fuzzy logic theory, yields an input-output
relationship that is mathematically equivalent to that of a feed-forward ANN.
4
-
More precisely, there exists an invertible transformation T such that
T (ANN) = APFRB and T−1(APFRB) = ANN. (1)
This equivalence enables bidirectional flow of information between the
ANN and the corresponding APFRB. It also enables the application of tools
from the theory of ANNs to APFRBs and vice versa. For example, given a
procedure for simplifying an ANN, we can immediately apply it to simplify
an APFRB as follows. Given an initial APFRB, calculate the equivalent
network as ANN = T−1(APFRB). Apply the simplification procedure to
this ANN and denote the result by ANN’. Then, APFRB’ := T (ANN’) is a
simplified version of the original APFRB.
In this paper, we use the equivalence T (ANN) = APFRB to extract the
knowledge learned by the ANN in the form of symbolic rules. We demon-
strate this approach on a benchmark problem involving the recognition of the
digits displayed by a LED device. An ANN is trained to recognize the ten
possible digits. Calculating APFRB = T (ANN), yields a symbolic descrip-
tion of the knowledge learned by the ANN. To increase the comprehensibility
of this FRB we simplify the rules.
The final result is an FRB with ten rules that correctly classifies all the
training examples. Furthermore, this FRB is tractable, and provides a com-
prehensible representation of the ANNs functioning. For example, it is pos-
sible to deduce that the ANN learned to focus its efforts on the digits that
5
-
are harder to recognize. The notion that digits that are more difficult to
recognize deserve more attention is quite intuitive. However, understanding
that the ANN implements this notion by observing its weights and biases is
all but impossible. It is only through the knowledge extraction process that
this notion emerges.
The rest of this paper is organized as follows. In section 2, we briefly
review the APFRB and its equivalence to an ANN. In Section 3, we present
the benchmark problem and the ANN trained to solve it. In Section 4, we
apply the equivalence (1) to extract information from the trained ANN. The
resulting APFRB is simplified in Section 5. In Section 6, we show that the
simplified FRB is tractable and allows us to represent the ANN’s functioning
in a comprehensible form. The final section concludes.
2 All-Permutations Fuzzy Rule-Base
For the sake of completeness, we briefly review the APFRB and its equiva-
lence to an ANN. More details and the proofs of all the results can be found
in [17]. For the sake of simplicity, we consider the case of an FRB with
output f ∈ R2; the generalization to the case f ∈ Rn is straightforward.
Definition 1 (APFRB) A fuzzy rule-base with inputs x1, . . . , xm and out-
put f ∈ R2 is called an APFRB if the following conditions hold.
1. Every input variable xi is characterized by two linguistic terms: termi−
and termi+. The membership functions µi−(·) and µ
i+(·) that model these
6
-
terms satisfy the following constraint: there exists a vi ∈ R such that
µi+(y)− µi−(y)
µi+(y) + µi−(y)
= tanh(y − vi), ∀ y ∈ R. (2)
2. The form of every rule is
If x1 is term1
+/−and x2 is term2
+/− . . . and xm is termm+/−
Then f =
a0 ± a1 ± a2 · · · ± am
b0 ± b1 ± b2 · · · ± bm
(3)
where termi+/− stands for either term
i− or term
i+, ± stands for either the
plus or the minus sign, and ai, bi ∈ R. The actual signs in the Then-part
are determined in the following manner: if the term characterizing xi in the
If-part is termi+, then in the Then-part, ai and bi appear with a plus sign;
otherwise ai and bi appear with a minus sign.
3. The rule-base contains exactly 2m rules spanning, in their If-part, all the
possible assignment combinations of x1, ..., xm.
Several commonly used fuzzy membership functions satisfy the constraint (2).
For example, the pair of Gaussian membership functions
µ=k(y) := exp(−(y−k)2/(2k)) and µ=−k(y) := exp(−(y+k)
2/(2k)), (4)
7
-
satisfy (2) with v = 0. The sigmoid functions
µ>k(y) := (1+exp(−2(y−k))−1 and µ
-
outputs f can be represented as the output of a standard ANN.
Conversely, consider an ANN with input z ∈ Rn, a single hidden layer
with m units, and two output units. Its output f ∈ R2 is given by
f =
∑mj=1 cjh(yj + θj)
∑mj=1 djh(yj + θj)
, (8)
where yj :=∑n
i=1 wjizi is the input to the jth neuron in the hidden layer, θj
is the bias of this neuron, and cj (dj) is the weight from this neuron to the
first (second) output neuron. Comparing (8) with (6) yields the following.
Corollary 1 If the activation function in the ANN is h(z) = tanh(z), then (8)
is the output of an APFRB with: a0 = 0, b0 = 0, ai = ci, bi = di, vi = −θi,
xi = yi, i = 1, . . . , m.
If h(z) = 1/(1+exp(−z)), then (8) is the output of an APFRB with: a0 =∑m
i=1 ai, b0 =∑m
i=1 bi, ai = ci/2, bi = di/2, vi = −θi/2, and xi = yi/2,
i = 1, . . . , m.
Summarizing, Theorem 1 establishes an equivalence between a single1
hidden layer ANN and an APFRB, and explicitly defines the transforma-
tion T in (1). Furthermore, Corollary 1 implies that we can immediately
extract the knowledge embedded in an ANN in the form of fuzzy If-Then
rules. The rest of this paper is devoted to demonstrating the usefulness of
this approach using a benchmark problem.
1The equivalence is easily generalized to ANNs with multiple hidden layers; see [17].
9
-
3 The LED Display Recognition Problem
The LED display recognition problem [8, Chapter 2] concerns learning to
recognize digits displayed using a seven-segment light emitting diodes (LED)
display. Several pattern recognition algorithms were applied to this problem
including classification trees [8], instance-based learning algorithms [1], and
ANNs [7].
The input to the learning algorithm is a set of supervised examples in the
form (z1, z2, . . . , z24, v). The first seven inputs, z1 . . . z7, are the states of the
seven diodes (1 for on and 0 for off) of the LED display (see Fig. 1). For exam-
ple, the vector {1, 1, 0, 1, 1, 1, 1} represents the digit 6, and {1, 1, 1, 1, 0, 1, 1}
the digit 9. The value v ∈ {0, 1, . . . , 9} is the displayed digit. The in-
puts z8 . . . z24 are independent random variables with prob(zi = 0) = prob(zi =
1) = 1/2. These noise inputs make the recognition task more challenging as
the classification algorithm must also learn to discriminate between mean-
ingful and useless inputs.
We trained a 24-6-10 ANN using the backpropagation algorithm and a
set of 2050 supervised examples.2 Each of the ten outputs f0, . . . , f9 corre-
sponds to a different digit and the final classification is based on the winner-
takes-all approach. That is, the ANN’s classification is digit i, where i :=
arg max0≤k≤9{fk}.
After training, the ANN correctly classified all the input examples. The
2More details on the training process can be found in the appendix.
10
-
z1
z2 z3
z4
z5 z6
z7
Figure 1: The LED device.
values of the parameters of the trained ANN (204 weights and 16 biases) can
be found in the appendix. These values, however, do not provide any insight
into the ANN’s functioning.
The network contains 220 free parameters and is definitely not a very
large network. Nevertheless, the problem addressed in this paper is not the
designing or training of ANNs, but rather interpreting the performance of a
trained ANN. In this respect, it is interesting to compare the size of this ANN
to other examples used to demonstrate knowledge extraction algorithms.
In [18], fuzzy rules were extracted from a self-organizing fuzzy neural network
with up to 10 neurons and 60 parameters. In [24], interpretable fuzzy models
were extracted from two neuro-fuzzy networks, the first with 28 and the
second with 92 parameters. In [13], fuzzy rules extraction from a trained
neural network was demonstrated using a network with 36 parameters. In [4],
an extraction method was demonstrated using a network with 36 parameters.
The above examples clearly indicate that extracting rules from a network
11
-
with 220 parameters is an interesting challenge.
4 Knowledge Extraction
Let xj :=∑
24
i=1 wjizi, j = 1, . . . , 6, denote the input of the jth hidden neuron
of the ANN. We applied Corollary 1 to represent the ANN as an APFRB
with 26 = 64 fuzzy rules and an output f ∈ R10. For example, one of the
rules of this APFRB is:3
If x1 equals 1 and x2 equals 1 and x3 equals −1 and x4 equals 1 and x5
equals −1 and x6 equals 1
Then f = (−1.4,−2.5,−0.6,−0.7,−0.2,−0.5,−11,−1.4, 0, 0.4)T .
The membership functions defining the terms equals 1 and equals −1 are:
µ= 1(y) = exp (−(y − 1)2/2) and µ= −1(y) = exp (−(y + 1)
2/2).
The inferencing amounts to computing a weighted sum, f , of the sixty-
four vectors in the Then-part of the rules, and the final digit classification
is i := arg max0≤k≤9{fk}, where fk is the kth entry in f .
This rule set provides a complete symbolic representation of the ANN’s
functioning. In other words, we have a fuzzy classifier that solves the LED
recognition problem. However, the comprehensibility of this classifier is hin-
dered by the large number and the complexity of the rules. To gain more
insight, we must simplify the APFRB.
3The numerical values were rounded to one decimal digit, without affecting the classi-fication accuracy.
12
-
5 APFRB Simplification
Simplification can be executed in the network level, by reducing or clustering
nodes and connections. However, the simplification process becomes easier
when the knowledge is represented in a symbolic and comprehensible manner.
Thus, by shifting from the ANN domain to the symbolic fuzzy domain, we
are able to simplify the knowledge more easily.
In this section, we apply two simplification stages to the APFRB.
5.1 Term Simplification
Recall that the jth input of the APFRB is xj =∑n
i=1 wjizi. If |wjizi| is
sufficiently small then it will have a negligible effect on the APFRB’s output.
In our case, all the zis are in the same range, so we can delete the term wjkzk
if |wjk| is small. Examining the wji values, we find that for all j:
min1≤i≤7 |wji|
max8≤i≤24 |wji|> 240.
Thus, we set wji = 0 for all i ≥ 8, j = 1, . . . , 6.
At this point, each xj in the APFRB is a linear combination of only
seven inputs zi, i = 1, . . . , 7. The noise inputs z8, . . . , z24 were identified as
meaningless. Of course, this step is identical to removing weak connections
from the input layer to the hidden layer in the ANN. However, the symbolic
structure of the APFRB also allows us to perform simplification steps that
13
-
cannot be carried out in terms of the weights of the ANN.
5.2 Rule Reduction
Consider an APFRB with input x ∈ Rm, q := 2m rules, and output f ∈ Rk.
Let f i denote the value in the Then-part of rule i, and let ti(x) denote the
degree of firing (or truth value) of rule i, so that the fuzzy inferencing process
yields f(x) = u(x)/d(x), where
u(x) :=
q∑
i=1
ti(x)f i and d(x) :=
q∑
i=1
ti(x). (9)
If we modify the degree of firing of rule i to, say, t̂k(x), then the modified
output is
f̂(x) = (t̂k(x)fk +∑
1≤i≤qi6=k
ti(x)f i)/(t̂k(x) +∑
1≤i≤qi6=k
ti(x)).
The final classification decision, obtained as arg max{f0, . . . , f9}, will not
change as long as arg max{f̂(x)} = arg max{f(x)}. It is easy to verify that
this is equivalent to
arg max{u(x)− (tk(x)− t̂k(x))fk} = arg max{u(x)}.
Note that deleting rule k from the APFRB altogether amounts to setting t̂k(x) ≡
0.
14
-
Let R(l, j) = 1 (R(l, j) = 0) denote that rule l is significant (insignificant)
when classifying digit j (R(l, j) is initialized as one for all l = 1, . . . , q and j =
0, . . . , 9). Denote the training set by D, and let (f k)j be the jth element of
the output vector of rule k. Then, it follows from the analysis above that
if pkj := maxx∈D tk(x)(fk)j is small, then the kth rule has a small effect on
classifying digit j. This motivates the following procedure:
While (there is no index l such that R(l, j) = 0 for all j)
For j = 0 to 9
Q← {k|R(k, j) = 1} /* rules in Q are significant for digit j */
q ← arg mink∈Q pkj
R(q, j)← 0 /* mark rule q as insignificant for digit j */
EndFor
EndWhile
Output(l)
This procedure outputs an index l such that rule l has a small effect on
classifying all the ten digits. If removing rule l from the rule-base does
not change the classification for all the training examples, then this rule is
deleted.
Applying this procedure repeatedly leads to the deletion of 54 rules. We
are left with a set of ten rules that correctly classify the training set. These
rules are:
15
-
• Rule 0: If x1 equals −1 and x2 equals −1 and x3 equals −1 and
x4 equals −1 and x5 equals 1 and x6 equals 1
Then f = (1.3, 0.1,−1.6,−2,−1.4,−0.1,−0.8,−1.2,−1.2,−0.9)T
• Rule 1: If x1 equals −1 and x2 equals −1 and x3 equals 1 and
x4 equals −1 and x5 equals 1 and x6 equals −1
Then f = (−0.1, 0.9,−1.6,−1.1,−1.5,−0.9, 0.2, 0,−1.5,−2.3)T
• Rule 2: If x1 equals 1 and x2 equals − 1 and x3 equals 1 and
x4 equals −1 and x5 equals −1 and x6 equals 1
Then f = (−1.1, 0.2, 0.6,−1.5,−0.1,−1.3,−1.7,−0.5,−1.5,−1.1)T
• Rule 3: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals −1
and x5 equals 1 and x6 equals −1
Then f = (−1.4,−1.1,−0.5, 1.4,−1.4,−1.6,−0.8,−1.4,−0.2,−1.1)T
• Rule 4: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals 1
and x5 equals −1 and x6 equals 1
Then f = (−2.3,−2,−0.1,−0.4, 0,−0.1,−0.7,−1.5,−0.8,−0.1)T
• Rule 5: If x1 equals −1 and x2 equals 1 and x3 equals 1 and x4 equals 1
and x5 equals 1 and x6 equals 1
Then f = (−0.7,−1.7,−1.8,−0.7,−1, 1.3, 0.6,−2.2,−1.3,−0.5)T
• Rule 6: If x1 equals − 1 and x2 equals 1 and x3 equals 1 and
x4 equals −1 and x5 equals −1 and x6 equals 1
Then f = (−0.6,−0.5,−0.8,−2.8,−0.1,−0.4, 1,−1.5,−0.5,−1.7)T
16
-
• Rule 7: If x1 equals 1 and x2 equals −1 and x3 equals −1 and
x4 equals 1 and x5 equals −1 and x6 equals −1
Then f = (−1.5,−0.8,−1.1,−0.8,−0.7,−1.5,−1.4, 1.1,−0.6,−0.7)T
• Rule 8: If x1 equals 1 and x2 equals 1 and x3 equals − 1 and
x4 equals −1 and x5 equals −1 and x6 equals −1
Then f = (−1.2,−1.3,−0.6,−0.6,−0.7,−2.6,−0.7,−0.3, 1,−1.2)T
• Rule 9: If x1 equals 1 and x2 equals −1 and x3 equals −1 and
x4 equals 1 and x5 equals 1 and x6 equals 1
Then f = (−0.3,−1.4,−0.9, 0.4,−1.3, 0,−2.4,−1.2,−1.5, 0.7)T
Evidently, this FRB is much simpler than the original one, as the number
of rules is reduced from 64 to 10. Furthermore, this FRB is simple enough
to allow us to interpret its functioning.
6 Interpreting the FRB
The symbolic structure of FRBs makes them much easier to understand than
ANNs. In particular, we can analyze the operation of a FRB by understand-
ing the If-part and the Then-part of each rule.
6.1 The If-part
To understand the If-part of the ten rules, we consider their degree of fir-
ing (DOF) for each possible input (namely, the 27 = 128 possible binary
17
-
vectors (z1, . . . , z7)). The ratio between the highest DOF and the second
highest DOF for the ten rules is:
930%, 1, 270%, 340%, 150%, 540%, 540%, 450%, 230%, 1, 940% and 240%,
respectively. Thus, with the exception of Rule 3 (recall that the rules are
numbered from Rule 0 to Rule 9), every rule is tuned to a single specific
input pattern and yields a much smaller DOF for any other pattern.
Fig. 2 depicts the pattern yielding the highest DOF for each rule. It may
be seen that rules 1, 5, 6 and 8 are tuned to recognize the digits 1, 5, 6 and 8,
respectively. Rules 0 and 7 are tuned to patterns that are one Hamming
distance away from the real digits 0 and 7.
If we compare the DOF only for the ten patterns representing the dig-
its 0, . . . , 9, then we find that rules 2 and 3 have the highest DOF when the
input is the digit one, and rule 4 has the highest DOF when the input is the
digit five. For all other rules, we have that rule i shows the highest DOF
when the input is digit i.
6.2 The Then-part
Considering the output vectors f i, i = 0, . . . , 9, we see that arg maxk(fi)k = i
for all i. In other words, if only rule i fired, then the inferencing would yield
digit i. In most rules, there is a considerable difference between entry i and
the second largest entry in f i. In five of the ten rules, the largest entry is
18
-
Rule 0 Rule 1 Rule 2 Rule 3 Rule 4
Rule 5 Rule 6 Rule 7 Rule 8 Rule 9
Figure 2: The pattern yielding maximal DOF for each rule.
positive and the other nine entries are negative. Thus, when such a rule fires
it not only contributes to the classification towards a specific digit, but also
contributes negatively to all other possible classifications.
6.3 Explaining the FRB
Summarizing, we see that the FRB includes seven rules that are tuned to
a specific digit. These are rules 0, 1, 5, 6, 7, 8, and 9. Each of these rules
responds with a high DOF when the input is the appropriate digit.
On the other hand, rules 2, 3 and 4 are not tuned to the corresponding
digit. For example, rule 2 displays the highest DOF when the input is the
digit 1. The fact that the rule-base correctly classifies all ten digits, including
19
-
the digits 2, 3 and 4, is due to the weighted combination of all the rules’
outputs, and not to the specific action of a single rule.
This behavior motivated us to try and understand the distinction between
the two sets of digits
S1 := {0, 1, 5, 6, 7, 8, 9} and S2 := {2, 3, 4}. (10)
Let H(d1, d2) denote the Hamming distance between the LED represen-
tations of the digits d1 and d2 (for example, H(1, 7) = 1). Let Mi denote the
set of digits d that satisfy min{H(d, 0), H(d, 1), . . . , H(d, 9)} = i (that is, the
digit closest to d is at a distance i from d). Then,
M1 = {0, 1, 3, 5, 6, 7, 8, 9} and M2 = {2, 4}. (11)
It is clear from the definition of Mi that digits in the set M1 are more difficult
to recognize correctly than those in the set M2.
Comparing (10) with (11), we see that there is a high correspondence
between Mi and Si. Thus, the FRB (or the original ANN) dedicates specially
tuned rules for the more ”tricky” digits.
7 Conclusions
The output of a feed-forward ANN can be represented as the result of infer-
encing a fuzzy rule base with a special structure–the APFRB. This equiva-
20
-
lence allows the bi-directional flow of information between the subsymbolic
knowledge representation in the ANN and the symbolic rules of the APFRB.
In this paper, we studied one application of this equivalence. The trans-
formation APFRB = T (ANN) extracts the knowledge learned by the trained
ANN in the form of symbolic fuzzy rules. We demonstrated this approach
using a medium-size ANN trained to solve a benchmark problem. The 24-
6-10 network was transformed into a set of 64 fuzzy rules. Simplification of
this rule-set led to a comprehensible representation of the ANN’s function-
ing. For example, it is possible to conclude that the ANN dedicates special
rules to digits that are more difficult to recognize.
Appendix
We generated 205 examples for each digit in the form (z1, . . . , z24, v) where zi ∈
{0, 1}, i = 1, . . . , 7, is the LED’s state, v is the correct classification, and z8, . . . , z24
are independent random variables with prob(0) = prob(1) = 1/2. Thus, the
complete training set contained 2050 supervised examples.
We used a 24-6-10 ANN. The number of hidden units was determined
using a trial and error approach. ANNs with less than six hidden neurons
were not able to correctly classify all the training examples. A similar network
was used in [7]. The hidden neurons employ the hyperbolic tangent activation
function and the ten output neurons are linear. The classification is based on
the winner-takes-all paradigm. The only preprocessing done was converting
21
-
the zis from {0, 1} to {−1, 1}.
We implemented the ANN using MATLAB’s Neural Networks Toolbox.
The network’s parameters were initialized using the “init” command with
the “net.layers{i}.initFcn” set to “initnw” (the Nguyen-Widrow initialization
algorithm [23]).
The training was performed using the “trainlm” command (Levenberg-
Marquardt backpropagation). A regularization factor (the squared sum of
the weights) was added to the error function by setting the option “net.performFcn”
to “msereg”.
After training, the network correctly classified all the 2050 examples. The
final values of the weights and biases are given below4 where W ∈ R6×24,
Θ ∈ R6, C ∈ R10×6, and Φ ∈ R10 are the input-to-hidden weight matrix, the
hidden neurons’ biases, the hidden-to-output weight matrix, and the output
neurons’ biases, respectively.
W =
0.23 −0.04 0.45 0.30 −0.17 −0.52 −0.14 . . .
−1.31 −0.25 −0.06 0.77 0.70 0.73 1.07 . . .
−1.09 −2.05 −1.86 1.58 0.60 −0.15 −0.63 . . .
2.99 0.59 −0.17 0.40 −0.79 1.08 −2.50 . . .
−0.57 −2.02 −0.25 −0.65 −0.09 2.08 2.90 . . .
−0.49 0.89 0.02 −0.44 −0.62 −1.65 0.55 . . .
(the maximal value of {wij}, ∀j ∈ {8, . . . , 24} is 8.3E− 5 and, therefore, this
4The numerical values were rounded to two decimal digits, without affecting the clas-sification accuracy.
22
-
part of the matrix is omitted).
Θ = (0.33,−0.59, 1.63,−2.20,−1.90, 1.59)T ,
C =
−0.43 −0.22 −0.43 −0.38 0.34 0.28
−0.32 −0.69 0.24 −0.43 −0.14 −0.18
0.62 −0.07 0.25 −0.31 −0.22 0.27
0.95 0.32 0.13 0.22 0.85 −0.28
0.02 0.05 0.12 0.01 −0.49 0.20
−0.38 0.05 0.21 0.57 0.28 0.47
−0.89 0.43 0.21 0.07 −0.24 −0.26
−0.10 −0.59 −0.04 0.12 −0.49 −0.66
0.07 0.59 −0.42 −0.23 −0.17 −0.27
0.46 0.13 −0.26 0.35 0.28 0.44
and Φ = (−0.74,−0.78,−1.13,−0.92,−0.89,−0.71,−0.45,−0.68,−0.74,−0.96)T .
The transformation APFRB = T (ANN) is defined for the case of unbiased
output neurons. Thus, the transformation was performed using the values
W , Θ and C only, and then Φ was added to every rule’s output.
References
[1] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning al-
gorithms,” Machine Learning, vol. 6, pp. 37–66, 1991.
23
-
[2] H. C. Anderson, A. Lotfi, and L. C. Westphal, “Comments on “Func-
tional equivalence between radial basis function networks and fuzzy in-
ference systems”,” IEEE Trans. Neural Networks, vol. 9, pp. 1529–1531,
1998.
[3] R. Andrews, J. Diederich, and A. Tickle, “Survey and critique of tech-
niques for extracting rules from trained artificial neural networks,”
Knowledge-Based Systems, vol. 8, pp. 373–389, 1995.
[4] M. Ayoubi and R. Isermann, “Neuro-fuzzy systems for diagnosis,” Fuzzy
Sets and Systems, vol. 89, pp. 289–307, 1997.
[5] J. M. Benitez, J. L. Castro, and I. Requena, “Are artificial neural net-
works black boxes?” IEEE Trans. Neural Networks, vol. 8, pp. 1156–
1164, 1997.
[6] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:
Oxford University Press, 1995.
[7] Z. Boger and H. Guterman, “Knowledge extraction from artificial neural
networks models,” in Proc. IEEE Int. Conf. Systems, Man and Cyber-
natics (SMC97), Orlando, Florida, 1997, pp. 3030–3035.
[8] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Regression Trees. Wadsworth International Group, 1984, ch. 2.
24
-
[9] J. L. Castro, C. J. Mantas, and J. M. Benitez, “Interpretation of artificial
neural networks by means of fuzzy rules,” IEEE Trans. Neural Networks,
vol. 13, pp. 101–116, 2002.
[10] I. Cloete and J. M. Zurada, Eds., Knowledge-Based Neurocomputing.
MIT Press, 2000.
[11] D. Dubois, H. T. Nguyen, H. Prade, and M. Sugeno, “Introduction:
The real contribution of fuzzy systems,” in Fuzzy Systems: Modeling
and Control, H. T. Nguyen and M. Sugeno, Eds. Kluwer, 1998, pp.
1–17.
[12] L.-M. Fu and L.-C. Fu, “Mapping rule-based systems into neural archi-
tectures,” Knowledge Based Systems, vol. 3, pp. 48–56, 1990.
[13] S. Huang and H. Xing, “Extract intelligible and concise fuzzy rules from
neural networks,” Fuzzy Sets and Systems, vol. 132, pp. 233–243, 2002.
[14] M. Ishikawa, “Structural learning and rule discovery,” in Knowledge-
Based Neurocomputing, I. Cloete and J. M. Zurada, Eds. Kluwer, 2000,
pp. 153–206.
[15] J.-S. R. Jang and C.-T. Sun, “Functional equivalence between radial ba-
sis function networks and fuzzy inference systems,” IEEE Trans. Neural
Networks, vol. 4, pp. 156–159, 1993.
25
-
[16] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Com-
puting: A Computational Approach to Learning and Machine Intelli-
gence. Prentice-Hall, 1997.
[17] E. Kolman and M. Margaliot, “Are artificial neural networks white
boxes?” IEEE Trans. Neural Networks, vol. 16, pp. 844–852, 2005.
[18] G. Leng, T. McGinnity, and G. Prasad, “An approach for on-line extrac-
tion of fuzzy rules using a self-organizing fuzzy neural network,” Fuzzy
Sets and Systems, vol. 150, pp. 211–243, 2005.
[19] M. Margaliot and G. Langholz, New Approaches to Fuzzy Modeling and
Control - Design and Analysis. World Scientific, 2000.
[20] K. McGarry, S. Wermter, and J. MacIntyre, “Hybrid neural systems:
¿from simple coupling to fully integrated neural networks,” Neural Com-
puting Surveys, vol. 2, pp. 62–93, 1999.
[21] S. Mitra and Y. Hayashi, “Neuro-fuzzy rule generation: survey in soft
computing framework,” IEEE Trans. Neural Networks, vol. 11, pp. 748–
768, 2000.
[22] S. Mitra and S. Pal, “Fuzzy multi-layer perceptron, inferencing and rule
generation,” IEEE Trans. Neural Networks, vol. 6, pp. 51–63, 1995.
[23] D. Nguyen and B. Widrow, “Improving the learning speed of 2-layer
neural networks by choosing initial values of the adaptive weights,” in
26
-
Proc. Int. Joint Conf. Neural Networks, vol. 3, San Diego, California,
1990, pp. 21–26.
[24] R. Paiva and A. Dourado, “Interpretability and learning in neuro-fuzzy
systems,” Fuzzy Sets and Systems, vol. 147, pp. 17–38, 2004.
[25] S. Sestito and T. Dillon, “Knowledge acquisition of conjunctive rules
using multilayered neural networks,” Int. J. Intelligent Systems, vol. 8,
pp. 779–805, 1993.
[26] R. Setiono, “Extracting rules from neural networks by pruning and
hidden-unit splitting,” Neural Computation, vol. 9, pp. 205–225, 1997.
[27] A. Tickle, R. Andrews, M. Golea, and J. Diederich, “The truth will come
to light: directions and challenges in extracting the knowledge embedded
within trained artificial neural networks,” IEEE Trans. Neural Networks,
vol. 9, pp. 1057–1068, 1998.
[28] G. G. Towell and J. W. Shavlik, “Extracting refined rules from
knowledge-based neural networks,” Machine Learning, vol. 13, pp. 71–
101, 1993.
[29] E. Tron and M. Margaliot, “Mathematical modeling of observed natural
behavior: a fuzzy logic approach,” Fuzzy Sets and Systems, vol. 146, pp.
437–450, 2004.
[30] ——, “How does the Dendrocoleum lacteum orient to light? a fuzzy
modeling approach,” Fuzzy Sets and Systems, 2005, to appear.
27
-
[31] L. A. Zadeh, “Fuzzy logic = computing with words,” IEEE Trans. Fuzzy
Systems, vol. 4, pp. 103–111, 1996.
[32] D. Zhang, X.-L. Bai, and K.-Y. Cai, “Extended neuro-fuzzy models of
multilayer perceptrons,” Fuzzy Sets and Systems, vol. 142, pp. 221–242,
2004.
28