Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...

Knowledge Extraction from Neural Networks

using the All-Permutations Fuzzy Rule Base∗

Eyal Kolman and Michael Margaliot†

August 3, 2005

Abstract

A major drawback of artificial neural networks is their black-boxcharacter. Even when the trained network performs adequately, itis very difficult to understand its operation. In this paper, we usethe mathematical equivalence between artificial neural networks anda specific fuzzy rule base to extract the knowledge embedded in thenetwork. We demonstrate this using a benchmark problem: the recog-nition of digits produced by a LED device. The method provides asymbolic and comprehensible description of the knowledge learned bythe network during its training.

Keywords: Feedforward neural networks, knowledge extraction, rule extrac-

tion, rule generation, hybrid intelligent systems, neuro-fuzzy systems.

1 Introduction

The ability of artificial neural networks (ANNs) to learn and generalize from

examples makes them very suitable for use in numerous real-world appli-

∗This work was partially supported by the Tel Aviv University Internal Research Fundunder grant number 05110066. An abridged version of this paper was presented at the8’th International Work-Conference on Artificial Neural Networks (IWANN’2005).

†Corresponding author: Dr. Michael Margaliot, Dept. of Electrical Engineering-Systems, Tel Aviv University, Tel Aviv, Israel 69978. Tel: +972-3-640 7768; Fax: +972-3-640 5027; Homepage: www.eng.tau.ac.il/∼michaelm; Email: [email protected]

cations where exact algorithmic approaches are unknown or too difficult to

implement. The knowledge learned during the training process is distributed

in the weights of the different neurons and, whether the ANN operates prop-

erly or not, it is very difficult to comprehend exactly what it is computing.

In this respect, ANNs process information in a “black-box” and subsymbolic

level. The problem of extracting the knowledge learned by the network, and

representing it in a comprehensible form, received a great deal of attention

in the literature (see, e.g., [10, 3, 27]).

Rule-based systems process information in a manner that is much easier to

comprehend because the system’s knowledge is stated using symbolic If-Then

rules. In particular, fuzzy rule bases (FRBs) enable the use and manipulation

of expert knowledge stated using natural language [29, 30, 11, 31, 19]. Thus,

the knowledge is easy to understand, verify, and, if necessary, refine.

Recently, a great deal of research has been devoted to the design of hy-

brid intelligent systems that fuse subsymbolic and symbolic techniques for

information processing [20] and, in particular, to creating a synergy between

ANNs and FRBs [21]. Such a synergy may lead to systems with the ro-

bustness and learning capabilities of ANNs and the “white-box” character of

FRBs.

Understanding the operation of a trained ANN is difficult because the

knowledge is embedded in a complex, distributed, and sometimes self-contradictory

form [14]. A widespread heuristic approach for knowledge extraction is based

on finding the “most effective” input-output paths, and then transforming

2

them into symbolic rules (see, e.g., [25, 7, 22]). Ishikawa [14] incorporates

regularization terms that punish large weights into the error function used

in the training phase. This forces the ANN to develop a more condensed

and, therefore, easier to understand, form of knowledge representation. To

extract the knowledge, Ishikawa represents each hidden and output unit as

a Boolean function of outputs from previous layers. Forcing the ANN to

produce a skeletal structure seems to be a very useful technique in gen-

eral [26, 6], however, extracting knowledge in the form of Boolean functions

is not appropriate for general ANNs.

Other approaches are based on an attempt to develop an equivalence

between ANNs and rule-based systems. Fu and Fu [12] mapped rule-based

systems into a neural-like architecture: final hypotheses are represented using

output neurons; data attributes become input neurons; and the strength of

the rule is mapped into the weight of the corresponding connection. This al-

lowed back-propagation-like learning for modifying rule strengths. However,

this approach cannot be used to extract knowledge from a standard ANN.

Towell and Shavlik [28] introduced the Knowledge-Based Artificial Neu-

ral Network (KBANN) algorithm that transforms a knowledge base into an

ANN. This can be used to insert initial domain knowledge into an ANN and

thus reduce the training time and improve the chances for a global minima

convergence. After training, a heuristic method is used to extract symbolic

rules. However, the extracted rules are Boolean and the inherent “fuzziness”

of the ANN is lost.

3

A well-known neuro-fuzzy model is the Adaptive Network-Based Fuzzy

Inference System (ANFIS) developed by Jang et al. [16], which is a feed-

forward network representation of the fuzzy reasoning process. However,

this representation is not a standard ANN because nodes in different layers

perform different tasks corresponding to the different stages in the fuzzy rea-

soning process. For example, nodes in the first layer compute membership

function values, whereas nodes in the second layer perform T-norm opera-

tions.

Jang and Sun [15] noted that the local activation functions of radial

basis function networks (RBFNs) are the Gaussian membership functions

frequently used in FRBs. They used this to extract a set of fuzzy rules that

are mathematically equivalent to the RBFN. However, this equivalence holds

only for RBFNs and it also requires that each membership function will be

used by no more than one rule [2].

Benitez et al. [5] showed that ANNs with Logistic activation functions

are equivalent to the result of inferencing a set of Mamdani-type fuzzy rules

(see also [9, 32]). However, this is not a standard FRB as the operators used

in the inferencing method are not those commonly used in FRBs.

Recently, the authors introduced a new Mamdani-type FRB referred to

as the All-Permutations Fuzzy Rule Base (APFRB) [17]. Inferencing the

APFRB, using standard tools from fuzzy logic theory, yields an input-output

relationship that is mathematically equivalent to that of a feed-forward ANN.

4

More precisely, there exists an invertible transformation T such that

T (ANN) = APFRB and T−1(APFRB) = ANN. (1)

This equivalence enables bidirectional flow of information between the

ANN and the corresponding APFRB. It also enables the application of tools

from the theory of ANNs to APFRBs and vice versa. For example, given a

procedure for simplifying an ANN, we can immediately apply it to simplify

an APFRB as follows. Given an initial APFRB, calculate the equivalent

network as ANN = T−1(APFRB). Apply the simplification procedure to

this ANN and denote the result by ANN’. Then, APFRB’ := T (ANN’) is a

simplified version of the original APFRB.

In this paper, we use the equivalence T (ANN) = APFRB to extract the

knowledge learned by the ANN in the form of symbolic rules. We demon-

strate this approach on a benchmark problem involving the recognition of the

digits displayed by a LED device. An ANN is trained to recognize the ten

possible digits. Calculating APFRB = T (ANN), yields a symbolic descrip-

tion of the knowledge learned by the ANN. To increase the comprehensibility

of this FRB we simplify the rules.

The final result is an FRB with ten rules that correctly classifies all the

training examples. Furthermore, this FRB is tractable, and provides a com-

prehensible representation of the ANNs functioning. For example, it is pos-

sible to deduce that the ANN learned to focus its efforts on the digits that

5

are harder to recognize. The notion that digits that are more difficult to

recognize deserve more attention is quite intuitive. However, understanding

that the ANN implements this notion by observing its weights and biases is

all but impossible. It is only through the knowledge extraction process that

this notion emerges.

The rest of this paper is organized as follows. In section 2, we briefly

review the APFRB and its equivalence to an ANN. In Section 3, we present

the benchmark problem and the ANN trained to solve it. In Section 4, we

apply the equivalence (1) to extract information from the trained ANN. The

resulting APFRB is simplified in Section 5. In Section 6, we show that the

simplified FRB is tractable and allows us to represent the ANN’s functioning

in a comprehensible form. The final section concludes.

2 All-Permutations Fuzzy Rule-Base

For the sake of completeness, we briefly review the APFRB and its equiva-

lence to an ANN. More details and the proofs of all the results can be found

in [17]. For the sake of simplicity, we consider the case of an FRB with

output f ∈ R2; the generalization to the case f ∈ Rn is straightforward.

Definition 1 (APFRB) A fuzzy rule-base with inputs x1, . . . , xm and out-

put f ∈ R2 is called an APFRB if the following conditions hold.

1. Every input variable xi is characterized by two linguistic terms: termi−

and termi+. The membership functions µi−(·) and µ

i+(·) that model these

6

terms satisfy the following constraint: there exists a vi ∈ R such that

µi+(y)− µi−(y)

µi+(y) + µi−(y)

= tanh(y − vi), ∀ y ∈ R. (2)

2. The form of every rule is

If x1 is term1

+/−and x2 is term2

+/− . . . and xm is termm+/−

Then f =

a0 ± a1 ± a2 · · · ± am

b0 ± b1 ± b2 · · · ± bm

(3)

where termi+/− stands for either term

i− or term

i+, ± stands for either the

plus or the minus sign, and ai, bi ∈ R. The actual signs in the Then-part

are determined in the following manner: if the term characterizing xi in the

If-part is termi+, then in the Then-part, ai and bi appear with a plus sign;

otherwise ai and bi appear with a minus sign.

3. The rule-base contains exactly 2m rules spanning, in their If-part, all the

possible assignment combinations of x1, ..., xm.

Several commonly used fuzzy membership functions satisfy the constraint (2).

For example, the pair of Gaussian membership functions

µ=k(y) := exp(−(y−k)2/(2k)) and µ=−k(y) := exp(−(y+k)

2/(2k)), (4)

7

satisfy (2) with v = 0. The sigmoid functions

µ>k(y) := (1+exp(−2(y−k))−1 and µ

outputs f can be represented as the output of a standard ANN.

Conversely, consider an ANN with input z ∈ Rn, a single hidden layer

with m units, and two output units. Its output f ∈ R2 is given by

f =

∑mj=1 cjh(yj + θj)

∑mj=1 djh(yj + θj)

, (8)

where yj :=∑n

i=1 wjizi is the input to the jth neuron in the hidden layer, θj

is the bias of this neuron, and cj (dj) is the weight from this neuron to the

first (second) output neuron. Comparing (8) with (6) yields the following.

Corollary 1 If the activation function in the ANN is h(z) = tanh(z), then (8)

is the output of an APFRB with: a0 = 0, b0 = 0, ai = ci, bi = di, vi = −θi,

xi = yi, i = 1, . . . , m.

If h(z) = 1/(1+exp(−z)), then (8) is the output of an APFRB with: a0 =∑m

i=1 ai, b0 =∑m

i=1 bi, ai = ci/2, bi = di/2, vi = −θi/2, and xi = yi/2,

i = 1, . . . , m.

Summarizing, Theorem 1 establishes an equivalence between a single1

hidden layer ANN and an APFRB, and explicitly defines the transforma-

tion T in (1). Furthermore, Corollary 1 implies that we can immediately

extract the knowledge embedded in an ANN in the form of fuzzy If-Then

rules. The rest of this paper is devoted to demonstrating the usefulness of

this approach using a benchmark problem.

1The equivalence is easily generalized to ANNs with multiple hidden layers; see [17].

9

3 The LED Display Recognition Problem

The LED display recognition problem [8, Chapter 2] concerns learning to

recognize digits displayed using a seven-segment light emitting diodes (LED)

display. Several pattern recognition algorithms were applied to this problem

including classification trees [8], instance-based learning algorithms [1], and

ANNs [7].

The input to the learning algorithm is a set of supervised examples in the

form (z1, z2, . . . , z24, v). The first seven inputs, z1 . . . z7, are the states of the

seven diodes (1 for on and 0 for off) of the LED display (see Fig. 1). For exam-

ple, the vector {1, 1, 0, 1, 1, 1, 1} represents the digit 6, and {1, 1, 1, 1, 0, 1, 1}

the digit 9. The value v ∈ {0, 1, . . . , 9} is the displayed digit. The in-

puts z8 . . . z24 are independent random variables with prob(zi = 0) = prob(zi =

1) = 1/2. These noise inputs make the recognition task more challenging as

the classification algorithm must also learn to discriminate between mean-

ingful and useless inputs.

We trained a 24-6-10 ANN using the backpropagation algorithm and a

set of 2050 supervised examples.2 Each of the ten outputs f0, . . . , f9 corre-

sponds to a different digit and the final classification is based on the winner-

takes-all approach. That is, the ANN’s classification is digit i, where i :=

arg max0≤k≤9{fk}.

After training, the ANN correctly classified all the input examples. The

2More details on the training process can be found in the appendix.

10

z1

z2 z3

z4

z5 z6

z7

Figure 1: The LED device.

values of the parameters of the trained ANN (204 weights and 16 biases) can

be found in the appendix. These values, however, do not provide any insight

into the ANN’s functioning.

The network contains 220 free parameters and is definitely not a very

large network. Nevertheless, the problem addressed in this paper is not the

designing or training of ANNs, but rather interpreting the performance of a

trained ANN. In this respect, it is interesting to compare the size of this ANN

to other examples used to demonstrate knowledge extraction algorithms.

In [18], fuzzy rules were extracted from a self-organizing fuzzy neural network

with up to 10 neurons and 60 parameters. In [24], interpretable fuzzy models

were extracted from two neuro-fuzzy networks, the first with 28 and the

second with 92 parameters. In [13], fuzzy rules extraction from a trained

neural network was demonstrated using a network with 36 parameters. In [4],

an extraction method was demonstrated using a network with 36 parameters.

The above examples clearly indicate that extracting rules from a network

11

with 220 parameters is an interesting challenge.

4 Knowledge Extraction

Let xj :=∑

24

i=1 wjizi, j = 1, . . . , 6, denote the input of the jth hidden neuron

of the ANN. We applied Corollary 1 to represent the ANN as an APFRB

with 26 = 64 fuzzy rules and an output f ∈ R10. For example, one of the

rules of this APFRB is:3

If x1 equals 1 and x2 equals 1 and x3 equals −1 and x4 equals 1 and x5

equals −1 and x6 equals 1

Then f = (−1.4,−2.5,−0.6,−0.7,−0.2,−0.5,−11,−1.4, 0, 0.4)T .

The membership functions defining the terms equals 1 and equals −1 are:

µ= 1(y) = exp (−(y − 1)2/2) and µ= −1(y) = exp (−(y + 1)

2/2).

The inferencing amounts to computing a weighted sum, f , of the sixty-

four vectors in the Then-part of the rules, and the final digit classification

is i := arg max0≤k≤9{fk}, where fk is the kth entry in f .

This rule set provides a complete symbolic representation of the ANN’s

functioning. In other words, we have a fuzzy classifier that solves the LED

recognition problem. However, the comprehensibility of this classifier is hin-

dered by the large number and the complexity of the rules. To gain more

insight, we must simplify the APFRB.

3The numerical values were rounded to one decimal digit, without affecting the classi-fication accuracy.

12

5 APFRB Simplification

Simplification can be executed in the network level, by reducing or clustering

nodes and connections. However, the simplification process becomes easier

when the knowledge is represented in a symbolic and comprehensible manner.

Thus, by shifting from the ANN domain to the symbolic fuzzy domain, we

are able to simplify the knowledge more easily.

In this section, we apply two simplification stages to the APFRB.

5.1 Term Simplification

Recall that the jth input of the APFRB is xj =∑n

i=1 wjizi. If |wjizi| is

sufficiently small then it will have a negligible effect on the APFRB’s output.

In our case, all the zis are in the same range, so we can delete the term wjkzk

if |wjk| is small. Examining the wji values, we find that for all j:

min1≤i≤7 |wji|

max8≤i≤24 |wji|> 240.

Thus, we set wji = 0 for all i ≥ 8, j = 1, . . . , 6.

At this point, each xj in the APFRB is a linear combination of only

seven inputs zi, i = 1, . . . , 7. The noise inputs z8, . . . , z24 were identified as

meaningless. Of course, this step is identical to removing weak connections

from the input layer to the hidden layer in the ANN. However, the symbolic

structure of the APFRB also allows us to perform simplification steps that

13

cannot be carried out in terms of the weights of the ANN.

5.2 Rule Reduction

Consider an APFRB with input x ∈ Rm, q := 2m rules, and output f ∈ Rk.

Let f i denote the value in the Then-part of rule i, and let ti(x) denote the

degree of firing (or truth value) of rule i, so that the fuzzy inferencing process

yields f(x) = u(x)/d(x), where

u(x) :=

q∑

i=1

ti(x)f i and d(x) :=

q∑

i=1

ti(x). (9)

If we modify the degree of firing of rule i to, say, t̂k(x), then the modified

output is

f̂(x) = (t̂k(x)fk +∑

1≤i≤qi6=k

ti(x)f i)/(t̂k(x) +∑

1≤i≤qi6=k

ti(x)).

The final classification decision, obtained as arg max{f0, . . . , f9}, will not

change as long as arg max{f̂(x)} = arg max{f(x)}. It is easy to verify that

this is equivalent to

arg max{u(x)− (tk(x)− t̂k(x))fk} = arg max{u(x)}.

Note that deleting rule k from the APFRB altogether amounts to setting t̂k(x) ≡

0.

14

Let R(l, j) = 1 (R(l, j) = 0) denote that rule l is significant (insignificant)

when classifying digit j (R(l, j) is initialized as one for all l = 1, . . . , q and j =

0, . . . , 9). Denote the training set by D, and let (f k)j be the jth element of

the output vector of rule k. Then, it follows from the analysis above that

if pkj := maxx∈D tk(x)(fk)j is small, then the kth rule has a small effect on

classifying digit j. This motivates the following procedure:

While (there is no index l such that R(l, j) = 0 for all j)

For j = 0 to 9

Q← {k|R(k, j) = 1} /* rules in Q are significant for digit j */

q ← arg mink∈Q pkj

R(q, j)← 0 /* mark rule q as insignificant for digit j */

EndFor

EndWhile

Output(l)

This procedure outputs an index l such that rule l has a small effect on

classifying all the ten digits. If removing rule l from the rule-base does

not change the classification for all the training examples, then this rule is

deleted.

Applying this procedure repeatedly leads to the deletion of 54 rules. We

are left with a set of ten rules that correctly classify the training set. These

rules are:

15

• Rule 0: If x1 equals −1 and x2 equals −1 and x3 equals −1 and

x4 equals −1 and x5 equals 1 and x6 equals 1

Then f = (1.3, 0.1,−1.6,−2,−1.4,−0.1,−0.8,−1.2,−1.2,−0.9)T

• Rule 1: If x1 equals −1 and x2 equals −1 and x3 equals 1 and

x4 equals −1 and x5 equals 1 and x6 equals −1

Then f = (−0.1, 0.9,−1.6,−1.1,−1.5,−0.9, 0.2, 0,−1.5,−2.3)T

• Rule 2: If x1 equals 1 and x2 equals − 1 and x3 equals 1 and

x4 equals −1 and x5 equals −1 and x6 equals 1

Then f = (−1.1, 0.2, 0.6,−1.5,−0.1,−1.3,−1.7,−0.5,−1.5,−1.1)T

• Rule 3: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals −1

and x5 equals 1 and x6 equals −1

Then f = (−1.4,−1.1,−0.5, 1.4,−1.4,−1.6,−0.8,−1.4,−0.2,−1.1)T

• Rule 4: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals 1

and x5 equals −1 and x6 equals 1

Then f = (−2.3,−2,−0.1,−0.4, 0,−0.1,−0.7,−1.5,−0.8,−0.1)T

• Rule 5: If x1 equals −1 and x2 equals 1 and x3 equals 1 and x4 equals 1

and x5 equals 1 and x6 equals 1

Then f = (−0.7,−1.7,−1.8,−0.7,−1, 1.3, 0.6,−2.2,−1.3,−0.5)T

• Rule 6: If x1 equals − 1 and x2 equals 1 and x3 equals 1 and

x4 equals −1 and x5 equals −1 and x6 equals 1

Then f = (−0.6,−0.5,−0.8,−2.8,−0.1,−0.4, 1,−1.5,−0.5,−1.7)T

16

• Rule 7: If x1 equals 1 and x2 equals −1 and x3 equals −1 and

x4 equals 1 and x5 equals −1 and x6 equals −1

Then f = (−1.5,−0.8,−1.1,−0.8,−0.7,−1.5,−1.4, 1.1,−0.6,−0.7)T

• Rule 8: If x1 equals 1 and x2 equals 1 and x3 equals − 1 and

x4 equals −1 and x5 equals −1 and x6 equals −1

Then f = (−1.2,−1.3,−0.6,−0.6,−0.7,−2.6,−0.7,−0.3, 1,−1.2)T

• Rule 9: If x1 equals 1 and x2 equals −1 and x3 equals −1 and

x4 equals 1 and x5 equals 1 and x6 equals 1

Then f = (−0.3,−1.4,−0.9, 0.4,−1.3, 0,−2.4,−1.2,−1.5, 0.7)T

Evidently, this FRB is much simpler than the original one, as the number

of rules is reduced from 64 to 10. Furthermore, this FRB is simple enough

to allow us to interpret its functioning.

6 Interpreting the FRB

The symbolic structure of FRBs makes them much easier to understand than

ANNs. In particular, we can analyze the operation of a FRB by understand-

ing the If-part and the Then-part of each rule.

6.1 The If-part

To understand the If-part of the ten rules, we consider their degree of fir-

ing (DOF) for each possible input (namely, the 27 = 128 possible binary

17

vectors (z1, . . . , z7)). The ratio between the highest DOF and the second

highest DOF for the ten rules is:

930%, 1, 270%, 340%, 150%, 540%, 540%, 450%, 230%, 1, 940% and 240%,

respectively. Thus, with the exception of Rule 3 (recall that the rules are

numbered from Rule 0 to Rule 9), every rule is tuned to a single specific

input pattern and yields a much smaller DOF for any other pattern.

Fig. 2 depicts the pattern yielding the highest DOF for each rule. It may

be seen that rules 1, 5, 6 and 8 are tuned to recognize the digits 1, 5, 6 and 8,

respectively. Rules 0 and 7 are tuned to patterns that are one Hamming

distance away from the real digits 0 and 7.

If we compare the DOF only for the ten patterns representing the dig-

its 0, . . . , 9, then we find that rules 2 and 3 have the highest DOF when the

input is the digit one, and rule 4 has the highest DOF when the input is the

digit five. For all other rules, we have that rule i shows the highest DOF

when the input is digit i.

6.2 The Then-part

Considering the output vectors f i, i = 0, . . . , 9, we see that arg maxk(fi)k = i

for all i. In other words, if only rule i fired, then the inferencing would yield

digit i. In most rules, there is a considerable difference between entry i and

the second largest entry in f i. In five of the ten rules, the largest entry is

18

Rule 0 Rule 1 Rule 2 Rule 3 Rule 4

Rule 5 Rule 6 Rule 7 Rule 8 Rule 9

Figure 2: The pattern yielding maximal DOF for each rule.

positive and the other nine entries are negative. Thus, when such a rule fires

it not only contributes to the classification towards a specific digit, but also

contributes negatively to all other possible classifications.

6.3 Explaining the FRB

Summarizing, we see that the FRB includes seven rules that are tuned to

a specific digit. These are rules 0, 1, 5, 6, 7, 8, and 9. Each of these rules

responds with a high DOF when the input is the appropriate digit.

On the other hand, rules 2, 3 and 4 are not tuned to the corresponding

digit. For example, rule 2 displays the highest DOF when the input is the

digit 1. The fact that the rule-base correctly classifies all ten digits, including

19

the digits 2, 3 and 4, is due to the weighted combination of all the rules’

outputs, and not to the specific action of a single rule.

This behavior motivated us to try and understand the distinction between

the two sets of digits

S1 := {0, 1, 5, 6, 7, 8, 9} and S2 := {2, 3, 4}. (10)

Let H(d1, d2) denote the Hamming distance between the LED represen-

tations of the digits d1 and d2 (for example, H(1, 7) = 1). Let Mi denote the

set of digits d that satisfy min{H(d, 0), H(d, 1), . . . , H(d, 9)} = i (that is, the

digit closest to d is at a distance i from d). Then,

M1 = {0, 1, 3, 5, 6, 7, 8, 9} and M2 = {2, 4}. (11)

It is clear from the definition of Mi that digits in the set M1 are more difficult

to recognize correctly than those in the set M2.

Comparing (10) with (11), we see that there is a high correspondence

between Mi and Si. Thus, the FRB (or the original ANN) dedicates specially

tuned rules for the more ”tricky” digits.

7 Conclusions

The output of a feed-forward ANN can be represented as the result of infer-

encing a fuzzy rule base with a special structure–the APFRB. This equiva-

20

lence allows the bi-directional flow of information between the subsymbolic

knowledge representation in the ANN and the symbolic rules of the APFRB.

In this paper, we studied one application of this equivalence. The trans-

formation APFRB = T (ANN) extracts the knowledge learned by the trained

ANN in the form of symbolic fuzzy rules. We demonstrated this approach

using a medium-size ANN trained to solve a benchmark problem. The 24-

6-10 network was transformed into a set of 64 fuzzy rules. Simplification of

this rule-set led to a comprehensible representation of the ANN’s function-

ing. For example, it is possible to conclude that the ANN dedicates special

rules to digits that are more difficult to recognize.

Appendix

We generated 205 examples for each digit in the form (z1, . . . , z24, v) where zi ∈

{0, 1}, i = 1, . . . , 7, is the LED’s state, v is the correct classification, and z8, . . . , z24

are independent random variables with prob(0) = prob(1) = 1/2. Thus, the

complete training set contained 2050 supervised examples.

We used a 24-6-10 ANN. The number of hidden units was determined

using a trial and error approach. ANNs with less than six hidden neurons

were not able to correctly classify all the training examples. A similar network

was used in [7]. The hidden neurons employ the hyperbolic tangent activation

function and the ten output neurons are linear. The classification is based on

the winner-takes-all paradigm. The only preprocessing done was converting

21

the zis from {0, 1} to {−1, 1}.

We implemented the ANN using MATLAB’s Neural Networks Toolbox.

The network’s parameters were initialized using the “init” command with

the “net.layers{i}.initFcn” set to “initnw” (the Nguyen-Widrow initialization

algorithm [23]).

The training was performed using the “trainlm” command (Levenberg-

Marquardt backpropagation). A regularization factor (the squared sum of

the weights) was added to the error function by setting the option “net.performFcn”

to “msereg”.

After training, the network correctly classified all the 2050 examples. The

final values of the weights and biases are given below4 where W ∈ R6×24,

Θ ∈ R6, C ∈ R10×6, and Φ ∈ R10 are the input-to-hidden weight matrix, the

hidden neurons’ biases, the hidden-to-output weight matrix, and the output

neurons’ biases, respectively.

W =

0.23 −0.04 0.45 0.30 −0.17 −0.52 −0.14 . . .

−1.31 −0.25 −0.06 0.77 0.70 0.73 1.07 . . .

−1.09 −2.05 −1.86 1.58 0.60 −0.15 −0.63 . . .

2.99 0.59 −0.17 0.40 −0.79 1.08 −2.50 . . .

−0.57 −2.02 −0.25 −0.65 −0.09 2.08 2.90 . . .

−0.49 0.89 0.02 −0.44 −0.62 −1.65 0.55 . . .

(the maximal value of {wij}, ∀j ∈ {8, . . . , 24} is 8.3E− 5 and, therefore, this

4The numerical values were rounded to two decimal digits, without affecting the clas-sification accuracy.

22

part of the matrix is omitted).

Θ = (0.33,−0.59, 1.63,−2.20,−1.90, 1.59)T ,

C =

−0.43 −0.22 −0.43 −0.38 0.34 0.28

−0.32 −0.69 0.24 −0.43 −0.14 −0.18

0.62 −0.07 0.25 −0.31 −0.22 0.27

0.95 0.32 0.13 0.22 0.85 −0.28

0.02 0.05 0.12 0.01 −0.49 0.20

−0.38 0.05 0.21 0.57 0.28 0.47

−0.89 0.43 0.21 0.07 −0.24 −0.26

−0.10 −0.59 −0.04 0.12 −0.49 −0.66

0.07 0.59 −0.42 −0.23 −0.17 −0.27

0.46 0.13 −0.26 0.35 0.28 0.44

and Φ = (−0.74,−0.78,−1.13,−0.92,−0.89,−0.71,−0.45,−0.68,−0.74,−0.96)T .

The transformation APFRB = T (ANN) is defined for the case of unbiased

output neurons. Thus, the transformation was performed using the values

W , Θ and C only, and then Φ was added to every rule’s output.

References

[1] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning al-

gorithms,” Machine Learning, vol. 6, pp. 37–66, 1991.

23

[2] H. C. Anderson, A. Lotfi, and L. C. Westphal, “Comments on “Func-

tional equivalence between radial basis function networks and fuzzy in-

ference systems”,” IEEE Trans. Neural Networks, vol. 9, pp. 1529–1531,

1998.

[3] R. Andrews, J. Diederich, and A. Tickle, “Survey and critique of tech-

niques for extracting rules from trained artificial neural networks,”

Knowledge-Based Systems, vol. 8, pp. 373–389, 1995.

[4] M. Ayoubi and R. Isermann, “Neuro-fuzzy systems for diagnosis,” Fuzzy

Sets and Systems, vol. 89, pp. 289–307, 1997.

[5] J. M. Benitez, J. L. Castro, and I. Requena, “Are artificial neural net-

works black boxes?” IEEE Trans. Neural Networks, vol. 8, pp. 1156–

1164, 1997.

[6] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:

Oxford University Press, 1995.

[7] Z. Boger and H. Guterman, “Knowledge extraction from artificial neural

networks models,” in Proc. IEEE Int. Conf. Systems, Man and Cyber-

natics (SMC97), Orlando, Florida, 1997, pp. 3030–3035.

[8] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and

Regression Trees. Wadsworth International Group, 1984, ch. 2.

24

[9] J. L. Castro, C. J. Mantas, and J. M. Benitez, “Interpretation of artificial

neural networks by means of fuzzy rules,” IEEE Trans. Neural Networks,

vol. 13, pp. 101–116, 2002.

[10] I. Cloete and J. M. Zurada, Eds., Knowledge-Based Neurocomputing.

MIT Press, 2000.

[11] D. Dubois, H. T. Nguyen, H. Prade, and M. Sugeno, “Introduction:

The real contribution of fuzzy systems,” in Fuzzy Systems: Modeling

and Control, H. T. Nguyen and M. Sugeno, Eds. Kluwer, 1998, pp.

1–17.

[12] L.-M. Fu and L.-C. Fu, “Mapping rule-based systems into neural archi-

tectures,” Knowledge Based Systems, vol. 3, pp. 48–56, 1990.

[13] S. Huang and H. Xing, “Extract intelligible and concise fuzzy rules from

neural networks,” Fuzzy Sets and Systems, vol. 132, pp. 233–243, 2002.

[14] M. Ishikawa, “Structural learning and rule discovery,” in Knowledge-

Based Neurocomputing, I. Cloete and J. M. Zurada, Eds. Kluwer, 2000,

pp. 153–206.

[15] J.-S. R. Jang and C.-T. Sun, “Functional equivalence between radial ba-

sis function networks and fuzzy inference systems,” IEEE Trans. Neural

Networks, vol. 4, pp. 156–159, 1993.

25

[16] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Com-

puting: A Computational Approach to Learning and Machine Intelli-

gence. Prentice-Hall, 1997.

[17] E. Kolman and M. Margaliot, “Are artificial neural networks white

boxes?” IEEE Trans. Neural Networks, vol. 16, pp. 844–852, 2005.

[18] G. Leng, T. McGinnity, and G. Prasad, “An approach for on-line extrac-

tion of fuzzy rules using a self-organizing fuzzy neural network,” Fuzzy

Sets and Systems, vol. 150, pp. 211–243, 2005.

[19] M. Margaliot and G. Langholz, New Approaches to Fuzzy Modeling and

Control - Design and Analysis. World Scientific, 2000.

[20] K. McGarry, S. Wermter, and J. MacIntyre, “Hybrid neural systems:

¿from simple coupling to fully integrated neural networks,” Neural Com-

puting Surveys, vol. 2, pp. 62–93, 1999.

[21] S. Mitra and Y. Hayashi, “Neuro-fuzzy rule generation: survey in soft

computing framework,” IEEE Trans. Neural Networks, vol. 11, pp. 748–

768, 2000.

[22] S. Mitra and S. Pal, “Fuzzy multi-layer perceptron, inferencing and rule

generation,” IEEE Trans. Neural Networks, vol. 6, pp. 51–63, 1995.

[23] D. Nguyen and B. Widrow, “Improving the learning speed of 2-layer

neural networks by choosing initial values of the adaptive weights,” in

26

Proc. Int. Joint Conf. Neural Networks, vol. 3, San Diego, California,

1990, pp. 21–26.

[24] R. Paiva and A. Dourado, “Interpretability and learning in neuro-fuzzy

systems,” Fuzzy Sets and Systems, vol. 147, pp. 17–38, 2004.

[25] S. Sestito and T. Dillon, “Knowledge acquisition of conjunctive rules

using multilayered neural networks,” Int. J. Intelligent Systems, vol. 8,

pp. 779–805, 1993.

[26] R. Setiono, “Extracting rules from neural networks by pruning and

hidden-unit splitting,” Neural Computation, vol. 9, pp. 205–225, 1997.

[27] A. Tickle, R. Andrews, M. Golea, and J. Diederich, “The truth will come

to light: directions and challenges in extracting the knowledge embedded

within trained artificial neural networks,” IEEE Trans. Neural Networks,

vol. 9, pp. 1057–1068, 1998.

[28] G. G. Towell and J. W. Shavlik, “Extracting refined rules from

knowledge-based neural networks,” Machine Learning, vol. 13, pp. 71–

101, 1993.

[29] E. Tron and M. Margaliot, “Mathematical modeling of observed natural

behavior: a fuzzy logic approach,” Fuzzy Sets and Systems, vol. 146, pp.

437–450, 2004.

[30] ——, “How does the Dendrocoleum lacteum orient to light? a fuzzy

modeling approach,” Fuzzy Sets and Systems, 2005, to appear.

27

[31] L. A. Zadeh, “Fuzzy logic = computing with words,” IEEE Trans. Fuzzy

Systems, vol. 4, pp. 103–111, 1996.

[32] D. Zhang, X.-L. Bai, and K.-Y. Cai, “Extended neuro-fuzzy models of

multilayer perceptrons,” Fuzzy Sets and Systems, vol. 142, pp. 221–242,

2004.

28

Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...

Documents

Transcript of Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...