Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...

28
Knowledge Extraction from Neural Networks using the All-Permutations Fuzzy Rule Base * Eyal Kolman and Michael Margaliot August 3, 2005 Abstract A major drawback of artificial neural networks is their black-box character. Even when the trained network performs adequately, it is very difficult to understand its operation. In this paper, we use the mathematical equivalence between artificial neural networks and a specific fuzzy rule base to extract the knowledge embedded in the network. We demonstrate this using a benchmark problem: the recog- nition of digits produced by a LED device. The method provides a symbolic and comprehensible description of the knowledge learned by the network during its training. Keywords: Feedforward neural networks, knowledge extraction, rule extrac- tion, rule generation, hybrid intelligent systems, neuro-fuzzy systems. 1 Introduction The ability of artificial neural networks (ANNs) to learn and generalize from examples makes them very suitable for use in numerous real-world appli- * This work was partially supported by the Tel Aviv University Internal Research Fund under grant number 05110066. An abridged version of this paper was presented at the 8’th International Work-Conference on Artificial Neural Networks (IWANN’2005). Corresponding author: Dr. Michael Margaliot, Dept. of Electrical Engineering- Systems, Tel Aviv University, Tel Aviv, Israel 69978. Tel: +972-3-6407768; Fax: +972-3- 640 5027; Homepage: www.eng.tau.ac.il/michaelm; Email: [email protected]

Transcript of Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords:...

  • Knowledge Extraction from Neural Networks

    using the All-Permutations Fuzzy Rule Base∗

    Eyal Kolman and Michael Margaliot†

    August 3, 2005

    Abstract

    A major drawback of artificial neural networks is their black-boxcharacter. Even when the trained network performs adequately, itis very difficult to understand its operation. In this paper, we usethe mathematical equivalence between artificial neural networks anda specific fuzzy rule base to extract the knowledge embedded in thenetwork. We demonstrate this using a benchmark problem: the recog-nition of digits produced by a LED device. The method provides asymbolic and comprehensible description of the knowledge learned bythe network during its training.

    Keywords: Feedforward neural networks, knowledge extraction, rule extrac-

    tion, rule generation, hybrid intelligent systems, neuro-fuzzy systems.

    1 Introduction

    The ability of artificial neural networks (ANNs) to learn and generalize from

    examples makes them very suitable for use in numerous real-world appli-

    ∗This work was partially supported by the Tel Aviv University Internal Research Fundunder grant number 05110066. An abridged version of this paper was presented at the8’th International Work-Conference on Artificial Neural Networks (IWANN’2005).

    †Corresponding author: Dr. Michael Margaliot, Dept. of Electrical Engineering-Systems, Tel Aviv University, Tel Aviv, Israel 69978. Tel: +972-3-640 7768; Fax: +972-3-640 5027; Homepage: www.eng.tau.ac.il/∼michaelm; Email: [email protected]

  • cations where exact algorithmic approaches are unknown or too difficult to

    implement. The knowledge learned during the training process is distributed

    in the weights of the different neurons and, whether the ANN operates prop-

    erly or not, it is very difficult to comprehend exactly what it is computing.

    In this respect, ANNs process information in a “black-box” and subsymbolic

    level. The problem of extracting the knowledge learned by the network, and

    representing it in a comprehensible form, received a great deal of attention

    in the literature (see, e.g., [10, 3, 27]).

    Rule-based systems process information in a manner that is much easier to

    comprehend because the system’s knowledge is stated using symbolic If-Then

    rules. In particular, fuzzy rule bases (FRBs) enable the use and manipulation

    of expert knowledge stated using natural language [29, 30, 11, 31, 19]. Thus,

    the knowledge is easy to understand, verify, and, if necessary, refine.

    Recently, a great deal of research has been devoted to the design of hy-

    brid intelligent systems that fuse subsymbolic and symbolic techniques for

    information processing [20] and, in particular, to creating a synergy between

    ANNs and FRBs [21]. Such a synergy may lead to systems with the ro-

    bustness and learning capabilities of ANNs and the “white-box” character of

    FRBs.

    Understanding the operation of a trained ANN is difficult because the

    knowledge is embedded in a complex, distributed, and sometimes self-contradictory

    form [14]. A widespread heuristic approach for knowledge extraction is based

    on finding the “most effective” input-output paths, and then transforming

    2

  • them into symbolic rules (see, e.g., [25, 7, 22]). Ishikawa [14] incorporates

    regularization terms that punish large weights into the error function used

    in the training phase. This forces the ANN to develop a more condensed

    and, therefore, easier to understand, form of knowledge representation. To

    extract the knowledge, Ishikawa represents each hidden and output unit as

    a Boolean function of outputs from previous layers. Forcing the ANN to

    produce a skeletal structure seems to be a very useful technique in gen-

    eral [26, 6], however, extracting knowledge in the form of Boolean functions

    is not appropriate for general ANNs.

    Other approaches are based on an attempt to develop an equivalence

    between ANNs and rule-based systems. Fu and Fu [12] mapped rule-based

    systems into a neural-like architecture: final hypotheses are represented using

    output neurons; data attributes become input neurons; and the strength of

    the rule is mapped into the weight of the corresponding connection. This al-

    lowed back-propagation-like learning for modifying rule strengths. However,

    this approach cannot be used to extract knowledge from a standard ANN.

    Towell and Shavlik [28] introduced the Knowledge-Based Artificial Neu-

    ral Network (KBANN) algorithm that transforms a knowledge base into an

    ANN. This can be used to insert initial domain knowledge into an ANN and

    thus reduce the training time and improve the chances for a global minima

    convergence. After training, a heuristic method is used to extract symbolic

    rules. However, the extracted rules are Boolean and the inherent “fuzziness”

    of the ANN is lost.

    3

  • A well-known neuro-fuzzy model is the Adaptive Network-Based Fuzzy

    Inference System (ANFIS) developed by Jang et al. [16], which is a feed-

    forward network representation of the fuzzy reasoning process. However,

    this representation is not a standard ANN because nodes in different layers

    perform different tasks corresponding to the different stages in the fuzzy rea-

    soning process. For example, nodes in the first layer compute membership

    function values, whereas nodes in the second layer perform T-norm opera-

    tions.

    Jang and Sun [15] noted that the local activation functions of radial

    basis function networks (RBFNs) are the Gaussian membership functions

    frequently used in FRBs. They used this to extract a set of fuzzy rules that

    are mathematically equivalent to the RBFN. However, this equivalence holds

    only for RBFNs and it also requires that each membership function will be

    used by no more than one rule [2].

    Benitez et al. [5] showed that ANNs with Logistic activation functions

    are equivalent to the result of inferencing a set of Mamdani-type fuzzy rules

    (see also [9, 32]). However, this is not a standard FRB as the operators used

    in the inferencing method are not those commonly used in FRBs.

    Recently, the authors introduced a new Mamdani-type FRB referred to

    as the All-Permutations Fuzzy Rule Base (APFRB) [17]. Inferencing the

    APFRB, using standard tools from fuzzy logic theory, yields an input-output

    relationship that is mathematically equivalent to that of a feed-forward ANN.

    4

  • More precisely, there exists an invertible transformation T such that

    T (ANN) = APFRB and T−1(APFRB) = ANN. (1)

    This equivalence enables bidirectional flow of information between the

    ANN and the corresponding APFRB. It also enables the application of tools

    from the theory of ANNs to APFRBs and vice versa. For example, given a

    procedure for simplifying an ANN, we can immediately apply it to simplify

    an APFRB as follows. Given an initial APFRB, calculate the equivalent

    network as ANN = T−1(APFRB). Apply the simplification procedure to

    this ANN and denote the result by ANN’. Then, APFRB’ := T (ANN’) is a

    simplified version of the original APFRB.

    In this paper, we use the equivalence T (ANN) = APFRB to extract the

    knowledge learned by the ANN in the form of symbolic rules. We demon-

    strate this approach on a benchmark problem involving the recognition of the

    digits displayed by a LED device. An ANN is trained to recognize the ten

    possible digits. Calculating APFRB = T (ANN), yields a symbolic descrip-

    tion of the knowledge learned by the ANN. To increase the comprehensibility

    of this FRB we simplify the rules.

    The final result is an FRB with ten rules that correctly classifies all the

    training examples. Furthermore, this FRB is tractable, and provides a com-

    prehensible representation of the ANNs functioning. For example, it is pos-

    sible to deduce that the ANN learned to focus its efforts on the digits that

    5

  • are harder to recognize. The notion that digits that are more difficult to

    recognize deserve more attention is quite intuitive. However, understanding

    that the ANN implements this notion by observing its weights and biases is

    all but impossible. It is only through the knowledge extraction process that

    this notion emerges.

    The rest of this paper is organized as follows. In section 2, we briefly

    review the APFRB and its equivalence to an ANN. In Section 3, we present

    the benchmark problem and the ANN trained to solve it. In Section 4, we

    apply the equivalence (1) to extract information from the trained ANN. The

    resulting APFRB is simplified in Section 5. In Section 6, we show that the

    simplified FRB is tractable and allows us to represent the ANN’s functioning

    in a comprehensible form. The final section concludes.

    2 All-Permutations Fuzzy Rule-Base

    For the sake of completeness, we briefly review the APFRB and its equiva-

    lence to an ANN. More details and the proofs of all the results can be found

    in [17]. For the sake of simplicity, we consider the case of an FRB with

    output f ∈ R2; the generalization to the case f ∈ Rn is straightforward.

    Definition 1 (APFRB) A fuzzy rule-base with inputs x1, . . . , xm and out-

    put f ∈ R2 is called an APFRB if the following conditions hold.

    1. Every input variable xi is characterized by two linguistic terms: termi−

    and termi+. The membership functions µi−(·) and µ

    i+(·) that model these

    6

  • terms satisfy the following constraint: there exists a vi ∈ R such that

    µi+(y)− µi−(y)

    µi+(y) + µi−(y)

    = tanh(y − vi), ∀ y ∈ R. (2)

    2. The form of every rule is

    If x1 is term1

    +/−and x2 is term2

    +/− . . . and xm is termm+/−

    Then f =

    a0 ± a1 ± a2 · · · ± am

    b0 ± b1 ± b2 · · · ± bm

    (3)

    where termi+/− stands for either term

    i− or term

    i+, ± stands for either the

    plus or the minus sign, and ai, bi ∈ R. The actual signs in the Then-part

    are determined in the following manner: if the term characterizing xi in the

    If-part is termi+, then in the Then-part, ai and bi appear with a plus sign;

    otherwise ai and bi appear with a minus sign.

    3. The rule-base contains exactly 2m rules spanning, in their If-part, all the

    possible assignment combinations of x1, ..., xm.

    Several commonly used fuzzy membership functions satisfy the constraint (2).

    For example, the pair of Gaussian membership functions

    µ=k(y) := exp(−(y−k)2/(2k)) and µ=−k(y) := exp(−(y+k)

    2/(2k)), (4)

    7

  • satisfy (2) with v = 0. The sigmoid functions

    µ>k(y) := (1+exp(−2(y−k))−1 and µ

  • outputs f can be represented as the output of a standard ANN.

    Conversely, consider an ANN with input z ∈ Rn, a single hidden layer

    with m units, and two output units. Its output f ∈ R2 is given by

    f =

    ∑mj=1 cjh(yj + θj)

    ∑mj=1 djh(yj + θj)

    , (8)

    where yj :=∑n

    i=1 wjizi is the input to the jth neuron in the hidden layer, θj

    is the bias of this neuron, and cj (dj) is the weight from this neuron to the

    first (second) output neuron. Comparing (8) with (6) yields the following.

    Corollary 1 If the activation function in the ANN is h(z) = tanh(z), then (8)

    is the output of an APFRB with: a0 = 0, b0 = 0, ai = ci, bi = di, vi = −θi,

    xi = yi, i = 1, . . . , m.

    If h(z) = 1/(1+exp(−z)), then (8) is the output of an APFRB with: a0 =∑m

    i=1 ai, b0 =∑m

    i=1 bi, ai = ci/2, bi = di/2, vi = −θi/2, and xi = yi/2,

    i = 1, . . . , m.

    Summarizing, Theorem 1 establishes an equivalence between a single1

    hidden layer ANN and an APFRB, and explicitly defines the transforma-

    tion T in (1). Furthermore, Corollary 1 implies that we can immediately

    extract the knowledge embedded in an ANN in the form of fuzzy If-Then

    rules. The rest of this paper is devoted to demonstrating the usefulness of

    this approach using a benchmark problem.

    1The equivalence is easily generalized to ANNs with multiple hidden layers; see [17].

    9

  • 3 The LED Display Recognition Problem

    The LED display recognition problem [8, Chapter 2] concerns learning to

    recognize digits displayed using a seven-segment light emitting diodes (LED)

    display. Several pattern recognition algorithms were applied to this problem

    including classification trees [8], instance-based learning algorithms [1], and

    ANNs [7].

    The input to the learning algorithm is a set of supervised examples in the

    form (z1, z2, . . . , z24, v). The first seven inputs, z1 . . . z7, are the states of the

    seven diodes (1 for on and 0 for off) of the LED display (see Fig. 1). For exam-

    ple, the vector {1, 1, 0, 1, 1, 1, 1} represents the digit 6, and {1, 1, 1, 1, 0, 1, 1}

    the digit 9. The value v ∈ {0, 1, . . . , 9} is the displayed digit. The in-

    puts z8 . . . z24 are independent random variables with prob(zi = 0) = prob(zi =

    1) = 1/2. These noise inputs make the recognition task more challenging as

    the classification algorithm must also learn to discriminate between mean-

    ingful and useless inputs.

    We trained a 24-6-10 ANN using the backpropagation algorithm and a

    set of 2050 supervised examples.2 Each of the ten outputs f0, . . . , f9 corre-

    sponds to a different digit and the final classification is based on the winner-

    takes-all approach. That is, the ANN’s classification is digit i, where i :=

    arg max0≤k≤9{fk}.

    After training, the ANN correctly classified all the input examples. The

    2More details on the training process can be found in the appendix.

    10

  • z1

    z2 z3

    z4

    z5 z6

    z7

    Figure 1: The LED device.

    values of the parameters of the trained ANN (204 weights and 16 biases) can

    be found in the appendix. These values, however, do not provide any insight

    into the ANN’s functioning.

    The network contains 220 free parameters and is definitely not a very

    large network. Nevertheless, the problem addressed in this paper is not the

    designing or training of ANNs, but rather interpreting the performance of a

    trained ANN. In this respect, it is interesting to compare the size of this ANN

    to other examples used to demonstrate knowledge extraction algorithms.

    In [18], fuzzy rules were extracted from a self-organizing fuzzy neural network

    with up to 10 neurons and 60 parameters. In [24], interpretable fuzzy models

    were extracted from two neuro-fuzzy networks, the first with 28 and the

    second with 92 parameters. In [13], fuzzy rules extraction from a trained

    neural network was demonstrated using a network with 36 parameters. In [4],

    an extraction method was demonstrated using a network with 36 parameters.

    The above examples clearly indicate that extracting rules from a network

    11

  • with 220 parameters is an interesting challenge.

    4 Knowledge Extraction

    Let xj :=∑

    24

    i=1 wjizi, j = 1, . . . , 6, denote the input of the jth hidden neuron

    of the ANN. We applied Corollary 1 to represent the ANN as an APFRB

    with 26 = 64 fuzzy rules and an output f ∈ R10. For example, one of the

    rules of this APFRB is:3

    If x1 equals 1 and x2 equals 1 and x3 equals −1 and x4 equals 1 and x5

    equals −1 and x6 equals 1

    Then f = (−1.4,−2.5,−0.6,−0.7,−0.2,−0.5,−11,−1.4, 0, 0.4)T .

    The membership functions defining the terms equals 1 and equals −1 are:

    µ= 1(y) = exp (−(y − 1)2/2) and µ= −1(y) = exp (−(y + 1)

    2/2).

    The inferencing amounts to computing a weighted sum, f , of the sixty-

    four vectors in the Then-part of the rules, and the final digit classification

    is i := arg max0≤k≤9{fk}, where fk is the kth entry in f .

    This rule set provides a complete symbolic representation of the ANN’s

    functioning. In other words, we have a fuzzy classifier that solves the LED

    recognition problem. However, the comprehensibility of this classifier is hin-

    dered by the large number and the complexity of the rules. To gain more

    insight, we must simplify the APFRB.

    3The numerical values were rounded to one decimal digit, without affecting the classi-fication accuracy.

    12

  • 5 APFRB Simplification

    Simplification can be executed in the network level, by reducing or clustering

    nodes and connections. However, the simplification process becomes easier

    when the knowledge is represented in a symbolic and comprehensible manner.

    Thus, by shifting from the ANN domain to the symbolic fuzzy domain, we

    are able to simplify the knowledge more easily.

    In this section, we apply two simplification stages to the APFRB.

    5.1 Term Simplification

    Recall that the jth input of the APFRB is xj =∑n

    i=1 wjizi. If |wjizi| is

    sufficiently small then it will have a negligible effect on the APFRB’s output.

    In our case, all the zis are in the same range, so we can delete the term wjkzk

    if |wjk| is small. Examining the wji values, we find that for all j:

    min1≤i≤7 |wji|

    max8≤i≤24 |wji|> 240.

    Thus, we set wji = 0 for all i ≥ 8, j = 1, . . . , 6.

    At this point, each xj in the APFRB is a linear combination of only

    seven inputs zi, i = 1, . . . , 7. The noise inputs z8, . . . , z24 were identified as

    meaningless. Of course, this step is identical to removing weak connections

    from the input layer to the hidden layer in the ANN. However, the symbolic

    structure of the APFRB also allows us to perform simplification steps that

    13

  • cannot be carried out in terms of the weights of the ANN.

    5.2 Rule Reduction

    Consider an APFRB with input x ∈ Rm, q := 2m rules, and output f ∈ Rk.

    Let f i denote the value in the Then-part of rule i, and let ti(x) denote the

    degree of firing (or truth value) of rule i, so that the fuzzy inferencing process

    yields f(x) = u(x)/d(x), where

    u(x) :=

    q∑

    i=1

    ti(x)f i and d(x) :=

    q∑

    i=1

    ti(x). (9)

    If we modify the degree of firing of rule i to, say, t̂k(x), then the modified

    output is

    f̂(x) = (t̂k(x)fk +∑

    1≤i≤qi6=k

    ti(x)f i)/(t̂k(x) +∑

    1≤i≤qi6=k

    ti(x)).

    The final classification decision, obtained as arg max{f0, . . . , f9}, will not

    change as long as arg max{f̂(x)} = arg max{f(x)}. It is easy to verify that

    this is equivalent to

    arg max{u(x)− (tk(x)− t̂k(x))fk} = arg max{u(x)}.

    Note that deleting rule k from the APFRB altogether amounts to setting t̂k(x) ≡

    0.

    14

  • Let R(l, j) = 1 (R(l, j) = 0) denote that rule l is significant (insignificant)

    when classifying digit j (R(l, j) is initialized as one for all l = 1, . . . , q and j =

    0, . . . , 9). Denote the training set by D, and let (f k)j be the jth element of

    the output vector of rule k. Then, it follows from the analysis above that

    if pkj := maxx∈D tk(x)(fk)j is small, then the kth rule has a small effect on

    classifying digit j. This motivates the following procedure:

    While (there is no index l such that R(l, j) = 0 for all j)

    For j = 0 to 9

    Q← {k|R(k, j) = 1} /* rules in Q are significant for digit j */

    q ← arg mink∈Q pkj

    R(q, j)← 0 /* mark rule q as insignificant for digit j */

    EndFor

    EndWhile

    Output(l)

    This procedure outputs an index l such that rule l has a small effect on

    classifying all the ten digits. If removing rule l from the rule-base does

    not change the classification for all the training examples, then this rule is

    deleted.

    Applying this procedure repeatedly leads to the deletion of 54 rules. We

    are left with a set of ten rules that correctly classify the training set. These

    rules are:

    15

  • • Rule 0: If x1 equals −1 and x2 equals −1 and x3 equals −1 and

    x4 equals −1 and x5 equals 1 and x6 equals 1

    Then f = (1.3, 0.1,−1.6,−2,−1.4,−0.1,−0.8,−1.2,−1.2,−0.9)T

    • Rule 1: If x1 equals −1 and x2 equals −1 and x3 equals 1 and

    x4 equals −1 and x5 equals 1 and x6 equals −1

    Then f = (−0.1, 0.9,−1.6,−1.1,−1.5,−0.9, 0.2, 0,−1.5,−2.3)T

    • Rule 2: If x1 equals 1 and x2 equals − 1 and x3 equals 1 and

    x4 equals −1 and x5 equals −1 and x6 equals 1

    Then f = (−1.1, 0.2, 0.6,−1.5,−0.1,−1.3,−1.7,−0.5,−1.5,−1.1)T

    • Rule 3: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals −1

    and x5 equals 1 and x6 equals −1

    Then f = (−1.4,−1.1,−0.5, 1.4,−1.4,−1.6,−0.8,−1.4,−0.2,−1.1)T

    • Rule 4: If x1 equals 1 and x2 equals 1 and x3 equals 1 and x4 equals 1

    and x5 equals −1 and x6 equals 1

    Then f = (−2.3,−2,−0.1,−0.4, 0,−0.1,−0.7,−1.5,−0.8,−0.1)T

    • Rule 5: If x1 equals −1 and x2 equals 1 and x3 equals 1 and x4 equals 1

    and x5 equals 1 and x6 equals 1

    Then f = (−0.7,−1.7,−1.8,−0.7,−1, 1.3, 0.6,−2.2,−1.3,−0.5)T

    • Rule 6: If x1 equals − 1 and x2 equals 1 and x3 equals 1 and

    x4 equals −1 and x5 equals −1 and x6 equals 1

    Then f = (−0.6,−0.5,−0.8,−2.8,−0.1,−0.4, 1,−1.5,−0.5,−1.7)T

    16

  • • Rule 7: If x1 equals 1 and x2 equals −1 and x3 equals −1 and

    x4 equals 1 and x5 equals −1 and x6 equals −1

    Then f = (−1.5,−0.8,−1.1,−0.8,−0.7,−1.5,−1.4, 1.1,−0.6,−0.7)T

    • Rule 8: If x1 equals 1 and x2 equals 1 and x3 equals − 1 and

    x4 equals −1 and x5 equals −1 and x6 equals −1

    Then f = (−1.2,−1.3,−0.6,−0.6,−0.7,−2.6,−0.7,−0.3, 1,−1.2)T

    • Rule 9: If x1 equals 1 and x2 equals −1 and x3 equals −1 and

    x4 equals 1 and x5 equals 1 and x6 equals 1

    Then f = (−0.3,−1.4,−0.9, 0.4,−1.3, 0,−2.4,−1.2,−1.5, 0.7)T

    Evidently, this FRB is much simpler than the original one, as the number

    of rules is reduced from 64 to 10. Furthermore, this FRB is simple enough

    to allow us to interpret its functioning.

    6 Interpreting the FRB

    The symbolic structure of FRBs makes them much easier to understand than

    ANNs. In particular, we can analyze the operation of a FRB by understand-

    ing the If-part and the Then-part of each rule.

    6.1 The If-part

    To understand the If-part of the ten rules, we consider their degree of fir-

    ing (DOF) for each possible input (namely, the 27 = 128 possible binary

    17

  • vectors (z1, . . . , z7)). The ratio between the highest DOF and the second

    highest DOF for the ten rules is:

    930%, 1, 270%, 340%, 150%, 540%, 540%, 450%, 230%, 1, 940% and 240%,

    respectively. Thus, with the exception of Rule 3 (recall that the rules are

    numbered from Rule 0 to Rule 9), every rule is tuned to a single specific

    input pattern and yields a much smaller DOF for any other pattern.

    Fig. 2 depicts the pattern yielding the highest DOF for each rule. It may

    be seen that rules 1, 5, 6 and 8 are tuned to recognize the digits 1, 5, 6 and 8,

    respectively. Rules 0 and 7 are tuned to patterns that are one Hamming

    distance away from the real digits 0 and 7.

    If we compare the DOF only for the ten patterns representing the dig-

    its 0, . . . , 9, then we find that rules 2 and 3 have the highest DOF when the

    input is the digit one, and rule 4 has the highest DOF when the input is the

    digit five. For all other rules, we have that rule i shows the highest DOF

    when the input is digit i.

    6.2 The Then-part

    Considering the output vectors f i, i = 0, . . . , 9, we see that arg maxk(fi)k = i

    for all i. In other words, if only rule i fired, then the inferencing would yield

    digit i. In most rules, there is a considerable difference between entry i and

    the second largest entry in f i. In five of the ten rules, the largest entry is

    18

  • Rule 0 Rule 1 Rule 2 Rule 3 Rule 4

    Rule 5 Rule 6 Rule 7 Rule 8 Rule 9

    Figure 2: The pattern yielding maximal DOF for each rule.

    positive and the other nine entries are negative. Thus, when such a rule fires

    it not only contributes to the classification towards a specific digit, but also

    contributes negatively to all other possible classifications.

    6.3 Explaining the FRB

    Summarizing, we see that the FRB includes seven rules that are tuned to

    a specific digit. These are rules 0, 1, 5, 6, 7, 8, and 9. Each of these rules

    responds with a high DOF when the input is the appropriate digit.

    On the other hand, rules 2, 3 and 4 are not tuned to the corresponding

    digit. For example, rule 2 displays the highest DOF when the input is the

    digit 1. The fact that the rule-base correctly classifies all ten digits, including

    19

  • the digits 2, 3 and 4, is due to the weighted combination of all the rules’

    outputs, and not to the specific action of a single rule.

    This behavior motivated us to try and understand the distinction between

    the two sets of digits

    S1 := {0, 1, 5, 6, 7, 8, 9} and S2 := {2, 3, 4}. (10)

    Let H(d1, d2) denote the Hamming distance between the LED represen-

    tations of the digits d1 and d2 (for example, H(1, 7) = 1). Let Mi denote the

    set of digits d that satisfy min{H(d, 0), H(d, 1), . . . , H(d, 9)} = i (that is, the

    digit closest to d is at a distance i from d). Then,

    M1 = {0, 1, 3, 5, 6, 7, 8, 9} and M2 = {2, 4}. (11)

    It is clear from the definition of Mi that digits in the set M1 are more difficult

    to recognize correctly than those in the set M2.

    Comparing (10) with (11), we see that there is a high correspondence

    between Mi and Si. Thus, the FRB (or the original ANN) dedicates specially

    tuned rules for the more ”tricky” digits.

    7 Conclusions

    The output of a feed-forward ANN can be represented as the result of infer-

    encing a fuzzy rule base with a special structure–the APFRB. This equiva-

    20

  • lence allows the bi-directional flow of information between the subsymbolic

    knowledge representation in the ANN and the symbolic rules of the APFRB.

    In this paper, we studied one application of this equivalence. The trans-

    formation APFRB = T (ANN) extracts the knowledge learned by the trained

    ANN in the form of symbolic fuzzy rules. We demonstrated this approach

    using a medium-size ANN trained to solve a benchmark problem. The 24-

    6-10 network was transformed into a set of 64 fuzzy rules. Simplification of

    this rule-set led to a comprehensible representation of the ANN’s function-

    ing. For example, it is possible to conclude that the ANN dedicates special

    rules to digits that are more difficult to recognize.

    Appendix

    We generated 205 examples for each digit in the form (z1, . . . , z24, v) where zi ∈

    {0, 1}, i = 1, . . . , 7, is the LED’s state, v is the correct classification, and z8, . . . , z24

    are independent random variables with prob(0) = prob(1) = 1/2. Thus, the

    complete training set contained 2050 supervised examples.

    We used a 24-6-10 ANN. The number of hidden units was determined

    using a trial and error approach. ANNs with less than six hidden neurons

    were not able to correctly classify all the training examples. A similar network

    was used in [7]. The hidden neurons employ the hyperbolic tangent activation

    function and the ten output neurons are linear. The classification is based on

    the winner-takes-all paradigm. The only preprocessing done was converting

    21

  • the zis from {0, 1} to {−1, 1}.

    We implemented the ANN using MATLAB’s Neural Networks Toolbox.

    The network’s parameters were initialized using the “init” command with

    the “net.layers{i}.initFcn” set to “initnw” (the Nguyen-Widrow initialization

    algorithm [23]).

    The training was performed using the “trainlm” command (Levenberg-

    Marquardt backpropagation). A regularization factor (the squared sum of

    the weights) was added to the error function by setting the option “net.performFcn”

    to “msereg”.

    After training, the network correctly classified all the 2050 examples. The

    final values of the weights and biases are given below4 where W ∈ R6×24,

    Θ ∈ R6, C ∈ R10×6, and Φ ∈ R10 are the input-to-hidden weight matrix, the

    hidden neurons’ biases, the hidden-to-output weight matrix, and the output

    neurons’ biases, respectively.

    W =

    0.23 −0.04 0.45 0.30 −0.17 −0.52 −0.14 . . .

    −1.31 −0.25 −0.06 0.77 0.70 0.73 1.07 . . .

    −1.09 −2.05 −1.86 1.58 0.60 −0.15 −0.63 . . .

    2.99 0.59 −0.17 0.40 −0.79 1.08 −2.50 . . .

    −0.57 −2.02 −0.25 −0.65 −0.09 2.08 2.90 . . .

    −0.49 0.89 0.02 −0.44 −0.62 −1.65 0.55 . . .

    (the maximal value of {wij}, ∀j ∈ {8, . . . , 24} is 8.3E− 5 and, therefore, this

    4The numerical values were rounded to two decimal digits, without affecting the clas-sification accuracy.

    22

  • part of the matrix is omitted).

    Θ = (0.33,−0.59, 1.63,−2.20,−1.90, 1.59)T ,

    C =

    −0.43 −0.22 −0.43 −0.38 0.34 0.28

    −0.32 −0.69 0.24 −0.43 −0.14 −0.18

    0.62 −0.07 0.25 −0.31 −0.22 0.27

    0.95 0.32 0.13 0.22 0.85 −0.28

    0.02 0.05 0.12 0.01 −0.49 0.20

    −0.38 0.05 0.21 0.57 0.28 0.47

    −0.89 0.43 0.21 0.07 −0.24 −0.26

    −0.10 −0.59 −0.04 0.12 −0.49 −0.66

    0.07 0.59 −0.42 −0.23 −0.17 −0.27

    0.46 0.13 −0.26 0.35 0.28 0.44

    and Φ = (−0.74,−0.78,−1.13,−0.92,−0.89,−0.71,−0.45,−0.68,−0.74,−0.96)T .

    The transformation APFRB = T (ANN) is defined for the case of unbiased

    output neurons. Thus, the transformation was performed using the values

    W , Θ and C only, and then Φ was added to every rule’s output.

    References

    [1] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning al-

    gorithms,” Machine Learning, vol. 6, pp. 37–66, 1991.

    23

  • [2] H. C. Anderson, A. Lotfi, and L. C. Westphal, “Comments on “Func-

    tional equivalence between radial basis function networks and fuzzy in-

    ference systems”,” IEEE Trans. Neural Networks, vol. 9, pp. 1529–1531,

    1998.

    [3] R. Andrews, J. Diederich, and A. Tickle, “Survey and critique of tech-

    niques for extracting rules from trained artificial neural networks,”

    Knowledge-Based Systems, vol. 8, pp. 373–389, 1995.

    [4] M. Ayoubi and R. Isermann, “Neuro-fuzzy systems for diagnosis,” Fuzzy

    Sets and Systems, vol. 89, pp. 289–307, 1997.

    [5] J. M. Benitez, J. L. Castro, and I. Requena, “Are artificial neural net-

    works black boxes?” IEEE Trans. Neural Networks, vol. 8, pp. 1156–

    1164, 1997.

    [6] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:

    Oxford University Press, 1995.

    [7] Z. Boger and H. Guterman, “Knowledge extraction from artificial neural

    networks models,” in Proc. IEEE Int. Conf. Systems, Man and Cyber-

    natics (SMC97), Orlando, Florida, 1997, pp. 3030–3035.

    [8] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and

    Regression Trees. Wadsworth International Group, 1984, ch. 2.

    24

  • [9] J. L. Castro, C. J. Mantas, and J. M. Benitez, “Interpretation of artificial

    neural networks by means of fuzzy rules,” IEEE Trans. Neural Networks,

    vol. 13, pp. 101–116, 2002.

    [10] I. Cloete and J. M. Zurada, Eds., Knowledge-Based Neurocomputing.

    MIT Press, 2000.

    [11] D. Dubois, H. T. Nguyen, H. Prade, and M. Sugeno, “Introduction:

    The real contribution of fuzzy systems,” in Fuzzy Systems: Modeling

    and Control, H. T. Nguyen and M. Sugeno, Eds. Kluwer, 1998, pp.

    1–17.

    [12] L.-M. Fu and L.-C. Fu, “Mapping rule-based systems into neural archi-

    tectures,” Knowledge Based Systems, vol. 3, pp. 48–56, 1990.

    [13] S. Huang and H. Xing, “Extract intelligible and concise fuzzy rules from

    neural networks,” Fuzzy Sets and Systems, vol. 132, pp. 233–243, 2002.

    [14] M. Ishikawa, “Structural learning and rule discovery,” in Knowledge-

    Based Neurocomputing, I. Cloete and J. M. Zurada, Eds. Kluwer, 2000,

    pp. 153–206.

    [15] J.-S. R. Jang and C.-T. Sun, “Functional equivalence between radial ba-

    sis function networks and fuzzy inference systems,” IEEE Trans. Neural

    Networks, vol. 4, pp. 156–159, 1993.

    25

  • [16] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Com-

    puting: A Computational Approach to Learning and Machine Intelli-

    gence. Prentice-Hall, 1997.

    [17] E. Kolman and M. Margaliot, “Are artificial neural networks white

    boxes?” IEEE Trans. Neural Networks, vol. 16, pp. 844–852, 2005.

    [18] G. Leng, T. McGinnity, and G. Prasad, “An approach for on-line extrac-

    tion of fuzzy rules using a self-organizing fuzzy neural network,” Fuzzy

    Sets and Systems, vol. 150, pp. 211–243, 2005.

    [19] M. Margaliot and G. Langholz, New Approaches to Fuzzy Modeling and

    Control - Design and Analysis. World Scientific, 2000.

    [20] K. McGarry, S. Wermter, and J. MacIntyre, “Hybrid neural systems:

    ¿from simple coupling to fully integrated neural networks,” Neural Com-

    puting Surveys, vol. 2, pp. 62–93, 1999.

    [21] S. Mitra and Y. Hayashi, “Neuro-fuzzy rule generation: survey in soft

    computing framework,” IEEE Trans. Neural Networks, vol. 11, pp. 748–

    768, 2000.

    [22] S. Mitra and S. Pal, “Fuzzy multi-layer perceptron, inferencing and rule

    generation,” IEEE Trans. Neural Networks, vol. 6, pp. 51–63, 1995.

    [23] D. Nguyen and B. Widrow, “Improving the learning speed of 2-layer

    neural networks by choosing initial values of the adaptive weights,” in

    26

  • Proc. Int. Joint Conf. Neural Networks, vol. 3, San Diego, California,

    1990, pp. 21–26.

    [24] R. Paiva and A. Dourado, “Interpretability and learning in neuro-fuzzy

    systems,” Fuzzy Sets and Systems, vol. 147, pp. 17–38, 2004.

    [25] S. Sestito and T. Dillon, “Knowledge acquisition of conjunctive rules

    using multilayered neural networks,” Int. J. Intelligent Systems, vol. 8,

    pp. 779–805, 1993.

    [26] R. Setiono, “Extracting rules from neural networks by pruning and

    hidden-unit splitting,” Neural Computation, vol. 9, pp. 205–225, 1997.

    [27] A. Tickle, R. Andrews, M. Golea, and J. Diederich, “The truth will come

    to light: directions and challenges in extracting the knowledge embedded

    within trained artificial neural networks,” IEEE Trans. Neural Networks,

    vol. 9, pp. 1057–1068, 1998.

    [28] G. G. Towell and J. W. Shavlik, “Extracting refined rules from

    knowledge-based neural networks,” Machine Learning, vol. 13, pp. 71–

    101, 1993.

    [29] E. Tron and M. Margaliot, “Mathematical modeling of observed natural

    behavior: a fuzzy logic approach,” Fuzzy Sets and Systems, vol. 146, pp.

    437–450, 2004.

    [30] ——, “How does the Dendrocoleum lacteum orient to light? a fuzzy

    modeling approach,” Fuzzy Sets and Systems, 2005, to appear.

    27

  • [31] L. A. Zadeh, “Fuzzy logic = computing with words,” IEEE Trans. Fuzzy

    Systems, vol. 4, pp. 103–111, 1996.

    [32] D. Zhang, X.-L. Bai, and K.-Y. Cai, “Extended neuro-fuzzy models of

    multilayer perceptrons,” Fuzzy Sets and Systems, vol. 142, pp. 221–242,

    2004.

    28