{"title": "Interpretation of Artificial Neural Networks: Mapping Knowledge-Based Neural Networks into Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": null, "full_text": "Interpretation of Artificial Neural Networks: \n\nMapping Knowledge-Based Neural Networks into Rules \n\nGeoffrey Towell \n\nJude W. Shavlik \n\nComputer Sciences Department \n\nU ni versity of Wisconsin \n\nMadison, WI 53706 \n\nAbstract \n\nWe propose and empirically evaluate a method for the extraction of expert(cid:173)\ncomprehensible rules from trained neural networks. Our method operates in \nthe context of a three-step process for learning that uses rule-based domain \nknowledge in combination with neural networks. Empirical tests using real(cid:173)\nworlds problems from molecular biology show that the rules our method extracts \nfrom trained neural networks: closely reproduce the accuracy of the network \nfrom which they came, are superior to the rules derived by a learning system that \ndirectly refines symbolic rules, and are expert-comprehensible. \n\n1 \n\nIntroduction \n\nArtificial neural networks (ANNs) have proven to be a powerful and general technique \nfor machine learning [1, 11]. However, ANNs have several well-known shortcomings. \nPerhaps the most significant of these shortcomings is that determining why a trained ANN \nmakes a particular decision is all but impossible. Without the ability to explain their \ndecisions, it is hard to be confident in the reliability of a network that addresses a real-world \nproblem. Moreover, this shortcoming makes it difficult to transfer the information learned \nby a network to the solution of related problems. Therefore, methods for the extraction of \ncomprehensible, symbolic rules from trained networks are desirable. \n\nOur approach to understanding trained networks uses the three-link chain illustrated by \nFigure 1. The first link inserts domain knowledge, which need be neither complete nor \ncorrect, into a neural network using KBANN [13] -\nsee Section 2. (Networks created \nusing KBANN are called KNNs.) The second link trains the KNN using a set of classified \n977 \n\n\f978 \n\nTowell and Shavlik \n\nNeural \nLearning \n\nFigure 1: Rule refinement using neural networks. \n\ntraining examples and standard neural learning methods [9]. The final link extracts rules \nfrom trained KNNs. Rule extraction is an extremely difficult task for arbitrarily-configured \nnetworks, but is somewhat less daunting for KNNs due to their initial comprehensibility. \nOur method (described in Section 3) takes advantage of this property to efficiently extract \nrules from trained KNNs. \n\nSignificantly, when evaluated in terms of the ability to correctly classify examples not seen \nduring training, our method produces rules that are equal or superior to the networks from \nwhich they came (see Section 4). Moreover, the extracted rules are superior to the rules \nresulting from methods that act directly on the rules (rather than their re-representation as a \nneural network). Also, our method is superior to the most widely-published algorithm for \nthe extraction of rules from general neural networks. \n\n2 The KBANN Algorithm \n\nThe KBANN algorithm translates symbolic domain knowledge into neural networks; defining \nthe topology and connection weights of the networks it creates. It uses a knowledge base of \ndomain-specific inference rules to define what is initially known about a topic. A detailed \nexplanation of this rule-translation appears in [13]. \n\nAs an example of the KBANN method, consider the sample domain knowledge in Figure 2a \nthat defines membership in category A. Figure 2b represents the hierarchical structure \nof these rules: solid and dotted lines represent necessary and prohibitory dependencies, \nrespectively. Figure 2c represents the KNN that results from the translation into a neural \nnetwork of this domain knowledge. Units X and Y in Figure 2c are introduced into the \nKNN to handle the diSjunction in the rule set. Otherwise, each unit in the KNN corresponds \nto a consequent or an antecedent in the domain knowledge. The thick lines in Figure 2c \nrepresent heavily-weighted links in the KNN that correspond to dependencies in the domain \nknowledge. The thin lines represent the links added to the network to allow refinement of \nthe domain knowledge. Weights and biases in the network are set so that, prior to learning, \nthe network's response to inputs is exactly the same as the domain knowledge. \n\nThis example illustrates the two principal benefits of using KBANN to initialize KNNs. \nFirst, the algorithm indicates the features that are believed to be important to an example's \nclassification. Second, it specifies important derived features, thereby guiding the choice \nof the number and connectivity of hidden units. \n\n3 Rule Extraction \n\nAlmost every method of rule extraction makes two assumptions about networks. First, that \ntraining does not significantly shift the meaning of units. By making this assumption, the \nmethods are able to attach labels to rules that correspond to terms in the domain knowledge \n\n\fInterpretation of Artificial Neural Networks \n\n979 \n\nA:- B. C. \nB:- notH. \nB:- notF. O. \nC :- I. J. \n\n(a) \n\nF \n\nG \n\nH \n\n(b) \n\nc \n\nA \n\nI \n\nJ K \n\n(c) \n\nFigure 2: Translation of domain knowledge into a KNN. \n\nupon which the network is based. These labels enhance the comprehensibility of the rules. \nThe second assumption is that the units in a trained KNN are always either active (::::::: 1) \nor inactive (::::::: 0). Under this assumption each non-input unit in a trained KNN can be \ntreated as a Boolean rule. Therefore, the problem for rule extraction is to determine the \nsituations in which the \"rule\" is true. Examination of trained KNNs validates both of these \nassumptions. \n\nGiven these assumptions, the simplest method for extracting rules we call the SUBSET \nmethod. This method operates by exhaustively searching for subsets of the links into a unit \nsuch that the sum of the weights of the links in the subset guarantees that the total input \nto the unit exceeds its bias. In the limit, SUBSET extracts a set of rules that reproduces the \nbehavior of the network. However, the combinatorics of this method render it impossible \nto implement. Heuristics can be added to reduce the complexity of the search at some cost \nin the accuracy of the resulting rules. Using heuristic search, SUBSET tends to produce \nrepetitive rules whose preconditions are difficult to interpret. (See [10] or [2] for more \ndetailed explanations of SUBSET.) \n\nOur algorithm, called NOFM, addresses both the combinatorial and presentation problems \ninherent to the SUBSET algorithm. It differs from SUBSET in that it explicitly searches for \nrules of the form: \"If (N of these M antecedents are true) ... \" \nThis method arose because we noticed that rule sets discovered by the SUBSET method \noften contain N-of-M style concepts. Further support for this method comes from \nexperiments that indicate neural networks are good at learning N-of-M concepts [1] as well \nas experiments that show a bias towards N-of-M style concepts is useful [5]. Finally, note \nthat purely conjunctive rules result if N = M, while a set of disjunctive rules results when \nN = 1; hence, using N-of-M rules does not restrict generality. \nThe idea underlying NOFM (summarized in Table 1) is that individual antecedents (links) \ndo not have unique importance. Rather, groups of antecedents form equivalence classes \nin which each antecedent has the same importance as, and is interchangeable with, other \nmembers of the class. This equivalence-class idea allows NOFM to consider groups of \nlinks without worrying about particular links within the group. Unfortunately, training \nusing backpropagation does not naturally bunch links into equivalence classes. Hence, the \nfirst step of NOFM groups links into equivalence classes. \n\nThis grouping can be done using standard clustering methods [3] in which clustering is \nstopped when no clusters are closer than a user-set distance (we use 0.25). After clustering, \nthe links to the unit in the upper-rigtlt corner of Figure 3 form two groups, one of four \nlinks with weight near one and one of three links with weight near six. (The effect of this \ngrouping is very similar to the training method suggested by Nowlan and Hinton [7].) \n\n\f980 \n\nTowell and Shavlik \n\nTable 1: The NOFM algorithm for rule extraction. \n\nSet link weights of aU group members to the average of the group. \n\n(1) With each hidden and output unit, fonn groups of similarly-weighted links. \n(2) \n(3) Eliminate any groups that do not affect whether the unit will be active or inactive. \n(4) Holding all links weights constant, optimize biases of hidden and output units. \n(5) \n\nForm a single rule for each hidden and output unit. The rule consists of a threshold given by \nthe bias and weighted antecedents specified by remaining links. \n\n(6) Where possible, simplify rules to eliminate spperfluous weights and thresholds. \n\n5ii'f'~ \nII I I \\ \\\\ / / I I \\ \\ \\ \n\n5ti'N~ \n\n6.0 1.0 1.2 \n\n6.1 6.1 \n\n6.0 \n\n6.2 \n\n1.2 \n\n1.1 \n\n1.1 \n\n1.0 \n\n6 . 1 \n\n1.1 \n\n1.1 \n\nA \n\nB \n\nC \n\nD \n\nE \n\nF G \n\nA \n\nC \n\nF \n\nD \n\nE \n\nB \n\nG \n\n<j\u00b7f'0~ \nI \n\\ \n\n6.1 6.1 \n\nI \n\n6.1 \n\nA \n\nC \n\nF \n\nInitial Unit \n\nAfter Steps 1 and 2 \n\nAfter Step 3 \n\nif \n\n6.1 ... NurnberTrue (A, C, F) \n> 10.9 \n\nthen Z. \n\nNurn.berTrue \n\nreturns the number of \ntrue antecedents \n\nAfter Steps 4 and S \n\nif 2 of { A C F} then Z. \n\nAfter Step 6 \n\nFigure 3: Rule extraction using NOFM. \n\nOnce the groups are formed, the procedure next attempts to identify and eliminate groups \nthat do not contribute to the calculation of the consequent. In the extreme case, this analysis \nis trivial; clusters can be eliminated solely on the basis of their weight. In Figure 3 no \ncombination of the cluster of links with weight 1.1 can cause the summed weights to exceed \nthe bias on unit Z. Hence, links with weight 1.1 are eliminated from Figure 3 after step 3. \n\nMore often, the assessment of a cluster's utility uses heuristics. The heuristic we use is to \nscan each training example and determine which groups can be eliminated while leaving \nthe example correctly categorized. Groups not required by any example are eliminated. \n\nWith unimportant groups eliminated, the next step of the procedure is to optimize the bias \non each unit. Optimization is required to adjust the network so that it accurately reflects \nthe assumption that units are boolean. This can be done by freezing link weights (so that \nthe groups stay intact) and retraining the bias terms in the network. \n\nAfter optimization, rules are formed that simply re-express the network. Note that these \nrules are considerable simpler than the trained network; they have fewer antecedents and \nthose antecedents tend to be in a few weight classes. \n\nFinally, rules are simplified whenever possible to eliminate the weights and thresholds. \nSimplification is accomplished by a scan of each restated rule to determine combinations of \n\n\fInterpretation of Artificial Neural Networks \n\n981 \n\nclusters that exceed the threshold. In Figure 3 the result of this scan is a single N-of-M style \nrule. When a rule has more than one cluster, this scan may return multiple combinations \neach of which has several N-of-M predicates. In such cases, rules are left in their original \nform of weights and a threshold. \n\n4 Experiments in Rule Extraction \n\nThis section presents a set of experiments designed to determine the relative strengths \nand weaknesses of the two rule-extraction methods described above. Rule-extraction \ntechniques are compared using two measures: quality, which is measured both by the \naccuracy of the rules; and comprehensibility which is approximated by analysis of extracted \nrule sets. \n\n4.1 Testing Methodology \n\nFollowing Weiss and Kulikowski [14], we use repeated 10-fold cross-validationl for \ntesting learning on two tasks from molecular biology: promoter recognition [13] and \nsplice-junction determination [6]. Networks are trained using the cross-entropy. Following \nHinton's [4] suggestion for improved network interpretability, all weights \"decay\" gently \nduring training. \n\n4.2 Accuracy of Extracted Rules \n\nFigure 4 addresses the issue of the accuracy of extracted rules. It plots percentage of errors \non the testing and training sets, averaged over eleven repetitions of 10-fold cross-validation, \nfor both the promoter and splice-junction tasks. For comparison, Figure 4 includes the \naccuracy of the trained KNNs prior to rule extraction (the bars labeled \"Network\"). Also \nincluded in Figure 4 is the accuracy of the EITHER system, an \"all symbolic\" method for \nthe empirical adaptation of rules [8]. (EITHER has not been applied to the splice-junction \nproblem.) \n\nThe initial rule sets for promoter recognition and splice-junction determination correctly \ncategorized 50% and 61 %, respectively, of the examples. Hence, each of the systems \nplotted in Figure 4 improved upon the initial rules. Comparing only the systems that result \nin refined rules, the NOFM method is the clear winner. On training examples, the error \nrate for rules extracted by NOFM is slightly worse than EITHER but superior to the rules \nextracted using SUBSET. On the testing examples the NOFM rules are more accurate than \nboth EITHER and SUBSET. (One-tailed, paired-sample t-tests indicate that for both domains \nthe NOFM rules are superior to the SUBSET rules with 99.5% confidence.) \n\nPerhaps the most significant result in this paper is that, on the testing set, the error rate \nof the NOFM rules is equal or superior to that of the networks from which the rules were \nextracted. Conversely, the error rate of the SUBSET rules on testing examples is statistically \nworse than the networks in both problem domains. The discussion at the end of this paper \n\nlIn N -fold cross-validation, the set of examples is partitioned into N sets of equal size. Networks \nare trained using N - 1 of the sets and tested using the remaining set. This procedure is repeated \nN times so that each set is used as the testing set once. We actually used only N - 2 of the sets \nfor training. One set was used for testing and the other to stop training to prevent overfitting of the \ntraining set. \n\n\f982 \n\nTowell and Shavlik \n\nPromoter Domain \n\nSplice-Junction Domain \n\nTraining Set \n\nTesting Set \n\nFigure 4: Error rates of extracted rules. \n\nNetwork MofN Subset \n\nanalyses the reasons why NOFM's rules can be superior to the networks from which they \ncame. \n\n4.3 Comprehensibility \n\nTo be useful, the extracted rules must not only be accurate, they also must be understandable. \nTo assess rule comprehensibility, we looked at rule sets extracted by the NOFM method. \nTable 3 presents the rules extracted by NOFM for promoter recognition. The rules extracted \nby NOFM for splice-junction determination are not shown because they have much the \nsame character as those of the promoter domain. \n\nWhile Table 3 is someWhat murky, it is vastly more comprehensible than the network of \n3000 links from which it was extracted. Moreover, the rules in this table can be rewritten in \na form very similar to one used in the biological community [12], namely weight matrices. \n\nOne major pattern in the extracted rules is that the network learns to disregard a major \nportion of the initial rules. These same rules are dropped by other rule-refinement systems \n(e.g., EITHER). This suggests that the deletion of these rules is not merely an artifact of \nNOFM, but instead reflects an underlying property of the data. Hence, we demonstrate that \nmachine learning methods can provide valuable evidence about biological theories. \n\nLooking beyond the dropped rules, the rules NOFM extracts confirm the importance of the \nbases identified in the initial rules (Tabie 2). However, whereas the initial rules required \nmatching every base, the extracted rules allow a less than perfect match. In addition, \nthe extracted rules point to places in which changes to the sequence are important. For \ninstance, in the first minus10 rule, a \\ T' in position 11 is a strong indicator that the rule \nis true. However, replacing the \\ T' with either a \\ G' or an \\ A' prevents the rule from \nbeing satisfied. \n\n5 Discussion and Conclusions \n\nOur results indicate that the NOFM method not only can extract meaningful, symbolic rules \nfrom trained KNNs, the extracted rules can be superior at classifying examples not seen \nduring training to the networks from which they came. Additionally, the NOFM method \nproduces rules whose accuracy is substantially better than EITHER, an approach that directly \nmodifies the initial set of rules [8]. While the rule set produced by the NOFM algorithm is \n\n\fInterpretation of Artificial Neural Networks \n\n983 \n\nTable 2: Partial set of original rules for promoter-recognition. \n\npromoter \ncontact \nminus-35 \nminus-10 \nconformation \n\n.- contact, conformation. \n.- minus-35, minus-10. \n.- @-37 \n.- @-14 \n.- @-45 \n\n'CTTGAC' . \n'TATAAT' \u2022 \n'AA--A' . \n\n--- three additional rules \n--- three additional rules \n--- three additional rules \n\nExamples are 57 base-pair long strands of DNA. Rules refer to bases by stating a sequence location \nfollowed by a subsequnce. So, @-37 ocr' indicates a 'C' in position -37 and a 'T' in position -36. \n\nPromoter :- Minus35, Minus10. \n\nTable 3: Promoter rules NOFM extracts. \n\nMinus-35 \n:-10 < 4.0 \u2022 nt(@-37 \n1.5 \u2022 nt(@-37 \n0.5 \u2022 nt(@-37 \n1.5 \u2022 nt(@-37 \n\nMinus-35 \n\n'--TTGAT-' ) + \n'----TCC-' ) + \n'---MC---' ) -\n'--GGAGG-' ) . \n\n:-10 < 5.0 * nt(@-37 \n3.1 * nt(@-37 \n1.9 * nt(@-37 \n1.5 \u2022 nt (@-37 \n1.5 \u2022 nt(@-37 \n1.9 * nt(@-37 \n3.1 \u2022 nt(@-37 \n\n'--T-G--A' ) + \n'---GT---' ) + \n'----C-CT' ) + \n'---C--A-' ) -\n,------GC' ) -\n'--CAW---' ) -\n'--A----C' ) . \n\nMinus-35 \n@-37 \nMinus-35 .- @-37 \n\n'-C-TGAC-' . \n'--TTD-CA' . \n\nMinus-10 .- 2 of @-14 \nnot 1 of @-14 \nMinus-10 \n:-10 < 3.0 \u2022 nt (@-14 \n1.8 \u2022 nt (@-14 \n0.7 \u2022 nt (@-14 \n0.7 * nt (@-14 \n\nMinus-10 \n\n'---CA---T' and \n'---RB---S' . \n\n'--TAT--T-' ) + \n'-----GA--' 1 + \n'----GAT--' 1 -\n'--GKCCCS-') . \n\n:-10 < 3.8 * nt (@-14 \n3.0 * nt(@-14 \n1.0 \u2022 nt(@-14 \n1.0 * nt (@-14 \n3.0 \u2022 nt(@-14 \n\nMinus-10 . - @-14 \n\n'--TA-A-T-') + \n'--G--C---') + \n'---T---A-') -\n'--CS-G-S-' ) -\n'--A--T---') . \n\n'-TAWA-T--' \u2022 \n\n\"ntO\" returns the number of enclosed in the parentheses antecedents that match the given sequence. So, \nnt(@-14 '- - - C - - G - -')wouldreturn 1 whenmatchedagainstthesequence@-14'AAACAAAAA'. \n\nTable 4: Standard nucleotide ambiguity codes. \n\nCode Meaning Code \n\nM \nK \n\nAorC \nGorT \n\nR \nD \n\nMeaning \nAorG \n\nA or G orT \n\nW \nB \n\nCode Meaning \nAorT \n\nCode Meaning \nCorG \n\nS \n\nC orG orT \n\nslightly larger than that produced by EITHER, the sets of rules produced by both of these \nalgorithms is small enough to be easily understood. Hence, although weighing the tradeoff \nbetween accuracy and understandability is problem and user-specific, the NOFM approach \ncombined with KBANN offers an appealing mixture. \n\nThe superiority of the NOFM rules over the networks from which they are extracted may \noccur because the rule-extraction process reduces overfitting of the training examples. The \nprinciple evidence in support of this hypothesis is that the difference in ability to correctly \ncategorize testing and training examples is smaller for NOFM rules than for trained KNNs. \nThus, the rules extracted by NOFM sacrifice some training set accuracy to achieve higher \ntesting set accuracy. \n\nAdditionally, in earlier tests this effect was more pronounced; the NOFM rules were superior \nto the networks from which they came on both datasets (with 99according to a one-tailed \nt-test). Modifications to training to reduce overfitting improved generalization by networks \nwithout significantly affecting NOFM's rules. The result of the change in training method is \nthat the differences between the network and NOFM are not statistically significant in either \ndataset. However, the result is significant in that it supports the overfitting hypothesis. \n\n\f984 \n\nTowell and Shavlik \n\nIn summary, the NOFM method extracts accurate, comprehensible rules from trained \nKNNs. The method is currently limited to KNNs; randomly-configured networks violate \nits assumptions. New training methods [7] may broaden the applicability of the method. \nEven without different methods for training, our results show that NOFM provides a \nmechanism through which networks can make expert comprehensible explanations of their \nbehavior. In addition, the extracted rules allow for the transfer of learning to the solution \nof related problems. \n\nAcknowledgments \nThis work is partially supported by Office of Naval Research Grant NOOOI4-90-J-194 I , \nNational Science Foundation Grant IRI-9002413, and Department of Energy Grant DE(cid:173)\nFG02-91ER61129. \n\nReferences \n[1] D. H. Fisher and K. B. McKusick. An empirical comparison of ID3 and back-propagation. \nIn Proceedings of the Eleventh International loint Conference on Artiftcial Intelligence, pages \n788-793,Detroit., MI, August 1989. \n\n[2] L. M. Fu. Rule learning by searching on adapted nets. In Proceedings of the Ninth National \n\nConference on ArtiftcialIntelligence, pages 590-595, Anaheim, CA, 1991. \n\n[3] J. A. Hartigan. Clustering Algorithms. Wiley. New York. 1975. \n[4] G. E. Hinton. Connectionist learning procedures. Artificial Intelligence. 40:185-234,1989. \n[5] P. M. Murphy and M. J. Pazzani. ID2-of-3: Constructive induction of N-of-M concepts for \ndiscriminators in decision trees. In Proceedings of the Eighth International Machine Learning \nWorkshop. pages 183-187. Evanston. IL. 1991. \n\n[6] M. O. Noordewier. G. G. Towell, and J. W. Shavlik. Training knowledge-based neural \nnetworks to recognize genes in DNA sequences. In Advances in Neural Information Processing \nSystems. 3, Denver. CO, 1991. Morgan Kaufmann. \n\n[7] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight-sharing. In \nAdvances in Neural Information Processing Systems, 4, Denver, CO, 1991. Morgan Kaufmann. \n[8] D. Ourston and R. J. Mooney. Changing the rules: A comprehensive approach to theory \nrefinement. In Proceedings of the Eighth National Conference on Artificial Intelligence, pages \n815-820, Boston. MA. Aug 1990. \n\n[9] D. E. Rumelhart, G. E. Hinton. and R. J. Williams. Learning internal representations by error \npropagation. In D. E. Rumelhart and J. L. McClelland. editors, Parallel Distributed Processing: \nExplorations in the microstructure of cognition. Volume 1,' Foundations. pages 318-363. MIT \nPress, Cambridge. MA. 1986. \n\n[10] K. Saito and R. Nakano. Medical diagnostic expert system based on PDP model. In Proceedings \n\nof IEEE International Conference on Neural Networks. volume 1, pages 255-262. 1988. \n\n[11] J. W. Shavlik. R. J. Mooney. and G. G. Towell. Symbolic and neural net learning algorithms: \n\nAn empirical comparison. Machine Learning. 6:111-143. 1991. \n\n[12] G. D. Stormo. Consensus patterns in DNA. In Methods in Enzymology. volume 183. pages \n\n211-221. Academic Press, Orlando, FL, 1990. \n\n[13] G. G. Towell, J. W. Shavlik, and M. O. Noordewier. Refinement of approximately correct \ndomain theories by knowledge-based neural networks. In Proceedings of the Eighth National \nConference on Artificial Intelligence, pages 861-866,Boston, MA, 1990. \n\n[14] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn. Morgan Kaufmann. San \n\nMateo, CA, 1990. \n\n\f", "award": [], "sourceid": 546, "authors": [{"given_name": "Geoffrey", "family_name": "Towell", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}