{"title": "Extracting Tree-Structured Representations of Trained Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 24, "page_last": 30, "abstract": null, "full_text": "Extracting Thee-Structured \n\nRepresentations of Thained  Networks \n\nMark W.  Craven and Jude W.  Shavlik \n\nComputer Sciences Department \nUniversity of Wisconsin-Madison \n\n1210 West  Dayton St. \nMadison,  WI 53706 \n\ncraven@cs.wisc.edu,  shavlik@cs.wisc.edu \n\nAbstract \n\nA  significant  limitation  of  neural  networks  is  that  the  represen(cid:173)\ntations  they  learn  are  usually  incomprehensible  to  humans.  We \npresent a  novel algorithm , TREPAN,  for extracting comprehensible, \nsymbolic representations from  trained neural  networks.  Our algo(cid:173)\nrithm uses  queries to induce a  decision tree that approximates the \nconcept  represented by  a  given  network.  Our experiments demon(cid:173)\nstrate that TREPAN is  able to produce decision trees that maintain \na high level of fidelity to their respective networks while being com(cid:173)\nprehensible  and  accurate.  Unlike  previous  work  in  this  area,  our \nalgorithm is  general in its applicability and scales well  to large net(cid:173)\nworks  and problems with  high-dimensional input spaces. \n\n1 \n\nIntroduction \n\nFor  many  learning  tasks ,  it  is  important  to  produce  classifiers  that  are  not  only \nhighly  accurate,  but  also  easily  understood  by  humans.  Neural  networks  are  lim(cid:173)\nited  in  this  respect,  since  they  are  usually  difficult  to  interpret  after training.  In \ncontrast to  neural  networks,  the  solutions formed  by  \"symbolic\"  learning  systems \n(e.g.,  Quinlan,  1993)  are  usually  much  more  amenable  to  human  comprehension. \nWe  present  a  novel  algorithm ,  TREPAN,  for  extracting  comprehensible,  symbolic \nrepresentations  from  trained  neural  networks.  TREPAN  queries  a  given  network \nto  induce  a  decision  tree  that  describes  the  concept  represented  by  the  network. \nWe  evaluate our algorithm  using  several  real-world  problem domains , and  present \nresults  that  demonstrate  that  TREPAN  is  able  to  produce  decision  trees  that  are \naccurate and comprehensible,  and  maintain a  high  level  of fidelity  to the networks \nfrom  which  they  were  extracted.  Unlike  previous work in  this  area,  our  algorithm \n\n\fExtracting Tree-structured Representations of Trained Networks \n\n25 \n\nis  very  general  in  its  applicability,  and scales  well  to  large  networks  and  problems \nwith  high-dimensional  input  spaces. \n\nThe  task  that  we  address  is  defined  as  follows:  given  a  trained  network  and  the \ndata on which it was trained, produce a concept description that is  comprehensible, \nyet  classifies  instances  in  the  same  way  as  the  network.  The  concept  description \nproduced  by  our  algorithm  is  a  decision  tree,  like  those  generated  using  popular \ndecision-tree induction algorithms  (Breiman et al.,  1984;  Quinlan,  1993). \n\nThere are several reasons why the comprehensibility of induced concept descriptions \nis  often  an  important  consideration.  If the  designers  and  end-users  of a  learning \nsystem are to be confident in the performance of the system, they must understand \nhow  it  arrives  at  its  decisions.  Learning systems  may  also  play  an important  role \nin  the  process  of scientific  discovery.  A  system  may  discover  salient  features  and \nrelationships in the input data whose importance was  not previously recognized.  If \nthe representations formed by the learner are comprehensible, then these discoveries \ncan  be  made  accessible  to  human  review.  However,  for  many  problems  in  which \ncomprehensibility is  important, neural networks provide better generalization than \ncommon symbolic  learning algorithms.  It is  in  these  domains  that  it  is  important \nto be able to extract comprehensible concept descriptions from  trained networks. \n\n2  Extracting Decision Trees \n\nOur  approach  views  the  task  of extracting  a  comprehensible  concept  description \nfrom  a  trained  network  as  an  inductive  learning  problem.  In  this  learning  task, \nthe  target  concept  is  the  function  represented  by  the  network,  and  the  concept \ndescription produced by our learning algorithm is  a decision tree that approximates \nthe network.  However,  unlike  most  inductive learning problems,  we  have  available \nan  oracle  that  is  able  to  answer  queries  during  the  learning  process.  Since  the \ntarget function is simply the concept represented by the network, the oracle uses the \nnetwork  to  answer queries.  The advantage of learning  with  queries,  as  opposed to \nordinary training examples, is that they can be used to garner information precisely \nwhere it is  needed during the learning process. \n\nOur  algorithm,  as  shown  in  Table  1,  is  similar  to  conventional decision-tree  algo(cid:173)\nrithms,  such  as  CART  (Breiman  et  al. ,  1984) ,  and  C4.5  (Quinlan,  1993) ,  which \nlearn directly from  a  training set.  However,  TREPAN is  substantially different from \nthese conventional algorithms in number of respects , which  we  detail below. \n\nThe  Oracle.  The  role  of the  oracle  is  to  determine  the  class  (as  predicted  by \nthe  network)  of each  instance that  is  presented  as  a  query.  Queries  to  the  oracle, \nhowever,  do  not have to be complete instances,  but instead can specify  constraints \non  the  values  that  the  features  can  take.  In  the  latter  case,  the  oracle  generates \na  complete  instance  by  randomly selecting  values  for  each  feature,  while  ensuring \nthat the constraints are satisfied.  In order to generate these random values, TREPAN \nuses  the training data to model each feature's  marginal distribution.  TREPAN  uses \nfrequency counts to model the distributions of discrete-valued features, and a  kernel \ndensity  estimation  method  (Silverman,  1986)  to  model  continuous  features.  As \nshown in  Table  1,  the  oracle  is  used  for  three  different  purposes:  (i)  to  determine \nthe  class  labels  for  the  network's  training examples;  (ii)  to select  splits for  each of \nthe tree's  internal  nodes;  (iii)  and  to  determine if a  node  covers  instances  of only \none  class.  These aspects of the algorithm are discussed in  more detail  below. \n\nTree  Expansion.  Unlike  most  decision-tree  algorithms,  which  grow  trees  in  a \ndepth-first  manner,  TREPAN  grows  trees  using  a  best-first  expansion.  The  notion \n\n\f26 \n\nM.  W.  CRAVEN, J.  W.  SHAVLIK \n\nTable 1:  The TREPAN  algorithm. \n\nTREPAN(training_examples,  features) \n\nQueue:= 0 \nfor  each  example E  E training_examples \n\nclass  label for  E  := ORACLE(E) \n\n/* sorted queue of nodes to  expand * / \n/*  use  net  to label  examples * / \n\ninitialize  the root of the tree, T,  as  a  leaf node \nput  (T,  training_examples, {}  )  into Queue \nwhile Queue is  not empty and size(T)  < tree...size_limit \n\n/* expand a  node * / \n\nremove  node N  from  head of Queue \nexamplesN  := example set stored with  N \nconstraintsN  := constraint set stored with N \nuse  features  to build set of candidate splits \nuse  examplesN  and calls  to ORAcLE(constraintsN)  to evaluate splits \nS  := best binary split \nsearch for  best  m-of-n split,  S',  using 5  as a  seed \nmake N  an internal  node with split  S' \nfor  each outcome,  s,  of 5' \n\n/* make children nodes * / \n\nmake C, a  new  child  node of N \nconstraintsc  := constraintsN  U  {5'  =  s} \nuse  calls  to ORACLE( constraintsc)  to determine if C  should remain  a  leaf \notherwise \n\nexamplesc  := members of examplesN  with outcome  s  on split S' \nput (C,  examplesc,  constraintsc)  into  Queue \n\nreturn T \n\nof  the  best  node,  in  this  case,  is  the  one  at  which  there  is  the  greatest  potential \nto  increase  the  fidelity  of the  extracted  tree  to  the  network.  The  function  used \nto  evaluate  node  n  is  f(n)  = reach(n)  x  (1  -\nfidelity(n)) ,  where  reach(n)  is  the \nestimated  fraction  of  instances  that  reach  n  when  passed  through  the  tree,  and \nfidelity(n)  is  the estimated fidelity  of the tree to the network for  those instances. \n\nSplit Types.  The role of internal nodes  in  a  decision tree is  to partition the input \nspace in order to  increase  the separation of instances  of different  classes.  In  C4. 5, \neach  of these  splits  is  based  on  a  single  feature.  Our algorithm,  like  Murphy  and \nPazzani's  (1991)  ID2-of-3  algorithm,  forms  trees  that  use  m-of-n  expressions  for \nits  splits.  An  m-of-n  expression  is  a  Boolean  expression  that  is  specified  by  an \ninteger  threshold,  m,  and  a  set  of n  Boolean  conditions.  An  m-of-n expression  is \nsatisfied when at least  m  of its n  conditions are satisfied.  For example,  suppose we \nhave  three  Boolean features,  a,  b,  and  c;  the  m-of-n  expression  2-of-{ a,  ....,b,  c}  is \nlogically equivalent  to  (a  /\\  ....,b)  V  (a  /\\  c)  V  (....,b  /\\ c). \n\nSplit Selection.  Split selection involves deciding how  to partition the input space \nat  a  given  internal  node  in  the  tree.  A  limitation  of  conventional  tree-induction \nalgorithms is  that the amount  of training data used  to  select  splits  decreases  with \nthe depth of the tree.  Thus splits near the bottom of a  tree are often poorly chosen \nbecause these  decisions  are  based  on  few  training  examples.  In  contrast,  because \nTREPAN  has  an  oracle  available,  it  is  able  to  use  as  many  instances  as  desired  to \nselect  each  split.  TREPAN  chooses  a  split  after considering at least  Smin  instances, \nwhere  Smin  is  a  parameter of the algorithm. \n\nWhen  selecting  a  split  at  a  given  node,  the  oracle  is  given  the  list  of  all  of  the \npreviously selected splits that lie on the path from the root of the tree to that node. \nThese splits  serve  as  constraints on  the feature  values  that  any instance generated \nby the oracle can take, since any example must satisfy these constraints in order to \n\n\fExtracting Tree-structured Representations of Trained Networks \n\n27 \n\nreach the given  node. \n\nLike  the  ID2-of-3  algorithm,  TREPAN  uses  a  hill-climbing  search  process  to  con(cid:173)\nstruct its  m-of-n splits.  The search process begins by first  selecting the  best binary \nsplit at the current node; as in C4. 5,  TREPAN  uses the gain ratio criterion (Quinlan, \n1993) to evaluate candidate splits.  For two-valued features,  a  binary split separates \nexamples according to their values for  the feature.  For discrete features  with  more \nthan  two  values,  we  consider  binary  splits  based  on  each  allowable  value  of  the \nfeature  (e.g.,  color=red?,  color=blue?,  ... ).  For  continuous  features,  we  consider \nbinary splits on  thresholds,  in the same manner as  C4.5.  The selected  binary split \nserves as a seed for the m-of-n search process.  This greedy search uses the gain ratio \nmeasure  as  its  heuristic  evaluation  function,  and  uses  the  following  two  operators \n(Murphy &  Pazzani,  1991): \n\n\u2022  m-of-n+l  :  Add  a  new  value  to the set,  and hold  the threshold  constant. \n\nFor example,  2-of-{ a,  b}  => 2-of-{ a,  b,  c} . \n\n\u2022  m+l - of- n+l:  Add  a  new  value  to  the  set,  and  increment  the  threshold. \n\nFor example,  2-of-{ a,  b,  c}  => 3-of-{ a,  b,  c,  d}. \n\nUnlike  ID2-of-3,  TREPAN  constrains  m-of-n splits  so  that  the  same  feature  is  not \nused in two or more disjunctive  splits which lie on the same path between the root \nand  a  leaf  of  the  tree.  Without  this  restriction,  the  oracle  might  have  to  solve \ndifficult  satisfiability problems in order create instances for  nodes on such  a  path. \n\nStopping  Criteria.  TREPAN  uses  two  separate  criteria  to  decide  when  to  stop \ngrowing an extracted decision tree.  First, a given node  becomes a  leaf in the tree if, \nwith high probability, the node covers only instances of a  single class.  To make this \ndecision, TREPAN  determines the proportion of examples, Pc,  that fall  into the most \ncommon class at a given node, and then calculates a confidence interval around this \nproportion  (Hogg  &  Tanis,  1983).  The  oracle  is  queried  for  additional  examples \nuntil  prob(pc  < 1 - f)  < 6,  where  f  and 6 are parameters of the algorithm. \nTREPAN  also  accepts  a  parameter that specifies  a  limit  on the number  of internal \nnodes in an extracted tree.  This parameter can be used to control the comprehen(cid:173)\nsibility of extracted trees,  since in some domains, it may require very  large trees to \ndescribe networks to a  high  level  of fidelity. \n\n3  Empirical Evaluation \n\nIn our experiments, we  are interested in evaluating the trees extracted by  our algo(cid:173)\nrithm according to three criteria:  (i)  their predictive accuracy;  (ii)  their comprehen(cid:173)\nsibility;  (i)  and  their fidelity  to the  networks from  which  they  were  extracted.  We \nevaluate TREPAN  using four  real-world domains:  the Congressional voting data set \n(15  features,  435  examples)  and  the  Cleveland  heart-disease data set  (13  features, \n303 examples)  from  the UC-Irvine database;  a  promoter data set  (57  features,  468 \nexamples) which is  a  more complex superset of the UC-Irvine one;  and a data set in \nwhich  the  task  is  to recognize  protein-coding regions  in  DNA  (64  features,  20,000 \nexamples)  (Craven & Shavlik,  1993b).  We  remove the physician-fee-freeze fea(cid:173)\nture from  the voting data set  to make the problem more difficult.  We  conduct  our \nexperiments using a  10-fold cross validation methodology, except for  in the protein(cid:173)\ncoding domain.  Because of certain domain-specific characteristics of this  data set, \nwe  use 4-fold cross-validation for  our experiments with it. \n\nWe  measure accuracy  and fidelity  on the examples in the test  sets.  Whereas accu(cid:173)\nracy is  defined  as  the  percentage of test-set  examples  that  are  correctly  classified, \nfidelity  is  defined  as  the percentage of test-set examples on which  the classification \n\n\f28 \n\nM. W. CRAVEN, J.  W. SHAVLIK \n\ndomain \n\nheart \npromoters \nprotein coding \nvoting \n\nTable  2:  Test-set accuracy and fidelity. \n\naccuracy \n\nnetworks  C4.5 \n\n84.5% \n90.6 \n94.1 \n92.2 \n\n71.0% \n84.4 \n90.3 \n89.2 \n\nID2-of-3  TREPAN \n81.8% \n87.6 \n91.4 \n90.8 \n\n74.6% \n83.5 \n90.9 \n87.8 \n\nfidelity \nTREPAN \n94.1% \n85.7 \n92.4 \n95.9 \n\nmade  by  a  tree  agrees  with  its  neural-network  counterpart.  Since  the  compre(cid:173)\nhensibility  of a  decision  tree  is  problematic  to  measure,  we  measure  the  syntactic \ncomplexity of trees and take this as being representative of their comprehensibility. \nSpecifically,  we  measure  the  complexity  of each  tree  in  two  ways:  (i)  the  number \nof internal  (i.e., non-leaf)  nodes  in the tree,  and  (ii)  the number of symbols  used in \nthe splits of the tree.  We  count an ordinary, single-feature split as  one symbol.  We \ncount  an  m-of-n split as  n  symbols,  since  such a  split lists  n  feature  ,-alues. \n\nThe neural  networks we  use in our experiments have a  single layer of hidden  units. \nThe  number  of hidden  units  used  for  each  network  (0,  5,  10,  20  or  40)  is  chosen \nusing cross  validation on the  network's training set,  and we  use  a  validation set to \ndecide  when  to  stop  training networks.  TREPAN  is  applied to each  saved network. \nThe parameters  of  TREPAN  are set  as  follows  for  all  runs:  at  least  1000  instances \n(training  examples  plus  queries)  are  considered  before  selecting  each  split;  we  set \nthe E and 6 parameters, which are used for the stopping-criterion procedure, to 0.05; \nand the maximum tree size is set to 15 internal nodes, which is  the size of a complete \nbinary tree of depth four. \n\nAs  baselines  for  comparison,  we  also  run  Quinlan'S  (1993)  C4.5  algorithm,  and \nMurphy  and  Pazzani's  (1991)  ID2-of-3  algorithm  on  the  same  testbeds.  Recall \nthat ID2-of-3  is  similar  to  C4.5, except  that it  learns trees  that use  m-of-n splits. \nWe use C4.5's pruning method for  both algorithms and use cross validation to select \npruning  levels  for  each  training set.  The  cross-validation  runs  evaluate  unpruned \ntrees and trees pruned with  confidence levels  ranging from  10%  to 90%. \n\nTable  2  shows  the  test-set  accuracy  results  for  our  experiments.  It can  be  seen \nthat,  for  every  data set,  neural  networks  generalize  better  than  the  decision  trees \nlearned  by  C4 .5  and  ID2-of-3.  The decision  trees  extracted from  the networks  by \nTREPAN  are  also  more  accurate than the  C4.5  and  ID2-of-3  trees  in  all  domains. \nThe differences  in accuracy between the neural networks and the two conventional \ndecision-tree algorithms (C4.5 and ID2-of-3)  are statistically significant for all four \ndomains  at  the  0.05  level  using  a  paired,  two-tailed  t-test.  We  also  test  the  sig(cid:173)\nnificance  of the  accuracy differences  between  TREPAN  and  the  other decision-tree \nalgorithms.  Except for  the promoter domain,  these differences  are also  statistically \nsignificant.  The results  in this  table indicate that, for  a  range of interesting tasks, \nour algorithm is able to extract decision trees which are more accurate than decision \ntrees induced strictly from  the training data. \n\nTable 2 also  shows  the test-set fidelity  measurements for  the TREPAN  trees.  These \nresults  indicate that  the trees  extracted by  TREPAN  provide  close  approximations \nto their respective neural networks. \n\nTable 3 shows tree-complexity measurements for  C4.5, ID2-of-3, and TREPAN.  For \nall  four  data  sets,  the  trees  learned  by  TREPAN  have  fewer  internal  nodes  than \nthe  trees  produced  by  C4.5  and  ID2-of-3.  In  most  cases ,  the  trees  produced  by \nTREPAN  and  ID2-of-3  use  more  symbols  than  C4.5,  since  their  splits  are  more \n\n\fExtracting  Tree-structured  Representations of Trained  Networks \n\n29 \n\ndomain \n\nC4.5 \n17.5 \nheart \npromoters \n11.2 \nprotein coding  155.0 \n20.1 \nvoting \n\nTable 3:  Tree complexity. \n\n#  internal nodes \n\nID2-of-3  TREPAN  II  C4.5 \n17.5 \n11.2 \n155.0 \n20.1 \n\n15.7 \n12.6 \n66.0 \n19.2 \n\n11.8 \n9.2 \n10.0 \n11.2 \n\n#  symbols \nID2-of-3  TREPAN \n20.8 \n23.8 \n36.0 \n20.8 \n\n48.8 \n47.5 \n455.3 \n77.3 \n\ncomplex.  However,  for  most of the data sets,  the TREPAN  trees and the  C4.5 trees \nare comparable in terms of their symbol complexity.  For all data sets, the ID2-of-3 \ntrees  are  more  complex  than  the  TREPAN  trees.  Based on  these  results,  we  argue \nthat the trees  extracted by  TREPAN  are as  comprehensible as the  trees  learned  by \nconventional decision-tree algorithms. \n\n4  Discussion  and  Conclusions \n\nIn  the  previous  section,  we  evaluated  our  algorithm  along  the  dimensions  of fi(cid:173)\ndelity,  syntactic  complexity,  and  accuracy.  Another  advantage  of  our  approach \nis  its  generality.  Unlike  numerous  other  extraction  methods  (Hayashi,  1991; \nMcMillan  et  al.,  1992;  Craven  &  Shavlik,  1993a;  Sethi  et  al.,  1993;  Tan,  1994; \nTchoumatchenko  &  Ganascia,  1994;  Alexander  &  Mozer,  1995;  Setiono  &  Liu, \n1995),  the  TREPAN  algorithm  does  not  place  any  requirements  on  either  the  ar(cid:173)\nchitecture of the network or its training method.  TREPAN  simply uses the network \nas  a  black  box  to  answer  queries  during  the  extraction  process.  In  fact,  TREPAN \ncould be used to extract decision-trees from other types of opaque learning systems, \nsuch as  nearest-neighbor classifiers. \n\nThere  are  several  existing  algorithms  which  do  not  require  special  network  archi(cid:173)\ntectures  or  training  procedures  (Saito  &  Nakano,  1988;  Fu,  1991 ;  Gallant,  1993) . \nThese  algorithms,  however,  assume that each  hidden  unit  in  a  network  can  be  ac(cid:173)\ncurately  approximated  by  a  threshold  unit.  Additionally,  these  algorithms  do  not \nextract  m-of-n rules,  but  instead extract only  conjunctive  rules.  In  previous  work \n(Craven &  Shavlik,  1994; Towell  &  Shavlik,  1993),  we  have shown that this  type of \nalgorithm produces rule-sets which  typically  are far  too complex to be comprehen(cid:173)\nsible.  Thrun  (1995)  has  developed  a  general  method  for  rule  extraction,  and  has \ndescribed how  his  algorithm can be used  to  verify that an  m-of-n rule is  consistent \nwith  a  network,  but he  has  not  developed  a  rule-searching method  that is  able  to \nfind  concise  rule  sets.  A  strength  of our  algorithm,  in  contrast,  is  its  scalability. \nWe  have demonstrated that our algorithm is  able to produce succinct decision-tree \ndescriptions of large networks in  domains with  large input spaces. \n\nIn summary,  a  significant  limitation of neural networks is  that their concept  repre(cid:173)\nsentations are usually not amenable to human understanding.  We have presented an \nalgorithm that is  able  to produce  comprehensible descriptions  of trained  networks \nby  extracting decision  trees  that  accurately describe  the networks'  concept  repre(cid:173)\nsentations.  We  believe  that our algorithm,  which  takes  advantage of the  fact  that \na  trained network can be queried,  represents a  promising advance towards the goal \nof general methods for  understanding the solutions encoded  by trained networks. \n\nAcknow ledgements \n\nThis research was  partially supported by  ONR grant N00014-93-1-0998. \n\n\f30 \n\nReferences \n\nM.  W. eRA VEN, J.  W.  SHA VLIK \n\nAlexander,  J.  A.  &  Mozer,  M.  C.  (1995).  Template-based  algorithms  for  connectionist \nrule  extraction.  In Tesauro,  G.,  Touretzky,  D.,  &  Leen,  T.,  editors,  Advances  in  Neural \nInformation  Processing  Systems  (volume  7).  MIT  Press. \n\nBreiman, L.,  Friedman, J., Olshen, R.,  &  Stone, C.  (1984).  Classification  and  Regression \nTrees.  Wadsworth and Brooks,  Monterey,  CA. \nCraven, M.  &  Shavlik, J.  (1993a).  Learning symbolic rules using artificial neural networks. \nIn Proc.  of the 10th International Conference  on Machine Learning,  (pp. 73-80), Amherst, \nMA.  Morgan  Kaufmann. \nCraven,  M.  W .  &  Shavlik,  J.  W.  (1993b).  Learning  to  predict  reading  frames  in  E. \ncoli  DNA  sequences.  In  Proc .  of the  26th  Hawaii  International  Conference  on  System \nSciences,  (pp . 773-782),  Wailea,  HI.  IEEE Press. \n\nCraven,  M.  W.  &  Shavlik,  J.  W.  (1994).  Using  sampling  and  queries  to  extract  rules \nfrom  trained neural networks.  In Proc .  of the  11th  International  Conference  on Machine \nLearning,  (pp.  37- 45),  New  Brunswick,  NJ.  Morgan Kaufmann. \n\nFu,  L.  (1991).  Rule  learning by  searching on  adapted  nets.  In  Proc.  of the  9th  National \nConference  on  Artificial Intelligence,  (pp.  590- 595) ,  Anaheim,  CA.  AAAI/MIT Press. \n\nGallant,  S.  I.  (1993).  Neural  Network  Learning and  Expert  Systems.  MIT  Press. \n\nHayashi,  Y.  (1991).  A  neural  expert  system  with  automated  extraction  of  fuzzy  if(cid:173)\nthen  rules.  In  Lippmann,  R.,  Moody,  J. ,  &  Touretzky,  D.,  editors,  Advances  in  Neural \nInformation  Processing  Systems  (volume  3).  Morgan Kaufmann, San  Mateo,  CA. \n\nHogg,  R.  V. &  Tanis,  E.  A.  (1983).  Probability  and  Statistical  Inference.  MacMillan. \nMcMillan,  C. ,  Mozer,  M.  C. , &  Smolensky,  P.  (1992).  Rule induction through integrated \nsymbolic and sub symbolic processing.  In Moody, J ., Hanson, S., &  Lippmann, R., editors, \nAdvances  in  Neural  Information  Processing  Systems  (volume  4).  Morgan  Kaufmann. \n\nMurphy,  P.  M.  &  Pazzani,  M.  J .  (1991).  ID2-of-3:  Constructive  induction  of  M-of-N \nconcepts for  discriminators  in  decision  trees.  In  Proc .  of the  8th  International  Machine \nLearning  Workshop,  (pp.  183- 187),  Evanston,  IL.  Morgan Kaufmann. \n\nQuinlan, J.  (1993).  C4.5:  Programs  for  Machine  Learning.  Morgan  Kaufmann. \nSaito,  K.  &  Nakano,  R.  (1988).  Medical  diagnostic expert system  based on PDP  model. \nIn  Proc.  of the  IEEE International  Conference  on  Neural  Networks,  (pp.  255- 262),  San \nDiego,  CA.  IEEE Press. \nSethi,  I.  K.,  Yoo,  J.  H.,  &  Brickman,  C.  M.  (1993) .  Extraction  of  diagnostic  rules \nusing neural networks.  In Proc.  of the  6th  IEEE Symposium on  Computer-Based  Medical \nSystems,  (pp.  217-222),  Ann Arbor,  MI.  IEEE Press. \n\nSetiono,  R.  &  Liu,  H.  (1995).  Understanding  neural  networks  via  rule  extraction.  In \nProc.  of the  14th  International Joint  Conference  on Artificial Intelligence,  (pp. 480- 485), \nMontreal,  Canada. \nSilverman, B.  W.  (1986).  Density Estimation for  Statistics  and Data Analysis.  Chapman \nand Hall. \n\nTan,  A.-H.  (1994).  Rule learning and extraction with self-organizing neural networks.  In \nProc.  of the  1993  Connectionist  Models  Summer School.  Erlbaum. \nTchoumatchenko, 1.  &  Ganascia,  J.-G.  (1994).  A Bayesian  framework  to  integrate sym(cid:173)\nbolic  and  neural  learning.  In  Proc.  of the  11th  International  Conference  on  Machine \nLearning,  (pp.  302- 308) ,  New  Brunswick,  NJ.  Morgan Kaufmann. \n\nThrun,  S.  (1995).  Extracting rules  from  artificial  neural  networks  with  distributed  rep(cid:173)\nresentations.  In  Tesauro,  G.,  Touretzky,  D.,  &  Leen,  T .,  editors,  Advances  in  Neural \nInformation  Processing  Systems  (volume  7).  MIT  Press. \n\nTowell,  G.  &  Shavlik,  J .  (1993).  Extracting  refined  rules  from  knowledge-based  neural \nnetworks.  Machine  Learning,  13(1):71-101. \n\n\f", "award": [], "sourceid": 1152, "authors": [{"given_name": "Mark", "family_name": "Craven", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}