{"title": "Generalization in Decision Trees and DNF: Does Size Matter?", "book": "Advances in Neural Information Processing Systems", "page_first": 259, "page_last": 265, "abstract": "", "full_text": "Generalization in decision trees and DNF: \n\nDoes size matter? \n\nMostefa Golea\\ Peter L.  Bartletth ,  Wee Sun Lee2  and Llew Mason1 \n\n1 Department of Systems Engineering \n\nResearch School of Information \nSciences and Engineering \nAustralian National University \nCanberra, ACT,  0200,  Australia \n\n2 School of Electrical Engineering \n\nUniversity  College UNSW \nAustralian Defence Force Academy \nCanberra, ACT,  2600,  Australia \n\nAbstract \n\nRecent  theoretical  results  for  pattern  classification  with  thresh(cid:173)\nolded real-valued  functions  (such  as  support  vector  machines,  sig(cid:173)\nmoid  networks,  and  boosting)  give  bounds  on  misclassification \nprobability  that  do  not  depend  on  the size  of the  classifier,  and \nhence can be considerably smaller than the bounds that follow from \nthe VC  theory.  In  this  paper,  we  show  that  these  techniques  can \nbe  more  widely  applied,  by  representing  other  boolean  functions \nas  two-layer neural  networks  (thresholded convex  combinations of \nboolean functions).  For example, we show that with high probabil(cid:173)\nity any decision tree of depth no more than d that is consistent with \nm training examples has misclassification probability no more than \no ( (~ (Neff VCdim(U)  log2 m log d)) 1/2),  where U is  the  class  of \nnode  decision  functions,  and  Neff  ::;  N  can  be  thought  of as  the \neffective  number of leaves  (it  becomes small  as  the distribution on \nthe  leaves  induced  by  the  training  data gets  far  from  uniform). \nThis  bound  is  qualitatively different  from  the VC  bound  and  can \nbe considerably smaller. \nWe use the same technique to give similar results for DNF formulae. \n\n\u2022  Author to whom correspondence should be addressed \n\n\f260 \n\nM.  Golea,  P Bartlett,  W.  S. Lee and L  Mason \n\n1 \n\nINTRODUCTION \n\nDecision trees are  widely  used  for  pattern classification  [2,  7].  For these  problems, \nresults from  the  VC  theory suggest  that the  amount  of training data should grow \nat  least  linearly  with  the size  of the tree[4,  3].  However,  empirical  results  suggest \nthat this is  not necessary  (see  [6,  10]).  For example,  it  has  been observed that the \nerror rate is  not always a monotonically increasing function of the tree size[6]. \nTo see why the size of a tree is not always a good measure of its complexity, consider \ntwo trees,  A  with N A  leaves and B  with N B  leaves,  where N B  \u00ab  N A .  Although A \nis  larger than B, if most of the classification in A  is  carried out by  very few  leaves \nand  the classification in  B  is  equally distributed over the leaves,  intuition suggests \nthat A  is  actually much simpler than B, since tree A  can be approximated well  by \na small tree with few  leaves.  In this paper, we  formalize  this intuition. \nWe  give  misclassification  probability  bounds  for  decision  trees  in  terms  of a  new \ncomplexity measure that depends on the distribution on the leaves that is  induced \nby  the  training  data,  and  can  be  considerably  smaller  than  the  size  of the  tree. \nThese results build on recent theoretical results that give misclassification probabil(cid:173)\nity bounds for thresholded real-valued functions, including support vector machines, \nsigmoid networks, and boosting (see [1, 8, 9]), that do not depend on the size of the \nclassifier.  We extend these results to decision trees by considering a decision tree as \na  thresholded convex combination of the leaf functions  (the boolean functions  that \nspecify,  for  a  given  leaf,  which  patterns  reach  that  leaf).  We  can  then  apply  the \nmisclassification probability bounds for  such  classifiers.  In fact,  we  derive  and use \na  refinement  of the previous  bounds  for  convex  combinations  of base  hypotheses, \nin which the base hypotheses can come from  several classes of different  complexity, \nand the VC-dimension of the base hypothesis class is  replaced by the average  (un(cid:173)\nder the convex coefficients) of the VC-dimension of these classes.  For decision trees, \nthe bounds we  obtain depend  on the effective number of leaves,  a  data dependent \nquantity that reflects how uniformly the training data covers the tree's leaves.  This \nbound  is  qualitatively  different  from  the  VC  bound,  which  depends  on  the  total \nnumber of leaves in the tree. \nIn the next section, we give some definitions and describe the techniques used.  We \npresent bounds on the misclassification probability of a thresholded convex combina(cid:173)\ntion of boolean functions from base hypothesis classes, in terms of a misclassification \nmargin and the average VC-dimension of the base hypotheses.  In Sections 3 and 4, \nwe  use  this  result  to  give  error bounds  for  decision  trees  and  disjunctive  normal \nform  (DNF)  formulae. \n\n2  GENERALIZATION ERROR IN TERMS  OF  MARGIN \n\nAND  AVERAGE COMPLEXITY \n\nWe begin with some definitions.  For a class ti of { -1,1 }-valued functions defined on \nthe input space X, the convex hull co(ti) ofti is the set of [-1, l]-valued functions of \nthe form :Ei aihi, where ai  ~ 0,  :Ei ai =  1, and hi  E  ti.  A function in co(ti) is used \nfor  classification  by  composing it with  the threshold  function,  sgn  : IR  ~ {-I, I}, \nwhich  satisfies  sgn(a)  = 1 iff  a  ~ O.  So  f  E  co(ti)  makes  a  mistake  on  the  pair \n(x,y)  E  X  x  {-1,1} iff  sgn(f(x\u00bb  =F  y.  We  assume  that  labelled  examples  (x,y) \nare generated according to some probability distribution V  on  X  x {-I, I}, and we \nlet  Pv [E]  denote  the  probability  under  V  of an event  E.  If S  is  a  finite  subset \nof Z,  we  let  Ps [E]  denote  the empirical  probability of E  (that  is,  the proportion \nof points  in  S  that lie  in  E).  We  use  Ev [.]  and  Es [.]  to denote expectation in a \nsimilar way.  For a function class H  of {-I, l}-valued functions defined on the input \n\n\fGeneralization in Decision Trees  and DNF: Does Size Matter? \n\n261 \n\nspace X, the growth function  and VC  dimension of H  will  be  denoted  by  IIH (m) \nand VCdim(H) respectively. \nIn  [8],  Schapire  et  al  give  the following  bound on  the  misclassification probability \nof  a  thresholded  convex  combination  of functions ,  in  terms  of  the  proportion  of \ntraining data that is  labelled to the correct  side  of the  threshold  by  some margin. \n(Notice that Pv [sgn(f(x\u00bb  # y]  ~ Pv [yf(x)  ~ 0].) \nTheorem 1  ([8])  Let V  be  a  distribution  on  X  x  {-I, I},  1\u00a3  a  hypothesis  class \nwith VCdim(H) =  d < 00 ,  and 8> O.  With probability  at least 1- 8  over a training \nset  S  of m  examples  chosen  according  to  V,  every  function  f  E  co(1\u00a3)  and  every \n8> 0  satisfy \n\nP v  [yf(x) ~ 0]  ~ Ps [yf(x)  ~ 8]  + 0..;m \n\n(  1  (dl  2( \nog 82m \n\n/d) \n\n)  1/2) \n\n+  log(1/8) \n\n. \n\nIn Theorem 1, all of the base hypotheses in the convex combination f  are elements \nof a  single class 1\u00a3  with bounded VC-dimension.  The following theorem generalizes \nthis result to the case in which these base hypotheses may be chosen from  any of k \nclasses, 1\u00a31, ... , 1\u00a3k, which can have different VC-dimensions.  It also gives a related \nresult that shows the error decreases to twice  the error estimate at a faster rate. \n\nTheorem 2  Let V  be  a distribution on X x {-I, I}, 1\u00a31, ... ,1\u00a3k  hypothesis  classes \nwith VCdim(Hi)  =  di ,  and 8 > O.  With  probability  at  least  1 - 8  over  a  training \nset S  of m  examples  chosen  according  to  V,  every  function  f  E  co (U~=1 1\u00a3i)  and \nevery 8 > 0 satisfy  both \n\nPv [yf(x)  ~ 0]  ~ Ps [yf(x)  ~ 8] + \n\n( 1 (1 \n..;m  82 (dlogm + logk) log (m82 /d)  +  log(1/8) \n\no \n\n)1/2) \n\n, \n\nPv [yf(x)  ~ 0]  ~ 2Ps [yf(x)  ~ 8]  + \n\no (! (812  (dlogm +  logk) log (m8 2 /d)  +IOg(1/8\u00bb)), \nwhere  d = E \u00b7 aidj;  and  the  ai  and ji are  defined  by  f  = Ei aihi  and  hi E 1\u00a3j;  for \njiE{l, ... ,k}. \n\nA \n\nA \n\n} \n\n{ \n\nN \n\n(l/N) Ei=1 hi : hi E 1\u00a31; \n\nProof sketch:  We  shall  sketch  only  the proof of the  first  inequality  of the  the(cid:173)\norem.  The  proof closely  follows  the  proof of Theorem  1  (see  [8]).  We  consider \na  number  of approximating  sets  of the  form  eN,1  = \n, \nwhere  I  =  (h, ... , IN)  E  {I, ... , k}N  and  N  E  N.  Define  eN  =  Ul eN,I' \nFor  a  given  f  =  Ei aihi  from  co (U~=1 1\u00a3i ),  we  shall  choose  an  approximation \n9  E  eN  by  choosing  hI, .. . , hN  independently  from  {hI, h2 ,  ... ,},  according  to \nthe  distribution  defined  by  the  coefficients  ai.  Let  Q  denote  this  distribution \non  eN.  As  in  [8],  we  can  take  the  expectation  under  this  random  choice  of \n9  E eN  to show  that, for  any 8  >  0,  Pv [yf(x)  ~ 0]  ~ Eg_Q [PD [yg(x)  ~ 8/2]] + \nexp(-N82/8).  Now,  for  a  given  I  E  {I, .. . ,k}N,  the  probability  that  there  is \na  9  in  eN,1  and  a  8  >  0  for  which  Pv [yg(x)  ~ 8/2]  >  Ps [yg(x)  ~ 8/2]  +  fN,1 \nis  at  most  8(N +  1) rr~1 (2:/7) dl\n;  exp( -mf~,zl32).  Applying  the  union  bound \n\n\fM.  Golea, P.  Bartlett, W  S.  Lee andL Mason \n\n262 \n(over  the  values  of  1),  taking  expectation  over  9  I'V  Q,  and  setting  EN,l  = \n( ~ In (8(N + 1) n~1 (2;;; )\". kN / 6N )  ) 1'2 shows that, with probability at least \n1  - 6N,  every  f  and  8  >  0  satisfy  Pv [yf(x)  ~ 0]  ~ Eg [Ps  [yg(x)  ~ 8/2]]  + \nEg  [EN,d.  As  above,  we  can  bound  the  probability  inside  the  first  expectation \nin  terms  of  Ps [yf(x)  ~ 81.  Also,  Jensen's  inequality  implies  that  Eg [ENtd  ~ \n(~ (In(8(N + 1)/6N) + Nln k + N L..i aidj; In(2em))) 1/2. Setting 6N  = 6/(N(N + \n1))  and N  =  r /-I In ( mf) 1 gives  the result. \n\nI \n\nTheorem  2 gives  misclassification  probability  bounds  only  for  thresholded  convex \ncombinations of boolean functions.  The key technique we use in the remainder of the \npaper is  to find  representations in this form  (that is,  as two-layer neural networks) \nof more arbitrary boolean functions.  We have some freedom in choosing the convex \ncoefficients,  and this choice affects  both the error estimate Ps [yf(x)  ~ 81  and  the \naverage VC-dimension d.  We  attempt to choose  the coefficients  and  the margin 8 \nso  as  to optimize the resulting bound on  misclassification probability.  In the next \ntwo  sections,  we  use  this  approach to find  misclassification probability bounds for \ndecision trees and  DNF formulae. \n\n3  DECISION TREES \n\nA two-class decision tree T  is  a tree whose internal decision nodes are labeled with \nboolean functions from some class U  and whose leaves are labeled with class labels U \nfrom {-I, +1}.  For a tree with N  leaves, define the leaf functions, hi  : X  -+ {-I, I} \nby  hi(X)  = 1 iff x  reaches leaf i, for  i  = 1, ... ,N.  Note that hi  is  the conjunction \nof all tests on the path from  the root to leaf i. \nFor a  sample S  and a  tree T, let Pi  =  Ps [hi(X)  = 1].  Clearly,  P  =  (PI, .. \"  PN)  is \na  probability vector.  Let Ui  E {-I, + I}  denote the class assigned  to leaf i.  Define \nthe class of leaf functions  for  leaves up to depth j  as \n\n1lj =  {h : \n\nh  =  UI  /\\ U2  /\\ \u2022.\u2022 /\\ U r  I r  ~ j,  Ui  E U}. \n\nIt is  easy to show that VCdim(1lj)  ~ 2jVCdim(U) In(2ej).  Let di  denote the depth \nof leaf i, so  hi  E 1ld;, and let  d = maxi di. \nThe  boolean  function  implemented  by  a  decision  tree  T  can  be  written  as  a \nthresholded  convex  combination  of  the  form  T(x)  = sgn(f(x\u00bb,  where  f(x)  = \nL..~I WWi  \u00abhi(x) + 1)/2)  = L..~I WWi hi(X)/2 + L..~l wwd2,  with  Wi  >  0  and \nL..~I Wi  =  1.  (To be precise, we need to enlarge the classes 1lj slightly to be closed \nunder negation.  This does not affect the results by more than a constant.)  We first \nassume that the tree is consistent with the training sample.  We will show later how \nthe results extend to the inconsistent case. \nThe  second  inequality  of Theorem  2  shows  that,  for  fixed  6  >  0  there  is  a  con(cid:173)\nstant  c  such  that,  for  any  distribution  V,  with  probability  at  least  1 - 6  over \nthe  sample  S  we  have  Pv [T(x)  'I y]  ~ 2Ps [yf(x) ~ 8]  + -b L~I widiB,  where \nB  = ~ VCdim(U) log2 m log d.  Different choices of the WiS  and the 8 will  yield  dif(cid:173)\nferent  estimates of the error rate of T.  We can assume  (wlog)  that PI  ~ ... ~ PN. \nA natural choice is Wi  = Pi  and Pj+I  ::.;  8 < Pj for  some j  E {I, ... ,N} which gives \n\ndB \nPv [T(x)  'I y]  ~ 2  L  Pi + (i2' \n\nN \n\ni=j+I \n\n(1) \n\n\fGeneralization in Decision Trees and DNF: Does Size Matter? \n\n263 \nwhere  d = L:~1 Pidi .  We  can  optimize  this  expression  over  the  choices  of j  E \n{I ... ,N} and ()  to give a bound on the misclassification probability of the tree. \nLet  pep, U)  =  L:~1 (Pi  -\nIIN)2  be  the  quadratic  distance  between  the  prob-(cid:173)\n(PI, ... ,PN )  and  the  uniform  probability  vector  U  = \nability  vector  P  = \n(liN, liN, ... , liN).  Define  Neff  ==  N  (1  - pep, U\u00bb.  The parameter Neff  is a mea(cid:173)\nsure of the effective  number of leaves in the tree. \nTheorem 3  For a fixed d > 0,  there is  a constant c that satisfies  the following.  Let \nV  be  a distribution  on X  x { -1, I}.  Consider the  class  of decision  trees  of depth  'Up \nto  d,  with  decision  functions  in U.  With probability  at  least  1 - d  over the  training \nset S  (of size mY,  every  decision  tree T  that  is  consistent with  S  has \n\nPv [T(x) 1=  y]  ~ c ( Neff VCdlm(~ log  m log d \n\n2 \n\n\u2022 \n\n)  1/2 \n\n, \n\nwhere  Neff  is  the  effective  number of leaves  of T. \n\nProof:  Supposing that ()  ~ (aIN)I/2  we optimize (1)  by choice of ().  If the chosen \n()  is  actually  smaller  than  ca/ N)I/2  then  we  show  that  the  optimized  bound  still \nholds by a standard VC  result.  If ()  ~ (a/N)I/2 then L:~i+l Pi  ~ (}2 Neff/d.  So  (1) \nimplies  that  P v  [T (x)  1=  y]  ~ 2(}2 Neff /d + dB / (}2.  The optimal  choice of ()  is  then \n(~iB/Neff)I/4.  So  if (~iB/Neff)I/4 ~ (a/N)I/2, we  have  the result.  Otherwise, \nthe upper bound we  need to prove satisfies 2(2NeffB)I/2  > 2NB, and this result is \nimplied by standard VC results using a simple upper bound for  the growth function \nof the class of decision trees with N  leaves. \n\nI \n\nThus the parameters that quantify the complexity of a tree are:  a)  the complexity \nof the test function class U, and b)  the effective number of leaves Neff.  The effective \nnumber of leaves can potentially be much  smaller than the total number of leaves \nin the tree [5].  Since this parameter is data-dependent, the same tree can be simple \nfor  one set of PiS  and complex for  another set of PiS. \nFor trees that are not consistent with the training data, the procedure to estimate \nthe  error  rate  is  similar.  By  defining  Qi  =  Ps [YO'i  =  -1 I hi(x) =  1]  and  PI  = \nPi (l- Qi)/ (1  - Ps [T(x) 1=  V])  we  obtain the following result. \nTheorem 4  For a fixed d > 0,  there is  a constant c that satisfies the following.  Let \nV  be  a distribution  on X  x { -1, 1}.  Consider the  class  of decision  trees  of depth  up \nto d,  with  decision  functions  in U.  With  probability  at least 1 - d over the  training \nset S  (of size mY,  every  decision  tree T  has \n\nPv [T(x) 1=  y]  ~ Ps [T(x) 1=  y] + c  Neff VCdim ~ log  mlogd \n\n.  ( )  \n\n2 \n\n( \n\n)  1/3 \n\n, \n\nwhere c  is  a universal constant,  and Neff  =  N(1- pep', U\u00bb \nof leaves  ofT. \n\nis  the  effective  number \n\nNotice that this definition of Neff  generalizes the definition given before Theorem 3. \n\n4  DNF  AS  THRESHOLDED CONVEX COMBINATIONS \n\nA DNF formula defined on {-1, I}n is a disjunction of terms, where each term is  a \nconjunction of literals and  a literal is  either a  variable or its negation.  For a given \nDNF formula g, we use N  to denote the number of terms in g,  ti to represent the ith \n\n\f264 \n\nM.  Golea,  P.  Bartlett, W  S.  Lee and L  Mason \n\nterm in f, Li to represent the set of literals in ti, and Ni the size of Li .  Each term ti \ncan be thought of as a member of the class HNi'  the set of monomials with Ni  liter(cid:173)\nals.  Clearly,  IHi I =  et).  The DNF 9 can be written as a thresholded convex combi-\nnation of the form  g(x)  = -sgn( - f(x))  = -sgn ( - L:f:,l Wi  \u00abti +  1)/2)) .  (Recall \nthat sgn(a) = 1 iff a  ~ 0.)  Further, each term ti can be written as a thresholded con(cid:173)\nvex  combination of the form  ti(X)  = sgn(Ji(x)) = sgn (L:lkELi Vik  \u00ablk(x) - 1)/2)) . \nAssume for  simplicity that the DNF  is  consistent  (the results extend easily  to the \ninconsistent  case).  Let  ')'+  (')'-)  denote  the fraction  of positive  (negative)  exam(cid:173)\nples  under  distribution  V.  Let  P v + [.]  (Pv - [.])  denote  probability  with  respect \nto the  distribution over  the  positive  (negative)  examples,  and  let  Ps+ [.]  (Ps- [.]) \nbe  defined  similarly,  with  respect  to  the  sample  S.  Notice  that  P v  [g(x)  :f:.  y]  = \n')'+Pv+ [g(x)  = -l]+,),-Pv - [(3i)ti(X)  =  1], so the second inequality of Theorem 2 \nshows that, with probability at least  1- 8,  for  any 8 and any 8i s, \n\nPv [g(x)  :f:.  y]  :::;  ')'+  2Ps+ [I(x)  :::;  8]  +  \u00a5 \n\ndB) \n\n(\n\n+ ')'- ~ 2Ps- [- fi(x)  ~ 8i ]  +  8; \n\nN  ( \n\nB) \n\nwhere d  = L:f:,l WiNi  and  B  = c(lognlog2m+log(N/8)) /m.  As  in  the  case  of \ndecision trees, different choices of 8, the 8is, and the weights yield different estimates \nof the error.  For an arbitrary order of the terms, let  Pi  be the fraction  of positive \nexamples covered by term ti but not by terms ti-l, ... ,tl' We order the terms such \nthat  for  each  i,  with  ti-l. ... ,tl  fixed,  Pi  is  maximized,  so  that  PI  2::  ...  ~ PN, \nand we  choose Wi  =  Pi.  Likewise,  for  a  given  term ti  with literals 11,'\"  ,LN.  in an \narbitrary order,  let  p~i) be the fraction  of negative  examples  uncovered  by  literal \nlk  but  not  uncovered  by  lk-l, ... ,11'  We  order the literals of term  ti  in  the  same \ngreedy  way  as  above  so  that  pi i)  ~ ...  2::  P~:, and  we  choose  Vik  =  P~ i).  For \nPHI:::; 8 < Pi  and  pLiL  ~ 8i < Pi~iL, where  1 :::;  j  :::;  Nand 1 ~ ji :::;  Ni,  we  get \n\nP D  [g(x)  :f:.  y]  :::;  ')'+  2 i~l Pi  +  \u00a5 \n\n( \n\nN \n\ndB) \n\n+ ')'- ~ 2 kf+l p~,) +  8; \n\nN  (N' \n\n.  B) \n\nNow,  let  P  = (Pl,,,,,PN)  and for  each  term i  let  p(i)  = (pii), ... ,p~:).  Define \nNeff  =  N(1  - pcP, U))  and  N~~ =  Ni(1  - p(p(i) , U)),  where  U  is  the  relevant \nuniform distribution in each case.  The parameter Neff  is  a measure of the effective \nnumber of terms in the DNF formula.  It can be much smaller than N;  this would be \nthe case if few  terms cover a large fraction of the positive examples.  The parameter \nN~~ is  a  measure of the effective  number  of literals  in  term  ti.  Again,  it  can  be \nmuch smaller than the actual number of literals in  ti:  this would be the case if few \nliterals of the term uncover a large fraction of the negative examples. \nOptimizing  over  8  and  the  8i s as  in  the  proof of Theorem  3  gives  the  following \nresult. \nTheorem 5  For a fixed 8 > 0,  there  is  a constant c that satisfies the following.  Let \nV  be  a distribution  on X x {-I, I}.  Consider the  class  of DNF formuLae  with up  to \nN  terms.  With  probabiLity  at  Least  1 - 8  over the  training set S  (of size mY,  every \nDNF formulae  9  that  is  consistent  with  S  has \n\nPD [g(x):f:.  y]:::;  ,),+(NeffdB)1/2  + ')'-I~)N~~B)1/2 \n\nN \n\nwhere d = maxf:.l N i ,  ')'\u00b1 = Pv [y  = \u00b11]  and B  = c(lognlog2 m +  log(N/8))/m. \n\ni=l \n\n\fGeneralization in Decision Trees and DNF: Does Size Matter? \n\n265 \n\n5  CONCLUSIONS \n\nThe results in this paper show that structural complexity measures (such as size)  of \ndecision trees and DNF formulae are not always the most appropriate in determining \ntheir generalization behaviour, and that measures of complexity that depend on the \ntraining data may give a  more accurate descriptirm.  Our analysis  can be extended \nto  multi-class  classification  problems.  A  similar  analysis  implies  similar  bounds \non  misclassification  probability  for  decision  lists,  and  it  seems  likely  that  these \ntechniques will  also be applicable to other pattern classification methods. \nThe complexity parameter, Neff described here does not always give the best possi(cid:173)\nble error bounds.  For example, the effective number of leaves Neff  in a decision tree \ncan be thought of as a  single number that summarizes the probability distribution \nover the leaves induced by  the training data.  It seems  unlikely that such a number \nwill give optimal bounds for all distributions.  In those cases, better bounds could be \nobtained by using numerical techniques to optimize over the choice of (J  and WiS.  It \nwould be interesting to see how  the bounds we obtain and those given by numerical \ntechniques reflect  the generalization performance of classifiers used in practice. \n\nAcknowledgements \n\nThanks to Yoav  Freund and Rob  Schapire for  helpful comments. \n\nReferences \n\n[1]  P. L. Bartlett. For valid generalization, the size of the weights is more important \nthan the size of the network. In Neural Information Processing Systems 9,  pages \n134-140. Morgan Kaufmann, San Mateo,  CA,  1997. \n\n[2]  L.  Breiman,  J.H.  Friedman,  R.A.  Olshen,  and  C.J.  Stone.  Classification  and \n\nRegression  Trees.  Wadsworth, Belmont,  1984. \n\n[3]  A.  Ehrenfeucht and  D.  Haussler.  Learning decision  trees from  random exam(cid:173)\n\nples.  Information  and  Computation,  82:231-246, 1989. \n\n[4]  U .M.  Fayyad  and  K.B.  Irani.  What  should  be . '1inimized  in  a  decision  tree? \n\nIn AAAI-90, pages 249-754,1990. \n\n[5]  R.  C. Holte.  Very simple rules perform well on most commonly used databases. \n\nMachine  learning,  11:63-91, 1993. \n\n[6]  P.M.  Murphy  and  M.J. pazzani.  Exploring the decision  forest:  An  empirical \ninvestigation of Occam's razor in decision tree induction.  Journal  of Artificial \nIntelligence  Research,  1:257-275, 1994. \n\n[7]  J.R.  Quinlan.  C4.5:  Programs  for  Machine  Learning.  Morgan  Kaufmann, \n\n1992. \n\n[8]  R.  E. Schapire, Y.  Freund, P. L.  Bartlett, and W. S.  Lee.  Boosting the margin: \na new explanation for the effectiveness of voting methods. In Machine  Learning: \nProceedings  of the  Fourteenth  International  Conference,  pages 322-330, 1997. \n[9]  J. Shawe-Taylor, P.  L.  Bartlett, R.  C.  Williamson,  and M.  Anthony.  A frame(cid:173)\n\nwork for  structural risk minimisation.  In Proc.  9th  COLT,  pages 68-76. ACM \nPress, New  York,  NY,  1996. \n\n[10]  G.L. Webb.  Further experimental evidence against the utility of Occam's razor. \n\nJournal  of Artificial Intelligence  Research,  4:397-417, 1996. \n\n\f", "award": [], "sourceid": 1340, "authors": [{"given_name": "Mostefa", "family_name": "Golea", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Wee Sun", "family_name": "Lee", "institution": null}, {"given_name": "Llew", "family_name": "Mason", "institution": null}]}