{"title": "Boosting with Multi-Way Branching in Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 300, "page_last": 306, "abstract": null, "full_text": "Boosting with Multi-Way Branching in \n\nDecision Trees \n\nYishay Mansour \n\nDavid McAllester \n\nAT&T  Labs-Research \n180  Park Ave \nFlorham Park NJ 07932 \n{mansour, dmac }@research.att.com \n\nAbstract \n\nIt  is  known  that  decision  tree  learning  can  be  viewed  as  a  form \nof boosting.  However,  existing boosting theorems for  decision tree \nlearning allow only binary-branching trees and the generalization to \nmulti-branching trees  is  not immediate.  Practical decision tree al(cid:173)\ngorithms, such  as CART and C4.5, implement a  trade-off between \nthe  number  of branches  and  the  improvement  in  tree  quality  as \nmeasured  by an index function.  Here  we  give a  boosting justifica(cid:173)\ntion for a particular quantitative trade-off curve.  Our main theorem \nstates,  in  essence,  that if we  require  an  improvement proportional \nto the  log  of the  number of branches  then  top-down  greedy  con(cid:173)\nstruction of decision trees  remains an effective boosting algorithm. \n\n1 \n\nIntroduction \n\nDecision trees  have been proved to be a very popular tool in experimental machine \nlearning.  Their popularity stems from two basic features -\nthey can be constructed \nquickly  and  they  seem  to  achieve  low  error  rates  in  practice.  In  some  cases  the \ntime  required  for  tree  growth  scales  linearly  with  the  sample  size.  Efficient  tree \nconstruction  allows  for  very  large  data sets.  On  the  other  hand,  although  there \nare  known  theoretical  handicaps of the decision  tree  representations,  it seem  that \nin practice they  achieve  accuracy  which is comparable to other learning paradigms \nsuch  as  neural networks. \n\nWhile  decision  tree  learning  algorithms are  popular  in  practice  it  seems  hard  to \nquantify  their  success ,in  a  theoretical  model.  It is  fairly  easy  to  see  that  even \nif the  target  function  can  be  described  using  a  small  decision  tree,  tree  learning \nalgorithms may fail  to find  a  good  approximation.  Kearns  and, Mansour  [6]  used \nthe weak learning hypothesis to show that standard tree learning algorithms perform \nboosting.  This provides  a  theoretical justification for decision  tree  learning similar \n\n\fBoosting with Multi-Way Branching in Decision Trees \n\n301 \n\nto justifications that have been given for various other boosting algorithms, such as \nAdaBoost  [4]. \n\nMost  decision  tree  learning  algorithms use  a  top-down  growth  process.  Given  a \ncurrent tree the algorithm selects some leaf node and extends it to an internal node \nby  assigning  to  it  some  \"branching function\"  and  adding  a  leaf to  each  possible \noutput value of this branching function.  The set of branching functions  may differ \nfrom one algorithm to another, but most algorithms used in practice try to keep the \nset  of branching functions  fairly  simple.  For  example,  in  C4.5  [7],  each  branching \nfunction  depends  on  a  single  attribute.  For  categorical  attributes,  the  branching \nis  according  to the  attribute's value,  while for  continuous  attributes it performs  a \ncomparison of the attribute with some constant. \n\nit is easy to construct \nOf course such  top-down tree growth can over-fit  the data -\na  (large)  tree whose  error rate on the training data is zero.  However,  if the class of \nsplitting functions  has finite  VC  dimension  then it is  possible  to prove  that,  with \nhigh confidence of the choice of the training data, for  all trees T  the true error rate \nof T  is  bounded  by  f(T) + 0  (JITI/m)  where  f(T)  is  the  error  rate of T  on  the \ntraining sample,  ITI  is  the  number of leaves of T, and m  is  the size of the  training \nsample.  Over-fitting can be avoided by requiring that top-down tree growth produce \na  small tree.  In practice this is  usually done  by constructing a large tree  and then \npruning  away  some of its  nodes.  Here  we  take  a  slightly  different  approach.  We \nassume a given target tree size  s and consider the problem of constructing a tree T \nwith ITI = sand f(T)  as small as possible.  We can avoid over-fitting by selecting a \nsmall target value for  the tree size. \n\na four-way branch increases  the tree size by roughly the same amount as two two(cid:173)\n\nA  fundamental  question  in  top-down  tree  growth  is  how  to  select  the  branching \nfunction when  growing a given leaf.  We can think of the target size  as  a  \"budget\" . \nA four-way branch spends more of the tree size budget than does a two-way branch \n-\nway branches.  A sufficiently large branch would spend the entire tree size budget in \na single step.  Branches that spend more of the tree size budget should be required to \nachieve more progress than branches spending less ofthe budget.  Naively, one would \nexpect  that the improvement should be required to be roughly linear in the number \none  should get  a  return  proportional to  the  expense. \nof new  leaves  introduced  -\nHowever,  a  weak  learning  assumption  and  a  target  tree  size  define  a  nontrivial \ngame between  the learner and an adversary.  The learner makes moves by selecting \nbranching functions and the adversary makes moves by presenting options consistent \nwith the weak learning hypothesis.  We  prove here that the learner achieve a better \nvalue in this game by selecting branches that get a return considerably smaller than \nthe naive linear return.  Our main theorem  states,  in essence,  that the return need \nonly be  proportional to the log of the number of branches. \n\n2  Preliminaries \n\nWe  assume  a  set  X  of instances  and  an  unknown  target  function  f  mapping  X \nto  {O,l}.  We  assume  a  given  \"training set\"  S  which  is  a  set of pairs of the  form \n(x,  f(x)).  We  let 1l be a set of potential branching functions  where  each  hE 1l is \na  function  from  X  to a  finite  set  Rh  - we  allow different  functions  in  1l  to  have \ndifferent  ranges.  We  require  that for  any  h E 1l  we  have  IRhl  ~ 2.  An 1l-tree is \n\n\f302 \n\nY.  Mansour and D.  McAllester \n\na  tree  where  each  internal  node  is  labeled  with  an branching function  h  E 1i  and \nhas children corresponding  to the  elements of the set  Rh.  We define  ITI  to be the \nnumber ofleafnodes ofT.  We let L(T) be the set ofleafnodes ofT.  For a given tree \nT, leaf node f  of T  and sample S  we  write Sl  to denote the subset of the sample S \nreaching leaf f.  For f  E T  we define Pl  to be the fraction of the sample reaching leaf \nf, i.e., ISll/ISI.  We define ql  to be the fraction of the pairs (x,  f(x\u00bb in Sl  for which \nf(x)  =  1.  The training error ofT, denoted i(T), is  L:lEL(T)Plmin(ql,  1- ql). \n\n3  The Weak Learning Hypothesis and Boosting \n\nHere,  as in [6],  we view top-down decision tree learning as a form of Boosting [8,  3]. \nBoosting describes  a general class of iterative algorithms based on a  weak  learning \nhypothesis.  The  classical  weak  learning  hypothesis  applies  to  classes  of Boolean \nfunctions.  Let 1i2  be the subset  of branching functions  h E 1i  with  IRhl  =  2.  For \n\nc5  > \u00b0 the classical c5-weak learning hypothesis for 1i2 states that for any distribution \n\non X  there exists an hE 1i2 with PrD(h(x) f  f(x))  ~ 1/2-c5.  Algorithms designed \nto exploit this particular hypothesis for classes of Boolean functions  have proved to \nbe quite useful  in practice  [5]. \n\nKearns  and  Mansour  show  [6]  that  the key  to using  the weak  learning  hypothesis \nfor  decision  tree  learning  is  the  use  of an  index  function  I  :  [0, 1]  ~ [0,1]  where \nI(q)  ~ 1,  I(q)  ~ min(q, (1- q))  and where  I(T)  is  defined  to be L:lEL(T) PlI(ql). \nNote  that these  conditions imply that i(T)  ~ I(T).  For  any sample W  let  qw  be \nthe fraction  of pairs (x,  f(x))  E  W  such  that  f(x)  =  1.  For  any  h  E 1i  let Th  be \nthe decision  tree consisting of a single internal node with branching function h  plus \na  leaf for  each member of IRh I.  Let  Iw (Th)  denote the value of I(Th)  as  measured \nwith respect  to the sample W.  Let  ~ (W, h)  denote I (qW ) - Iw (Th).  The quantity \n~(W, h) is the reduction in the index for sample W achieved by introducing a single \nbranch.  Also note that Pt~(Sl, h) is the reduction in I(T) when the leaf f  is replaced \nby  the branch  h.  Kearns  and Mansour  [6]  prove the following lemma. \n\nLemma 3.1  (Kearns &  Mansour)  Assuming the c5-weak  learning hypothesis for \n1i2,  and taking  I(q)  to  be  2Jq(1- q),  we  have  that for  any  sample  W  there  exists \nan  h E 1i2  such  that ~(W,h) ~ ~:I(qw). \n\nThis lemma motivates the following definition. \n\nDefinition 1  We  say  that 1i2  and I  satisfies  the  \"I-weak  tree-growth  hypothesis  if \nfor  any  sample W  from  X  there  exists an  hE 1i2  such  that  ~(W, h)  ~ \"II(qw). \n\nLemma 3.1 states, in essence,  that the classical weak learning hypothesis implies the \nweak tree growth hypothesis for the index function I(q)  = 2J q(l - q).  Empirically, \nhowever,  the  weak  tree  growth  hypothesis  seems  to  hold  for  a  variety  of index \nfunctions  that  were  already  used  for  tree  growth  prior to the work  of Kearns  and \nMansour.  The Ginni index  I(q)  = 4q(1  - q)  is used  in  CART [1]  and the entropy \nI(q)  =  -q log q - (1- q) log(l- q)  is  used in C4.5 [7].  It has long been empirically \nobserved  that  it  is  possible  to  make  steady  progress  in  reducing  I(T)  for  these \nchoices of I  while it is  difficult to make steady  progress in reducing  i(T). \n\nWe  now  define  a  simple  binary  branching  procedure.  For  a  given  training set  S \nand  target  tree  size  s  this  algorithm grows  a  tree  with  ITI  =  s.  In  the  algorithm \n\n\fBoosting with Multi-Way Branching in Decision Trees \n\n303 \n\no denotes  the  trivial  tree  whose  root  is  a  leaf node  and Tl  h  denotes  the  result  of \nreplacing the leaf l  with the  branching function  h  and a  new  leaf for  each  element \nof Rh. \n\n, \n\nT=0 \nWHILE (ITI  < s)  DO \nl  f- argmaxl \nh f- argmaxhEl\u00a3:l~(Sl' h) \nT  f- Tl,h; \n\n'ftl1(til) \n\nEND-WHILE \nWe now define e(n)  to be the quantity TI~:ll(l-;). Note that e(n)  ~ TI~:/ e- 7 = \ne--Y Wi\"'l  1  S  < e--Y Inn  =  n--Y. \n\n~ .. - l   /\" \n\nTheorem 3.2  (Kearns &  Mansour)  1f1l2  and I  satisfy the  ,-weak tree  growth \nhypothesis then the binary branching procedure produces a  tree T  with i(T)  ~ I(T)  ~ \ne(ITI)  ~ ITI--Y\u00b7 \n\nProof:  The  proof is  by  induction  on  the  number  of iterations  of the  procedure. \nWe  have  that  1(0)  ~ 1 = e(l)  so  the  initial tree  immediately satisfies  the condi(cid:173)\ntion.  We  now  assume  that  the  condition  is  satisfied  by  T  at  the  begining  of an \niteration  and  prove  that  it  remains  satisfied  by  Tl,h  at  the  end  of the  iteration. \nSince  I(T)  =  LlET Ih1(til)  we  have  that  the  leaf l  selected  by  the  procedure  is \nsuch  that  Pl1(til)  2:  II~)\u00b7  By  the  ,-weak  tree  growth  assumption  the  function \nh  selected  by  the  procedure  has  the  property  that  ~(Sl, h)  2:  ,1(ql).  We  now \nI(Tl,h)  =  Pl~(Sl'  h)  2:  P1I1(til)  2:  ,II?il \"  This  implies  that \nhave  that  I(T)  -\nI(Tl,h)  ~ I(T)  - rh1(T) =  (1- j;)I(T)  ~ (1- rh)e(ITI) = e(ITI + 1)  = e(ITl,hl). \no \n\n4  Statement of the Main Theorem \n\nWe  now  construct  a  tree-growth  algorithm that selects  multi-way branching func(cid:173)\ntions.  As  with many weak  learning hypotheses,  the ,-weak tree-growth  hypothesis \ncan  be  viewed  as  defining  a  game between  the learner  and  an  adversary.  Given  a \ntree T  the adversary selects  a set of branching functions allowed at each leaf of the \ntree subject to the constraint that at each leaf l  the adversary must provide a binary \nbranching function  h with ~(Sl' h)  2:  ,1(til).  The learner then selects  a leaf land \na  branching function  h and  replaces  T  by  Tl,h.  The  adversary  then  again selects \na  new  set  of options  for  each  leaf subject  to  the  ,-weak  tree  growth  hypothesis. \nThe proof of theorem 3.2 implies that even  when the adversary can reassign all op(cid:173)\ntions at every move there exists a learner strategy, the binary branching procedure, \nguaranteed to achieves  a final  error rate of ITI--Y. \n\nOf course the optimal play for the adversary in this game is to only provide a single \nbinary option at each leaf.  However,  in practice the \"adversary\"  will make mistakes \nand  provide  options  to  the  learner  which  can  be  exploited  to  achieve  even  lower \nerror  rates.  Our objective  now is  to construct  a strategy for  the learner  which  can \nexploit multi-way branches  provided by  the adversary. \n\nWe  first  say  that  a  branching function  h  is  acceptable  for  tree  T  and  target  size \n\n\f304 \n\nY.  Mansour and D.  MeAl/ester \n\ns  if either  IRhl  = 2  or  ITI  <  e(IRh!)s\"Y/(2IRh!).  We  also  define  g(k)  to  be  the \nquantity  (1  - e(k\u00bb/\"Y.  It should  be  noted  that g(2)  = 1.  It should also  be  noted \nthat e( k)  '\" e -'Y Ink  and  hence  for  \"Y In k small we  have  e( k)  '\" 1 - \"Y In k  and  hence \ng(k) '\" Ink.  We  now define  the following multi-branch tree growth procedure. \n\nT=0 \nWHILE (ITI < s)  DO \n\nl  +- argm~  Ptl(qt) \nh +- argmaxhEll,  h acceptable for  T  and s  ~(St, h)/g(IRhl) \nT  +- Tt,h; \n\nEND-WHILE \n\nA run of the multi-branch tree growth procedure will be called \"Y-boosting if at each \niteration the branching function h selected has the property that ~(St, h) / g(lRh I)  ~ \n\"YI(qt).  The  \"Y-weak  tree  growth  hypothesis  implies  that  ~(St,h)/g(IRhl)  ~ \n\"YI(qt)/g(2)  =  \"YI(qt).  Therefore,  the \"Y-weak  tree  growth  hypothesis  implies that \nevery  run  of the  multi-branch growth  procedure  is  \"Y-bootsing.  But  a  run  can  be \n\"Y-bootsing  by  exploiting  mutli-way  branches  even  when  the  \"Y-weak  tree  growth \nhypothesis fails.  The following is  the main theorem of this paper. \n\nTheorem 4.1  1fT is  produced by a \"Y-boosting  run of the  multi-branch  tree-growth \nprocedure  then  leT)  ~ e(ITI)  ~ ITI-'Y\u00b7 \n\n5  Proof of Theorem 4.1 \n\nTo prove the main theorem we  need the concept of a  visited weighted  tree, or VW(cid:173)\ntree for short.  A VW-tree is a tree in which each node m is assigned both a rational \nweight  Wm  E  [0,1]  and  an  integer  visitation  count  Vm  ~ 1.  We  now  define  the \nfollowing VW tree growth procedure.  In the procedure  Tw  is  the tree consisting of \na single root node with weight  wand visitation count  1.  The tree  Tt.w1 .... . w/c  is the \nresult  of inserting  k  new  leaves below  the leaf l  where  the ith new  leaf has weight \nWi  and new  leaves  have  visitation count  1. \n\nW  +- any rational number in  [0,1] \nT+-Tw \nFOR ANY  NUMBER OF  STEPS REPEAT  THE FOLLOWING \n\ne(tI:~wl \n\nl  +- argmaxt \nVt  +- Vt + 1 \nOPTIONALLY T  +- Tt.Wl .. .. ,Wlll  WITH WI + .. . Wtll  ~ e(vt)wt \n\nWe first  prove an analog of theorem  3.2 for the above procedure.  For a  VW-tree T \nwe  define  ITI  to be LtEL(T) Vt  and we  define  leT)  to be  LtEL(T) e( Vt)Wt. \n\nLemma 5.1  The  VW procedure  maintains the  invariant that leT)  ~ e(ITI). \n\nProof:  The  proof is  by  induction  on  the  number  of iterations  of the  algorithm. \nThe  result  is  immediate for  the  initial tree  since  eel)  = 1.  We  now  assume  that \nleT)  ~ e(IT!)  at  the start  of an iteration  and  show  that this  remains true  at the \nend of the iteration. \n\n\fBoosting with Multi- Way Branching in Decision Trees \n\n305 \n\nWe can associate each leaf l  with Vt  \"subleaves\"  each of weight e(vt)wt/Vt.  We have \nthat ITI  is the total number of these subleaves and I(T)  is the total weight of these \nsubleaves.  Therefore  there  must exist  a  subleaf whose  weight  is  at least  I(T)/ITI. \nHence  there  must  exist  a  leaf l  satisfying  e(vt)wt/Vt  2':  I(T)/ITI.  Therefore  this \nrelation must hold of the leaf l  selected  by the procedure. \nLet  T'  be  the  tree  resulting  from  incrementing Vt.  We  now  have  I(T)  - I(T') = \ne(vt)wt- e(vt + l)wt = e(vt)wt- (1- ;;)e(vt)wt = ;;e(vt)wt 2':  \"/I~)' So we  have \nI(T')  ~ (1 - ffl )I(T)  ~ (1 - ffl )e(ITI) = e(IT'I). \nFinally, if the procedure grows new  leaves we  have that the I(T)  does  not increase \nand that ITI  remains the same and hence  the invariant is  maintained. \n0 \n\nFor  any  internal  node  m  in a  tree  T  let  C(m)  denote  the set  of nodes  which  are \nchildren  of m.  A  VW-tree  will  be  called  locally-well-formed if for  every  internal \nnode m we  have that Vm  =  IC(m)l, that I:nEC(m) Wn  ~ e(IC(m)l)wm .  A  VW-tree \nwill be called globally-safe ifmaxtEL(T) e(vt)wt/Vt ~ millmEN(T) e(vt-1)wt/(vt-1) \nwhere  N(T) denotes the set of internal nodes  of T. \n\nLemma 5.2  If T  is  a  locally  well-formed and  globally  safe  VW-tree,  then  T  is  a \npossible  output of the  VW growth  procedure  and therefore I(T)  ~ e(ITI). \n\nProof:  Since  T  is  locally  well  formed  we  can  use  T  as  a  \"template\"  for  making \nnondeterministic choices  in the VW growth procedure.  This process  is  guaranteed \nto  produce  T  provided  that  the  growth  procedure  is  never  forced  to  visit  a  node \ncorresponding  to a  leaf of T.  But the global safety  condition guarantees  that any \nunfinished  internal node of T  has  a  weight  as  least as  large as  any  leaf node of T. \no \n\nWe  now  give  a  way  of mapping ?i-trees  into VW-trees.  More specifically,  for  any \n?i-tree T  we define VW(T) to be the result of assigning each node m in T  the weight \nPmI(qm), each internal node a visitation count equal to its number of children, and \neach leaf node  a  visitation count equal to 1.  We  now have  the following lemmas. \n\nLemma 5.3  If T  is  grown  by  a  I-boosting  run  of the  multi-branch procedure  then \nVW(T)  is  locally  well-formed. \n\nProof:  Note  that  the  children  of an  internal  node  m  are  derived  by  selecting \na  branching  function  h  for  the  node  m.  Since  the  run  is  I-boosting  we  have \n~(St, h)/g(IRhi)  2':  II(qt).  Therefore  ~(St, h)  =  (I(tit)  - 1St (n))  2':  I(tit)(l  -\ne(IRhl)).  This  implies that  Ist(Th)  ~ e(IRhDI(qt).  Multiplying by  Pt  and  trans(cid:173)\nforming the result  into weights in the tree  VW(T)  gives  the desired  result. \n0 \n\nThe following lemma now suffices for  theorem 4.1. \n\nLemma 5.4  If T  is  grown  by  a  I-boosting  run  of the  multi-branch procedure  then \nVW(T)  is globally  safe. \n\nProof:  First  note  that  the  following  is  an  invariant of  a  I-boosting  run  of the \nmulti-branch procedure. \n\nmax  Wt  < \n\ntEL(VW(T)) \n\nWt \n\nmin \n\n- mEN(VW(T)) \n\n\f306 \n\nY.  Mansour and D.  MeAl/ester \n\nThe  proof is  a  simple induction on  ,-boosting tree  growth  using  the fact  that the \nprocedure  always expands a  leaf node of maximal weight. \n\nWe  must now  show  that for  every  internal  node  m  and every  leaf \u00a3 we  have  that \nWi  ~ e(k -1)wm/(k -1) where k  is  the number of children of m.  Note that if k = 2 \nthen  this reduces  to Wi  ~ Wm  which  follows  from  the  above  invariant.  So  we  can \nassume without loss  of generality that k  > 2.  Also,  since e( k) / k  < e( k - 1) / (k - 1), \nit suffices  to show that Wi  ~ e(k)wm/k. \nLet  m  be  an internal  node  with  k  > 2 children  and  let T'  be  the  tree  at the  time \nm  was  selected  for  expansion.  Let  Wi  be the maximum weight  of a  leaf in the final \ntree  T.  By  the  definition  of the  acceptability  condition,  in  the  last  s/2 iterations \nwe  are performing only binary branching.  Each binary expansion reduces  the index \nby  at least  ,  times  the  weight  of the  selected  node.  Since  the  sequence  of nodes \nselected  in the multi-branch procedure  has non-increasing  weights,  we  have that in \nany iteration the weight of the selected  node is  at least Wi .  Since there  are at least \ns/2 binary expansions after the expansion of m,  each of which reduces  I  by at least \n,Wi, we have that s,wd2 ~ I(T') so Wi  ~ 2I(T')/(/s).  The acceptability condition \ncan  be  written  as  2/(/s)  ~ e(k)/(kIT'1)  which  now  yields  WI  ~ I(T')e(k)/(kIT'I). \nBut we  have that I(T')/IT'I  ~ Wm  which  now  yields  WI  ~ e(k)wm/k as  desired.  0 \n\nReferences \n\n[1]  Leo  Breiman,  Jerome  H.  Friedman,  Richard  A.  Olshen,  and  Charles  J.  Stone. \n\nClassification  and Regression  Trees.  Wadsworth International Group,  1984. \n\n[2]  Tom  Dietterich,  Michael  Kearns  and  Yishay  Mansour.  Applying  the  Weak \nLearning  Framework  to  understand  and  improve  C4.5.  In  Proc.  of Machine \nLearning,  96-104,  1996. \n\n[3]  Yoav Freund.  Boosting a weak learning algorithm by majority.  Information  and \n\nComputation,  121(2):256-285, 1995. \n\n[4]  Yoav  Freund  and  Robert  E.  Schapire.  A  decision-theoretic  generalization  of \non-line  learning  and  an  application  to  boosting.  In  Computational  Learning \nTheory:  Second  European  Conference,  EuroCOLT  '95,  pages  23-37.  Springer(cid:173)\nVerlag,  1995. \n\n[5]  Yoav  Freund  and  Robert  E.  Schapire.  Experiments  with  a  new  boosting  al(cid:173)\n\ngorithm.  In  Machine  Learning:  Proceedings  of the  Thirteenth  International \nConference,  pages  148-156,  1996. \n\n[6]  Michael  Kearns  and  Yishay  Mansour.  On  the  boosting  ability  of top-down \ndecision  tree  learning.  In  Proceedings  of the  Twenty-Eighth  ACM Symposium \non  the  Theory of Computing,  pages 459-468,1996. \n\n[7]  J.  Ross  Quinlan.  C4.5:  Programs  for  Machine  Learning.  Morgan  Kaufmann, \n\n1993. \n\n[8]  Robert  E.  Schapire.  The  strength  of weak  learnability.  Machine  Learning, \n\n5(2):197-227, 1990. \n\n\f", "award": [], "sourceid": 1659, "authors": [{"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}