{"title": "Combinations of Weak Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 500, "abstract": null, "full_text": "Combined Weak  Classifiers \n\nDepartment of Electrical,  Computer and System  Engineering \n\nChuanyi Ji and Sheng Ma \n\nRensselaer  Polytechnic Institute, Troy,  NY  12180 \n\nchuanyi@ecse.rpi.edu,  shengm@ecse.rpi.edu \n\nAbstract \n\nTo obtain classification systems with both good generalization per(cid:173)\nformance  and  efficiency  in  space  and  time,  we  propose  a  learning \nmethod based on combinations of weak classifiers,  where weak clas(cid:173)\nsifiers are linear classifiers (perceptrons) which can do a little better \nthan making random guesses.  A randomized algorithm is  proposed \nto find  the weak classifiers.  They\u00b7 are then combined through a  ma(cid:173)\njority vote.  As  demonstrated through systematic experiments,  the \nmethod developed is able to obtain combinations of weak classifiers \nwith  good  generalization  performance  and  a  fast  training  time on \na  variety of test  problems and real  applications. \n\n1 \n\nIntroduction \n\nThe problem we  will investigate in this work is how to develop  a classifier with both \ngood  generalization  performance  and  efficiency  in  space  and  time  in  a  supervised \nlearning environment.  The  generalization  performance is  measured  by  the  proba(cid:173)\nbility  of classification  error  of a  classifier.  A  classifier  is  said  to  be  efficient  if its \nsize  and  the  (average)  time  needed  to  develop  such  a  classifier  scale  nicely  (poly(cid:173)\nnomiaUy)  with  the  dimension  of the  feature  vectors,  and  other  parameters  in  the \ntraining algorithm. \n\nThe  method  we  propose  to  tackle  this  problem is  based  on  combinations of weak \nclassifiers [8][6] ,  where  the  weak  classifiers  are  the  classifiers  which  can  do  a  little \nbetter  than  random  guessing.  It has  been  shown  by  Schapire  and  Freund  [8][4] \nthat  the  computational  power  of weak  classifiers  is  equivalent  to  that  of a  well(cid:173)\ntrained  classifier,  and  an  algorithm  has  been  given  to  boost  the  performance  of \nweak  classifiers.  What  has  not  been  investigated  is  the  type  of weak  classifiers \nthat  can  be  used  and  how  to find  them.  In  practice,  the  ideas  have  been  applied \nwith success  in  hand-written  character  recognition  to boost  the  performance of an \nalready well-trained classifier.  But the original idea on combining a large number of \nweak  classifiers  has  not  been  used  in solving real  problems.  An  independent  work \n\n\fCombinations of Weak Classifiers \n\n495 \n\nby  Kleinberg[6]  suggests  that  in  addition  to  a  good  generalization  performance, \ncombinations of weak  classifiers  also provide advantages in computation time, since \nweak  classifiers  are  computationally easier  to  obtain  than  well-trained  classifiers. \nHowever,  since  the  proposed  method is  based  on  an  assumption which  is  difficult \nto realize,  discrepancies  have  been  found  between  the theory  and the experimental \nresults[7].  The  recent  work  by  Breiman[1][2]  also  suggests  that  combinations  of \nclassifiers  can be computationally efficient,  especially when  used  to learn large data \nsets. \n\nThe  focus  of this  work  is  to  investigate  the  following  problems:  (1)  how  to  find \nweak  classifiers,  (2)  what  are  the  performance  and  efficiency  of combinations  of \nweak classifiers,  and (3) what are the advantages of using combined weak classifiers \ncompared with other pattern  classification  methods? \nWe  will  develop  a  randomized  algorithm  to  obtain  weak  classifiers.  We  will  then \nprovide simulation results on both synthetic real problems to show  capabilities and \nefficiency of combined weak classifiers.  The extended version of this work with some \nof the theoretical analysis can  be found  in  [5]. \n\n2  Weak Classifiers \n\nIn  the  present  work,  we  choose  linear  classifiers  (perceptrons)  as  weak  classifiers. \nLet t - ~ be the required  generalization error of a  classifier,  where  v  2:  2,  is  called \nthe  weakness  factor  which  is  used  to  characterize  the strength  of a  classifier.  The \nlarger the  v,  the  weaker  the  weak  classifier.  A set of weak  classifiers  are  combined \nthrough  a simple majority vote. \n\n3  Algorithm \n\nOur  algorithm for  combinations of weak  classifiers  consists  of two  steps:  (1)  gen(cid:173)\nerating individual weak  classifiers  through a simple randomized algorithm; and  (2) \ncombining a  collection  of weak  classifiers  through  a simple majority vote. \n\nThree parameters need  to be chosen  a priori for  the algorithm:  a weakness factor v, \na  number (J  (~ ~ (J  < 1)  which  will be  used  as  a  threshold  to partition  the  training \nset,  and the number of weak classifiers 2\u00a3 + 1 to be generated,  where  \u00a3  is a positive \ninteger. \n\n3.1  Partitioning the Training Set \n\nThe  method  we  use  to  partition  a  training set  is  motivated by  what  given  in  [4]. \nSuppose  a  combined  classifier  consists  of K  (K  ~ 1)  weak  classifiers  already.  In \norder  to  generate  a  (new)  weak  classifier,  the  entire  training  set  of  N  training \nsamples  is  partitioned  into  two  subsets:  a  set  of  Ml  samples  which  contain  all \nthe  misclassified samples and a small fraction of samples correctly-classified  by  the \nexisting combined classifier;  and the remaining N  - Ml training samples.  The set of \nMl samples are called \"cares\", since they will be used to select a new weak classifier, \nwhile the  rest of the samples are  the  \"don't-cares\". \n\nThe  threshold  (J  is  used  to  determine  which  samples  should  be  assigned  as  cares. \nFor  instance,  for  the  n-th  training  sample  (1  ~ n  ~ N),  the  performance  index \na( n)  is  recorded,  where  a( n)  is  the  fraction  of the  weak  classifiers  in  the  existing \ncombined classifier which classify the n-th sample correctly.  If a(n) < (J,  this sample \nis assigned to the cares.  Otherwise, it is a don't-care.  This is done for all N  samples. \n\n\f496 \n\nC.  Ji and S.  Ma \n\nThrough  partitioning a  training set  in  this  way,  a  newly-generated  weak  classifier \nis  forced  to  learn  the  samples  which  have  not  been  learned  by  the  existing  weak \nclassifiers.  In  the  meantime, a  properly-chosen  ()  can  ensure  that enough  samples \nare used  to obtain each  weak  classifier. \n\n3.2  Random Sampling \n\nTo achieve  a  fast  training time,  we  obtain a  weak  classifier  by  randomly sampling \nthe classifier-space  of all  possible linear classifiers. \n\nAssume  that  a  feature  vector  x  E Rd  is  distributed  over  a  compact region  D.  The \ndirection  of a  hyperplane  characterized  by  a  linear  classifier  with  a  weight  vector, \nis  first  generated  by  randomly  selecting  the  elements  of the  weight  vector  based \non  a  uniform  distribution  over  (-1 , l)d .  Then  the  threshold  of  the  hyperplane \nis  determined  by  randomly  picking  an  xED ,  and  letting  the  hyperplane  pass \nthrough  x .  This will  generate  random hyperplanes  which  pass  through  the  region \nD, and whose directions are randomly distributed in all directions.  Such a randomly \nselected  classifier will  then be tested on all  the cares.  If it misclassifies a fraction  of \ncares  no more  than  k - ~ - \u20ac  (\u20ac  > 0  and  small),  the  classifier  is  kept  and  will  be \nused  in the combination.  Otherwise,  it is  discarded.  This process  is repeated  until \na  weak classifier  is  finally obtained. \n\nA newly-generated  weak  classifier  is  then  combined with the existing ones  through \na simple majority vote.  The entire training set will then be tested  on  the combined \nclassifier  to  result  in  a  new  set  of cares,  and  don't-cares.  The  whole  process  will \nbe  repeated  until  the  total  number  2L + 1 of weak  classifiers  are  generated.  The \nalgorithm can be easily extended  to multiple classes.  Details can  be found  in [5] . \n\n4  Experimental Results \n\nExtensive  simulations have  been  carried  out  on  both synthetic  and  real  problems \nusing  our  algorithm.  One  synthetic  problem  is  chosen  to  test  the  efficiency  of \nour  method.  Real  applications from  standard  data bases  are  selected  to  compare \nthe  generalization performance of combinations of weak  classifiers  (CW)  with  that \nof other methods such  as  K-Nearest-Neighbor  classifiers  (K-NN)l,  artificial  neural \nnetworks  (ANN),  combinations of neural networks  (CNN),  and stochastic discrimi(cid:173)\nnations (SD). \n\n4.1  A  Synthetic Problem:  Two  Overlapping Gaussians \n\nTo  test  the  scaling  properties  of combinations of weak  classifiers,  a  non-linearly \nseparable problem is chosen  from  a standard database called ELENA  2.  The prob(cid:173)\nlem is a two-class classification problem, where  the  distributions of samples in  both \nclasses  are  multi-variate Gaussians  with the same mean but different  variances for \neach  independent variable.  There  is  a  considerable  amount of overlap  between  the \nsamples in two classes,  and the problem is non-linearly separable.  The average gen(cid:173)\neralization error and the standard deviations are given in Figure 1 for our algorithm \nbased  on  20  runs,  and for  other  classifiers.  The  Bayes  error  is  also  given  to show \nthe theoretical  limit.  The results show  that the  performance of kNN  degrades  very \nquickly.  The performance of ANN  is better than that of kNN  but still deviates more \nand more from the Bayes error  as  d gets large.  The combination of weak  classifiers \n\n1 The best result  of different  k  is  reported. \n2 I pu b I neural-nets I ELEN AI databases IBenchmarks. ps. Z  on  ft p.dice. ucl. ac. be \n\n\fCombinations of Weak Classifiers \n\n497 \n\n\" I., \n\n10 \n\n- \"- k-NN \n-_ ANN \n-0- cw \n\n- ..... \n\nIV_  \n\nFigure  1:  Performance  versus  the  dimension of the feature  vectors \n\nAlgorithms \n\nCombined Weak Classifiers \n\nk  Nearest  Neighbor \n\nNeural  Networks \n\nCombined  Neural  Networks \n\nCard1 \n\n(%)  Error /er \n\n11.3/ 0.85 \n\n15.67 \n\n13.64/ 0.85 \n13.02/0.33 \n\nDiabetes1 \n\n(%)  Error  / er \n22.70/  0.70 \n\n25.8 \n\nGene1 \n\n(%1  Error/ er \n11.80/0.52 \n\n22.87 \n\n23.52/  0.72 \n22.79/0.57 \n\n13.47/0.44 \n12.08/0.23 \n\nTable  1:  Performance on Card1,  Diabetes1  and Gene1.  er:  standard deviation \n\ncontinues to follow  the  trend of the Bayes  error. \n\n4.2  Proben1 Data Sets \n\nThree  data sets,  Card1,  Diabetes1  and Gene1  were  selected  to  test  our  algorithm \nfrom  Proben1  databases which  contain data sets from real  applications3 . \n\nCard1  data set  is  for  a  problem on  determining  whether  a  credit-card  application \nfrom  a  customer  can  be  approved  based  on  information  given  in  51-dimensional \nfeature  vectors.  345  out  of 690  examples  are  used  for  training  and  the  rest  for \ntesting.  Diabetes1  data set  is  for  determining  whether  diabetes  is  present  based \non 8-dimensional input  patterns.  384  examples are used for  training  and  the same \nnumber  of samples  for  testing.  Gene1  data  set  is  for  deciding  whether  a  DNA \nsequence is from a donor, an acceptor or neither from 120 dimensional binary feature \nvectors.  1588 samples out of total of 3175  were  used  for  training,  and  the  rest  for \ntesting. \n\nThe average generalization error as  well  as  the standard  deviations are reported in \nTable  1.  The  results  from  combinations of weak  classifiers  are  based  on  25  runs. \nThe results of neural networks and combinations of well-trained neural networks are \nfrom the database.  As demonstrated by  the results,  combinations of weak classifiers \nhave  been  able  to achieve  the  generalization  performance  comparable to or better \nthan that of combinations of well-trained neural networks. \n\n4.3  Hand-written Digit Recognition \n\nHand-written  digit  recognition  is  chosen  to  test  our  algorithm,  since  one  of \nthe  previously  developed  method  on  combinations  of weak  classifiers  (stochastic \ndiscrimination[6])  was applied to this problem.  For the purpose of comparison,  the \n\n3 Available \n\nby \n\nanonymous \n\nftp \n\nfrom \n\nftp.ira.uka.de, \n\nas \n\n/ pub/papers/techreports/1994/1994-21. ps.z. \n\n\f498 \n\nC.  Ji and S.  Ma \n\nAlgorithms \n\nCombined Weak Classifiers \n\nk  Nearest  Neighbor \n\nNeural  Networks \n\nStochastic Discriminations \n\n(%)  Error/O' \n4.23/ 0.1 \n\n4.84 \n5.33 \n3.92 \n\nTable  2:  Performance on  handwritten  digit recognition. \n\nParameters \n1/2 +  l/v \n\ne \n2L+1 \n\nGaussians  Card1  Diabetes1  Gene1  Digits \n0.54 \n0.53 \n20000 \n\n0.55 \n0.54 \n4000 \n\n0.51 \n0.51 \n1000 \n\n0.51 \n0.54 \n1000 \n\n0.51 \n0.51 \n2000 \n\nA verage Tries \n\n2 \n\n3 \n\n7 \n\n4 \n\n2 \n\nTable 3:  Parameters used  in  our experiments. \n\nsame set  of data as  used  in  [6](from  the  NIST  data base)  is  utilized  to  train  and \nto test  our algorithm.  The data set  contains  10000  digits  written by  different  peo(cid:173)\nple.  Each  digit is  represented  by  16  by  16  black  and  white  pixels.  The first  4997 \ndigits  are  used  to  form  a  training  set,  and  the  rest  are  for  testing.  Performance \nof our  algorithm,  k-NN,  neural  networks,  and  stochastic  discriminations are  given \nin  Table  2.  The  results  for  our  methods  are  based  on  5  runs,  while  the  results \nfor  the  other  methods  are  from  [6].  The  results  show  that  the  performance of our \nalgorithm is slightly worse  (by 0.3%)  than that of stochastic discriminations, which \nuses  a  different  method for  multi-class classification [6] . \n\n4.4  Effects  of The Weakness  Factor \n\nExperiments are  done  to  test  the  effects  of v  on  the  problem of two  8-dimensional \noverlapping  Gaussians.  The  performance  and  the  average  training  time  (CPU(cid:173)\ntime on  Sun  Spac-10)  of combined  weak  classifiers  based  on  10  runs  are  given  for \ndifferent  v's  in  Figures  2  and  3,  respectively.  The  results  indicate  as  v  increases \nan individual weak  classifier  is obtained more quickly,  but more weak  classifiers  are \nneeded  to  achieve  good  performance.  When  a  proper  v  is  chosen,  a  nice  scaling \nproperty  can  be observed  in  training time. \n\nA  record  of the  parameters  used  in  all  the  experiments  on  real  applications  are \nprovided  in  Table  3.  The  average  tries,  which  are  the  average  number  of times \nneeded  to  sample  the  classifier  space  to obtain  an  acceptable  weak  classifier,  are \nalso given  in  the  table to characterize  the  training time for  these  problems. \n\n4.5  Training Time \n\nTo compare learning time with off-line BackPropagation (BP), feedforward two layer \nneural  network  with  10  sigmoidal hidden  units  are  trained  by  gradient-descent  to \nlearn  the  problem on  the  two  8-dimensional overlapping  Gaussians.  2500  training \nsamples  are  used.  The  performance  versus  CPU  time4  are  plotted  for  both  our \nalgorithm  and  BP  in  Figure  4.  For  our  algorithm,  2000  weak  classifiers  are  com(cid:173)\nbined.  For BP,  1000 epoches are used.  The figure shows that our algorithm is  much \nfaster than the BP algorithm.  Moreover,  when several well-trained neural networks \nare  combined  to  achieve  a  better  performance,  the  cost  on  training  time  will  be \n\n4Both  algorithms  are  run on  a  Sun  Sparc-lO  sun  workstation \n\n\fCombinations of Weak Classifiers \n\n499 \n\n'\u00b00 \n\n200 \n\n400 \n\ntoo \n\n100 \n\n1000 \n\n1200 \n\n1400 \n\n,'00  1100 \n\n2000 \n\nNuntMI'oII .... ~ \n\nFigure 2:  Performance versus  the  number of weak  classifiers  for  different  1/.  nu :  1/. \n\nI'  -0- 1 AwaO 005 \n-x- lJnu-o.Ol0 \n~ \u2022 \n-\u00b7- \u00b7 1~O'5 \nu  , \n-+- : lA'1u-002O \n- : lhN-0025 \n\n~ \n\n, \n\n.' \n\n. \n\n_ \"::;~~=1;:~-~:~~~~-~~:::::::::: \n\nm \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\nFigure 3:  Training time versus  the number of weak  classifiers for  different  1/. \n\n~oIIw.-a...r ... \n\neven  higher.  Therefore,  compared  to combinations of well-trained neural  networks, \ncombining weak  classifiers  is  computationally much cheaper. \n\n5  Discussions \n\nFrom  the  experimental results,  we  observe  that  the  performance of the  combined \nweak classifiers is  comparable or even better than combinations of well-trained clas(cid:173)\nsifiers,  and out-performs individual neural network classifiers  and k-Nearest  Neigh(cid:173)\nbor  classifiers.  In  the  meantime  whereas  the  k-nearest  neighbor  classifiers  suffer \nfrom  the  curse  of dimensionality, a  nice  scaling  property  in  terms  of  the  dimen(cid:173)\nsion  of feature  vectors  has  been  observed  for  combined  weak  classifiers.  Another \n\n.. \n.. \n\n35 \n\nIS \n\n10 \n\n, , \n, , \n\n; \n\nTranngcurteollBP \nT_CV't'eollBP \n\n- T,......cuwolCW \n-\nI \n\nr_cwy.otcw \nv_ \n\nl --_ \n\n-1- -\n\n- - - - - . .   I \n.............. --t-___ , \n\n-\n\n-\n\n- _  I- - _ _ __  t  ____ _ \n\nFigure 4:  Performance  versus  CPU  time \n\n\f500 \n\nC.  Ii and S.  Ma \n\nimportant observation  obtained from  the  experiments  is  that  the  weakness  factor \ndirectly  impacts the size  of a  combined classifier  and  the  training time.  Therefore, \nthe  choice  of the  weakness  factor  is  important  to  obtain  efficient  combined  weak \nclassifiers.  It has  been shown  in our  theoretical  analysis on learning an  underlying \nperceptron  [5]  that  v  should  be  at  least  large  as  O( dlnd)  to  obtain  a  polynomial \ntraining time,  and  the  price  paid to accomplish  this is  a  space-complexity which  is \npolynomial in  d  as  well.  This cost  can  be  observed  from  our experimental  results \nfor  the need  of a  large number of weak  classifiers. \n\nAcknowledgement \n\nSpecials  thanks  are  due  to  Tin  Kan  Ho  for  providing  NIST  data,  related  refer(cid:173)\nences  and helpful discussions.  Support from the National Science Foundation (ECS-\n9312594 and  (CAREER)  IRI-9502518) is  gratefully acknowledged. \n\nReferences \n\n[1]  L.  Breiman,  \"Bias,  Variance  and  Arcing  Classifiers,\"  Technical  Report,  TR-\n460,  Department  of Statistics,  University  of California,  Berkeley,  April,  1996. \n[2]  1.  Breiman,  \"Pasting,  Bites  Together  for  Prediction  in  Large  Data sets  and \n\nOn-Line,\"  ftp.stat .berkeley.edu/users/breiman,  1996. \n\n[3]  H.  Drucker,  R.  Schapire  and  P.  Simard,  \"Improving  Performance  in  Neural \n\nNetworks  Using  a  Boosting  Algorithm,\"  Neural Information  Processing  Sym(cid:173)\nposium,  42-49,  1993. \n\nFreund \n\nand \n\nR. \n\n[4]  Y. \n\n\"A \nDecision-Theoretic  Generalization  of On-Line  Learning  and  An  Application \nto Boosting,\"  http://www.research.att.com/orgs/ssr/people/yoav or schapire. \n[5]  C.  Ji  and  S.  Ma,  \"Combinations  of Weak  Classifiers,\"  IEEE  Trans.  Neural \nNetworks,  Special  Issue  on  Neural  Networks  and  Pattern  Recognition,  vol.  8, \n32-42,  Jan.,  1997. \n\nSchapire, \n\n[6]  E .M.  Kleinberg,  \"Stochastic Discrimination,\"  Annals  of Mathematics  and  Ar(cid:173)\n\ntificial Intelligence , voU , 207-239,  1990. \n\n[7]  E.M.  Kleinberg  and  T.  Ho,  \"Pattern  Recognition  by  Stochastic  Modeling,\" \nProceedings  of the  Third International  Workshop  on  Frontiers  in  Handwriting \nRecognition,  175-183,  Buffalo,  May  1993. \n\n[8]  R.E.  Schapire,  \"The  Strength  of Weak  Learnability,\"  Machine  Learning,  vol. \n\n5,  197-227, 1990. \n\n\f", "award": [], "sourceid": 1325, "authors": [{"given_name": "Chuanyi", "family_name": "Ji", "institution": null}, {"given_name": "Sheng", "family_name": "Ma", "institution": null}]}