{"title": "A Parallel Mixture of SVMs for Very Large Scale Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 640, "abstract": null, "full_text": "A  Parallel  Mixture of SVMs for  Very  Large  Scale \n\nProblems \n\nRonan  Collobert* \n\nUniversite de  Montreal, DIRG \nCP 6128, Succ.  Centre-Ville \nMontreal,  Quebec,  Canada \n\ncollober\u00a9iro.umontreal.ca \n\nSamy Bengio \n\nIDIAP \n\nCP 592, rue  du Simp Ion  4 \n1920 Martigny, Switzerland \n\nbengio\u00a9idiap.ch \n\nYoshua Bengio \n\nUniversite de  Montreal, DIRG \nCP 6128,  Succ.  Centre-Ville \nMontreal,  Quebec,  Canada \nbengioy\u00a9iro.umontreal.ca \n\nAbstract \n\nSupport Vector Machines  (SVMs)  are currently the state-of-the-art models for \nmany classification problems but they suffer from the complexity of their train(cid:173)\ning algorithm which is at least quadratic with respect to the number of examples. \nHence,  it is  hopeless  to try to solve  real-life  problems  having more than a  few \nhundreds  of  thousands  examples  with  SVMs.  The  present  paper  proposes  a \nnew  mixture  of SVMs  that  can  be  easily  implemented  in  parallel  and  where \neach SVM is  trained on a  small  subset of the whole  dataset.  Experiments on a \nlarge benchmark dataset  (Forest)  as well  as  a  difficult  speech database, yielded \nsignificant  time  improvement  (time  complexity  appears  empirically  to  locally \ngrow linearly with the number of examples) .  In addition, and that is a surprise, \na  significant improvement in  generalization was  observed on Forest. \n\n1 \n\nIntroduction \n\nRecently  a  lot  of  work  has  been  done  around  Support  Vector  Machines  [9],  mainly  due  to \ntheir impressive generalization performances on classification problems when compared to other \nalgorithms such as  artificial neural networks [3, 6].  However,  SVMs require to solve  a quadratic \noptimization problem which needs resources that are at least quadratic in the number of training \nexamples,  and  it  is  thus  hopeless  to  try  solving  problems  having  millions  of examples  using \nclassical SVMs. \n\nIn order to overcome this drawback, we  propose in this paper to use a mixture of several SVMs, \neach  of them trained  only  on  a  part of the  dataset.  The idea of an  SVM  mixture  is  not  new, \nalthough previous attempts such as  Kwok's paper on Support Vector Mixtures  [5]  did not train \nthe  SVMs  on  part of the  dataset  but  on the whole  dataset  and  hence  could  not  overcome the \n'Part of this work  has been done while  Ronan Collobert was  at IDIAP,  CP  592,  rue du Simplon 4, \n\n1920  Martigny,  Switzerland. \n\n\fLUHe  CUIHIJ1eJULY  vrUUleUI  lUI  1i:L1!!,e  UaLaOeLO. \nl:i't'fltpte  'fIte~ltuu  LU  LlalH  oUCH \na  mixture,  and  we  will  show  that  in  practice  this  method  is  much  faster  than  training  only \none  SVM,  and  leads  to  results  that  are  at  least  as  good  as  one  SVM.  We  conjecture  that  the \ntraining time  complexity of the  proposed  approach  with  respect  to the number  of examples  is \nsub-quadratic for  large data sets.  Moreover this mixture can be easily parallelized,  which could \nimprove again  significantly  the training time. \n\nvve  vruvuoe  Here  a \n\nThe organization of the paper goes as follows:  in the next section, we  briefly introduce the SVM \nmodel  for  classification.  In section 3 we  present our mixture of SVMs,  followed  in section  4 by \nsome comparisons to related models.  In section 5 we  show some experimental results, first  on a \ntoy dataset, then on two  large real-life  datasets.  A short conclusion then follows . \n\n2 \n\nIntroduction to Support Vector Machines \n\nSupport Vector Machines  (SVMs)  [9]  have been applied to many classification problems, gener(cid:173)\nally  yielding  good  performance compared to other algorithms.  The decision  function  is  of the \nform \n\n(1) \n\nwhere x  E ~d is  the d-dimensional input vector of a test example, y  E {-I, I} is a class label,  Xi \nis  the input vector for  the ith  training example, Yi  is  its associated class label,  N  is  the number \nof training examples,  K(x , Xi)  is  a  positive definite kernel function , and 0:  =  {a1 , ... ,aN} and \nb are the parameters of the model.  Training an SVM  consists  in finding  0:  that minimizes  the \nobjective function \n\nN \n\n1  N  N \n\nQ(o:)  =  - 2..: ai + 22..:2..:aiajYiyjK(Xi , Xj) \n\nsubject to the constraints \n\nand \n\ni=l \n\ni=l  j=l \n\nN \n2..: aiYi  =  0 \ni=l \n\nO:S  ai :S  C  Vi. \n\nThe kernel  K(X,Xi)  can have different forms,  such as the Radial  Basis Function  (RBF): \n\nK(Xi, Xj)  =  exp (-llxi(T~ Xj112) \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nwith parameter (T. \nTherefore,  to  train  an  SVM,  we  need  to  solve  a  quadratic  optimization  problem,  where  the \nnumber of parameters is  N.  This makes the use of SVMs for  large datasets difficult:  computing \nK(Xi' Xj)  for  every  training  pair  would  require  O(N2)  computation,  and solving  may take  up \nto  O(N3).  Note  however that current state-of-the-art algorithms appear to have training time \ncomplexity scaling much closer to O(N2 )  than O(N3)  [2]. \n\n3  A  New Conditional Mixture of SVMs \n\nIn this section we  introduce a  new type of mixture of SVMs.  The output of the mixture for  an \ninput vector  X  is  computed as  follows: \n\nf(x)  =  h (II wm(x)sm(x)) \n\n(6) \n\n\fwuen~ \n\nlVl  1::;  LUe  UUUIUel  Ul  eXvelL::;  lU  LUe  lUIXLUle,  ;;m~;,r;)  1::;  LUe  UULVUL  Ul  LUe  'fit \n\nexvelL \ngiven  input  x,  wm(x)  is  the  weight  for  the  mth  expert  given  by  a  \"gater\"  module  taking  also \nx  in  input,  and  h  is  a  transfer function  which  could  be for  example the hyperbolic tangent for \nclassification tasks.  Here each expert is  an SVM,  and we  took a  neural network for  the gater in \nour experiments.  In the proposed model,  the gater is  trained to minimize the cost function \n\nN \n\nC  =  L [f(xi) - Yi]2  . \n\ni=l \n\n(7) \n\nTo train this model,  we  propose a  very simple  algorithm: \n\n1.  Divide the training set into M  random subsets of size  near N j M. \n2.  Train each expert separately over  one of these subsets. \n3.  Keeping the experts fixed,  train the gater to minimize  (7)  on the whole training set. \n4.  Reconstruct  M  subsets:  for  each example  (Xi,Yi), \n\n\u2022  sort the experts in  descending order according to the values Wm(Xi), \n\u2022  assign  the  example  to  the  first  expert  in  the  list  which  has  less  than  (NjM + c) \n\nexamples*,  in order to ensure a  balance between the experts. \n\n5.  If a  termination  criterion  is  not  fulfilled  (such  as  a  given  number  of  iterations  or  a \n\nvalidation error going up), goto step 2. \n\nNote  that  step  2  of  this  algorithm  can  be  easily  implemented  in  parallel  as  each  expert  can \nbe  trained  separately  on  a  different  computer.  Note  also  that  step  3  can  be  an  approximate \nminimization  (as  usually done when training neural networks). \n\n4  Other  Mixtures of SVMs \n\nThe  idea  of mixture  models  is  quite  old  and  has  given  rise  to  very  popular  algorithms,  such \nas the well-known  Mixture  of Experts  [4]  where the cost  function  is  similar to equation  (7)  but \nwhere the gater and the experts are trained, using gradient descent or EM, on the whole dataset \n(and  not  subsets)  and  their  parameters  are  trained  simultaneously.  Hence  such  an  algorithm \nis  quite  demanding in  terms  of resources  when  the  dataset  is  large,  if training time  scales  like \nO(NP)  with p > 1. \nIn  the  more  recent  Support  Vector  Mixture  model  [5],  the  author  shows  how  to  replace  the \nexperts  (typically  neural  networks)  by  SVMs  and  gives  a  learning  algorithm  for  this  model. \nOnce  again  the  resulting  mixture  is  trained jointly  on  the  whole  dataset,  and  hence  does  not \nsolve the quadratic barrier when the dataset is  large. \n\nIn another  divide-and-conquer approach [7],  the authors propose to first  divide  the training set \nusing  an  unsupervised  algorithm  to  cluster  the  data  (typically  a  mixture  of Gaussians),  then \ntrain  an  expert  (such  as  an  SVM)  on  each  subset  of the  data corresponding  to  a  cluster,  and \nfinally recombine the outputs of the experts.  Here, the algorithm does indeed train separately the \nexperts on small datasets, like the present algorithm, but there is no notion of a loop reassigning \nthe examples to experts according to the prediction made by the gater of how  well  each expert \nperforms on each example.  Our experiments suggest that this element is  essential to the success \nof the algorithm. \n\nFinally,  the  Bayesian  Committee  Machine  [8]  is  a  technique  to  partition the  data into  several \nsubsets, train SVMs on the individual subsets and then use a specific combination scheme based \non the covariance of the test data to combine the predictions.  This method scales linearly in the \n\n'where c is  a  small positive constant.  In the experiments, c = 1. \n\n\fllUll1ue1  U1  Lld111111!!,  UdLd,  UUL  1~  111  1dCL  d  HUnIjU \u00b7uc~\u00b7tVt;  ll1eLllUU  ~ 1L  CdllllUL  Uve1dLe  Ull  d  ~U1!!,1e \ntest  example.  Like  in  the  previous  case,  this  algorithm  assigns  the  examples  randomly  to  the \nexperts  (however the Bayesian framework would in principle allow to find  better assignments). \n\nRegarding  our  proposed  mixture  of  SVMs,  if  the  number  of experts  grows  with  the  number \nof  examples,  and  the  number  of  outer  loop  iterations  is  a  constant,  then  the  total  training \ntime  of the  experts  scales  linearly  with  the  number  of examples.  Indeed, &iven  N  the  total \nis  a  constant  r; \nnumber  of examples,  choose  the  number  of expert  M  such  that  the  ratio  M \nThen,  if k  is  the  number of outer loop  iterations,  and  if the  training time  for  an SVM  with  r \nexamples  is  O(ri3 )  (empirically  f3  is  slightly  above  2),  the total training time  of the  experts  is \nO(kri3  * M)  =  O(kri3- 1 N),  where  k,  rand f3  are  constants,  which  gives  a  total  training time \nof  O(N).  In  particular for  f3  =  2  that  gives  O(krN).  The  actual  total  training  time  should \nhowever  also include  k times  the  training time  of the  gater,  which  may  potentially  grow  more \nrapidly than O(N).  However, it did not appear to be the case in our experiments,  thus yielding \napparent linear  training time.  Future work will  focus  on methods  to reduce  the gater training \ntime and guarantee linear  training time per outer loop iteration. \n\n5  Experiments \n\nIn  this  section,  we  present  three  sets  of experiments  comparing  the  new  mixture  of SVMs  to \nother  machine  learning  algorithms.  Note  that  all  the  SVMs  in  these  experiments  have  been \ntrained using  SVMTorch  [2] . \n\n5.1  A  Toy  Problem \n\nIn  the  first  series  of experiments,  we  first  tested  the  mixture  on  an  artificial  toy  problem  for \nwhich  we  generated  10,000 training examples and  10,000 test examples.  The problem had two \nnon-linearly separable classes and had two input dimensions.  On Figure 1 we  show the decision \nsurfaces obtained first  by  a  linear SVM,  then by  a  Gaussian SVM, and finally  by the proposed \nmixture of SVMs.  Moreover, in the latter, the gater was a simple linear function  and there were \ntwo linear SVMs  in  the mixturet .  This artificial problem thus  shows  clearly that the algorithm \nseems to work,  and is  able to combine,  even linearly, very simple  models  in  order to produce a \nnon-linear decision surface. \n\n5.2  A  Large-Scale  Realistic Problem:  Forest \n\nFor a  more realistic problem, we  did a  series of experiments on part of the  UCI Forest dataset+. \nWe  modified  the  7-class  classification  problem  into  a  binary  classification  problem  where  the \ngoal  was  to separate class  2 from  the other 6 classes.  Each example was  described by 54  input \nfeatures,  each normalized by  dividing  by the maximum found  on the training set.  The  dataset \nhad more than 500,000 examples and this allowed us to prepare a series of experiments as follows : \n\n\u2022  We  kept  a  separate test  set  of 50,000  examples  to compare the best  mixture of SVMs \n\nto other learning algorithms. \n\n\u2022  We used a validation set of 10,000 examples to select the best mixture of SVMs , varying \n\nthe number of experts and the number of hidden  units in the gater. \n\n\u2022  We trained our models on different training sets, using from 100,000 to 400,000 examples. \n\u2022  The mixtures  had from  10  to  50  expert SVMs  with  Gaussian kernel  and the gater was \n\nan MLP with between  25  and 500  hidden units. \n\ntNote that the transfer function  hO  was  still  a  tanhO. \ntThe  Forest  dataset \n\navailable  on \n\nis \n\nthe  VCI  website \n\nftp://ftp.ics.uci.edu/pub/rnachine-learning-databases/covtype/covtype.info. \n\nat \n\nthe \n\nfollowing \n\naddress: \n\n\f(a)  Linear SVM \n\n(b)  Gaussian  SVM \n\n(c)  Mixture  of  two  linear \nSVMs \n\nFigure  1:  Comparison  of  the  decision  surfaces  obtained  by  (a)  a  linear  SVM,  (b)  a  Gaussian \nSVM,  and  (c)  a  linear  mixture  of  two  linear  SVMs,  on  a  two-dimensional  classification  toy \nproblem. \n\nNote  that  since  the  number  of examples  was  quite  large,  we  selected  the  internal training pa(cid:173)\nrameters  such  as  the  (J  of the  Gaussian  kernel  of the  SVMs  or  the  learning  rate  of the  gater \nusing a  held-out  portion of the training set.  We  compared our models to \n\n\u2022  a single MLP, where the number of hidden units was selected by cross-validation between \n\n25  and 250  units, \n\n\u2022  a  single SVM,  where the parameter of the kernel  was  also selected by cross-validation, \n\u2022  a  mixture  of SVMs  where  the  gater  was  replaced  by  a  constant  vector,  assigning  the \n\nsame weight  value  to every expert. \n\nTable  1  gives  the  results  of a  first  series  of experiments  with  a  fixed  training  set  of  100,000 \nexamples.  To  select  among the variants of the gated SVM  mixture we  considered performance \nover the validation set as well  as training time.  All the SVMs  used  (J  =  1. 7.  The selected model \nhad 50  experts and a  gater with  150 hidden  units.  A  model  with  500  hidden units  would  have \ngiven a performance of 8.1 % over the test set but would have taken 621  minutes on one machine \n(and 388  minutes on 50  machines). \n\nTrain \n\nTest \n\nError (%) \n\none MLP \none  SVM \nuniform SVM mixture \ngated SVM  mixture \n\n17.56  18.15 \n16.03  16.76 \n19.69  20.31 \n5.91 \n9.28 \n\n(1  cpu) \n12 \n3231 \n85 \n237 \n\nTime  (minutes) \n\n(50  cpu) \n\n2 \n73 \n\nTable  1:  Comparison  of  performance  between  an  MLP  (100  hidden  units),  a  single  SVM,  a \nuniform SVM mixture where the gater always output the same value for each expert, and finally \na  mixture of SVMs  as  proposed in this  paper. \n\nAs  it  can be seen,  the gated SVM  outperformed all  models  in terms of training and test error. \nNote that the training error of the single SVM is high because its hyper-parameters were selected \nto minimize error on the validation set (other values could yield to much lower training error but \nlarger  test  error).  It was  also  much  faster,  even  on  one  machine, than the  SVM  and  since  the \nmixture  could  easily  be  parallelized  (each  expert  can  be  trained  separately) ,  we  also  reported \n\n\fLue  LIUIe  IL  LUUK  LU  LldUI  UU  ClV  UldCUIUei:>. \n.1U  d  UIi:>L  dLLeUIVL  LU  UUUeli:>LdUU  LUei:>e  lei:>UILi:>,  uue \ncan at least  say that the power of the model  does  not lie  only in the MLP  gater, since  a  single \nMLP  was  pretty  bad,  it  is  neither  only  because  we  used  SVMs,  since  a  single  SVM  was  not \nas  good  as  the  gated mixture,  and it  was  not  only  because  we  divided  the  problem into  many \nsub-problems since  the uniform mixture also performed badly.  It seems  to be a  combination of \nall these elements. \n\nWe  also  did  a  series  of experiments in order to see the influence of the number of hidden  units \nof the gater as  well  as the number of experts in the mixture.  Figure 2 shows the validation error \nof different  mixtures of SVMs,  where the number of hidden units varied from  25  to 500  and the \nnumber  of experts  varied  from  10  to  50.  There  is  a  clear  performance improvement  when  the \nnumber of hidden  units  is  increased, while  the improvement  with additional experts exists but \nis  not as  strong.  Note however that the training time increases also rapidly with the number of \nhidden units while it slightly decreases with the number of experts if one uses one computer per \nexpert. \n\nValidation error as a function of the number of hidden units \n\nof the gater and the number of experts \n\n2!'50 \n\n100 \n\n150200 \n\n250 \n\n50 \n\nNumber of hidden \nunits of the gater \n\n500 \n\n10 \n\nFigure 2:  Comparison of the validation error of different mixtures of SVMs with various number \nof hidden units and experts. \n\nIn  order  to  find  how  the  algorithm  scaled  with  respect  to  the  number  of examples,  we  then \ncompared the same  mixture of experts  (50  experts,  150  hidden  units in  the  gater)  on different \ntraining set sizes.  Table 3 shows the validation error of the mixture of SVMs trained on training \nsets of sizes from  100,000 to 400,000.  It seems that, at least in this range and for  this particular \ndataset,  the mixture of SVMs  scales  linearly  with  respect  to the number of examples,  and not \nquadratically as  a  classical SVM.  It is  interesting to see  for  instance that the mixture of SVMs \nwas able to solve a problem of 400,000 examples in less  than 7 hours  (on 50  computers)  while it \nwould  have taken more than one month to solve the same problem with a  single SVM. \n\nFinally,  figure  4  shows  the  evolution  of the  training  and  validation  errors  of a  mixture  of 50 \nSVMs  gated  by  an  MLP  with  150  hidden  units,  during  5  iterations  of  the  algorithm.  This \nshould convince that the loop of the algorithm is  essential in order to obtain good performance. \nIt is  also  clear that the empirical convergence of the outer loop is  extremely rapid. \n\n5.3  Verification on Another Large-Scale  Problem \n\nIn order to verify that the results obtained on  Forest were replicable on  other large-scale prob(cid:173)\nlems,  we  tested  the  SVM  mixture  on  a  speech  task.  We  used  the  Numbers95  dataset  [1]  and \n\n\f450 ,----~--~-~--~-~-_ \n\nError as a function  of the number of training iterations \n\n400 \n\n350 \n\n_300 \nc: \nE \n-;250 \nE \ni= 200 \n\n150 \n\n100 \n\n1~ \n\n2 \n\nNumber of train  examples \n\n2~ \n\n3 \n\n3~ \n\n4 \nx 105 \n\nFigure 3:  Comparison of the training time \nof the  same mixture  of SVMs  (50  experts, \n150  hidden  units  in  the  gater)  trained  on \ndifferent training set sizes, from  100,000 to \n400,000. \n\n- Train error \n\nValidation Error \n\n-\n\n-\n\n-\n\n1\n\n14 \n\n13 \n\n12 \n\n11 \n~10 \ng 9 \n\nw \n\n8 \n\n7 \n\n6 \n\n~L---~2~--~3---~4---~5 \n\nNumber of training  iterations \n\nFigure  4:  Comparison  of the  training  and \nvalidation errors of the mixture of SVMs as \na  function  of the number of training itera(cid:173)\ntions. \n\nturned it into a  binary classification problem where the task was to separate silence frames from \nnon-silence  frames .  The  total  number of frames  was  around  540,000  frames.  The  training  set \ncontained 100,000  randomly chosen frames  out of the first  400,000 frames.  The disjoint  valida(cid:173)\ntion set  contained 10,000  randomly chosen frames  out of the first  400,000 frames  also.  Finally, \nthe test set contained 50,000 randomly chosen frames out of the last 140,000 frames.  Note that \nthe validation set was  used  here to select  the number of experts in  the mixture, the number of \nhidden  units in the gater,  and a.  Each frame  was  parameterized using standard methods used \nin  speech  recognition  (j-rasta coefficients,  with  first  and  second  temporal  derivatives)  and  was \nthus described by  45  coefficients,  but we  used in fact  an input window of three frames,  yielding \n135 input features per examples. \n\nTable 2 shows a comparison between a single SVM and a  mixture of SVMs on this dataset.  The \nnumber of experts in the mixture was set to 50,  the number of hidden units of the gater was set \nto 50,  and the a  of the SVMs  was set to 3.0.  As  it can be seen, the mixture of SVMs  was again \nmany times faster  than the single  SVM  (even on 1 cpu only)  but yielded similar generalization \nperformance. \n\none SVM \ngated SVM  mixture \n\nTrain  Test \nError  (%) \n0.98 \n4.41 \n\n7.57 \n7.32 \n\nTime (minutes) \n(1  cpu) \n6787 \n851 \n\n(50  cpu) \n\n65 \n\nTable  2:  Comparison  of  performance  between  a  single  SVM  and  a  mixture  of  SVMs  on  the \nspeech dataset. \n\n6  Conclusion \n\nIn this paper we have presented a new algorithm to train a mixture of SVMs that gave very good \nresults compared to classical SVMs either in terms of training time or generalization performance \non  two  large  scale  difficult  databases.  Moreover,  the  algorithm  appears  to  scale  linearly  with \nthe number of examples,  at least  between  100,000 and 400,000 examples. \n\n\f.1 uebe  lebUILb  dle  eXLleIuelY  e UCUUli:t!!,l1l!!,  dllU  bu!!,!!,ebL  LUi:tL  Lue  plupUbeu  lueLuuu  CUUIU  dllUW \ntraining SVM-like models for  very large multi-million data sets in a reasonable time.  If training \nof  the  neural  network  gater  with  stochastic  gradient  takes  time  that  grows  much  less  than \nquadratically, as we conjecture it to be the case for very large data sets (to reach a \"good enough\" \nsolution),  then the  whole  method  is  clearly sub-quadratic in  training time  with  respect  to  the \nnumber  of  training  examples.  Future  work  will  address  several  questions:  how  to  guarantee \nlinear  training time for  the gater as  well  as  for  the  experts?  can better results  be obtained by \ntuning the hyper-parameters of each expert  separately?  Does  the approach work well  for  other \ntypes of experts? \n\nAcknowledgments \n\nRC  would  like  to  thank the  Swiss  NSF  for  financial  support  (project  FN2100-061234.00).  YB \nwould  like  to thank the NSERC  funding  agency and NCM2  network for  support. \n\nReferences \n\n[1]  RA.  Cole,  M.  Noel,  T.  Lander,  and  T.  Durham.  New  telephone  speech  corpora at  CSLU. \nProceedings  of the  European  Conference  on  Speech  Communication  and  Technology,  EU(cid:173)\nROSPEECH, 1:821- 824, 1995. \n\n[2]  R  Collobert and S.  Bengio.  SVMTorch:  Support vector machines for  large-scale regression \n\nproblems.  Journal  of Machine  Learning  Research,  1:143- 160, 200l. \n\n[3]  C.  Cortes and V.  Vapnik.  Support vector networks.  Machine  Learning,  20:273- 297,  1995. \n[4]  Robert  A.  Jacobs,  Michael  I. Jordan, Steven J.  Nowlan,  and  Geoffrey  E.  Hinton.  Adaptive \n\nmixtures of local experts.  Neural  Computation,  3(1):79- 87,  1991. \n\n[5]  J. T. Kwok.  Support vector mixture for classification and regression problems.  In Proceedings \nof the  International  Conference  on  Pattern  Recognition  (ICPR),  pages  255-258,  Brisbane, \nQueensland, Australia,  1998. \n\n[6]  E.  Osuna,  R  Freund,  and  F.  Girosi.  Training  support  vector  machines:  an  application  to \nface  detection.  In  IEEE  conference  on  Computer  Vision  and  Pattern  Recognition,  pages \n130- 136, San Juan, Puerto Rico,  1997. \n\n[7]  A.  Rida, A.  Labbi, and C.  Pellegrini.  Local experts combination trough density  decomposi(cid:173)\n\ntion.  In  International  Workshop  on  AI and  Statistics  (Uncertainty'99).  Morgan Kaufmann, \n1999. \n\n[8]  V.  Tresp.  A bayesian committee machine.  Neural  Computation,  12:2719-2741,2000. \n[9]  V.  N.  Vapnik.  The  nature  of statistical learning  theory.  Springer, second edition,  1995. \n\n\f", "award": [], "sourceid": 1949, "authors": [{"given_name": "Ronan", "family_name": "Collobert", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}