{"title": "Model Selection for Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 230, "page_last": 236, "abstract": null, "full_text": "Model Selection for Support Vector Machines \n\nOlivier Chapelle*,t, Vladimir Vapnik* \n* AT&T Research Labs, Red Bank, NJ \n\nt LIP6, Paris, France \n\n{ chapelle, vlad} @research.au.com \n\nAbstract \n\nNew functionals for parameter (model) selection of Support Vector Ma(cid:173)\nchines are introduced based on the concepts of the span of support vec(cid:173)\ntors and rescaling of the feature space.  It is shown that using these func(cid:173)\ntionals, one can both predict the best choice of parameters of the model \nand the relative quality of performance for any value of parameter. \n\n1  Introduction \n\nSupport Vector Machines (SVMs) implement the following  idea:  they map input vectors \ninto a high dimensional feature space, where a maximal margin hyperplane is constructed \n[6].  It was  shown  that when  training data are  separable, the error rate  for  SVMs can  be \ncharacterized by \n\n(1) \n\nwhere R is the radius ofthe smallest sphere containing the training data and M  is the mar(cid:173)\ngin (the distance between the hyperplane and the closest training vector in  feature space). \nThis functional  estimates  the VC dimension of hyperplanes separating data with  a given \nmargin M. \nTo  perform  the  mapping  and  to  calculate  Rand M  in  the  SVM  technique.  one  uses  a \npositive definite  kernel  K(x, x')  which  specifies  an  inner  product  in  feature  space.  An \nexample of such a kernel is the Radial Basis Function (RBF). \n\nK(x, x') =  e-llx-x'II2/20'2. \n\nThis kernel has a free parameter (7  and more generally, most kernels require some param(cid:173)\neters to  be  set.  When  treating  noisy  data with  SVMs.  another  parameter.  penalizing  the \ntraining errors. also needs to be set.  The problem of choosing the values of these parame(cid:173)\nters which minimize the expectation of test error is called the model selection problem. \n\nIt was shown that the parameter of the kernel that minimizes functional (1) provides a good \nchoice for the model:  the minimum for this functional coincides with the minimum of the \ntest error [1].  However. the shapes of these curves can be different. \n\nIn  this  article  we  introduce  refined  functionals  that  not  only  specify  the  best  choice  of \nparameters (both the parameter of the kernel  and  the parameter penalizing training error). \nbut also produce curves which better reflect the actual error rate. \n\n\fModel Selection for Support  Vector Machines \n\n231 \n\nThe  paper  is  organized  as  follows.  Section  2  describes  the  basics  of SVMs,  section  3 \nintroduces a new functional based on the concept of the span of support vectors, section 4 \nconsiders the idea of rescaling data in feature space and section 5 discusses experiments of \nmodel selection with these functionals. \n\n2  Support Vector Learning \n\nWe  introduce some standard notation for  SVMs;  for  a complete description,  see [6].  Let \n(Xi, Yih <i<l  be a set of training examples,  Xi  E  jRn  which  belong to  a class  labeled  by \nYi  E {-f, f}.  The decision function given by a SVM is : \n\nwhere the coefficients a?  are obtained by maximizing the following functional: \n\nl I t  \n\nW(a)  =  Lai - 2'  L  aiajYiYjK(Xi,Xj) \n\ni=l \n\ni,j=l \n\n(2) \n\n(3) \n\nunder constraints \n\nt \nL  aiYi  = 0 and 0 ~ ai ~ C  i = 1, ... , f. \ni=l \n\nC is a constant which controls the tradeoff between the complexity of the decision function \nand the number of training examples misclassified.  SVM are linear maximal margin clas(cid:173)\nsifiers in a high-dimensional feature space where the data are mapped through a non-linear \nfunction <p(x)  such that <P(Xi)  . <p(Xj)  = K(Xi,Xj). \nThe points Xi  with ai  > 0 are called support vectors.  We distinguish between those with \no < ai < C and those with ai =  C.  We call them respectively support vectors of the first \nand second category. \n\n3  Prediction using the span of support vectors \n\nThe results introduced in  this section are based on the leave-one-out cross-validation esti(cid:173)\nmate.  This procedure is usually used to  estimate the probability of test error of a learning \nalgorithm. \n\n3.1  The leave-one-out procedure \n\nThe leave-one-out procedure consists of removing from the training data one element, con(cid:173)\nstructing the decision rule on the basis of the remaining training data and then testing the \nremoved element.  In this fashion  one tests all f  elements of the training data (using f  dif(cid:173)\nferent decision rules).  Let us denote the number of errors in  the  leave-one-out procedure \nby \u00a3(Xl' Yl, ... , Xl, Yl) .  It is  known [6]  that the the  leave-one-out procedure gives an  al(cid:173)\nmost unbiased estimate of the probability of test error:  the expectation of test error for the \nmachine trained on f  - 1 examples is equal to the expectation of 1\u00a3(Xl' Yl, ... , Xl, Yt). \nWe  now provide an  analysis of the number of errors made by the leave-one-out procedure. \nFor this purpose, we introduce a new concept, called the span of support vectors [7]. \n\n\f232 \n\nO.  Chapelle and V.  N.  Vapnik \n\n3.2  Span of support vectors \n\nSince  the  results  presented  in  this  section  do  not depend  on  the  feature  space,  we  will \nconsider without any loss of generality, linear SVMs, i.e.  K  (Xi, Xj)  =  Xi  .  Xj. \n\nSuppose that 0\u00b0 =  (a?, ... , a~) is the solution of the optimization problem (3). \nFor any fixed support vector xp we define the set Ap  as constrained linear combinations of \nthe support vectors of the first category (Xi)i:;t:p  : \n\n. t  Ai  =  1,  0 ~ a? + Yiypa~Ai ~ c} . \n\nt=l ,  t#p \n\n(4) \n\nNote that Ai  can be less than O. \nWe  also  define  the  quantity  Sp,  which  we  call  the  span  of the  support vector  xp  as  the \nminimum distance between xp and this set (see figure  1) \n\n(5) \n\nt... \n2= +inf \nt...3  = -inf \n\n..  AI \n\n\u00b7\u00b7\u00b7'' - - ' ~ .. 2,, \n\nFigure  1:  Three  support vectors with  al  =  a2  =  a3/2.  The  set Al  is  the  semi-opened \ndashed line. \n\nIt was shown  in  [7]  that the set Ap  is  not empty and that Sp  =  d(xp, Ap)  ~ Dsv, where \nD sv is the diameter of the smallest sphere containing the support vectors. \nIntuitively, the smaller Sp  =  d(xp, Ap)  is, the less likely the leave-one-out procedure is  to \nmake an error on the vector xp' Formally, the following theorem holds : \n\nTheorem 1  [7 J If in  the  leave-one-out procedure a  support vector xp  corresponding  to \no < a p  < C is recognized incorrectly,  then the following inequality holds \n\na O > \np  - Sp max(D, 1/.JC)\u00b7 \n\n1 \n\nThis  theorem  implies  that  in  the  separable  case  (C  = \n(0),  the  number  of  errors \nmade  by  the  leave-one-out  procedure  is  bounded  as  follows:  \u00a3(Xl' Yl, .'\"  Xl, Yl)  ~ \n2:p a~ maxp SpD  = maxp SpD / M2 , because  2: a~ = 1/ M2  [6].  This  is  already  an \nimprovement compared to functional (I), since Sp  ~ Dsv. But depending on the geome(cid:173)\ntry of the support vectors the value of the span Sp  can be much less than the diameter D sv \nof the support vectors and can even be equal to zero. \n\nWe  can  go  further under the  assumption  that  the  set  of support vectors does  not change \nduring the leave-one-out procedure, which leads us to the following theorem: \n\n\fModel Selection for Support Vector Machines \n\n233 \n\nTheorem 2  If the  sets of support vectors of first  and second categories remain  the  same \nduring the leave-one-out procedure. then for any support vector xp.  the following equality \nholds: \n\nyp[fO(xp) -\n\nfP(x p)]  =  o~S; \n\nwhere fO  and fP  are the decisionfunction (2) given by the SVM trained respectively on the \nwhole training set and after the point xp has been removed. \n\nThe proof of the theorem follows the one of Theorem 1 in [7]. \n\nThe assumption  that the  set of support vectors does  not change during the  leave-one-out \nprocedure is  obviously not satisfied in  most cases.  Nevertheless, the proportion of points \nwhich  violate this  assumption  is  usually  small  compared  to  the  number of support  vec(cid:173)\ntors.  In this case, Theorem 2 provides a good approximation of the result of the leave-one \nprocedure, as pointed out by the experiments (see Section 5.1, figure 2). \n\nAs already noticed in  [1], the larger op is, the more \"important\" in the decision function the \nsupport vector xp is. Thus, it is not surprising that removing a point xp causes a change in \nthe decision function proportional to its Lagrange multiplier op . The same kind of result as \nTheorem 2 has also been derived in  [2], where for SVMs without threshold, the following \ninequality has been derived:  yp(f\u00b0(xp) -\nfP(xp))  ~ o~K(xp,xp). The span Sp  takes \ninto  account the  geometry of the support vectors in  order to  get a precise  notion of how \n\"important\" is a given point. \n\nThe previous theorem enables us to compute the number of errors made by the leave-one(cid:173)\nout procedure: \n\nCorollary 1  Under  the  assumption  of Theorem  2,  the  test error prediction given by the \nleave-one-out procedure is \n\n(6) \n\nNote that points which are not support vectors are correctly classified by the leave-one-out \nprocedure. Therefore t/.  defines the number of errors of the leave-one-out procedure on the \nentire training set. \n\nUnder the  assumption  in  Theorem 2,  the  box constraints  in  the  definition  of Ap  (4)  can \nbe removed.  Moreover,  if we consider only  hyperplanes passing  through  the origin,  the \nconstraint E Ai  =  1 can also be removed.  Therefore, under those assumptions, the com(cid:173)\nputation of the span  Sp  is an  unconstrained minimization  of a quadratic form  and can be \ndone analytically.  For support vectors of the  first  category,  this  leads to  the  closed  form \nS~ =  l/(KsMpp,  where Ksv is  the matrix of dot products between support vectors of \nthe first category.  A similar result has also been obtained in  [3]. \n\nIn  Section  5,  we  use  the  span-rule  (6)  for  model  selection  in  both  separable  and  non(cid:173)\nseparable cases. \n\n4  Rescaling \n\nAs we already mentioned, functional (1) bounds the VC dimension of a linear margin clas(cid:173)\nsifier.  This bound is  tight when the data almost \"fills\" the surface of the sphere enclosing \nthe training data, but when the data lie on a flat ellipsoid, this bound is poor since the radius \nof the sphere takes into account only the components with the largest deviations.  The idea \nwe present here is to make a rescaling of our data in feature space such that the radius of the \nsphere stays constant but the margin increases, and then apply this  bound to  our rescaled \ndata and hyperplane. \n\n\f234 \n\n0.  Chapelle and V. N.  Vapnik \n\nLet us first consider linear SVMs, i.e.  without any  mapping in  a high dimensional space. \nThe rescaling can be achieved by computing the covariance matrix of our data and rescaling \naccording  to  its  eigenvalues.  Suppose our data are centered and  let  ('PI' ... ,'Pn)  be the \nnormalized eigenvectors of the covariance matrix  of our data.  We  can  then  compute the \nsmallest  enclosing  box  containing our  data,  centered  at  the  origin  and  whose  edges  are \nparallels to  ('PI' ... , 'Pn)'  This box is an approximation of the smallest enclosing ellipsoid. \nThe length of the edge in  the direction 'P k is J-Lk  =  maxi IXi  . 'P k I.  The rescaling consists \nof the following diagonal transformation: \n\nD  : x  --t Dx = LJ-Lk(X' 'Pk)  'Pk' \n\nk \n\nLet us  consider Xi  =  D-I xi and w = Dw.  The decision  function  is  not changed under \nthis transformation since w . Xi  =  W \n.  xi and the data Xi  fill  a box of side length  1.  Thus, \nin  functional  (l), we replace R2  by  1 and  1/ M2  by w2 .  Since we  rescaled our data in  a \nbox,  we  actually estimated the radius  of the enclosing ball  using the foo-norm  instead of \nthe classical f 2-norm.  Further theoretical works needs to be done to justify this change of \nnorm. \n\nIn the non-linear case, note that even if we map our data in a high dimensional feature space, \nthey lie in  the linear subspace spanned by these data.  Thus, if the number of training data f \nis  not too large, we can work in this subspace of dimension at most f.  For this purpose, one \ncan use the tools of kernel PCA [5]  :  if A is  the matrix of normalized eigenvectors of the \nGram matrix Kij  = K (Xi, Xj) and (>'d the eigenvalues, the dot product Xi . 'P k  is replaced \nby v'XkAik and W\u00b7 'Pk  becomes v'XkL:i AikYiO'i.  Thus, we can still achieve the diagonal \ntransformation A and finally functional (1) becomes \n\nL \nk \n\n5  Experiments \n\n>.~ max Ark (2: Aik YiO'i)2 . \n\n~ \n\ni \n\nTo  check these new  methods,  we performed two series of experiments.  One concerns the \nchoice  of (7,  the  width  of the  RBF  kernel,  on  a  linearly  separable  database,  the  postal \ndatabase.  This  dataset  consists  of 7291  handwritten  digit  of size  16x16 with  a  test  set \nof 2007  examples.  Following  [4],  we  split the  training  set  in  23  subsets  of 317  training \nexamples.  Our task consists of separating digit 0 to 4 from 5 to 9.  Error bars in figures 2a \nand  3 are  standard deviations over the 23  trials.  In  another experiment,  we  try  to choose \nthe  optimal  value of C  in  a  noisy  database,  the breast-cancer database! .  The dataset has \nbeen  split randomly  100 times  into  a  training set containing 200 examples and  a test  set \ncontaining 77 examples. \n\nSection 5.1  describes experiments of model selection  using  the  span-rule (6),  both in  the \nseparable case and in the non-separable one, while Section 5.2 shows VC bounds for model \nselection in  the separable case both with and without rescaling. \n\n5.1  Model selection using the span-rule \n\nIn this section, we use the prediction of test error derived from the span-rule (6) for model \nselection.  Figure 2a shows  the  test  error and  the  prediction given  by  the  span  for  differ(cid:173)\nent values  of the  width  (7  of the  RBF  kernel  on  the  postal database.  Figure 2b  plots  the \nsame functions  for  different values  of C  on  the  breast-cancer database.  We  can  see that \nthe  method predicts the correct value  of the  minimum.  Moreover,  the  prediction  is  very \naccurate and the curves are almost identical. \n\nI Available from http; I Ihorn. first. gmd. del \"'raetsch/da ta/breast-cancer \n\n\fModel Selection for Support Vector Machines \n\n235 \n\n40,-----~---~r=_=_ -\"\"T==es=:'t=er=ro=r ='ll \n\n-\n\nSpan prediction \n\n35 \n\n30 \n\n25 \ng20 \n\nUJ \n\n15 \n\n10 \n\n5 \n\ni\" \" , \n\n~6 \n\n-4 \n\n-2 \n\n0 \n\nLog sigma \n\n2 \n\n4 \n\n6 \n\no \n\n2 \n\n4 \n\n6 \n\nLoge \n\n8 \n\n10 \n\n12 \n\n(a) choice of (T  in  the postal database \n\n(b) choice of C in  the breast-cancer database \n\nFigure 2:  Test error and its prediction using the span-rule (6). \n\nThe computation of the span-rule (6) involves computing the span Sp  (5) for every support \n\nvector.  Note,  however,  that we are  interested in  the  inequality S;  ::;  Yp!(xp)/a~, rather \nthan the exact value of the span Sp.  Thus, while minimizing Sp  =  d(xp, Ap), if we find  a \npoint x*  E Ap such that d(xp, x*)2  ::; Yp! (xp ) /  a~, we can  stop the minimization because \nthis point will be correctly classified by the leave-one-out procedure. \n\nIt turned  out in  the experiments that the time  required  to  compute the span  was  not pro(cid:173)\nhibitive, since it is  was about the same than the training time. \n\nThere is  a  noteworthy extension  in  the application  of the  span  concept.  If we  denote  by \ne one hyperparameter of the kernel  and  if the derivative  8K(;~'Xi)  is  computable,  then  it \nis  possible to  compute analytically  8 ~ aiS~~y;fO(x;) ,  which is the derivative of an  upper \nbound of the number of errors made by the leave-one-out procedure (see Theorem 2).  This \nprovides us  a  more  powerful  technique  in  model  selection.  Indeed,  our  initial  approach \nwas  to choose the  value  of the  width  (T  of the  RBF kernel  according to  the  minimum of \nthe span-rule.  In our case, there was only hyperparamter so it was possible to try different \nvalues of (T.  But,  if we have several hyperparameters, for example one (T  per component, \n\n_~  (Xk- Xj,)2 \n\nK(x, x') =  e \n,  it is not possible to do an exhaustive search on all the possible \nvalues of of the hyperparameters. Nevertheless, the previous remark enables us to find their \noptimal value by a classical gradient descent approach. \n\nk \n\n2<T~ \n\nPreliminary results  seem to  show  that using this  approach with  the previously mentioned \nkernel improve the test error significantely. \n\n5.2  VC dimension with rescaling \n\nIn this section, we perform model selection on the postal database using functional (1) and \nits rescaled version. Figure 3a shows the values of the classical bound R2 / M2 for different \nvalues of (T.  This bound predicts the correct value for the minimum, but does not reflect the \nactual test error.  This is easily understandable since for large values of (T,  the data in input \nspace tend to  be mapped in  a very flat ellipsoid in feature space,  a fact which  is  not taken \ninto account [4].  Figure 3b shows that by  performing a rescaling of our data,  we manage \nto have a much tighter bound and this curve reflects the actual test error, given in figure  2a. \n\n\f0.  Chape/le and V.  N.  Vapnik \n\n-\n\nVC Dimension with rescali \n\n236 \n\n18000'---~--~--~--r=~==~=.=~ \n\n16000 \n\n14000 \n\n12000 \n\nE 10000 \n'6 \n~ 8000 \n\n6000 \n\n4000 \n\n2000 \n\n120 \n\n100 \n\n80 \n\nE \n~ 60 \n> \n\n40 \n\n20 \n\n~L-==~===c~~ ____ ~ __ ~ __ ~ \n6 \n....., \n\n-2 \n\n-4 \n\n4 \n\n0 \n\nLog sigma \n\n2 \n\n~  -4 \n\n-2 \n\n0 \n\nLog sigma \n\n2 \n\n4 \n\n6 \n\n(a) without rescaling \n\n(b) with rescaling \n\nFigure 3:  Bound on the VC dimension for different values of ~ on the postal database.  The \nshape of the curve with rescaling is very similar to the test error on figure 2. \n\n6  Conclusion \n\nIn this paper, we introduced two new techniques of model selection for SVMs.  One is based \non the span, the other is based on rescaling of the data in  feature space.  We demonstrated \nthat using these techniques, one can both predict optimal values for the parameters of the \nmodel  and  evaluate relative  performances for  different values  of the  parameters.  These \nfunctionals can  also lead to  new  learning techniques as  they  establish  that generalization \nability is not only due to margin. \n\nAcknowledgments \n\nThe authors would like to thank Jason Weston and Patrick Haffner for helpfull discussions \nand comments. \n\nReferences \n\n[1]  C. J.  C.  Burges.  A tutorial on support vector machines for pattern recognition.  Data Mining and \n\nKnowledge Discovery, 2(2):121-167, 1998. \n\n[2]  T.  S.  Jaakkola  and D.  Haussler.  Probabilistic kernel  regression  models.  In  Proceedings  of the \n\nJ 999 Conference on AI and Statistics,  1999. \n\n[3]  M.  Opper  and  O.  Winther.  Gaussian  process classification  and  SVM:  Mean  field  results  and \n\nleave-one-out estimator.  In Advances in Large Margin  Classifiers.  MIT Press, 1999.  to appear. \n\n[4]  B.  SchOlkopf, J.  Shawe-Taylor,  A. 1.  Smola, and R.  C.  Williamson.  Kernel-dependent  Support \nVector error bounds.  In Ninth International Conference on Artificial Neural Networks, pp.  304 -\n309 \n\n[5]  B.  SchOlkopf,  A.  Smola,  and  K.-R.  Muller.  Kernel  principal  component  analysis.  In  Artifi(cid:173)\ncial Neural Networks -ICANN'97, pages 583  - 588,  Berlin,  1997.  Springer Lecture Notes in \nComputer Science, Vol.  1327. \n\n[6]  V.  Vapnik.  Statistical Learning Theory. Wiley, New York,  1998. \n[7]  V.  Vapnik and O.  Chapelle.  Bounds on error expectation for SVM.  Neural Computation,  1999. \n\nSubmitted. \n\n\f", "award": [], "sourceid": 1663, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}