{"title": "Understanding Stepwise Generalization of Support Vector Machines: a Toy Model", "book": "Advances in Neural Information Processing Systems", "page_first": 321, "page_last": 327, "abstract": null, "full_text": "Understanding stepwise generalization of \nSupport Vector Machines:  a  toy model \n\nSebastian Risau-Gusman and Mirta B. Gordon \nDRFMCjSPSMS CEA Grenoble,  17 avo  des Martyrs \n\n38054 Grenoble cedex 09,  France \n\nAbstract \n\nIn this  article  we  study the effects  of introducing structure in the \ninput distribution of the data to be learnt by a  simple perceptron. \nWe  determine the learning curves within the framework of Statis(cid:173)\ntical  Mechanics.  Stepwise  generalization  occurs  as  a  function  of \nthe number of examples when the distribution of patterns is highly \nanisotropic.  Although  extremely simple,  the model  seems  to cap(cid:173)\nture  the  relevant  features  of  a  class  of  Support  Vector  Machines \nwhich was  recently shown to present this behavior. \n\n1 \n\nIntroduction \n\nA new approach to learning has recently been proposed as an alternative to feedfor(cid:173)\nward neural networks:  the Support Vector Machines (SVM)  [1].  Instead of trying to \nlearn a non linear mapping between the input patterns and internal representations, \nlike  in multilayered perceptrons, the SVMs choose  a priori a non-linear kernel that \ntransforms the input space into a  high dimensional feature  space.  In binary classi(cid:173)\nfication  tasks like  those  considered in the  present  paper,  the SVMs  look for  linear \nseparation  with  optimal  margin  in  feature  space.  The  main  advantage  of SVMs \nis  that learning becomes  a  convex optimization problem.  The difficulties of having \nmany local minima that hinder the process of training multilayered neural networks \nis  thus avoided.  One of the questions raised by  this approach is  why  SVMs  do not \noverfit  the  data  in  spite  of  the  extremely  large  dimensions  of the  feature  spaces \nconsidered. \nTwo  recent  theoretical  papers  [2,  3]  studied  a  family  of SVMs  with  the  tools  of \nStatistical Mechanics, predicting typical properties in the limit of large dimensional \nspaces.  Both  papers  considered  mappings  generated  by  polynomial  kernels,  and \nmore  specifically  quadratic  ones.  In  these,  the  input  vectors  x  E  RN  are  trans(cid:173)\nformed  to N(N + I)j2-dimensional feature  vectors <I>(x) .  More precisely,  the map(cid:173)\nping  <I> I (x)  =  (x, XIX, X2X, ... ,XkX)  has  been  studied  in  [3]  as  a  function  of  k, \nthe number of quadratic features,  and <I> 2 (x)  =  (x,xlxjN,X2xjN,\u00b7\u00b7\u00b7 ,xNxjN) has \nbeen  considered  in  [2],  leading  to  different  results.  These  mappings  are  particu(cid:173)\nlar  cases  of quadratic  kernels.  In  particular,  in  the  case  of learning quadratically \nseparable tasks with mapping <I> 2 ,  the generalization error decreases  up to a  lower \nbound for  a  number of examples  proportional to N, followed  by a  further  decrease \nif the number of examples increases proportionally to the dimension of the feature \n\n\f322 \n\nS.  Risau-Gusman and M.  B.  Gordon \n\nspace,  i.e.  to N 2 \u2022  In fact,  this  behavior is  not  specific  of the SVMs.  It also  arises \nin the typical case of Gibbs learning (defined below)  in quadratic feature spaces [4]: \non increasing the training set size,  the quadratic components of the discriminating \nsurface  are  learnt after  the  linear  ones.  In  the  case  of learning linearly  separable \ntasks  in  quadratic  feature  spaces,  the  effect  of  overfitting  is  harmless,  as  it  only \nslows  down  the  decrease  of the  generalization error  with  the  training set  size.  In \nthe case of mapping <PI,  overfitting is  dramatic,  as  the generalization error at  any \ngiven  training set size  increases with the number  k  of features. \n\nThe aim of the present paper is to understand the influence of the mapping scaling(cid:173)\nfactor  on the generalization performance of the  SVMs.  To this end,  it is  worth to \nremark that features  <P2  may  be obtained  by  compressing the  quadratic subspace \nof <PI  by  a  fixed  factor.  In order to mimic  this contraction,  we  consider  a  linearly \nseparable task in which the input patterns have a highly anisotropic distribution, so \nthat the variance in one subspace is much smaller than in the orthogonal directions. \nWe  show  that  in  this  simple  toy  model,  the  generalization  error  as  a  function  of \nthe training set size  exhibits a  cross-over between two different  behaviors:  a  rapid \ndecrease corresponding to learning the components in the uncompressed space, fol(cid:173)\nlowed  by  a  slow  improvement in which  mainly  the components in the compressed \nspace  are  learnt.  The  latter  would  correspond,  in  this  highly  stylized  model,  to \nlearning the scaled quadratic features in the SVM  with  mapping <P2. \nThe paper is  organized as follows:  after  a  short  presentation of the model,  we  de(cid:173)\nscribe the main steps of the Statistical Mechanics calculation.  The order parameters \ncaracterizing the properties of the learning process are defined,  and their evolution \nas a function of the training set size is analyzed.  The two regimes of the generaliza(cid:173)\ntion error are described, and we determine the training set size per input dimension \nat the crossover, as  a  function  of the pertinent parameters.  Finally we  discuss our \nresults, and their relevance to the understanding of the generalization properties of \nSVMs. \n\n2  The model \n\nWe  consider  the  problem  of learning  a  binary  classification  task  from  examples. \nThe  training  data  set  Va  contains  P  =  aN  N-dimensional  patterns  (eJ.', rJ.') \n(p  =  1,\u00b7\u00b7\u00b7, P)  where  rJ.'  =  sign(e  .  w .. )  is  given  by  a  teacher  of  weights \nw\"  =  (WI, W2, ...\u2022 , wn ).  Without any loss of generality we consider normalized teach(cid:173)\ners:  w\"  . w\" = N.  We  assume that the components ~i'  (i = 1,\u00b7\u00b7\u00b7, N) of the input \npatterns e are  independent,  identically  distributed  random  variables  drawn  from \na  zero-mean  gaussian  distribution,  with  variance  a  along  Nc  directions  and  unit \nvariance in the Nu  remaining ones  (Nc + Nu  =  N): \n\np(e) =  IT \n\niENe  V27rO' \n\n1 \n\n2  exp \n\n(~; ) \n\n-20'2 \n\nIT  -exp  -2  . \n(~;) \n1 \niENu  V2ir \n\n(1) \n\nWe  take  a  < 1  without  any  loss  of generality,  as  the  case  a  > 1 may  be  deduced \nfrom  the former  through a  straightforward rescaling of Nc  and  N u.  Hereafter,  the \nsubspace of dimension  Nc  and variance a  will  be called  compressed subspace.  The \ncorresponding  orthogonal  subspace,  of dimension  Nu  =  N  - N c,  will  be  called \nuncompressed subspace. \n\nWe  study the typical generalization error of a student perceptron learning the clas(cid:173)\nsification task, using the tools of Statistical Mechanics.  The pertinent cost function \n\n\fUnderstanding Stepwise Generalization ojSVM's: a Toy Model \n\nis the number of misclassified patterns: \n\nE(w; Va) =  L 8( -TIL ~IL  \u2022 w), \n\np \n\n1L=1 \n\n323 \n\n(2) \n\nThe weight vectors in version space correspond to a  vanishing cost  (2).  Choosing a \nw  at random from  the  a posteriori distribution \n\nP(wIVa ) = Z-l PO(w)  exp (-,8E(w; Va\u00bb, \n\n(3) \n\nin the  limit  of ,8  -+  00  is  called  Gibbs'  learning.  In eq.  (3),,8 is  equivalent  to  an \ninverse temperature in the Statistical Mechanics formulation,  the cost (2)  being the \nenergy  function.  We  assume  that  Po,  the  a  priori  distribution  of the  weights,  is \nuniform on the hypersphere of radius VN: \n\nPo(w)  =  (21re)-N/2  8(w . w  - N). \n\n(4) \n\nThe normalization constant (21re)N/2  is the leading order term of the hypersphere's \nsurface  in  N-dimensional  space.  Z  is  the  partition  function  ensuring  the  correct \nnormalization of P(wIVa): \n\nZ(,8; Va)  = J dw Po(w)  exp (-,8E(w; Va\u00bb . \n\n(5) \n\nIn  general,  the  properties  of the  student  are  related  to  those  of  the  free  energy \nF(,8; Va)  = -In Z(,8; V a )/,8.  In the  limit  N  -+  00  with  the  training  set  size  per \ninput  dimension  Q:  ==  P / N  constant,  the properties of the student  weights  become \nindependant of the particular training set Va.  They are deduced from  the averaged \nfree  energy per degree of freedom,  calculated using the replica trick: \n\n(6) \n\nwhere  the overline  represents  the  average over Va,  composed of patterns  selected \naccording to (1).  In the case of Gibbs learning, the typical behavior of any intensive \nquantity is obtained in the zero temperature limit ,8  -+  00.  In this limit, only error(cid:173)\nfree  solutions,  with  vanishing  cost,  have  non-vanishing  posterior  probability  (3) . \nThus,  Gibbs learning corresponds to picking at random a student in version space, \ni.e.  a  vector  w  that  classifies  correctly  the  training  set  Va,  with  a  probability \nproportional to Po (w ). \nIn  the  case  of an  isotropic  pattern  distribution,  which  corresponds  to  (7  =  1  in \n(1),  the properties of cost function  (2)  have  been extensively studied [5].  The case \nof patterns  drawn  from  two  gaussian  clusters  in  which  the  symmetry  axis  of the \nclusters is  the same [6]  and different  [7]  from  the teacher's axis,  have recently  been \naddressed.  Here we consider the problem where, instead of having a single direction \nalong which the patterns' distribution is  contracted (or expanded), there is  a  finite \nfraction  of compressed dimensions.  In  this case,  all  the properties of the student's \nperceptron may be expressed in terms of the following order parameters, that have \nto satisfy corresponding extremum conditions of the free  energy: \n\n-ab \nqc \n\n-ab \nqu \n\n1 \n\n1 \n\n(N L  WiaWib) \n(N  L  WiaWib) \n\niENc \n\niENu \n\n(7) \n\n(8) \n\n\f324 \n\nS.  Risau-Gusman and M  B.  Gordon \n\nc \n\niEN< \n\nR a  =  (~ L  wiawi) \n(~ L  WiaWn \n(~ L (Wia)2) \n\nR a u \n\niENu \n\nQa \n\niEN< \n\n(9) \n\n(10) \n\n(11) \n\nwhere  ( ... )  indicates  the  average  over  the  posterior  (3);  a, b  are  replica  indices, \nand  the  subcripts  c  and  u  stand  for  compressed  and  uncompressed  respectively. \nNotice  that we  do  not impose  that  Qa,  the  typical squared norm of the student's \ncomponents  in  the  compressed  subspace,  be  equal  to the  corresponding  teacher's \nnorm Q*  = LiEN\u00abwi)2 IN. \n\n3  Order  parameters and learning curves \n\nAssuming  that  the  order  parameters  are  invariant  under  permutation  of replicas, \nwe  can  drop  the  replica  indices  in  equations  (7)  to  (11).  We  expect  that  this \nhypothesis of replica symmetry is  consistent, like it is  in other cases of perceptrons \nlearning  realizable  tasks.  The  problem  is  thus  reduced  to  the  determination  of \nfive  order parameters.  Their meaning becomes  clearer if we  consider the following \ncombinations: \n\nqc \n\nqu \n\nRc \n\nRu \n\niic \nQ' \niiu \n\nl-Q' \n\nRc \n\n.JCJ..ftJ*' \nRu \n\nv'1=QJl - Q* ' \n\nQ  =  (~ L (Wi)2). \n\niEN< \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\n(16) \n\nqc  and qu  are the typical overlaps between the components of two student vectors in \nthe compressed and the uncompressed subspaces respectively.  Similarly,  Rc and Ru \nare the corresponding overlaps between a typical student and the teacher.  In terms \nof this set of parameters, the typical generalization error is  \u20acg  =  (1 I 7r) arccos R with \n\nR  = (72 RcJQQ* + R u J(1 - Q)(1 - Q*). \nJ(72Q + (1- Q)J(72Q* + (1  - Q*) \n\n(17) \n\nGiven  a,  the  general  solution  to  the  extremum  conditions  depends  on  the  three \nparameters of  the  problem,  namely  (7,  Q*  and  nc  ==  NcIN.  An  interesting  case \nis  the  one  where  the  teacher's  anisotropy  is  consistent  with  that  of the  pattern's \ndistribution,  i.e.  Q*  =  nco  In this  case,  it easy to show  that  Q  =  Q*,  qc  =  Rc  and \nqu  =  Ru.  Thus,  , \n\n(18) \n\n(19) \n\nR  = nuRu + (72 ncRc , \n\nnu + (72n c \n\nwhere nu ==  NuIN, Ru  and Rc  are given by the following  equations: \n\na  J exp (_Rt2 12) \n\nRc \n\n1 - Rc \n\n(72 \n\n(72nc  + nu  7r  Jl - R \n\nVt  H(tVR) \n\n, \n\n\fUnderstanding Stepwise Generalization ofSVM's: a Toy Model \n\n325 \n\nn =0.9  R \nc \n(12=0.01  \u2022\u2022. ,  .. \u2022. \n\n* U  .................... ... ........... . . .. ............................ . . .. ... . \n...... \n\n-------\n\n______ - - - - - - - - -\n\nR \n\n\" \n, , \n,'''R \nG \n\n, \n\" \n, \n, , , \n\nR \n~  , .... , .... , \n\n.... \n\n...... \n\n...... .... . \n\n\u2022 \u2022 ~' \u2022 \u2022 ' \n\n0,5' .. -.. ~-~-~-~--, \n\n.' \n\nE \n~~ \n\n0,4 \n\n0,3 \n\n0.2 \n\nE \n9 \n\n\u2022 \n....... \n...... \n\nE  G \n9 \n.... .... \n+ \n.................... \n\n0.1  ------------------ --------.--\n\n0,0 \n\n0~,0--~0,2~~0~,4--~0,76~0~,8~~1.0 \n\n....................... _- ...... \n\n1,0 \n\n0,8 \n\n0,6 \n\n0,4 \n\n0,2 \n\n0,0 \n\n, ,  . \n,  . ' \n\nf\u00b7 \u2022 \n\nE \n9 \n\no \n\n2 \n\n4 \n\na \n\n6 \n\n8 \n\n10 \n\nFigure  1:  Order  parameters and generalization error for  the case  Q*  = nc  = 0.9, \n(72  = 10- 2 \u2022  The curves for  the case of spherically distributed patterns is  shown for \ncomparison.  The inset shows the first  step of learning and its plateau  (see  text). \n\n(7 \n\n2  Ru. \n1- Ru. \n\n(20) \n\nwhere Vt = dte- t2 / 2 /~ and  H(x) = I~oo Vt .  If (72  = 1,  we  recover the equations \ncorresponding to Gibbs learning of isotropic pattern distributions  [;)]. \nThe order parameters are represented as a function of a  on figure 1, for  a particular \nchoice  of nc and  (7 .  Ru.  grows  much  faster  than  Rc, meaning  that  it  is  easier  to \nlearn the components of the uncompressed space.  As  a result, R  (and therefore the \ngeneralization  error  109)  presents  a  cross-over  between  two  behaviors.  At  small  a, \nboth Ru.  \u00ab 1 and Rc \u00ab 1, so that R(a, (72)  =  Ra(a(nu. +(74nc)/(nu. +(72nc)2)  where \nRa  is  the  overlap  for  Gibbs  learning  with  an  isotropic  (72  = 1)  distribution  [5]. \nLearning  the  anisotropic  distribution  is  faster  (in  a)  than  learning  the  isotropic \none.  If (7  \u00ab  1  the  anisotropy  is  very  large  and  R  increases  like  Ra  but  with  an \neffective  training  set  size  per  input  dimension,....,  ainu.  >  a.  On  increasing  a, \nthere  is  an intermediate  regime  in  which  Ru.  increases  but  Rc \u00ab  1,  so  that  R  ::: \nRu.nu./(nu. +(72nc).  The corresponding generalization error seems to reach a plateau \ncorresponding to Ru.  =  1 and Rc  =  O.  At a  \u00bb 1,  R(a, (72)  :::  Ra(a), the asymptotic \nbehavior is independent of the details of the distribution, like  in  [7].  The crossover \nbetween these two regimes,  when  (72  \u00ab 1, occurs at ao ~ J2(nu. + (72nc)/(72nc ). \nThe cases Q*  = 1 and Q*  = 0 are also of interest.  Q*  = 1 corresponds to a teacher \nhaving  all  the  weights  components  in  the  compressed  subspace,  whereas  Q*  =  0 \n\n\f326 \n\nS.  Risau-Gusman and M.  B.  Gordon \n\n0,5 \n\n0,4 \n\n0,3 \n\n0,2 \n\n0,1 \n\n\u00a3 \n\n9 \n\nI \n\nI \ni \n\\ \ni \n~  Gibbs \n, ,. \n,I \n\" ,\", \n\\\\, \n,  ' . \n, \nI  \" \"  \n\" \n\nQ'=0.9 \n..... _, \n. /  \n-..... \n......... \n\n..... \n\n0.025 \n\nj ...... . \n\nQ\u00b7=O.O \n20 \n\n0,000 a \n\n40 \n\n60 \n\n80 \n\n100 \n\nO,O~ ____ ~ ____ ~ __ ~ ____ ~ ____ ~ ____ ~ __ ~ ____ ~ ____ ~ __ ~ \n\no \n\n2 \n\n4 \n\n6 \n\n8 \n\nFigure  2:  Generalization  errors  as  a  function  of a  for  different  teachers  (Q*  =  1, \nQ*  = 0.9  and  Q*  = 1),  for  the  case  nc  = 0.9  and  (J'2  = 10-2 .  The  curve  for \nspherically distributed patterns [5]  is included for  comparison.  The inset shows the \nlarge alpha behaviors. \n\ncorresponds  to a  teacher orthogonal to the compressed subspace,  i.e.  with  all  the \ncomponents in the uncompressed  subspace.  They correspond respectively to tasks \nwhere either the uncompressed or the compressed components are irrelevant for  the \npatterns'  classification.  In  Figure  2  we  show  all  the  generalization  error  curves, \nincluding the generalization error EgG  for  a  uniform distribution [5]  for  comparison. \nThe behaviour of Eg(a)  is  very  sensitive to the value of Q*.  If Q*  =  1,  the teacher \nis  in  the  compressed  subspace  where  learning  is  difficult.  Consequently,  Eg(a)  > \nEgG (a)  as  expected.  On  the  contrary,  for  Q*  =  0,  only  the  components  in  the \nuncompressed space are relevant for the classification task.  In this subspace learning \nis easy and Eg(a)  < EgG(a).  At Q*  f.  0,1 there is a crossover between these regimes, \nas  already  discussed.  All  the  curves  merge  in  the  asymptotic  regime  a  -+  00,  as \nmay be seen  in  the inset of Figure 2. \n\n4  Discussion \n\nWe  analyzed  the  typical  learning  behavior  of a  toy  perceptron  model  that  allows \nto  clarify  some  aspects  of  generalization  in  high  dimensional  feature  spaces.  In \nparticular,  it  captures  an  element  essential  to  obtain  stepwise  learning,  which  is \nshown to stem from  the compression of high order features.  The components in the \ncompressed  space  are  more  difficult  to  learn  than those  not  compressed.  Thus,  if \n\n\fUnderstanding Stepwise Generalization ojSVM's: a Toy Model \n\n327 \n\nthe training set is  not large enough, mainly the latter are learnt. \n\nOur results allow to understand the importance of the scaling of high order features \nin the SVMs kernels.  In fact,  with SVMs  one has to choose  a  priori the kernel that \nmaps the  input space to the feature  space.  If high order features  are conveniently \ncompressed,  hierarchical  learning  occurs.  That  is,  low  order  features  are  learnt \nfirst;  higher order features are only learnt if the training set is  large enough.  In the \ncases  where  the  higher  order features  are irrelevant,  it is  likely  that they  will  not \nhinder  the  learning  process.  This  interesting  behavior  allows  to  avoid  overfitting. \nComputer simulations currently in progress, of SVMs generated by quadratic kernels \nwith and without the 1/ N  scaling, show a  behavior consistent with the theoretical \npredictions  [2,  3].  These may be understood with the present toy model. \n\nReferences \n\n[1]  V.  Vapnik  (1995)  The  nature  of statistical  learning  theory.  Springer  Verlag, \n\nNew  York. \n\n[2]  R.  Dietrich,  M.  Opper,  and  H.  Sompolinsky  (1999)  Statistical  Mechanics  of \n\nSupport Vector  Networks.  Phys.  Rev.  Lett. 82,  2975-2978. \n\n[3]  A.  Buhot  and  M.  B.  Gordon  (1999)  Statistical  mechanics  of  support  vector \n\nmachines. ESANN'99-European Symposium on  Artificial Neural Networks Pro(cid:173)\nceedings,  Michel  Verleysen  ed.  201-206;  A.  Buhot  and  M.  B.  Gordon  (1998) \nLearning properties of support vector machines.  Cond-Mat/9802179. \n\n[4]  H. Yoon and J.-H. Oh (1998) Learning of higher order perceptrons with tunable \n\ncomplexities J.  Phys.  A:  Math.  Gen.  31,  7771-7784. \n\n[5]  G.  Gyorgyi  and  N.  Tishby  (1990)  Statistical  Theory  of  Learning  a  Rule.  In \nNeural  Networks  and Spin  Glasses  (W.  K.  Theumann and  R.  Koberle,  Worls \nScientific),  3-36. \n\n[6]  R.  Meir  (1995)  Empirical  risk  minimizaton.  A  case  study.  Neural  Compo 7, \n\n144-157. \n\n[7]  C.  Marangi,  M.  Biehl,  S.  A.  Solla  (1995)  Supervised Learning from  Clustered \n\nExamples Europhys.  Lett.  30  (2),  117-122. \n\n\f", "award": [], "sourceid": 1769, "authors": [{"given_name": "Sebastian", "family_name": "Risau-Gusman", "institution": null}, {"given_name": "Mirta", "family_name": "Gordon", "institution": null}]}