{"title": "From Regularization Operators to Support Vector Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 349, "abstract": "", "full_text": "From Regularization Operators \n\nto Support Vector Kernels \n\nAlexander J. Smola \n\nGMDFIRST \n\nRudower Chaussee 5 \n12489 Berlin, Germany \n\nsmola@first.gmd.de \n\nBernhard Scholkopf \n\nMax-Planck-Institut fur biologische Kybernetik \n\nSpemannstra.Be 38 \n\n72076 Ttibingen, Germany \n\nbs-@mpik-tueb.mpg.de \n\nAbstract \n\nWe derive the correspondence between regularization operators used in \nRegularization Networks and Hilbert Schmidt Kernels appearing in Sup(cid:173)\nport Vector Machines. More specifica1ly, we prove that the Green's Func(cid:173)\ntions associated with regularization operators are suitable Support Vector \nKernels  with equivalent regularization properties.  As  a by-product we \nshow that a large number of Radial Basis Functions namely  condition(cid:173)\nally positive definite functions may be used as Support Vector kernels. \n\n1  INTRODUCTION \n\nSupport Vector (SV) Machines for pattern recognition, regression estimation and operator \ninversion  exploit the  idea of transforming  into  a  high  dimensional  feature  space  where \nthey perform a linear algorithm. Instead of evaluating this map explicitly, one uses Hilbert \nSchmidt Kernels  k(x, y)  which  correspond to  dot products  of the  mapped  data in  high \ndimensional space, i.e. \n\nk(x, y)  =  (<I>(x)  \u00b7 <I>(y)) \n\n(I) \nwith  <I>  : .!Rn  --*  :F  denoting the map into feature  space.  Mostly,  this  map  and many  of \nits properties are unknown. Even worse, so far no general rule was available. which kernel \nshould be used,  or why  mapping into a very high dimensional space often provides good \nresults,  seemingly defying  the curse of dimensionality.  We  will  show  that each  kernel \nk(x, y)  corresponds  to  a  regularization operator P,  the  link being that k  is  the  Green's \nfunction of P* P (with F* denoting the adjoint operator of F).  For the sake of simplicity \nwe shall only discuss the case of regression -\nour considerations, however, also hold true \nfoi the other cases mentioned above. \n\nWe start by briefly reviewing the concept of SV Machines (section 2) and of Regularization \nNetworks (section  3).  Section 4 contains the main result stating the  equivalence of both \n\n~ \n\n\f344 \n\nA.  J.  Smola and B. Schollwpf \n\nmethods.  In section 5, we show some applications of this finding to  known SV machines. \nSection 6 introduces a new class of possible SV kernels, and,  finally,  section 7 concludes \nthe paper with a discussion. \n\n2  SUPPORT VECTOR MACHINES \n\nThe SV algorithm for regression estimation, as  described in  [Vapnik,  1995]  and  [Vapnik \net al.,  1997], exploit~ the idea of computing a linear function  in  high  dimensional feature \nspace F  (furnished with a dot product) and thereby computing a nonlinear function in the \nspace of the  input data !Rn.  The functions  take  the form  f(x)  =  (w \u00b7 <ll(x))  + b with \nell  : !Rn  -+ :F and w E F. \nIn order to infer f  from a training set {(xi, Yi)  I i  =  1, ... , f.,  Xi  E !Rn, Yi  E IR},  one tries \nto  minimize the empirical risk functional  Remp[f]  together with a complexity term  l!wll 2 , \nthereby enforcingflatness in feature space, i.e.  to minimize \n\nRreg[/] =  Remp[!] + Allwll 2  =f. :L;c(f(xi),yi) + Allwll 2 \n\n1 \n\nl \n\n(2) \n\nwith  c(f(xi),yi)  being  the cost function  determining  how  deviations  of f(xi)  from  the \ntarget values Yi  should be penalized, and  A being a regularization constant.  As  shown in \n[Vapnik,  1995] for the case of \u20ac-insensitive  cost functions, \n\ni=l \n\nc(f(x) \n\n) =  {  lf(x)- yl- \u20ac \n\n' Y \n\n0 \n\nfor  lf(_x)- Yl  ;:::  e \notherwtse \n\n' \n\n(3) \n\n(2)  can  be minimized  by  solving  a quadratic programming problem formulated in  terms \nof dot products in  :F.  It turns  out that the solution can  be expressed in  terms  of Support \nVectors, w =  :Ef=I Cti<ll(xi), and therefore \n\nl \n\nf(x) = L ai(<ll(xi) \u00b7 <ll(x)) + b = L aik(xi, x) + b, \n\nl \n\n(4) \n\ni=l \n\ni=l \n\nwhere k(xi, x)  is  a  kernel  function  computing a dot product in  feature  space  (a concept \nintroduced  by  Aizerrnan  et  al.  [ 1964]).  The  coefficients  ai  can  be  found  by  solving  a \nquadratic programming problem (with Kii  := k(xi, Xj) and  ai =  f3i  - {3i): \n\nminimize  !  L  (f3i- f3i)(f3j- f3J)Kij- \"\u00a3, (!3i- f3i)Yi- (f3i  + f3i)e \n\nl \n\nl \n\n-\n\n(5) \n\nsubject to \n\ni,j=l \nf. \n\"\u00a3,  f3i  - !3i  =  0,  f3i, !3i  E [0,  ft] \u00b7 \ni=l \n\ni=l \n\nNote that (3) is  not the only possible choice of cost functions resulting in a quadratic pro(cid:173)\ngramming problem (in fact quadratic parts and infinities are admissible, too).  For a detailed \ndiscussion  see  [Smola and Scholkopf,  1998].  Also  note  that any  continuous  symmetric \nfunction  k(x, y)  E  L2  \u00ae  L2  may  be  used as  an  admissible Hilbert-Schmidt kernel  if it \nsatisfies Mercer's condition \n\n/ /  k(x,y)g(x)g(y)dxdy;::: 0  for all  g E L2(IRn)~ \n\n(6) \n\n3  REGULARIZATION NETWORKS \n\nHere again  we  start with  minimizing the empirical  risk  functional  Remp[!]  plus a  regu(cid:173)\nlarization  term  liP /11 2  defined by  a regularization operator Pin the sense of Arsenin and \n\n\fFrom Regularization Operators to Support Vector Kernels \n\nTikhonov [1977].  Similar to (2), we minimize \n\nRreg[f]  =  Remp + .\\IIPJII  =  f  \u00a3_-c(f(xi),yi) + -XIIPJII  \u00b7 \n2 \n\n~ \n\n2  1\"' \n\nl \n\n~ \n\ni=1 \n\n345 \n\n(7) \n\n(8) \n\nUsing an expansion off in terms of some symmetric function k(xi, Xj)  (note here, that k \nneed not fulfil Mercer's condition), \n\nf(x)  = 2: aik(xi, x) + b, \n\nand the cost function defined in (3 ), this leads to a quadratic programming problem similar \nto  the one for SVs:  by computing Wolfe's dual (for details of the calculations see [Smola \nand SchOlkopf, 1998]), and using \n\n(9) \n((f  \u00b7  g)  denotes  the  dot  product  of  the  functions  f  and  g  in  Hilbert  Space,  t.e. \nI !(x)g(x)dx), we get a= n- 1 K(i]- /3*), with f3i, f3i  being the solution of \n\nDij := ((Fk)(xi, .)  \u00b7 (Fk)(xj, .)) \n\nl \n\n!  2:  (f3i- f3i)(f3j- {3j)(KD- 1 K)ij- L (f3i- f3i)Yi- (f3i + f3i)E \n\nl \n\ni=l \n\nmm1m1Ze \nsubject to  L f3i  - f3i  =  0,  f3i, f3i  E [0, A] \n\ni,j=1 \nf_ \n\ni=1 \n\n(I 0) \n\nUnfortunately this setting of the problem does not preserve sparsity in  terms of the coef(cid:173)\nficients,  as a potentially sparse decomposition in terms of f3i  and f3i  is  spoiled by n-1 K, \nwhich in general is not diagonal (the expansion (4) on the other hand does typically have \nmany vanishing coefficients). \n\n4  THE EQUIVALENCE OF~BOTH METHODS \n\nComparing (5) with (10) leads to the question if and under which condition the two methods \nmight be  equivalent and  therefore  also  under  which  conditions  regularization  networks \nmight lead  to  sparse decompositions  (i.e.  only  a  few  of the  expansion coefficients in f \nwould differ from zero). A sufficient condition is D  =  K  (thus K n-1 K  = K), i.e. \n\n(11) \n\nOur goal now is twofold: \n\n\u2022  Given a regularization operator P, find a kernel k such that a SV machine using k \nwill  not only enforce flatness in feature space, but also correspond to  minimizing \na regularized risk functional with P as regularization operator. \n\n\u2022  Given a Hilbert Schmidt kernel k, find a regularization operator P such that a SV \n\nmachine using this kernel can be viewed as a Regularization Network using P. \n\nThese two problems can  be solved by  employing the concept of Green's functions as  de(cid:173)\nscribed  in  [Girosi  et al.,  1993].  These  functions  had  been  introduced  in  the  context of \nsolving  differential equations.  For our purpose,  it  is  sufficient to know that the Green's \nfunctions Gx, (x) ofF* P satisfy \n\n(12) \nHere, 8xi (x) is the 8-distribution (not to be confused with the Kronecker symbol8ij) which \nhas the property that (f \u00b7 8x.)  =  f(xi).  Moreover we require for all  Xi  the projection of \nGx, (x)  onto  the  null  space  of F* P to  be  zero.  The relationship  between  kernels  and \nregularization operators is formalized in  the following proposition. \n\n\f346 \n\nA. I.  Smola and B. Schtilkopf \n\nProposition 1 \nLet P  be a  regularization operator,  and G be the Green's function of P* P.  Then G is  a \nHilbert Schmidt-Kernel such that D  =  K.  SV machines using G minimize risk functional \n(7) with Pas regularization operator. \n\nProof: Substituting ( 12) into GxJ (xi)  =  ( GxJ (.)  \u00b7 8x\u2022 (.)) yields \n\nGxi (xi)  =  ( (PGx, ){.)  \u00b7 (PGxJ(.))  =  Gx; {xj), \n\n(13) \n\nhence G(xi,Xj)  :=  Gx,{xj)  is  symmetric and satisfies (11).  Thus  the  SVoptimization \nproblem (5) is equivalent to the regularization network counterpart ( 10).  Furthermore G is \nan admissible positive kernel, as it can be written as a dot product in Hilbert Space, namely \n\n(14) \n\nIn the following we will  exploit this relationship in  both ways:  to compute Green's func(cid:173)\ntions for a given regularization operator P and to infer the regularization operator from  a \ngiven kernel k. \n\n5  TRANSLATION INVARIANT KERNELS \n\nLet us  now  more specifically  consider regularization operators P that may  be  written as \nmultiplications in Fourier space [Girosi et al.,  1993] \n\n(Pi\u00b7 P ) = \n\ng \n\n1 \n\nf  f{w)fJ(w) dw \n\n(21l')n/2  Jn  P(w) \n\n(15) \n\nwith  ](w) denoting the Fourier transform of j(x), and P(w)  =  P( -w) real valued,  non(cid:173)\nnegative and converging uniformly to 0 for lwl  --+  oo  and n = supp[P(w)].  Small values \nof P(w) correspond to a strong attenuation of the corresponding frequencies. \n\nFor regularization operators defined in Fourier Space by (15) it can be shown by exploiting \nP(w)  =  P(-w) =  P(w) that \n\nG(Xi  x)  = \n\nl \n\n1 \n\n{27!' )n/2  }JR.n \n\nf  eiw(x;-x) P(w)dw \n\n(16) \n\nis  a  corresponding Green's function  satisfying  translational  invariance,  i.e.  G(xi,xj)  = \nG (Xi  - xi), and G ( w)  = P ( w).  For the proof, one only has to show that G satisfies (11 ). \nThis provides us  with an efficient tool for analyzing SV kernels and the types of capacity \ncontrol they exhibit \n\nExample 1 (Bq-splines) \nVapnik et al.  [ 1997] propose to use Bq-splines as building blocks for kernels,  i.e. \n\nwith  x  E  !Rn.  For  the  sake  of simplicity,  we  consider the  case n \ndefinition \n\n1.  Recalling  the \n\ni=l \n\n(17) \n\n(18) \n(\u00ae denotes the convolution and Ix the indicator function  on X),  we can utilize the above \nresult and the  Fourier-Plancherel identity to  construct the  Fourier representation of the \ncorresponding regularization operator.  Up  to a multiplicative constant,  it equals \n\nBq =  \u00aeq+11[-o.5,o.5] \n\n. \n\nP(w) = k(w)  =  sinc(q+l)(Wi). \n\n2 \n\n(19) \n\n\fFrom Regularization Operators to Support W?ctor Kernels \n\n347 \n\nThis shows that only B-splines of odd order are admissible, as the even ones have negative \nparts in the Fourier spectrum (which would result in an amplification of the corresponding \nfrequency components).  The zeros ink stem from the fact that B1  has only compact support \n[-(k+ 1)/2, (k+ 1)/2). By using this kernel we trade reduced computational complexity in \ncalculatingf(we only have to take points with llxi- xi II  :S  cfrom some limited neighbor(cid:173)\nhood determined by c into account)for a possibly worse performance of the regularization \noperator as it completely ~emovesfrequencies wp  with k(wp)  =  0. \n\nExample 2 (Dirichlet kernels) \nIn  [Vapnik et al.,  1997], a class of kernels generating Fourier expansions was introduced, \n(20) \n\nk(x)  =  sin(2N + 1)x/2. \n\nsinx/2 \n\n(As  in  example  1 we  consider x  E  ~1  to  avoid tedious  notation.)  By  construction,  this \nkernel corresponds to  P(w)  = ~ L~-N 6(w- i).  A  regularization operator with these \nproperties,  however;  may not be desirable as it only damps a finite  number of frequencies \nand leaves all other frequencies unchanged which can lead to overjitting (Fig.  1 ). \n\n-I \n\n-\u00b7 _, \n\n05 \n\n\\ \n!\\ \n'i \n'I I. \nI\u2022  ,\\ \nI  l \nI \n\n\\ : \\ \n\n-.:.,,\\ \n\n.I \n'I \n1\\  .'i \n\\  I.\\  r \n\\I \n\"\"t-4 \n~ \n\n-\u00b7 \n\n-15 \n\n-10 \n\n10 \n\n15 \n\nFigure 1:  Left: Interpolation with a Dirichlet Kernel of order N  =  10. One can clearly ob(cid:173)\nserve the overfitting (dashed line:  interpolation, solid line:  original data points, connected \nby lines). Right: Interpolation of the same data with a Gaussian Kernel of width CT2  =  1. \n\nExample 3 (Gaussian kernels) \nFollowing  the  exposition  of Yuille  and Grzywacz  [ 1988]  as  described  in  [Girosi  et al., \n1993], one can see that for \n\n11P!II 2  = \n\nI \ndx L ~!2m com f(x)) 2 \n\n2m \n\nm \n\n(21) \n\nwith 62m  =  6. m  and 6 2m+l = V' 6. m.  6. being the Laplacian and V' the Gradient opera(cid:173)\ntor;  we get Gaussians kernels \n\nk(x) =  exp ( -~~~~2 ). \n\n(22) \n\nMoreover;  we  can provide an equivalent representation of P in terms of its  Fourier prop(cid:173)\nerties,  i.e.  P(w) =  exp(- u2 \\kxll 2\n)  up to a multiplicative constant.  Training a SV machine \nwith Gaussian RBF kernels [Scholkopf et al.,  1997] corresponds to minimizing the specific \ncost function with a regularization operator of type (21 ).  This also explains the good perfor(cid:173)\nmance of SV machines in this case, as it is by no means obvious that choosing a flat fum:;tion \nin high dimensional space will correspond to a simple function in low dimensional space, \nas showed in example 2.  Gaussian kernels tend to yield good performance under g'eneral \nsmoothness assumptions and should be considered especially if no additional knowledge \nof the data is available. \n\n\f348 \n\nA. J. Smola and B. Scholkopf \n\n6  A NEW CLASS OF SUPPORT VECTOR KERNELS \n\nWe will follow the lines of Madych and Nelson [ 1990] as pointed out by Girosi et al. [ 1993]. \nOur main statement is that conditionally positive definite functions ( c.p.d.) generate admis(cid:173)\nsible SV kernels.  This is very useful as the property of being c.p.d. often is easier to verify \nthan  Mercer's  condition,  especially  when  combined  with  the  results  of Schoenberg and \nMicchelli on the connection between c.p.d. and completely monotonic functions  [Schoen(cid:173)\nberg,  1938, Micchelli,  1986].  Moreover c.p.d. functions lead to  a class of SV kernels that \ndo not necessarily satisfy Mercer's condition. \n\nDefinition 1 (Conditionally positive definite functions) \nA  continuous function  h,  defined on  [0, oo),  is  said to  be  conditionally positive  definite \n\u00b7  ( c.p.d.) of order m  on m.n  if for any distinct points x1, ... , Xt  E m.n  and scalars c1, ... , Ct \nthe quadratic form Eri=l cicih(llxi- Xj II) is nonnegative provided that E~=l Cip(xi) = \n0 for all polynomials p on m.n  of degree lower than m. \n\nProposition 2 (c.p.d. functions and admissible kernels) \nDefine II~ the space of polynomials of degree lower than m  on IRn.  Every c.p.d. function \nh of order m  generates an admissible Kernel for SV expansions on the space of functions \nf  orthogonal to II~ by setting k(xi, Xj)  := h(llxi- Xjll 2 ). \nProof: In  [Dyn,  /991] and [Madych and Nelson,  1990] it was shown that c.p.d. functions \nh generate semi-norms 11-llh  by \n\n(23) \n\nProvided that the projection off onto the  space of polynomials of degree lower than m  is \nzero.  For these functions,  this,  however.  also defines a dot product in some feature  space. \nHence they can be used as SV kernels. \n\nOnly  c.p.d.  functions  of order  m  up  to  2  are  of practical  interest for  SV  methods  (for \ndetails see [Smola and Scholkopf, 1998]). Consequently, we may use kernels like the ones \nproposed in  [Girosi et al.,  1993] as SV kernels: \n\nk(x,y) = \nk(x,y) = \nk(x,y) = \nk(x,y) = \n\n7  DISCUSSION \n\ne-.BIJx-yJJ2 \n\nGaussian, (m  = 0); \n\n-v'llx- Yll 2  + c2  multiquadric, (m = 1) \n\n1 \n\ny'Jix-yJ12+c2 \n\nllx- Yll 2 ln llx- Yll \n\ninverse multiquadric, (m = 0) \nthin plate splines, ( m = 2) \n\n(24) \n\n(25) \n(26) \n\n(27) \n\nWe have pointed out a connection between SV kernels and regularization operators. As one \nof the possible implications of this result, we hope that it will deepen our understanding of \nSV machines and of why  they  have been found to  exhibit high generalization ability.  In \nSec.  5,  we  have  given  examples where only  the translation into the regularization frame(cid:173)\nwork provided insight in  why certain kernels are preferable to  others.  Capacity control is \none of the strengths of SV machines; however, this does not mean that the structure of the \nlearning machine, i.e. the choice of a suitable kernel for a given task, should be disregarded. \nOn the contrary, the rather general class of admissible SV kernels should be seen as another \nstrength, provided that we have a means of choosing the right kernel.  The newly established \nlink to regularization theory can thus be seen as a tool for constructing the structure con(cid:173)\nsisting of sets  of functions  in  which the SV machine (approximately) performs structural \n\n\fFrom Regularization Operators to Support \\iector Kernels \n\n349 \n\nrisk  minimization (e.g.  [Vapnik,  1995]).  For a treatment of SV kernels in  a Reproducing \nKernel  Hilbert Space context see [Girosi,  1997]. \n\nFinally  one  should  leverage  the  theoretical  results  achieved  for  regularization  operators \nfor  a better understanding of SVs  (and vice  versa).  By doing  so this  theory  might serve \nas  a  bridge for  connecting  two  (so  far)  separate  threads  of machine  learning.  A  trivial \nexample for  such  a  connection  would be  a Bayesian  interpretation of SV  machines.  In \nthis  case the choice of a special kernel can be regarded as a prior on the hypothesis space \nwith  P[f]  ex  exp{ ->.IIF 111 2 ).  A  more  subtle reasoning  probably  will  be necessary  for \nunderstanding the capacity  bounds  [Vapnik,  1995]  from a Regularization Network point \nof view.  Future work will  include an analysis of the family  of polynomial kernels, which \nperform very well in Pattern Classification [SchOlkopf et al.,  1995]. \n\nAcknowledgements \n\nAS is supported by a grant of the DFG (# Ja 379/51 ). BS is supported by the Studienstiftung \ndes deutschen Volkes.  The authors thank Chris Burges, Federico Girosi, Leo van Hemmen, \nKlaus-Robert Muller and Vladimir Vapnik for helpful discussions and comments. \n\nReferences \nM.A. Aizerman, E. M. Braverman, and L. I. Rozonoer.  Theoretical foundations of the po(cid:173)\ntential function method in pattern recognition learning. Automation and Remote Control, \n25:821-837, 1964. \n\nN.  Dyn.  Interpolation and approximation by radial  and related functions.  In C.K.  Chui, \nL.L.  Schumaker,  and  D.J.  Ward,  editors,  Approximation  Theory,  VI,  pages  211-234. \nAcademic Press, New York,  1991. \n\nF.  Girosi.  An equivalence between sparse approximation and suppm1 vector machines. A.I. \n\nMemo No.  1606, MIT,  1997. \n\nF.  Girosi, M.  Jones, and T.  Poggio.  Priors, stabilizers and basis functions:  From regular(cid:173)\n\nization to radial, tensor and additive splines.  A.I. Memo No.  1430, MIT,  1993. \n\nW.R.  Madych and S.A. Nelson.  Multivariate interpolation and conditionally positive defi(cid:173)\n\nnite functions. II.  Mathematics of Computation, 54(189):211-230, 1990. \n\nC.  A.  Micchelli.  Interpolation of scattered data:  distance matrices and conditionally posi(cid:173)\n\ntive definite functions.  Constructive Approximation, 2:11-22, 1986. \n\nI.J.  Schoenberg.  Metric spaces and  completely monotone functions.  Ann.  of Math.,  39: \n\n811-841, 1938. \n\nB.  Scholkopf, C. Burges, and V.  Vapnik. Extracting support data for a given task.  In U. M. \n\nFayyad and R. Uthurusamy, editors, Proc.  KDD  I, Menlo Park,  1995. AAAI Press. \n\nB.  SchOlkopf,  K.  Sung,  C.  Burges, F.  Girosi, P.  Niyogi, T.  Poggio, and V.  Vapnik.  Com(cid:173)\nparing support vector machines with gaussian kernels to radial basis function classifiers. \nIEEE Trans.  Sign.  Processing, 45:2758-2765, 1997. \n\nA.  J.  Smola and  B.  SchOlkopf.  On  a  kernel-based method for  pattern  recognition,  re(cid:173)\n\ngression,  approximation  and operator inversion.  Algorithmica,  1998.  see  also  GMD \nTechnical Report 1997- I 064, URL: http://svm.first.gmd.de/papers.html. \n\nV.  Vapnik.  The Nature of Statistical Learning Theory.  Springer Verlag, New York,  1995. \nV.  Vapnik, S. Golowich, and A. Smola:  Support vector method for function approximation, \n\nregression estimation, and signal processing.  In NIPS 9, San Mateo, CA,  1997. \n\nA. Yuille and N.  Gr:z;ywacz.  The motion coherence theory.  In Proceedings of the Interna(cid:173)\n\ntional Conference on  Computer Vision,  pages 344-354, Washington, D.C.,  1988. IEEE \nComputer Society Press. \n\n\f\f", "award": [], "sourceid": 1372, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}