{"title": "Shrinking the Tube: A New Support Vector Regression Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 330, "page_last": 336, "abstract": null, "full_text": "Shrinking the Thbe: \n\nA New Support Vector Regression Algorithm \n\nBernhard SchOikopr\u00a7,*, Peter Bartlett*, Alex Smola\u00a7,r, Robert Williamson* \n\n\u00a7 GMD FIRST, Rudower Chaussee 5,  12489 Berlin, Germany \n\n* FEITIRSISE, Australian National University, Canberra 0200, Australia \n\nbs, smola@first.gmd.de, Peter.Bartlett, Bob.Williamson@anu.edu.au \n\nAbstract \n\nA new algorithm for Support Vector regression is  described.  For a priori \nchosen 1/,  it automatically adjusts a flexible tube of minimal radius to the \ndata such that  at most a fraction  1/  of the data points lie outside.  More(cid:173)\nover,  it  is  shown  how  to  use  parametric  tube  shapes  with  non-constant \nradius. The algorithm is analysed theoretically and experimentally. \n\nINTRODUCTION \n\n1 \nSupport Vector (SV) machines comprise a new class of learning algorithms, motivated by \nresults of statistical learning theory (Vapnik, 1995). Originally developed for pattern recog(cid:173)\nnition, they represent the decision boundary in  terms of a typically small subset (SchOikopf \net aI.,  1995) of all  training examples, called the Support Vectors.  In order for this property \nto carryover to the case of SV Regression, Vapnik devised the so-called E-insensitive loss \nf(x)1  - E},  which  does  not penalize errors below \nfunction  Iy  -\nsome E  > 0, chosen a priori.  His algorithm, which we will henceforth call E-SVR, seeks to \nestimate functions \n\nf(x)lc  =  max{O,  Iy  -\n\nf (x) = (w . x) + b,  w, x  E  ~N , b E  ~, \n\nbased on data \n\n(1) \n\n(2) \n\n(xl,yd, ... ,(xe,Ye)  E  ~N x~, \n\nby minimizing the regularized risk functional \n\nIIwll2/2 + C . R~mp, \n\n(3) \nwhere C  is  a  constant determining the trade-off between  minimizing training  errors  and \nminimizing the model complexity term IIwll2, and R~mp := t 2::;=1  IYi  -\nThe parameter E  can be useful if the desired accuracy of the approximation can be specified \nbeforehand. In some cases, however, we just want the estimate to be as accurate as possible, \nwithout having to commit ourselves to a certain level of accuracy. \n\nf(Xi)lc' \n\nWe  present a modification of the E-SVR algorithm which automatically minimizes E,  thus \nadjusting the accuracy level to the data at hand. \n\n\fShrinking the Tube:  A New Support  Vector Regression Algorithm \n\n331 \n\n2  ZJ-SV  REGRESSION AND c-SV REGRESSION \nTo estimate functions (1) from  empirical data (2)  we proceed as  follows  (SchOlkopf et aI., \n1998a).  At  each  point  Xi,  we  allow  an  error  of E.  Everything  above  E  is  captured  in \nslack  variables  d*)  \u00ab(*)  being  a  shorthand  implying both  the  variables  with  and  without \nasterisks),  which  are penalized  in  the objective function  via a  regularization constant C, \nchosen a priori (Vapnik,  1995).  The tube size E  is traded off against model complexity and \nslack variables via a constant v  > 0: \n\nminimize \n\nsubject to \n\n-\n\n-r(w, e(*) ,E)  =  Ilw112/2 + C\u00b7 (VE  + \u00a3 :L(Ei + En) \n\n1  e \n\n((w,xi)+b)-Yi  <  E+Ei \nYi-((W ' Xi)+b)  <  E+Ei \n\ni-I \n-\n\n(4) \n\n(5) \n(6) \n\nd*)  ~  0,  E  >  0. \n\n(7) \nHere and below,  it  is  understood that i  = 1, ... , i, and that bold face greek letters denote \ni-dimensional vectors of the corresponding variables.  Introducing a Lagrangian with mul-\ntipliers o~ *) , 77i *) ,f3  ~ 0,  we obtain the the Wolfe dual problem.  Moreover, as  Boser et al. \n(1992), we substitute a kernel k for the dot product, corresponding to a dot product in some \nfeature space related to input space via a nonlinear map <I> , \nk(x,y)  =  (<I>(x)\u00b7  <I>(y)). \n\n(8) \n\nThis leads to the v-SVR Optimization Problem: for v  ~ 0, C  > 0, \n\nmaximize  W(o(*))  =  :L(oi - Oi)Yi  - ~ :L (oi - Oi)(O;  - OJ)k(Xi, Xj) \n\ne \n\ni=1 \n\ne \n\ni,j=1 \n\nsubject to \n\nThe regression estimate can be shown to take the form \n\n(11) \n\n(9) \n\n(13) \n\nf(x)  =  :L(oi - oi)k(Xi' x) + b, \n\nl \n\ni=1 \n\nwhere b (and  E)  can be computed by  taking  into account that  (5)  and  (6)  (substitution of \n\nL: j  (0; - oj)k(xj, x) for (w\u00b7 x) is understood) become equalities with E~*) = \u00b0 for points \nwith \u00b0 < o~*) < C / i, respectively, due to the Karush-Kuhn-Tuckerconditions (cf. Vapnik, \n\n1995).  The latter moreover imply  that in  the  kernel  expansion (13), only those o~*) will \nbe nonzero that correspond to  a constraint (5)/(6)  which is  precisely met.  The respective \npatterns Xi  are referred to as Support Vectors. \n\nBefore we  give theoretical  results  explaining the significance of the parameter v, the fol(cid:173)\nlowing observation concerning E  is  helpful.  If v  > 1, then E  =  0, since it does not pay to \nincrease E  (cf.  (4)).  If v  ~ 1,  it can still  happen that E  =  0,  e.g.  if the data are noise-free \nand can perfectly be interpolated with a low capacity model.  The case E  =  0, however, is \nnot what we are interested in; it corresponds to plain Ll loss regression . Below, we will  use \nthe term errors to  refer to training points lying outside of the tube,  and  the term fraction \nof errors/SVs to denote the relative numbers of errors/SVs, i.e. divided by i. \n\nProposition 1  Assume E  > 0.  The following statements hoLd: \n\n(i)  v  is an upper bound on the fraction of errors. \n\n(ii)  v  is a Lower bound on the fraction ofSVs. \n\n\f332 \n\nB.  SchOlkopf, P.  L.  Bartlett, A. 1.  Smola  and R. Williamson \n\n(iii)  Suppose  the  data  (2)  were  generated  iid  from  a  distribution  P(x, y) \n\nP(x)P(ylx) with P(ylx) continuous.  With probability 1,  asymptotically, v equals \nboth the fraction of SVs and the fraction of errors. \n\nThe first two statements of this proposition can be proven from the structure of the dual op(cid:173)\ntimization problem, with (12) playing a crucial role.  Presently, we instead give a graphical \nproof based on the primal problem (Fig.  1). \n\nTo  understand the  third  statement, note that all  errors are also SVs,  but there can be SVs \nwhich are not errors:  namely,  if they  lie  exactly at the edge of the  tube.  Asymptotically, \nhowever, these SVs form a negligible fraction of the whole SV set, and the set of errors and \nthe one of SV s essentially coincide. This is due to the fact that for a class of functions with \nwell-behaved capacity (such  as  SV regression functions),  and for a distribution satisfying \nthe  above continuity condition,  the  number of points  that  the  tube  edges  f  \u00b1 \u00a3  can  pass \nthrough  cannot asymptotically  increase  linearly  with  the  sample  size.  Interestingly,  the \nproof (Scholkopf et aI.,  1998a) uses  a  uniform convergence argument similar in  spirit to \nthose used in  statistical learning theory. \n\nDue to this  proposition, 0  ::;  v  ::;  1 can be used to control the number of errors  (note that \nfor v  ~ 1, (11)  implies (12), since ai . a;  =  0 for all  i  (Vapnik,  1995)).  Moreover, since \nthe  constraint (10)  implies  that  (12)  is  equivalent to  Li a~*)  ::;  Cv/2, we conclude that \nProposition  1 actually  holds for the  upper and the lower edge of the tube separately, with \nv /2 each. As an aside, note that by the same argument, the number of SVs at the two edges \nof the standard \u00a3-SVR tube asymptotically agree. \n\nMoreover, note that this  bears on  the robustness of v-SVR. At first glance, SVR seems all \nbut robust:  using the \u00a3-insensitive loss function, only the patterns outside of the \u00a3-tube con(cid:173)\ntribute to  the empirical risk term,  whereas  the patterns closest to the estimated regression \nhave zero loss.  This, however, does not mean that it is  only the outliers that determine the \nregression.  In  fact,  the contrary is  the case:  one can  show  that local  movements of target \nvalues Yi  of points  Xi  outside  the  tube  do  not  influence the  regression  (Scholkopf et aI., \n1998c).  Hence, v-SVR is  a generalization of an estimator for the mean of a random vari(cid:173)\nable  which  throws  away  the  largest and  smallest examples  (a  fraction  of at most  v /2 of \neither category), and estimates the mean by taking the average of the two extremal ones of \nthe remaining examples. This is close in  spirit to robust estimators like the trimmed mean. \n\nLet us briefly discuss how the new algorithm relates to \u00a3-SVR (Vapnik, 1995). By rewriting \n(3) as a constrained optimization problem, and deriving a dual much like we did for v-SVR, \n\nFigure 1:  Graphical depiction of the v-trick.  Imag(cid:173)\nine increasing \u00a3,  starting from  O.  The first  term in \nv\u00a3+ 1 L;=l (~i +~n (cf. (4)) will increase propor(cid:173)\ntionally to  v,  while the second term  will  decrease \nproportionally to  the fraction  of points outside of \nthe tube.  Hence, \u00a3  will  grow as  long as  the latter \n+\u00a3  fraction  is  larger than v . At the optimum, it there-\no  fore must be::;  v  (Proposition  1,  (i)).  Next, imag(cid:173)\nine decreasing \u00a3,  starting  from  some large  value. \n-\u00a3  Again, the change in the first  term is  proportional \nto  v, but this  time,  the change in  the second term \nis  proportional to  the fraction of SVs (even points \non the edge of the tube will contribute).  Hence, \u00a3 \nwill shrink as long as the fraction of SVs is smaller \nthan v, eventually leading to Proposition  1, (ii). \n\n\fShrinking the Tube:  A New Support  Vector Regression Algorithm \n\n333 \n\none arrives at the following quadratic program:  maximize \n\nl \n\nl \n\nW(a, a*) = -\u00a3 2)0: +Oi)+ 'L)oi -Oi)Yi-~ L (0; -Oi)(O) -OJ)k(Xi' Xj)  (14) \nsubject to (10) and (11).  Compared to (9), we have an additional term -c 2:;=1 (aT  + Oi), \nwhich makes it plausible that the constraint (12) is  not needed. \n\ni,j=l \n\ni=l \n\ni=l \n\nl \n\nIn the following sense, v-SVR includes c-SVR. Note that in the general case, using kernels, \nw is a vector in  feature space. \n\nProposition 2  If v-SVR leads  to the  solution t, w, b,  then  c-SVR with  E  set a priori to t, \nand the same value of C,  has the solution W, b. \n\nProof  If we minimize (4), then fix  c and minimize only over the remaining variables, the \nsolution does not change. \n\u2022 \n\n3  PARAMETRIC INSENSITIVITY MODELS \n\nWe  generalized \u00a3-SVR by  considering the  tube as  not given  but  instead  estimated it  as  a \nmodel parameter. What we have so far retained is the assumption that the c-insensitive zone \nhas  a tube (or slab) shape.  We now go one step further and use parametric models of arbi-\ntrary shape. Let { d *)} (here and below, q = 1, ... ,p is  understood) be a set of 2p positive \nfunctions on IRN.  Consider the following quadratic program:  for given v~*), . .. , v~*)  2:  0, \nminimize \n\nr(w, e(*), c(*\u00bb)  = IlwW /2 + C\u00b7  ?;(vqcq + v;\u00a3;) + f  ~(~i + ~n \n\n) \n\np \n\n( \n\n1 \n\nl \n\nsubject to \n\n((w\u00b7 Xi) + b)  - Yi  <  L  cq(q(X;) + ~i \nYi-((W'Xi)+b)  <  L  c;(;(xd+C \n~J*)  2:  0, \n\nE~*)  >  O. \n\nq \n\nq \n\n(15) \n\n(16) \n\n(17) \n\n(18) \n\nA calculation analogous  to  Sec.  2 shows  that the Wolfe  dual  consists  of maximizing  (9) \nsubject to  (10),  (11),  and,  instead of (12), the modified constraints 2:;=1 o~*)d*)(xd :S \nC . v~*).  In  the experiments  in Sec. 4,  we  use  a  simplified  version  of this  optimization \nproblem,  where  we  drop the  term  v;c~ from  the objective function  (15),  and  use  Cq  and \n(q  in  (17).  By this,  we render the problem symmetric with respect to the two edges of the \ntube.  In  addition,  we  use p  = 1.  This  leads  to  the  same Wolfe dual,  except for  the last \nconstraint, which becomes (cf. (12\u00bb \n\nl L i=l (a; + ai)((xi) :S  C . v. \n\n(19) \nThe advantage of this setting is  that since the same v  is  used for both sides of the tube, the \ncomputation of E, b is  straightforward:  for instance, by  solving a linear system, using two \nconditions as  those described following (13).  Otherwise, general statements  are harder to \nmake:  the linear system can have a zero determinant, depending on whether the functions \nd *) ,  evaluated on the Xi  with 0 < o~ *)  < C / \u00a3,  are linearly dependent.  The latter occurs, \nfor  instance,  if we  use  constant functions  (( *)  ==  1.  In  this  case,  it  is  pointless  to  use \ntwo different values  v, v*; for,  the constraint (10) then  implies that both sums 2:;=1  a~ *) \nwill  be bounded by C . min {v, v*}.  We conclude this  section by giving,  without proof, a \ngeneralization of Proposition 1, (iii), to the optimization problem with constraint (19): \n\n\f334 \n\nB. SchOlkopf, P  L.  Bartlett, A.  J.  Smola and R.  Williamson \n\nProposition 3  Assume c > O.  Suppose the data (2) were generated iid from  a distribution \nP(x, y)  = P(x)P(ylx)  with  P(ylx)  continuous.  With  probability 1,  asymptotically,  the \nfractions of SVs and errors equal v \u00b7(J ((x) d?(X))-l, where? is the asymptotic distribu(cid:173)\ntion of SVs over x. \n\n4  EXPERIMENTS AND DISCUSSION \nIn the experiments, we  used the optimizer LOQO (http://www.princeton.edwrvdb/).This \nhas  the serendipitous advantage that the primal  variables  band c  can  be recovered as  the \ndual variables of the Wolfe dual (9) (i.e. the double dual variables) fed into the optimizer. \nIn Fig.  2,  the task was  to estimate a regression of a noisy sinc function,  given f  examples \n(Xi,Yi),  with  Xi  drawn  uniformly from  [-3,3], and  Yi  = sin(7l'Xi)/(7l'Xi)  + Vi,  with  Vi \ndrawn from  a Gaussian with  zero mean and  variance (J2.  We  used the default parameters \n\u00a3 = 50,  C = 100,  (J  = 0.2, and the RBF kernel k(x, x') = exp( -Ix - x /12 ). \nFigure 3 gives an illustration of how one can make use of parametric insensitivity models as \nproposed in Sec. 3.  Using the proper model, the estimate gets much better.  In the parametric \ncase,  we  used v  = 0.1  and  ((x)  = sin2 ((27l' /3)x),  which,  due to J ((x) dP(x)  = 1/2, \ncorresponds to our standard choice v  = 0.2 in v-SVR (cf. Proposition 3). The experimental \nfindings  are  consistent  with  the  asymptotics  predicted theoretically  even  if we  assume a \nuniform distribution of SVs:  for \u00a3 =  200, we got 0.24 and 0.19 for the fraction of SVs and \nerrors, respectively. \n\nThis method allows the incorporation of prior knowledge into the loss function.  Although \nthis approach at first glance seems fundamentally different from incorporating prior know(cid:173)\nledge directly into the kernel (Sch6lkopf et al.,  1998b), from the point of view of statistical \n\n,,,' \n\n\"'''''''' \n\nFigure 2:  Left:  v-SV regression  with  v  =  0.2  (top)  and  v  =  0.8 (bottom).  The larger v \nallows more points to  lie outside the tube (see Sec. 2).  The algorithm automatically adjusts \nc  to 0.22 (top)  and 0.04 (bottom).  Shown are the sinc function  (dotted), the regression f \nand the tube f  \u00b1 c.  Middle:  v-SV regression on data  with  noise  (J  =  0 (top)  and  (J  =  1 \n(bottom).  In  both cases,  v  = 0.2.  The tube width  automatically adjusts to the noise (top: \nc = 0, bottom: c = 1.19). Right: c-SV regression (Vapnik, 1995) on data with noise (J = 0 \n(top) and  (J  =  1 (bottom).  In both cases, c  =  0.2 -\nthis choice, which has to be specified \na priori, is  ideal for neither case:  in  the top figure,  the regression estimate is biased; in the \nbottom figure, c does not match the external noise (cf. Smola et al.,  1998). \n\n\fShrinking the Tube:  A New Support Vector Regression Algorithm \n\n335 \n\n,--~-------, Figure  3:  Toy  example,  using \n\n,'''-''-, \n\nprior  knowledge  about  an  x(cid:173)\ndependence of the noise.  Additive \nnoise  (0'  =  1)  was  multiplied  by \nsin2 ((27r 13)x).  Left:  the  same \nfunction was used as  (  as a para(cid:173)\nmetric insensitivity tube (Sec. 3) . \n\u2022 ,,:---::-----c7-----:--7-----:------!, .,,:---::--.,------:---.,------:------!,  Right:  v-S VR with standard tube. \n\n. \n,  . '-: \n. ',  \"  ., \n\n-....... _- _ ....... . \n\n. ,'_ . \n\n\u2022  ___ \u2022\n\n, \n\n.0.5 \n\n..' \n\n\u2022 \n\nI  0.1  I  0.2  I  0.3  I  0.4  I  0.5  I  0.6  I  0.7  I  0,8  I  0,9  I  1.0  I \n\nTable  1:  Results for  the Boston housing benchmark; top:  v-SVR, bottom:  e:-SVR  MSE: \nMean squared errors, STD: standard deviations thereof (100 trials), Errors: fraction oftrain(cid:173)\ning points outside the tube, SVs: fraction of training points which are SVs, \nIv \nautomatic e: \nMSE \nSTD \nErrors \nSVs \n\n0.0 \n11.3 \n9.6 \n0.5 \n1.0 \n\n0.6 \n10.0 \n8.4 \n0.3 \n0.8 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n0.3 \n10.6 \n9.0 \n0.4 \n0.9 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n1.2 \n9.3 \n7.6 \n0.2 \n0.6 \n\n2.6 \n9.4 \n6.4 \n0.0 \n0.3 \n\n1.7 \n8.7 \n6.8 \n0.1 \n0.4 \n\n0.8 \n9.5 \n7.9 \n0.2 \n0.7 \n\nIe: \nMSE \nSTD \nErrors \nSVs \n\n0 1 \n\n1  I  2  I \n\n3  I \n\n11.3 \n9.5 \n0.5 \n1.0 \n\n9.5 \n7.7 \n0.2 \n0.6 \n\n8.8 \n6.8 \n0.1 \n0.4 \n\n9.7 \n6.2 \n0.0 \n0.3 \n\n41 \n11.2 \n6.3 \n0.0 \n0.2 \n\n5  I \n\n6 1 \n\n7  I \n\n8  I \n\n9  I  10  I \n\n13.1 \n6.0 \n0.0 \n0.1 \n\n15.6 \n6.1 \n0.0 \n0.1 \n\n18.2  22.1 \n6.2 \n6.6 \n0.0 \n0.0 \n0.1 \n0.1 \n\n27.0  34.3 \n8.4 \n7.3 \n0.0 \n0.0 \n0.1 \n0.1 \n\nlearning theory the two  approaches are closely related:  in  both cases, the structure of the \nloss-function-induced class of functions  (which is  the object of interest for  generalization \nerror bounds) is  customized; in the first case, by changing the loss function,  in  the second \ncase, by changing the class of functions that the estimate is taken from . \n\nEmpirical  studies  using  e:-SVR  have  reported  excellent performance on  the  widely  used \nBoston housing  regression benchmark set (Stitson et aI.,  1999).  Due to  Proposition 2, \nthe  only  difference  between  v-SVR  and  standard  e:-SVR  lies  in  the  fact  that  different \nparameters,  e:  vs.  v ,  have  to  be  specified  a  priori.  Consequently,  we  are  in  this  exper(cid:173)\niment only  interested  in  these  parameters  and  simply  adjusted  C  and  the  width  20'2  in \nk(x, y)  = exp( -llx - YI12/(20'2))  as  Scholkopf et ai.  (1997):  we  used  20' 2  = 0.3 \u00b7 N, \nwhere N  =  13 is  the  input dimensionality, and C 1 e =  10 . 50 (i.e.  the original value of \n10 was corrected since in the present case, the maximal y-value is  50). We performed 100 \nruns,  where each time the overall set of 506 examples  was  randomly  split  into a training \nset of e =  481 examples and a test set of 25 examples.  Table  1 shows that in a wide range \nof v  (note that only 0 :s  v :s  1 makes sense), we obtained performances which are close to \nthe best performances that can be achieved by selecting e:  a priori by looking at the test set. \nFinally, note that although we did not use validation techniques to select the optimal values \nfor C and 20'2, we obtained performance which are state of the art (Stitson et ai.  (1999) re(cid:173)\nport an MSE of 7.6 for e:-SVR using ANOVA kernels, and 11.7 for Bagging trees).  Table 1 \nmoreover shows that v can be used to control the fraction of SVs/errors. \n\nDiscussion.  The theoretical  and  experimental analysis  suggest that  v  provides  a way  to \ncontrol an upper bound on the number of training errors which is  tighter than the one used \nin  the soft margin  hyperplane (Vapnik,  1995).  In  many  cases,  this  makes  it  a parameter \nwhich  is  more convenient than the one in  e:-SVR. Asymptotically, it directly controls the \n\n\f336 \n\nB.  SchOlkopf, P  L.  Bartlett, A. 1.  Smola and R.  Williamson \n\nnumber of Support Vectors, and the latter can be used to give a leave-one-out generalization \nbound (Vapnik, 1995). In addition, v characterizes the compression ratio:  it suffices to train \nthe  algorithm only on  the SVs,  leading to  the  same solution  (SchOlkopf et aI.,  1995).  In \nc:-SVR, the tube width c:  must be specified a priori; in v-SVR, which generalizes the idea of \nthe trimmed mean, it is computed automatically.  Desirable properties of c:-SVR, including \nthe formulation  as  a  definite quadratic program, and  the  sparse SV representation of the \nsolution,  are retained.  We  are optimistic that in  many  applications, v-SVR will  be more \nrobust than c:-SVR.  Among these should be the reduced set algorithm of Osuna and Girosi \n(1999), which approximates the SV pattern recognition decision surface by c:-SVR.  Here, \nv  should give a direct handle on the desired speed-up. \n\nOne  of the  immediate questions  that  a  v-approach  to  SV  regression  raises  is  whether a \nsimilar algorithm is possible for the case of pattern recognition . This question has recently \nbeen answered to  the affirmative (SchOlkopf et aI.,  1998c).  Since the pattern recognition \nalgorithm  (Vapnik,  1995) does  not  use  c:,  the  only  parameter that  we can  dispose  of by \nusing v  is  the regularization constant C.  This leads to  a dual optimization problem with a \nhomogeneous quadratic form,  and v  lower bounding the sum of the Lagrange multipliers. \nWhether we could have abolished C in the regression case, too, is an open problem. \nAcknowledgement  This work was supported by the ARC and the DFG (# Ja 379171). \n\nReferences \nB.  E.  Boser,  1.  M.  Guyon,  and  V.  N.  Vapnik.  A  training  algorithm  for  optimal  margin \nclassifiers.  In  D.  Haussler,  editor,  Proceedings  of the  5th  Annual ACM Workshop  on \nComputational Learning Theory, pages  144-152, Pittsburgh, PA,  1992. ACM Press. \n\nE.  Osuna and  F.  Girosi.  Reducing run-time complexity in  support vector machines.  In \nB. SchOlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support \nVector Learning, pages 271 - 283. MIT Press, Cambridge, MA,  1999. \n\nB.  SchOlkopf,  C.  Burges,  and  V.  Vapnik.  Extracting  support data  for  a  given  task.  In \nU. M.  Fayyad and R.  Uthurusamy, editors, Proceedings, First International Conference \non Knowledge Discovery &  Data Mining. AAAI Press, Menlo Park, CA, 1995. \n\nB.  Scholkopf, P.  Bartlett,  A.  Smola,  and R.  Williamson.  Support vector regression  with \n\nautomatic  accuracy control.  In L.  Niklasson,  M.  Boden,  and  T.  Ziemke,  editors,  Pro(cid:173)\nceedings of the 8th International Conference on Artificial Neural Networks, Perspectives \nin Neural Computing, pages  III - 116, Berlin, 1998a. Springer Verlag. \n\nB.  SchOlkopf,  P.  Simard,  A.  Smola,  and  V.  Vapnik.  Prior  knowledge  in  support  vector \nkernels.  In M. Jordan, M.  Kearns, and S. Solla, editors, Advances in Neural Information \nProcessing Systems 10, pages 640 - 646, Cambridge, MA,  1998b. MIT Press. \n\nB.  SchOlkopf, A.  Smola, R. Williamson,  and P.  Bartlett.  New support vector algorithms. \n\n1998c.  NeuroColt2-TR 1998-031; cf. http:!www.neurocolt.com \n\nB.  Scholkopf, K.  Sung, C.  Burges, F.  Girosi, P.  Niyogi, T.  Poggio,  and  V.  Vapnik.  Com(cid:173)\n\nparing support vector machines with gaussian kernels to radial basis function classifiers. \nIEEE Trans.  Sign.  Processing, 45:2758 - 2765, 1997. \n\nA.  Smola, N. Murata, B.  SchOlkopf, and K.-R.  Moller.  Asymptotically optimal choice of \nc:-Ioss for support vector machines.  In L.  Niklasson, M.  Boden, and T.  Ziemke, editors, \nProceedings of the 8th International Conference on Artificial Neural Networks, Perspec(cid:173)\ntives in Neural Computing, pages 105 - 110, Berlin, 1998. Springer Verlag. \n\nM. Stitson,  A.  Gammerman,  V.  Vapnik,  V.  Vovk,  C.  Watkins,  and  J.  Weston.  Support \nvector  regression  with  ANOVA  decomposition  kernels.  In  B.  Scholkopf,  C.  Burges, \nand A.  Smola, editors, Advances in  Kernel Methods - Support Vector Learning, pages \n285 - 291. MIT Press, Cambridge, MA, 1999. \n\nV.  Vapnik.  The Nature of Statistical Learning Theory.  Springer Verlag, New York,  1995. \n\n\f", "award": [], "sourceid": 1563, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}