{"title": "Computing the Solution Path for the Regularized Support Vector Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 483, "page_last": 490, "abstract": "", "full_text": "Computing the Solution Path for the\n\nRegularized Support Vector Regression\n\nLacey Gunter\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\nlgunter@umich.edu\n\nJi Zhu\u2217\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\n\njizhu@umich.edu\n\nAbstract\n\nIn this paper we derive an algorithm that computes the entire solu-\ntion path of the support vector regression, with essentially the same\ncomputational cost as \ufb01tting one SVR model. We also propose an\nunbiased estimate for the degrees of freedom of the SVR model,\nwhich allows convenient selection of the regularization parameter.\n\n1 Introduction\n\nThe support vector regression (SVR) is a popular tool for function estimation prob-\nlems, and it has been widely used on many real applications in the past decade, for\nexample, time series prediction [1], signal processing [2] and neural decoding [3].\nIn this paper, we focus on the regularization parameter of the SVR, and propose\nan e\ufb03cient algorithm that computes the entire regularized solution path; we also\npropose an unbiased estimate for the degrees of freedom of the SVR, which allows\nconvenient selection of the regularization parameter.\nSuppose we have a set of training data (x1, y1), . . . , (xn, yn), where the input xi \u2208\nRp and the output yi \u2208 R. Many researchers have noted that the formulation for\nthe linear \u0001-SVR can be written in a loss + penalty form [4]:\n\n(1)\n\nn(cid:2)\n\ni=1\n\n(cid:3)(cid:3)\n\n(cid:3)(cid:3)yi \u2212 \u03b20 \u2212 \u03b2Txi\n(cid:4)\n\nmin\n\u03b20,\u03b2\n\n\u0001 + \u03bb\nwhere |\u03be|\u0001 is the so called \u0001-insensitive loss function:\nif |\u03be| \u2264 \u0001\notherwise\n\n0\n|\u03be| \u2212 \u0001\n\n|\u03be|\u0001 =\n\n2 \u03b2T\u03b2\n\nThe idea is to disregard errors as long as they are less than \u0001. Figure 1 plots the loss\nfunction. Notice that it has two non-di\ufb00erentiable points at \u00b1\u0001. The regularization\nparameter \u03bb controls the trade-o\ufb00 between the \u0001-insensitive loss and the complexity\nof the \ufb01tted model.\n\n\u2217\n\nTo whom the correspondence should be addressed.\n\n\f5\n2\n\n.\n\n0\n\n.\n\n2\n\n5\n1\n\n.\n\ns\ns\no\nL\n\n0\n\n.\n\n1\n\n5\n\n.\n\n0\n\n0\n0\n\n.\n\nLeft\n\nRight\n\nCenter\n\nElbow L\n\nElbow R\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\ny\u2212f\n\n1\n\n2\n\n3\n\nFigure 1: The \u0001-insensitive loss function.\n\ni=1\n\ni(cid:2)=1\n\nn\n\ni=1\n\n\u0001 +\n\n1\n2\u03bb\n\nmin\n\u03b20,\u03b8\n\nn(cid:2)\n\nn(cid:2)\n\n\u03b8i\u03b8i(cid:2) K(xi, xi(cid:2))\n\nIn practice, one often maps x into a high (often in\ufb01nite) dimensional reproducing\nkernel Hilbert space (RKHS), and \ufb01ts a nonlinear kernel SVR model [4]:\n\nn(cid:2)\n|yi \u2212 f(xi)|\n(cid:5)\n(2)\ni=1 \u03b8iK(x, xi), and K(\u00b7,\u00b7) is a positive-de\ufb01nite reproducing\nwhere f(x) = \u03b20 + 1\n\u03bb\nkernel that generates a RKHS. Notice that we write f(x) in a way that involves \u03bb\nexplicitly, and we will see later that \u03b8i \u2208 [\u22121, 1].\nBoth (1) and (2) can be transformed into a quadratic programming problem, hence\nmost commercially available packages can be used to solve the SVR. In the past\nyears, many speci\ufb01c algorithms for the SVR have also been developed, for example,\ninterior point algorithms [4-5], subset selection algorithms [6\u20137], and sequential\nminimal optimization [4, 8\u20139]. All these algorithms solve the SVR for a pre-\ufb01xed\nregularization parameter \u03bb, and it is well known that an appropriate value of \u03bb is\ncrucial for achieving small prediction error of the SVR.\nIn this paper, we show that the solution \u03b8(\u03bb) ispiecewise linear as a function of\n\u03bb, which allows us to derive an e\ufb03cient algorithm that computes the exact entire\nsolution path {\u03b8(\u03bb), 0 \u2264 \u03bb \u2264 \u221e}. We acknowledge that this work was inspired by\none of the authors\u2019 earlier work on the SVM setting [10].\nBefore delving into the technical details, we illustrate the concept of piecewise linear-\nity of the solution path with a simple example. We generate 10 training observations\nusing the famous sinc(\u00b7) function:\n\ny =\n\nsin(\u03c0x)\n\n\u03c0x\n\n+ e, where x \u223c U(\u22122\u03c0, 2\u03c0) and e \u223c N(0, 0.192)\n\nWe use the SVR with a 1-dimensional spline kernel\n\n(cid:3)\nK(x, x\n\n(cid:3)\n) = 1 + k1(x)k1(x\n\n) \u2212 k4(|x \u2212 x\n(cid:3)|)\n(cid:3)\n) +k 2(x)k2(x\n\u2212 k2\n\u2212 1/12)/2, k4 = (k4\n\n(3)\n1/2 + 7/240)/24. Figure 2\n\nwhere k1(\u00b7) = \u00b7 \u2212 1/2, k2 = (k2\nshows a subset of the piecewise linear solution path \u03b8(\u03bb) as a function of \u03bb.\nIn section 2, we describe the algorithm that computes the entire solution path of\nthe SVR. In section 3, we propose an unbiased estimate for the degrees of freedom\nof the SVR, which can be used to select the regularization parameter \u03bb. In section\n4, we present numerical results on simulation data. We conclude the paper with a\ndiscussion section.\n\n1\n\n1\n\n\f0\n1\n\n.\n\n5\n0\n\n.\n\n\u03b8\n\n0\n0\n\n.\n\n.\n\n5\n0\n\u2212\n\n.\n\n0\n1\n\u2212\n\n1\n\n2\n\n3\n\n\u03bb\n\n4\n\n5\n\nFigure 2: A subset of the solution path \u03b8(\u03bb) as a function of \u03bb.\n\n2 Algorithm\n\nFor simplicity in notation, we describe the problem setup using the linear SVR, and\nthe algorithm using the kernel SVR.\n\n2.1 Problem Setup\n\nThe linear \u0001-SVR (1) can be re-written in an equivalent way:\n\nn(cid:2)\n\u2212(\u03b4i + \u0001) \u2264 yi \u2212 f(xi) \u2264 (\u03bei + \u0001),\nf(xi) =\u03b2 0 + \u03b2Txi,\ni = 1, . . . n\nThis gives us the Lagrangian primal function\n\nmin\n\u03b20,\u03b2\nsubject to\n\n(\u03bei + \u03b4i) + \u03bb\n\n2 \u03b2T\u03b2\n\ni=1\n\n\u03bei, \u03b4i \u2265 0;\n\nLP :\n\nn(cid:2)\nn(cid:2)\n\ni=1\n\n(\u03bei + \u03b4i) + \u03bb\n\nn(cid:2)\n\u03b3i(yi \u2212 f(xi) +\u03b4 i + \u0001) \u2212 n(cid:2)\n\n2 \u03b2T\u03b2 +\n\ni=1\n\n\u03b1i(yi \u2212 f(xi) \u2212 \u03bei \u2212 \u0001) \u2212\n\n\u03c1i\u03bei \u2212 n(cid:2)\n\ni=1\n\nSetting the derivatives to zero we arrive at:\n\ni=1\n\ni=1\n\nn(cid:2)\n(\u03b1i \u2212 \u03b3i)xi\nn(cid:2)\n\ni=1\n\n1\n\u03bb\n\n\u03b2 =\n\nn(cid:2)\n\n\u03b1i =\n\n\u03b3i\n\ni=1\n\ni=1\n\n\u2202\n\u2202\u03b2\n\n\u2202\n\u2202\u03b20\n\n:\n\n:\n\n\u03b1i = 1 \u2212 \u03c1i\n\u03b3i = 1 \u2212 \u03c4i\nwhere the Karush-Kuhn-Tucker conditions are\n\n:\n\n:\n\n\u2202\n\u2202\u03bei\n\u2202\n\u2202\u03b4i\n\u03b1i(yi \u2212 f(xi) \u2212 \u03bei \u2212 \u0001) = 0\n\u03b3i(yi \u2212 f(xi) + \u03b4i + \u0001) = 0\n\u03c1i\u03bei = 0\n\u03c4i\u03b4i = 0\n\n\u03c4i\u03b4i.\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n(9)\n(10)\n(11)\n\n\fAlong with the constraint that our Lagrange multipliers must be non-negative, we\ncan conclude from (6) and (7) that both 0 \u2264 \u03b1i \u2264 1 and 0 \u2264 \u03b3i \u2264 1. We also see\nfrom (8) and (9) that if \u03b1i is positive, then \u03b3i must be zero, and vice versa. These\nlead to the following relationships:\n\nyi \u2212 f(xi) > \u0001\n\u21d2 \u03b1i = 1,\n\u21d2 \u03b1i = 0,\nyi \u2212 f(xi) < \u2212\u0001\nyi \u2212 f(xi) \u2208 (\u2212\u0001, \u0001) \u21d2 \u03b1i = 0,\nyi \u2212 f(xi) =\u0001\n\u21d2 \u03b1i \u2208 [0, 1],\nyi \u2212 f(xi) =\u2212\u0001\n\u21d2 \u03b1i = 0,\n\n\u03bei > 0,\n\u03bei = 0,\n\u03bei = 0,\n\u03bei = 0,\n\u03bei = 0,\n\n\u03b3i = 0,\n\u03b3i = 1,\n\u03b3i = 0,\n\u03b3i = 0,\n\u03b3i \u2208 [0, 1],\n\n\u03b4i = 0;\n\u03b4i > 0;\n\u03b4i = 0;\n\u03b4i = 0;\n\u03b4i = 0.\n\nUsing these relationships, we de\ufb01ne the following sets that will be used later on\nwhen we are calculating the regularization path of the SVR:\n\n\u2022 R = {i : yi \u2212 f(xi) > \u0001, \u03b1i = 1, \u03b3i = 0} (Right of the elbows)\n\u2022 ER = {i : yi \u2212 f(xi) = \u0001, 0 \u2264 \u03b1i \u2264 1, \u03b3i = 0} (Right elbow)\n\u2022 C = {i : \u2212\u0001 < yi \u2212 f(xi) < \u0001, \u03b1i = 0, \u03b3i = 0} (Center)\n\u2022 EL = {i : yi \u2212 f(xi) =\u2212\u0001, \u03b1 i = 0, 0 \u2264 \u03b3i \u2264 0} (Left elbow)\n\u2022 L = {i : yi \u2212 f(xi) < \u2212\u0001, \u03b1i = 0, \u03b3i = 1} (Left of the elbows)\n\nNotice from (4) that for every \u03bb, \u03b2 is fully determined by the values of \u03b1i and \u03b3i.\nFor points in R, L and C, the values of \u03b1i and \u03b3i are known; therefore, the algorithm\nwill focus on points resting at the two elbows ER and EL.\n\nInitialization\n\n2.2\nInitially, when \u03bb = \u221e we can see from (4) that \u03b2 = 0. We can determine the value\nof \u03b20 via a simple 1-dimensional optimization. For lack of space, we focus on the\ncase that all the values of yi are distinct, and furthermore, the initial sets ER and\nEL have at most one point combined (which is the usual situation). In this case \u03b20\nwill not be unique and each of the \u03b1i and \u03b3i will be either 0 or 1.\nSince \u03b20 is not unique, we can focus on one particular solution path, for example,\nby always setting \u03b20 equal to one of its boundary values (thus keeping one point\nat an elbow). As \u03bb decreases, the range of \u03b20 shrinks toward zero and reaches zero\nwhen we have two points at the elbows, and the algorithm proceeds from there.\n\n2.3 The Path\n\nThe formalized setup above can be easily modi\ufb01ed to accommodate non-linear ker-\nnels; in fact, \u03b8i in (2) is equal to \u03b1i \u2212 \u03b3i. For the remaining portion of the algorithm\nwe will use the kernel notation.\nThe algorithm focuses on the sets of points ER and EL. These points have either\nf(xi) =y i \u2212 \u0001 with \u03b1i \u2208 [0, 1], or f(xi) = yi + \u0001 with \u03b3i \u2208 [0, 1]. As we follow the\npath we will examine these sets until one or both of them change, at which point\nwe will say an event has occurred. Thus events can be categorized as:\n\n1. The initial event, for which two points must enter the elbow(s)\n2. A point from R has just entered ER, with \u03b1i initially 1\n3. A point from L has just entered EL, with \u03b3i initially 1\n4. A point from C has just entered ER, with \u03b1i initially 0\n\n\f5. A point from C has just entered EL, with \u03b3i initially 0\n6. One or more points in ER and/or EL have just left the elbow(s) to join\n\neither R, L, or C, with \u03b1i and \u03b3i initially 0 or 1\n\nUntil another event has occurred, all sets will remain the same. As a point passes\nthrough ER or EL, its respective \u03b1i or \u03b3i must change from 0 \u2192 1 or 1\u2192 0. Relying\non the fact that f(xi) = yi\u2212\u0001 or f(xi) = yi+\u0001 for all points in ER or EL respectively,\nwe can calculate \u03b1i and \u03b3i for these points.\nWe use the subscript (cid:14) to index the sets above immediately after the (cid:14)th event has\noccurred, and let \u03b1(cid:5)\ni , \u03b2(cid:5)\n0 and \u03bb(cid:5) be the parameter values immediately after the\ni, \u03b3(cid:5)\n(cid:6)\n(cid:14)th event. Also let f (cid:5) be the function at this point. We de\ufb01ne for convenience\n0,\u03bb = \u03bb(cid:5) \u00b7 \u03b2(cid:5)\n\u03b20,\u03bb = \u03bb \u00b7 \u03b20 and hence \u03b2(cid:5)\nn(cid:2)\n0. Then since\n(\u03b1i \u2212 \u03b3i)K(x, xi) +\u03b2 0,\u03bb\n(cid:9)\n\nf(x) =\n\n(cid:7)\n\n(cid:8)\n\n1\n\u03bb\n\ni=1\n\nfor \u03bb(cid:5)+1 < \u03bb < \u03bb(cid:5) we can write\nf(x) \u2212 \u03bb(cid:5)\n\u03bb\n\nf(x) =\n\nf (cid:5)(x)\n\n\u23a1\u23a3(cid:2)\n\ni\u2208E (cid:2)R\n\nf (cid:5)(x)\n\n+ \u03bb(cid:5)\n\u03bb\n\n(cid:2)\n\nj\u2208E (cid:2)L\n\n=\n\n1\n\u03bb\n\n\u03bdiK(x, xi) \u2212\n\n\u03c9jK(x, xj) + \u03bd0 + \u03bb(cid:5)f (cid:5)(x)\n\n\u23a4\u23a6 ,\n\ni , \u03c9j = \u03b3j \u2212 \u03b3(cid:5)\n\nwhere \u03bdi = \u03b1i \u2212 \u03b1(cid:5)\n0,\u03bb, and we can do the reduction\nin the second line since the \u03b1i and \u03b3i are \ufb01xed for all points in R(cid:5), L(cid:5), and C(cid:5) and\nall points remain in their respective sets. Suppose |E (cid:5)R| = n(cid:5)\nL, so for\nthe n(cid:5)\n\nR and |E (cid:5)L| = n(cid:5)\nL points staying at the elbows we have (after some algebra) that\n\nR + n(cid:5)\n\n1\n\n\u23a1\u23a3(cid:2)\n\u23a1\u23a3(cid:2)\n\nj and \u03bd0 = \u03b20,\u03bb \u2212 \u03b2(cid:5)\n(cid:2)\n(cid:2)\nAlso, by condition (5) we have that(cid:2)\n\n\u03bdiK(xm, xi) \u2212\n\n\u03bdiK(xk, xi) \u2212\n\n(cid:2)\n\nyk \u2212 \u0001\n\nym + \u0001\n\ni\u2208E (cid:2)R\n\ni\u2208E (cid:2)R\n\nj\u2208E (cid:2)L\n\nj\u2208E (cid:2)L\n\n1\n\n\u03c9jK(xm, xj) +\u03bd 0\n\n\u03c9jK(xk, xj) +\u03bd 0\n\n\u23a4\u23a6 = \u03bb \u2212 \u03bb(cid:5), \u2200k \u2208 E (cid:5)R\n\u23a4\u23a6 = \u03bb \u2212 \u03bb(cid:5), \u2200m \u2208 E (cid:5)L\n\n\u03bdi \u2212\n\n\u03c9j = 0\n\ni\u2208E (cid:2)R\n\nj\u2208E (cid:2)L\n\nR + n(cid:5)\n\nL + 1 linear equations we can use to solve for each of the\nL + 1 unknown variables \u03bdi, \u03c9j and \u03bd0. Notice this system is linear in \u03bb\u2212 \u03bb(cid:5),\n\nThis gives us n(cid:5)\nR + n(cid:5)\nn(cid:5)\nwhich implies that \u03b1i, \u03b3j and \u03b20,\u03bb change linearly in \u03bb \u2212 \u03bb(cid:5). So we can write:\n\n\u03b1i = \u03b1(cid:5)\n\u03b3j = \u03b3(cid:5)\n\u03b20,\u03bb = \u03b2(cid:5)\nf(x) = \u03bb(cid:5)\n\u03bb\n\ni + (\u03bb \u2212 \u03bb(cid:5))bi \u2200i \u2208 E (cid:5)R\nj + (\u03bb \u2212 \u03bb(cid:5))bj \u2200j \u2208 E (cid:5)L\n(cid:14)\n0,\u03bb + (\u03bb \u2212 \u03bb(cid:5))b0\nf (cid:5)(x) \u2212 h(cid:5)(x)\n\n(cid:15)\n\n+ h(cid:5)(x)\n\n(12)\n(13)\n(14)\n\n(15)\n\n\fwhere (bi, bj, b0) is the solution when \u03bb \u2212 \u03bb(cid:5) is equal to 1, and\n\nh(cid:5)(x) =\n\nbiK(x, xi) \u2212\n\nbjK(x, xj) + b0.\n\n(cid:2)\n\ni\u2208E (cid:2)R\n\n(cid:2)\n\nj\u2208E (cid:2)L\n\nGiven \u03bb(cid:5), equations (12), (13) and (15) allow us to compute the \u03bb at which the next\nevent will occur, \u03bb(cid:5)+1. This will be the largest \u03bb less than \u03bb(cid:5), such that either \u03b1i\nfor i \u2208 E (cid:5)R reaches 0 or 1, or \u03b3j for j \u2208 E (cid:5)L reaches 0 or 1, or one of the points in R,\nL or C reaches an elbow.\nWe terminate the algorithm either when the sets R and L become empty, or when\nIn the later case we must have f (cid:5) \u2212 h(cid:5)\n\u03bb has become su\ufb03ciently close to zero.\nsu\ufb03ciently small as well.\n\n2.4 Computational cost\n\nR +n(cid:5)\n\nR + n(cid:5)\n\nThe major computational cost for updating the solutions at any event (cid:14) involves two\nthings: solving the system of (n(cid:5)\nL) linear equations, and computing h(cid:5)(x). The\nformer takes O((n(cid:5)\nL)2) calculations by using inverse updating and downdating\nsince the elbow sets usually di\ufb00er by only one point between consecutive events,\nand the latter requires O(n(n(cid:5)\nAccording to our experience, the total number of steps taken by the algorithm is on\naverage some small multiple of n. Letting m be the average size of E (cid:5)R \u222a E (cid:5)L, then\n, which is\nthe approximate computational cost of the algorithm is O\ncomparable to a single SVR \ufb01tting algorithm that uses quadratic programming.\n\nL)) computations.\n\ncn2m + nm2\n\nR + n(cid:5)\n\n(cid:17)\n\n(cid:16)\n\n3 The Degrees of Freedom\n\nThe degrees of freedom is an informative measure of the complexity of a \ufb01tted model.\nIn this section, we propose an unbiased estimate for the degrees of freedom of the\nSVR, which allows convenient selection of the regularization parameter \u03bb.\nSince the usual goal of regression analysis is to minimize the predicted squared-error\nloss, we study the degrees of freedom using Stein\u2019s unbiased risk estimation (SURE)\ntheory [11]. Given x, assuming y is generated according to a homoskedastic model:\n\nwhere \u03bc is the true mean and \u03c32 is the common variance. Then the degrees of\nfreedom of a \ufb01tted model f(x) can be de\ufb01ned as\n\ny \u223c (\u03bc(x), \u03c32)\nn(cid:2)\n\n(cid:5)\n\ni=1\n\ndf(f) =\n\ncov(f(xi), yi)/\u03c32\n\n(cid:5)\n\nStein showed that under mild conditions,\ndf(f). It turns out that for the SVR model, for every \ufb01xed \u03bb,\nextremely simple formula:\n\ni=1 \u2202fi/\u2202yi is an unbiased estimate of\ni=1 \u2202fi/\u2202yi has an\n\nn\n\nn\n\n(16)\nTherefore, |ER| + |EL| is a convenient unbiased estimate for the degrees of freedom\nof f(x). Due to the space restriction, we omit the proof here, but make a note that\nthe proof relies on our SVR algorithm.\n\ni=1\n\n= |ER| + |EL|\n\n\u2202fi\n\u2202yi\n\n(cid:18)df \u2261 n(cid:2)\n\n\fp(cid:2)\n\np(cid:19)\n\nIn applying (16) to select the regularization parameter \u03bb, we plug it into the GCV\n\ncriterion [12] for model selection:(cid:5)\n\nn\n\n(n \u2212 (cid:18)df)2\ni=1(yi \u2212 f(xi))2\n\nThe advantages of this criterion are that it does not assume a known \u03c32, and it\navoids cross-validation, which is computationally intensive. In practice, we can \ufb01rst\nuse our e\ufb03cient algorithm to compute the entire solution path, then identify the\nappropriate value of \u03bb that minimizes the GCV criterion.\n\n4 Numerical Results\n\nTo demonstrate our algorithm and the selection of \u03bb using the GCV criterion, we\nshow numerical results on simulated data. We consider both additive and multi-\nplicative kernels using the 1-dimensional spline kernel (3), which are respectively\n\n(cid:3)\nK(x, x\n\n) =\n\n(cid:3)\nK(xj, x\nj)\n\n(cid:3)\nand K(x, x\n\n) =\n\n(cid:3)\nK(xj, x\nj)\n\nj=1\n\nj=1\n\nSimulations were based on the following four functions [13]:\n\n1. f(x) = sin(\u03c0x)\n2. f(x) = 0.1e4x1 +\n\n\u03c0x + e1, x \u2208 (\u22122\u03c0, 2\u03c0)\n(cid:21)\n(cid:17)\n\n(cid:20)\n\n1+e\n\n1\n\n(cid:16)\n\u221220(x2\u2212.5) + 3x3 + 2x4 + x5 + e2, x \u2208 (0, 1)2\n(cid:20)\n\n\u03c9L + 1\n\u03c9C\n\n+ e3,\n\n(cid:21)\n\n1/2\n\n2\n\n3. f(R, \u03c9, L, C) =\nR2 +\n4. f(R, \u03c9, L, C) = tan\u22121\n\nwhere (R, \u03c9, L, C) \u2208 (0, 100) \u00d7 (2\u03c0(20, 280)) \u00d7 (0, 1) \u00d7 (1, 11)\n\n\u03c9L+ 1\n\u03c9C\n\nR\n\n+ e4,\n\ni ), where \u03c31 = 0.19, \u03c32 = 1, \u03c33 = 218.5, \u03c34 = 0.18.\n\nei are distributed as N(0, \u03c32\nWe generated 300 training observations from each function along with 10,000 vali-\ndation observations and 10,000 test observations. For the \ufb01rst two simulations we\nused the additive 1-dimensional spline kernel and for the second two simulations\nthe multiplicative 1-dimensional spline kernel. We then found the \u03bb that minimized\nthe GCV criterion. The validation set was used to select the gold standard \u03bb which\nminimized the prediction MSE. Using these \u03bb\u2019s we calculated the prediction MSE\nwith the test data for each criterion. After repeating this for 20 times, the average\nMSE and standard deviation for the MSE can be seen in Table 1, which indicates\nthe GCV criterion performs closely to optimal.\n\nTable 1: Simulation results of \u03bb selection for SVR\nf(x) MSE-Gold Standard MSE-GCV\n0.0389 (0.0011)\n1.1120 (0.0382)\n50982 (2205)\n0.0471 (0.0028)\n\n0.0385 (0.0011)\n1.0999 (0.0367)\n50095 (1358)\n0.0459 (0.0023)\n\n1\n2\n3\n4\n\n\f5 Discussion\n\nIn this paper, we have proposed an e\ufb03cient algorithm that computes the entire reg-\nularization path of the SVR. We have also proposed the GCV criterion for selecting\nthe best \u03bb given the entire path. The GCV criterion seems to work su\ufb03ciently well\non the simulation data. However, we acknowledge that according to our experience\non real data sets (not shown here due to lack of the space), the GCV criterion\nsometimes tends to over-\ufb01t the model. We plan to explore this issue further.\nDue to the di\ufb03culty of also selecting the best \u0001 for the SVR, an alternate algorithm\nexists that automatically adjusts the value of \u0001, called the \u03bd-SVR [4].\nIn this\nscenario, \u0001 is treated as another free parameter. Using arguments similar to those\nfor \u03b20 in our above algorithm, one can show that \u0001 is piecewise linear in 1/\u03bb and\nits path can be calculated similarly.\n\nAcknowledgments\n\nWe would like to thank Saharon Rosset for helpful comments. Gunter and Zhu are\npartially supported by grant DMS-0505432 from the National Science Foundation.\n\nReferences\n\n[1] M\u00a8uler K, Smola A, R\u00a8atsch G, Sch\u00a8olkopf B, Kohlmorgen J & Vapnik V (1997) Predicting\ntime series with support vector machines. Arti\ufb01cial Neural Networks, 999-1004.\n\n[2] Vapnik V, Golowich S & Smola A (1997) Support vector method for function approxi-\nmation, regression estimation, and signal processing. NIPS 9.\n\n[3] Shpigelman L, Crammer K, Paz R, Vaadia E & Singer Y (2004) A temporal kernel-based\nmodel for tracking hand movements from neural activities. NIPS 17, 1273-1280.\n\n[4] Smola A & Sch\u00a8olkopf B (2004) A tutorial on support vector regression. Statistics and\nComputing 14: 199-222.\n\n[5] Vanderbei, R. (1994) LOQO: An interior point code for quadratic programming. Tech-\nnical Report SOR-94-15, Princeton University.\n\n[6] Osuna E, Freund R & Girosi F (1997) An improved training algorithm for support\nvector machines. Neural Networks for Signal Processing, 276-284.\n\n[7] Joachims T (1999) Making large-scale SVM learning practical. Advances in Kernel\nMethods \u2013 Support Vector Learning, 169-184.\n\n[8] Platt J (1999) Fast training of support vector machines using sequential minimal opti-\nmization. Advances in Kernel Methods \u2013 Support Vector Learning, 185-208.\n\n[9] Keerthi S, Shevade S, Bhattacharyya C & Murthy K (1999) Improvements to Platt\u2019s\nSMO algorithm for SVM classi\ufb01er design. Technical Report CD-99-14, NUS.\n\n[10] Hastie, T., Rosset, S., Tibshirani, R. & Zhu, J. (2004) The Entire Regularization Path\nfor the Support Vector Machine. JMLR, 5, 1391-1415.\n\n[11] Stein, C. (1981) Estimation of the mean of a multivariate normal distribution. Annals\nof Statistics 9: 1135-1151.\n\n[12] Craven, P. & Wahba, G. (1979) Smoothing noisy data with spline function. Numerical\nMathematics 31: 377-403.\n\n[13] Friedman, J. (1991) Multivariate Adaptive Regression Splines. Annals of Statistics\n19: 1-67.\n\n\f", "award": [], "sourceid": 2856, "authors": [{"given_name": "Lacey", "family_name": "Gunter", "institution": null}, {"given_name": "Ji", "family_name": "Zhu", "institution": null}]}