{"title": "Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1072, "abstract": null, "full_text": "Multiplicative Updates for Nonnegative Quadratic\n\nProgramming in Support Vector Machines\n\nFei Sha1, Lawrence K. Saul1, and Daniel D. Lee2\n1Department of Computer and Information Science\n2Department of Electrical and System Engineering\n\nUniversity of Pennsylvania\n\n200 South 33rd Street, Philadelphia, PA 19104\n\nffeisha,lsaulg@cis.upenn.edu, ddlee@ee.upenn.edu\n\nAbstract\n\nWe derive multiplicative updates for solving the nonnegative quadratic\nprogramming problem in support vector machines (SVMs). The updates\nhave a simple closed form, and we prove that they converge monotoni-\ncally to the solution of the maximum margin hyperplane. The updates\noptimize the traditionally proposed objective function for SVMs. They\ndo not involve any heuristics such as choosing a learning rate or deciding\nwhich variables to update at each iteration. They can be used to adjust all\nthe quadratic programming variables in parallel with a guarantee of im-\nprovement at each iteration. We analyze the asymptotic convergence of\nthe updates and show that the coef\ufb01cients of non-support vectors decay\ngeometrically to zero at a rate that depends on their margins. In practice,\nthe updates converge very rapidly to good classi\ufb01ers.\n\n1 Introduction\n\nSupport vector machines (SVMs) currently provide state-of-the-art solutions to many prob-\nlems in machine learning and statistical pattern recognition[18]. Their superior perfor-\nmance is owed to the particular way they manage the tradeoff between bias (under\ufb01tting)\nand variance (over\ufb01tting). In SVMs, kernel methods are used to map inputs into a higher,\npotentially in\ufb01nite, dimensional feature space; the decision boundary between classes is\nthen identi\ufb01ed as the maximum margin hyperplane in the feature space. While SVMs pro-\nvide the \ufb02exibility to implement highly nonlinear classi\ufb01ers, the maximum margin criterion\nhelps to control the capacity for over\ufb01tting. In practice, SVMs generalize very well \u2014 even\nbetter than their theory suggests.\n\nComputing the maximum margin hyperplane in SVMs gives rise to a problem in nonnega-\ntive quadratic programming. The resulting optimization is convex, but due to the nonneg-\nativity constraints, it cannot be solved in closed form, and iterative solutions are required.\nThere is a large literature on iterative algorithms for nonnegative quadratic programming\nin general and for SVMs as a special case[3, 17]. Gradient-based methods are the simplest\npossible approach, but their convergence depends on careful selection of the learning rate,\nas well as constant attention to the nonnegativity constraints which may not be naturally\nenforced. Multiplicative updates based on exponentiated gradients (EG)[5, 10] have been\n\n\finvestigated as an alternative to traditional gradient-based methods. Multiplicative updates\nare naturally suited to sparse nonnegative optimizations, but EG updates\u2014like their addi-\ntive counterparts\u2014suffer the drawback of having to choose a learning rate.\n\nSubset selection methods constitute another approach to the problem of nonnegative\nquadratic programming in SVMs. Generally speaking, these methods split the variables\nat each iteration into two sets: a \ufb01xed set in which the variables are held constant, and a\nworking set in which the variables are optimized by an internal subroutine. At the end of\neach iteration, a heuristic is used to transfer variables between the two sets and improve\nthe objective function. An extreme version of this approach is the method of Sequential\nMinimal Optimization (SMO)[15], which updates only two variables per iteration. In this\ncase, there exists an analytical solution for the updates, so that one avoids the expense of a\npotentially iterative optimization within each iteration of the main loop.\n\nIn general, despite the many proposed approaches for training SVMs, solving the quadratic\nprogramming problem remains a bottleneck in their implementation. (Some researchers\nhave even advocated changing the objective function in SVMs to simplify the required\noptimization[8, 13].) In this paper, we propose a new iterative algorithm, called Multi-\nplicative Margin Maximization (M3), for training SVMs. The M3 updates have a simple\nclosed form and converge monotonically to the solution of the maximum margin hyper-\nplane. They do not involve heuristics such as the setting of a learning rate or the switching\nbetween \ufb01xed and working subsets; all the variables are updated in parallel. They pro-\nvide an extremely straightforward way to implement traditional SVMs. Experimental and\ntheoretical results con\ufb01rm the promise of our approach.\n\n2 Nonnegative quadratic programming\n\nWe begin by studying the general problem of nonnegative quadratic programming. Con-\nsider the minimization of the quadratic objective function\n\nF (v) =\n\nvT Av + bT v;\n\n(1)\n\n1\n2\n\nsubject to the constraints that vi (cid:21) 0 8i. We assume that the matrix A is symmetric and\nsemipositive de\ufb01nite, so that the objective function F (v) is bounded below, and its opti-\nmization is convex. Due to the nonnegativity constraints, however, there does not exist an\nanalytical solution for the global minimum (or minima), and an iterative solution is needed.\n\n2.1 Multiplicative updates\n\nOur iterative solution is expressed in terms of the positive and negative components of the\nmatrix A in eq. (1). In particular, let A+ and A(cid:0) denote the nonnegative matrices:\n\nA+\n\nij =(cid:26) Aij\n\n0\n\nif Aij > 0,\notherwise,\n\nand A(cid:0)\n\nij =(cid:26) jAij j\n\n0\n\nif Aij < 0,\notherwise.\n\n(2)\n\nIt follows trivially that A = A+(cid:0)A(cid:0). In terms of these nonnegative matrices, our proposed\nupdates (to be applied in parallel to all the elements of v) take the form:\n\nvi  (cid:0) vi\" (cid:0)bi +pb2\n\ni + 4(A+v)i(A(cid:0)v)i\n2(A+v)i\n\n# :\n\n(3)\n\nThe iterative updates in eq. (3) are remarkably simple to implement. Their somewhat mys-\nterious form will be clari\ufb01ed as we proceed. Let us begin with two simple observations.\nFirst, eq. (3) prescribes a multiplicative update for the ith element of v in terms of the\nith elements of the vectors b, A+v, and A+v. Second, since the elements of v, A+,\nand A(cid:0) are nonnegative, the overall factor multiplying vi on the right hand side of eq. (3)\nis always nonnegative. Hence, these updates never violate the constraints of nonnegativity.\n\n\f2.2 Fixed points\n\nWe can show further that these updates have \ufb01xed points wherever the objective func-\ntion, F (v) achieves its minimum value. Let v(cid:3) denote a global minimum of F (v). At\nsuch a point, one of two conditions must hold for each element v (cid:3)\ni > 0 and\n(@F=@vi)jv(cid:3) = 0, or (ii), v(cid:3)\ni = 0 and (@F=@vi)jv(cid:3) (cid:21) 0. The \ufb01rst condition applies to the\npositive elements of v(cid:3), whose corresponding terms in the gradient must vanish. These\nderivatives are given by:\n\ni: either (i) v(cid:3)\n\n@F\n\n@vi(cid:12)(cid:12)(cid:12)(cid:12)v(cid:3)\n\n= (A+v(cid:3))i (cid:0) (A(cid:0)v(cid:3))i + bi:\n\n(4)\n\nThe second condition applies to the zero elements of v(cid:3). Here, the corresponding terms of\nthe gradient must be nonnegative, thus pinning v (cid:3)\ni to the boundary of the feasibility region.\nThe multiplicative updates in eq. (3) have \ufb01xed points wherever the conditions for global\nminima are satis\ufb01ed. To see this, let\n\n(cid:13)i\n\n4\n=\n\n(cid:0)bi +pb2\n\ni + 4(A+v(cid:3))i(A(cid:0)v(cid:3))i\n2(A+v(cid:3))i\n\n(5)\n\ndenote the factor multiplying the ith element of v in eq. (3), evaluated at v(cid:3). Fixed points\nof the multiplicative updates occur when one of two conditions holds for each element vi:\neither (i) v(cid:3)\ni = 0. It is straightforward to show from eqs. (4\u20135)\nthat (@F=@vi)jv(cid:3)= 0 implies (cid:13)i = 1. Thus the conditions for global minima establish the\nconditions for \ufb01xed points of the multiplicative updates.\n\ni > 0 and (cid:13)i = 1, or (ii) v(cid:3)\n\n2.3 Monotonic convergence\n\nThe updates not only have the correct \ufb01xed points; they also lead to monotonic improve-\nment in the objective function, F (v). This is established by the following theorem:\n\nTheorem 1 The function F (v) in eq. (1) decreases monotonically to the value of its global\nminimum under the multiplicative updates in eq. (3).\n\nThe proof of this theorem (sketched in Appendix A) relies on the construction of an auxil-\niary function which provides an upper bound on F (v). Similar methods have been used to\nprove the convergence of many algorithms in machine learning[1, 4, 6, 7, 12, 16].\n\n3 Support vector machines\n\nWe now consider the problem of computing the maximum margin hyperplane in SVMs[3,\n17, 18]. Let f(xi; yi)gN\ni=1 denote labeled examples with binary class labels yi = (cid:6)1, and\nlet K(xi; xj) denote the kernel dot product between inputs. In this paper, we focus on the\nsimple case where in the high dimensional feature space, the classes are linearly separable\nand the hyperplane is required to pass through the origin 1.\nIn this case, the maximum\nmargin hyperplane is obtained by minimizing the loss function:\n\nL((cid:11)) = (cid:0)Xi\n\n1\n\n2Xij\n\n(cid:11)i +\n\n(cid:11)i(cid:11)jyiyjK(xi; xj);\n\n(6)\n\nsubject to the nonnegativity constraints (cid:11)i (cid:21) 0. Let (cid:11)(cid:3) denote the location of the minimum\ni yixi\n\nof this loss function. The maximal margin hyperplane has normal vector w = Pi (cid:11)(cid:3)\n\nand satis\ufb01es the margin constraints yiK(w; xi) (cid:21) 1 for all examples in the training set.\n\n1The extensions to non-realizable data sets and to hyperplanes that do not pass through the origin\n\nare straightforward. They will be treated in a longer paper.\n\n\fKernel\nData\nSonar\nBreast cancer\n\nPolynomial\nk = 4\n(cid:27) = 0:3\nk = 6\n9.6% 9.6% 7.6%\n5.1% 3.6% 4.4%\n\nRadial\n(cid:27) = 1:0\n6.7%\n4.4%\n\n(cid:27) = 3:0\n10.6%\n4.4%\n\nTable 1: Misclassi\ufb01cation error rates on the sonar and breast cancer data sets after 512\niterations of the multiplicative updates.\n\n3.1 Multiplicative updates\n\nThe loss function in eq. (6) is a special case of eq. (1) with Aij = yiyjK(xi; xj) and\nbi = (cid:0)1. Thus, the multiplicative updates for computing the maximal margin hyperplane\nin hard margin SVMs are given by:\n\n(cid:11)i  (cid:0) (cid:11)i\" 1 +p1 + 4(A+(cid:11))i(A(cid:0)(cid:11))i\n\n2(A+(cid:11))i\n\n#\n\n(7)\n\nwhere A(cid:6) are de\ufb01ned as in eq. (2). We will refer to the learning algorithm for hard margin\nSVMs based on these updates as Multiplicative Margin Maximization (M 3).\n\nIt is worth comparing the properties of these updates to those of other approaches. Like\nmultiplicative updates based on exponentiated gradients (EG)[5, 10], the M 3 updates are\nwell suited to sparse nonnegative optimizations 2; unlike EG updates, however, they do\nnot involve a learning rate, and they come with a guarantee of monotonic improvement.\nLike the updates for Sequential Minimal Optimization (SMO)[15], the M 3 updates have\na simple closed form; unlike SMO updates, however, they can be used to adjust all the\nquadratic programming variables in parallel (or any subset thereof), not just two at a time.\nFinally, we emphasize that the M3 updates optimize the traditional objective function for\nSVMs; they do not compromise the goal of computing the maximal margin hyperplane.\n\n3.2 Experimental results\n\nWe tested the effectiveness of the multiplicative updates in eq. (7) on two real world prob-\nlems: binary classi\ufb01cation of aspect-angle dependent sonar signals[9] and breast cancer\ndata[14]. Both data sets, available from the UCI Machine Learning Repository[2], have\nbeen widely used to benchmark many learning algorithms, including SVMs[5]. The sonar\nand breast cancer data sets consist of 208 and 683 labeled examples, respectively. Train-\ning and test sets for the breast cancer experiments were created by 80%/20% splits of the\navailable data.\n\nWe experimented with both polynomial and radial basis function kernels. The polynomial\nkernels had degrees k = 4 and k = 6, while the radial basis function kernels had variances\nof (cid:27) = 0:3; 1:0 and 3:0. The coef\ufb01cients (cid:11)i were uniformly initialized to a value of one in\nall experiments.\n\nMisclassi\ufb01cation rates on the test data sets after 512 iterations of the multiplicative updates\nare shown in Table 1. As expected, the results match previously published error rates on\nthese data sets[5], showing that the M 3 updates do in practice converge to the maximum\nmargin hyperplane. Figure 1 shows the rapid convergence of the updates to good classi\ufb01ers\nin just one or two iterations.\n\n2In fact, the multiplicative updates by nature cannot directly set a variable to zero. However,\na variable can be clamped to zero whenever its value falls below some threshold (e.g., machine\nprecision) and when a zero value would satisfy the Karush-Kuhn-Tucker conditions.\n\n\fiteration\n\nsupport vectors\n\nnon-support vectors\n\n (%)      e\nt\n\n   e\n (%)\n      2.9           3.6\n\ng\n\n  00\n\n  01\n\n  02\n\n  04\n\n  08\n\n  16\n\n  32\n\n  64\n\ns\nt\nn\ne\ni\nc\ni\nf\nf\ne\no\nc\n\n      2.4           2.2\n\n      1.1           4.4\n\n      0.5           4.4\n\n      0.0           4.4\n\n      0.0           4.4\n\n      0.0           4.4\n\n      0.0           4.4\n\n0\n0\n\n100\n\n200\ntraining examples\n\n300\n\n400\n\n500\n\nFigure 1: Rapid convergence of the multiplicative updates in eq. (7). The plots show\nresults after different numbers of iterations on the breast cancer data set with the radial\nbasis function kernel ((cid:27) = 3). The horizontal axes index the coef\ufb01cients (cid:11)i of the 546\ntraining examples; the vertical axes show their values. For ease of visualization, the training\nexamples were ordered so that support vectors appear to the left and non-support vectors,\nto the right. The coef\ufb01cients (cid:11)i were uniformly initialized to a value of one. Note the rapid\nattenuation of non-support vector coef\ufb01cients after one or two iterations. Intermediate error\nrates on the training set ((cid:15)t) and test set ((cid:15)g) are also shown.\n\n3.3 Asymptotic convergence\n\nThe rapid decay of non-support vector coef\ufb01cients in Fig. 1 motivated us to analyze their\nrates of asymptotic convergence. Suppose we perturb just one of the non-support vector\ncoef\ufb01cients in eq. (6)\u2014say (cid:11)i\u2013away from the \ufb01xed point to some small nonzero value (cid:14)(cid:11)i.\nIf we hold all the variables but (cid:11)i \ufb01xed and apply its multiplicative update, then the new\ndisplacement (cid:14)(cid:11)0\n\ni after the update is given asymptotically by ((cid:14)(cid:11)0\n\ni) (cid:25) ((cid:14)(cid:11)i)(cid:13)i, where\n\n(cid:13)i =\n\n;\n\n(8)\n\n1 +p1 + 4(A+(cid:11)(cid:3))i(A(cid:0)(cid:11)(cid:3))i\n\n2(A+(cid:11)(cid:3))i\n\nand Aij = yiyjK(xi; xj). (Eq. (8) is merely the specialization of eq. (5) to SVMs.) We can\nthus bound the asymptotic rate of convergence\u2014in this idealized but instructive setting\u2014\nby computing an upper bound on (cid:13)i, which determines how fast the perturbed coef\ufb01cient\ndecays to zero. (Smaller (cid:13)i implies faster decay.) In general, the asymptotic rate of con-\nvergence is determined by the overall positioning of the data points and classi\ufb01cation hy-\nperplane in the feature space. The following theorem, however, provides a simple bound in\nterms of easily understood geometric quantities.\n\nfeature space from xi to the maximum margin hyperplane, and let d = minj dj =\n\nTheorem 2 Let di = jK(xi; w)j=pK(w; w) denote the perpendicular distance in the\n1=pK(w; w) denote the one-sided margin of the classi\ufb01er. Also, let \u2018i =pK(xi; xi)\n\ndenote the distance of xi to the origin in the feature space, and let \u2018 = maxj \u2018j denote the\nlargest such distance. Then a bound on the asymptotic rate of convergence (cid:13)i is given by:\n\n(cid:13)i (cid:20) (cid:20)1 +\n\n1\n2\n\n(di (cid:0) d)d\n\n\u2018i\u2018\n\n(cid:21)(cid:0)1\n\n:\n\n(9)\n\n\f+\n\n+\n\n+\n\nl\n\ni\n\nd\n\ni\n\n_\n\n+\n\nd\n\nclassification hyperplane\n\n_\n\n_\n\n_\n\nFigure 2: Quantities used to bound the asymptotic rate of convergence in eq. (9); see text.\nSolid circles denote support vectors; empty circles denote non-support vectors.\n\nThe proof of this theorem is sketched in Appendix B. Figure 2 gives a schematic repre-\nsentation of the quantities that appear in the bound. The bound has a simple geometric\nintuition:\nthe more distant a non-support vector from the classi\ufb01cation hyperplane, the\nfaster its coef\ufb01cient decays to zero. This is a highly desirable property for large numeri-\ncal calculations, suggesting that the multiplicative updates could be used to quickly prune\naway outliers and reduce the size of the quadratic programming problem. Note that while\nthe bound is insensitive to the scale of the inputs, its tightness does depend on their relative\nlocations in the feature space.\n\n4 Conclusion\n\nSVMs represent one of the most widely used architectures in machine learning. In this\npaper, we have derived simple, closed form multiplicative updates for solving the non-\nnegative quadratic programming problem in SVMs. The M 3 updates are straightforward\nto implement and have a rigorous guarantee of monotonic convergence. It is intriguing\nthat multiplicative updates derived from auxiliary functions appear in so many other areas\nof machine learning, especially those involving sparse, nonnegative optimizations. Exam-\nples include the Baum-Welch algorithm[1] for discrete hidden markov models, general-\nized iterative scaling[6] and adaBoost[4] for logistic regression, and nonnegative matrix\nfactorization[11, 12] for dimensionality reduction and feature extraction. In these areas,\nsimple multiplicative updates with guarantees of monotonic convergence have emerged\nover time as preferred methods of optimization. Thus it seems worthwhile to explore their\nfull potential for SVMs.\n\nReferences\n\n[1] L. Baum. An inequality and associated maximization technique in statistical estimation of\n\nprobabilistic functions of Markov processes. Inequalities, 3:1\u20138, 1972.\n\n[2] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n\n[3] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Knowledge\n\nDiscovery and Data Mining, 2(2):121\u2013167, 1998.\n\n[4] M. Collins, R. Schapire, and Y. Singer. Logistic regression, adaBoost, and Bregman distances.\nIn Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000.\n\n\f[5] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Multiplicative updatings for support vector\n\nmachines. In Proceedings of ESANN\u201999, pages 189\u2013194, 1999.\n\n[6] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of\n\nMathematical Statistics, 43:1470\u20131480, 1972.\n\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society B, 39:1\u201337, 1977.\n\n[8] C. Gentile. A new approximate maximal margin classi\ufb01cation algorithm. Journal of Machine\n\nLearning Research, 2:213\u2013242, 2001.\n\n[9] R. P. Gorman and T. J. Sejnowski. Analysis of hidden units in a layered network trained to\n\nclassify sonar targets. Neural Networks, 1(1):75\u201389, 1988.\n\n[10] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predic-\n\ntors. Information and Computation, 132(1):1\u201363, 1997.\n\n[11] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.\n\nNature, 401:788\u2013791, 1999.\n\n[12] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T. K. Leen,\nT. G. Dietterich, and V. Tresp, editors, Advances in Neural and Information Processing Systems,\nvolume 13, Cambridge, MA, 2001. MIT Press.\n\n[13] O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal of Ma-\n\nchine Learning Research, 1:161\u2013177, 2001.\n\n[14] O. L. Mangasarian and W. H. Wolberg. Cancer diagnosis via linear programming. SIAM News,\n\n23(5):1\u201318, 1990.\n\n[15] J. Platt. Fast training of support vector machines using sequential minimal optimization. In\nB. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods \u2014 Support\nVector Learning, pages 185\u2013208, Cambridge, MA, 1999. MIT Press.\n\n[16] L. K. Saul and D. D. Lee. Multiplicative updates for classi\ufb01cation by mixture models.\n\nIn\nT. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural and Information\nProcessing Systems, volume 14, Cambridge, MA, 2002. MIT Press.\n\n[17] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[18] V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.\n\nA Proof of Theorem 1\n\nThe proof of monotonic convergence in the objective function F (v), eq. (1), is based on\nthe derivation of an auxiliary function. Similar techniques have been used for many models\nin statistical learning[1, 4, 6, 7, 12, 16]. An auxiliary function G(~v; v) has the two crucial\nproperties that F (~v) (cid:20) G(~v; v) and F (v) = G(v; v) for all nonnegative ~v,v. From such\nan auxiliary function, we can derive the update rule v0 = arg min~vG(~v; v) which never\nincreases (and generally decreases) the objective function F (v):\nF (v0) (cid:20) G(v0; v) (cid:20) G(v; v) = F (v):\n\n(10)\n\nBy iterating this procedure, we obtain a series of estimates that improve the objective func-\ntion. For nonnegative quadratic programming, we derive an auxiliary function G( ~v; v) by\ndecomposing F (v) in eq. (1) into three terms and then bounding each term separately:\n\nF (v) =\n\nG(~v; v) =\n\n1\n\n2Xij\n2Xi\n\n1\n\nA+\n\nij vivj (cid:0)\n\n1\n\nA(cid:0)\n\n2Xij\n2Xij\n\n1\n\nbivi;\n\nijvivj +Xi\nij vivj(cid:18)1 + log\n\nA(cid:0)\n\n(A+v)i\n\nvi\n\n~v2\ni (cid:0)\n\nIt can be shown that F (~v) (cid:20) G(~v; v). The minimization of G(~v; v) is performed by\nsetting its derivative to zero, leading to the multiplicative updates in eq. (3). The updates\n\n(11)\n\nbi~vi:\n\n(12)\n\n~vi~vj\n\nvivj(cid:19) +Xi\n\n\fmove each element vi in the same direction as (cid:0)@F=@vi, with \ufb01xed points occurring only\nif v(cid:3)\ni = 0 or @F=@vi = 0. Since the overall optimization is convex, all minima of F (v) are\nglobal minima. The updates converge to the unique global minimum if it exists.\n\nB Proof of Theorem 2\n\nThe proof of the bound on the asymptotic rate of convergence relies on the repeated use\nof equalities and inequalities that hold at the \ufb01xed point (cid:11)(cid:3). For example, if (cid:11)(cid:3)\ni = 0 is a\nnon-support vector coef\ufb01cient, then (@L=@(cid:11)i)j(cid:11)(cid:3) (cid:21) 0 implies (A+(cid:11)(cid:3))i(cid:0)(A(cid:0)(cid:11)(cid:3))i (cid:21) 1. As\nshorthand, let z+\n\ni = (A(cid:0)(cid:11)(cid:3))i. Then we have the following result:\n\ni = (A+(cid:11)(cid:3))i and z(cid:0)\n\n1\n(cid:13)i\n\n=\n\n(cid:21)\n\n=\n\n2z+\ni\n\n1 +q1 + 4z+\n1 +q(z+\n\ni z(cid:0)\ni\n2z+\ni\ni (cid:0) z(cid:0)\n\n2z+\ni\ni + z(cid:0)\n1 + z+\ni (cid:0) z(cid:0)\nz+\n2z+\ni\n\ni\ni (cid:0) 1\n\n:\n\n(cid:21) 1 +\n\ni\n\ni z(cid:0)\ni )2 + 4z+\ni (cid:0) z(cid:0)\nz+\ni + z(cid:0)\nz+\n\n= 1 +\n\ni (cid:0) 1\ni + 1\n\nTo prove the theorem, we need to express this result in terms of kernel dot products. We\ncan rewrite the variables in the numerator of eq. (16) as:\n\nyiyjK(xi; xj)(cid:11)(cid:3)\n\nj = yiK(xi; w) = jK(xi; w)j; (17)\n\nwe can obtain a bound on the denominator of eq. (16) by:\n\nxjyj is the normal vector to the maximum margin hyperplane. Likewise,\n\nAij(cid:11)(cid:3)\n\nj = Xj\n\ni (cid:0) z(cid:0)\nz+\n\ni = Xj\nwhere w =Pj (cid:11)(cid:3)\n\nj\n\nz+\n\ni = Xj\n\nA+\n\nij(cid:11)(cid:3)\n\nj\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\nk\n\nk\n\n(cid:11)(cid:3)\nj\n\n(cid:11)(cid:3)\nj\n\nA+\n\n(cid:20) max\n\n(cid:20) max\n\nikXj\njK(xi; xk)jXj\nk pK(xk; xk)Xj\n(cid:20) pK(xi; xi) max\nk pK(xk; xk)K(w; w):\n= pK(xi; xi) max\nk = Xj\nk = Xj\n\nj Xk\n\nAjk(cid:11)(cid:3)\n\nAjk(cid:11)(cid:3)\n\nj (cid:11)(cid:3)\n\n(cid:11)(cid:3)\nj\n\n(cid:11)(cid:3)\n\nK(w; w) = Xjk\n\nThe last step in eq. (23) is obtained by recognizing that (cid:11)(cid:3)\nj is nonzero only for the coef\ufb01-\ncients of support vectors, and that in this case the optimality condition (@L=@(cid:11)j)j(cid:11)(cid:3) = 0\n\nimpliesPk Ajk(cid:11)(cid:3)\n\nk = 1. Finally, substituting eqs. (17) and (22) into eq. (16) gives:\n1\n(cid:13)i\n\njK(xi; w)j (cid:0) 1\n\n(cid:21) 1 +\n\n2pK(xi; xi) maxkpK(xk; xk)K(w; w)\n\n:\n\nThis reduces in a straightforward way to the claim of the theorem.\n\n(24)\n\nEq. (21) is an application of the Cauchy-Schwartz inequality for kernels, while eq. (22)\nexploits the observation that:\n\n(cid:11)(cid:3)\nj :\n\n(23)\n\n\f", "award": [], "sourceid": 2280, "authors": [{"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}