{"title": "Multiple Incremental Decremental Learning of Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 907, "page_last": 915, "abstract": "We propose a multiple incremental decremental algorithm  of Support Vector Machine (SVM). Conventional single  cremental decremental SVM can update the trained model  efficiently when single data point is added to or removed  from the training set. When we add and/or remove multiple  data points, this algorithm is time-consuming because we  need to repeatedly apply it to each data point. The roposed  algorithm is computationally more efficient when multiple  data points are added and/or removed simultaneously. The  single incremental decremental algorithm is built on an  optimization technique called parametric programming.  We extend the idea and introduce multi-parametric  programming for developing the proposed algorithm.  Experimental results on synthetic and real data sets indicate that the proposed algorithm can significantly  reduce the computational cost of multiple incremental  decremental operation. Our approach is especially useful  for online SVM learning in which we need to remove old  data points and add new data points in a short amount of  time.", "full_text": "Multiple Incremental Decremental Learning of\n\nSupport Vector Machines\n\nMasayuki Karasuyama and Ichiro Takeuchi\n\nDepartment of Engineering, Nagoya Institute of Technology\nGokiso-cho, Syouwa-ku, Nagoya, Aichi, 466-8555, JAPAN\n\nkrsym@ics.nitech.ac.jp, takeuchi.ichiro@nitech.ac.jp\n\nAbstract\n\nWe propose a multiple incremental decremental algorithm of Support Vector Ma-\nchine (SVM). Conventional single incremental decremental SVM can update the\ntrained model ef\ufb01ciently when single data point is added to or removed from the\ntraining set. When we add and/or remove multiple data points, this algorithm is\ntime-consuming because we need to repeatedly apply it to each data point. The\nproposed algorithm is computationally more ef\ufb01cient when multiple data points\nare added and/or removed simultaneously. The single incremental decremental\nalgorithm is built on an optimization technique called parametric programming.\nWe extend the idea and introduce multi-parametric programming for developing\nthe proposed algorithm. Experimental results on synthetic and real data sets indi-\ncate that the proposed algorithm can signi\ufb01cantly reduce the computational cost\nof multiple incremental decremental operation. Our approach is especially useful\nfor online SVM learning in which we need to remove old data points and add new\ndata points in a short amount of time.\n\n1 Introduction\n\nIncremental decremental algorithm for online learning of Support Vector Machine (SVM) was pre-\nviously proposed in [1], and the approach was adapted to other variants of kernel machines [2\u20134].\nWhen a single data point is added and/or removed, these algorithms can ef\ufb01ciently update the\ntrained model without re-training it from scratch. These algorithms are built on an optimization\ntechnique called parametric programming [5\u20137], in which one solves a series of optimization prob-\nlems parametrized by a single parameter. In particular, one solves a solution path with respect to\nthe coef\ufb01cient parameter corresponding to the data point to be added or removed. When we add\nand/or remove multiple data points using these algorithms, one must repeat the updating operation\nfor each single data point. It often requires too much computational cost to use it for real-time online\nlearning. In what follows, we refer this conventional algorithm as single incremental decremental\nalgorithm or single update algorithm.\nIn this paper, we develop a multiple incremental decremental algorithm of the SVM. The proposed\nalgorithm can update the trained model more ef\ufb01ciently when multiple data points are added and/or\nremoved simultaneously. We develop the algorithm by introducing multi-parametric programming\n[8] in the optimization literature. We consider a path-following problem in the multi-dimensional\nspace spanned by the coef\ufb01cient parameters corresponding to the set of data points to be added or\nremoved. Later, we call our proposed algorithm as multiple incremental decremental algorithm or\nmultiple update algorithm.\nThe main computational cost of parametric programming is in solving a linear system at each break-\npoint (see Section 3 for detail). Thus, the total computational cost of parametric programming is\nroughly proportional to the number of breakpoints on the solution path.\nIn the repeated use of\n\n1\n\n\fsingle update algorithm for each data point, one follows the coordinate-wise solution path in the\nmulti-dimensional coef\ufb01cient parameter space. On the other hand, in multiple update algorithm, we\nestablish a direction in the multi-dimensional coef\ufb01cient parameter space so that the total length of\nthe path becomes much shorter than the coordinate-wise one. Because the number of breakpoints in\nthe shorter path followed by our algorithm is less than that in the longer coordinate-wise path, we\ncan gain relative computational ef\ufb01ciency. Figure 2 in Section 3.4 schematically illustrates our main\nidea.\nThis paper is organized as follows. Section 2 formulates the SVM and the KKT conditions. In Sec-\ntion 3, after brie\ufb02y reviewing single update algorithm, we describe our multiple update algorithm.\nIn section 4, we compare the computational cost of our multiple update algorithm with the sin-\ngle update algorithm and with the LIBSVM (the-state-of-the-art batch SVM solver based on SMO\nalgorithm) in numerical experiments on synthetic and real data sets. We close in Section 5 with\nconcluding remarks.\n\n2 Support Vector Machine and KKT Conditions\nSuppose we have a set of training data {(xi, yi)}n\ni=1, where xi \u2208 X \u2286 Rd is the input and\nyi \u2208 {\u22121, +1} is the output class label. Support Vector Machines (SVM) learn the following\ndiscriminant function:\n\nf(x) = wT \u03a6(x) + b,\n\nn\u2211\n\ni=1\n\n\u03bei\n\ns.t.\n\nmin\n\nwhere \u03a6(x) denotes a \ufb01xed feature-space transformation. The model parameter w and b can be\nobtained by solving an optimization problem:\n||w||2 + C\n1\n2\n\u2211\nyif(xi) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, i = 1,\u00b7\u00b7\u00b7 , n,\nIntroducing Lagrange multipliers \u03b1i \u2265 0, the\nwhere C \u2208 R+ is the regularization parameter.\noptimal discriminant function f : X \u2192 R can be formulated as f(x) =\ni=1 \u03b1iyiK(x, xi) + b,\nwhere K(xi, xj) = \u03a6(xi)T \u03a6(xj) is a kernel function. From the Karush-Kuhn-Tucker (KKT)\noptimality conditions, we obtain the following relationships:\nyif(xi) > 1 \u21d2 \u03b1i = 0,\nyif(xi) = 1 \u21d2 \u03b1i \u2208 [0, C],\nn\u2211\nyif(xi) < 1 \u21d2 \u03b1i = C,\n\n(1a)\n(1b)\n(1c)\n\nn\n\nyi\u03b1i = 0.\n\n(1d)\n\ni=1\n\nUsing (1a)-(1c), let us de\ufb01ne the following index sets:\n\nO = {i | yif(xi) > 1, \u03b1i = 0},\nM = {i | yif(xi) = 1, 0 \u2264 \u03b1i \u2264 C},\nI = {i | yif(xi) < 1, \u03b1i = C}.\n\n(2a)\n(2b)\n(2c)\nIn what follows, the subscription by an index set, such as vI for a vector v \u2208 Rn, indicates a\nsubvector of v whose elements are indexed by I. Similarly, the subscription by two index sets,\nsuch as MM;O for a matrix M \u2208 Rn\u00d7n, denotes a submatrix whose rows are indexed by M\nand columns are indexed by O.\nIf the submatrix is the principal submatrix such as QM;M, we\nabbreviate as QM.\n\n3 Incremental Decremental Learning for SVM\n\n3.1 Single Incremental Decremental SVM\n\nIn this section, we brie\ufb02y review the conventional single incremental decremental SVM [1]. Using\nthe SV sets (2b) and (2c), we can expand yif(xi) as\nQij\u03b1j +\n\nQij\u03b1j + yib,\n\nyif(xi) =\n\n\u2211\n\nj\u2208M\n\n\u2211\n\nj\u2208I\n\n2\n\n\fwhere Qij = yiyjK(xi, xj). When a new data point (xc, yc) is added, we increase the correspond-\ning new parameter \u03b1c from 0 while keeping the optimal conditions of the other parameters satis\ufb01ed.\nLet us denote the amount of the change of each variable with an operator \u2206. To satisfy the equality\nconditions (1b) and (1d), we need\n\nQic\u2206\u03b1c +\n\nQij\u2206\u03b1j + yi\u2206b = 0, i \u2208 M,\n\n\u2211\n\nj\u2208M\nyc\u2206\u03b1c +\n\n\u2211\n\nj\u2208M\n\nyj\u2206\u03b1j = 0.\n\nSolving this linear system with respect to \u2206\u03b1i, i \u2208 M, and b, we obtain the update direction of the\nparameters. We maximize the \u2206\u03b1c under the constraint that no element moves across M,I and O.\nAfter updating the index sets M,I and O, we repeat the process until the new data point satis\ufb01es the\noptimality condition. Decremental algorithm can be derived similarly, in which the target parameter\nmoves toward 0.\n\n3.2 Multiple Incremental Decremental SVM\n\nSuppose we add m new data points and remove \u2113 data points simultaneously. Let us denote the\nindex set of new adding data points and removing data points as\n\nA = {n + 1, n + 2,\u00b7\u00b7\u00b7 , n + m} and R \u2282 {1,\u00b7\u00b7\u00b7 , n},\nrespectively, where |R| = \u2113. We remove the elements of R from the sets M,I and O (i.e. M \u2190\nM\\R,I \u2190 I \\R and O \u2190 O \\R). Let us de\ufb01ne y = [y1,\u00b7\u00b7\u00b7 , yn+m]\u22a4\n, \u03b1 = [\u03b11,\u00b7\u00b7\u00b7 , \u03b1n+m]\u22a4,\nand Q \u2208 R(n+m)\u00d7(n+m), where (i, j)-th entry of Q is Qij. When m = 1, \u2113 = 0 or m = 0, \u2113 = 1,\nour method corresponds to the conventional single incremental decremental algorithm. We initially\nset \u03b1i = 0,\u2200i \u2208 A. If we have yif(xi) > 1, i \u2208 A, we can append these indices to O and remove\nthem from A because these points already satisfy the optimality condition (1a). Similarly, we can\nappend the indices {i | yif(xi) = 1, i \u2208 A} to M and remove them from A. In addition, we can\nremove the points {i | \u03b1i = 0, i \u2208 R} because they already have no in\ufb02uence on the model. Unlike\nsingle incremental decremental algorithm, we need to determine the directions of \u2206\u03b1A and \u2206\u03b1R.\nThese directions have a critical in\ufb02uence on the computational cost. For \u2206\u03b1R, we simply trace the\nshortest path to 0, i.e.,\n\n(3)\nwhere \u03b7 is a step length. For \u2206\u03b1A, we do not know the optimal value of \u03b1A beforehand. To\ndetermine this direction, we may be able to use some optimization techniques (e.g. Newton method).\nHowever, such methods usually need additional computational burden. In this paper, we simply take\n\n\u2206\u03b1R = \u2212\u03b7\u03b1R,\n\n\u2206\u03b1A = \u03b7(C1 \u2212 \u03b1A).\n\nThis would become the shortest path if \u03b1i = C,\u2200i \u2208 A, at optimality.\nWhen we move parameters \u03b1i,\u2200i \u2208 A \u222a R, the optimality conditions of the other parameters must\nbe kept satis\ufb01ed. From yif(xi) = 1, i \u2208 M, and the equality constraint (1d), we need\nQij\u2206\u03b1j + yi\u2206b = 0, i \u2208 M,\n\nQij\u2206\u03b1j +\n\nQij\u2206\u03b1j +\n\n\u2211\n\n\u2211\n\n(5)\n\n\u2211\n\u2211\n[\n\nj\u2208A\n\n]\n\nj\u2208R\n\n\u2211\n[\n[\n\nj\u2208A\n\nj\u2208R\nyj\u2206\u03b1j +\n\nj\u2208M\nyj\u2206\u03b1j +\n\nUsing matrix notation, (5) and (6) can be written as\n\nM\n\nwhere\n\n\u2206b\n\n\u2206\u03b1M\n\n+\n\n\u22a4\nA\n\ny\n\nQM;A QM;R\n\n(4)\n\n(6)\n\n(7)\n\n\u2211\n\nj\u2208M\n\n\u22a4\nR\n\ny\n\n]\n\n.\n\nyj\u2206\u03b1j = 0.\n\n][\n\n]\n\n= 0,\n\n\u2206\u03b1A\n\u2206\u03b1R\n\nM =\n\n\u22a4\n0\nM\nyM QM\n\ny\n\n3\n\n\f(8a)\n(8b)\n(8c)\n\nFrom the de\ufb01nitions of the index sets in (2a)-(2c), the following inequality constraints must also be\nsatis\ufb01ed:\n\ni \u2208 M,\ni \u2208 O,\ni \u2208 I.\nSince we removed the indices {i | f(xi) \u2265 1} from A, we obtain\n\n0 \u2264 \u03b1i + \u2206\u03b1i \u2264 C,\nyi{f(xi) + \u2206f(xi)} > 1,\nyi{f(xi) + \u2206f(xi)} < 1,\n\nyi{f(xi) + \u2206f(xi)} < 1,\n\ni \u2208 A.\n\n(9)\nDuring the process of moving \u03b1i, i \u2208 A, to C from 0, if the inequality (9) becomes equality for any\ni, we can append the point to M and remove it from A. On the other hand, if (9) holds until \u03b1i\nbecomes C, the point moves to I. In the path following literature [8], the region that satis\ufb01es (8)\nand (9) is called critical region (CR).\nWe decide update direction by the linear system (7) while monitoring inequalities (8) and (9). Sub-\nstituting (3) and (4) to (7), we obtain the update direction\n\u22a4\nA\n\n][\n\n[\n\n]\n\n\u22a4\nR\n\ny\n\n[\n\n]\n\n= \u03b7\u03d5, where \u03d5 = \u2212M\n\n\u22121\n\ny\n\n.\n\n(10)\n\nC1 \u2212 \u03b1A\n\u2212\u03b1R\n\n\u2206b\n\n\u2206\u03b1M\n\nQM;A QM;R\n\nTo determine step length \u03b7, we need to check inequalities (8) and (9). Using vector notation and the\nhadamard product \u2299 (element-wise product [9]), we can write\n\ny \u2299 \u2206f = \u03b7 \u03c8, where \u03c8 = [ y Q:;M ] \u03d5 + Q:;A(C1 \u2212 \u03b1A) \u2212 Q:;R\u03b1R,\n\n(11)\nand the subscription \u201d:\u201d of Q denotes the index of all the elements {1,\u00b7\u00b7\u00b7 , n + m}. Since (10) and\n(11) are linear function of \u03b7, we can calculate the set of the largest step length \u03b7s for each i at which\nthe inequalities (8) and (9) becomes equality for i. The size of such \u03b7s is |M|\u00d7 2 + |O| + |I| + |A|\nand we de\ufb01ne this set as H. We determine the step length as follows:\n\n\u03b7 = min({\u02dc\u03b7 | \u02dc\u03b7 \u2208 H, \u02dc\u03b7 \u2265 0} \u222a {1}).\n\nIf \u03b7 becomes 1, we can terminate the algorithm because all the new data points in A and existing\npoints in M,O and I satisfy the optimality conditions and \u03b1R is 0. Once we decide \u03b7, we can\nupdate \u03b1M and b using (10), and \u03b1A and \u03b1R using (3) and (4). In the path-following literature,\nthe points at which the size of linear system (7) is changed are called breakpoints. If the ith data\npoint reaches bound of any one of the constraints (8) and (9) we need to update M,O and I. After\nupdating, we re-calculate \u03d5, \u03c8 to determine the next step length.\n\n3.3 Empty Margin\nWe need to establish the way of dealing with the empty margin M. In such case, we can not obtain\nthe bias from yif(xi) = 1, i \u2208 M. Then we can only obtain the interval of the bias from\n\nTo keep these inequality constraints, the bias term must be in\ni\u2208U yigi,\n\nmax\n\ni \u2208 O,\ni \u2208 I \u222a A.\n\nyif(xi) > 1,\nyif(xi) < 1,\ni\u2208L yigi \u2264 b \u2264 min\n\u2211\n\u03b1iQij \u2212\n\n\u2211\n\ni\u2208I\n\ni\u2208A\n\n\u2211\n\n\u03b1iQij \u2212\n\n\u03b1iQij,\n\ni\u2208R\n\ngi = 1 \u2212\n\nwhere\n\nand\n\nL = {i | i \u2208 O, yi = +1} \u222a {i | i \u2208 I \u222a A, yi = \u22121},\nU = {i | i \u2208 O, yi = \u22121} \u222a {i | i \u2208 I \u222a A, yi = +1}.\n\n(12)\n\nIf this empty margin happens during the path-following, we look for the new data points which\nre-enter the margin. When the set M is empty, equality constraint (6) becomes\n\n\u2211\n\ni\u2208A\n\nyi\u2206\u03b1i +\n\n\u2211\n\ni\u2208R\n\nyi\u2206\u03b1i = \u03b7\u03b4(\u03b1) = 0,\n\n(13)\n\n4\n\n\fFigure 1: An illustration of the bias in empty margin case. Dotted lines represent yi(gi + \u2206gi(\u03b7)),\nfor each i. Solid lines are the upper bound and the lower bound of the bias. The bias term is uniquely\ndetermined when u(\u03b7) and l(\u03b7) intersect.\n\nwhere\n\n\u2211\n\ni\u2208A\n\n\u03b4(\u03b1) =\n\nyi(C \u2212 \u03b1i) \u2212\n\n\u2211\n\ni\u2208R\n\nyi\u03b1i.\n\nWe take two different strategies depending on \u03b4(\u03b1).\nFirst, if \u03b4(\u03b1) \u0338= 0, we can not simply increase \u03b7 from 0 while keeping (13) satis\ufb01ed. Then we need\nnew margin data point m1 which enables equality constraint to be satis\ufb01ed. The index m1 is either\n\nilow = argmax\n\ni\u2208L yigi or iup = argmax\ni\u2208U\n\nyigi.\n\nIf ilow, iup \u2208 O \u222a I, we can update b and M as follows:\n\n\u03b4(\u03b1) > 0 \u21d2 b = yiup giup , M = {iup},\n\u03b4(\u03b1) < 0 \u21d2 b = yilow gilow , M = {ilow}.\n\nBy setting the bias terms as above, equality condition\n\n\u03b7\u03b4(\u03b1) + ym1\u2206\u03b1m1 = 0\n\n\u2211\n\nj\u2208R\n\nj\u2208A\n\nis satis\ufb01ed. If ilow \u2208 A or iup \u2208 A, we can put either of these points to margin.\nOn the other hand, if \u03b4(\u03b1) = 0, we can increase \u03b7 while keeping (13) satis\ufb01ed. Then, we consider\n\u2211\nincreasing \u03b7 until the upper bound and the lower bound of the bias (12) take the same value (the bias\nterm can be uniquely determined). If we increase \u03b7, gi changes linearly:\n\n\u2211\n\n\u2211\n\n}\n\n\u2206gi(\u03b7) = \u2212\n\n\u2206\u03b1jQij \u2212\n\n\u2206\u03b1jQij = \u03b7\n\n{\u2212\n\n(C \u2212 \u03b1j)Qij +\n\nj\u2208A\n\n\u03b1jQij\n\n.\n\nj\u2208R\n\nSince each yi(gi + \u2206gi(\u03b7)) may intersect, we need to consider the following piece-wise linear\nboundaries:\n\nu(\u03b7) = max\n\nl(\u03b7) = min\n\ni\u2208U yi(gi + \u2206gi(\u03b7)),\nj\u2208L yj(gj + \u2206gj(\u03b7)).\n\nFigure 1 shows an illustration of these functions. We can trace the upper bound and the lower bound\nuntil two bounds become the same value.\n\n3.4 The number of breakpoints\n\nThe main computational cost of incremental decremental algorithm is in solving the linear system\n(10) at each breakpoint (The cost is O(|M|2) because we use Cholesky factor update except the \ufb01rst\nstep). Thus, the number of breakpoints is an important factor of the computational cost. To simplify\nthe discussion, let us introduce the following assumptions:\n\n\u2022 The number of breakpoints is proportional to the total length of the path.\n\u2022 The path obtained by our algorithm is the shortest one.\n\n5\n\n\f(a) Adding 2 data points.\n\n(b) Adding and Removing 1 data point\n\nFigure 2: The schematic illustration of the difference of path length and the number of breakpoints.\nEach polygonal region enclosed by dashed lines represents the region in which M,I,O and A\nare constant (CR: critical region). The intersection of the path and the borders are the breakpoints.\nThe update of matrices and vectors at the breakpoints are the main computational cost of path-\nfollowing. In the case of Figure 2(a), we add 2 data points. If optimal \u03b11 = \u03b12 = C, our proposed\nalgorithm can trace shortest path to optimal point from the origin (left plot). On the other hand,\nsingle incremental algorithm moves one coordinate at a time (right plot). Figure 2(b) shows the case\nthat we add and remove 1 data point, respectively. In this case, if \u03b12 = C, our algorithm can trace\nshortest path to \u03b11 = 0, \u03b12 = C (left plot), while single incremental algorithm again moves one\ncoordinate at a time (right plot).\n\nThe \ufb01rst assumption means that the breakpoints are uniformly distributed on the path. The second\nassumption holds for the removing parameters \u03b1R because we know that we should move \u03b1R to 0.\nOn the other hand, for some of \u03b1A, the second assumption does not necessarily hold because we do\nnot know the optimal \u03b1A beforehand. In particular, if the point i \u2208 A which was located inside the\nmargin before the update moved to M during the update (i.e. the equality (9) holds), the path with\nrespect to this parameter is not really the shortest one.\nTo simplify the discussion further, let us consider only the case of |A| = m > 0 and |R| = 0 (the\nsame discussion holds for other cases too). In this simpli\ufb01ed scenario, the ratio of the number of\nbreakpoints of multiple update algorithm to that of repeated use of single update algorithm is\n\nwhere \u2225 \u2022 \u22252 is \u21132 norm and \u2225 \u2022 \u22251 is \u21131 norm. Figure 2 illustrates the concept in the case of m = 2.\nIf we consider only the case of \u03b1i = C,\u2200i \u2208 A, the ratio is simply\n\nm : m.\n\n\u221a\n\n\u2225\u03b1A\u22252 : \u2225\u03b1A\u22251,\n\n4 Experiments\n\nWe compared the computational cost of the proposed multiple incremental decremental algorithm\n(MID-SVM) with (repeated use of) single incremental decremental algorithm [1] (SID-SVM) and\nwith the LIBSVM [10], the-state-of-the-art batch SVM solver based on sequential minimal opti-\nmization algorithm (SMO).\nIn LIBSVM, we examined several tolerances for termination criterion: \u03b5 = 10\u22123, 10\u22126, 10\u22129.\nWhen we use LIBSVM for online-learning, alpha seeding [11, 12] sometimes works well. The\nbasic idea of alpha seeding is to use the parameters before the update as the initial parameter. In\ny = 0 may not be\nalpha seeding, we need to take care of the fact that the summation constraint \u03b1\nsatis\ufb01ed after removing \u03b1s in R. In that case, we simply re-distribute\n\n\u22a4\n\nto the in-bound \u03b1i, i \u2208 {i | 0 < \u03b1i < C}, uniformly. If \u03b4 cannot be distributed to in-bound \u03b1s, it is\nalso distributed to other \u03b1s. If we still can not distribute \u03b4 by this way, we did not use alpha-seeding.\nFor kernel function, we used RBF kernel K(xi, xj) = exp(\u2212\u03b3||xi \u2212 xj||2). In this paper, we\nassume that the kernel matrix K is positive de\ufb01nite. If the kernel matrix happens to be singular,\nwhich typically arise when there are two or more identical data points in M, our algorithm may not\nwork. As far as we know, this degeneracy problem is not fully solved in path-following literature.\nMany heuristics are proposed to circumvent the problem. In the experiments described below, we\n\n\u03b4 =\n\n\u03b1iyi\n\n\u2211\n\ni\u2208R\n\n6\n\nfinalpathbreakpointsborders of CRinitial\fFigure 3: Arti\ufb01cial data set. For graphical simplicity, we plot only a part of data points. The cross\npoints are generated from a mixture of two Gaussian while the circle points come from a single\nGaussian. Two classes have equal prior probabilities.\n\nuse one of them: adding small positive constant to the diagonal elements of kernel matrix. We set\nthis constant as 10\u22126. In the LIBSVM we can specify cache size of kernel matrix. We set this cache\nsize enough large to store the entire matrix.\n\n4.1 Arti\ufb01cial Data\n\nFirst, we used simple arti\ufb01cial data set to see the computational cost for various number of adding\nand/or removing points. We generated data points (x, y) \u2208 R2 \u00d7 {+1,\u22121} using normal distri-\nbutions. Figure 3 shows the generated data points. The size of initial data points is n = 500. As\ndiscussed, adding or removing the data points with \u03b1i = 0 at optimal can be performed with al-\nmost no cost. Thus, to make clear comparison, we restrict the adding and/or removing points as\nthose with \u03b1i = C at optimal. Figure 4 shows the log plot of the CPU time. We examined several\nscenarios: (a) adding m \u2208 {1,\u00b7\u00b7\u00b7 , 50} data points, (b) removing \u2113 \u2208 {1,\u00b7\u00b7\u00b7 , 50} data points, (c)\nadding m \u2208 {1,\u00b7\u00b7\u00b7 , 25} data points and removing \u2113 \u2208 {1,\u00b7\u00b7\u00b7 , 25} data points simultaneously.\nThe horizontal axis is the number of adding and/or removing data points. We see that MID-SVM\nis signi\ufb01cantly faster than SID-SVM. When m = 1 or \u2113 = 1, SID-SVM and MID-SVM are identi-\ncal. The relative difference of SID-SVM and MID-SVM grows as the m and/or \u2113 increase because\nMID-SVM can add or remove multiple data points simultaneously while SID-SVM merely iterates\nthe algorithm m + \u2113 times. In this experimental setting, the CPU time of SMO does not change\nlargely because m and \u2113 are relatively smaller than n. Figure 5 shows the number of breakpoints\nof SID-SVM and MID-SVM along with the theoretical number of breakpoints of the MID-SVM in\nSection 3.4 (e.g., for scenario (a), the number of breakpoints of SID-SVM multiplied by\nm/m).\nThe results are very close to the theoretical one.\n\n\u221a\n\n4.2 Application to Online Time Series Learning\n\nWe applied the proposed algorithm to a online time series learning problem, in which we update the\nmodel when some new observations arrive (adding the new ones and removing the obsolete ones).\nWe used Fisher river data set in StatLib [13]. In this data set, the task is to predict whether the mean\ndaily \ufb02ow of the river increases or decreases using the previous 7 days temperature, precipitation\nand \ufb02ow (xi \u2208 R21). This data set contains the observations from Jan 1 1988 to Dec 31 1991.\nThe size of the initial data points is n = 1423 and we set m = \u2113 = 30 (about a month). Each\ndimension of x is normalized to [0, 1]. We add new m data points and remove the oldest \u2113 data\npoints. We investigate various settings of the regularization parameter C \u2208 {10\u22121, 100,\u00b7\u00b7\u00b7 , 105}\nand kernel parameter \u03b3 \u2208 {10\u22123, 10\u22122, 10\u22121, 100}. Unlike previous experiments, we did not choose\nthe adding or removing data points by its parameter. Figure 6 shows the elapsed CPU times and\nFigure 7 shows 10-fold cross-validation error of each setting. Each \ufb01gure has 4 plots corresponding\nto different settings of kernel parameter \u03b3. The horizontal axis denotes the regularization parameter\nC. Figure 6 shows that our algorithm is faster than the others, especially in large C. It is well\nknown that the computational cost of SMO algorithm becomes large when C gets large [14]. Cross-\nvalidation error in Figure 7 indicates that the relative computational cost of our proposed algorithm\nis especially low for the hyperparameters with good generalization performances in this application\nproblem.\n\n7\n\n\u22122\u221210123\u22122\u2212101234x1x2\f(a) Adding m data points.\n\n(b) Removing \u2113 data points. (c) Adding m data points\nand removing \u2113 data points\nsimultaneously (m = \u2113).\n\nFigure 4: Log plot of the CPU time (arti\ufb01cial data set)\n\n(a) Adding m data points.\n\n(b) Removing \u2113 data points. (c) Adding m data points\nand removing \u2113 data points\nsimultaneously (m = \u2113).\n\nFigure 5: The number of breakpoints (arti\ufb01cial data set)\n\n(a) \u03b3 = 100\n\n(b) \u03b3 = 10\n\n(cid:0)1\n\n(c) \u03b3 = 10\n\n(cid:0)2\n\n(d) \u03b3 = 10\n\n(cid:0)3\n\nFigure 6: Log plot of the CPU time (Fisher river data set)\n\n(a) \u03b3 = 100\n\n(b) \u03b3 = 10\n\n(cid:0)1\n\n(c) \u03b3 = 10\n\n(cid:0)2\n\n(d) \u03b3 = 10\n\n(cid:0)3\n\nFigure 7: Cross-validation error (Fisher river data set)\n\n5 Conclusion\n\nWe proposed multiple incremental decremental algorithm of the SVM. Unlike single incremen-\ntal decremental algorithm, our algorithm can ef\ufb01ciently work with simultaneous addition and/or\nremoval of multiple data points. Our algorithm is built on multi-parametric programming in the\noptimization literature [8]. We previously proposed an approach to accelerate Support Vector Re-\ngression (SVR) cross-validation using similar technique [15]. These multi-parametric programming\nframeworks can be easily extended to other kernel machines.\n\n8\n\n0102030405010\u22121.810\u22121.610\u22121.410\u22121.2mCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)0102030405010\u22121.810\u22121.610\u22121.410\u22121.2lCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)051015202510\u22121.910\u22121.710\u22121.510\u22121.3m and lCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)010203040500100200300400500600mthe number of breakpointsMID\u2212SVMSID\u2212SVMTheoretical010203040500100200300400500600lthe number of breakpointsMID\u2212SVMSID\u2212SVMTheoretical0510152025050100150200250300350400450500m and lthe number of breakpointsMID\u2212SVMSID\u2212SVMTheoretical10\u2212210010210410610\u22121100101102103CCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)10\u2212210010210410610\u22121100101102103CCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)10\u2212210010210410610\u22121100101102CCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)10010510\u22120.510\u22120.310\u22120.1100.1CCPU time (sec)MID\u2212SVMSID\u2212SVMSMO(\u03b5=1e\u22123)SMO(\u03b5=1e\u22126)SMO(\u03b5=1e\u22129)10\u221221001021041060.320.330.340.350.360.370.380.390.40.410.42Cross Validation ErrorC10\u221221001021041060.320.340.360.380.40.420.440.46Cross Validation ErrorC10\u221221001021041060.340.360.380.40.420.440.46Cross Validation ErrorC10\u221221001021041060.340.360.380.40.420.440.46Cross Validation ErrorC\fReferences\n[1] G. Cauwenberghs and T. Poggio, \u201cIncremental and decremental support vector machine learning,\u201d in\nAdvances in Neural Information Processing Systems (T. K. Leen, T. G. Dietterich, and V. Tresp, eds.),\nvol. 13, (Cambridge, Massachussetts), pp. 409\u2013415, The MIT Press, 2001.\n\n[2] M. Martin, \u201cOn-line support vector machines for function approximation,\u201d tech. rep., Software Depart-\n\nment, University Politecnica de Catalunya, 2002.\n\n[3] J. Ma and J. Theiler, \u201cAccurate online support vector regression,\u201d Neural Computation, vol. 15, no. 11,\n\npp. 2683\u20132703, 2003.\n\n[4] P. Laskov, C. Gehl, S. Kruger, and K.-R. Muller, \u201cIncremental support vector learning: Analysis, imple-\n\nmentation and applications,\u201d Journal of Machine Learning Research, vol. 7, pp. 1909\u20131936, 2006.\n\n[5] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, \u201cThe entire regularization path for the support vector\n\nmachine,\u201d Journal of Machine Learning Research, vol. 5, pp. 1391\u20131415, 2004.\n\n[6] L. Gunter and J. Zhu, \u201cEf\ufb01cient computation and model selection for the support vector regression,\u201d\n\nNeural Computation, vol. 19, no. 6, pp. 1633\u20131655, 2007.\n\n[7] G. Wang, D.-Y. Yeung, and F. H. Lochovsky, \u201cA new solution path algorithm in support vector regression,\u201d\n\nIEEE Transactions on Neural Networks, vol. 19, no. 10, pp. 1753\u20131767, 2008.\n\n[8] E. N. Pistikopoulos, M. C. Georgiadis, and V. Dua, Process Systems Engineering: Volume 1: Multi-\n\nParametric Programming. WILEY-VCH, 2007.\n\n[9] J. R. Schott, Matrix Analysis For Statistics. Wiley-Interscience, 2005.\n[10] C.-C. Chang and C.-J. Lin, \u201cLIBSVM: a library for support vector machines,\u201d 2001. Software available\n\nat http://www.csie.ntu.edu.tw/(cid:24)cjlin/libsvm.\n\n[11] D. DeCoste and K. Wagstaff, \u201cAlpha seeding for support vector machines,\u201d in Proceedings of the Inter-\n\nnational Conference on Knowledge Discovery and Data Mining, pp. 345\u2013359, 2000.\n\n[12] M. M. Lee, S. S. Keerthi, C. J. Ong, and D. DeCoste, \u201cAn ef\ufb01cient method for computing leave-one-out\nerror in support vector machines,\u201d IEEE transaction on neural networks, vol. 15, no. 3, pp. 750\u2013757,\n2004.\n\n[13] M. Meyer, \u201cStatlib.\u201d http://lib.stat.cmu.edu/index.php.\n[14] L. Bottou and C.-J. Lin, \u201cSupport vector machine solvers,\u201d in Large Scale Kernel Machines (L. Bottou,\n\nO. Chapelle, D. DeCoste, and J. Weston, eds.), pp. 301\u2013320, Cambridge, MA.: MIT Press, 2007.\n\n[15] M. Karasuyama, I. Takeuchi, and R.Nakano, \u201cEf\ufb01cient leave-m-out cross-validation of support vector\nregression by generalizing decremental algorithm,\u201d New Generation Computing, vol. 27, no. 4, Special\nIssue on Data-Mining and Statistical Science, pp. 307\u2013318, 2009.\n\n9\n\n\f", "award": [], "sourceid": 572, "authors": [{"given_name": "Masayuki", "family_name": "Karasuyama", "institution": null}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": null}]}