{"title": "Convergence of Large Margin Separable Linear Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 357, "page_last": 363, "abstract": null, "full_text": "Convergence of Large Margin Separable Linear \n\nClassification \n\nTong Zhang \n\nMathematical Sciences Department \nIBM TJ. Watson Research Center \n\nYorktown Heights, NY 10598 \n\ntzhang@watson.ibm.com \n\nAbstract \n\nLarge margin linear classification methods have been successfully ap(cid:173)\nplied to many applications. For a linearly separable problem, it is known \nthat under appropriate assumptions, the expected misclassification error \nof the computed \"optimal hyperplane\" approaches zero at a rate propor(cid:173)\ntional to the inverse training sample size. This rate is usually charac(cid:173)\nterized by the margin and the maximum norm of the input data. In this \npaper, we argue that another quantity, namely the robustness of the in(cid:173)\nput data distribution, also plays an important role in characterizing the \nconvergence behavior of expected misclassification error. Based on this \nconcept of robustness, we show that for a large margin separable linear \nclassification problem, the expected misclassification error may converge \nexponentially in the number of training sample size. \n\n1 Introduction \n\nWe consider the binary classification problem: to determine a label y E {-1, 1} associ(cid:173)\nated with an input vector x. A useful method for solving this problem is by using linear \ndiscriminant functions . Specifically, we seek a weight vector wand a threshold () such that \nwT x < () if its label y = -1 and wT x ~ () if its label y = 1. \nIn this paper, we are mainly interested in problems that are linearly separable by a positive \nmargin (although, as we shall see later, our analysis is suitable for non-separable problems). \nThat is, there exists a hyperplane that perfectly separates the in-class data from the out-of(cid:173)\nclass data. We shall also assume () = 0 throughout the rest of the paper for simplicity. \nThis restriction usually does not cause problems in practice since one can always append a \nconstant feature to the input data x, which offset the effect of (). \n\nlinearly \n\nseparable problems, \n\nFor \nlabeled data \n(X1,yl), .. . ,(xn,yn), Vapnik recently proposed a method that optimizes a hard \nmargin bound which he calls the \"optimal hyperplane\" method (see [11]). The optimal \nhyperplane Wn is the solution to the following quadratic programming problem: \n\ngiven a \n\ntraining set of n \n\n. 1 \n2 \nmln-w \nw 2 \n\n(1) \n\n\fFor linearly non-separable problems, a generalization of the optimal hyperplane method \nhas appeared in [2], where a slack variable f.i is introduced for each data point (xi, yi) for \ni = 1, ... ,n. We compute a hyperplane Wn that solves \ns.t. wTxiyi 2: I-f.i, \n\nfori = 1, ... ,no \n\nf.i 2: 0 \n\nmin~wTw+CLf.i \nw,~ 2 \n\n. , \n\n(2) \n\nWhere C > 0 is a given parameter (also see [11]). \nIn this paper, we are interested in the quality of the computed weight Wn for the purpose of \npredicting the label y of an unseen data point x. We study this predictive power of Wn in the \nstandard batch learning framework. That is, we assume that the training data (xi, yi) for \ni = 1, ... n are independently drawn from the same underlying data distribution D which \nis unknown. The predictive power of the computed parameter Wn then corresponds to the \nclassification performance of Wn with respect to the true distribution D. \n\nWe organize the paper as follows. In Section 2, we briefly review a number of existing \ntechniques for analyzing separable linear classification problems. We then derive an ex(cid:173)\nponential convergence rate of misclassification error in Section 3 for certain large margin \nlinear classification. Section 4 compares the newly derived bound with known results from \nthe traditional margin analysis. We explain that the exponential bound relies on a new \nquantity (the robustness of the distribution) which is not explored in a traditional margin \nbound. Note that for certain batch learning problems, exponential learning curves have al(cid:173)\nready been observed [10]. It is thus not surprising that an exponential rate of convergence \ncan be achieved by large margin linear classification. \n\n2 Some known results on generalization analysis \n\nThere are a number of ways to obtain bounds on the generalization error of a linear classi(cid:173)\nfier. A general framework is to use techniques from empirical processes (aka VC analysis). \nMany such results that are related to large margin classification have been described in \nchapter 4 of [3]. \n\nThe main advantage of this framework is its generality. The analysis does not require the \nestimated parameter to converge to the true parameter, which is ideal for combinatorial \nproblems. However, for problems that are numerical in natural, the potential parameter \nspace can be significantly reduced by using the first order condition of the optimal solution. \nIn this case, the VC analysis may become suboptimal since it assumes a larger search space \nthan what a typical numerical procedure uses. Generally speaking, for a problem that is \nlinearly separable with a large margin, the expected classification error of the computed \nhyperplane resulted from this analysis is of the order Oeo~n V Similar generalization \nbounds can also be obtained for non-separable problems. \n\nIn chapter 10 of [11], Vapnik described a leave-one-out cross-validation analysis for lin(cid:173)\nearly separable problems. This analysis takes into account the first order KKT condition of \nthe optimal hyperplane W n . The expected generalization performance from this analysis is \nO( ~) , which is better than the corresponding bounds from the VC analysis. Unfortunately, \nthis technique is only suitable for deriving an expected generalization bound (for example, \nit is not useful for obtaining a PAC style probability bound). \n\nAnother well-known technique for analyzing linearly separable problems is the mistake \nbound framework in online learning. It is possible to obtain an algorithm with a small gen(cid:173)\neralization error in the batch learning setting from an algorithm with a small online mistake \n\n'Bounds described in [3] would imply an expected classification error of 0(108: n), which can be \nslightly improved (by a log n factor) if we adopt a slightly better covering number estimate such as \nthe bounds in [12, 14]. \n\n\fbound. The readers are referred to [6] and references therein for this type of analysis. The \ntechnique may lead to a bound with an expected generalization performance of O(~). \n\nBesides the above mentioned approaches, generalization ability can also be studied in \nthe statistical mechanical learning framework. It was shown that for linearly separable \nproblems, exponential decrease of misclassification error is possible under this framework \n[1, 5, 7, 8]. Unfortunately, it is unclear how to relate the statistical mechanical learning \nframework to the batch learning framework considered in this paper. Their analysis, em(cid:173)\nploying approximation techniques, does not seem to imply small sample bounds which we \nare interested in. \n\nThe statistical mechanical learning result suggests that it may be possible to obtain a similar \nexponential decay of misclassification error in the batch learning setting, which we prove \nin the next section. Furthermore, we show that the exponential rate depends on a quantity \nthat is different than the traditional margin concept. Our analysis relies on a PAC style \nprobability estimate on the convergence rate of the estimated parameter from (2) to the true \nparameter. Consequently, it is suitable for non-separable problems. A direct analysis on the \nconvergence rate of the estimated parameter to the true parameter is important for problems \nthat are numerical in nature such as (2). However, a disadvantage of our analysis is that we \nare unable to directly deal with the linearly separable formulation (1). \n\n3 Exponential convergence \nWe can rewrite the SVM formulation (2) by eliminating e as: \n\nwhere A = 1/(nC) and \n\ni \n\nz :S 0, \nz > O. \n\nWn(A) = argmin -\nw n \n\n12: T\" \n\nf(w x' y' - 1) + -w w, \n\nAT \n2 \n\n(3) \n\nDenote by D the true underlying data distribution of (x, y), and let w. (A) be the optimal \nsolution with respect to the true distribution as : \n\nW.(A) = arg inf EDf(wT xy - 1) + ~wT w . \n\nw \n\n2 \n\nLet w. be the solution to \n\nw. = arginf ~w2 \n\nw 2 \n\nS.t. EDf(wT xy - 1) = 0, \n\n(4) \n\n(5) \n\nwhich is the infinite-sample version of the optimal hyperplane method. \nThroughout this section, we assume Ilw.112 < 00, and EDllxl12 < 00. The latter condition \nensures that EDf( wT xy - 1) :S IIwl12ED Ilx 112 + 1 exists for all w. \n\n3.1 Continuity of solution under regularization \nIn this section, we show that Ilw. (A) - w.112 -+ 0 as A -+ O. This continuity result allows \nus to approximate (5) by using (4) and (3) with a small positive regularization parameter A. \nWe only need to show that within any sequence of A that converges to zero, there exists a \nsubsequence Ai -+ 0 such that w. (Ai) converges to w. strongly. \nWe first consider the following inequality which follows from the definition of w. (A): \n\nEDf(w.(A) xy - 1) + '2W.(A) \n\nT \n\nA \n\n2 \n\nA 2 \n:S '2w. . \n\n(6) \n\n\fTherefore Ilw.(A)112 :s Ilw.112' \nIt is well-known that every bounded sequence in a Hilbert space contains a weakly conver(cid:173)\ngent subsequence (cf. Proposition 66.4 in [4]). Therefore within any sequence of A that \nconverges to zero, there exists a subsequence Ai --t 0 such that W. (Ai) converges weakly. \nWe denote the limit by w. \nSince f(W.(Af xy - 1) is dominated by Ilw.11211x112 + 1 which has a finite integral with \nrespect to D, therefore from (6) and the Lebesgue dominated convergence theorem, we \nobtain \n\n, \n\n, \n\n0= lim ED f(w. (AdT xy -1) = ED limf(w.(Aif xy - 1) = EDf(wT xy - 1). \n(7) \nAlso note that IIwl12 :s liffii Ilw.(Ai)112 :s Ilw.112, therefore by the definition of W., we \nmust have w = w \u2022. \nSince W. is the weak limit of W.(Ai), we obtain Ilw.112 :s liffii Ilw.(Ai)112. Also since \n:s Ilw.112, therefore liffii Ilw.(AdI12 = Ilw.112' This equality implies that \nIlw.(Ai)112 \nW. (Ai) converges to w. strongly since \n, \n\nlim(w.(Ai) - W.)2 = limw.(Ai)2 + w; - 21imw.(Ai)Tw\u2022 = O. \n, \n\n, \n\n3.2 Accuracy of estimated hyperplane with non-zero regularization parameter \n\nOur goal is to show that for the estimation method (3) with a nonzero regularization pa(cid:173)\nrameter A > 0, the estimated parameter Wn(A) converges to the true parameter W.(A) in \nprobability when the sample size n --t 00. Furthermore, we give a large deviation bound \non the rate of convergence. \n\nFrom (4), we obtain the following first order condition: \n\nEDf3(A, x, y)xy + AW.(A) = 0, \n\n(8) \nwhere f3(A, x, y) = f'(W.(A)T xy - 1) and f'(z) E [-1,0] denotes a member of the \nsubgradient of f at z [9].2 In the finite sample case, we can also interpret f3(\\ x, y) in \n(8) as a scaled dual variable a: f3 = -a/C, where a appears in the dual (or Kernel) \nformulation of an SVM (for example, see chapter 10 of [11]). \nThe convexity of f implies that f(zd + (Z2 - zdf'(zd :s f(Z2) for any subgradient f' of \nf. This implies the following inequality: \n\n~ L \n, \n. \nn \n\nf(W.(A)T xiyi - 1) + (Wn(A) - W.(A))T ~ Lf3(A, xi, yi)xiyi \n\nn \n\n, \n. \n\nwhich is equivalent to: \n\n1 ' \" \n- ~ f(W.(A) x'y' - 1) + - W.(A) + \nn \n\nT . . \n\nA \n2 \n\n2 \n\n, \n. \n(Wn(A) - W.(A)?[ ~ Lf3(\\ xi, yi)xiyi + AW.(A)] + ~(W.(A) - Wn(A))2 \n\nn \n\n. \n, \n\n2 \n\n2Por readers not familiar with the sub gradient concept in convex analysis, our analysis requires \nlittle modification if we replace f with a smoother convex function such as P, which avoids the \ndiscontinuity in the first order derivative. \n\n\fAlso note that by the definition of Wn(A), we have: \n!(WnC>..)T xiyi - 1) + ~Wn(A)2 < ..!. \" \n, \n\n..!. \" \nn~ \n, \n\n2 \n\n-n~ \n\n!(w*(Af xiyi -1) + ~W*(A)2. \n\n2 \n\nTherefore by comparing the above two inequalities, we obtain: \n\n~(W*(A) - Wn(A))2 \u00abW*(A) - wn(A)f[..!. L,B(A, xi, yi)xiyi + AW*(A)] \n2 \n\n-\n:Sllw*(A) - Wn(A)11211~ L,B(A, xi, yi)xiyi + AW*(A)112. \n\n. , \n\nn \n\nTherefore we have \n\ni \n\nIIW*(A) - Wn(A)112 :S~II~ L,B(A, xi, yi)xiyi + AW*(A)112 \n\ni \n\n=-11- ~,B(A, x', y')x'y' - ED,B(A, x, y)xYI12' \n\n. . . . \n\n2 1 \" \nAn. , \n\n(9) \n\nNote that in (9), we have already bounded the convergence of Wn(A) to W*(A) in terms of \nthe convergence of the empirical expectation of a random vector ,B( A, x, y)xy to its mean. \nIn order to obtain a large deviation bound on the convergence rate, we need the following \nresult which can be found in [13], page 95: \n\nTheorem 3.1 Let ei be zero-mean independent random vectors in a Hilbert space. If there \nexists M > 0 such thatforall natural numbers 12: 2: E~=l Elleill~ :S \";bl!Ml. Thenfor \naILS> 0: P(II~ Ei eil12 2:.5):S 2exp(-~()2/(bM2 +.5M)). \nU sing the fact that ,B( A, x, y) E [-1, 0], it is easy to verify the following corollary by using \nTheorem 3.1 and (9), where we also bound the l-th moment of the right hand side of (9) \nusing the following form ofJensen's inequality: la + bll :S 2l-1( lall + Ibll) forl 2: 2. \nCorollary 3.1 If there exists M > 0 such thatfor all natural numbers 1 2: 2: ED Ilxll~ :S \n%1!Ml. Thenfor all.5 > 0: \n\nP(llw*(.A) - wn(A)112 2: .5) :S 2 exp( -iA2.52 /(4bM 2 + A.5M)). \n\nLet PD ( .) denote the probability with respect to distribution D, then the following bound \non the expected misclassification error of the computed hyperplane Wn (A) is a straight(cid:173)\nforward consequence of Corollary 3.1: \n\nCorollary 3.2 Under the assumptions of Corollary 3.1, then for any non-random values \nA, ,,(, K > 0, we have: \n\nEXPD(Wn(Af xy:S 0) :SPD(w*(.A)T xy:S \"() + PD (ll xl12 2: K) \n\n+ 2 exp( -iA2\"{2 /( 4bK2 M2 + A\"{K M)), \n\nwhere the expectation Ex is taken over n random samples from D with Wn (A) estimated \nfrom the n samples. \n\nWe now consider linearly separable classification problems where the solution W* of (5) \nis finite. Throughout the rest of this section, we impose an additional assumption that the \n\n\fdistribution D is finitely supported: IIxl12 :s M almost everywhere with respect to the \nmeasure D. \nFrom Section 3.1, we know that for any sufficiently small positive number A, Ilw. -\nw.(A)112 < 11M, which means that W.(A) also separates the in-class data from the out(cid:173)\nof-class data with a margin of at least 2(1 - Mllw. - w. (A) 112). Therefore for sufficiently \nsmall A, we can define: \n\nI'(A) = sup{b: Pn(W.(A)T xy :s b) = O} ~ 1- Mllw. - w.(A)112 > O. \n\nBy Corollary 3.2, we obtain the following upper-bound on the misclassification error if we \ncompute a linear separator from (3) with a non-zero small regularization parameter A: \n\nEx Pn( wn(Af xy :s 0) :s 2 exp( - ~A21'(A)2 1(4M4 + AI'(A)M2)). \n\nThis indicates that the expected misclassification error of an appropriately computed hyper(cid:173)\nplane for a linearly separable problem is exponential in n. However, the rate of convergence \ndepends on AI'( A) 1M2. This quantity is different than the margin concept which has been \nwidely used in the literature to characterize the generalization behavior of a linear clas(cid:173)\nsification problem. The new quantity measures the convergence rate of W.(A) to w. as \nA -+ O. The faster the convergence, the more \"robust\" the linear classification problem is, \nand hence the faster the exponential decay of misclassification error is. As we shall see in \nthe next section, this \"robustness\" is related to the degree of outliers in the problem. \n\n4 Example \n\nWe give an example to illustrate the \"robustness\" concept that characterizes the exponential \ndecay of misclassification error. It is known from Vapnik's cross-validation bound in [11] \n(Theorem 10.7) that by using the large margin idea alone, one can derive an expected \nmisclassification error bound that is of the order O(l/n), where the constant is margin \ndependent. We show that this bound is tight by using the following example. \n\nExample 4.1 Consider a two-dimensional problem. Assume that with probability of 1-1', \nwe observe a data point x with label y such that xy = [1, 0]; and with probability of 1', we \nobserve a data point x with label y such that xy = [-1, 1]. This problem is obviously \nlinearly separable with a large margin that is I' independent. \nNow, for n random training data, with probability at most I'n + (1- I')n, we observe either \nxiyi = [1,0] for all i = 1, . .. , n, or xiyi = [-1,1] for all i = 1, ... , n. For all other \ncases, the computed optimal hyperplane Wn = w \u2022. This means that the misclassification \nerror is 1'(1 - I')(\"Yn-l + (1 - I')n-l). This error converges to zero exponentially as n -+ \n00. However the convergence rate depends on the fraction of outliers in the distribution \ncharacterized by 1'. \nIn particular, for any n, if we let I' = 1 In, then we have an expected misclassification error \nthat is at least ~(l-l/n)n ~ 1/(en). D \n\nThe above tightness construction of the linear decay rate of the expected generalization \nerror (using the margin concept alone) requires the scenario that a small fraction (which \nshall be in the order of inverse sample size) of data are very different from other data. \nThis small portion of data can be considered as outliers, which can be measured by the \n\"robustness\" of the distribution. In general, w. (A) converges to w. slowly when there \nexist such a small portion of data (outliers) that cannot be correctly classified from the \nobservation of the remaining data. It can be seen that the optimal hyperplane in (1) is quite \nsensitive to even a single outlier. Intuitively, this instability is quite undesirable. However, \nthe previous large margin learning bounds seemed to have dismissed this concern. This \n\n\fpaper indicates that such a concern is still valid. In the worst case, even if the problem \nis separable by a large margin, outliers can still cause a slow down of the exponential \nconvergence rate. \n\n5 Conclusion \n\nIn this paper, we derived new generalization bounds for large margin linearly separable \nclassification. Even though we have only discussed the consequence of this analysis for \nseparable problems, the technique can be easily applied to non separable problems (see \nCorollary 3.2). For large margin separable problems, we show that exponential decay of \ngeneralization error may be achieved with an appropriately chosen regularization parame(cid:173)\nter. However, the bound depends on a quantity which characterizes the robustness of the \ndistribution. An important difference of the robustness concept and the margin concept is \nthat outliers may not be observable with large probability from data while margin generally \nwill. This implies that without any prior knowledge, it could be difficult to directly apply \nour bound using only the observed data. \n\nReferences \n\n[1] lK. Anlauf and M. Biehl. The AdaTron: an adaptive perceptron algorithm. Europhys. \n\nLett., 10(7):687-692, 1989. \n\n[2] C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning, 20:273-297, \n\n1995. \n\n[3] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines \n\nand other Kernel-based Learning Methods. Cambridge University Press, 2000. \n\n[4] Harro G. Heuser. Functional analysis. John Wiley & Sons Ltd., Chichester, 1982. \n\nTranslated from the German by John Horvath, A Wiley-Interscience Publication. \n\n[5] W. Kinzel. Statistical mechanics of the perceptron with maximal stability. In Lecture \n\nNotes in Physics, volume 368, pages 175-188. Springer-Verlag, 1990. \n\n[6] 1 Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for \n\nlinear prediction. Journal of Infonnation and Computation, 132:1-64, 1997. \n\n[7] M. Opper. Learning times of neural networks: Exact solution for a perceptron algo(cid:173)\n\nrithm. Phys. Rev. A, 38(7):3824-3826, 1988. \n\n[8] M. Opper. Learning in neural networks: Solvable dynamics. Europhysics Letters, \n\n8(4):389-392,1989. \n\n[9] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, \n\n1970. \n\n[10] Dale Schuurmans. Characterizing rational versus exponential learning curves. J. \n\nComput. Syst. Sci., 55:140-160, 1997. \n\n[11] V.N. Vapnik. Statistical learning theory. John Wiley & Sons, New York, 1998. \n[12] Robert C. Williamson, Alexander 1 Smola, and Bernhard Scholkopf. Entropy num(cid:173)\n\nbers of linear function classes. In COLT'OO, pages 309-319,2000. \n\n[13] Vadim Yurinsky. Sums and Gaussian vectors. Springer-Verlag, Berlin, 1995. \n[14] Tong Zhang. Analysis of regularized linear functions for classification problems. \n\nTechnical Report RC-21572, IBM, 1999. Abstract in NIPS'99, pp. 370-376. \n\n\f", "award": [], "sourceid": 1891, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}