{"title": "Exact Convex Confidence-Weighted Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 345, "page_last": 352, "abstract": "Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods.", "full_text": "Exact Convex Con\ufb01dence-Weighted Learning\n\nKoby Crammer Mark Dredze Fernando Pereira\u2217\n\nDepartment of Computer and Information Science , University of Pennsylvania\n\nPhiladelphia, PA 19104\n\n{crammer,mdredze,pereira}@cis.upenn.edu\n\nAbstract\n\nCon\ufb01dence-weighted (CW) learning [6], an online learning method for linear clas-\nsi\ufb01ers, maintains a Gaussian distributions over weight vectors, with a covariance\nmatrix that represents uncertainty about weights and correlations. Con\ufb01dence\nconstraints ensure that a weight vector drawn from the hypothesis distribution\ncorrectly classi\ufb01es examples with a speci\ufb01ed probability. Within this framework,\nwe derive a new convex form of the constraint and analyze it in the mistake bound\nmodel. Empirical evaluation with both synthetic and text data shows our version of\nCW learning achieves lower cumulative and out-of-sample errors than commonly\nused \ufb01rst-order and second-order online methods.\n\n1\n\nIntroduction\n\nOnline learning methods for linear classi\ufb01ers, such as the perceptron and passive-aggressive (PA)\nalgorithms [4], have been thoroughly analyzed and are widely used. However, these methods do not\nmodel the strength of evidence for different weights arising from differences in the use of features\nin the data, which can be a serious issue in text classi\ufb01cation, where weights of rare features should\nbe trusted less than weights of frequent features.\nCon\ufb01dence-weighted (CW) learning [6], motivated by PA learning, explicitly models classi\ufb01er\nweight uncertainty with a full multivariate Gaussian distribution over weight vectors. The PA ge-\nometrical margin constraint is replaced by the probabilistic constraint that a classi\ufb01er drawn from\nthe distribution should, with high probability, classify correctly the next example. While Dredze\net al. [6] explained CW learning in terms of the standard deviation of the margin induced by the\nhypothesis Gaussian, in practice they used the margin variance to make the problem convex. In this\nwork, we use their original constraint but maintain convexity, yielding experimental improvements.\nOur primary contributions are a mistake-bound analysis [11] and comparison with related methods.\nWe emphasize that this work focuses on the question of uncertainty about feature weights, not on\ncon\ufb01dence in predictions. In large-margin classi\ufb01cation, the margin\u2019s magnitude for an instance\nis sometimes taken as a proxy for prediction con\ufb01dence for that instance, but that quantity is not\ncalibrated nor is it connected precisely to a measure of weight uncertainty. Bayesian approaches to\nlinear classi\ufb01cation, such as Bayesian logistic regression [9], use a simple mathematical relationship\nbetween weight uncertainty and prediction uncertainty, which unfortunately cannot be computed\nexactly. CW learning preserves the convenient computational properties of PA algorithms while\nproviding a precise connection between weight uncertainty and prediction con\ufb01dence that has led to\nweight updates that are more effective in practice [6, 5].\nWe begin with a review of the CW approach, then show that the constraint can be expressed in a\nconvex form, and solve it to obtain a new CW algorithm. We also examine a dual representation\nthat supports kernelization. Our analysis provides a mistake bound and indicates that the algorithm\nis invariant to initialization. Simulations show that our algorithm improves over \ufb01rst-order methods\n\n\u2217Current af\ufb01liation: Google, Mountain View, CA 94043, USA.\n\n1\n\n\f(perceptron and PA) as well as other second order methods (second-order perceptron). We conclude\nwith a review of related work.\n\n2 Con\ufb01dence-Weighted Linear Classi\ufb01cation\n\nThe CW binary-classi\ufb01er learner works in rounds. On round i, the algorithm applies its current\nlinear classi\ufb01cation rule hw(x) = sign(w \u00b7 x) to an instance xi \u2208 Rd to produce a prediction\n\u02c6yi \u2208{\u2212 1, +1}, receives a true label yi \u2208{\u2212 1, +1} and suffers a loss !(yi, \u02c6yi). The rule hw can be\nidenti\ufb01ed with w up to a scaling, and we will do so in what follows since our algorithm will turn out\nto be scale-invariant. As usual, we de\ufb01ne the margin of an example on round i as mi = yi(wi \u00b7 xi),\nwhere positive sign corresponds to a correct prediction.\nCW classi\ufb01cation captures the notion of con\ufb01dence in the weights of a linear classi\ufb01er with a prob-\nability density on classi\ufb01er weight vectors, speci\ufb01cally a Gaussian distribution with mean \u00b5 \u2208 Rd\nand covariance matrix \u03a3 \u2208 Rd\u00d7d. The values \u00b5p and \u03a3p,p represent knowledge of and con\ufb01dence\nin the weight for feature p. The smaller \u03a3p,p, the more con\ufb01dence we have in the mean weight value\n\u00b5p. Each covariance term \u03a3p,q captures our knowledge of the interaction between features p and q.\nIn the CW model, the traditional signed margin is the mean of the induced univariate Gaussian\nrandom variable\n\n(1)\nThis probabilistic model can be used for prediction in different ways. Here, we use the average\nweight vector E [w] =\u00b5, analogous to Bayes point machines [8]. The information captured by the\ncovariance \u03a3 is then used just to adjust training updates.\n\nM \u223cN !y(\u00b5 \u00b7 x), x#\u03a3x\" .\n\n3 Update Rule\n\nThe CW update rule of Dredze et al. [6] makes the smallest adjustment to the distribution that\nensures the probability of correct prediction on instance i is no smaller than the con\ufb01dence hyper-\nparameter \u03b7 \u2208 [0, 1]: Pr [yi (w \u00b7 xi) \u2265 0] \u2265 \u03b7. The magnitude of the update is measured by its KL\ndivergence to the previous distribution, yielding the following constrained optimization:\n(2)\n\n(\u00b5i+1, \u03a3i+1) = arg min\n\u00b5,\u03a3\n\nDKL (N (\u00b5, \u03a3) %N (\u00b5i, \u03a3i))\ni \u03a3\" + (\u00b5i \u2212 \u00b5)# \u03a3\u22121\n\ni\n\ns.t. Pr [yi (w \u00b7 xi) \u2265 0] \u2265 \u03b7.\n(\u00b5i \u2212 \u00b5)& s.t. yi(\u00b5 \u00b7 xi) \u2265 \u03c6\u2019x#i \u03a3xi .\n\nmin\n\n1\n\n2#log$det\u03a3 i\n\ndet\u03a3 % + Tr!\u03a3\u22121\n\nThey rewrite the above optimization in terms of the standard deviation as:\n\n(3)\nis not convex in \u03a3.\nUnfortunately, while the constraint of this problem is linear in \u00b5,\nDredze et al. [6, eq. (7)] circumvented that lack of convexity by removing the square root from\nthe right-hand-size of the constraint, which yields the variance. However, we found that the origi-\nnal optimization can be preserved while maintaining convexity with a change of variable. Since \u03a3\nis positive semide\ufb01nite (PSD), it can be written as \u03a3=\u03a5 2 with \u03a5= Qdiag(\u03bb1/2\n)Q#\nwhere Q is orthonormal and \u03bb1, . . . ,\u03bb d are the eigenvalues of \u03a3; \u03a5 is thus also PSD. This change\nyields the following convex optimization with a convex constraint in \u00b5 and \u03a5 simultaneously:\n(\u00b5i \u2212 \u00b5)\n\n(\u00b5i+1, \u03a5i+1) = arg min\n\n(\u00b5i \u2212 \u00b5)# \u03a5\u22122\n\n, . . . ,\u03bb 1/2\n\nit\n\nd\n\n1\n\ni\n\n1\n2\n\nlog$det\u03a5 2\n\ndet\u03a5 2% +\n1\n2\ns.t. yi (\u00b5 \u00b7 xi) \u2265 \u03c6%\u03a5xi%\n\ni\n\nTr!\u03a5\u22122\n\n1\n2\n, \u03a5 is PSD .\n\ni \u03a52\" +\n\nWe call our algorithm CW-Stdev and the original algorithm of Dredze et al. CW-Var.\n\n(4)\n\n3.1 Closed-Form Update\n\nWhile standard optimization techniques can solve the convex program (4), we favor a closed-form\nsolution. Omitting the PSD constraint for now, we obtain the Lagrangian for (4),\nL =\n\n(\u00b5i \u2212 \u00b5))+\u03b1 (\u2212yi (\u00b5 \u00b7 xi) +\u03c6%\u03a5x i%)\n\n2(log$det\u03a5 2\n\ni \u03a52\" + (\u00b5i \u2212 \u00b5)# \u03a5\u22122\n\ndet\u03a5 2% + Tr!\u03a5\u22122\n\n(5)\n\n1\n\ni\n\ni\n\n2\n\n\fInput parameters a > 0 ; \u03b7 \u2208 [0.5, 1]\nInitialize \u00b51 = 0 , \u03a31 = aI ,\u03c6 =\u03a6 \u22121(\u03b7) ,\u03c8 = 1 + \u03c62/2 ,\u03be = 1 + \u03c62 .\nFor i = 1, . . . , n\n\n\u2022 Receive a training example xi \u2208 Rd\n\u2022 Compute Gaussian margin distribution Mi \u223cN ` (\u00b5i \u00b7 xi) ,`x#i \u03a3ixi\u00b4\u00b4\n\u2022 Receive true label yi and compute\n4\u201e\u2212\u03b1vi\u03c6 +q\u03b12v2\n\n, ui =\n\nvi = x#i \u03a3ixi , mi = yi (\u00b5i \u00b7 xi) (11)\n\u03b1i = max(0,\nvi\u03be \u2212mi\u03c8 +rm2\n\u03c64\n4\n\n+ vi\u03c62\u03be!) (14)\n\n1\n\n1\n\ni\n\ni \u03c62 + 4vi\u00ab2\n\n\u03b1i\u03c6\n\n\u221aui + vi\u03b1i\u03c6\n\n(12)\n\n(22)\n\n\u2022 Update\n\n\u00b5i+1 = \u00b5i + \u03b1iyi\u03a3ixi\n\u03a3i+1 =\u03a3 i \u2212 \u03b2i\u03a3ixix#i \u03a3i\n\u03a3i+1 =\u201e\u03a3\u22121\ni + \u03b1i\u03c6u\u2212 1\n\n2\n\ni\n\ndiag2 (xi)\u00ab\u22121\n\nOutput Gaussian distribution N`\u00b5n+1, \u03a3n+1\u00b4.\n\n,\u03b2 i =\n\n(full)\n\n(diag)\n\n(10)\n\n(15)\n\nFigure 1: The CW-Stdev algorithm. The numbers in parentheses refer to equations in the text.\n\nAt the optimum, it must be that\n\n\u2202\n\u2202\u00b5L =\u03a5 \u22122\n\ni\n\n(\u00b5 \u2212 \u00b5i) \u2212 \u03b1yixi = 0\n\n\u21d2\n\n\u00b5i+1 = \u00b5i + \u03b1yi\u03a52\n\ni xi ,\n\n(6)\n\nwhere we assumed that \u03a5i is non-singular (PSD). At the optimum, we must also have,\n\n\u2202\n\u2202\u03a5L = \u2212\u03a5\u22121 +\n\n1\n2\n\n\u03a5\u22122\n\ni \u03a5+\n\n1\n2\n\n\u03a5\u03a5\u22122\n\ni + \u03b1\u03c6\n\nfrom which we obtain the implicit-form update\n\n\u03a5\u22122\n\ni+1 =\u03a5 \u22122\n\ni + \u03b1\u03c6\n\n+ \u03b1\u03c6\n\n\u03a5xix#i\n\n2*x#i \u03a52xi\n\n.\n\nxix#i \u03a5\n\n2*x#i \u03a52xi\n\u2019x#i \u03a52\n\nxix#i\n\ni+1xi\n\nConveniently, these updates can be expressed in terms of the covariance matrix 1 :\n\n= 0 ,\n\n(7)\n\n(8)\n\n(9)\n\n\u00b5i+1 = \u00b5i + \u03b1yi\u03a3ixi\n\n,\n\n\u03a3\u22121\ni+1 =\u03a3 \u22121\n\ni + \u03b1\u03c6\n\nxix#i\n\n*x#i \u03a3i+1xi\n\n.\n\nWe observe that (9) computes \u03a3\u22121\nstrictly positive eigenvalues, so do \u03a3\u22121\nas assumed above.\n\ni+1 as the sum of a rank-one PSD matrix and \u03a3\u22121\ni\n\nhas\ni+1 and \u03a3i+1. Thus, \u03a3i and \u03a5i are indeed PSD non-singular,\n\n. Thus, if \u03a3\u22121\n\ni\n\n3.2 Solving for the Lagrange Multiplier \u03b1\n\nWe now determine the value of the Lagrange multiplier \u03b1 and make the covariance update explicit.\nWe start by computing the inverse of (9) using the Woodbury identity [14, Eq. 135] to get\n\n\u03a3i+1 =+\u03a3\u22121\n\ni + \u03b1\u03c6\n\nLet\n\nxix#i\n\n*x#i \u03a3i+1xi,\u22121\n\n=\u03a3 i \u2212 \u03a3ixi+\n\n\u03b1\u03c6\n\n*x#i \u03a3i+1xi + x#i \u03a3ixi\u03b1\u03c6, x#i \u03a3i . (10)\n\nui = x#i \u03a3i+1xi\n\n,\n\nvi = x#i \u03a3ixi\n\n, mi = yi (\u00b5i \u00b7 xi) .\n\n(11)\n\n1Furthermore, writing the Lagrangian of (3) and solving it would yield the same solution as Eqns. (9). Thus\n\nthe optimal solution of both (3) and (4) are the same.\n\n3\n\n\fMultiplying (10) by x#i\nsolved for ui to obtain\n\n(left) and xi (right) we get ui = vi \u2212 vi-\n\u221aui = \u2212\u03b1vi\u03c6 +*\u03b12v2\n\ni \u03c62 + 4vi\n\n2\n\n.\n\n\u03b1\u03c6\u221aui+vi\u03b1\u03c6. vi , which can be\n\n(12)\n\nThe KKT conditions for the optimization imply that either \u03b1 = 0 and no update is needed, or the\nconstraint (4) is an equality after the update. Using the equality version of (4) and Eqs. (9,10,11,12)\nwe obtain mi + \u03b1vi = \u03c6\u2212\u03b1vi\u03c6+\u221a\u03b12v2\n, which can be rearranged into a quadratic equation\ni!1 +\u03c6 2\" + 2\u03b1mivi-1 + \u03c62\nin \u03b1: \u03b12v2\nis always negative and thus not a valid Lagrange multiplier. We use the following abbreviations for\nwriting the larger root \u03b3i: \u03c8 = 1 + \u03c62/2 ;\n\ni \u2212 vi\u03c62\" = 0 . The smaller root of this equation\n\ni \u03c62+4vi\n\n2\n\n2. +!m2\n\u03b3i = \u2212mivi\u03c8 +*m2\n\n\u03be = 1 + \u03c62 . The larger root is then\ni v2\ni \u03c82 \u2212 v2\nv2\ni \u03c8\n\ni \u2212 vi\u03c62)\n\ni \u03c8 (m2\n\n(13)\nThe constraint (4) is satis\ufb01ed before the update if mi \u2212 \u03c6\u221avi \u2265 0. If mi \u2264 0, then mi \u2264 \u03c6\u221avi and\nfrom (13) we have that \u03b3i > 0. If, instead, mi \u2265 0, then, again by (13), we have\n\n.\n\n\u03b3i > 0 \u21d4 mivi\u03c8< \u2019m2\n\ni \u03c82 \u2212 v2\ni v2\n\ni \u03c8 (m2\n\ni \u2212 vi\u03c62) \u21d4 mi <\u03c6v i .\n\nFrom the KKT conditions, either \u03b1i = 0 or (3) is satis\ufb01ed as an equality and \u03b1i = \u03b3i > 0. We\nsummarize the discussion in the following lemma:\nLemma 1 The solution of (13) satis\ufb01es the KKT conditions, that is either \u03b1i \u2265 0 or the constraint\nof (3) is satis\ufb01ed before the update with the parameters \u00b5i and \u03a3i.\n\nWe obtain the \ufb01nal form of \u03b1i by simplifying (13) together with Lemma 1,\n\n\u2212mi\u03c8 +\u2019m2\n\ni\n\n\u03be\n\n0,\n\n1\nvi\n\n\u03c64\n4 + vi\u03c62\u03be\n\nmax\uf8f1\uf8f2\uf8f3\n\n\uf8fc\uf8fd\uf8fe\n\n.\n\n(14)\n\nTo summarize, after receiving the correct label yi the algorithm checks whether the probability of a\ncorrect prediction under the current parameters is greater than a con\ufb01dence threshold \u03b7 =\u03a6( \u03c6). If\nso, it does nothing. Otherwise it performs an update as described above. We initialize \u00b51 = 0 and\n\u03a31 = aI for some a > 0. The algorithm is summarized in Fig. 1.\nTwo comments are in order. First, if \u03b7 = 0.5, then from Eq. (9) we see that only \u00b5 will be updated,\nnot \u03a3, because \u03c6 = 0 \u21d4 \u03b7 = 0.5. In this case the covariance \u03a3 parameter does not in\ufb02uence the\ndecision, only the mean \u00b5. Furthermore, for length-one input vectors, at the \ufb01rst round we have\n\u03a31 = aI, so the \ufb01rst-round constraint is yi (wi \u00b7 xi) \u2265 a%xi%2 = a, which is equivalent to the\noriginal PA update.\nSecond, the update described above yields full covariance matrices. However, sometimes we may\nprefer diagonal covariance matrices, which can be achieved by projecting the matrix \u03a3i+1 that\nresults from the update onto the set of diagonal matrices.\nIn practice it requires setting all the\noff-diagonal elements to zero, leaving only the diagonal elements. In fact, if \u03a3i is diagonal then we\nonly need to project xix#i\n\nto a diagonal matrix. We thus replace (9) with the following update,\n\n\u03a3\u22121\ni+1 =\u03a3 \u22121\n\ni + \u03c6\n\n\u03b1i\u221aui\n\ndiag2 (xi) ,\n\n(15)\n\nwhere diag2 (xi) is a diagonal matrix made from the squares of the elements of xi on the diagonal.\nNote that for diagonal matrices there is no need to use the Woodbury equation to compute the inverse,\nas it can be computed directly element-wise. We use CW-Stdev (or CW-Stdev-full) to refer to the\nfull-covariance algorithm, and CW-Stdev-diag to refer to the diagonal-covariance algorithm.\nFinally, the following property of our algorithm shows that it can be used with Mercer kernels:\n\n4\n\n\fTheorem 2 (Representer Theorem) The mean \u00b5i and covariance \u03a3i parameters computed by the\nalgorithm in Fig. 1 can be written as linear combinations of the input vectors with coef\ufb01cients that\ndepend only on inner products of input vectors:\n\n\u03a3i =\n\ni\u221215p,q=1\n\np,qxpx#q + aI\n\u03c0(i)\n\n,\n\n\u00b5i =\n\n\u03bd(i)\np xp .\n\ni\u221215p\n\n(16)\n\nThe proof, given in the appendix, is a simple induction.\n\n4 Analysis\n\nWe analyze CW-Stdev in two steps. First, we show that performance does not depend on initializa-\ntion and then we compute a bound on the number of mistakes that the algorithm makes.\n\n4.1 Invariance to Initialization\n\nThe algorithm in Fig. 1 uses a prede\ufb01ned parameter a to initialize the covariance matrix. Since the\ndecision to update depends on the covariance matrix, which implicitly depends on a through \u03b1i and\nvi, one may assume that a effects performance. In fact the number of mistakes is independent of\na, i.e. the constraint of (3) is invariant to scaling. Speci\ufb01cally, if it holds for mean and covariance\nparameters \u00b5 and \u03a3, it holds also for the scaled parameters c\u00b5 and c2\u03a3 for any c > 0. The following\nlemma states that the scaling is controlled by a. Thus, we can always initialize the algorithm with a\nvalue of a = 1. If, in addition to predictions, we also need the distribution over weight vectors, the\nscale parameter a should be calibrated.\n\nLemma 3 Fix a sequence of examples (x1, y1) . . . (xn, yn). Let \u03a3i, \u00b5i, mi, vi,\u03b1 i, ui be the quan-\ntities obtained throughout the execution of the algorithm described in Fig. 1 initialized with (0, I)\n(a = 1). Let also \u02dc\u03a3i, \u02dc\u00b5i, \u02dcmi, \u02dcvi, \u02dc\u03b1i, \u02dcui be the corresponding quantities obtained throughout the exe-\ncution of the algorithm, with an alternative initialization of (0, aI) (for some a > 0). The following\nrelations between the two set of quantities hold:\n\n\u02dcmi = \u221aami , \u02dcvi = avi , \u02dc\u03b1i =\n\n1\n\u221aa\n\n\u03b1i , \u02dc\u00b5i = \u221aa\u00b5i , \u02dcui = aui , \u02dc\u03a3i = a\u03a3i .\n\n(17)\n\nProof sketch: The proof proceeds by induction. The initial values of these quantities clearly satisfy\nthe required equalities. For the induction step we assume that (17) holds for some i and show that\nthese identities also hold for i + 1 using Eqs. (9,14,11,12) .\nFrom the lemma we see that the quantity \u02dcmi/\u221a\u02dcvi = mi/\u221avi is invariant to a. Therefore, the\nbehavior of the algorithm in general, and its updates and mistakes in particular, are independent to\nthe choice of a. Therefore, we assume a = 1 in what follows.\n\n4.2 Analysis in the Mistake Bound Model\n\nThe main theorem of the paper bounds the number of mistakes made by CW-Stdev.\n\nTheorem 4 Let (x1, y1) . . . (xn, yn) be an input sequence for the algorithm of Fig. 1, initialized\nwith (0, I), with xi \u2208 Rd and yi \u2208{\u2212 1, +1} . Assume there exist \u00b5\u2217 and \u03a3\u2217 such that for all i for\nwhich the algorithm made an update (\u03b1i > 0),\n\n\u00b5\u2217#xiyi \u2265 \u00b5#i+1xiyi\n\nThen the following holds:\n\nno. mistakes \u22645i\n\n\u03b12\ni vi \u2264\n\n1 +\u03c6 2\n\n\u03c62\n\n.\n\nand x#i \u03a3\u2217xi \u2264 x#i \u03a3i+1xi\n-\u2212 log det\u03a3 \u2217 + Tr (\u03a3\u2217) +\u00b5 \u2217#\u03a3\u22121\n\nn+1\u00b5\u2217 \u2212 d.\n\n(18)\n\n(19)\n\n5\n\n\f200\n180\n160\n140\n120\n100\n80\n60\n40\n20\n\ns\ns\no\nL\n\n \n\ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n(a)\n\nPerceptron\nPA\n2nd Ord\nStd\u2212diag\nStd\u2212full\nVar\u2212diag\nVar\u2212full\n\n \n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n \n\n100\n\n200\n\n300\n\n400\n\n500\nRound\n\n600\n\n700\n\n800\n\n900 1000\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n(b)\n\nPerceptron PA 2nd OrderStd\u2212diag Std\u2212full Var\u2212diag Var\u2212full\n\n1.00\n\n0.95\n\ny\nc\na\nr\nu\nc\nc\nA\n \nv\ne\nd\nt\nS\n\n0.90\n\n0.85\n\n0.80\n\n0.80\n\n(c)\n\nReuters\nSentiment\n20 Newsgroups\n\n0.95\n\n1.00\n\n0.85\n\n0.90\n\nVariance Accuracy\n\nFigure 2: (a) The average and standard deviation of the cumulative number of mistakes for seven\nalgorithms. (b) The average and standard deviation of test error (%) over unseen data for the seven\nalgorithms. (c) Comparison between CW-Stdev-diag and CW-Var-diag on text classi\ufb01cation.\n\nThe proof is given in the appendix.\nThe above bound depends on an output of the algorithm, \u03a3n+1, similar to the bound for the second-\norder perceptron [3]. The two conditions (18) imply linear separability of the input sequence by\n\u00b5\u2217:\n\n\u00b5\u2217#xiyi\n\n(18)\n\n\u2265 \u00b5#i+1xiyi\n\n(4)\n\n\u2265 \u03c6\u2019x#i \u03a3i+1xi\n\n(18)\n\n\u2265 x#i \u03a3\u2217xi \u2265 min\n\ni\n\nx#i \u03a3\u2217xi > 0 ,\n\nwhere the superscripts in parentheses refer to the inequalities used. From (10), we observe that\n\u03a3i+1 * \u03a3i for all i, so \u03a3n+1 * \u03a3i+1 * \u03a31 = I for all i. Therefore, the conditions on \u03a3\u2217 in (18)\nare satis\ufb01ed by \u03a3\u2217 =\u03a3 n+1. Furthermore, if \u00b5\u2217 satis\ufb01es the stronger conditions yi(\u00b5\u2217\u00b7 xi) \u2265 %xi%,\nfrom \u03a3i+1 * I above it follows that\n\nwhere the last equality holds since we assumed that an update was made for the ith example. In this\nsituation, the bound becomes\n\n(\u03c6\u00b5\u2217)#xiyi \u2265 \u03c6%xi% = \u03c6\u2019x#i Ixi \u2265 \u03c6\u2019x#i \u03a3i+1xi = \u00b5#i+1xiyi ,\nn+1\u00b5\u2217. .\n\n(\u2212 log det\u03a3 n+1 + Tr (\u03a3n+1) \u2212 d) + (\u03c62 + 1)-\u00b5\u2217#\u03a3\u22121\nn+1\u00b5\u2217 in this bound is analogous to the quantity R2 %\u00b5\u2217%2 in the perceptron\nThe quantity \u00b5\u2217#\u03a3\u22121\nbound [13], except that the norm of the examples does not come in explicitly as the radius R of the\nenclosing ball, but implicitly through the fact that \u03a3\u22121\nn+1 is a sum of example outer products (9). In\naddition, in this version of the bound we impose a margin of 1 under the condition that examples\nhave unit norm, whereas in the perceptron bound, the margin of 1 is for examples with arbitrary\nnorm. This follows from the fact that (4) is invariant to the norm of xi.\n\n\u03c62 + 1\n\n\u03c62\n\n5 Empirical Evaluation\n\nWe illustrate the bene\ufb01ts of CW-Stdev with synthetic data experiments. We generated 1, 000 points\nin R20 where the \ufb01rst two coordinates were drawn from a 45\u25e6 rotated Gaussian distribution with\nstandard deviation 1. The remaining 18 coordinates were drawn from independent Gaussian distri-\nbutions N (0, 2). Each point\u2019s label depended on the \ufb01rst two coordinates using a separator parallel\nto the long axis of the ellipsoid, yielding a linearly separable set (Fig. 3(top)). We evaluated \ufb01ve on-\nline learning algorithms: the perceptron [16] , the passive-aggressive (PA) algorithm [4], the second-\norder perceptron (SOP) [3], CW-Var-diag, CW-Var-full [6], CW-Stdev-diag and CW-Stdev-full. All\nalgorithm parameters were tuned over 1, 000 runs.\nFig. 2(a) shows the average cumulative mistakes for each algorithm; error bars indicate one unit of\nstandard deviation. Clearly, second-order algorithms, which all made fewer than 80 mistakes, out-\nperform the \ufb01rst-order ones, which made at least 129 mistakes. Additionally, CW-Var makes more\nmistakes than CW-Stdev: 8% more in the diagonal case and 17% more in the full. The diagonal\nmethods performed better than the \ufb01rst order methods, indicating that while they do not use any\n\n6\n\n\fsecond-order information, they capture additional information for single features. For each repeti-\ntion, we evaluated the resulting classi\ufb01ers on 10, 000 unseen test examples (Fig. 2(b)). Averaging\nimproved the \ufb01rst-order methods. The second-order methods outperform the \ufb01rst-order methods,\nand CW-Stdev outperforms all the other methods. Also, the full case is less sensitive across runs.\nThe Gaussian distribution over weight vectors after 50 rounds is represented in Fig. 3(bot). The 20\ndimensions of the version space are grouped into 10 pairs, the \ufb01rst containing the two meaningful\nfeatures. The dotted segment represents the \ufb01rst two coordinates of possible representations of\nthe true hyperplane in the positive quadrant. Clearly, the corresponding vectors are orthogonal to\nthe hyperplane shown in Fig. 3(top). The solid black ellipsoid represents the \ufb01rst two signi\ufb01cant\nfeature weights; it does not yet lie of the dotted segment because the algorithm has not converged.\nNevertheless, the long axis is already parallel to the true set of possible weight vectors. The axis\nperpendicular to the weight-vector set is very small, showing that there is little freedom in that\ndirection. The remaining nine ellipsoids represent the covariance of pairs of noise features. Those\nellipsoids are close to circular and have centers close to the origin, indicating that the corresponding\nfeature weights should be near zero but without much con\ufb01dence.\n\nNLP Evaluation: We compared CW-Stdev-diag with CW-Var-diag, which beat many state of the\nart algorithms on 12 NLP datasets [6]. We followed the same evaluation setting using 10-fold cross\nvalidation and the same splits for both algorithms. Fig. 2(c) compares the accuracy on test data of\neach algorithm; points above the line represent improvements of CW-Stdev over CW-Var. Stdev\nimproved on eight of the twelve datasets and, while the improvements are not signi\ufb01cant, they show\nthe effectiveness of our algorithm on real world data.\n\n6 Related Work\n\n0\n\n\u221210\n\n\u221220\n\n1.5\n\n\u22125\n\n0\n\n2\n\n1\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n20\n\n10\n\n2.5\n\n\u221220\n\n\u221215\n\n\u221210\n\n\u221230\n\n\u221225\n\nOnline additive algorithms have a long history, from with the\nperceptron [16] to more recent methods [10, 4]. Our update\nhas a more general form, in which the input vector xi is lin-\nearly transformed using the covariance matrix, both rotating\nthe input and assigning weight speci\ufb01c learning rates. Weight-\nspeci\ufb01c learning rates appear in neural-network learning [18],\nalthough they do not model con\ufb01dence based on feature vari-\nance.\nThe second order perceptron (SOP) [3] demonstrated that\nsecond-order information can improve on \ufb01rst-order methods.\nBoth SOP and CW maintain second-order information. SOP\nis mistake driven while CW is passive-aggressive. SOP uses\nthe current instance in the correlation matrix for prediction\nwhile CW updates after prediction. A variant of CW-Stdev\nsimilar to SOP follows from our derivation if we \ufb01x the La-\ngrange multiplier in (5) to a prede\ufb01ned value \u03b1i = \u03b1, omit\nthe square root, and use a gradient-descent optimization step.\nFundamentally, CW algorithms have a probabilistic motiva-\ntion, while the SOP is geometric: replace the ball around an\nexample with a re\ufb01ned ellipsoid. Shivaswamy and Jebara [17]\nused a similar motivation in batch learning.\nEnsemble learning shares the idea of combining multiple clas-\nsi\ufb01ers. Gaussian process classi\ufb01cation (GPC) maintains a\nGaussian distribution over weight vectors (primal) or over re-\ngressor values (dual). Our algorithm uses a different update\ncriterion than the standard GPC Bayesian updates [15, Ch.3],\navoiding the challenge of approximating posteriors. Bayes\npoint machines [8] maintain a collection of weight vectors\nconsistent with the training data, and use the single linear classi\ufb01er which best represents the collec-\ntion. Conceptually, the collection is a non-parametric distribution over the weight vectors. Its online\nversion [7] maintains a \ufb01nite number of weight-vectors which are updated simultaneously. The rele-\n\nFigure 3: Top : Plot of the two in-\nformative features of the synthetic\ndata. Bottom: Feature weight dis-\ntributions of CW-Stdev-full after 50\nexamples.\n\n\u22120.5\n\n\u22120.5\n\n0.5\n\n1.5\n\n2.5\n\n0.5\n\n0\n\n1\n\n\u22121\n\n2\n\n3\n\n7\n\n\fvance vector machine [19] incorporates probabilistic models into the dual formulation of SVMs. As\nin our work, the dual parameters are random variables distributed according to a diagonal Gaussian\nwith example speci\ufb01c variance. The weighted-majority [12] algorithm and later improvements [2]\ncombine the output of multiple arbitrary classi\ufb01ers, maintaining a multinomial distribution over the\nexperts. We assume linear classi\ufb01ers as experts and maintain a Gaussian distribution over their\nweight vectors.\n\n7 Conclusion\n\nWe presented a new con\ufb01dence-weighted learning method for linear classi\ufb01er based on the standard\ndeviation. We have shown that the algorithm is invariant to scaling and we provided a mistake-bound\nanalysis. Based on both synthetic and NLP experiments, we have shown that our method improves\nupon recent \ufb01rst and second order methods. Our method also improves on previous CW algorithms.\nWe are now investigating special cases of CW-Stdev for problems with very large numbers of fea-\ntures, multi-class classi\ufb01cation, and batch training.\n\nReferences\n[1] Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford Uni-\n\nversity Press, New York, NY, USA, 1997.\n\n[2] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to\n\nuse expert advice. Journal of the Association for Computing Machinery, 44(3):427\u2013485, May 1997.\n\n[3] Nicol\u00b4o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. Siam\n\nJournal of Commutation, 34(3):640\u2013668, 2005.\n\n[4] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms.\n\nJournal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[5] Mark Dredze and Koby Crammer. Active learning with con\ufb01dence. In ACL, 2008.\n[6] Mark Dredze, Koby Crammer, and Fernando Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In Inter-\n\nnational Conference on Machine Learning, 2008.\n\n[7] E. Harrington, R. Herbrich, J. Kivinen, J. Platt, and R.C. Williamson. Online bayes point machines. In\n\n7th Paci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2003.\n\n[8] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. JMLR, 1:245\u2013279, 2001.\n[9] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression models and their\n\nextensions. In Workshop on Arti\ufb01cial Intelligence and Statistics, 1997.\n\n[10] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.\n\nInformation and Computation, 132(1):1\u201364, January 1997.\n\n[11] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine\n\nLearning, 2:285\u2013318, 1988.\n\n[12] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation,\n\n108:212\u2013261, 1994.\n\n[13] A. B. J. Novikoff. On convergence proofs on perceptrons.\n\nIn Proceedings of the Symposium on the\n\nMathematical Theory of Automata, volume XII, pages 615\u2013622, 1962.\n\n[14] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2007.\n[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[16] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the\n\nbrain. Psychological Review, 65:386\u2013407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).).\n\n[17] P. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In AISTATS, 2007.\n[18] Richard S. Sutton. Adapting bias by gradient descent: an incremental version of delta-bar-delta.\n\nIn\nProceedings of the Tenth National Conference on Arti\ufb01cial Intelligence, pages 171\u2013176. MIT Press, 1992.\n[19] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning\n\nResearch, 1:211\u2013244, 2001.\n\n[20] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier\n\nablation. In AAAI-2006, 2006.\n\n8\n\n\f", "award": [], "sourceid": 465, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Mark", "family_name": "Dredze", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}]}