{"title": "New Adaptive Algorithms for Online Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1840, "page_last": 1848, "abstract": "We propose a general framework to online learning for   classification problems with time-varying potential functions in the   adversarial setting. This framework allows to design and prove   relative mistake bounds for any generic loss function. The mistake   bounds can be specialized for the hinge loss, allowing to recover   and improve the bounds of known online classification   algorithms. By optimizing the general bound we derive a new online   classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order   information. We analyze the properties of the algorithm and   illustrate its performance using synthetic dataset.", "full_text": "New Adaptive Algorithms for Online Classi\ufb01cation\n\nFrancesco Orabona\n\nDSI\n\nUniversit`a degli Studi di Milano\n\nMilano, 20135 Italy\n\norabona@dsi.unimi.it\n\nkoby@ee.technion.ac.il\n\nKoby Crammer\n\nDepartment of Electrical Enginering\n\nThe Technion\n\nHaifa, 32000 Israel\n\nAbstract\n\nWe propose a general framework to online learning for classi\ufb01cation problems\nwith time-varying potential functions in the adversarial setting. This framework\nallows to design and prove relative mistake bounds for any generic loss function.\nThe mistake bounds can be specialized for the hinge loss, allowing to recover and\nimprove the bounds of known online classi\ufb01cation algorithms. By optimizing the\ngeneral bound we derive a new online classi\ufb01cation algorithm, called NAROW,\nthat hybridly uses adaptive- and \ufb01xed- second order information. We analyze the\nproperties of the algorithm and illustrate its performance using synthetic dataset.\n\n1\n\nIntroduction\n\nLinear discriminative online algorithms have been shown to perform very well on binary and mul-\nticlass labeling problems [10, 6, 14, 3]. These algorithms work in rounds, where at each round a\nnew instance is given and the algorithm makes a prediction. After the true class of the instance is\nrevealed, the learning algorithm updates its internal hypothesis. Often, such update is taking place\nonly on rounds where the online algorithm makes a prediction mistake or when the con\ufb01dence in\nthe prediction is not suf\ufb01cient. The aim of the classi\ufb01er is to minimize the cumulative loss it suffers\ndue to its prediction, such as the total number of mistakes.\nUntil few years ago, most of these algorithms were using only \ufb01rst-order information of the in-\nput features. Recently [1, 8, 4, 12, 5, 9], researchers proposed to improve online learning algo-\nrithms by incorporating second order information. Speci\ufb01cally, the Second-Order-Perceptron (SOP)\nproposed by Cesa-Bianchi et al. [1] builds on the famous Perceptron algorithm with an additional\ndata-dependent time-varying \u201cwhitening\u201d step. Con\ufb01dence weighted learning (CW) [8, 4] and the\nadaptive regularization of weights algorithm (AROW) [5] are motivated from an alternative view:\nmaintaining con\ufb01dence in the weights of the linear models maintained by the algorithm. Both CW\nand AROW use the input data to modify the weights as well and the con\ufb01dence in them. CW and\nAROW are motivated from the speci\ufb01c properties of natural-language-precessing (NLP) data and\nindeed were shown to perform very well in practice, and on NLP problems in particular. However,\nthe theoretical foundations of this empirical success were not known, especially when using only\nthe diagonal elements of the second order information matrix. Filling this gap is one contribution of\nthis paper.\nIn this paper we extend and generalizes the framework for deriving algorithms and analyzing them\nthrough a potential function [2]. Our framework contains as a special case the second order Percep-\ntron and a (variant of) AROW. While it can also be used to derive new algorithms based on other\nloss functions.\nFor carefully designed algorithms, it is possible to bound the cumulative loss on any sequence of\nsamples, even adversarially chosen [2]. In particular, many of the recent analyses are based on the\nonline convex optimization framework, that focuses on minimizing the sum of convex functions.\n\n1\n\n\fTwo common view-points for online convex optimization are of regularization [15] or primal-dual\nprogress [16, 17, 13]. Recently new bounds have been proposed for time-varying regularizations\nin [18, 9], focusing on the general case of regression problems. The proof technique derived from\nour framework extends the work of Kakade et al. [13] to support time varying potential functions.\nWe also show how the use of widely used classi\ufb01cation losses, as the hinge loss, allows us to derive\nnew powerful mistake bounds superior to existing bounds. Moreover the framework introduced\nsupports the design of aggressive algorithms, i.e. algorithms that update their hypothesis not only\nwhen they make a prediction mistake.\nFinally, current second order algorithms suffer from a common problem. All these algorithms main-\ntain the cumulative second-moment of the input features, and its inverse, qualitatively speaking, is\nused as a learning rate. Thus, if there is a single feature with large second-moment in the pre\ufb01x of the\ninput sequence, its effective learning rate would drop to a relatively low value, and the learning algo-\nrithm will take more time to update its value. When the instances are ordered such that the value of\nthis feature seems to be correlated with the target label, such algorithms will set the value of weight\ncorresponding to this feature to a wrong value and will decrease its associated learning rate to a low\nvalue. This combination makes it hard to recover from the wrong value set to the weight associated\nwith this feature. Our \ufb01nal contribution is a new algorithm that adapts the way the second order\ninformation is used. We call this algorithm Narrow Adaptive Regularization Of Weights (NAROW).\nIntuitively, it interpolates its update rule from adaptive-second-order-information to \ufb01xed-second-\norder-information, to have a narrower decrease of the learning rate for common appearing features.\nWe derive a bound for this algorithm and illustrate its properties using synthetic data simulations.\n\n2 Online Learning for Classi\ufb01cation\n\nt xt).\n\n(cid:80)T\nthe regret, R(u) +(cid:80)T\n(cid:80)T\n\nWe work in the online binary classi\ufb01cation scenario where learning algorithms work in rounds.\nAt each round t, an instance xt \u2208 Rd is presented to the algorithm, which then predicts a label\n\u02c6yt \u2208 {\u22121, +1}. Then, the correct label yt is revealed, and the algorithm may modify its hypothesis.\nThe aim of the online learning algorithm is to make as few mistakes as possible (on any sequence\nof samples/labels {(xt, yt)}T\nt=1). In this paper we focus on linear prediction functions of the form\n\u02c6yt = sign(w(cid:62)\nWe strive to design online learning algorithms for which it is possible to prove a relative mistakes\nbound or a loss bound. Typical such analysis bounds the cumulative loss the algorithm suffers,\nt=1 (cid:96)(wt, xt, yt), with the cumulative loss of any classi\ufb01er u plus an additional penalty called\nt=1 (cid:96)(u, xt, yt). Given that we focus on classi\ufb01cation, we are more interested\nin relative mistakes bound, where we bound the number of mistakes of the learner with R(u) +\nt=1 (cid:96)(u, xt, yt). Since the classi\ufb01er u is arbitrary, we can choose, in particular, the best classi\ufb01er\nthat can be found in hindsight given all the samples. Often R(\u00b7) depends on a function measuring\nthe complexity of u and the number of samples T , and (cid:96) is a non-negative loss function. Usually (cid:96)\nis chosen to be a convex upper bound of the 0/1 loss. We will also denote by (cid:96)t(u) = (cid:96)(u, xt, yt).\nIn the following we denote by M to be the set of round indexes for which the algorithm performed a\nmistake. We assume that the algorithm always update if it rules in such events. Similarly, we denote\nby U the set of the margin error rounds, that is, rounds in which the algorithm updates its hypothesis\nand the prediction is correct, but the loss (cid:96)t(wt) is different from zero. Their cardinality will be\nindicated with M and U respectively. Formally, M = {t : sign(w(cid:62)\nt xt) (cid:54)= yt & wt (cid:54)= wt+1},\nand U = {t : sign(w(cid:62)\nt xt) = yt & wt (cid:54)= wt+1}. An algorithm that updates its hypothesis only on\nmistake rounds is called conservative (e.g. [3]). Following previous naming convention [3], we call\naggressive an algorithm that updates is rule on rounds for which the loss (cid:96)t(wt) is different from\nzero, even if its prediction was correct.\nWe de\ufb01ne now few basic concepts from convex analysis that will be used in the paper. Given a\nconvex function f : X \u2192 R, its sub-gradient \u2202f (v) at v satis\ufb01es: \u2200u \u2208 X, f (u) \u2212 f (v) \u2265 (u \u2212\nv)\u00b7 \u2202f (v). The Fenchel conjugate of f, f\u2217 : S \u2192 R, is de\ufb01ned by f\u2217(u) = supv\u2208S\nA differentiable function f : X \u2192 R is \u03b2-strongly convex w.r.t. a norm (cid:107) \u00b7 (cid:107) if for any u, v \u2208 S and\n\u03b1 \u2208 (0, 1), h(\u03b1u + (1 \u2212 \u03b1)v) \u2264 \u03b1f (u) + (1 \u2212 \u03b1)f (v) \u2212 \u03b2\n2 \u03b1(1 \u2212 \u03b1)(cid:107)u \u2212 v(cid:107)2. Strong convexity\nturns out to be a key property to design online learning algorithms.\n\n(cid:0)v \u00b7 u\u2212 f (v)(cid:1).\n\n2\n\n\f3 General Algorithm and Analysis\n\nWe now introduce a general framework to design online learning algorithms and a general lemma\nwhich serves as a general tool to prove their relative regret bounds. Our algorithm builds on previous\nalgorithms for online convex programming with a one signi\ufb01cant difference. Instead of using a \ufb01xed\nlink function as \ufb01rst order algorithms, we allow a sequence of link functions ft(\u00b7), one for each time\nt. In a nutshell, the algorithm maintains a weight vector \u03b8t. Given a new examples it uses the current\nlink function ft to compute a prediction weight vector wt. After the target label is received it sets\nthe new weight \u03b8t+1 to be the sum of \u03b8t and minus the gradient of the loss at wt. The algorithm is\nsummarized in Fig. 1.\nThe following lemma is a generalization of Corollary 7 in [13] and Corollary 3 in [9], for online\nlearning. All the proofs can be found in the Appendix.\nLemma 1. Let ft, t = 1, . . . , T be \u03b2t-strongly convex functions with respect to the norms\n(cid:107) \u00b7 (cid:107)f1, . . . ,(cid:107) \u00b7 (cid:107)fT over a set S and let (cid:107) \u00b7 (cid:107)f\u2217\nbe the respective dual norms. Let f0(0) = 0,\nand x1, . . . , xT be an arbitrary sequence of vectors in Rd. Assume that algorithm in Fig. 1 is run\nT(cid:88)\non this sequence with the functions fi. Then, for any u \u2208 S, and any \u03bb > 0 we have\n\n(cid:32) \u03b72\n\n(cid:18) 1\n\nT(cid:88)\n\n(cid:33)\n\n(cid:19)\n\ni\n\nt (\u03b8t) \u2212 f\u2217\n(f\u2217\n\nt\u22121(\u03b8t))\n\n.\n\nt (cid:107)zt(cid:107)2\nf\u2217\n2\u03bb\u03b2t\n\nt\n\n+\n\n1\n\u03bb\n\n\u03b7tz(cid:62)\n\nt\n\n\u03bb\n\nt=1\n\nwt \u2212 u\n\n\u2264 fT (\u03bbu)\n\n+\n\n\u03bb\n\nt=1\n\nThis Lemma can appear dif\ufb01cult to interpret, but we now show that it is straightforward to use\nthe lemma to recover known bounds of different online learning algorithms.\nIn particular we\ncan state the following Corollary that holds for any convex loss (cid:96) that upper bounds the 0/1 loss.\n\n1: Input: A series of strongly convex\n\nfunctions f1, . . . , fT .\n\nt (\u03b8t)\n\nReceive xt\nSet wt = \u2207f\u2217\nPredict \u02c6yt = sign(w(cid:62)\nReceive yt\nif (cid:96)t(wt) > 0 then\nzt = \u2202(cid:96)t(wt)\n\u03b8t+1 = \u03b8t \u2212 \u03b7tzt\n\n2: Initialize: \u03b81 = 0\n3: for t = 1, 2, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n\nCorollary 1. De\ufb01ne B =(cid:80)T\n\nt=1(f\u2217\n\nt (\u03b8t)\u2212 f\u2217\n\nt\u22121(\u03b8t)). Under\nthe hypothesis of Lemma 1, if (cid:96) is convex and it upper bounds\nthe 0/1 loss, and \u03b7t = \u03b7, then for any u \u2208 S the algorithm\nin Fig. 1 has the following bound on the maximum number of\nmistakes M,\n\nM \u2264 T(cid:88)\n\nT(cid:88)\n\n(cid:107)zt(cid:107)2\nf\u2217\n2\u03b2t\n\nt\n\nB\n\u03b7\n\nt xt)\n\nfT (u)\n\n.\n\n\u03b7\n\n+\n\nt=1\n\nt=1\n\n+ \u03b7\n\nelse\n\n(cid:96)t(u) +\n\n\u03b8t+1 = \u03b8t\n\n(1)\nMoreover if ft(x) \u2264 ft+1(x),\u2200x \u2208 S, t = 0, . . . , T \u2212 1 then\nB \u2264 0.\nA similar bound has been recently presented in [9] as a re-\ngret bound. Yet, there are two differences. First, our analysis\nbounds the number of mistakes, a more natural quantity in\nclassi\ufb01cation setting, rather than of a general loss function.\nSecond, we retain the additional term B which may be nega-\ntive, and thus possibly provide a better bound. Moreover, to\nchoose the optimal tuning of \u03b7 we should know quantities that are unknown to the learner. We could\nuse adaptive regularization methods, as the one proposed in [16, 18], but in this way we would lose\nthe possibility to prove mistake bounds for second order algorithms, like the ones in [1, 5]. In the\nnext Section we show how to obtain bounds with an automatic tuning, using additional assumption-\nion on the loss function.\n\nFigure 1: Prediction algorithm\n\n3.1 Better bounds for linear losses\nThe hinge loss, (cid:96)(u, xt, yt) = max(1 \u2212 ytu(cid:62)xt, 0), is a very popular evaluation metric in classi-\n\ufb01cation. It has been used, for example, in Support Vector Machines [7] as well as in many online\nlearning algorithms [3]. It has also been extended to the multiclass case [3]. Often mistake bounds\nare expressed in terms of the hinge loss. One reason is that it is a tighter upper bound of the 0/1 loss\ncompared to other losses, as the squared hinge loss. However, this loss is particularly interesting for\nus, because it allows an automatic tuning of the bound in (1). In particular it is easy to verify that it\nsatis\ufb01es the following condition\n\n(cid:96)(u, xt, yt) \u2265 1 + u(cid:62)\u2202(cid:96)t(wt), \u2200u \u2208 S, wt : (cid:96)t(wt) > 0 .\n\n(2)\n\n3\n\n\fThanks to this condition we can state the following Corollary for any loss satisfying (2).\nCorollary 2. Under the hypothesis of Lemma 1, if fT (\u03bbu) \u2264 \u03bb2fT (u), and (cid:96) satis\ufb01es (2), then for\nany u \u2208 S, and any \u03bb > 0 we have\n\n\u03b7t \u2264 L + \u03bbfT (u) +\n\n(cid:32)\n(cid:88)\nt\u2208M\u222aU \u03b7t(cid:96)t(u), and B = (cid:80)T\nwhere L = (cid:80)\n(cid:118)(cid:117)(cid:117)(cid:116)2B +\noptimal \u03bb, we obtain(cid:88)\n\u03b7t \u2264 L +(cid:112)2fT (u)\n\n(cid:88)\nt\n2\u03b2t\nt\u2208M\u222aU\nt (\u03b8t) \u2212 f\u2217\nt=1(f\u2217\n(cid:88)\n\n(cid:18) \u03b72\n\n(cid:18) \u03b72\n\nt\u2208M\u222aU\n\nB +\n\n1\n\u03bb\n\nt\u2208M\u222aU\n\n(cid:107)zt(cid:107)2\nf\u2217\n\nt\n\n\u2212 \u03b7tw(cid:62)\nt zt\n\nt\u22121(\u03b8t)). In particular, choosing the\n\n(cid:107)zt(cid:107)2\nf\u2217\n\nt\n\n\u2212 2\u03b7tw(cid:62)\nt zt\n\nt\n\u03b2t\n\n.\n\n(3)\n\nt\u2208M\u222aU\n\n(cid:19)(cid:33)\n\n,\n\n(cid:19)\n\nThe intuition and motivation behind this Corollary is that a classi\ufb01cation algorithm should be inde-\npendent of the particular scaling of the hyperplane. In other words, wt and \u03b1wt (with \u03b1 > 0) make\nexactly the same predictions, because only the sign of the prediction matters. Exactly this indepen-\ndence in a scale factor allows us to improve the mistake bound (1) to the bound of (3). Hence, when\n(2) holds, the update of the algorithm becomes somehow independent from the scale factor, and we\nhave the better bound. Finally, note that when the hinge loss is used, the vector \u03b8t is updated as in\nan aggressive version of the Perceptron algorithm, with a possible variable learning rate.\n\n4 New Bounds for Existing Algorithms\n\nWe now show the versatility of our framework, proving better bounds for some known \ufb01rst order\nand second order algorithms.\n\n4.1 An Aggressive p-norm Algorithm\n\n2(q\u22121)(cid:107)u(cid:107)2\n\nWe can use the algorithm in Fig. 1 to obtain an aggressive version of the p-norm algorithm [11]. Set\nq, that is 1-strongly convex w.r.t. the norm (cid:107) \u00b7 (cid:107)q. The dual norm of (cid:107) \u00b7 (cid:107)q is\nft(u) = 1\n(cid:107) \u00b7 (cid:107)p, where 1/p + 1/q = 1. Moreover set \u03b7t = 1 in mistake error rounds, so using the second\nbound of Corollary 2, and de\ufb01ning R such that (cid:107)xt(cid:107)2\n(cid:0)\u03b72\nt (cid:107)xt(cid:107)2\n(cid:88)\n\np \u2264 R2, we have\np + 2\u03b7tytw(cid:62)\n\n(cid:115)(cid:107)u(cid:107)2\n(cid:115)(cid:107)u(cid:107)2\n\n(cid:115) (cid:88)\n(cid:115)\n\n(cid:1) \u2212(cid:88)\n(cid:1) \u2212(cid:88)\n\nM \u2264 L +\n\n\u2264 L +\n\nq \u2212 1\n\nM R2 +\n\nt\u2208M\u222aU\n\np + 2\u03b7tytw(cid:62)\n\nt xt\n\nt xt\n\nt\u2208U\n\n\u03b7t .\n\n\u03b7t\n\nq\n\nq\n\nq \u2212 1\n\nt\u2208U\n\nSolving for M we have\n\n1\n\nR2\n\nt xt\n\nt\u2208U\n\nt\u2208U\n\n(cid:19)\n\n\u2212 \u03b7t\n\n(cid:107)u(cid:107)2\n\n(cid:107)u(cid:107)2\n\nqR2 + R\n\np+2\u03b7tytw(cid:62)\n\n\u03b7t,\n\n(4)\n\n2(q \u2212 1)\n\nM \u2264 L +\n\n4(q \u2212 1)\nt (cid:107)xt(cid:107)2\n\n(cid:107)u(cid:107)q\u221a\nq \u2212 1\nt\u2208M\u222aU \u03b7t(cid:96)t(u), and D = (cid:80)\n\nwhere L = (cid:80)\nwe have that D is negative, and L \u2264(cid:80)\n\n. We have still the\nfreedom to set \u03b7t in margin error rounds. If we set \u03b7t = 0, the algorithm of Fig. 1 becomes the\np-norm algorithm and we recover its best bound [11]. However if 0 \u2264 \u03b7t \u2264 min\n, 1\nt\u2208M\u222aU (cid:96)t(u). Hence the aggressive updates gives us a better\n\nbound, thanks to last term that is subtracted to the bound.\nIn the particular case of p = q = 2 we recover the Perceptron algorithm. In particular the minimum\nof D, under the constraint \u03b7t \u2264 1, can be found setting \u03b7t = min\n. If R is equal\nto\n2, we recover the PA-I update rule, when C = 1. However note that the mistake bound in (4) is\nbetter than the one proved for PA-I in [3] and the ones in [16]. Hence the bound (4) provides the \ufb01rst\ntheoretical justi\ufb01cation to the good performance of the PA-I, and it can be seen as a general evidence\nsupporting the aggressive updates versus the conservative ones.\n\n(cid:16) R2/2\u2212ytw(cid:62)\n\n(cid:16) R2\u22122ytw(cid:62)\n(cid:17)\n\nt xt\np\n\n(cid:107)xt(cid:107)2\n\n(cid:107)xt(cid:107)2\n\n(cid:17)\n\n\u221a\n\nt xt\n\n, 1\n\n(cid:0)\u03b72\nt (cid:107)xt(cid:107)2\n(cid:115)\n(cid:18) \u03b72\n\n1\n\nt\u2208U\n\nqR2 + L + D \u2212(cid:88)\n\n4\n\n\f4.2 Second Order Algorithms\n\nFigure 2: NLP Data: the number of\nwords vs. the word-rank on two sen-\ntiment data sets.\n\nidentity we have f\u2217\nin Corollary 2, and setting \u03b7t = 1 we have\n\nt (\u03b8t) \u2212 f\u2217\n\nt\u22121(\u03b8t) = \u2212 (x(cid:62)\nt A\n2(r+x(cid:62)\n\n(cid:112)\n(cid:115)\n(cid:115)\n\nu(cid:62)AT u\n\n(cid:107)u(cid:107)2 +\n\nM + U \u2264 L +\n\n\u2264 L +\n\n\u2264 L +\n\nr(cid:107)u(cid:107)2 +\n\n(u(cid:62)xt)2\n\nt\n\nt\n\nr\n\nft\n\nt (x), are equal to 1\nt x. Denote by \u03c7t = x(cid:62)\n\nWe show now how to derive in a simple way the bound of the\n2 x(cid:62)Atx,\nSOP [1] and the one of AROW [5]. Set ft(x) = 1\nwhere At = At\u22121 + xtx(cid:62)\n, r > 0 and A0 = I. The functions\nft are 1-strongly convex w.r.t. the norms (cid:107)x(cid:107)2\n= x(cid:62)Atx.\n2 x(cid:62)A\u22121\nThe dual functions of ft(x), f\u2217\nt x,\nwhile (cid:107)x(cid:107)2\nis x(cid:62)A\u22121\nt A\u22121\nt\u22121xt and\nf\u2217\nt A\u22121\nmt = ytx(cid:62)\nt\u22121\u03b8t. With these de\ufb01nitions it easy to see\nthat the conservative version of the algorithm corresponds di-\nrectly to SOP. The aggressive version corresponds to AROW,\nwith a minor difference. In fact, the prediction of the algo-\nrithm in Fig. 1 specialized in this case is ytw(cid:62)\n,\nr+\u03c7t\non the other hand AROW predicts with mt. The sign of the\npredictions is the same, but here the aggressive version is up-\n\u2264 1, while AROW updates if mt \u2264 1.\ndating when mt\nTo derive the bound, observe that using Woodbury matrix\n2(r+\u03c7t). Using the second bound\n\n= \u2212 m2\n\nt xt = mt\n\n\u22121\nt\u22121\u03b8t)2\nt A\n\n\u22121\nt\u22121xt)\n\nr+\u03c7t\n\nr\n\nr\n\nt\n\n(cid:18)\n\nt A\u22121\nx(cid:62)\n\n(u(cid:62)xt)2\n\nt\u2208M\u222aU\n\n(cid:118)(cid:117)(cid:117)(cid:116) (cid:88)\n(cid:88)\n(cid:88)\n\n1\nr\n\nt\u2208M\u222aU\n\nt\u2208M\u222aU\n\nt\n\nt xt + 2ytw(cid:62)\n\nt xt \u2212 m2\n(cid:118)(cid:117)(cid:117)(cid:116)r log(det(AT )) +\nr + \u03c7t\n(cid:88)\n(cid:115)\n(cid:88)\n\nlog(det(AT )) +\n\nt\u2208M\u222aU\n\nt\u2208M\u222aU\n\n(cid:19)\n(cid:18)\n\n(cid:19)\n\n2ytw(cid:62)\n\nt xt \u2212 m2\nr + \u03c7t\n\nt\n\nmt(2r \u2212 mt)\nr(r + \u03c7t)\n\n.\n\nThis bound recovers the SOP\u2019s one in the conservative case, and improves slightly the one of AROW\nfor the aggressive case. It would be possible to improve the AROW bound even more, setting \u03b7t to a\nvalue different from 1 in margin error rounds. We leave the details for a longer version of this paper.\n\n4.3 Diagonal updates for AROW\n\nBoth CW and AROW has an ef\ufb01cient version that use diagonal matrices instead of full ones. In this\ncase the complexity of the algorithm becomes linear in dimension. Here we prove a mistake bound\nfor the diagonal version of AROW, using Corollary 2. We denote Dt = diag{At}, where At is\n2 x(cid:62)Dtx. Setting \u03b7t = 1, and using the second bound\nde\ufb01ned as in SOP and AROW, and ft(x) = 1\nin Corollary 2 and Lemma 12 in [9], we have1\nd(cid:88)\n(cid:88)\n\nM + U \u2264 (cid:88)\n(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116)uT DT u\nd(cid:88)\n\n(cid:32)(cid:80)\n\nt\u2208M\u222aU x2\nt,i\n\nt\u2208M\u222aU x2\nt,i\n\nd(cid:88)\n\n(cid:96)t(u) +\n\nt\u2208M\u222aU\n\n(cid:32)\n\n(cid:33)\n\n(cid:33)\n\n(cid:33)\n\n+ 2U\n\n+ 1\n\nlog\n\ni=1\n\nr\n\nr\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:107)u(cid:107)2 +\n\n(cid:32)(cid:80)\n(cid:118)(cid:117)(cid:117)(cid:116)r\n\n+ 1\n\n+ 2U .\n\n(cid:96)t(u) +\n\nlog\n\n=\n\nx2\nt,i\n\n1\nr\n\nu2\ni\n\ni=1\n\nt\u2208M\u222aU\n\ni=1\n\nr\n\nt\u2208M\u222aU\n\nThe presence of a mistake bound allows us to theoretically analyze the cases where this algorithm\ncould be advantageous respect to a simple Perceptron. In particular, for NLP data the features are\nbinary and it is often the case that most of the features are zero most of the time. On the other hand,\n\n1We did not optimize the constant multiplying U in the bound.\n\n5\n\n\f(cid:88)\n\nd(cid:88)\n\nthese \u201crare\u201d features are usually the most informative ones (e.g. [8]). Fig. 2 shows the number of\ntimes each feature (word) appears in two sentiment datasets vs the word rank. Clearly there are few\nvery frequent words and many rate words. These exact properties were used to originally derive the\nCW algorithm. Our analysis justi\ufb01es this derivation. Concretely, the above considerations leads us\nto think that the optimal hyperplane u will be such that\n\nt,i \u2264(cid:88)\nfeatures appear in the sequence. In general each time that(cid:80)d\n\nt,i \u2248(cid:88)\n\nwhere I is the set of the informative and rare features and s is the maximum number of times these\nt,i \u2264 s(cid:107)u(cid:107)2 with\ns small enough, it is possible to show that, with an optimal tuning of r, this bound is better of the\nPerceptron\u2019s one. In particular, using a proof similar to the one in [1], in the conservative version of\nthis algorithm, it is enough to have s < M R2\n\n2d , and to set r = sM R2\n\nM R2\u22122sd.\n\ni s \u2248 s(cid:107)u(cid:107)2\nu2\n(cid:80)\n\nt\u2208M\u222aU x2\n\n(cid:88)\n\ni=1 u2\ni\n\nt\u2208M\u222aU\n\nt\u2208M\u222aU\n\ni\u2208I\n\ni\u2208I\n\nu2\ni\n\nu2\ni\n\nx2\n\nx2\n\ni=1\n\n5 A New Adaptive Second Order Algorithm\n\nWe now introduce a new algorithm with an update rule that interpolates from adaptive-second-order-\ninformation to \ufb01xed-second-order-information. We start from the \ufb01rst bound in Corollary 2. We set\n, and A0 = I. This is similar to the regularization used\nft(x) = 1\nin AROW and SOP, but here we have rt > 0 changing over time. Again, denote \u03c7t = x(cid:62)\nt\u22121xt,\nand set \u03b7t = 1. With this choices, we obtain the bound\n\n2 x(cid:62)Atx, where At = At\u22121 + xtx(cid:62)\n(cid:88)\n\n(cid:18) \u03bb(u(cid:62)xt)2\n\nt A\u22121\n(cid:19)\n\u2212 mt(2rt \u2212 mt)\n\nM + U \u2264 (cid:88)\n\n\u03bb(cid:107)u(cid:107)2\n\n+\n\nrt\n\n,\n\nt\n\n(cid:96)t(u) +\n\n+\n\n\u03c7trt\n\n2\u03bb(rt + \u03c7t)\n\n2\u03bb(rt + \u03c7t)\n\nt\u2208M\u222aU\n\n2\n\nt\u2208M\u222aU\n\n2rt\n\nthat holds for any \u03bb > 0 and any choice of rt > 0. We would like to choose rt at each step to\nminimize the bound, in particular to have a small value of the sum \u03bb(u(cid:62)xt)2\n\u03bb(rt+\u03c7t). Altough\nwe do not know the values of (u(cid:62)xt)2 and \u03bb, still we can have a good trade-off setting rt = \u03c7t\nb\u03c7t\u22121\nwhen \u03c7t \u2265 1\nb and rt = +\u221e otherwise. Here b is a parameter. With this choice we have that\n\n+ \u03c7trt\n\nrt\n\n\u03c7trt\nrt+\u03c7t\n\n= 1\n\n= \u03c7t(u(cid:62)xt)2b\n\nrt+\u03c7t\n\n, when \u03c7t \u2265 1\n\nb . Hence we have\n\nwhere in the last inequality we used an extension of Lemma 4 in [5] to varying values of rt. Tuning\n\u03bb we have\nmin (1, b\u03c7t) \u2212 bmt(2rt \u2212 mt)\n\nM + U \u2264 (cid:88)\n\n(cid:96)t(u) + (cid:107)u(cid:107)R\n\n(cid:114) 1\n\n(cid:19)\n\n.\n\nbR2 + log det(AT )\n\nt\u2208M\u222aU\n\nt\u2208M\u222aU\n\nrt + \u03c7t\n\nThis algorithm interpolates between a second order algorithm with adaptive second order informa-\ntion, like AROW, and one with a \ufb01xed second order information. Even the bound is in between\nthese two worlds. In particular the matrix At is updated only if \u03c7t \u2265 1\nb , preventing its eigenvalues\nfrom growing too much, as in AROW/SOP. We thus call this algorithm NAROW, since its is a new\nadaptive algorithm, which narrows the range of possible eigenvalues of the matrix At. We illustrate\nempirically its properties in the next section.\n\n6\n\nt\u2208M\u222aU\n\nrt\n\n2\n\nb , and (u(cid:62)xt)2\n\u2212 (cid:88)\nM + U \u2212 \u03bb(cid:107)u(cid:107)2\n(cid:18) \u03bbb\u03c7t(u(cid:62)xt)2\n\u2264 (cid:88)\n(cid:88)\n\n2(rt + \u03c7t)\n\u03c7t(cid:107)u(cid:107)2R2\n2(rt + \u03c7t)\n\n\u2264 \u03bbb\n\nt:b\u03c7t>1\n\n1\n2\u03bb\n\u03bbbR2(cid:107)u(cid:107)2 log det(AT ) +\n\nt:b\u03c7t>1\n\n+\n\n\u2264 1\n2\n\n(cid:96)t(u)\n\n(cid:19)\n(cid:88)\n\n+\n\n1\n\n2\u03bbb\n\n+\n\n1\n2\u03bb\n\nmin\n\n(cid:88)\n\nt\u2208M\u222aU\n1\n2\u03bb\n\nt\u2208M\u222aU\n\nmin\n\nmt(2rt \u2212 mt)\n2\u03bb(rt + \u03c7t)\nmt(2rt \u2212 mt)\n2\u03bb(rt + \u03c7t)\n\nmt(2rt \u2212 mt)\n2\u03bb(rt + \u03c7t)\n\n,\n\n(cid:88)\n(cid:18) 1\n\nb\n\nt:b\u03c7t\u22641\n\n, \u03c7t\n\nt\u2208M\u222aU\n\n\u03c7t \u2212 (cid:88)\n(cid:19)\n\u2212 (cid:88)\n(cid:18) 1\n(cid:19)\n\u2212 (cid:88)\n(cid:115) (cid:88)\n(cid:18)\n\nt\u2208M\u222aU\n\n, \u03c7t\n\nb\n\nt\u2208M\u222aU\n\n\fFigure 3: Top: Four sequences used for training, the colors represents the ordering in the sequence from blue\nto yellow, to red. Middle: cumulative number of mistakes of four algorithms on data with no labels noise.\nBottom: results when training using data with 10% label-noise.\n\n6 Experiments\n\nWe illustrate the characteristics of our algorithm NAROW using a synthetic data generated in a\nsimilar manner of previous work [4]. We repeat its properties for completeness. We generated 5, 000\npoints in R20 where the \ufb01rst two coordinates were drawn from a 45\u25e6 rotated Gaussian distribution\nwith standard deviation 1 and 10. The remaining 18 coordinates were drawn from independent\nGaussian distributions N (0, 8.5). Each point\u2019s label depended on the \ufb01rst two coordinates using\na separator parallel to the long axis of the ellipsoid, yielding a linearly separable set. Finally, we\nordered the training set in four different ways: from easy examples to hard examples (measured by\nthe signed distance to the separating-hyperplane), from hard examples to easy examples, ordered by\ntheir signed value of the \ufb01rst feature, and by the signed value of the third (noisy) feature - that is by\nxi \u00d7 y for i = 1 and i = 3 - respectively. An illustration of these ordering appears in the top row of\nFig. 3, the colors code the ordering of points from blue via yellow to red (last points). We evaluated\nfour algorithms: version I of the passive-aggressive (PA-I) algorithm [3], AROW [5], AdaGrad [9]\nand NAROW. All algorithms, except AdaGrad, have one parameter to be tuned, while AdaGrad has\ntwo. These parameters were chosen on a single random set, and the plots summarizes the results\naveraged over 100 repetitions.\nThe second row of Fig. 3 summarizes the cumulative number of mistakes averaged over 100 repe-\ntitions and the third row shows the cumulative number of mistakes where 10% arti\ufb01cial label noise\nwas used. (Mistakes are counted using the unnoisy labels.)\nFocusing on the left plot, we observe that all the second order algorithms outperform the single\n\ufb01rst order algorithm - PA-I. All algorithms make few mistakes when receiving the \ufb01rst half of the\ndata - the easy examples. Then all algorithms start to make more mistakes - PA-I the most, then\nAdaGrad and closely following NAROW, and AROW the least. In other words, AROW was able to\nconverge faster to the target separating hyperplane just using \u201ceasy\u201d examples which are far from\nthe separating hyperplane, then NAROW and AdaGrad, with PA-I being the worst in this aspect.\nThe second plot from the left, showing the results for ordering the examples from hard to easy. All\nalgorithms follow a general trend of making mistakes in a linear rate and then stop making mistakes\nwhen the data is easy and there are many possible classi\ufb01ers that can predict correctly. Clearly,\n\n7\n\n\u221220\u22121001020\u221225\u221220\u221215\u221210\u221250510152025 500100015002000250030003500400045005000\u221220\u22121001020\u221225\u221220\u221215\u221210\u221250510152025 500100015002000250030003500400045005000\u221220\u22121001020\u221225\u221220\u221215\u221210\u221250510152025 500100015002000250030003500400045005000\u221220\u22121001020\u221225\u221220\u221215\u221210\u221250510152025 50010001500200025003000350040004500500010002000300040005000100200300400500600700800ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500020040060080010001200ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500050100150200250300350ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500050100150200250300350ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad10002000300040005000100200300400500600700800ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500020040060080010001200ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500050100150200250300350ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad1000200030004000500050100150200250300350ExamplesCumulative Number of Mistakes PAAROWNAROWAdaGrad\fAROW and NAROW stop making mistakes \ufb01rst, then AdaGrad and PA-I last. A similar trend can\nbe found in the noisy dataset, with each algorithm making relatively more mistakes.\nThe third and fourth columns tell a similar story, although the plots in the third column summarize\nresults when the instances are ordered using the \ufb01rst feature (which is informative with the second)\nand the plots in the fourth column summarize when the instances are ordered using the third unin-\nformative feature. In both cases, all algorithms do not make many mistakes in the beginning, then at\nsome point, close to the middle of the input sequence, they start making many mistakes for a while,\nand then they converge. In terms of total performance: PA-I makes more mistakes, then AdaGrad,\nAROW and NAROW. However, NAROW starts to make many mistakes before the other algorithms\nand takes more \u201cexamples\u201d to converge until it stopped making mistakes. This phenomena is further\nshown in the bottom plots where label noise is injected.\nWe hypothesize that this relation is due to the fact that NAROW does not let the eigenvalues of the\nmatrix A to grow unbounded. Since its inverse is proportional to the effective learning rate, it means\nthat it does not allow the learning rate to drop too low as opposed to AROW and even to some extent\nAdaGrad.\n7 Conclusion\nWe presented a framework for online convex classi\ufb01cation, specializing it for particular losses, as the\nhinge loss. This general tool allows to design theoretical motivated online classi\ufb01cation algorithms\nand to prove their relative mistake bound. In particular it supports the analysis of aggressive updates.\nOur framework also provided a missing bound for AROW for diagonal matrices. We have shown\nits utility proving better bounds for known online algorithms, and proposing a new algorithm, called\nNAROW. This is a hybrid between adaptive second order algorithms, like AROW and SOP, and a\nstatic second order one. We have validated it using synthetic datasets, showing its robustness to the\nmalicious orderings of the sample, comparing it with other state-of-art algorithms. Future work will\nfocus on exploring the new possibilities offered by our framework and on testing NAROW on real\nworld data.\nAcknowledgments We thank Nicol`o Cesa-Bianchi for his helpful comments. Francesco Orabona was\nsponsored by the PASCAL2 NoE under EC grant no. 216886. Koby Crammer is a Horev Fellow, supported by\nthe Taub Foundations. This work was also supported by the German-Israeli Foundation grant GIF-2209-1912.\n\n\u2212 T(cid:88)\n\nt \u2207f\u2217\nt\u22121(\u03b8t) \u2212 \u03b7tz(cid:62)\nT(cid:88)\nT(cid:88)\n\nt=1 \u03b7tu(cid:62)zt \u2212 1\n\u2206t \u2264 1\n\u03bb\n\nt the Fenchel dual of ft, and \u2206t = f\u2217\n\nA Appendix\nProof of Lemma 1. De\ufb01ne by f\u2217\n\n(cid:80)T\nt=1 \u2206t = f\u2217\nt (\u03b8t) \u2212 f\u2217\n(cid:107)zt(cid:107)2\nf\u2217\nf\u2217\nin [13]. Moreover using the Fenchel-Young inequality, we have that 1\n\u03bb\nu(cid:62)\u03b8T +1 \u2212 1\n\n\u03bb fT (\u03bbu) = \u2212(cid:80)T\n\nT (\u03b8T +1) \u2212 f\u2217\n\nt\u22121(\u03b8t) \u2264 f\u2217\n\nt (\u03b8t) \u2212 f\u2217\n\n0 (\u03b81) = f\u2217\n\nt (\u03b8t) + \u03b72\n\nt\n2\u03b2t\n\nt\n\nT (\u03b8T +1). Moreover we have that \u2206t = f\u2217\n\n\u03bb f\u2217\n\u03bb fT (\u03bbu). Hence putting all togheter we have\n\nt=1 \u2206t = 1\n\nt (\u03b8t+1) \u2212 f\u2217\n\nt (\u03b8t+1) \u2212 f\u2217\n(cid:80)T\n\nt\u22121(\u03b8t). We have\nt (\u03b8t) +\n, where we used Theorem 6\nT (\u03b8T +1) \u2265\n\n\u03b7tu(cid:62)zt \u2212 1\n\u03bb\n\nfT (\u03bbu) \u2264 1\n\u03bb\n\nt (\u03b8t) \u2212 f\u2217\n(f\u2217\n\nt\u22121(\u03b8t) \u2212 \u03b7tw(cid:62)\n\nt zt +\n\n\u03b72\nt\n2\u03b2t\n\n(cid:107)zt(cid:107)2\nf\u2217\n\nt\n\n),\n\nt=1\n\nt=1\n\nt=1\nwhere we used the de\ufb01nition of wt in Algorithm 1.\nProof of Corollary 1. By convexity, (cid:96)(wt, xt, yt) \u2212 (cid:96)(u, xt, yt) \u2264 z(cid:62)\nt (wt \u2212 u), so setting \u03bb = 1\nin Lemma 1 we have the stated bound. For the additional statement, using Lemma 12 in [16] and\nt+1(x), so B \u2264 0. The additional statement on B is\nft(x) \u2264 ft+1(x) we have that f\u2217\nt (x) \u2265\nproved using Lemma 12 in [16]. Using it, we have that ft(x) \u2264 ft+1(x) implies that f\u2217\nt+1(x), so we have that B \u2264 0.\nf\u2217\n(cid:33)\nT(cid:88)\n\nProof of Corollary 2. Lemma 1, the condition on the loss (2), and the hypothesis on fT gives us\n\n\u03b7t(1 \u2212 (cid:96)t(u)) \u2264 \u2212 T(cid:88)\n\n\u03b7tu(cid:62)zt \u2264 \u03bbfT (u) +\n\n+ B \u2212 \u03b7tz(cid:62)\n\nt (x) \u2265 f\u2217\n\n(cid:32) \u03b72\n\nT(cid:88)\n\n.\n\nt wt\n\n1\n\u03bb\n\nt=1\n\nt (cid:107)zt(cid:107)2\nf\u2217\n2\u03b2t\n\nt\n\nt=1\n\nt=1\n\nNote that \u03bb is free, so choosing its optimal value we get the second bound.\n\n8\n\n\fReferences\n[1] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. SIAM\n\nJournal on Computing, 34(3):640\u2013668, 2005.\n\n[2] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[3] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[4] K. Crammer, M. Dredze, and F. Pereira. Exact Convex Con\ufb01dence-Weighted learning. Ad-\n\nvances in Neural Information Processing Systems, 22, 2008.\n\n[5] K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. Advances\n\nin Neural Information Processing Systems, 23, 2009.\n\n[6] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jour-\n\nnal of Machine Learning Research, 3:951\u2013991, 2003.\n\n[7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other\n\nKernel-Based Learning Methods. Cambridge University Press, 2000.\n\n[8] M. Dredze, K. Crammer, and F. Pereira. Online Con\ufb01dence-Weighted learning. Proceedings\n\nof the 25th International Conference on Machine Learning, 2008.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\nstochastic optimization. Technical Report 2010-24, UC Berkeley Electrical Engineering\nand Computer Science, 2010. Available at http://cs.berkeley.edu/\u02dcjduchi/\nprojects/DuchiHaSi10.pdf.\n\n[10] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the Perceptron algorithm.\n\nMachine Learning, pages 277\u2013296, 1999.\n\n[11] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265\u2013299, 2003.\n[12] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in\n\ncosts. In Proc. of the 21st Conference on Learning Theory, 2008.\n\n[13] S. Kakade, S. Shalev-Shwartz, and A. Tewari. On the duality of strong convexity and strong\nsmoothness: Learning applications and matrix regularization. Technical report, TTI, 2009.\nhttp://www.cs.huji.ac.il/ shais/papers/KakadeShalevTewari09.pdf.\n\n[14] J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. IEEE Trans. on Signal\n\nProcessing, 52(8):2165\u20132176, 2004.\n\n[15] A. Rakhlin and A. Tewari. Lecture notes on online learning. Technical report, 2008. Avail-\nable at http://www-stat.wharton.upenn.edu/\u02dcrakhlin/papers/online_\nlearning.pdf.\n\n[16] S. Shalev-Shwartz. Online learning: Theory, algorithms, and applications. Technical report,\n\nThe Hebrew University, 2007. PhD thesis.\n\n[17] S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algorithms.\n\nMachine Learning Journal, 2007.\n\n[18] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization.\n\nIn Advances in Neural Information Processing Systems 22, pages 2116\u20132124. 2009.\n\n9\n\n\f", "award": [], "sourceid": 995, "authors": [{"given_name": "Francesco", "family_name": "Orabona", "institution": null}, {"given_name": "Koby", "family_name": "Crammer", "institution": null}]}