{"title": "Following the Leader and Fast Rates in Linear Prediction: Curved Constraint Sets and Other Regularities", "book": "Advances in Neural Information Processing Systems", "page_first": 4970, "page_last": 4978, "abstract": "The follow the leader (FTL) algorithm, perhaps the simplest of all online learning algorithms, is known to perform well when the loss functions it is used on are positively curved. In this paper we ask whether there are other \"lucky\" settings when FTL achieves sublinear, \"small\" regret. In particular, we study the fundamental problem of linear prediction over a non-empty convex, compact domain. Amongst other results, we prove that the curvature of  the boundary of the domain can act as if the losses were curved: In this case, we prove that as long as the mean of the loss vectors have positive lengths bounded away from zero, FTL enjoys a logarithmic growth rate of regret, while, e.g., for polyhedral domains and stochastic data it enjoys finite expected regret. Building on a previously known meta-algorithm, we also get an algorithm that simultaneously enjoys the worst-case guarantees and the bound available for FTL.", "full_text": "Following the Leader and Fast Rates in Linear\nPrediction: Curved Constraint Sets and Other\n\nRegularities\n\nRuitong Huang\n\nDepartment of Computing Science\nUniversity of Alberta, AB, Canada\n\nruitong@ualberta.ca\n\nTor Lattimore\n\nSchool of Informatics and Computing\n\nIndiana University, IN, USA\ntor.lattimore@gmail.com\n\nAndr\u00e1s Gy\u00f6rgy\n\nDept. of Electrical & Electronic Engineering\n\nImperial College London, UK\na.gyorgy@imperial.ac.uk\n\nCsaba Szepesv\u00e1ri\n\nDepartment of Computing Science\nUniversity of Alberta, AB, Canada\n\nszepesva@ualberta.ca\n\nAbstract\n\nThe follow the leader (FTL) algorithm, perhaps the simplest of all online learning\nalgorithms, is known to perform well when the loss functions it is used on are posi-\ntively curved. In this paper we ask whether there are other \u201clucky\u201d settings when\nFTL achieves sublinear, \u201csmall\u201d regret. In particular, we study the fundamental\nproblem of linear prediction over a non-empty convex, compact domain. Amongst\nother results, we prove that the curvature of the boundary of the domain can act as\nif the losses were curved: In this case, we prove that as long as the mean of the loss\nvectors have positive lengths bounded away from zero, FTL enjoys a logarithmic\ngrowth rate of regret, while, e.g., for polyhedral domains and stochastic data it\nenjoys \ufb01nite expected regret. Building on a previously known meta-algorithm, we\nalso get an algorithm that simultaneously enjoys the worst-case guarantees and the\nbound available for FTL.\n\n1\n\nIntroduction\n\nLearning theory traditionally has been studied in a statistical framework, discussed at length, for\nexample, by Shalev-Shwartz and Ben-David [2014]. The issue with this approach is that the analysis\nof the performance of learning methods seems to critically depend on whether the data generating\nmechanism satis\ufb01es some probabilistic assumptions. Realizing that these assumptions are not\nnecessarily critical, much work has been devoted recently to studying learning algorithms in the so-\ncalled online learning framework [Cesa-Bianchi and Lugosi, 2006]. The online learning framework\nmakes minimal assumptions about the data generating mechanism, while allowing one to replicate\nresults of the statistical framework through online-to-batch conversions [Cesa-Bianchi et al., 2004].\nBy following a minimax approach, however, results proven in the online learning setting, at least\ninitially, led to rather conservative results and algorithm designs, failing to capture how more regular,\n\u201ceasier\u201d data, may give rise to faster learning speed. This is problematic as it may suggest overly\nconservative learning strategies, missing opportunities to extract more information when the data is\nnicer. Also, it is hard to argue that data resulting from passive data collection, such as weather data,\nwould ever be adversarially generated (though it is equally hard to defend that such data satis\ufb01es\nprecise stochastic assumptions). Realizing this issue, during recent years much work has been devoted\nto understanding what regularities and how can lead to faster learning speed. For example, much\nwork has been devoted to showing that faster learning speed (smaller \u201cregret\u201d) can be achieved in\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe online convex optimization setting when the loss functions are \u201ccurved\u201d, such as when the loss\nfunctions are strongly convex or exp-concave, or when the losses show small variations, or the best\nprediction in hindsight has a small total loss, and that these properties can be exploited in an adaptive\nmanner (e.g., Merhav and Feder 1992, Freund and Schapire 1997, Gaivoronski and Stella 2000,\nCesa-Bianchi and Lugosi 2006, Hazan et al. 2007, Bartlett et al. 2007, Kakade and Shalev-Shwartz\n2009, Orabona et al. 2012, Rakhlin and Sridharan 2013, van Erven et al. 2015, Foster et al. 2015).\nIn this paper we contribute to this growing literature by studying online linear prediction and the\nfollow the leader (FTL) algorithm. Online linear prediction is arguably the simplest of all the learning\nsettings, and lies at the heart of online convex optimization, while it also serves as an abstraction of\ncore learning problems such as prediction with expert advice. FTL, the online analogue of empirical\nrisk minimization of statistical learning, is the simplest learning strategy, one can think of. Although\nthe linear setting of course removes the possibility of exploiting the curvature of losses, as we will\nsee, there are multiple ways online learning problems can present data that allows for small regret,\neven for FTL. As is it well known, in the worst case, FTL suffers a linear regret (e.g., Example 2.2 of\nShalev-Shwartz [2012]). However, for \u201ccurved\u201d losses (e.g., exp-concave losses), FTL was shown\nto achieve small (logarithmic) regret (see, e.g., Merhav and Feder [1992], Cesa-Bianchi and Lugosi\n[2006], Gaivoronski and Stella [2000], Hazan et al. [2007]).\nIn this paper we take a thorough look at FTL in the case when the losses are linear, but the problem\nperhaps exhibits other regularities. The motivation comes from the simple observation that, for\nprediction over the simplex, when the loss vectors are selected independently of each other from\na distribution with a bounded support with a nonzero mean, FTL quickly locks onto selecting the\nloss-minimizing vertex of the simplex, achieving \ufb01nite expected regret. In this case, FTL is arguably\nan excellent algorithm. In fact, FTL is shown to be the minimax optimizer for the binary losses in the\nstochastic expert setting in the paper of Kot\u0142owski [2016]. Thus, we ask the question of whether there\nare other regularities that allow FTL to achieve nontrivial performance guarantees. Our main result\nshows that when the decision set (or constraint set) has a suf\ufb01ciently \u201ccurved\u201d boundary and the\nlinear loss is bounded away from 0, FTL is able to achieve logarithmic regret even in the adversarial\nsetting, thus opening up a new way to prove fast rates based on not on the curvature of losses, but on\nthat of the boundary of the constraint set and non-singularity of the linear loss. In a matching lower\nbound we show that this regret bound is essentially unimprovable. We also show an alternate bound\nfor polyhedral constraint sets, which allows us to prove that (under certain technical conditions) for\nstochastic problems the expected regret of FTL will be \ufb01nite. To \ufb01nish, we use (A, B)-prod of Sani\n\u221a\net al. [2014] to design an algorithm that adaptively interpolates between the worst case O(\nn) regret\nand the smaller regret bounds, which we prove here for \u201ceasy data.\u201d Simulation results on arti\ufb01cial\ndata to illustrate the theory complement the theoretical \ufb01ndings, though due to lack of space these are\npresented only in the long version of the paper [Huang et al., 2016].\nWhile we believe that we are the \ufb01rst to point out that the curvature of the constraint set W can help\nin speeding up learning, this effect is known in convex optimization since at least the work of Levitin\nand Polyak [1966], who showed that exponential rates are attainable for strongly convex constraint\nsets if the norm of the gradients of the objective function admit a uniform lower bound. More recently,\nGarber and Hazan [2015] proved an O(1/n2) optimization error bound (with problem-dependent\nconstants) for the Frank-Wolfe algorithm for strongly convex and smooth objectives and strongly\n\u221a\nconvex constraint sets. The effect of the shape of the constraint set was also discussed by Abbasi-\nYadkori [2010] who demonstrated O(\nn) regret in the linear bandit setting. While these results at a\nhigh level are similar to ours, our proof technique is rather different than that used there.\n\n2 Preliminaries, online learning and the follow the leader algorithm\n\nWe consider the standard framework of online convex optimization, where a learner and an envi-\nronment interact in a sequential manner in n rounds: In round every round t = 1, . . . , n, \ufb01rst the\nlearner predicts wt \u2208 W. Then the environment picks a loss function (cid:96)t \u2208 L, and the learner suffers\nloss (cid:96)t(wt) and observes (cid:96)t. Here, W is a non-empty, compact convex subset of Rd and L is a set\nof convex functions, mapping W to the reals. The elements of L are called loss functions. The\nperformance of the learner is measured in terms of its regret,\n\nRn =\n\n(cid:96)t(w) .\n\nn(cid:88)\n\nt=1\n\nn(cid:88)\n\nt=1\n\n(cid:96)t(wt) \u2212 min\nw\u2208W\n\n2\n\n\fThe simplest possible case, which will be the focus of this paper, is when the losses are linear, i.e.,\nwhen (cid:96)t(w) = (cid:104)ft, w(cid:105) for some ft \u2208 F \u2282 Rd. In fact, the linear case is not only simple, but is also\nfundamental since the case of nonlinear loss functions can be reduced to it: Indeed, even if the losses\nare nonlinear, de\ufb01ning ft \u2208 \u2202(cid:96)t(wt) to be a subgradient1 of (cid:96)t at wt and letting \u02dc(cid:96)t(u) = (cid:104)ft, u(cid:105), by\nthe de\ufb01nition of subgradients, (cid:96)t(wt) \u2212 (cid:96)t(u) \u2264 (cid:96)t(wt) \u2212 ((cid:96)t(wt) + (cid:104)ft, u \u2212 wt(cid:105)) = \u02dc(cid:96)t(wt) \u2212 \u02dc(cid:96)t(u),\n\nhence for any u \u2208 W, (cid:88)\n\n(cid:96)t(wt) \u2212(cid:88)\n\n(cid:96)t(u) \u2264(cid:88)\n\n\u02dc(cid:96)t(wt) \u2212(cid:88)\n\n\u02dc(cid:96)t(u) .\n\nt\n\nt\n\nt\n\nt\n\nIn particular, if an algorithm keeps the regret small no matter how the linear losses are selected (even\nwhen allowing the environment to pick losses based on the choices of the learner), the algorithm can\nalso be used to keep the regret small in the nonlinear case. Hence, in what follows we will study the\nlinear case (cid:96)t(w) = (cid:104)ft, w(cid:105) and, in particular, we will study the regret of the so-called \u201cFollow The\nLeader\u201d (FTL) learner, which, in round t \u2265 2 picks\n\nt\u22121(cid:88)\n\ni=1\n\nwt = argmin\nw\u2208W\n\n(cid:96)i(w) .\n\nminw\u2208W(cid:80)t\u22121\n\nFor the \ufb01rst round, w1 \u2208 W is picked in an arbitrary manner. When W is compact, the optimal w of\ni=1(cid:104)w, ft(cid:105) is attainable, which we will assume henceforth. If multiple minimizers exist,\nwe simply \ufb01x one of them as wt. We will also assume that F is non-empty, compact and convex.\n(cid:80)t\n\ni=1 fi be the negative average of the \ufb01rst t vectors in (ft)n\n\nt=1, ft \u2208 F. For\n\n2.1 Support functions\nLet \u0398t = \u2212 1\nconvenience, we de\ufb01ne \u03980 := 0. Thus, for t \u2265 2,\n(cid:104)w, fi(cid:105) = argmin\nw\u2208W\n\nwt = argmin\nw\u2208W\n\nt\u22121(cid:88)\n\nt\n\ni=1\n\n(cid:104)w,\u2212\u0398t\u22121(cid:105) = argmax\nw\u2208W\n\n(cid:104)w, \u0398t\u22121(cid:105) .\n\n(cid:8)(\u03b8, z)| z \u2265 \u03a6(\u03b8), z \u2208 R, \u03b8 \u2208 Rd(cid:9) of \u03a6 is a cone, since for any (\u03b8, z) \u2208 epi(\u03a6) and a \u2265 0, az \u2265\n\nDenote by \u03a6(\u0398) = maxw\u2208W(cid:104)w, \u0398(cid:105) the so-called support function of W. The support function,\nbeing the maximum of linear and hence convex functions, is itself convex. Further \u03a6 is positive\nhomogenous: for a \u2265 0 and \u03b8 \u2208 Rd, \u03a6(a\u03b8) = a\u03a6(\u03b8). It follows then that the epigraph epi(\u03a6) =\na\u03a6(\u03b8) = \u03a6(a\u03b8), (a\u03b8, az) \u2208 epi(\u03a6) also holds.\nThe differentiability of the support function is closely tied to whether in the FTL algorithm the choice\nof wt is uniquely determined:\nProposition 2.1. Let W (cid:54)= \u2205 be convex and closed. Fix \u0398 and let Z := {w \u2208 W |(cid:104)w, \u0398(cid:105) = \u03a6(\u0398)}.\nThen, \u2202\u03a6(\u0398) = Z and, in particular, \u03a6(\u0398) is differentiable at \u0398 if and only if maxw\u2208W(cid:104)w, \u0398(cid:105) has\na unique optimizer. In this case, \u2207\u03a6(\u0398) = argmaxw\u2208W(cid:104)w, \u0398(cid:105).\nThe proposition follows from Danskin\u2019s theorem when W is compact (e.g., Proposition B.25 of\nBertsekas 1999), but a simple direct argument can also be used to show that it also remains true even\nwhen W is unbounded.2 By Proposition 2.1, when \u03a6 is differentiable at \u0398t\u22121, wt = \u2207\u03a6(\u0398t\u22121).\n\n3 Non-stochastic analysis of FTL\n\nWe start by rewriting the regret of FTL in an equivalent form, which shows that we can expect FTL\nto enjoy a small regret when successive weight vectors move little. A noteworthy feature of the next\nproposition is that rather than bounding the regret from above, it gives an equivalent expression for it.\nProposition 3.1. The regret Rn of FTL satis\ufb01es\n\nn(cid:88)\n\nRn =\n\nt(cid:104)wt+1 \u2212 wt, \u0398t(cid:105) .\n\n(cid:8)\u03b8 \u2208 Rd | g(x(cid:48)) \u2265 g(x) + (cid:104)\u03b8, x(cid:48) \u2212 x(cid:105) \u2200x(cid:48) \u2208 dom(g)(cid:9), where dom(g) \u2282 Rd is the domain of g.\n\n1 We let \u2202g(x) denote the subdifferential of a convex function g : dom(g) \u2192 R at x, i.e., \u2202g(x) =\n\nt=1\n\n2 The proofs not given in the main text can be found in the long version of the paper [Huang et al., 2016].\n\n3\n\n\f(cid:80)n\nThe result is a direct corollary of Lemma 9 of McMahan [2010], which holds for any sequence of\nlosses, even in the lack of convexity. It is also a tightening of the well-known inequality Rn \u2264\nt=1 (cid:96)t(wt) \u2212 (cid:96)t(wt+1), which again holds for arbitrary loss sequences (e.g., Lemma 2.1 of Shalev-\nShwartz [2012]). To keep the paper self-contained, we give an elegant, short direct proof, based on\nthe summation by parts formula:\n(cid:80)n\nt=1 ut (vt+1 \u2212 vt) = (ut+1vt+1 \u2212 u1v1) \u2212(cid:80)n\ninition of regret with ut := wt,\u00b7 and vt+1 := t\u0398t, we get Rn = \u2212(cid:80)n\n(cid:104)wn+1, n\u0398n(cid:105) = \u2212{hhhhhh\n\nProof. The summation by parts formula states that for any u1, v1, . . . , un+1, vn+1 reals,\nt=1(ut+1 \u2212 ut) vt+1. Applying this to the def-\nt=1(cid:104)wt, t\u0398t \u2212 (t \u2212 1)\u0398t\u22121(cid:105) +\nhhhhhh\n(cid:104)wn+1, n\u0398n(cid:105).\n\n(cid:104)wn+1, n\u0398n(cid:105) \u2212 0 \u2212(cid:80)n\n\nt=1(cid:104)wt+1 \u2212 wt, t\u0398t(cid:105)} +\n\nOur next proposition gives another formula that is equal to the regret. As opposed to the previous\nresult, this formula is appealing as it is independent of wt; but it directly connects the sequence\n(\u0398t)t to the geometric properties of W through the support function \u03a6. For this proposition we will\nmomentarily assume that \u03a6 is differentiable at (\u0398t)t\u22651; a more general statement will follow later.\nProposition 3.2. If \u03a6 is differentiable at \u03981, . . . , \u0398n,\n\nRn =\n\nt D\u03a6(\u0398t, \u0398t\u22121) ,\n\n(1)\n\nn(cid:88)\n\nt=1\n\nwhere D\u03a6(\u03b8(cid:48), \u03b8) = \u03a6(\u03b8(cid:48)) \u2212 \u03a6(\u03b8) \u2212 (cid:104)\u2207\u03a6(\u03b8), \u03b8(cid:48) \u2212 \u03b8(cid:105) is the Bregman divergence of \u03a6 and we use the\nconvention that \u2207\u03a6(0) = w1.\nProof. Let v = argmaxw\u2208W(cid:104)w, \u03b8(cid:105), v(cid:48) = argmaxw\u2208W(cid:104)w, \u03b8(cid:48)(cid:105). When \u03a6 is differentiable at \u03b8,\nD\u03a6(\u03b8(cid:48), \u03b8) = \u03a6(\u03b8(cid:48)) \u2212 \u03a6(\u03b8) \u2212 (cid:104)\u2207\u03a6(\u03b8), \u03b8(cid:48)\u2212 \u03b8(cid:105) = (cid:104)v(cid:48), \u03b8(cid:48)(cid:105)\u2212 (cid:104)v, \u03b8(cid:105) \u2212 (cid:104)v, \u03b8(cid:48)\u2212 \u03b8(cid:105) = (cid:104)v(cid:48)\u2212 v, \u03b8(cid:48)(cid:105) . (2)\n\nTherefore, by Proposition 3.1, Rn =(cid:80)n\n\nt=1 t(cid:104)wt+1 \u2212 wt, \u0398t(cid:105) =(cid:80)n\n\nt=1 t D\u03a6(\u0398t, \u0398t\u22121).\n\nWhen \u03a6 is non-differentiable at some of the points \u03981, . . . , \u0398n, the equality in the above propo-\nsition can be replaced with inequalities. De\ufb01ning the upper Bregman divergence D\u03a6(\u03b8(cid:48), \u03b8) =\nsupw\u2208\u2202\u03a6(\u03b8) \u03a6(\u03b8(cid:48)) \u2212 \u03a6(\u03b8) \u2212 (cid:104)w, \u03b8(cid:48) \u2212 \u03b8(cid:105) and the lower Bregman divergence D\u03a6(\u03b8(cid:48), \u03b8) similarly with\ninf instead of sup, similarly to Proposition 3.2, we obtain\n\nn(cid:88)\n\nt D\u03a6(\u0398t, \u0398t\u22121) \u2264 Rn \u2264 n(cid:88)\n\nt D\u03a6(\u0398t, \u0398t\u22121) .\n\n(3)\n\nt=1\n\nt=1\n\n3.1 Constraint sets with positive curvature\nThe previous results shows in an implicit fashion that the curvature of W controls the regret. We now\npresent our \ufb01rst main result that makes this connection explicit. Denote the boundary of W by bd(W).\nFor this result, we shall assume that W is C 2, that is, bd(W) is a twice continuously differentiable\nsubmanifold of Rd. Recall that in this case the principal curvatures of W at w \u2208 bd(W) are the\neigenvalues of \u2207uW (w), where uW : bd(W) \u2192 Sd\u22121, the so-called Gauss map, maps a boundary\npoint w \u2208 bd(W) to the unique outer normal vector to W at w.3 As it is well known, \u2207uW (w) is a\nself-adjoint operator, with nonnegative eigenvalues, thus the principal curvatures are nonnegative.\nPerhaps a more intuitive, yet equivalent de\ufb01nition, is that the principal eigenvalues are the eigenvalues\nof the Hessian of f = fw in the parameterization t (cid:55)\u2192 w + t\u2212 fw(t)uW (w) of bd(W) which is valid\nin a small open neighborhood of w, where fw : TwW \u2192 [0,\u221e) is a suitable convex, nonnegative\nvalued function that also satis\ufb01es fw(0) = 0 and where TwW, a hyperplane of Rd, denotes the\ntangent space of W at w, obtained by taking the support plane H of W at w and shifting it by \u2212w.\nThus, the principal curvatures at some point w \u2208 bd(W) describe the local shape of bd(W) up to\nthe second order.\nA related concept that has been used in convex optimization to show fast rates is that of a strongly\nconvex constraint set [Levitin and Polyak, 1966, Garber and Hazan, 2015]: W is \u03bb-strongly convex\n\n3Sd\u22121 =(cid:8)x \u2208 Rd |(cid:107)x(cid:107)2 = 1(cid:9) denotes the unit sphere in d-dimensions. All differential geometry concept\n\nand results that we need can be found in Section 2.5 of [Schneider, 2014].\n\n4\n\n\fwith respect to the norm (cid:107)\u00b7(cid:107) if, for any x, y \u2208 W and \u03b3 \u2208 [0, 1], the (cid:107)\u00b7(cid:107)-ball with origin \u03b3x+(1\u2212\u03b3)y\nand radius \u03b3(1 \u2212 \u03b3)\u03bb(cid:107)x \u2212 y(cid:107)2 /2 is included in W. One can show that a closed convex set W is\n\u03bb-strongly convex with respect to (cid:107)\u00b7(cid:107)2 if and only if the principal curvatures of the surface bdW are\nall at least \u03bb.\nOur next result connects the principal curvatures of bd(W) to the regret of FTL and shows that FTL\nenjoys logarithmic regret for highly curved surfaces, as long as (cid:107)\u0398t(cid:107)2 is bounded away from zero.\nTheorem 3.3. Let W \u2282 Rd be a C 2 convex body with d \u2265 2.4 Let M = maxf\u2208F (cid:107)f(cid:107)2 and assume\nthat \u03a6 is differentiable at (\u0398t)t. Assume that the principal curvatures of the surface bd(W) are all\nat least \u03bb0 for some constant \u03bb0 > 0 and Ln := min1\u2264t\u2264n (cid:107)\u0398t(cid:107)2 > 0. Choose w1 \u2208 bd(W). Then\n\nRn \u2264 2M 2\n\u03bb0Ln\n\n(1 + log(n)) .\n\n(cid:16) 2M 2\n\u03bb0L (1 + log(n)) + LW (cid:80)n\n\nAs we will show later in an essentially matching lower\nbound, this bound is tight, showing that the forte of FTL is\nwhen Ln is bounded away from zero and \u03bb0 is large. Note\n\u221a\nthat the bound is vacuous as soon as Ln = O(log(n)/n)\n\u221a\nand is worse than the minimax bound of O(\nn) when\nn). One possibility to reduce the\nLn = o(log(n)/\nbound\u2019s sensitivity to Ln is to use the trivial bound\n(cid:104)wt+1 \u2212 wt, \u0398t(cid:105) \u2264 LW = L supw,w(cid:48)\u2208W (cid:107)w \u2212 w(cid:48)(cid:107)2 for\n(cid:17)\nindices t when (cid:107)\u0398t(cid:107) \u2264 L. Then, by optimizing the bound\nover L, one gets a data-dependent bound of the form\nt=1 t I ((cid:107)\u0398t(cid:107) \u2264 L)\n,\ninf L>0\nwhich is more complex, but is free of Ln and thus re\ufb02ects\nthe nature of FTL better. Note that in the case of stochastic\nproblems, where f1, . . . , fn are independent and identically\ndistributed (i.i.d.) with \u00b5 := \u2212E [\u0398t] (cid:54)= 0, the probability\nthat (cid:107)\u0398t(cid:107)2 < (cid:107)\u00b5(cid:107)2 /2 is exponentially small in t. Thus, selecting L = (cid:107)\u00b5(cid:107)2 /2 in the previous\nbound, the contribution of the expectation of the second term is O((cid:107)\u00b5(cid:107)2 W ), giving an overall bound\nlog(n) + (cid:107)\u00b5(cid:107)2 W ). After the proof we will provide some simple examples\nof the form O( M 2\n\u03bb0(cid:107)\u00b5(cid:107)2\nthat should make it more intuitive how the curvature of W helps keeping the regret of FTL small.\nProof. Fix \u03b81, \u03b82 \u2208 Rd and let w(1) = argmaxw\u2208W(cid:104)w, \u03b81(cid:105), w(2) = argmaxw\u2208W(cid:104)w, \u03b82(cid:105). Note that\nif \u03b81, \u03b82 (cid:54)= 0 then w(1), w(2) \u2208 bd(W). Below we will show that\n\nFigure 1: Illustration of the con-\nstruction used in the proof of (4).\n\n(cid:104)w(1) \u2212 w(2), \u03b81(cid:105) \u2264 1\n2\u03bb0\n\n(4)\nProposition 3.1 suggests that it suf\ufb01ces to bound (cid:104)wt+1 \u2212 wt, \u0398t(cid:105). By (4), we see that it suf\ufb01ces to\nbound how much \u0398t moves. A straightforward calculation shows that \u0398t cannot move much:\nLemma 3.4. For any norm (cid:107)\u00b7(cid:107) on F, we have (cid:107)\u0398t \u2212 \u0398t\u22121(cid:107) \u2264 2\nt M , where M = maxf\u2208F (cid:107)f(cid:107) is a\nconstant that depends on F and the norm (cid:107)\u00b7(cid:107).\n\n(cid:107)\u03b82(cid:107)2\n\n2\n\n.\n\n(cid:107)\u03b82 \u2212 \u03b81(cid:107)2\n\nCombining inequality (4) with Proposition 3.1 and Lemma 3.4, we get\n(cid:107)\u0398t \u2212 \u0398t\u22121(cid:107)2\n\n2\n\nRn =\n\nn(cid:88)\n\nt=1\n\nt(cid:104)wt+1 \u2212 wt, \u0398t(cid:105) \u2264 n(cid:88)\nn(cid:88)\n\nt=1\n\n1\n\n\u2264 2M 2\n\u03bb0Ln\n\n\u2264 2M 2\n\u03bb0\n\nt(cid:107)\u0398t\u22121(cid:107)2\n\nt=1\n\nt\n2\u03bb0\n\nn(cid:88)\n\nt=1\n\n(cid:107)\u0398t\u22121(cid:107)2\n\u2264 2M 2\n\u03bb0Ln\n\n1\nt\n\n(1 + log(n)) .\n\nTo \ufb01nish the proof, it thus remains to show (4).\nThe following elementary lemma relates the cosine of the angle between two vectors \u03b81 and \u03b82 to the\nsquared normalized distance between the two vectors, thereby reducing our problem to bounding the\ncosine of this angle. For brevity, we denote by cos(\u03b81, \u03b82) the cosine of the angle between \u03b81 and \u03b82.\n\n4Following Schneider [2014], a convex body of Rd is any non-empty, compact, convex subset of Rd.\n\n5\n\nw(1)(cid:102)\u03b81w(2)(cid:102)\u03b82(cid:99)\u03b82P\u03b3(s)\fLemma 3.5. For any non-zero vectors \u03b81, \u03b82 \u2208 Rd,\n1 \u2212 cos(\u03b81, \u03b82) \u2264 1\n2\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n(cid:107)\u03b81(cid:107)2(cid:107)\u03b82(cid:107)2\n\n2\n\n.\n\n(5)\n\n(cid:105).\nWith this result, we see that it suf\ufb01ces to upper bound cos(\u03b81, \u03b82) by 1 \u2212 \u03bb0(cid:104)w(1) \u2212 w(2),\n\u03b81(cid:107)\u03b81(cid:107)2\nTo develop this bound, let \u02dc\u03b8i = \u03b8i(cid:107)\u03b8i(cid:107)2\nfor i = 1, 2. The angle between \u03b81 and \u03b82 is the same as the\nangle between the normalized vectors \u02dc\u03b81 and \u02dc\u03b82. To calculate the cosine of the angle between \u02dc\u03b81\nand \u02dc\u03b82, let P be a plane spanned by \u02dc\u03b81 and w(1) \u2212 w(2) and passing through w(1) (P is uniquely\ndetermined if \u02dc\u03b81 is not parallel to w(1) \u2212 w(2); if there are multiple planes, just pick any of them).\nFurther, let \u02c6\u03b82 \u2208 Sd\u22121 be the unit vector along the projection of \u02dc\u03b82 onto the plane P , as indicated in\nFig. 1. Clearly, cos(\u02dc\u03b81, \u02dc\u03b82) \u2264 cos(\u02dc\u03b81, \u02c6\u03b82).\nConsider a curve \u03b3(s) on bd(W) connecting w(1) and w(2) that is de\ufb01ned by the intersection of\nbd(W) and P and is parametrized by its curve length s so that \u03b3(0) = w(1) and \u03b3(l) = w(2), where\nl is the length of the curve \u03b3 between w(1) and w(2). Let uW (w) denote the outer normal vector to W\nat w as before, and let u\u03b3 : [0, l] \u2192 Sd\u22121 be such that u\u03b3(s) = \u02c6\u03b8 where \u02c6\u03b8 is the unit vector parallel\nto the projection of uW (\u03b3(s)) on the plane P . By de\ufb01nition, u\u03b3(0) = \u02dc\u03b81 and u\u03b3(l) = \u02c6\u03b82. Note that\nin fact \u03b3 exists in two versions since W is a compact convex body, hence the intersection of P and\nbd(W) is a closed curve. Of these two versions we choose the one that satis\ufb01es that (cid:104)\u03b3(cid:48)(s), \u02dc\u03b81(cid:105) \u2264 0\nfor s \u2208 [0, l].5 Given the above, we have\ncos(\u02dc\u03b81, \u02c6\u03b82) = (cid:104)\u02c6\u03b82, \u02dc\u03b81(cid:105) = 1+ (cid:104)\u02c6\u03b82 \u2212 \u02dc\u03b81, \u02dc\u03b81(cid:105) = 1+\n(cid:104)u(cid:48)\n\u03b3(s), \u02dc\u03b81(cid:105) ds. (6)\nNote that \u03b3 is a planar curve on bd(W), thus its curvature \u03bb(s) satis\ufb01es \u03bb(s) \u2265 \u03bb0 for s \u2208 [0, l].\nAlso, for any w on the curve \u03b3, \u03b3(cid:48)(s) is a unit vector parallel to P . Moreover, u(cid:48)\n\u03b3(s) is parallel to\n\u03b3(cid:48)(s) and \u03bb(s) = (cid:107)u(cid:48)\n\nu(cid:48)\n\u03b3(s) ds, \u02dc\u03b81\n\n(cid:68)(cid:90) l\n\n(cid:90) l\n\n= 1+\n\n(cid:69)\n\n0\n\n0\n\n\u03b3(s)(cid:107)2(cid:104)\u03b3(cid:48)(s), \u02dc\u03b81(cid:105) \u2264 \u03bb0(cid:104)\u03b3(cid:48)(s), \u02dc\u03b81(cid:105),\n\nwhere the last inequality holds because (cid:104)\u03b3(cid:48)(s), \u02dc\u03b81(cid:105) \u2264 0. Plugging this into (6), we get the desired\ncos(\u02dc\u03b81, \u02c6\u03b82) \u2264 1 + \u03bb0\n= 1 \u2212 \u03bb0(cid:104)w(1) \u2212 w(2), \u02dc\u03b81(cid:105) .\n\n(cid:104)\u03b3(cid:48)(s), \u02dc\u03b81(cid:105) ds = 1 + \u03bb0\n\n\u03b3(cid:48)(s) ds, \u02dc\u03b81\n\n(cid:69)\n\n\u03b3(s)(cid:107)2. Therefore,\n\u03b3(s), \u02dc\u03b81(cid:105) = (cid:107)u(cid:48)\n(cid:104)u(cid:48)\n(cid:90) l\n\n(cid:68)(cid:90) l\n(cid:17) \u2264 1\n\n0\n\n\u03bb0\n\nReordering and combining with (5) we obtain\n1 \u2212 cos(\u02dc\u03b81, \u02c6\u03b82)\n\n(cid:104)w(1) \u2212 w(2), \u02dc\u03b81(cid:105) \u2264 1\n\u03bb0\n\n0\n\n(cid:16)\n\n(1 \u2212 cos(\u03b81, \u03b82)) \u2264 1\n2\u03bb0\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n(cid:107)\u03b81(cid:107)2(cid:107)\u03b82(cid:107)2\n\n2\n\n.\n\nMultiplying both sides by (cid:107)\u03b81(cid:107)2 gives (4), thus, \ufb01nishing the proof.\nExample 3.6. The smallest principal curvature of some common convex bodies are as follows:\n\n\u2022 The smallest principal curvature \u03bb0 of the Euclidean ball W = {w |(cid:107)w(cid:107)2 \u2264 r} of radius r\n\n\u2022 Let Q be a positive de\ufb01nite matrix. If W =(cid:8)w | w(cid:62)Qw \u2264 1(cid:9) then \u03bb0 = \u03bbmin/\n\nsatis\ufb01es \u03bb0 = 1\nr .\n\n\u03bbmax,\n\n\u221a\n\nwhere \u03bbmin and \u03bbmax are the minimal, respectively, maximal eigenvalues of Q.\n\n\u2022 In general, let \u03c6 : Rd \u2192 R be a C 2 convex function. Then, for W = {w | \u03c6(w) \u2264 1},\n\n\u03bb0 = minw\u2208bd(W) minv : (cid:107)v(cid:107)2=1,v\u22a5\u03c6(cid:48)(w)\n\nv(cid:62)\u22072\u03c6(w)v\n(cid:107)\u03c6(cid:48)(w)(cid:107)2\n\n.\n\n\u221a\nIn the stochastic i.i.d. case, when E [\u0398t] = \u2212\u00b5, we have (cid:107)\u0398t + \u00b5(cid:107)2 = O(1/\nt) with high probability.\nThus say, for W being the unit ball of Rd, one has wt = \u0398t/(cid:107)\u0398t(cid:107)2; therefore, a crude bound suggests\nthat (cid:107)wt \u2212 w\u2217(cid:107)2 = O(1/\nn), while the previous result\npredicts that Rn is much smaller. In the next example we look at the unit ball, to explain geometrically,\nwhat \u201ccauses\u201d the smaller regret.\n\nt), overall predicting that E [Rn] = O(\n\n\u221a\n\n\u221a\n\n5\u03b3(cid:48) and u(cid:48)\n\n\u03b3 denote the derivatives of \u03b3 and u, respectively, which exist since W is C 2.\n\n6\n\n\fExample 3.7. Let W = {w |(cid:107)w(cid:107)2 \u2264 1} and consider a stochastic setting where the fi are i.i.d.\nsamples from some underlying distribution with expectation E [fi] = \u00b5 = (\u22121, 0, . . . , 0) and\n(cid:107)fi(cid:107)\u221e \u2264 M. It is straightforward to see that w\u2217 = (1, 0, . . . , 0), and thus (cid:104)w\u2217, \u00b5(cid:105) = \u22121. Let\nE = {\u2212\u03b8 |(cid:107)\u03b8 \u2212 \u00b5(cid:107)2 \u2264 \u0001}. As suggested beforehand, we expect \u2212\u00b5t \u2208 E with high probability. As\nOD(cid:105) \u2212 1 = | \u02dcBD|. Similarly, the excess\nshown in Fig. 2, the excess loss of an estimate # \u00bb\nOA(cid:48) in the \ufb01gure is |CD|. Therefore, for an estimate \u2212\u00b5t \u2208 E, the point A is\nloss of an estimate # \u00bb\nwhere the largest excess loss is incurred. The triangle OAD is similar to the triangle ADB. Thus\n|BD|\n|AD|\n|OD| . Therefore, |BD| = \u00012 and since | \u02dcBD| \u2264 |BD|, if (cid:107)\u00b5t \u2212 \u00b5(cid:107)2 \u2264 \u0001, the excess error is\n|AD| =\nat most \u00012 = O(1/t), making the regret Rn = O(log n).\n\nOA is (cid:104) # \u00bb\nO \u02dcA,\n\n# \u00bb\n\nOur last result in this section is an asymptotic lower bound for\nthe linear game, showing that FTL achieves the optimal rate\nunder the condition that mint (cid:107)\u0398t(cid:107)2 \u2265 L > 0.\nTheorem 3.8. Let h, L\n(0, 1).\nthat {(1,\u2212L), (\u22121,\u2212L)}\n\n(cid:8)(x, y) : x2 + y2/h2 \u2264 1(cid:9) be an ellipsoid with princi-\n\nAssume\nF and let W =\n\npal curvature h. Then, for any learning strategy, there exists a\nsequence of losses in F such that Rn = \u2126 (log(n)/(Lh)) and\n(cid:107)\u0398t(cid:107)2 \u2265 L for all t.\n\n\u2282\n\n\u2208\n\n3.2 Other regularities\n\nSo far we have looked at the case when FTL achieves a low\nregret due to the curvature of bd(W). The next result char-\nacterizes the regret of FTL when W is a polyhedron, which\nhas a \ufb02at, non-smooth boundary and thus Theorem 3.3 is not\napplicable. For this statement recall that given some norm (cid:107) \u00b7 (cid:107),\nits dual norm is de\ufb01ned by (cid:107)w(cid:107)\u2217 = sup(cid:107)v(cid:107)\u22641(cid:104)v, w(cid:105).\nTheorem 3.9. Assume that W is a polyhedron and that \u03a6 is differentiable at \u0398i, i = 1, . . . , n.\nLet wt = argmaxw\u2208W(cid:104)w, \u0398t\u22121(cid:105), W = supw1,w2\u2208W (cid:107)w1 \u2212 w2(cid:107)\u2217 and F = supf1,f2\u2208F (cid:107)f1 \u2212 f2(cid:107).\nThen the regret of FTL is\nRn \u2264 W\n\nFigure 2: Illustration of how\ncurvature helps to keep the re-\ngret small.\n\nt I(wt+1 (cid:54)= wt)(cid:107)\u0398t \u2212 \u0398t\u22121(cid:107) \u2264 F W\n\nI(wt+1 (cid:54)= wt) .\n\nn(cid:88)\n\nn(cid:88)\n\nt=1\n\nt=1\n\nNote that when W is a polyhedron, wt is expected to \u201csnap\u201d to some vertex of W. Hence, we expect\nthe regret bound to be non-vacuous, if, e.g., \u0398t \u201cstabilizes\u201d around some value. Some examples after\nthe proof will illustrate this.\nProof. Let v = argmaxw\u2208W(cid:104)w, \u03b8(cid:105), v(cid:48) = argmaxw\u2208W(cid:104)w, \u03b8(cid:48)(cid:105). Similarly to the proof of Theorem 3.3,\n\n(cid:104)v(cid:48) \u2212 v, \u03b8(cid:48)(cid:105) = (cid:104)v(cid:48), \u03b8(cid:48)(cid:105) \u2212 (cid:104)v(cid:48), \u03b8(cid:105) + (cid:104)v(cid:48), \u03b8(cid:105) \u2212 (cid:104)v, \u03b8(cid:105) + (cid:104)v, \u03b8(cid:105) \u2212 (cid:104)v, \u03b8(cid:48)(cid:105)\n\n\u2264 (cid:104)v(cid:48), \u03b8(cid:48)(cid:105) \u2212 (cid:104)v(cid:48), \u03b8(cid:105) + (cid:104)v, \u03b8(cid:105) \u2212 (cid:104)v, \u03b8(cid:48)(cid:105) = (cid:104)v(cid:48) \u2212 v, \u03b8(cid:48) \u2212 \u03b8(cid:105) \u2264 W I(v(cid:48) (cid:54)= v)(cid:107)\u03b8(cid:48) \u2212 \u03b8(cid:107),\n\nwhere the \ufb01rst inequality holds because (cid:104)v(cid:48), \u03b8(cid:105) \u2264 (cid:104)v, \u03b8(cid:105). Therefore, by Lemma 3.4,\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nRn =\n\nt(cid:104)wt+1 \u2212 wt, \u0398t(cid:105) \u2264 W\n\nt I(wt+1(cid:54)= wt)(cid:107)\u0398t \u2212 \u0398t\u22121(cid:107) \u2264 F W\n\nI(wt+1(cid:54)= wt) .\n\nt=1\n\nt=1\n\nt=1\n\nAs noted before, since W is a polyhedron, wt is (generally) attained at the vertices. In this case, the\nepigraph of \u03a6 is a polyhedral cone. Then, the event when wt+1 (cid:54)= wt, i.e., when the \u201cleader\u201d switches\ncorresponds to when \u0398t and \u0398t\u22121 belong to different linear regions corresponding to different linear\npieces of the graph of \u03a6.\nWe now spell out a corollary for the stochastic setting. In particular, in this case FTL will often enjoy\na constant regret:\n\n7\n\nOD=w\u2217A=\u2212\u00b5tB\u02dcB\u02dcA=(cid:100)wtCA(cid:48)\u02dcA(cid:48)=\u2212\u00b5\fCorollary 3.10 (Stochastic setting). Assume that (ft)1\u2264t\u2264n is an i.i.d. sequence of random variables\nsuch that E [fi] = \u00b5 and (cid:107)fi(cid:107)\u221e \u2264 M. Let W = supw1,w2\u2208W (cid:107)w1 \u2212 w2(cid:107)1. Further assume that\nthere exists a constant r > 0 such that \u03a6 is differentiable for any \u03bd such that (cid:107)\u03bd \u2212 \u00b5(cid:107)\u221e \u2264 r. Then,\n\nE [Rn] \u2264 2M W (1 + 4dM 2/r2) .\n\nProof. Let V = {\u03bd |(cid:107)\u03bd \u2212 \u00b5(cid:107)\u221e \u2264 r}. Note that the epigraph of the function \u03a6 is a polyhedral cone.\nSince \u03a6 is differentiable in V , {(\u03b8, \u03a6(\u03b8))| \u03b8 \u2208 V } is a subset of a linear subspace. Therefore, for\n\u2212\u0398t,\u2212\u0398t\u22121 \u2208 V , wt+1 = wt. Hence, by Theorem 3.9,\n\n(cid:32)\n\nn(cid:88)\n\n(cid:33)\n\nOn the other hand, note that (cid:107)fi(cid:107)\u221e \u2264 M. Then\n\nt=1\n\nE [Rn] \u2264 2M W\n\nn(cid:88)\n(cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\nt=1 exp(\u2212\u03b1t) \u2264(cid:82) n\n(cid:80)n\n\nPr(\u2212\u0398t /\u2208 V ) = Pr\n\nt\n\nt(cid:88)\n\ni=1\n\nPr(\u2212\u0398t,\u2212\u0398t\u22121 /\u2208 V ) \u2264 4M W\n\n1 +\n\nPr(\u2212\u0398t /\u2208 V )\n\n.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:33)\n\n\u2264 d(cid:88)\n\nj=1\n\nPr\n\n(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nt\n\nt(cid:88)\n\ni=1\n\nt=1\n\nfi,j \u2212 \u00b5j\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 r\n\n\u2264 2de\n\n\u2212 tr2\n\n2M 2 ,\n\nfi \u2212 \u00b5\n\n\u2265 r\n\nwhere the last inequality is due to Hoeffding\u2019s inequality. Now, using that for \u03b1 > 0,\n\n0 exp(\u2212\u03b1t)dt \u2264 1\n\n\u03b1, we get E [Rn] \u2264 2M W (1 + 4dM 2/r2).\n\nThe condition that \u03a6 is differentiable for any \u03bd such that (cid:107)\u03bd \u2212 \u00b5(cid:107)\u221e \u2264 r is equivalent to that \u03a6 is\ndifferentiable at \u00b5. By Proposition 2.1, this condition requires that at \u00b5, maxw\u2208W(cid:104)w, \u03b8(cid:105) has a unique\noptimizer. Note that the volume of the set of vectors \u03b8 with multiple optimizers is zero.\n\n4 An adaptive algorithm for the linear game\n\nWhile as shown in Theorem 3.3, FTL can exploit the curvature of the surface of the constraint set\nto achieve O(log n) regret, it requires the curvature condition and mint (cid:107)\u0398t(cid:107)2 \u2265 L being bounded\n\u221a\naway from zero, or it may suffer even linear regret. On the other hand, many algorithms, such as the\n\"Follow the regularized leader\" (FTRL) algorithm, are known to achieve a regret guarantee of O(\nn)\neven for the worst-case data in the linear setting. This raises the question whether one can have an\nalgorithm that can achieve constant or O(log n) regret in the respective settings of Corollary 3.10\nor Theorem 3.3, while it still maintains O(\nn) regret for worst-case data. One way to design an\nadaptive algorithm is to use the (A, B)-prod algorithm of Sani et al. [2014], leading to the following\nresult:\nProposition 4.1. Consider (A, B)-prod of Sani et al. [2014], where algorithm A is chosen to be\nFTRL with an appropriate regularization term, while B is chosen to be FTL. Then the regret of the\nresulting hybrid algorithm H enjoys the following guarantees:\n\n\u221a\n\n\u2022 If FTL achieves constant regret as in the setting of Corollary 3.10, then the regret of H is\n\nalso constant.\n\n\u2022 If FTL achieves a regret of O(log n) as in the setting of Theorem 3.3, then the regret of H is\n\nalso O(log n).\n\n\u221a\n\u2022 Otherwise, the regret of H is at most O(\n\nn log n).\n\n5 Conclusion\n\nFTL is a simple method that is known to perform well in many settings, while existing worst-case\nresults fail to explain its good performance. While taking a thorough look at why and when FTL can\nbe expected to achieve small regret, we discovered that the curvature of the boundary of the constraint\nand having average loss vectors bounded away from zero help keep the regret of FTL small. These\nconditions are signi\ufb01cantly different from previous conditions on the curvature of the loss functions\nwhich have been considered extensively in the literature. It would be interesting to further investigate\nthis phenomenon for other algorithms or in other learning settings.\n\n8\n\n\fAcknowledgements\n\nThis work was supported in part by the Alberta Innovates Technology Futures through the Alberta\nIngenuity Centre for Machine Learning and by NSERC. During part of this work, T. Lattimore was\nwith the Department of Computing Science, University of Alberta.\n\nReferences\nY. Abbasi-Yadkori. Forced-exploration based algorithms for playing in bandits with large action sets. Library\n\nand Archives Canada, 2010.\n\nJ. Abernethy, P.L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online\n\nconvex games. In 21st Annual Conference on Learning Theory (COLT), 2008.\n\nP.L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 65\u201372, 2007.\n\nD. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1999.\nN. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY,\n\nUSA, 2006.\n\nN. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE\n\nTrans. Information Theory, 50(9):2050\u20132057, 2004.\n\nD.J. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 3357\u20133365, 2015.\n\nY. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.\n\nJournal of Computer and System Sciences, 55:119\u2013139, 1997.\n\nA.A. Gaivoronski and F. Stella. Stochastic nonstationary optimization for \ufb01nding universal portfolios. Annals of\n\nOperations Research, 100(1\u20134):165\u2013188, 2000.\n\nD. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of\n\nthe 32nd International Conference on Machine Learning (ICML), volume 951, pages 541\u2013549, 2015.\n\nE. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine\n\nLearning, 69(2-3):169\u2013192, 2007.\n\nR. Huang, T. Lattimore, A. Gy\u00f6rgy, and Cs. Szepesv\u00e1ri. Following the leader and fast rates in linear prediction:\n\nCurved constraint sets and other regularities. arXiv, 2016.\n\nS. M. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 1457\u20131464, 2009.\n\nW. Kot\u0142owski. Minimax strategy for prediction with expert advice under stochastic assumptions. Algorithmic\n\nLearning Theory (ALT), 2016.\n\nE.S. Levitin and B.T. Polyak. Constrained minimization methods. USSR Computational Mathematics and\n\nMathematical Physics, 6(5):1\u201350, 1966.\n\nH.B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and implicit updates.\n\narXiv, 2010. URL http://arxiv.org/abs/1009.3240.\n\nN. Merhav and M. Feder. Universal sequential learning and decision from individual data sequences. In 5th\n\nAnnual ACM Workshop on Computational Learning Theory (COLT), pages 413\u2014427. ACM Press, 1992.\n\nF. Orabona, N. Cesa-Bianchi, and C. Gentile. Beyond logarithmic bounds in online learning. In Proceedings of\nthe Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 823\u2013831,\n2012.\n\nA. Rakhlin and K. Sridharan. Online learning with predictable sequences. In 26th Annual Conference on\n\nLearning Theory (COLT), pages 993\u20131019, 2013.\n\nA. Sani, G. Neu, and A. Lazaric. Exploiting easy data in online optimization. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 810\u2013818, 2014.\n\nR. Schneider. Convex Bodies: The Brunn\u2013Minkowski Theory. Encyclopedia of Mathematics and its Applications.\n\nCambridge Univ. Press, 2nd edition, 2014.\n\nS. Shalev-Shwartz. Online learning and online convex optimization. Foundations and trends in Machine\n\nLearning, 4(2):107\u2013194, 2012.\n\nS. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge\n\nUniversity Press, New York, NY, USA, 2014.\n\nT. van Erven, P. Gr\u00fcnwald, N. Mehta, M. Reid, and R. Williamson. Fast rates in statistical and online learning.\nJournal of Machine Learning Research (JMLR), 16:1793\u20131861, 2015. Special issue in Memory of Alexey\nChervonenkis.\n\n9\n\n\f", "award": [], "sourceid": 2530, "authors": [{"given_name": "Ruitong", "family_name": "Huang", "institution": "University of Alberta"}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "Indiana University"}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": "Imperial College London"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "U. Alberta"}]}