{"title": "Combining Adversarial Guarantees and Stochastic Fast Rates in Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4457, "page_last": 4465, "abstract": "We consider online learning algorithms that guarantee worst-case regret rates in adversarial environments (so they can be deployed safely and will perform robustly), yet adapt optimally to favorable stochastic environments (so they will perform well in a variety of settings of practical importance). We quantify the friendliness of stochastic environments by means of the well-known Bernstein (a.k.a. generalized Tsybakov margin) condition. For two recent algorithms (Squint for the Hedge setting and MetaGrad for online convex optimization) we show that the particular form of their data-dependent individual-sequence regret guarantees implies that they adapt automatically to the Bernstein parameters of the stochastic environment. We prove that these algorithms attain fast rates in their respective settings both in expectation and with high probability.", "full_text": "Combining Adversarial Guarantees and\nStochastic Fast Rates in Online Learning\n\nCentrum Wiskunde & Informatica\n\nCWI and Leiden University\n\npdg@cwi.nl\n\nWouter M. Koolen\n\nPeter Gr\u00fcnwald\n\nScience Park 123, 1098 XG\nAmsterdam, the Netherlands\n\nwmkoolen@cwi.nl\n\nTim van Erven\nLeiden University\n\nNiels Bohrweg 1, 2333 CA\n\nLeiden, the Netherlands\ntim@timvanerven.nl\n\nAbstract\n\nWe consider online learning algorithms that guarantee worst-case regret rates\nin adversarial environments (so they can be deployed safely and will perform\nrobustly), yet adapt optimally to favorable stochastic environments (so they will\nperform well in a variety of settings of practical importance). We quantify the\nfriendliness of stochastic environments by means of the well-known Bernstein\n(a.k.a. generalized Tsybakov margin) condition. For two recent algorithms (Squint\nfor the Hedge setting and MetaGrad for online convex optimization) we show that\nthe particular form of their data-dependent individual-sequence regret guarantees\nimplies that they adapt automatically to the Bernstein parameters of the stochastic\nenvironment. We prove that these algorithms attain fast rates in their respective\nsettings both in expectation and with high probability.\n\n1\n\nIntroduction\n\n\u221a\n\nWe consider online sequential decision problems. We focus on full information settings, encompassing\nsuch interaction protocols as online prediction, classi\ufb01cation and regression, prediction with expert\nadvice or the Hedge setting, and online convex optimization (see Cesa-Bianchi and Lugosi 2006). The\ngoal of the learner is to choose a sequence of actions with small regret, i.e. such that his cumulative\nloss is not much larger than the loss of the best \ufb01xed action in hindsight. This has to hold even in\nthe worst case, where the environment is controlled by an adversary. After three decades of research\nthere exist many algorithms and analysis techniques for a variety of such settings. For many settings,\nadversarial regret lower bounds of order\nT are known, along with matching individual sequence\nalgorithms [Shalev-Shwartz, 2011].\nA more recent line of development is to design adaptive algorithms with regret guarantees that scale\nwith some more re\ufb01ned measure of the complexity of the problem. For the Hedge setting, results of\nthis type have been obtained, amongst others, by Cesa-Bianchi et al. [2007], De Rooij et al. [2014],\nGaillard et al. [2014], Sani et al. [2014], Even-Dar et al. [2008], Koolen et al. [2014], Koolen and\nVan Erven [2015], Luo and Schapire [2015], Wintenberger [2015]. Interestingly, the price for such\nadaptivity (i.e. the worsening of the worst-case regret bound) is typically extremely small (i.e. a\nconstant factor in the regret bound). For online convex optimization (OCO), many different types of\nadaptivity have been explored, including by Crammer et al. [2009], Duchi et al. [2011], McMahan\nand Streeter [2010], Hazan and Kale [2010], Chiang et al. [2012], Steinhardt and Liang [2014],\nOrabona et al. [2015], Van Erven and Koolen [2016].\nHere we are interested in the question of whether such adaptive results are strong enough to lead to\nimproved rates in the stochastic case when the data follow a \u201cfriendly\u201d distribution. In speci\ufb01c cases\nit has been shown that fancy guarantees do imply signi\ufb01cantly reduced regret. For example Gaillard\net al. [2014] present a generic argument showing that a certain kind of second-order regret guarantees\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fimplies constant expected regret (the fastest possible rate) for i.i.d. losses drawn from a distribution\nwith a gap (between expected loss of the best and all other actions). In this paper we signi\ufb01cantly\nextend this result. We show that a variety of individual-sequence second-order regret guarantees\nimply fast regret rates for distributions under much milder stochastic assumptions. In particular, we\nwill look at the Bernstein condition (see Bartlett and Mendelson 2006), which is the key to fast rates\nin the batch setting. This condition provides a parametrised interpolation (expressed in terms of the\n\nBernstein exponent \u03ba\u2208[0, 1]) between the friendly gap case(\u03ba= 1) and the stochastic worst case\n(\u03ba= 0). We show that appropriate second-order guarantees automatically lead to adaptation to these\nparameters, for both the Hedge setting and for OCO. In the Hedge setting, we build on the guarantees\nT regime for \u03ba= 0 and the fastest\navailable for the Squint algorithm [Koolen and Van Erven, 2015] and for OCO we rely on guarantees\n(doubly) logarithmic regime for \u03ba= 1. We show all this, not just in expectation (which is relatively\nT 1\u2212\u03ba\nachieved by MetaGrad [Van Erven and Koolen, 2016]. In both cases we obtain regret rates of order\n2\u2212\u03ba (Theorem 2). These rates include the slow worst-case\neasy), but also with high probability (which is much harder). Our proofs make use of a a convenient\nnovel notation (ESI, for exponential stochastic inequality) which allows us to prove such results\nsimultaneously, and which is of independent interest (De\ufb01nition 5). Our proofs use that, for bounded\nlosses, the Bernstein condition is equivalent to the ESI-Bernstein condition, which we introduce.\nThe next section introduces the two settings we consider and the individual sequence guarantees we\nwill use in each. It also reviews the stochastic criteria for fast rates and presents our main result.\nIn Section 3 we consider a variety of examples illustrating the breadth of cases that we cover. In\nSection 4 we introduce ESI and give a high-level overview of our proof.\n\n\u221a\n\n2 Setup and Main Result\n\n2.1 Hedge Setting\n\nWe start with arguably the simplest setting of online prediction, the Hedge setting popularized by\nFreund and Schapire [1997]. To be able to illustrate the full reach of our stochastic assumption\n\nwe will use a minor extension to countably in\ufb01nitely many actions k\u2208 N={1, 2, . . .}, customarily\nwt = (w1\nt , . . .) of the\n\nt , . . .) on experts. Then the environment reveals the losses (cid:96)t = ((cid:96)1\n\ncalled experts. The protocol is as follows. Each round t the learner plays a probability mass function\n\nt \u2208[0, 1]. The learner incurs loss\u001bwt, (cid:96)t\u001b=\u2211k wk\n\nt . The regret after T rounds\n\nt , w2\n\nt , (cid:96)2\n\nt (cid:96)k\n\nexperts, where each (cid:96)k\ncompared to expert k is given by\n\nT \u2236= TQ\nt=1\n\nRk\n\n\u0001\u001bwt, (cid:96)t\u001b\u2212 (cid:96)k\nt\u0001 .\n\nRk\n\nV k\nT K k\n\nHere V k\n\nguarantees\n\nT where K k\n\nThe goal of the learner is to keep the regret small compared to any expert k. We will make use\nof Squint by Koolen and Van Erven [2015], a self-tuning algorithm for playing wt. Koolen and\n\nVan Erven [2015, Theorem 4] show that Squint with prior probability mass function \u03c0=(\u03c01, \u03c02, . . .)\nt\u00012 is a second-order term that depends on the algorithm\u2019s own pre-\ndictions wt. It is well-known that with K experts the worst-case lower bound is \u0398(\u221a\nT ln K)\n[Cesa-Bianchi and Lugosi, 2006, Theorem 3.7]. Taking a fat-tailed prior \u03c0, for example \u03c0k=\nk(k+1),\nT(ln k+ ln ln T)\u0002, matching the lower\n\nT \u2264 \u0002\nT+ K k\nT \u2236=\u2211T\nt=1\u0001\u001bwt, (cid:96)t\u001b\u2212 (cid:96)k\nT \u2264 T , the above bound implies Rk\n\nT = O(\u2212 ln \u03c0k+ ln ln T)\nT \u2264 O\u0002\u0001\n\nand using V k\nbound in some sense for all k simultaneously.\nThe question we study in this paper is what becomes of the regret when the sequence of losses\n(cid:96)1, (cid:96)2, . . . is drawn from some distribution P, not necessarily i.i.d. But before we expand on such\nstochastic cases, let us \ufb01rst introduce another setting.\n\nfor any expert k.\n\n(1)\n\n1\n\n2.2 Online Convex Optimization (OCO)\n\nthe set of actions is a compact convex setU\u2286 Rd. Each round t the learner plays a point wt\u2208U.\n\nWe now turn to our second setting called online convex optimization [Shalev-Shwartz, 2011]. Here\n\n2\n\n\fRu\n\nRu\n\nThen the environment reveals a convex loss function (cid:96)t\u2236U\u2192 R. The loss of the learner is (cid:96)t(wt).\nThe regret after T rounds compared to u\u2208U is given by\nThe goal is small regret compared to any point u\u2208U. A common tool in the analysis of algorithms is\n\n((cid:96)t(wt)\u2212 (cid:96)t(u)) .\n\nT \u2236= TQ\nt=1\n\nT \u2264 \u02dcRu\n\nthe linear upper bound on the regret obtained from convexity of (cid:96)t (at non-differentiable points we\nmay take any sub-gradient)\n\nT \u2236= TQ\n\u001bwt\u2212 u,\u2207(cid:96)t(wt)\u001b.\nt=1\n\u221a\nT \u2264 O\u0001DG\nT\u0001 and\nT KT+ DGKT where KT = O(d ln T)\n\u221a\n\nWe will make use of (the full matrix version of) MetaGrad by Van Erven and Koolen [2016]. In their\nTheorem 8, they show that, simultaneously, \u02dcRu\n\nT \u2264 \u0001\nwhere D bounds the two-norm diameter ofU, G bounds\u0001\u2207(cid:96)t(wt)\u00012 the two-norm of the gradients\nt=1\u001bwt\u2212 u,\u2207(cid:96)t(wt)\u001b2. The \ufb01rst bound matches the worst-case lower bound. The second\nT \u2264 G2D2T by Cauchy-Schwarz. Yet in this paper\nassume from now on that DG= 1 (this can always be achieved by scaling the loss).\ngradients\u2207(cid:96)t(wt)) are drawn from a distribution P, not necessarily i.i.d. This includes the common\ncase of linear regression and classi\ufb01cation where (cid:96)t(u)= loss(\u001bu, xt\u001b, yt) with(xt, yt) sampled i.i.d.\n\nand V u\nbound (2) may be a factor\nwe will show fast rates in certain stochastic settings arising from (2). To simplify notation we will\n\nTo talk about stochastic settings we will assume that the sequence (cid:96)t of loss functions (and hence the\n\nfor any u\u2208U,\n\nT \u2236=\u2211T\n\nKT worse, as V u\n\nand loss a \ufb01xed one-dimensional convex loss function (e.g. square loss, absolute loss, log loss, hinge\nloss, . . . ).\n\n\u02dcRu\n\nV u\n\n(2)\n\n2.3 Parametrised Family of Stochastic Assumptions\n\nWe now recall the Bernstein [Bartlett and Mendelson, 2006] stochastic condition. The idea behind\nthis assumption is to control the variance of the excess loss of the actions in the neighborhood of the\nbest action.\nWe do not require that the losses are i.i.d., nor that the Bayes act is in the model. For the Hedge\n\nt\u0001Gt\u22121(cid:6)\nsetting it suf\ufb01ces if there is a \ufb01xed expert k\u2217 that is always best, i.e. E\u0001(cid:96)k\nalmost surely for all t. (Here we denote byGt\u22121 the sigma algebra generated by (cid:96)1, . . . , (cid:96)t\u22121, and the\nthere is a \ufb01xed point u\u2217\u2208U attaining minu\u2208U E[(cid:96)t(u)Gt\u22121] at every round t. In either case there\nalmost surely quanti\ufb01cation refers to the distribution of (cid:96)1, . . . , (cid:96)t\u22121.) Similarly, for OCO we assume\nmay be multiple candidate k\u2217 or u\u2217. In the succeeding we assume that one is selected. Note that\n(cid:96)t are continuous, it is even automatic in the OCO case due to compactness ofU), while it is very\ntime t\u2208 N and expert/point k\u2208 N/u\u2208U as follows\n\nstrong beyond i.i.d. Yet it is not impossible (and actually interesting) as we will show by example in\nSection 3.\nBased on the loss minimiser, we de\ufb01ne the excess losses, a family of random variables indexed by\n\nfor i.i.d. losses the existence of a minimiser is not such a strong assumption (if the loss functions\n\nt \u0001Gt\u22121(cid:6)= inf k E\u0001(cid:96)k\n\n\u2217\n\nt\n\nxk\nt\n\nxu\nt\n\nand\n\n(OCO).\n\n(Hedge)\n\n(3)\nNote that for the Hedge setting we work with the loss directly. For OCO instead we talk about the\nlinear upper bound on the excess loss, for this is the quantity that needs to be controlled to make use\nof the MetaGrad bound (2). With these variables in place, from this point on the story is the same for\n\nHedge and for OCO. So let us writeF for either the set N of experts or the setU of points, and f\u2217\nfor k\u2217 resp. u\u2217, and let us consider the family{xf\nt f\u2208F, t\u2208 N}. We call f\u2208F predictors. With\nCondition 1. Fix B\u2265 0 and \u03ba\u2208[0, 1]. The family (3) satis\ufb01es the(B, \u03ba)-Bernstein condition if\nalmost surely for all f\u2208F and rounds t\u2208 N.\n\nt)2\u0002Gt\u22121\u0002 \u2264 B E\u0002xf\n\nthis notation the Bernstein condition is the following.\n\nt\u0002Gt\u22121\u0002\u03ba\n\nE\u0002(xf\n\n\u2236= (cid:96)k\nt\u2212 (cid:96)k\n\n\u2217\n\n\u2236= \u001bu\u2212 u\n\u2217\n\n,\u2207(cid:96)t(u)\u001b\n\n3\n\n\fThe point of this stochastic condition is that it implies that the variance in the excess loss gets smaller\nthe closer a predictor gets to the optimum in terms of expected excess loss.\n\nSome authors refer to the \u03ba= 1 case as the Massart condition. Van Erven et al. [2015] have shown\n\nthat the Bernstein condition is equivalent to the central condition, a fast-rate type of condition that has\nbeen frequently used (without an explicit name) in density estimation under misspeci\ufb01cation. Two\nmore equivalent conditions appear in our proof sketch Section 4. We compare all four formulations\nin Appendix B.\n\n2.4 Main Result\n\n\u2217\n\u2217\nIn the stochastic case we evaluate the performance of algorithms by Rf\nT , i.e. the regret compared\n\nT ] is sometimes called the\nto the predictor f\u2217 with minimal expected loss. The expectation E[Rf\nTheorem 2. In any stochastic setting satisfying the(B, \u03ba)-Bernstein Condition 1, the guarantees (1)\n\npseudo-regret. The following result shows that second-order methods automatically adapt to the\nBernstein condition. (Proof sketch in Section 4.)\n\nfor Squint and (2) for MetaGrad imply fast rates for the respective algorithms both in expectation\nand with high probability. That is,\n\n\u2217\n\n2\u2212\u03ba\u0003 ,\nE[Rf\nT ] = O\u0003K\n1\u2212\u03ba\n2\u2212\u03ba\nand for any \u03b4> 0, with probability at least 1\u2212 \u03b4,\nT T\nT = O\u0002(KT\u2212 ln \u03b4) 1\n2\u2212\u03ba T\nwhere for Squint KT \u2236= K f\n\n\u2217\nT from (1) and for MetaGrad KT is as in (2).\n\n2\u2212\u03ba\u0002 ,\n1\u2212\u03ba\n\nRf\n\n\u2217\n\n1\n\nrates, one has to impose further assumptions on P. A standard assumption made in such cases is a\n\nCrucially, the bound provided by Theorem 2 is natural, and, in general, the best one can expect.\nThis can be seen from considering the statistical learning setting, which is a special case of our\n\n(cid:96)f\nIn this setting one usually considers excess risk, which is the expected loss difference between the\n\nWe see that Squint and MetaGrad adapt automatically to the Bernstein parameters of the distribution,\nwithout any tuning. Theorem 2 only uses the form of the second-order bounds and does not depend\non the details of the algorithms, so it also applies to any other method with a second-order regret\nbound. In particular it holds for Adapt-ML-Prod by Gaillard et al. [2014], which guarantees (1) with\n\nKT = O(lnF+ln ln T) for \ufb01nite sets of experts. Here we focus on Squint as it also applies to in\ufb01nite\nsets. Appendix D provides an extension of Theorem 2 that allows using Squint with uncountableF.\nsetup. Here(xt, yt) are i.i.d.\u223c P andF is a set of functions fromX to a set of predictionsA, with\nt \u2236= (cid:96)(yt, f(xt)) for some loss function (cid:96)\u2236Y\u00d7A\u2192[0, 1] such as squared, 0~1, or absolute loss.\nlearned \u02c6f and the optimal f\u2217. The minimax expected (over training sample(xt, yt)) risk relative\nto f\u2217 is of order T\u22121~2 (see e.g. Massart and N\u00e9d\u00e9lec [2006], Audibert [2009]). To get better risk\nBernstein condition with exponent \u03ba> 0; see e.g. Koltchinskii [2006], Bartlett and Mendelson [2006],\nIfF is suf\ufb01ciently \u2018simple\u2019, e.g. a class with logarithmic entropy numbers (see Appendix D), or, in\n2\u2212\u03ba\u0001. The bound\nachieves, in expectation, a better excess risk bound of order O\u0001(log T)\u22c5 T\u2212 1\ninterpolates between T\u22121~2 for \u03ba= 0 and T\u22121 for \u03ba= 1 (Massart condition). Results of Tsybakov\nt= 1 to T and using ERM at each t to classify the next data point (so that ERM becomes FTL,\nfollow-the-leader), this suggests that we can achieve a cumulative expected regret E[Rf\nT ] of order\n2\u2212\u03ba\u0001. Theorem 2 shows that this is, indeed, also the rate that Squint attains in such\nO\u0001(log T)\u22c5 T 1\u2212\u03ba\ncases ifF is countable and the optimal f\u2217 has positive prior mass \u03c0f\n\u2217> 0 (more on this condition\n\n[2004], Massart and N\u00e9d\u00e9lec [2006], Audibert [2009] suggest that this rate can, in general, not be\nimproved upon, and exactly this rate is achieved by ERM and various other algorithms in various\nsettings by e.g. Tsybakov [2004], Audibert [2004, 2009], Bartlett et al. [2006]. By summing from\n\nAudibert [2004] or Audibert [2009]; see Van Erven et al. [2015] for how it generalizes the Tsybakov\nmargin and other conditions.\n\nclassi\ufb01cation, a VC class, then, if a \u03ba-Bernstein condition holds, ERM (empirical risk minimization)\n\nbelow)\u2014 we thus see that Squint obtains exactly the rates one would expect from a statistical\n\n\u2217\n\n4\n\n\f\u03ba. Gr\u00fcnwald [2012] provides a means to tune \u03b7 automatically in terms of the data, but his method\n\u2014 like ERM and all algorithms in the references above \u2014 may achieve linear regret in worst-case\n\nlearning/classi\ufb01cation perspective, and the minimax excess risk results in that setting suggests that\nthese cumulative regret rates cannot be improved in general. It was shown earlier by Audibert\n[2004] that, when equipped with an oracle to tune the learning rate \u03b7 as a function of t, the rates\n\nO\u0002(log T)\u22c5 T 1\u2212\u03ba\n2\u2212\u03ba\u0002 can also be achieved by Hedge, but the exact tuning depends on the unknown\nsettings, whereas Squint keeps the O(\u221a\nTheorem 2 only gives the desired rate for Squint with in\ufb01niteF ifF is countable and \u03c0f\n\u2217> 0. The\n\u2217= 0, as long asF admits suf\ufb01ciently small entropy\nof uncountably in\ufb01niteF, which can have \u03c0f\nnumbers. Incidentally, this also allows us to show that Squint achieves regret rate O\u0002(log T)\u22c5 T 1\u2212\u03ba\n2\u2212\u03ba\u0002\nwhenF=\u0016i=1,2,...Fi is a countably in\ufb01nite union ofFi with appropriate entropy numbers; in such\ncases there can be, at every sample size, a classi\ufb01er \u02c6f\u2208F with 0 empirical error, so that ERM/FTL\n\ncombination of these two assumptions is strong or at least unnatural, and OCO cannot be readily used\nin all such cases either, so in Appendix D we therefore show how to extend Theorem 2 to the case\n\nT) guarantee for such cases.\n\nwill always over-\ufb01t and cannot be used even if the Bernstein condition holds; Squint allows for\naggregation of such models. In the remainder of the main text, we concentrate on applications for\nwhich Theorem 2 can be used directly, without extensions.\n\n3 Examples\n\nOur OCO examples were chosen to be natural and illustrate fast rates without curvature.\n\nWe give examples motivating and illustrating the Bernstein condition for the Hedge and OCO settings.\n\nOur examples in the Hedge setting will illustrate Bernstein with \u03ba< 1 and non i.i.d. distributions.\n3.1 Hedge Setting: Gap implies Bernstein with \u03ba= 1\nIn the Hedge setting, we say that a distribution P (not necessarily i.i.d.) of expert losses{(cid:96)k\nt t, k\u2208 N}\nhas gap \u03b1> 0 if there is an expert k\u2217 such that\nk\u2260k\u2217 E\u0001(cid:96)k\nt\u0001Gt\u22121(cid:6)\nalmost surely for each round t\u2208 N.\nIt is clear that the condition can only hold for k\u2217 the minimiser of the expected loss.\nLemma 3. A distribution with gap \u03b1 is( 1\n\u03b1 , 1)-Bernstein.\nt)2\u0001Gt\u22121(cid:6)\u2264 1= 1\nE\u0001xk\nt\u0001Gt\u22121(cid:6) .\nProof. For all k\u2260 k\u2217 and t, we have E\u0001(xk\n\nt \u0002Gt\u22121\u0002+ \u03b1 \u2264 inf\nE\u0002(cid:96)k\n\n\u03b1 \u03b1\u2264 1\n\nT = O(KT)= O(ln ln T) rate. Gaillard et al. [2014] show constant\n\n\u2217\n\n\u2217\n\n\u03b1\n\nBy Theorem 2 we get the Rk\nregret for \ufb01nitely many experts and i.i.d. losses with a gap. Our alternative proof above shows that\nneither \ufb01niteness nor i.i.d. are essential for fast rates in this case.\n\nThe next example illustrates that we can sometimes get the fast rates without a gap. And it also shows\nthat we can get any intermediate rate: we construct an example satisfying the Bernstein condition for\n\n3.2 Hedge Setting: Any(1, \u03ba)-Bernstein\nany \u03ba\u2208[0, 1] of our choosing (such examples occur naturally in classi\ufb01cation settings such as those\nFix \u03ba\u2208[0, 1]. Each expert k= 1, 2, . . . is parametrised by a real number \u03b4k\u2208[0, 1~2]. The only\nassumption we make is that \u03b4k= 0 for some k, and inf k{\u03b4k \u03b4k> 0}= 0. For a concrete example let\nus choose \u03b41= 0 and \u03b4k= 1~k for k= 2, 3, . . . Expert \u03b4k has loss 1~2\u00b1 \u03b4k with probability 1\u00b1\u03b42~\u03ba\u22121\n, and so \u03b41= 0 is best,\n+ \u03b42~\u03ba\nwith loss deterministically equal to 1~2. The squared excess loss of \u03b4k is \u03b42\ncondition with exponent \u03ba (but no \u03ba\u2032> \u03ba) and constant 1, and the associated regret rate by Theorem 2.\n\nindependently between experts and rounds. Expert \u03b4k has mean loss 1\n2\n\nconsidered in the example in Appendix D).\n\nk. So we have the Bernstein\n\nk\n2\n\nk\n\n5\n\n\f\u2217\n\n\u221a\n\nNote that for \u03ba= 0 (the hard case) all experts have mean loss equal to 1\nwe designate as the best expert our pseudo-regret E[Rk\n\n\u2217\n2 independently at random. Hence, by the central limit theorem, with high\ntheir losses deviate from 1\nprobability our regret Rk\nT is of order\ncase), we do not \ufb01nd a gap. We still have experts arbitrary close to the best expert in mean, but their\nexpected excess loss squared equals their expected excess loss.\nERM/FTL (and hence all approaches based on it, such as [Bartlett and Mendelson, 2006]) may fail\n\n2. So no matter which k\u2217\nT ] is zero. Yet the experts do not agree, as\nT . On the other side of the spectrum, for \u03ba= 1 (the best\ncompletely on this type of examples. The clearest case is when{k \u03b4k> \u0001} is in\ufb01nite for some \u0001> 0.\nof them will result in expected instantaneous regret at least \u00012~\u03ba, leading to linear regret overall.\nThe requirement \u03b4k= 0 for some k is essential. If instead \u03b4k> 0 for all k then there is no best expert\n\nThen at any t there will be experts that, by chance, incurred their lower loss every round. Picking any\n\nin the class. Theorem 19 in Appendix D shows how to deal with this case.\n\n3.3 Hedge Setting: Markov Chains\n\n2\n\n6\n\n2\n\n2\n\n1\n\n2\n\nregret of order\n\nE\u0002(xf\n\n3.4 OCO: Hinge Loss on the Unit Ball\n\nT 2m. Then, if the data are actually generated by an m-th order Markov chain with\n\nSuppose we model a binary sequence z1, z2, . . . , zT with m-th order Markov chains. As experts we\n\n\u221a\nt)2\u0002(zt\u2212m, . . . , zt\u22121)= a\u0002= 1,\n\nconsider all possible functions f\u2236{0, 1}m\u2192{0, 1} that map a history of length m to a prediction\nfor the next outcome, and the loss of expert f is the 0~1-loss: (cid:96)f\nt =f(zt\u2212m, . . . , zt\u22121)\u2212 zt. (We\ninitialize z1\u2212m= . . .= z0= 0.) A uniform prior on this \ufb01nite set of 22m experts results in worst-case\ntransition probabilities P(zt= 1(zt\u2212m, . . . , zt\u22121)= a)= pa, we have f\u2217(a)= 1{pa\u2265 1\n} and\n\nE\u0002xf\nt\u0002(zt\u2212m, . . . , zt\u22121)= a\u0002= 2pa\u2212 1\nfor any f such that f(a)\u2260 f\u2217(a). So the Bernstein condition holds with \u03ba= 1 and B=\n2 minapa\u2212 1\n.\nLet(x1, y1),(x2, y2), . . . be classi\ufb01cation data, with yt\u2208{\u22121,+1} and xt\u2208 Rd, and consider the\nhinge loss (cid:96)t(u)= max{0, 1\u2212 yt\u001bxt, u\u001b}. Now suppose, for simplicity, that both xt and u come\nfrom the d-dimensional unit Euclidean ball, such that\u001bxt, u\u001b\u2208[\u22121,+1] and hence the hinge is never\nactive, i.e. (cid:96)t(u)= 1\u2212 yt\u001bxt, u\u001b. Then, if the data turn out to be i.i.d. observations from a \ufb01xed\ndistribution P, the Bernstein condition holds with \u03ba= 1 (the proof can be found in Appendix C):\n\u001bxt, u\u001b\u2264 1. If the data are i.i.d., then the(B, \u03ba)-Bernstein condition is satis\ufb01ed with \u03ba= 1 and\nB= 2\u03bbmax\n\u0001\u00b5\u0001 , where \u03bbmax is the maximum eigenvalue of E[xx\u0016] and \u00b5= E[yx], provided that\u0001\u00b5\u0001> 0.\nIn particular, if xt is uniformly distributed on the sphere and yt = sign(\u001b\u00afu, xt\u001b) is the noiseless\nclassi\ufb01cation of xt according to the hyper-plane with normal vector \u00afu, then B\u2264 c\u221a\nconstant c> 0.\nThe excluded case\u0001\u00b5\u0001= 0 only happens in the degenerate case that there is nothing to learn, because\n\u00b5= 0 implies that the expected hinge loss is 1, its maximal value, for all u.\nLetU=[0, 1] be the unit interval. Consider the absolute loss (cid:96)t(u)=u\u2212 xt where xt\u2208[0, 1] are\ndrawn i.i.d. from P. Let u\u2217\u2208 arg minu Eu\u2212 x minimize the expected loss. In this case we may\nsimplify\u001bw\u2212 u\u2217,\u2207(cid:96)(w)\u001b=(w\u2212 u\u2217) sign(w\u2212 x). To satisfy the Bernstein condition, we therefore\nwant B such that, for all w\u2208[0, 1],\n\u2217) sign(w\u2212 x)\u00012\u0002 \u2264 B E[(w\u2212 u\n\u22172\u2212\u03ba \u2264 B2\u03ba P(x\u2264 w)\u2212 1\nw\u2212 u\n\n\u2217) sign(w\u2212 x)]\u03ba .\n\u03ba.\n\nLemma 4 (Unregularized Hinge Loss Example). Consider the hinge loss setting above, where\n\nE\u0002\u0001(w\u2212 u\n\n3.5 OCO: Absolute Loss\n\nfor some absolute\n\nd\n\nThat is,\n\n\fFor instance, if the distribution of x has a strictly positive density p(x)\u2265 m> 0, then u\u2217 is the\n= P(x\u2264 w)\u2212 P(x\u2264 u\u2217)\u2265 mw\u2212 u\u2217, so the condition holds with \u03ba= 1\nmedian and P(x\u2264 w)\u2212 1\nand B= 1\n1\u2212 p, the condition holds with \u03ba= 1 and B=\nw\u2212 u\u2217\u2264 1 and P(x\u2264 w)\u2212 1\n\u2265p\u2212 1\n.\n\n2m. Alternatively, for a discrete distribution on two points a and b with probabilities p and\n2, as can be seen by bounding\n\n12p\u22121, provided that p\u2260 1\n\n2\n\n2\n\n2\n\n4 Proof Ideas\n\nThis section builds up to prove our main result Theorem 2. We \ufb01rst introduce the handy ESI-\nabbreviation that allows us to reason simultaneously in expectation and with high probability. We\nthen provide two alternative characterizations of the Bernstein condition that are equivalent for\nbounded losses. Finally, we show how one of these, ESI-Bernstein, combines with individual-\nsequence second-order regret bounds to give rise to Theorem 2.\n\n4.1 Notation: Exponential Stochastic Inequality (ESI, pronounce easy)\n\nLemma 6. Exponential stochastic negativity/inequality has the following useful properties:\n\nDe\ufb01nition 5. A random variable X is exponentially stochastically negative, denoted X \u0016 0, if\nE[eX]\u2264 1. For any \u03b7\u2265 0, we write X\u0016\u03b7 0 if \u03b7X\u0016 0. For any pair of random variables X and Y ,\nthe exponential stochastic inequality (ESI) X\u0016\u03b7 Y is de\ufb01ned as expressing X\u2212 Y \u0016\u03b7 0; X\u0016 Y is\nde\ufb01ned as X\u00161 Y .\n1. (Negativity). Let X\u0016 0. As the notation suggests X is negative in expectation and with high\nprobability. That is E[X]\u2264 0 and P{X\u2265\u2212 ln \u03b4}\u2264 \u03b4 for all \u03b4> 0.\n2. (Convex combination). Let\u0001X f\u0001\nprobability distribution onF. If X f\u0016 0 for all f then Ef\u223cw[X f]\u0016 0.\nf\u2208F be a family of random variables and let w be a\n3. (Chain rule). Let X1, X2, . . . be adapted to \ufb01ltrationG1\u2286G2 . . . (i.e. Xt isGt-measurable\nfor each t). If XtGt\u22121\u0016 0 almost surely for all t, then\u2211T\nProof. Negativity: By Jensen\u2019s inequality E[X]\u2264 ln E\u0001eX(cid:6)\u2264 0, whereas by Markov\u2019s inequal-\n\u0001 \u2264 \u03b4 E\u0001eX(cid:6) \u2264 \u03b4. Convex combination: By Jensen\u2019s inequality\nity P{X\u2265\u2212 ln \u03b4} = P\u0001eX\u2265 1\nEf\u223cw[X f]\u0002\u2264 Ef\u223cw E\u0002eX f\u0002\u2264 1. Chain rule: By induction. The base case T= 0 holds trivially.\nE\u0002e\nFor T> 0 we have E\u0002e\u2211T\n\nt=1 Xt E\u0001eXT\u0001GT\u22121(cid:6)\u0002\u2264 E\u0002e\u2211T\u22121\n\nt=1 Xt\u0016 0 for all T\u2265 0.\n\nt=1 Xt\u0002= E\u0002e\u2211T\u22121\n\nt=1 Xt\u0002\u2264 1.\n\n\u03b4\n\n\u2217\n\n\u2217\n\n4.2 The Bernstein Condition and Second-order Bounds\n\nOur main result Theorem 2, bounds the regret Rf\n\n\u2217\nT compared to the stochastically optimal predictor\n\nT = O(\u0002\nT \u2264 \u02dcRf\n\nf\u2217 when the sequence of losses (cid:96)1, (cid:96)2, . . . comes from a Bernstein distribution P. For simplicity we\nT KT).\nV f\u2217\n\u2217\n\u2217\nT is\n\u2217\n\u2217\nT with high probability. Combination with the individual-sequence bound\nT is bounded in terms of a function of itself. And solving the inequality for \u02dcRf\n\nonly consider the OCO setting in this sketch. Full details are in Theorem 11. Our starting point\nwill be the individual-sequence second-order bound (2), which implies Rf\nThe crucial technical contribution of this paper is to establish that for Bernstein distributions V f\nbounded in terms of \u02dcRf\n\u2217\nthen gives that \u02dcRf\nestablishes the fast rates for Rf\nT .\n\u2217\n\n\u2217\nT would be bounded in terms of \u02dcRf\n\nt=1(xft\nT =\u2211T\nT =\u2211T\nt )2 and \u02dcRf\nt )2 in terms of xft\nalgorithm in round t. We will bound(xft\nt=1 xft\nCondition 1 for \u03ba= 1 directly yields\nt \u0002 = B E\u0002 \u02dcRf\nE\u0002xft\nt )2\u0002 \u2264 B\nE\u0002(xft\nE\u0002V f\nT \u0002 = TQ\nT \u0002 .\nTQ\nt=1\nt=1\n\n\u2217\nT , we look at their relation\nt where ft is the prediction of the\nt separately for each round t. The Bernstein\n\nTo get a \ufb01rst intuition as to why V f\nin expectation. Recall that V f\n\n(4)\n\n\u2217\n\nT\n\n\u2217\n\n\u2217\n\n7\n\n\fset of linear inequalities:\nCondition 7. The excess loss family (3) satis\ufb01es the linearized \u03ba-Bernstein condition if there are\n\nFor \u03ba< 1 the \ufb01nal step of interchanging expectation and sums does not work directly, but we may use\nz\u03ba= \u03ba\u03ba(1\u2212 \u03ba)1\u2212\u03ba inf \u0001>0\u0001\u0001\u03ba\u22121z+ \u0001\u03ba\u0001 for z\u2265 0 to rewrite the Bernstein condition as the following\nconstants c1, c2> 0 such that we have:\nt)2\u0002Gt\u22121\u0002\u2212 E\u0002xf\nc1\u22c5 \u00011\u2212\u03ba E\u0002V f\n\na.s. for all \u0001> 0, f\u2208F and t\u2208 N.\nT \u0002+ c2\u22c5 T\u22c5 \u0001.\n\nt\u0002Gt\u22121\u0002 \u2264 c2\u22c5 \u0001\nT \u0002 \u2264 E\u0002 \u02dcRf\n\nThis gives the following generalization of (4):\n\u2217\n\nc1\u22c5 \u00011\u2212\u03ba\u22c5 E\u0002(xf\n\n(5)\n\n\u2217\n\nTogether with the individual sequence regret bound and optimization of \u0001 this can be used to derive\nthe in-expectation part of Theorem 2.\n\n\u2217\nGetting the in-probability part is more dif\ufb01cult, however, and requires relating V f\nT in\nT\nprobability instead of in expectation. Our main technical contribution does exactly this, by showing\nthat the Bernstein condition is in fact equivalent to the following exponential strengthening of\nCondition 7:\n\nCondition 8. The family (3) satis\ufb01es the \u03ba-ESI-Bernstein condition if there are c1, c2> 0 such that:\n\nand \u02dcRf\n\n\u2217\n\n\u0002c1\u22c5 \u00011\u2212\u03ba\u22c5(xf\n\nt)2\u2212 xf\n\nt\u0002Gt\u22121 \u0016\u00011\u2212\u03ba c2\u22c5 \u0001\n\na.s. for all \u0001> 0, f\u2208F and t\u2208 N.\n\n\u2217\n\n\u2217\n\n\u2217\n\nCondition 8 implies Condition 7 by Jensen\u2019s inequality (see Lemma 6 part 1). The surprising converse\nis proved in Lemma 9 in the appendix. By telescoping over rounds using the chain rule from Lemma 6,\nwe see that ESI-Bernstein implies the following substantial strengthening of (5):\n\nNow the second-order regret bound (2) can be rewritten, using 2\n\na.s. for all \u0001> 0, T\u2208 N.\nT \u2212 \u02dcRf\nT \u0016\u00011\u2212\u03ba c2\u22c5 T\u22c5 \u0001\nc1\u22c5 \u00011\u2212\u03ba\u22c5 V f\n\u221a\nab= inf \u03b3 \u03b3a+ b~\u03b3, as:\n\u0002\nT \u22c5 KT+ 2KT \u2264 \u03b3\u22c5 V f\nT + KT\n+ 2KT .\nT \u2264 2\nfor every \u03b3> 0\u2236 2 \u02dcRf\nV f\u2217\nPlugging in \u03b3= c1\u00011\u2212\u03ba we can chain this inequality with (6) to give, for all \u0001> 0,\n+ 2KT ,\nT + c2\u22c5 T\u22c5 \u0001+ KT\nT \u0016\u00011\u2212\u03ba \u02dcRf\nc1\u22c5 \u00011\u2212\u03ba\nand both parts of Theorem 2 now follow by rearranging, plugging in the minimiser \u0001\u00e0 K\n\n(6)\n\n(7)\nT T 1\u2212\u03ba\n2\u2212\u03ba\n2\u2212\u03ba ,\n\n1\n\n2 \u02dcRf\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u03b3\n\nand using Lemma 6 part 1.\n\nAcknowledgments\n\nKoolen acknowledges support by the Netherlands Organization for Scienti\ufb01c Research (NWO, Veni\ngrant 639.021.439).\n\nReferences\nJ-Y. Audibert. PAC-Bayesian statistical learning theory. PhD thesis, Universit\u00e9 Paris VI, 2004.\nJ-Y. Audibert. Fast learning rates in statistical inference through aggregation. Ann. Stat., 37(4), 2009.\nP. Bartlett and S. Mendelson. Empirical minimization. Probab. Theory Rel., 135(3):311\u2013334, 2006.\nP. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. J. Am. Stat. Assoc., 101\n(473):138\u2013156, 2006.\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\nN. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice.\nMachine Learning, 66(2/3):321\u2013352, 2007.\nC. Chiang, T. Yang, C. Le, M. Mahdavi, C. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations.\nIn Proc. 25th Conf. on Learning Theory (COLT), 2012.\nK. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS 22, 2009.\n\n8\n\n\fJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.\nJournal of Machine Learning Research, 12:2121\u20132159, 2011.\nT. van Erven and W. Koolen. MetaGrad: Multiple learning rates in online learning. In Advances in Neural\nInformation Processing Systems 29, 2016.\nT. van Erven, P. Gr\u00fcnwald, N. Mehta, M. Reid, and R. Williamson. Fast rates in statistical and online learning.\nJournal of Machine Learning Research, 16:1793\u20131861, 2015.\nE. Even-Dar, M. Kearns, Y. Mansour, and J. Wortman. Regret to the best vs. regret to the average. Machine\nLearning, 72(1-2), 2008.\nY. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to\nboosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\nP. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. In Proc. 28th Conf.\non Learning Theory (COLT), 2015.\nP. Gaillard, G. Stoltz, and T. van Erven. A second-order bound with excess losses. In Proc. 27th COLT, 2014.\nP. Gr\u00fcnwald. The safe Bayesian: learning the learning rate via the mixability gap. In ALT \u201912. Springer, 2012.\nE. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine\nlearning, 80(2-3):165\u2013188, 2010.\nV. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat., 34(6):\n2593\u20132656, 2006.\nW. Koolen. The relative entropy bound for Squint. Blog entry on blog.wouterkoolen.info/, August 2015.\nW. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In Proc.\n28th Conf. on Learning Theory (COLT), pages 1155\u20131175, 2015.\nW. Koolen, T. van Erven, and P. Gr\u00fcnwald. Learning the learning rate for prediction with expert advice. In\nAdvances in Neural Information Processing Systems 27, pages 2294\u20132302, 2014.\nH. Luo and R. Schapire. Achieving all with no parameters: Adaptive normalhedge. In Proc. 28th COLT, 2015.\nP. Massart and \u00c9. N\u00e9d\u00e9lec. Risk bounds for statistical learning. Ann. Stat., 34(5):2326\u20132366, 2006.\nB. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proc. 23rd\nConf. on Learning Theory (COLT), pages 244\u2013256, 2010.\nN. Mehta and R. Williamson. From stochastic mixability to fast rates. In NIPS 27, 2014.\nF. Orabona, K. Crammer, and N. Cesa-Bianchi. A generalized online mirror descent with applications to\nclassi\ufb01cation and regression. Machine Learning, 99(3):411\u2013435, 2015.\nA. Rakhlin and K. Sridharan. Online nonparametric regression. In Proc. 27th COLT, 2014.\nS. de Rooij, T. van Erven, P. Gr\u00fcnwald, and W. Koolen. Follow the leader if you can, Hedge if you must.\nJournal of Machine Learning Research, 15:1281\u20131316, April 2014.\nA. Sani, G. Neu, and A. Lazaric. Exploiting easy data in online optimization. In NIPS 27, 2014.\nS. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\nLearning, 4(2):107\u2013194, 2011.\nJ. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proc.\n31th Int. Conf. on Machine Learning (ICML), pages 1593\u20131601, 2014.\nA. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Ann. Stat., 32:135\u2013166, 2004.\nO. Wintenberger. Optimal learning with Bernstein Online Aggregation. ArXiv:1404.1356, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2209, "authors": [{"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica"}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": "CWI"}, {"given_name": "Tim", "family_name": "van Erven", "institution": "Leiden University"}]}