{"title": "Relax and Randomize : From Value to Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2141, "page_last": 2149, "abstract": "We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such ''unorthodox'' methods as Follow the Perturbed Leader and the R^2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a ''random play out''. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone's dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts.", "full_text": "Relax and Randomize: From Value to Algorithms\n\nAlexander Rakhlin\n\nUniversity of Pennsylvania\n\nOhad Shamir\n\nMicrosoft Research\n\nKarthik Sridharan\n\nUniversity of Pennsylvania\n\nAbstract\n\nWe show a principled way of deriving online learning algorithms from a minimax\nanalysis. Various upper bounds on the minimax value, previously thought to be\nnon-constructive, are shown to yield algorithms. This allows us to seamlessly re-\ncover known methods and to derive new ones, also capturing such \u201cunorthodox\u201d\nmethods as Follow the Perturbed Leader and the R2 forecaster. Understanding\nthe inherent complexity of the learning problem thus leads to the development of\nalgorithms. To illustrate our approach, we present several new algorithms, includ-\ning a family of randomized methods that use the idea of a \u201crandom playout\u201d. New\nversions of the Follow-the-Perturbed-Leader algorithms are presented, as well as\nmethods based on the Littlestone\u2019s dimension, ef\ufb01cient methods for matrix com-\npletion with trace norm, and algorithms for the problems of transductive learning\nand prediction with static experts.\n\n1\n\nIntroduction\n\nThis paper studies the online learning framework, where the goal of the player is to incur small\nregret while observing a sequence of data on which we place no distributional assumptions. Within\nthis framework, many algorithms have been developed over the past two decades [6]. More recently,\na non-algorithmic minimax approach has been developed to study the inherent complexities of se-\nquential problems [2, 1, 15, 20]. It was shown that a theory in parallel to Statistical Learning can be\ndeveloped, with random averages, combinatorial parameters, covering numbers, and other measures\nof complexity. Just as the classical learning theory is concerned with the study of the supremum of\nempirical or Rademacher process, online learning is concerned with the study of the supremum of\nmartingale processes. While the tools introduced in [15, 17, 16] provide ways of studying the mini-\nmax value, no algorithms have been exhibited to achieve these non-constructive bounds in general.\nIn this paper, we show that algorithms can, in fact, be extracted from the minimax analysis. This\nobservation leads to a unifying view of many of the methods known in the literature, and also gives\na general recipe for developing new algorithms. We show that the potential method, which has\nbeen studied in various forms, naturally arises from the study of the minimax value as a certain\nrelaxation. We further show that the sequential complexity tools introduced in [15] are, in fact,\nrelaxations and can be used for constructing algorithms that enjoy the corresponding bounds. By\nchoosing appropriate relaxations, we recover many known methods, improved variants of some\nknown methods, and new algorithms. One can view our framework as one for converting a non-\nconstructive proof of an upper bound on the value of the game into an algorithm. Surprisingly,\nthis allows us to also study such \u201cunorthodox\u201d methods as Follow the Perturbed Leader [10], and\nthe recent method of [7] under the same umbrella with others. We show that the idea of a random\nplayout has a solid theoretical basis, and that Follow the Perturbed Leader algorithm is an example\nof such a method. Based on these developments, we exhibit an ef\ufb01cient method for the trace norm\nmatrix completion problem, novel Follow the Perturbed Leader algorithms, and ef\ufb01cient methods\nfor the problems of online transductive learning. The framework of this paper gives a recipe for\ndeveloping algorithms. Throughout the paper, we stress that the notion of a relaxation, introduced\nbelow, is not appearing out of thin air but rather as an upper bound on the sequential Rademacher\ncomplexity. The understanding of inherent complexity thus leads to the development of algorithms.\n\n1\n\n\f2 Value, The Minimax Algorithm, and Relaxations\n\nLet us introduce some notation. The sequence x1, . . . , xt is often denoted by x1\u2236t , and the set of all\ndistributions on some setA by (A). Unless speci\ufb01ed otherwise, \u270f denotes a vector(\u270f1, . . . ,\u270f T)\nof i.i.d. Rademacher random variables. AnX -valued tree x of depth d is de\ufb01ned as a sequence\n(x1, . . . , xd) of mappings xt\u2236{\u00b11}t\u22121\uffffX (see [15]). We often write xt(\u270f) instead of xt(\u270f1\u2236t\u22121).\nLetF be the set of learner\u2019s moves andX the set of moves of Nature. The online protocol dictates\nthat on every round t = 1, . . . , T the learner and Nature simultaneously choose ft \u2208F, xt \u2208X ,\nand observe each other\u2019s actions. The learner aims to minimize regret RegT \uffff\u2211T\nt=1 `(ft, xt)\u2212\nt=1 `(f, xt) where ` \u2236 F\u00d7X \u2192 R is a known loss function. Our aim is to study this\ninf f\u2208F\u2211T\n`,F andX . We do assume, however, that `,F, andX are such that the minimax theorem in the\nspace of distributions overF andX holds. By studying the abstract setting, we are able to develop\n\ngeneral algorithmic and non-algorithmic ideas that are common across various application areas.\nThe starting point of our development is the minimax value of the associated online learning game:\n\nonline learning problem at an abstract level without assuming convexity or other such properties of\n\ninf\n\nE\n\nxt\n\nE\n\n. . .\n\ninf\n\n(1)\n\nsup\n\nsup\n\nqt+1\n\nsup\nxT\n\nargmin\n\nE\nft+1\n\nsup\nxt+1\n\n. . . inf\nqT\n\nqT\u2208(F)\n\nxT\u2208X E\n\nthe optimal algorithm that solves the minimax expression at every round t and returns\n\nfT\u223cqT\uffff T\ufffft=1\nfT\uffff T\uffffi=t+1\n\nHenceforth, if the quanti\ufb01cation in inf and sup is omitted, it will be understood that xt, ft, pt, qt\n\nx1\u2208X E\nf1\u223cq1\nq1\u2208(F)\nft\u223cq\uffff`(ft, xt)+ inf\n\nT\ufffft=1\nVT(F)=\n`(f, xt)\uffff\n`(ft, xt)\u2212 inf\nf\u2208F\nwhere (F) is the set of distributions onF. The minimax formulation immediately gives rise to\nT\uffffi=1\n`(f, xi)\uffff\uffff\uffff\n`(fi, xi)\u2212 inf\nq\u2208(F) \uffffsup\nf\u2208F\nrange overX ,F, (X), (F), respectively. Moreover, Ext is with respect to pt while Eft is with\nrespect to qt. We now notice a recursive form for the value of the game. De\ufb01ne for any t\u2208[T\u2212 1]\nand any given pre\ufb01x x1, . . . , xt\u2208X the conditional value\nVT(F\uffffx1, . . . , xt)\uffff inf\nx\u2208X\uffff E\nq\u2208(F)\nt=1 `(f, xt) andVT(F)=VT(F\uffff{}). The minimax optimal\nwithVT(F\uffffx1, . . . , xT)\uffff\u2212 inf f\u2208F\u2211T\nf\u223cq[`(f, x)]+VT(F\uffffx1, . . . , xt\u22121, x)\uffff .\nx\u2208X\uffff E\n\nf\u223cq[`(f, x)]+VT(F\uffffx1, . . . , xt, x)\uffff\n\nalgorithm specifying the mixed strategy of the player can be written succinctly as\n\nSimilar recursive formulations have appeared in the literature [8, 13, 19, 3], but now we have\ntools to study the conditional value of the game. We will show that various upper bounds on\n\nVT(F\uffffx1, . . . , xt\u22121, x) yield an array of algorithms. In this way, the non-constructive approaches of\n\n[15, 16, 17] to upper bound the value of the game directly translate into algorithms. We note that\nthe minimax algorithm in (2) can be interpreted as choosing the best decision that takes into account\nthe present loss and the worst-case future. The \ufb01rst step in our analysis is to appeal to the minimax\ntheorem and perform the same manipulation as in [1, 15], but only on the conditional values:\n\nqt= argmin\nq\u2208(F)\n\nsup\n\nsup\n\n(2)\n\nE\n\nE\n\ninf\n\n(3)\n\npt+1\n\nE\nxt+1\n\nT\uffffi=1\n\n. . . sup\npT\n\nxi\u223cpi\n\n`(f, xi)\uffff .\n\n`(fi, xi)\u2212 inf\nf\u2208F\n\nVT(F\uffffx1, . . . , xt)= sup\n\nThe idea now is to come up with more manageable, yet tight, upper bounds on the conditional value.\n\nxT\uffff T\uffffi=t+1\nfi\u2208F\nA relaxation RelT is a sequence of real-valued functions RelT(F\uffffx1, . . . , xt) for each t\u2208[T]. We\ncall a relaxation admissible if for any x1, . . . , xT\u2208X ,\nf\u223cq[`(f, x)]+ RelT(F\uffffx1, . . . , xt, x)\uffff\nx\u2208X\uffff E\nfor all t \u2208 [T \u2212 1], and\nt=1 `(f, xt). We use the notation\nRelT(F) for RelT(F\uffff{}). A strategy q that minimizes the expression in (4) de\ufb01nes an optimal\nf\u223cq[`(f, x)]+ RelT(F\uffffx1, . . . , xt\u22121, x)\uffff , (5)\n\nRelT(F\uffffx1, . . . , xT) \u2265 \u2212 inf f\u2208F\u2211T\nqt= arg min\n\nRelT(F\uffffx1, . . . , xt)\u2265 inf\nq\u2208(F)\n\nMeta-Algorithm for an admissible relaxation RelT :\n\nx\u2208X\uffff E\nq\u2208(F) sup\n\non round t, compute\n\nsup\n\n(4)\n\n2\n\n\fplay ft\u223c qt and receive xt from the opponent. Importantly, minimization need not be exact: any qt\n\nthat satis\ufb01es the admissibility condition (4) is a valid method, and we will say that such an algorithm\nis admissible with respect to the relaxation RelT .\nProposition 1. Let RelT be an admissible relaxation. For any admissible algorithm with respect\nto RelT , (including the Meta-Algorithm), irrespective of the strategy of the adversary,\n\n(6)\n\nT\ufffft=1\n\nEft\u223cqt`(ft, xt)\u2212 inf\nf\u2208F\n\nT\ufffft=1\n\n`(f, xt)\u2264 RelT(F) ,\n\nand therefore, E[RegT]\u2264 RelT(F) . If `(\u22c5,\u22c5) is bounded, the Hoeffding-Azuma inequality yields a\nhigh-probability bound on RegT . We also have thatVT(F)\u2264 RelT(F) . Further, if for all t\u2208[T],\nthe admissible strategies qt are deterministic, RegT\u2264 RelT(F) .\n\nThe reader might recognize RelT as a potential function. It is known that one can derive regret\nbounds by coming up with a potential such that the current loss of the player is related to the differ-\nence in the potentials at successive steps, and that the regret can be extracted from the \ufb01nal potential.\nThe origin/recipe for \u201cgood\u201d potential functions has always been a mystery (at least to the authors).\nOne of the key contributions of this paper is to show that they naturally arise as relaxations on the\nconditional value, and the conditional value is itself the tightest possible relaxation. In particular,\nfor many problems a tight relaxation is achieved through symmetrization applied to the expression\nin (3). De\ufb01ne the conditional Sequential Rademacher complexity\n\n(7)\n\n(8)\n\nHere the supremum is over allX -valued binary trees of depth T\u2212 t. One may view this complexity\n\nas a partially symmetrized version of the sequential Rademacher complexity\n\nx\n\nRT(F\uffffx1, . . . , xt)= sup\n\nf\u2208F\uffff2\nE\u270ft+1\u2236T sup\nRT(F)\uffff RT(F\uffff{}) = sup\n\nx\n\nT\uffffs=t+1\n\n\u270fs`(f, xs\u2212t(\u270ft+1\u2236s\u22121))\u2212 t\uffffs=1\n\u270fs`(f, xs(\u270f1\u2236s\u22121))\uffff\nf\u2208F\uffff2\nE\u270f1\u2236T sup\n\nT\uffffs=1\n\n`(f, xs)\uffff .\n\nde\ufb01ned in [15]. We shall refer to the term involving the tree x as the \u201cfuture\u201d and the term being\nsubtracted off in (7) \u2013 as the \u201cpast\u201d. This indeed corresponds to the fact that the quantity is con-\nditioned on the already observed x1, . . . , xt, while for the future we have the worst possible binary\ntree.1\nProposition 2. The conditional Sequential Rademacher complexity is admissible.\nWe now show that several well-known methods arise as further relaxations on RT .\n\n1\n\n\n(9)\n\nlog\uffff\uffff\ufffff\u2208F\n\nupper bound on sequential Rademacher complexity leads to the following relaxation:\n\nFurthermore, it leads to a parameter-free version of the Exponential Weights algorithm, de\ufb01ned on\n\n>0\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nRelT(F\uffffx1, . . . , xt)= inf\n\nExponential Weights [12, 21] SupposeF is a \ufb01nite class and\uffff`(f, x)\uffff\u2264 1. In this case, a (tight)\n\n`(f, xi)\uffff\uffff\uffff+ 2(T\u2212 t)\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nexp\uffff\u2212\nt\uffffi=1\nProposition 3. The relaxation (9) is admissible and RT(F\uffffx1, . . . , xt) \u2264 RelT(F\uffffx1, . . . , xt) .\nround t+ 1 by the mixed strategy qt+1(f)\u221d exp\uffff\u2212\u2217t\u2211t\ns=1 `(f, xs)\uffff with \u2217t the optimal value in\n(9). The algorithm\u2019s regret is bounded by RelT(F)\u2264 2\uffff2T log\uffffF\uffff .\nfree algorithm. The learning rate \u2217 can be optimized (via 1D line search) at each iteration.\nIn the setting of online linear optimization [22], the loss is `(f, x)=\ufffff, x\uffff.\nSupposeF is a unit ball in some Banach space andX is the dual. Let\uffff\u22c5\uffff be some(2, C)-smooth\nnorm onX (in the Euclidean case, C = 2). Using the notation \u02dcxt\u22121=\u2211t\u22121\ns=1 xs, a straightforward\nRelT(F\uffffx1, . . . , xt)=\uffff\uffff\u02dcxt\u22121\uffff2+\uffff\u2207\uffff\u02dcxt\u22121\uffff2 , xt\uffff+ C(T\u2212 t+ 1)\n1It is cumbersome to write out the indices on xs\u2212t(\u270ft+1\u2236s\u22121) in (7), so we will instead use xs(\u270f) whenever\n\nWe point out that the exponential-weights algorithm arising from the relaxation (9) is a parameter-\n\nupper bound on sequential Rademacher complexity is the following relaxation:\n\nMirror Descent [4, 14]\n\nthis doesn\u2019t cause confusion.\n\n(10)\n\n3\n\n\finf\n\nsup\n\ninf\n\nsup\n\nE\n\nsup\n\ninf\n\nsup\n\nProposition 4. The relaxation (10) is admissible and RT(F\uffffx1, . . . , xt)\u2264 RelT(F\uffffx1, . . . , xt) .\nIt yields the update ft=\n\n2\uffff\uffff\u02dcxt\u22121\uffff2+C(T\u2212t+1) with regret bound RelT(F)\u2264\u221a2CT .\n\u2212\u2207\uffff\u02dcxt\u22121\uffff2\n\nWe would like to remark that the traditional mirror descent update can be shown to arise out of an\nappropriate relaxation. The algorithms proposed are parameter free as the step size is tuned automat-\nically. We chose the popular methods of Exponential Weights and Mirror Descent for illustration.\nIn the remainder of the paper, we develop new algorithms to show universality of our approach.\n3 Classi\ufb01cation\n\nimproper version, we may write the value in (1) as\n\nbetween the learner and Nature, so let us outline them. The \u201cproper\u201d version of supervised learning\n\nWe start by considering the problem of supervised learning, whereX is the space of instances and\nY the space of responses (labels). There are two closely related protocols for the online interaction\nfollows the protocol presented in Section 2: at time t, the learner selects ft\u2208F, Nature simultane-\nously selects(xt, yt)\u2208X\u00d7Y, and the learner suffers the loss `(f(xt), yt). The \u201cimproper\u201d version\nis as follows: at time t, Nature chooses xt\u2208X and presents it to the learner as \u201cside information\u201d,\nthe learner then picks \u02c6yt\u2208Y and Nature simultaneously chooses yt\u2208Y. In the improper version, the\nloss of the learner is `(\u02c6yt, yt), and it is easy to see that we may equivalently state this protocol as the\nlearner choosing any function ft\u2208YX (not necessarily inF), and Nature simultaneously choosing\n(xt, yt). We mostly focus on the \u201cimproper\u201d version of supervised learning in this section. For the\n\u02c6yT\u223cqT\uffff T\ufffft=1\nVT(F)= sup\n`(f(xt), yt)\uffff\n`(\u02c6yt, yt)\u2212 inf\nE\nx1\u2208X\nyT\u2208X\ny1\u2208X\nf\u2208F\n\u02c6y1\u223cq1\nq1\u2208(Y)\nqT\u2208(Y)\nand a relaxation RelT is admissible if for any(x1, y1) . . . ,(xT , yT)\u2208X\u00d7Y,\ni=1,(x, y)\uffff\uffff\u2264 RelT\uffffF\uffff{(xi, yi)}t\ni=1\uffff\n`(\u02c6y, y)+ RelT\uffffF\uffff{(xi, yi)}t\ny\u2208Y\uffff E\n\u02c6y\u223cq\nx\u2208X\nq\u2208(Y)\nLet us now focus on binary prediction, i.e. Y ={\u00b11}. In this case, the supremum over y in (11)\nbecomes a maximum over two values. Let us now take the absolute loss `(\u02c6y, y)=\uffff\u02c6y\u2212 y\uffff= 1\u2212 \u02c6yy.\ni=1,(x,\u22121)\uffff\uffff\nqt= argmin\nq\u2208(Y)\ni=1,(x,\u22121)\uffff\uffff\nor equivalently as : qt= 1\nWe now assume thatF has a \ufb01nite Littlestone\u2019s dimension Ldim(F) [11, 5]. Suppose the loss\nfunction is `(\u02c6y, y)=\uffff\u02c6y\u2212 y\uffff, and consider the \u201cmixed\u201d conditional Rademacher complexity\n\ni=1,(x, 1)\uffff , 1+ q+ RelT\uffffF\uffff{(xi, yi)}t\ni=1,(x, 1)\uffff\u2212 RelT\uffffF\uffff{(xi, yi)}t\n\u270fif(xi(\u270f))\u2212 t\uffffi=1\ufffff(xi)\u2212 yi\uffff\uffff\n\nWe can see2 that the optimal randomized strategy, given the side information x, is given by (11) as\n\nas a possible relaxation. The admissibility condition (11) with the conditional sequential\nRademacher (13) as a relaxation would require us to upper bound\n\nxT\u2208X\n\nT\ufffft=1\n\n. . . sup\n\n(11)\n\n(12)\n\nmax\uffff1\u2212 q+ RelT\uffffF\uffff{(xi, yi)}t\n2\uffffRelT\uffffF\uffff{(xi, yi)}t\nT\u2212t\uffffi=1\n\u02c6yt\u223cqt\uffff\u02c6yt\u2212 yt\uffff+ sup\n\nyt\u2208{\u00b11}\uffff E\n\nf\u2208F\uffff2\n\nE\u270f sup\n\nsup\nx\n\nmax\n\ninf\n\nx\n\nHowever, the supremum over x is preventing us from obtaining a concise algorithm. We need to\n\nthen proceed as in the Exponential Weights example for a \ufb01nite collection of experts. This leads to\nan upper bound on (13) and gives rise to algorithms similar in spirit to those developed in [5], but\nwith more attractive computational properties and de\ufb01ned more concisely.\n\nfurther \u201crelax\u201d this supremum, and the idea is to pass to a \ufb01nite cover ofF on the given tree x and\ni=0\ufffft\ni\uffff, which is shown in [15] to be the maximum size of\nDe\ufb01ne the function g(d, t) = \u2211d\nan exact (zero) cover for a function class with the Littlestone\u2019s dimension Ldim = d. Given\n{(x1, yt), . . . ,(xt, yt)} and  = (1, . . . , t) \u2208 {\u00b11}t, letFt() = {f \u2208 F \u2236 f(xi) = i \u2200i \u2264\nt}, the subset of functions that agree with the signs given by  on the \u201cpast\u201d data and let\nF\uffffx1,...,xt \uffff F\uffffxt \uffff {(f(x1), . . . , f(xt)) \u2236 f \u2208 F} be the projection ofF onto x1, . . . , xt. De-\nnote Lt(f)=\u2211t\ni=1\uffffi\u2212 yi\uffff for \u2208{\u00b11}t. The following proposition\ngives a relaxation and an algorithm which achieves the O(\uffffLdim(F)T log T) regret bound. Un-\n\nlike the algorithm of [5], we do not need to run an exponential number of experts in parallel and\nonly require access to an oracle that computes the Littlestone\u2019s dimension.\n\ni=1\ufffff(xi)\u2212 yi\uffff and Lt()=\u2211t\n\n\u270fif(xi(\u270f))\u2212 t\uffffi=1\ufffff(xi)\u2212 yi\uffff\uffff\uffff\n\nE\u270f sup\n\nf\u2208F\uffff2\n\nqt\u2208[\u22121,1]\n\n(13)\n\n(14)\n\nsup\nxt\n\nT\u2212t\uffffi=1\n\n2The extension to k-class prediction is immediate.\n\n4\n\n\fProposition 5. The relaxation\n\n\n\nRelT\uffffF\uffff(xt, yt)\uffff= 1\n\ng(Ldim(Ft()), T\u2212 t) exp{\u2212Lt()}\uffff\uffff+ 2(T\u2212 t) .\nis admissible and leads to an admissible algorithm which uses weights qt(\u22121)= 1\u2212 qt(+1) and\n\nlog\uffff\uffff \uffff\u2208F\uffffxt\nqt(+1)=\u2211(,+1)\u2208F\uffffxt g(Ldim(Ft(,+1)), T\u2212 t) exp{\u2212Lt\u22121()}\n\u2211(,t)\u2208F\uffffxt g(Ldim(Ft(, t)), T\u2212 t) exp{\u2212Lt\u22121()} ,\n\nThere is a very close correspondence between the proof of Proposition 5 and the proof of the com-\nbinatorial lemma of [15], the analogue of the Vapnik-Chervonenkis-Sauer-Shelah result.\n\n(15)\n\n4 Randomized Algorithms and Follow the Perturbed Leader\n\nWe now develop a class of admissible randomized methods that arise through sampling. Consider\nthe objective (5) given by a relaxation RelT . If RelT is the sequential (or classical) Rademacher\ncomplexity, it involves an expectation over sequences of coin \ufb02ips, and this computation (coupled\nwith optimization for each sequence) can be prohibitively expensive. More generally, RelT might\ninvolve an expectation over possible ways in which the future might be realized. In such cases,\nwe may consider a rather simple \u201crandom playout\u201d strategy: draw the random sequence and solve\nonly one optimization problem for that random sequence. The ideas of random playout have been\ndiscussed in previous literature for estimating the utility of a move in a game (see also [3]). We show\nthat random playout strategy has a solid basis: for the examples we consider, it satis\ufb01es admissibility.\nIn many learning problems the sequential and the classical Rademacher complexities are within a\nconstant factor of each other. This holds true, for instance, for linear functions in \ufb01nite-dimensional\nspaces. In such cases, the relaxation RelT does not involve the supremum over a tree, and the\nrandomized method only needs to draw a sequence of coin \ufb02ips and compute a solution to an opti-\nmization problem slightly more complicated than ERM. We show that Follow the Perturbed Leader\n(FPL) algorithms [10] arise in this way. We note that FPL has been previously considered as a rather\nunorthodox algorithm providing some kind of regularization via randomization. Our analysis shows\nthat it arises through a natural relaxation based on the sequential (and thus the classical) Rademacher\ncomplexity, coupled with the random playout idea. As a new algorithmic contribution, we provide\na version of the FPL algorithm for the case of the decision sets being `2 balls, with a regret bound\nthat is independent of the dimension. We also provide an FPL-style method for the combination of\n\nUnder the above assumption one can use the following relaxation\n\nThe assumption below implies that the sequential and classical Rademacher complexities are within\nconstant factor C of each other. We later verify that it holds in the examples we consider.\n\n`1 and `\u221e balls. To the best of our knowledge, these results are novel.\nAssumption 1. There exists a distribution D\u2208 (X) and constant C\u2265 2 such that for any t\u2208[T]\nand given any x1, . . . , xt\u22121, xt+1, . . . , xT\u2208X and any \u270ft+1, . . . ,\u270f T\u2208{\u00b11},\nf\u2208F[ CAt(f)\u2212 Lt\u22121(f)]\nE\n\u270ft,xt\u223cD\np\u2208(X)\ni=t \u270fi`(f, xi).\ni=1 `(f, xi), and At(f)=\u2211T\nwhere \u270ft\u2019s are i.i.d. Rademacher, Lt\u22121(f)=\u2211t\u22121\n`(f, xi)\uffff\n\u270fi`(f, xi)\u2212 t\uffffi=1\nf\u2208F\uffffC\nxt+1,...xT\u223cD\n\nf\u2208F\uffff CAt+1(f)\u2212 Lt\u22121(f)+ E\n\nx\u223cp[`(f, x)]\u2212 `(f, xt)\uffff\u2264\n\nwhich is a partially symmetrized version of the classical Rademacher averages.\nThe proof of admissibility for the randomized methods is quite curious \u2013 the forecaster can be seen as\nmimicking the sequential Rademacher complexity by sampling from the \u201cequivalently bad\u201d classical\nRademacher complexity under the speci\ufb01c distribution D speci\ufb01ed by the above assumption.\nLemma 6. Under Assumption 1, the relaxation in Eq. (16) is admissible and a randomized strategy\n\nthat ensures admissibility is given by: at time t, draw xt+1, . . . , xT\u223c D and \u270ft+1, . . . ,\u270f T and then:\n(a) In the case the loss ` is convex in its \ufb01rst argument and setF is convex and compact, de\ufb01ne\n\nRelT(F\uffffx1, . . . , xt)=\n\nT\uffffi=t+1\n\nE\n\nE\u270f sup\n\n(16)\n\nsup\n\nsup\n\nE\nxt\u223cp\n\nsup\n\n5\n\n\f(b) In the case of non-convex loss, sample ft from the distribution\n\nft= argmin\ng\u2208F\n\u02c6qt= argmin\n\u02c6q\u2208(F)\n\nsup\n\nsup\n\n\u270fi`(f, xi)\u2212 t\u22121\uffffi=1\nx\u2208X\uffff`(g, x)+ sup\nf\u2208F\uffffC\nT\uffffi=t+1\n\u270fi`(f, xi)\u2212 t\u22121\uffffi=1\nf\u223c\u02c6q[`(f, x)]+ sup\nx\u2208X\uffff E\nf\u2208F\uffffC\nT\uffffi=t+1\n\u270ft`(f, xt)\uffff ,\n\u270f\uffffsup\nE[RegT]\u2264 C Ex1\u2236T\u223cD E\nT\ufffft=1\nf\u2208F\n\n`(f, xi)\u2212 `(f, x)\uffff\uffff\n`(f, xi)\u2212 `(f, x)\uffff\uffff\n\nThe expected regret for the method is bounded by the classical Rademacher complexity:\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\nxi\u2212 x\uffff\uffff\n\nsup\n\nE\n\nE\n\nHere, we consider the setting similar to that\n\nOf particular interest are the settings of static experts and transductive learning, which we consider\nin Section 5. In the transductive case, the xt\u2019s are pre-speci\ufb01ed before the game, and in the static\nexpert case \u2013 effectively absent. In these cases, as we show below, there is no explicit distribution\nD and we only need to sample the random signs \u270f\u2019s. We easily see that in these cases, the expected\nregret bound is simply two times the transductive Rademacher complexity.\nThe idea of sampling from a \ufb01xed distribution is particularly appealing in the case of linear loss,\n\nAt round t, the generic algorithm speci\ufb01ed by Lemma 18 draws fresh Rademacher random variables\n\nThe above lemma is especially attractive with Gaussian perturbations as sum of normal random\n\nany symmetric distribution satis\ufb01es Assumption 2.\nLemma 7. If D is any symmetric distribution over R, then Assumption 2 is satis\ufb01ed by using the\n\n`(f, x)=\ufffff, x\uffff. SupposeX is a unit ball in some norm\uffff\u22c5\uffff in a vector space B, andF is a unit ball\nin the dual norm\uffff\u22c5\uffff\u2217. A suf\ufb01cient condition implying Assumption 1 is then\nAssumption 2. There exists a distribution D\u2208 (X) and constant C\u2265 2 such that for any w\u2208 B,\nxt\u223cp\uffffw+ 2\u270ftxt\uffff\u2264 E\n\u270ft\uffffw+ C\u270f txt\uffff\nx\u2208X\nxt\u223cD\n\u270f and xt+1, . . . , xT\u223c D and picks\n\u270fixi\u2212 t\u22121\uffffi=1\nft= argmin\nx\u2208X\uffff\ufffff, x\uffff+\uffffC\nT\uffffi=t+1\nf\u2208F\nWe now look at `2\uffff`2 and `1\uffff`\u221e cases and provide corresponding randomized algorithms.\nExample : `1\uffff`\u221e Follow the Perturbed Leader\nin [10]. LetF \u2282 RN be the `1 unit ball andX the (dual) `\u221e unit ball in RN. In [10],F is the\nprobability simplex andX =[0, 1]N but these are subsumed by the `1\uffff`\u221e case. Next we show that\nproduct distribution DN and any C \u2265 6\uffffEx\u223cD\uffffx\uffff. In particular, Assumption 2 is satis\ufb01ed with a\ndistribution D that is uniform on the vertices of the cube{\u00b11}N and C= 6.\nvariables is again normal. Hence, instead of drawing xt+1, . . . , xT \u223c N(0, 1) on round t, one can\nsimply draw one vector Xt\u223c N(0, T\u2212 t) as the perturbation. In this case, C\u2264 8.\nLemma 8. Suppose F is the `N\n1 unit ball and X is the dual `N\u221e unit ball, and let D be\ndraws Rademacher random variables \u270ft+1, . . . ,\u270f T and xt+1, . . . , xT \u223c DN and picks ft =\ni=t+1 \u270fixi\uffff where C = 6\uffffEx\u223cD\uffffx\uffff. The expected regret is bounded as :\nf\u2208F\nE\u270f\uffff T\ufffft=1\nPyt+1\u2236T\u223cD\uffffC\uffff T\uffffi=t+1\n\u270ftxt\uffff\u221e+ 4\nT\ufffft=1\nE\nx1\u2236T\u223cDN\nFor instance, for the case of coin \ufb02ips (with C= 6) or the Gaussian distribution (with C= 3\u221a2\u21e1)\nthe bound above is 4C\u221aT log N, as the second term is bounded by a constant.\nWe now consider the case whenF andX\nExample : `2\uffff`2 Follow the Perturbed Leader\nLemma 9. LetX andF be unit balls in Euclidean norm. Then Assumption 2 is satis\ufb01ed with a\nuniform distribution D on the surface of the unit sphere with constant C= 4\u221a2.\n\nare both the unit `2 ball. We can use as perturbation the uniform distribution on the surface of unit\nsphere, as the following lemma shows. This result was hinted at in [2], as in high dimensional case,\nthe random draw from the unit sphere is likely to produce orthogonal directions. However, we do\nnot require dimensionality to be high for our result.\n\nThe form of update in Equation (20), however, is not in a convenient form, and the following\nlemma shows a simple Follow the Perturbed Leader type algorithm with the associated regret bound.\n\nany symmetric distribution. Consider the randomized algorithm that at each round t, freshly\n\n\ufffff,\u2211t\u22121\n\ni=1 xi\u2212 C\u2211T\nE[RegT]\u2264 C\n\nyi\uffff\u2264 4\uffff\n\nargmin\n\nsup\n\n6\n\n\fi=t+1 \u270fixi\uffff2\n\n. The randomized algorithm enjoys a\n\nAs in the previous example the update in (20) is not in a convenient form and this is addressed below.\n\ni=1 xi+ C\u2211T\n\nthe surface of the unit sphere. Consider the randomized algorithm that at each round (say round\n\nLemma 10. LetX andF be unit balls in Euclidean norm, and D be the uniform distribution on\ni=t+1 xi\uffff\uffffL where C = 4\u221a2\nt) freshly draws xt+1, . . . , xT \u223c D and picks ft =\uffff\u2212\u2211t\u22121\ni=1 xi+ C\u2211T\n2+ 1\uffff1\uffff2\nand scaling factor L=\uffff\uffff\u2212\u2211t\u22121\nt=1 xt\uffff2\u2264 4\u221a2T .\nbound on the expected regret given by E[RegT]\u2264 C Ex1,...,xT\u223cD\uffff\u2211T\n\nImportantly, the bound does not depend on the dimensionality of the space. To the best of our\nknowledge, this is the \ufb01rst such result for Follow the Perturbed Leader style algorithms. Further,\nunlike [10, 6], we directly deal with the adaptive adversary.\n5 Static Experts with Convex Losses and Transductive Online Learning\nWe show how to recover a variant of the R2 forecaster of [7], for static experts and transductive\n\nset of static experts. The transductive setting is equivalent to this: the sequence of xt\u2019s is known\n\nonline learning. At each round, the learner makes a prediction qt\u2208[\u22121, 1], observes the outcome\nyt\u2208[\u22121, 1], and suffers convex L-Lipschitz loss `(qt, yt). Regret is de\ufb01ned as the difference be-\nt=1 `(f[t], yt), where F \u2282[\u22121, 1]T can be seen as a\ntween learner\u2019s cumulative loss and inf f\u2208F\u2211T\nbefore the game starts, and hence the effective function class is once again a subset of[\u22121, 1]T . It\n\nturns out that in these cases, sequential Rademacher complexity becomes the classical Rademacher\ncomplexity (see [17]), which can thus be taken as a relaxation. This is also the reason that an ef-\n\ufb01cient implementation by sampling is possible. For general convex loss, one possible admissible\nrelaxation is just a conditional version of the classical Rademacher averages:\n\n(22)\n\n(21)\n\nT\ufffft=1\n\nT\ufffft=1\n\nT\uffffs=t+1\n\n@`(\u02c6yt, yt)\u22c5 f[t]\n\n\u270fsf[s]\u2212 Lt(f)\uffff\n\nRelT(F\uffffy1, . . . , yt)= E\u270ft+1\u2236T sup\nf\u2208F\uffff2L\nwhere Lt(f)=\u2211t\ns=1 `(f[s], ys). If (21) is used as a relaxation, the calculation of prediction \u02c6yt\ninvolves a supremum over f \u2208 F with (potentially nonlinear) loss functions of instances seen so\n`(\u02c6yt, yt)\u2212 inf\nT\ufffft=1\nf\u2208F\n\nfar. In some cases this optimization might be hard and it might be preferable if the supremum only\ninvolves terms linear in f. To this end we start by noting that by convexity\n\nOne can now consider an alternative online learning problem which, if we solve, also solves the\n\noriginal problem. More precisely, the new loss is `\u2032(\u02c6y, r)= r\u22c5 \u02c6y; we \ufb01rst pick prediction \u02c6yt (de-\nterministically) and the adversary picks rt (corresponding to rt= @`(\u02c6yt, yt) for choice of yt picked\nby adversary). Now note that `\u2032 is indeed convex in its \ufb01rst argument and is L Lipschitz because\n\uffff@`(\u02c6yt, yt)\uffff\u2264 L. This is a one dimensional convex learning game where we pick \u02c6yt and regret is\n\ngiven by the right hand side of (22). Hence, we can consider the relaxation\n\n@`(\u02c6yt, yt)\u22c5 \u02c6yt\u2212 inf\nf\u2208F\n\n`(f(xt), yt)\u2264 T\ufffft=1\n\nas a linearized form of (21). At round t, the prediction of the algorithm is then\n\n@`(\u02c6yi, yi)\u22c5 f[i]\uffff\n\u270fif[t]\u2212 t\uffffi=1\nRelT(F\uffff@`(\u02c6y1, y1), . . . ,@`(\u02c6yt, yt))= E\u270ft+1\u2236T sup\nf\u2208F\uffff2L\nT\uffffi=t+1\nt\u22121\uffffi=1\n@`(\u02c6yi, yi)f[i]\u2212 1\n\u02c6yt= E\n2 f[t]\uffff\uffff\n\u270fif[i]\u2212 1\nf\u2208F\uffff T\uffffi=t+1\n2 f[t]\uffff\u2212 sup\n\u270f\uffffsup\nf\u2208F\uffff T\uffffi=t+1\nt=1 \u270ftf[t]\uffff .\ntion (24). Further the regret of the strategy is bounded as RegT\u2264 2L E\u270f\uffffsupf\u2208F\u2211T\n\n(24)\nLemma 11. The relaxation in Eq. (23) is admissible w.r.t. the prediction strategy speci\ufb01ed in Equa-\n\nThis algorithm is similar to R2, with the main difference that R2 computes the in\ufb01ma over a sum\nof absolute losses, while here we have a more manageable linearized objective. While we need\nto evaluate the expectation over \u270f\u2019s on each round, we can estimate \u02c6yt by sampling \u270f\u2019s and using\nMcDiarmid\u2019s inequality argue that the estimate is close to \u02c6yt with high probability. The randomized\n\n@`(\u02c6yi, yi)f[i]+ 1\n\n\u270fif[i]\u2212 1\n\nt\u22121\uffffi=1\n\n(23)\n\n2L\n\n2L\n\nprediction is now given simply as: on round t, draw \u270ft+1, . . . ,\u270f T and predict\n\u02c6yt(\u270f)= inf\n\u270fif[i]+ 1\n\n`(f[i], yi)+ 1\n\n2 f[t]\uffff\u2212 inf\n\n\u270fif[i]+ 1\n\nf\u2208F\uffff\u2212 T\uffffi=t+1\n\nf\u2208F\uffff\u2212 T\uffffi=t+1\n\nt\u22121\uffffi=1\n\n2L\n\n2L\n\nt\u22121\uffffi=1\n\n`(f[i], yi)\u2212 1\n\n2 f[t]\uffff(25)\n\nWe now show that this predictor enjoys regret bound of the transductive Rademacher complexity :\n\n7\n\n\fLemma 12. The relaxation speci\ufb01ed in Equation (21) is admissible w.r.t. the randomized prediction\n\n6 Matrix Completion\nConsider the problem of predicting unknown entries in a matrix, in an online fashion. At each round\n\nstrategy speci\ufb01ed in Equation (25), and enjoys bound E[RegT]\u2264 2L E\u270f\uffffsupf\u2208F\u2211T\nt=1 \u270ftf[t]\uffff .\nt the adversary picks an entry in an m\u00d7 n matrix and a value yt for that entry. The learner then\nchooses a predicted value \u02c6yt, and suffers loss `(yt, \u02c6yt), assumed to be \u21e2-Lipschitz. We de\ufb01ne our\nregret with respect to the classF of all matrices whose trace-norm is at most B (namely, we can\nB = \u21e5(\u221amn). Consider a transductive version, where we know in advance the location of all\n(transductive) Rademacher complexity ofF, which by Theorem 6 of [18], is O(B\u221an) independent\n\nof T . Moreover, in [7], it was shown how one can convert algorithms with such guarantees to obtain\nthe same regret even in a \u201cfully\u201d online case, where the set of entry locations is unknown in advance.\nIn this section we use the two alternatives provided for transductive learning problem in the previous\nsubsection, and provide two alternatives for the matrix completion problem. Both variants proposed\nhere improve on the one provided by the R2 forecaster in [7], since that algorithm competes against\n\nuse any such matrix to predict just by returning its relevant entry at each round). Usually, one has\n\nentries we need to predict. We show how to develop an algorithm whose regret is bounded by the\n\nvariants are also computationally more ef\ufb01cient. Our \ufb01rst variant also improves on the recently\nproposed method in [9] in terms of memory requirements, and each iteration is simpler: Whereas\n\ncase). This can be done very ef\ufb01ciently, e.g. with power iterations or the Lanczos method.\nOur \ufb01rst algorithm follows from Eq. (24), which for our setting gives the following prediction rule:\n\nthe smaller classF\u2032 of matrices with bounded trace-norm and bounded individual entries, and our\nthat method requires storing and optimizing full m\u00d7 n matrices every iteration, our algorithm only\nrequires computing spectral norms of sparse matrices (assuming T \uffff mn, which is usually the\n\u02c6yt= B E\n2 xt\uffff\uffff\uffff (26)\nIn the above\uffff\u22c5\uffff stands for the spectral norm and each xi is a matrix with a 1 at some speci\ufb01c posi-\n\ntion and 0 elsewhere. Notice that the algorithm only involves calculation of spectral norms on each\nround, which can be done ef\ufb01ciently. As mentioned in previous subsection, one can approximately\nevaluate the expectation by sampling several \u270f\u2019s on each round and averaging. The second algorithm\nfollows (25), and is given by \ufb01rst drawing \u270f at random and then predicting\n\n@`(\u02c6yi, yi)xi\u2212 1\n\n@`(\u02c6yi, yi)xi+ 1\n\n2 xt\uffff\u2212\uffff T\uffffi=t+1\n\n\u270f\uffff\uffff\uffff T\uffffi=t+1\n\n\u270fixi\u2212 1\n\n\u270fixi\u2212 1\n\nt\u22121\uffffi=1\n\nt\u22121\uffffi=1\n\n2\u21e2\n\n2\u21e2\n\nt\u22121\uffffi=1\n\nt\u22121\uffffi=1\n\n2\u21e2\n\n2\u21e2\n\n\u270fif[xi]\u2212 1\n\n\u270fif[xi]\u2212 1\n\n\ufffff\uffff\u2303\u2264B\uffff T\uffffi=t+1\n\n2 f[xt]\uffff\u2212 sup\n\n`(f[xi], yi)+ 1\n\n\u02c6yt(\u270f)= sup\n`(f[xi], yi)\u2212 1\n2 f[xt]\uffff\n\ufffff\uffff\u2303\u2264B\uffff T\uffffi=t+1\nwhere\ufffff\uffff\u2303 is the trace norm of the m\u00d7n f, and f[xi] is the entry of the matrix f at the position xi.\nmentioned earlier, we get that the expected regret of either variant is O\uffffB\u21e2 (\u221am+\u221an)\uffff.\n\nNotice that the above involves solving two trace norm constrained convex optimization problems per\nround. As a simple corollary of Lemma 12, together with the bound on the Rademacher complexity\n\n7 Conclusion\nIn [2, 1, 15, 20] the minimax value of the online learning game has been analyzed and non-\nconstructive bounds on the value have been provided.\nIn this paper, we provide a general con-\nstructive recipe for deriving new (and old) online learning algorithms, using techniques from the\napparently non-constructive minimax analysis. The recipe is rather simple: we start with the notion\nof conditional sequential Rademacher complexity, and \ufb01nd an \u201cadmissible\u201d relaxation which upper\nbounds it. This relaxation immediately leads to an online learning algorithm, as well as to an as-\nsociated regret guarantee. In addition to the development of a uni\ufb01ed algorithmic framework, our\ncontributions include (1) a new algorithm for online binary classi\ufb01cation whenever the Littlestone\ndimension of the class is \ufb01nite; (2) a family of randomized online learning algorithms based on the\nidea of a random playout, with new Follow the Perturbed Leader style algorithms arising as special\ncases; and (3) ef\ufb01cient algorithms for trace norm based online matrix completion problem which\nimprove over currently known methods.\n\nAcknowledgements: We gratefully acknowledge the support of NSF under grants CAREER DMS-\n0954737 and CCF-1116928.\n\n8\n\n\fReferences\n[1] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret\n\nthrough minimax duality. In COLT, 2009.\n\n[2] J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower\n\nbounds for online convex games. In COLT, 2008.\n\n[3] J. Abernethy, M.K. Warmuth, and J. Yellin. Optimal strategies from random walks. In COLT,\n\npages 437\u2013445, 2008.\n\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[5] S. Ben-David, D. P\u00b4al, and S. Shalev-Shwartz. Agnostic online learning. In COLT, 2009.\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[7] N. Cesa-Bianchi and O. Shamir. Ef\ufb01cient online learning via randomized rounding. In NIPS,\n\n2011.\n\n[8] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of\n\nGames, 3:97\u2013139, 1957.\n\n[9] E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix predic-\n\ntion. In COLT, 2012.\n\n[10] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. J. Comput. Syst.\n\nSci., 71(3):291\u2013307, 2005.\n\n[11] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 2(4):285\u2013318, 04 1988.\n\n[12] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108(2):212\u2013261, 1994.\n\n[13] J.F. Mertens, S. Sorin, and S. Zamir. Repeated games. Univ. Catholique de Louvain, Center\n\nfor Operations Research & Econometrics, 1994.\n\n[14] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\n1983.\n\n[15] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial\n\nparameters, and learnability. In NIPS, 2010. Available at http://arxiv.org/abs/1006.1138.\n\n[16] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In COLT, 2011.\n\nAvailable at http://arxiv.org/abs/1011.3168.\n\n[17] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Stochastic, constrained, and\n\nsmoothed adversaries. In NIPS, 2011. Available at http://arxiv.org/abs/1104.5070.\n\n[18] O. Shamir and S. Shalev-Shwartz. Collaborative \ufb01ltering with the trace norm: Learning,\n\nbounding, and transducing. In COLT, 2011.\n\n[19] S. Sorin. The operator approach to zero-sum stochastic games. Stochastic Games and Appli-\n\ncations, NATO Science Series C, Mathematical and Physical Sciences, 570:417\u2013426, 2003.\n\n[20] K. Sridharan and A. Tewari. Convex games in banach spaces. In COLT, 2010.\n[21] V.G. Vovk. Aggregating strategies.\n\nIn Proc. Third Workshop on Computational Learning\n\nTheory, pages 371\u2013383. Morgan Kaufmann, 1990.\n\n[22] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\n2003.\n\n9\n\n\f", "award": [], "sourceid": 1068, "authors": [{"given_name": "Sasha", "family_name": "Rakhlin", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}]}