{"title": "Random Walk Approach to Regret Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1777, "page_last": 1785, "abstract": "We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies.", "full_text": "Random Walk Approach to Regret Minimization\n\nHariharan Narayanan\n\nMIT\n\nCambridge, MA 02139\n\nhar@mit.edu\n\nAlexander Rakhlin\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nrakhlin@wharton.upenn.edu\n\nAbstract\n\nWe propose a computationally ef\ufb01cient random walk on a convex body which\nrapidly mixes to a time-varying Gibbs distribution. In the setting of online convex\noptimization and repeated games, the algorithm yields low regret and presents a\nnovel ef\ufb01cient method for implementing mixture forecasting strategies.\n\n1\n\nIntroduction\n\nThis paper brings together two topics: online convex optimization and sampling from logconcave\ndistributions over convex bodies.\nOnline convex optimization has been a recent focus of research [30, 25], for it presents an abstrac-\ntion that uni\ufb01es and generalizes a number of existing results in online learning. Techniques from\nthe theory of optimization (in particular, Fenchel and minimax duality) have proven to be key for\nunderstanding the rates of growth of regret [25, 1]. Deterministic regularization methods [3, 25]\nhave emerged as natural black-box algorithms for regret minimization, and the choice of the regu-\nlarization function turned out to play a pivotal role in limited-feedback problems [3]. In particular,\nthe authors of [3] demonstrated the role of self-concordant regularization functions and the Dikin\nellipsoid for minimizing regret. The latter gives a handle on the local geometry of the convex set,\ncrucial for linear optimization with limited feedback.\nRandom walks in a convex body gained much attention following the breakthrough paper of Dyer,\nFrieze and Kannan [9], who exhibited a polynomial time randomized algorithm for estimating the\nvolume of a convex body. It is known that the problem of computing this volume by a deterministic\nalgorithm is #P-hard. Over the two decades following [9], the polynomial dependence of volume\ncomputation on the dimension n has been drastically decreased from O\u2217(n23) to O\u2217(n4) [17]. The\ndevelopment was accomplished through the study of several geometric random walks: the Ball Walk\nand Hit-and-Run (see [26] for a survey). The driving force behind such results are the isoperimetric\ninequalities which can be extended from uniform to general logconcave distributions. In particular,\ncomputing the volume of a convex body can be seen as a special case of integration of a logconcave\nfunction, and there has been a number of major results on mixing time for sampling from logconcave\ndistributions [17, 18]. Connections to optimization have been established in [12, 18], among others.\nMore recently, a novel random walk, called the Dikin Walk has been proposed in [19, 13]. By\nexploiting the local geometry of the set, this random walk is shown to mix rapidly, and offers a\nnumber of advantages over the other random walks.\nWhile the aim of online convex optimization is different from that of sampling from logconcave\ndistributions, the fact that the two communities recognized the importance of the Dikin ellipsoid is\nremarkable. In this paper we build a bridge between the two topics. We show that the problem of\nonline convex optimization can be solved by sampling from logconcave distributions, and that the\nDikin Walk can be adapted to mix rapidly to a certain time-varying distribution. In fact, it mixes fast\nenough that for linear cost functions only one step of the guided Dikin Walk is necessary per round\nof the repeated game. This is surprisingly similar to the suf\ufb01ciency of one Damped Newton step of\nAlgorithm 2 in [3], due to locally quadratic convergence ensured by the self-concordant regularizer.\n\n1\n\n\fThe time-varying Gibbs distributions from which we sample are closely related to Mixture Forecast-\ners and Bayesian Model Averaging methods (see [7, Section 11.10] as well as [29, 28, 4, 10]). To\nthe best of our knowledge, the method presented in this paper is the \ufb01rst provably computationally-\nef\ufb01cient approach to solving a class of problems which involves integrating over continuous sets of\ndecisions. From the Bayesian point of view, our algorithm is an ef\ufb01cient procedure for sampling\nfrom posterior distributions, and can be used for settings outside of regret minimization.\n\nPrior work: The closest to our work is the result of [11] for Universal Portfolios. Unlike our one-\nstep Markov chain, the algorithm of [11] works with a discretization of the probability simplex and\nrequires a number of steps which has adverse dependence on the time horizon and accuracy. This\nseems unavoidable with the Grid Walk. In [2], it was shown that the Weighted Average Forecaster\n[15, 27] on a prohibitively large class of experts is optimal in terms of regret for a certain multitask\nproblem, yet computationally inef\ufb01cient. A Markov chain has been proposed with the required\nstationary distribution, but no mixing time bounds have been derived. In [8], the authors faced a\nsimilar problem whereby a near-optimal regret can be achieved by the Weighted Average Forecaster\non a prohibitively large discretization of the set of decisions. Sampling from time-varying Markov\nchains has been investigated in the context of network dynamics [24], and has been examined from\nthe point of view of linear stochastic approximation in reinforcement learning [14]. Beyond [11],\nwe are not aware of any results to date where a provably rapidly mixing walk is used to solve regret\nminimization problems.\nIt is worth emphasizing that without the Dikin Walk [19], the one-step mixing results of this paper\nseem out of reach. In particular, when sampling from exponential distributions, the known bounds\nfor the conductance of the Ball Walk and Hit-and-Run are not scale-independent. In order to obtain\nO(\u221aT ) regret, one has to be able to sample the target distribution with an error that is O(1/\u221aT ).\nAs a consequence of the deterioration of the bounds on the conductance as the scale tends to zero,\nthe number of steps necessary per round would tend to in\ufb01nity as T tends to in\ufb01nity.\n\n2 Main Results\n\nLet K\u2282 Rn be a convex compact set and let F be a set of convex functions from K to R. Online\nconvex optimization is de\ufb01ned as a repeated T -round game between the player (the algorithm) and\nNature (adversary) [30, 25]. From the outset we assume that Nature is oblivious (see [7]), i.e. the\nindividual sequence of decisions !1, . . . , !T \u2208F can be \ufb01xed before the game. We are interested in\nrandomized algorithms, and hence we consider the following online learning model: on round t, the\nplayer chooses a distribution (or, a mixed strategy) \u00b5t\u22121 supported on K and \u201cplays\u201d a random Xt \u223c\n\u00b5t\u22121. Nature then reveals the cost function !t \u2208F . The goal of the player is to control expected\nregret (see Lemma 1) with respect to a randomized strategy de\ufb01ned by a \ufb01xed distribution pU \u2208P\nfor some collection of distributions P. If P contains Dirac delta distributions, the comparator term is\nindeed the best \ufb01xed decision x\u2217 \u2208K chosen in hindsight. A procedure which guarantees sublinear\ngrowth of regret for any distribution pU \u2208P will be called Hannan consistent with respect to P. We\nnow state a natural procedure for updating distributions \u00b5t which guarantees Hannan consistency\nfor a wide range of problems. This procedure is similar to the Mixture Forecaster used in the\nprediction context [29, 28, 4, 10]. Denote the cumulative cost functions by Lt(x) =!t\ns=1 !s(x),\nwith L0(x) \u2261 0, and let \u03b7> 0 be a learning rate. Let q0(x) be some prior probability distribution\nsupported on K. De\ufb01ne the following sequence of functions\n(1)\n\nqt(x) = q0(x) exp{\u2212\u03b7Lt(x)} ,\n\n\u2200t \u2208{ 1, . . . , T}\n\nfor every x \u2208K . De\ufb01ne the probability distribution \u00b5t over K at time t to have density\n\nd\u00b5t(x)\n\ndx\n\n= q0(x)e\u2212\u03b7Lt(x)\n\nZt\n\nwhere Zt =\"x\u2208K\n\nqt(x)dx.\n\n(2)\n\nLet D(p||q) stand for the Kullback-Leibler divergence between distributions p and q. The following\nlemma1 gives an equality for expected regret with respect to a \ufb01xed randomized strategy. It bears\n1Due to its simplicity, the lemma has likely appeared in the literature, yet we could not locate a reference for\nthis form with equality and in the context of online convex optimization. The closest results appear in [28, 10],\n[7, p. 326] in the context of prediction, and in [4] in the context of density estimation with exponential families.\n\n2\n\n\fstriking similarity to upper bounds on regret in terms of Bregman divergences for the Follow the\nRegularized Leader and Mirror Descent methods [23, 5], [7, Therem 11.1].\nLemma 1. Let Xt be a random variable distributed according to \u00b5t\u22121, for all t \u2208{ 1, . . . , T}, as\nde\ufb01ned in (2). Let U be a random variable with distribution pU. The expected regret is\n\nE# T$t=1\n\n!t(Xt) \u2212\n\n!t(U)% = \u03b7\u22121 (D(pU||\u00b50) \u2212 D(pU||\u00b5T )) + \u03b7\u22121\nT$t=1\nE# T$t=1\n\nT$t=1\n!t(U)% \u2264 \u03b7\u22121D(pU||\u00b50) + T \u03b7/8.\n\nSpecializing to the case !(x) \u2208 [0, 1] over K,\nT$t=1\n\n!t(Xt) \u2212\n\nD(\u00b5t\u22121||\u00b5t).\n\nBefore proceeding, let us make a few remarks. First, if the divergence between the comparator\ndistribution pU and the prior \u00b50 is bounded, the result yields O(\u221aT ) rates of regret growth for\nbounded losses by choosing \u03b7 appropriately. To bound the divergence between a continuous initial\n\u00b50 and a point comparator at some x\u2217, the analysis can be carried out in two stages: comparison\nto a \u201csmall-covariance\u201d Gaussian centered at x\u2217, followed by an observation that the loss of the\n\u201csmall-covariance\u201d Gaussian strategy is not very different from the loss of the deterministic strategy\nx\u2217. This analysis can be found in [7, p. 326] and gives a near-optimal O(\u221aT log T ) regret bound.\nWe also note that for linear cost functions, the notion of expected regret coincides with regret for\ndeterministic strategies. Third, we note that if the prior is of the form q0(x) \u221d exp{\u2212R(x)} for\nsome convex function R, then qt(x) \u221d exp{\u2212 (\u03b7Lt(x) + R(x))}, bearing similarity to the objec-\ntive function of the Follow the Regularized Leader algorithm [23, 3]. In general, we can encode\nprior knowledge in q0. For instance, if the cost functions are linear and the set K is a convex hull\nof N vertices (e.g. probability simplex), then the minimum loss is attained at one of the vertices,\nand a uniform prior on the vertices yields the Weighted Average Forecaster with the usual log N\ndependence [7]. Finally, we note that in online convex optimization, one of the dif\ufb01culties is the\nissue of projections back to the set K. This issue does not arise when dealing with distributions,\nbut instead translates into the dif\ufb01culty of sampling. We \ufb01nd these parallels between sampling and\noptimization intriguing.\nWe defer the easy proof of Lemma 1 to p. 8. Having a bound on regret, a natural question is whether\nthere exists a computationally ef\ufb01cient algorithm for playing Xt according to the mixed strategy\ngiven in (2). The main result of this paper is that for linear Lipschitz cost functions the guided\nrandom walk (Algorithm 1 below) produces a sequence of points X1, . . . , XT \u2208K with respective\ndistributions \u03c30, . . . , \u03c3T\u22121 such that \u03c3i is close to \u00b5i for all 0 \u2264 i \u2264 T \u22121. Moreover, Xi is obtained\nfrom Xi\u22121 with only one random step. The step requires sampling from a Gaussian distribution with\ncovariance given by the Hessian of the self-concordant barrier and can be implemented ef\ufb01ciently\nwhenever the Hessian can be computed. The computation time exactly matches [3, Algorithm 2]: it\nis the same as time spent inverting a Hessian matrix, which is O(n3) or less.\nLet us now discuss our assumptions. First, the analysis of the random walk is carried out only for\nlinear cost functions with a bounded Lipschitz constant. An analysis for general convex functions\nmight be possible, but for the sake of brevity we restrict ourselves to the linear case. Note that\nconvex cost functions can be linearized and a standard argument shows that regret for linearized\nfunctions can only be larger than that for the convex functions [30]. The second assumption is that\nq0 does not depend on T and has a bounded L2 norm with respect to the uniform distribution on K.\nThis means that q0 can be not only uniform, but, for instance, of the form q0(x) \u221d exp{\u2212R(x)}.\nTheorem 2. Suppose !t : K *\u2192 [0, 1] are linear functions with Lipschitz constant 1 and the prior q0\nis of bounded L2 norm with respect to uniform distribution on K. Then the one-step random walk\n(Algorithm 1) produces a sequence X1, . . . , XT with distributions \u03c30, . . . , \u03c3T\u22121 such that for all i,\n\n\"x\u2208K |d\u03c3i(x) \u2212 d\u00b5i(x)|\u2264 C\u03b7n3\u03bd2,\n\nwhere \u00b5i are de\ufb01ned in (2), \u03bd is the parameter of self-concordance, and C is an absolute con-\nstant. Therefore, regret of Algorithm 1 is within O(\u221aT ) from the ideal procedure of Lemma 1. In\n\n3\n\n\fparticular, by choosing \u03b7 appropriately, for an absolute constant C$,\n\nE# T$t=1\n\n!t(Xt) \u2212\n\nT$t=1\n\n!t(U)% \u2264 C$n3/2\u03bd&T D(pU||\u00b50).\n\n(3)\n\nProof. The statement follows directly from Lemma 1, Theorem 9, and an observation that for\nbounded losses\n\n\u2019\u2019E\u00b5t\u22121!t(Xt) \u2212 E\u03c3t\u22121!t(Xt)\u2019\u2019 \u2264\"x\u2208K |!t(x)| \u00b7 |d\u00b5t\u22121(x) \u2212 d\u03c3t\u22121(x)|\u2264 C\u03b7n3\u03bd2 .\n\n3 Sampling from a time-varying Gibbs distribution\n\nSketch of the Analysis The suf\ufb01ciency of only one step of the random walk is made possible\nby the fact that the distributions \u00b5t\u22121 and \u00b5t are close, and thus \u00b5t\u22121 is a (very) warm start for\n\u00b5t. The reduction in distance between the distributions after a single step is due to a general fact\n(Lov\u00b4asz-Simonovits [16]) which we state in Theorem 6. The majority of the work goes into lower\nbounding the conductance of the random walk by a quantity independent of T (Lemma 5). Since\nthe random walk of Algorithm 1 takes advantage of the local geometry of the set, the conductance\nis lower bounded by (a) proving an isoperimetric inequality (Theorem 3) for the Riemannian metric\n(which states that the measure of the gap between two well-separated sets is large) and (b) by proving\nthat for close-by (in the Riemannian metric) points, their transition functions are not too different\n(Lemma 4). Section 3 is organized as follows. In Section 3.1, the main building blocks for proving\nmixing time are stated, and their proofs appear later in Section 4. In Section 3.2, we use the mixing\nresult of Section 3.1 to show that Algorithm 1 indeed closely tracks the distributions \u00b5t (Theorem 9).\n\n3.1 Bounding Mixing Time\n\nIn the remainder of this paper, C will denote a universal constant that may change from line to\nline. For any function F on the interior int(K) having continuous derivatives of order k, for vectors\nh1, . . . , hk \u2208 Rn and x \u2208 int(K), for k \u2265 1, we recursively de\ufb01ne\n\nDkF (x)[h1, . . . , hk] := lim\n\u0001\u21920\n\nDk\u22121(x + \u0001hk)[h1, . . . , hk\u22121] \u2212 Dk\u22121(x)[h1, . . . , hk\u22121]\n\n,\n\n\u0001\n\nn-x \u2212 y-2\n\nx\n\nwhere D0F (x) := F (x). Let F be a self-concordant barrier of K with a parameter \u03bd (see [20]).\nFor x, y \u2208K , \u03c1(x, y) is the distance in the Riemannian metric whose metric tensor is the Hessian\nof F . Thus, the metric tensor on the tangent space at x assigns to a vector v the length -v-x :=\nD2F (x)[v, v], and to a pair of vectors v, w, the inner product .v, w/x := D2F (x)[v, w]. We have\n\u03c1(x, y) = inf \u0393(z -d\u0393-z where the in\ufb01mum is taken over all recti\ufb01able paths \u0393 from x to y. Let M\nbe the metric space whose point set is K and metric is \u03c1. We assume !i are linear and 1\u2212Lipschitz\nwith respect to \u03c1. For x \u2208 int(K), let Gx denote the unique Gaussian probability density function\non Rn such that\nGx(y) \u221d exp)\u2212\nFurther, de\ufb01ne the scaled cumulative cost as st(y) := r2\u03b7Lt(y). Note that shape of Gx is precisely\ngiven by the Dikin ellipsoid, which is de\ufb01ned as a unit ball in -\u00b7- x around a point x [20, 3].\nThe Markov chain Mt considered in this paper is such that for x, y \u2208K , one step x \u2192 y is given\nby Algorithm 1. A simple calculation shows that the detailed balance conditions are satis\ufb01ed with\nrespect to a stationary distribution \u00b5t (de\ufb01ned in Eq. (2)). Therefore the Markov chain is reversible\nand has this stationary measure. The next results imply that this Markov chain is rapidly mixing.\nThe \ufb01rst main ingredient is an isoperimetric inequality necessary for lower bounding conductance.\nTheorem 3. Let S1 and S2 be measurable subsets of K and \u00b5 a probability measure supported on\nK that possesses a density whose logarithm is concave. Then,\n\n+ V (x)* , where V (x) =\n\nln det D2F (x) and r = 1/(Cn)\n\n1\n2\n\nr2\n\n\u00b5((K \\ S1) \\ S2) \u2265\n\n2(1 + 3\u03bd) \u03c1(S1, S2)\u00b5(S1)\u00b5(S2).\n\n1\n\n4\n\n\fAlgorithm 1 One Step Random Walk (Xt, st)\n\nInput: current point Xt \u2208K and scaled cumulative cost st.\nOutput: next point Xt+1 \u2208K\nToss a fair coin. If Heads, set Xt+1 := Xt.\nElse,\n\nSample Z from GXt. If Z /\u2208K , let Xt+1 := Xt.\nIf Z \u2208K , let\n\nXt+1 :=+Z\n\nXt\n\nwith prob. min,1, GZ (Xt) exp(st(Xt))\nGXt (Z) exp(st(Z))-\n\notherwise.\n\nK\n\nXt+1\n\nXt\n\nFigure 1: The new point is sam-\npled from a Gaussian distribution\nwhose shape is de\ufb01ned by the lo-\ncal metric. Dotted lines are the unit\nDikin ellipsoids.\n\nThe next Lemma relates the Riemannian metric \u03c1 to the Markov Chain. Intuitively, it says that for\nclose-by points, their transition distributions cannot be far apart.\nLemma 4. If x, y \u2208K and \u03c1(x, y) \u2264 r\nTheorem 3 and Lemma 4 together give a lower bound on conductance of the Markov Chain.\nLemma 5 (Bound on Conductance). Let \u00b5 be any exponential distribution on K. The conductance\n\nC\u221an, then dT V (Px, Py) \u2264 1 \u2212 1\nC .\n\n\u03a6 := inf\n\n\u00b5(S1)\u2264 1\n\n2 (S1\n\nPx(K \\ S1)d\u00b5(x)\n\n\u00b5(S1)\n\nof the Markov Chain in Algorithm 1 is bounded below by\n\n1\n\nC\u03bdn\u221an.\n\nThe lower bound on conductance of Lemma 5 can now be used with the following general result on\nthe reduction of distance between distributions.\nTheorem 6 (Lov\u00b4asz-Simonovits [16]). Let \u03b30 be the initial distribution for a lazy reversible ergodic\nMarkov chain whose conductance is \u03a6 and stationary measure is \u03b3, and \u03b3k be the distribution of\nthe kth step. Let M := supS\n\u03b3(S) where the supremum is over all measurable subsets S of K. For\nf(x)2d\u03b3(x). For any \ufb01xed f, let Ef be the map that takes\n\n\u03b30(S)\n\nevery bounded f, let -f-2,\u03b3 denote.(K\nx to(K\n\nf(y)dPx(y). Then if(K\n\nf(x)d\u03b3(x) = 0,\n\n-Ekf-2,\u03b3 \u2264)1 \u2212\n\n\u03a62\n\n2 *k\n\n-f-2,\u03b3 .\n\nIn summary, Lemma 5 provides a lower bound on conductance, while Theorem 6 ensures reduction\nof the norm whenever conductance is large enough. In the next section, these two are put together.\nWe will show that reduction in the norm guarantees that the distribution after one step of the random\nwalk (k = 1 in Theorem 6) is close to the desired distribution \u00b5t.\n\n3.2 Tracking the distributions\nLet {\u03c3i}\u221ei=1 be the probability measures with bounded density, supported on K, corresponding to\nthe distribution of a point during different steps of the evolution of the algorithm. For i \u2208 N, let\n- \u00b7 -\u00b5i denote the L2 norm with respect to the measure \u00b5i. We shall write - \u00b7 -i for brevity. Hence,\nfor a measurable function f : K\u2192 R, -f-i =/(K\nwhere we used the fact that !i+1(x) \u2264 1 and \u00af\u03b7 is an appropriate multiple of \u03b7, e.g. \u00af\u03b7 = (e2 \u2212 1)\u03b7\ndoes the job. Analogously, d\u00b5i+1/d\u00b5i \u2264 1 + \u00af\u03b7 over K. It then follows that the norms at time i and\ni + 1 are comparable:\n(5)\n\nf 2d\u00b5i01/2 . Furthermore,\n\nq0(x)e\u2212\u03b7Li(x)dx\nq0(x)e\u2212\u03b7Li+1(x)dx\n\nZt+1\nZt \u2264 e2\u03b7 \u2264 1 + \u00af\u03b7\n\nd\u00b5i(x)\nd\u00b5i+1(x)\n\n= sup\nx\u2208K\n\n-f-i(1 + \u00af\u03b7)\u22121 \u2264 -f-i+1 \u2264 -f-i(1 + \u00af\u03b7)\n\nsup\nx\u2208K\n\n(4)\n\n5\n\n\fThe mixing results of Lemma 5 together with Theorem 6 imply\nCorollary 7. For any i,\n\n1111\n\nd\u03c3i+1\n\nd\u00b5i \u2212 11111i \u22641111\n\nd\u03c3i\n\nd\u00b5i \u2212 11111i)1 \u2212) 1\nCn3\u03bd2**\n\nCorollary 7 says that \u03c3i+1 is \u201ccloser\u201d than \u03c3i to \u00b5i by a multiplicative constant. We now show that\nthe distance of \u03c3i+1 to \u00b5i+1 is (additively) not much worse than its distance to \u00b5i. The multiplicative\nreduction in distance is shown to be dominating the additive increase, concluding the proof that \u03c3i\nis close to \u00b5i for all i (Theorem 9).\nLemma 8. For any i, it holds that\n\nNext, a bound on (7) follows simply by the norm comparison inequality (5):\n\nThe statement follows by rearranging the terms.\n\nd\u03c3i\n\n6\n\n(6)\n\n(7)\n\n.\n\nWe \ufb01rst establish a bound of C\u03b7 on (6). For any function f : K\u2192 R, let f +(x) = max(0, f(x))\nand f\u2212(x) = min(0, f(x)). By the triangle inequality,\n\nProof.\n\n1111\n\nd\u03c3i+1\n\nd\u00b5i+1 \u2212 11111i+1 \u22121111\n\n.\n\n\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\n2\n\nd\u03c3i+1\n\n2\n\ni+1\n2\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\n1111\n\n+ \u00af\u03b7(1 + \u00af\u03b7).\n\nd\u00b5i+1 \u2212\n\nd\u00b5i+1 \u2212\n\nd\u03c3i+1\nd\u00b5i+1 \u2212\n\nd\u00b5i \u2212 11111i\n\nd\u00b5i+1 \u2212 11111i+1 \u2264 (1 + \u00af\u03b7)21111\nd\u00b5i \u2212 11111i\nd\u00b5i \u2212 11111i+1\n= 1111\nd\u00b5i+1 \u2212 11111i+1 \u22121111\nd\u00b5i \u2212 11111i+1 \u22121111\nd\u00b5i \u2212 11111i\n+1111\nd\u00b5i \u2212 11111i+1 \u22641111\nd\u00b5i+1 \u2212 11111i+1 \u22121111\nd\u00b5i 1111i+1\n1111\nd\u00b5i *\u221211111\nd\u00b5i *+11111\n+11111\n=11111\nd\u00b5i 1111\n) d\u03c3i+1\n) d\u03c3i+1\nd\u00b5i+131111\n+1111\nd\u00b5i+131111\n\u22641111\n\u00af\u03b7121 <\n\u00af\u03b7121 \u2265\ni+1 \u2264 \u00af\u03b72(1 + \u00af\u03b7)21111\n= \u00af\u03b721111\nd\u00b5i 1111\nd\u00b5i 1111\nd\u00b5i \u2212 11111i*\n= \u00af\u03b7(1 + \u00af\u03b7))1 +1111\nd\u00b5i 1111i\nd\u00b5i \u2212 11111i+1 \u2264 \u00af\u03b7(1 + \u00af\u03b7)1111\nd\u00b5i \u2212 11111i\nd\u00b5i \u2212 11111i \u2264 \u00af\u03b71111\nd\u00b5i \u2212 11111i+1 \u22121111\nd\u00b5i \u2212 11111i \u2264 C\u03b7n3\u03bd2 .\n1111\n\"x\u2208K |d\u03c3i(x) \u2212 d\u00b5i(x)|\u2264 C\u03b7n3\u03bd2 .\n\n< \u00af\u03b7(1 + \u00af\u03b7), where \u00af\u03b7 = (e2 \u2212 1)\u03b7, then for all i,\n\nd\u03c3i+1\nd\u00b5i\n\nd\u03c3i+1\nd\u00b5i\n\nd\u03c3i+1\n\nd\u03c3i+1\n\ni+1\n\n2\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\nd\u03c3i+1\n\ni+1\n\ni+1\n\nd\u03c3i+1\n\nd\u00b5i\n\n2\n\n2\n\ni\n\n.\n\nd\u00b5i\n\n\u03c3i+1\n\n.\n\nNow, using (4) and (5),\n\nd\u03c3i+1\n\n2\n\ni+1\n\nd\u03c3i+1\nd\u00b5i+1 \u2212\n\n1111\n\nThus, (6) is bounded as\n\nd\u03c3i+1\n\n1111\n\nd\u00b5i+1 \u2212 11111i+1 \u22121111\n1111\nd\u00b50 \u2212 11110\n\nTheorem 9. If111 d\u03c30\n\nConsequently, for all i\n\n\fProof. By Corollary 7 and Lemma 8, we see that\n\nSince \u00af\u03b7 = o( 1\n\nd\u03c3i\n\nd\u03c3i+1\n\nd\u03c3i+1\n\n1111\n\nn3\u03bd2 ),\n\nd\u00b5i \u2212 11111i\nCn3\u03bd2**1111\nd\u00b5i+1 \u2212 11111i+1 \u2264 (1 + \u00af\u03b7)2)1 \u2212) 1\nd\u00b5i \u2212 11111i\nCn3\u03bd2**1111\nd\u00b5i+1 \u2212 11111i+1 \u2264)1 \u2212) 1\n1111\nd\u00b5i51/2\nd\u00b5i \u2212 1\u2019\u2019\u2019\u2019 d\u00b5i \u22644\" ) d\u03c3i\nd\u00b5i \u2212 1*2\n\n\" |d\u03c3i \u2212 d\u00b5i| =\" \u2019\u2019\u2019\u2019\n\nd\u03c3i\n\nd\u03c3i\n\nLet 0 \u2264 a < 1 and b > 0, and x0, x1, . . . , be any sequence of non-negative numbers such that,\nx0 \u2264 b and for each i, xi+1 \u2264 axi + b. We see, by unfolding the recurrence, that xi+1 \u2264 b\n1\u2212a .\nFrom this and (8), the \ufb01rst statement of the theorem follows. The second statement follows from\n\n+ \u00af\u03b7(1 + \u00af\u03b7).\n\n+ C\u03b7.\n\n(8)\n\n=1111\n\nd\u03c3i\n\nd\u00b5i \u2212 11111i\n\n.\n\n4 Proof Sketch\n\nIn this section, we prove the main building blocks stated in Section 3.1. Consider a time step t. Let\ndT V represent total variation distance. Without loss of generality, assume x is the origin and assume\nst(x) = 0. For x \u2208K and a vector v, |v|x is de\ufb01ned to be\n\u03b1. The following relation holds:\nTheorem 10 (Theorem 2.3.2 (iii) [21]). Let F be a self-concordant barrier whose self-concordance\nparameter is \u03bd. Then |h|x \u2264 -h-x \u2264 2(1 + 3\u03bd)|h|x for all h \u2208 Rn and x \u2208 int(K).\nWe term (S1, (M \\ S1) \\ S2, S2) a \u03b4-partition of M, if \u03b4 \u2264 dM(S1, S2) :=\ndM(x, y),\ninf\nwhere S1, S2 are measurable subsets of M. Let P\u03b4 be the set of all \u03b4-partitions of M. If \u00b5 is a\nmeasure on M, the isoperimetric constant is de\ufb01ned as\n\nx\u2208S1,y\u2208S2\n\nx\u00b1\u03b1v\u2208K\n\nsup\n\nC(\u03b4,M, \u00b5) := inf\nP\u03b4\n\n\u00b5((M \\ S1) \\ S2)\n\n\u00b5(S1)\u00b5(S2)\n\nand Ct := C) r\n\n\u221an\n\n,M, \u00b5t* .\n\nGiven interior points x, y in int(K), suppose p, q are the ends of the chord in K containing x, y\nand p, x, y, q lie in that order. Denote by \u03c3(x, y) the cross ratio |x\u2212y||p\u2212q|\n. Let dH denote the\n|p\u2212x||q\u2212y|\nHilbert (projective) metric de\ufb01ned by dH(x, y) := ln (1 + \u03c3(x, y)) . For two sets S1 and S2, let\n\u03c3(S1, S2) := inf x\u2208S1,y\u2208S2 \u03c3(x, y).\nProof of Theorem 3. For any z on the segment xy an easy computation shows that dH(x, z) +\ndH(z, y) = dH(x, y). Therefore it suf\ufb01ces to prove the result in\ufb01nitesimally. By a result due to\nNesterov and Todd [22, Lemma 3.1],\n\nx \u2264 \u03c1(x, y) \u2264 \u2212 ln(1 \u2212 -x \u2212 y-x).\n\u03c1(x,y)\n)x\u2212y)x\n\n= 1, and a direct computation shows that\n\n(9)\n\n-x \u2212 y-x \u2212 -x \u2212 y-2\nwhenever -x \u2212 y-x < 1. From (9) limy\u2192x\ndH(x, y)\n|x \u2212 y|x\n\nlim\ny\u2192x\n\n\u03c3(x, y)\n\n= lim\ny\u2192x\n\n|x \u2212 y|x \u2265 1.\n\u03c1(x, y) \u2264 2(1 + 3\u03bd)dH(x, y).\n\nHence, using Theorem 10, the Hilbert metric and the Riemannian metric satisfy\n\nThe statement of the theorem is now an immediate consequence of the following result due to Lov\u00b4asz\nand Vempala [18]: If S1 and S2 are measurable subsets of K and \u00b5 a probability measure supported\non K that possesses a density whose logarithm is concave, then\n\n\u00b5((K \\ S1) \\ S2) \u2265 \u03c3(S1, S2)\u00b5(S1)\u00b5(S2).\n\n7\n\n\fProof of Lemma 5. Let S1 be a measurable subset of K such that \u00b5(S1) \u2264 1\nreversibility of the chain, which is easily checked,\n\nits complement. Let S$1 = S1 \u2229{ x\u2019\u2019Px(S2) \u2264 1/C} and S$2 = S2 \u2229{ y\u2019\u2019Py(S1) \u2264 1/C}. By the\n\n2 and S2 := K \\ S1 be\n\n\"S1\n\nPy(S1)d\u00b5(y).\n\nPx(S2)d\u00b5(x) =\"S2\nmin) dPx\nC\u221an, then dT V (Px, Py) \u2264 1 \u2212 1\nr\nC\u221an\n.\n\n\u03c1(x, y) \u2265\n\ndPy\nd\u00b5\n\n(w),\n\ninf\n\nd\u00b5\n\nx\u2208S\"1,y\u2208S\"2\n\n(w)* d\u00b5(w) = 1 \u2212\n\nC . Therefore\n\nIf x \u2208 S$1 and y \u2208 S$2 then,\n\ndT V (Px, Py) := 1 \u2212\"K\nLemma 4 implies that if \u03c1(x, y) \u2264 r\n\u03c1(S$1, S$2) :=\n\nTherefore Theorem 3 implies that\n\n1\nC\n\n.\n\n(10)\n\n\u00b5((K \\ S$1) \\ S$2) \u2265\n\nFirst suppose \u00b5(S$1) \u2265 (1 \u2212 1\n\n\u03c1(S$1, S$2)\nmin(\u00b5(S$1), \u00b5(S$2)) \u2265\n2(1 + 3\u03bd)\nC )\u00b5(S1) and \u00b5(S$2) \u2265 (1 \u2212 1\nPx(S2)d\u00b5(x) \u2265 \u00b5((K \\ S$1) \\ S$2) \u2265 C\u00b5(S$1)\n\nC\n\n\"S1\n\nr\n\nC\u03bd\u221an\n\nmin(\u00b5(S$1), \u00b5(S$2)).\n\nC )\u00b5(S2). Then,\n\n\u2265 C min(\u00b5(S$1), \u00b5(S$2))\n\nC\n\nand we are done. Otherwise, without loss of generality, suppose \u00b5(S$1) \u2264 (1 \u2212 1\n\nC )\u00b5(S1). Then\n\nand we are done.\n\n\"S1\n\nPx(S2)d\u00b5(x) \u2265\n\n\u00b5(S1)\n\nC\n\nProof of Lemma 1. We have that\n\nD(\u00b5t\u22121||\u00b5t) =\"K\n\nd\u00b5t\u22121 log qt\u22121Zt\nZt\u22121qt\n\n= log Zt\nZt\u22121\n\n+\"K\n\n\u03b7!t(x)d\u00b5t\u22121(x) = log Zt\nZt\u22121\n\n+ \u03b7E!t(Xt).\n(11)\n\nRearranging, canceling the telescoping terms, and using the fact that Z0 = 1\n\n\u03b7E\n\n!t(Xt) =\n\nD(\u00b5t\u22121||\u00b5t) \u2212 log ZT .\n\nT$t=1\n\nT$t=1\n\nLet U be a random variable with a probability distribution pU. Then\n\n\u2212\n\nT$t=1\n!t(Xt) \u2212\n\nE!t(U) = \u03b7\u22121\"K \u2212\u03b7LT (u)dpU(u) = \u03b7\u22121\"K\nT$t=1\ndpU(u) log qT (u)/ZT\nq0(u)\n\n!t(U)% = \u03b7\u22121\"K\n\nCombining,\n\nE# T$t=1\n\ndpU(u) log qT (u)\nq0(u)\n\nT$t=1\nD(\u00b5t\u22121||\u00b5t)\nT$t=1\n= \u03b7\u22121 (D(pU||\u00b50) \u2212 D(pU||\u00b5T )) + \u03b7\u22121\n\nD(\u00b5t\u22121||\u00b5t).\n\n+ \u03b7\u22121\n\nNow, from Eq. (11), the KL divergence can be also written as\n\nD(\u00b5t\u22121||\u00b5t) = log(K\n\ne\u2212\u03b7)t(x)qt\u22121(x)dx\n\nqt\u22121(x)dx\n\n(K\n\nBy representing the divergence in this form, one can obtain upper bounds via known methods, such\nas Log-Sobolev inequalities (e.g. [6]). In the simplest case of bounded loss, it is easy to show that\nD(\u00b5t\u22121||\u00b5t) \u2264 O(\u03b72), and the particular constant 1/8 can be obtained by, for instance, applying\nLemma A.1 in [7]. This proves the second part of the lemma.\n\n+ \u03b7E!t(Xt) = log Ee\u2212\u03b7()t(Xt)\u2212E)t(Xt))\n\n8\n\n\fReferences\n[1] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through\n\nminimax duality. In COLT \u201909, 2009.\n\n[2] J. Abernethy, P. L. Bartlett, and A. Rakhlin. Multitask learning with expert advice. In Proceedings of The\n\nTwentieth Annual Conference on Learning Theory, pages 484\u2013498, 2007.\n\n[3] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for bandit linear\n\noptimization. In Proceedings of The Twenty First Annual Conference on Learning Theory, 2008.\n\n[4] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential\n\nfamily of distributions. Machine Learning, 43(3):211\u2013246, June 2001.\n\n[5] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Oper. Res. Lett., 31(3):167\u2013175, 2003.\n\n[6] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities using the entropy method. 31:1583\u2013\n\n1614, 2003.\n\n[7] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n[8] V. Dani, T. P. Hayes, and S. Kakade. The price of bandit information for online optimization. In Advances\n\nin Neural Information Processing Systems 20. Cambridge, MA, 2008.\n\n[9] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume\n\nof convex bodies. Journal of the ACM (JACM), 38(1):1\u201317, 1991.\n\n[10] S. Kakade and A. Ng. Online bounds for Bayesian algorithms. In Proceedings of Neural Information\n\nProcessing Systems (NIPS 17), 2005.\n\n[11] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for universal portfolios. The Journal of Machine Learning\n\nResearch, 3:440, 2003.\n\n[12] A.T. Kalai and S. Vempala. Simulated annealing for convex optimization. Mathematics of Operations\n\nResearch, 31(2):253\u2013266, 2006.\n\n[13] R. Kannan and H. Narayanan. Random walks on polytopes and an af\ufb01ne interior point method for linear\n\nprogramming. In STOC, 2009. Accepted (pending revisions), Mathematics of Operations Research.\n\n[14] V. R. Konda and J. N. Tsitsiklis. Linear stochastic approximation driven by slowly varying markov chains.\n\nSystems and Control Letters, 50:95\u2013102, 2003.\n\n[15] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation,\n\n108(2):212\u2013261, 1994.\n\n[16] L. Lov\u00b4asz and M. Simonovits. Random walks in a convex body and an improved volume algorithm.\n\nRandom Structures and Algorithms, 4(4):359\u2013412, 1993.\n\n[17] L. Lov\u00b4asz and S. Vempala. Simulated annealing in convex bodies and an o\u2217(n4) volume algorithm. J.\n\nComput. Syst. Sci., 72(2):392\u2013417, 2006.\n\n[18] L. Lov\u00b4asz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random\n\nStruct. Algorithms, 30(3):307\u2013358, 2007.\n\n[19] H. Narayanan. Randomized interior point methods for sampling and optimization. CoRR, abs/0911.3950,\n\n2009.\n\n[20] A.S. Nemirovskii. Interior point polynomial time methods in convex programming, 2004.\n[21] Y. E. Nesterov and A. S. Nemirovskii. Interior Point Polynomial Algorithms in Convex Programming.\n\nSIAM, Philadelphia, 1994.\n\n[22] Y.E. Nesterov and MJ Todd. On the Riemannian geometry de\ufb01ned by self-concordant barriers and\n\ninterior-point methods. Foundations of Computational Mathematics, 2(4):333\u2013361, 2008.\n\n[23] A. Rakhlin. Lecture notes on online learning, 2008. http://stat.wharton.upenn.edu/\u02dcrakhlin/papers/online learning.pdf.\n[24] D. Shah and J. Shin. Dynamics in congestion games. In Proceedings of ACM Sigmetrics, 2010.\n[25] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In NIPS. 2007.\n[26] S. Vempala. Geometric random walks: A survey. In Combinatorial and computational geometry. Math.\n\nSci. Res. Inst. Publ, 52:577\u2013616, 2005.\n\n[27] V. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learn-\n\ning Theory, pages 372\u2013383. Morgan Kaufmann, 1990.\n\n[28] V. Vovk. Competitive on-line statistics. International Statistical Review, 69:213\u2013248, 2001.\n[29] K. Yamanishi. Minimax relative loss analysis for sequential prediction algorithms using parametric hy-\n\npotheses. In COLT\u2019 98, pages 32\u201343, New York, NY, USA, 1998. ACM.\n\n[30] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 871, "authors": [{"given_name": "Hariharan", "family_name": "Narayanan", "institution": null}, {"given_name": "Alexander", "family_name": "Rakhlin", "institution": null}]}