{"title": "Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 2208, "page_last": 2216, "abstract": "We develop efficient solution methods for a robust empirical risk minimization problem designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance. Our methods apply to distributionally robust optimization problems proposed by Ben-Tal et al., which put more weight on observations inducing high loss via a worst-case approach over a non-parametric uncertainty set on the underlying data distribution. Our algorithm solves the resulting minimax problems with nearly the same computational cost of stochastic gradient descent through the use of several carefully designed data structures. For a sample of size n, the per-iteration cost of our method scales as O(log n), which allows us to give optimality certificates that distributionally robust optimization provides at little extra cost compared to empirical risk minimization and stochastic gradient methods.", "full_text": "Stochastic Gradient Methods for Distributionally\n\nRobust Optimization with f-divergences\n\nHongseok Namkoong\nStanford University\n\nhnamk@stanford.edu\n\nAbstract\n\nJohn C. Duchi\n\nStanford University\n\njduchi@stanford.edu\n\nWe develop ef\ufb01cient solution methods for a robust empirical risk minimization\nproblem designed to give calibrated con\ufb01dence intervals on performance and\nprovide optimal tradeoffs between bias and variance. Our methods apply to dis-\ntributionally robust optimization problems proposed by Ben-Tal et al., which put\nmore weight on observations inducing high loss via a worst-case approach over a\nnon-parametric uncertainty set on the underlying data distribution. Our algorithm\nsolves the resulting minimax problems with nearly the same computational cost\nof stochastic gradient descent through the use of several carefully designed data\nstructures. For a sample of size n, the per-iteration cost of our method scales as\nO(log n), which allows us to give optimality certi\ufb01cates that distributionally robust\noptimization provides at little extra cost compared to empirical risk minimization\nand stochastic gradient methods.\n\n1\n\nIntroduction\n\nIn statistical learning or other data-based decision-making problems, it is desirable to give solutions\nthat come with guarantees on performance, at least to some speci\ufb01ed con\ufb01dence level. For tasks\nsuch as driving or medical diagnosis where safety and reliability are crucial, con\ufb01dence levels\nhave additional importance. Classical techniques in machine learning and statistics, including\nregularization, stability, concentration inequalities, and generalization guarantees [6, 25] provide\nsuch guarantees, though often a more \ufb01ne-tuned certi\ufb01cate\u2014one with calibrated con\ufb01dence\u2014is\ndesirable. In this paper, we leverage techniques from the robust optimization literature [e.g. 2],\nbuilding an uncertainty set around the empirical distribution of the data and studying worst case\nperformance in this uncertainty set. Recent work [15, 13] shows how this approach can give (i)\ncalibrated statistical optimality certi\ufb01cates for stochastic optimization problems, (ii) performs a\nnatural type of regularization based on the variance of the objective and (iii) achieves fast rates of\nconvergence under more general conditions than empirical risk minimization by trading off bias\n(approximation error) and variance (estimation error) optimally. In this paper, we propose ef\ufb01cient\nalgorithms for such distributionally robust optimization problems.\nWe now provide our formal setting. Let X\u21e2 Rd be a compact convex set, and for a convex\nfunction f : R+ ! R with f (1) = 0, de\ufb01ne the f-divergence between distributions P and Q by\nDf (P||Q) =R f ( dP\nn} be an\nuncertainty set around the uniform distribution /n, we develop methods for solving the robust\nempirical risk minimization problem\n\ndQ )dQ. Letting P\u21e2,n := {p 2 Rn : p> = 1, p  0, Df (p|| /n) \uf8ff \u21e2\n\nminimize\n\nx2X\n\nsup\np2P\u21e2,n\n\npi`i(x).\n\nnXi=1\n\n(1)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn problem (1), the functions `i : X! R+ are convex and subdifferentiable, and we consider the\niid\u21e0 P0. We let `(x) = [`1(x) \u00b7\u00b7\u00b7 `n(x)]> 2 Rn denote the\nsituation in which `i(x) = `(x; \u21e0i) for \u21e0i\nvector of convex losses, so the robust objective (1) is supp2P\u21e2,n pT `(x).\nA number of authors show how the robust formulation (1) provides guarantees. Duchi et al. [15]\nshow that the objective (1) is a convex approximation to regularizing the empirical risk by variance,\n\nsup\np2P\u21e2,n\n\nnXi=1\n\npi`i(x) =\n\n1\nn\n\nnXi=1\n\n`i(x) +r \u21e2\n\nn\n\nVarP0(`(x; \u21e0)) + oP0(n 1\n2 )\n\n(2)\n\nuniformly in x 2X . Since the right hand side naturally trades off good loss performance (ap-\nproximation error) and minimizing variance (estimation error) which is usually non-convex, the\nrobust formulation (1) provides a convex regularization for the standard empirical risk minimization\n(ERM) problem. This trading between bias and variance leads to certi\ufb01cates on the optimal value\ninf x2X EP0[`(x; \u21e0)] so that under suitable conditions, we have\n\nP\u2713 inf\n\nx2X\n\nEP0[`(x; \u21e0)] \uf8ff un\u25c6 = P (W  p\u21e2) for W \u21e0 N(0, 1)\n\nlim\nn!1\n\n(3)\n\nwhere un = inf x2X supp2P\u21e2,n pT `(x) is the optimal robust objective. Duchi and Namkoong [13]\nprovide \ufb01nite sample guarantees for the special case that f (t) = 1\n2 (t  1)2, making the expansion (2)\nmore explicit and providing a number of consequences for estimation and optimization based on this\nexpansion (including fast rates for risk minimization). A special case of their results [13, \u00a73.1] is as\n\nof the class of functions F := {`(x;\u00b7) | x 2X} , assume that M  `(x; \u21e0) for all x 2X ,\u21e0 2 \u2305, and\nfor some \ufb01xed > 0, de\ufb01ne \u21e2 = log 1\n + 10 VC(F) log VC(F). Then, with probability at least 1  ,\nM\u21e2\nn\n\nfollows. Letbxrob 2 argminx2X supp2P\u21e2,n pT `(x), let VC(F) denote the VC-(subgraph)-dimension\nEP0[`(bxrob; \u21e0)] \uf8ff un + O(1)\n\n(4)\nFor large n, evaluating the objective (1) may be expensive; with \ufb01xed p = /n, this has motivated an\nextensive literature in stochastic and online optimization [27, 23, 19, 16, 18]. The problem (1) does\nnot admit quite such a straightforward approach. A \ufb01rst idea, common in the robust optimization\nliterature [3], is to obtain a problem that may be written as a sum of individual terms by taking the\ndual of the inner supremum, yielding the convex problem\n\nEP0[`(x; \u21e0)] + 2s 2\u21e2VarbPn\n\nx2X8<:\n\nM\u21e2\nn \uf8ff inf\n\n(`(x; \u21e0))\nn\n\n9=;\n\n+ O(1)\n\ninf\nx2X\n\nsup\np2P\u21e2,n\n\np>`(x) =\n\ninf\n\nx2X ,0,\u23182R\n\n1\nn\n\nf\u21e4\u2713 `i(x)  \u2318\n\n\n\n\u25c6 +\n\n\u21e2\nn\n\n + \u2318.\n\n(5)\n\nnXi=1\n\nHere f\u21e4(s) = supt0{st  f (t)} is the Fenchel conjugate of the convex function f. While the\nabove dual reformulation is jointly convex in (x, , \u2318), canonical stochastic gradient descent (SGD)\nprocedures [23] generally fail because the variance of the objective (and its subgradients) explodes as\n ! 0. (This is not just a theoretical issue: in extensive simulations that we omit because they are a\nbit boring, SGD and other heuristic approaches that impose shrinking bounds of the form t  ct > 0\nat each iteration t all fail to optimize the objective (5).)\nInstead, we view the robust ERM problem (1) as a game between the x (minimizing) player and p\n(maximizing) player. Each player performs a variant of mirror descent (ascent), and we show how\nsuch an approach yields strong convergence guarantees, as well as good empirical performance. In\nparticular, we show (for many suitable divergences f) that if `i is L-Lipschitz and X has radius\nbounded by R, then our procedure requires at most O( R2L2+\u21e2\n) iterations to achieve an \u270f-accurate\nsolution to problem (1), which is comparable to the number of iterations required by SGD [23]. Our\nsolution strategy builds off of similar algorithms due to Nemirovski et al. [23, Sec. 3] and Ben-Tal et al.\n[4], and more directly procedures developed by Clarkson et al. [10] for solving two-player convex\ngames. Most directly relevant to our approach is that of Shalev-Shwartz and Wexler [26], which\n+ : pT = 1} and that there is some\nsolves problem (1) under the assumption that P\u21e2,n = {p 2 Rn\nx with perfect loss performance, that is,Pn\ni=1 `i(x) = 0. We generalize these approaches to more\nchallenging f-divergence-constrained problems, and, for the 2 divergence with f (t) = 1\n2 (t  1)2,\n\n\u270f2\n\n2\n\n\f\u270f2\n\ndevelop ef\ufb01cient data structures that give a total run-time for solving problem (1) to \u270f-accuracy\nscaling as O((Cost(grad) + log n) R2L2+\u21e2\n). Here Cost(grad) is the cost to compute the gradient of\na single term r`i(x) and perform a mirror descent step with x. Using SGD to solve the empirical\nminimization problem to \u270f-accuracy has run-time O(Cost(grad) R2L2\n), so we see that we can achieve\n\u270f2\nthe guarantees (3)\u2013(4) offered by the robust formulation (1) at little additional computational cost.\nThe remainder of the paper is organized as follows. We present our abstract algorithm in Section 2\nand give guarantees on its performance in Section 3. In Section 4, we give ef\ufb01cient computational\nschemes for the case that f (t) = 1\n\n2 (t  1)2, presenting experiments in Section 5.\n2 A bandit mirror descent algorithm for the minimax problem\nUnder the conditions that ` is convex and X is compact, standard results [7] show that there exists a\nsaddle point (x?, p?) 2X\u21e5P \u21e2,n for the robust problem (1) satisfying\n\nsupp>`(x?) | p 2P \u21e2,n \uf8ff p?>`(x?) \uf8ff infp?>`(x) | x 2X .\n\nWe now describe a procedure for \ufb01nding this saddle point by alternating a linear bandit-convex\noptimization procedure [8] for p and a stochastic mirror descent procedure for x. Our approach builds\noff of Nemirovski et al.\u2019s [23] development of mirror descent for two-player stochastic games.\nTo describe our algorithm, we require a few standard tools. Let k\u00b7kx denote a norm on the space\nX with dual norm kykx,\u21e4 = sup{hx, yi : kxk \uf8ff 1}, and let x be a differentiable strongly convex\nfunction on X , meaning x(x+)  x(x)+r x(x)>+ 1\nx for all . Let p a differentiable\nstrictly convex function on P\u21e2,n. For a differentiable convex function h, we de\ufb01ne the Bregman\ndivergence Bh(x, y) = h(x)  h(y)  hrh(y), x  yi 0. The Fenchel conjugate \u21e4p of p is\n\n2 kk2\n\n{hs, pi  p(p)} .\n\n \u21e4p (s) := sup\n\np {hs, pi  p(p)} and r \u21e4p (s) = argmax\n\np\n\n( \u21e4p is differentiable because p is strongly convex [20, Chapter X].) We let gi(x) 2 @`i(x) be a\nparticular subgradient selection.\nWith this notation in place, we now give our algorithm, which alternates between gradient ascent\nsteps on p and subgradient descent steps on x. Roughly, we would like to alternate gradient ascent\nsteps for p, pt+1 pt + \u21b5p`(xt), and descent steps xt+1 xt \u21b5xgi(xt) for x, where i is a random\nindex drawn according to pt. This procedure is inef\ufb01cient\u2014requiring time of order nCost(grad) in\neach iteration\u2014so that we use stochastic estimates of the loss vector `(xt) developed in the linear\nbandit literature [8] and variants of mirror descent to implement our algorithm.\n\nAlgorithm 1 Two-player Bandit Mirror Descent\n1: Input: Stepsize \u21b5x,\u21b5 p > 0, initialize: x1 2X , p1 = /n\n2: for t = 1, 2, . . . , T do\nSample It \u21e0 pt, that is, set It = i with probability pt,i\n3:\nCompute estimated loss for i 2 [n]:b`t,i(x) = `i(x)\n4:\nUpdate p: wt+1 r \u21e4p (r p(pt) + \u21b5pb`t(xt)), pt+1 argminp2P\u21e2,n B p(p, wt+1)\n5:\nUpdate x: yt+1 r \u21e4x ( x(xt)  \u21b5xgIt(xt)), xt+1 argminx2X B x(x, yt+1)\n6:\n7: end for\n\n1{It = i}\n\npi,t\n\nWe specialize this general algorithm for speci\ufb01c choices of the divergence f and the functions x and\n p presently, \ufb01rst brie\ufb02y discussing the algorithm. Note that in Step 5, the updates for p depend only\n\nare ef\ufb01ciently computable, can yield substantial performance bene\ufb01ts.\n\non a single index It 2{ 1, . . . , n} (the vectorb`(xt) is 1-sparse), which, as long as the updates for p\n\n3 Regret bounds\n\nWith our algorithm described, we now describe its convergence properties, specializing later to\nspeci\ufb01c families of f-divergences. We begin with the following result on pseudo-regret, which (with\nminor modi\ufb01cations) is known [23, 10, 26]. We provide a proof for completeness in Appendix A.1.\n\n3\n\n\f+\n\n}\n\n1\n\u21b5x\n\n{z\n\n\u21b5x\n2\n\n{z\n\nTXt=1\n\nt=1 xt and\n\nB x(x?, x1) +\n\nTXt=1\n|\n\nT1: ERM regret\n\nT2: robust regret\n\nE[kgIt(xt)k2\nx,\u21e4]\n\nt=1 pt. Then for the saddle point (x?, p?) we have\n\nT PT\nLemma 1. Let the sequences xt and pt be generated by Algorithm 1. De\ufb01nebxT := 1\nT PT\nbpT := 1\nE[b`t(xt)>(p?  pt)]\nTE[p?>`(bxT ) bp>T `(x?)] \uf8ff\n}\n|\nwhere the expectation is taken over the random draws It \u21e0 pt. Moreover, E[b`t(xt)>(p  pt)] =\nE[`(xt)>(p  pt)] for any vector p.\nIn the lemma, T1 is the standard regret when applying mirror descent to the ERM problem. In\nLp2/T yields\nparticular, if B x(x?, x1) \uf8ff R2 and `i(x) is L-Lipschitz, then choosing \u21b5x = R\nT1 \uf8ff RLpT . Because it is (relatively) easy to bound the term T1, the remainder of our arguments\nfocus on bounding the the second term T2, which is the regret that comes as a consequence of the\nrandom sampling for the loss vectorb`t. This regret depends strongly on the distance-generating\nfunction p. To the end of bounding T2, we use the following bound for the pseudo-regret of p, which\nis standard [9, Chapter 11], [8, Thm 5.3]. For completeness we outline the proof in Appendix A.2.\nLemma 2. For any p 2P \u21e2,n, Algorithm 1 satis\ufb01es\nTXt=1\n\nB \u21e4p \u21e3r p(pt) + \u21b5pb`t(xt),r p(pt)\u2318 .\n\nTXt=1b`t(xt)>(p  pt) \uf8ff\n\nLemma 2 shows that controlling the Bregman divergences B p and B \u21e4p is suf\ufb01cient to bound T2 in\nthe basic regret bound of Lemma 1.\nNow, we narrow our focus slightly to a specialized\u2014but broad\u2014family of divergences for which we\ncan give more explicit results. For k 2 R, the Cressie-Read divergence [12] of order k is\n\nB p(p, p1)\n\n+\n\n1\n\u21b5p\n\n(6)\n\n\u21b5p\n\ntk  kt + k  1\n\n,\n\n@pi\n\nfk(t) =\n\n(7)\nwhere fk(t) = 1 for t < 0, and for k 2{ 0, 1} we de\ufb01ne fk by its limits as k ! 0 or 1 (we have\nf1(t) = t log t  t + 1 and f0(t) =  log t + t  1). Inspecting expression (6), we might hope that\ncareful choices of p could yield regret bounds that grow slowly with T and have small dependence\non the sample size n. Indeed, this is the case, as we show in the sequel: for each divergence fk, we\nmay carefully choose p to achieve small regret. To prove our bounds, however, it is crucial that\n\nk(k  1)\n\nthe importance sampling estimatorb`t has small variance, which in turn necessitates that pt,i is not\ntoo small. Generally, this means that in the update (Alg. 1, Line 5) to construct pt+1, we choose\n (p) to grow quickly as pi ! 0 (e.g. | @\n p(p)|! 1 ), but there is a tradeoff in that this may cause\nlarge Bregman divergence terms (6). In the coming sections, we explore this tradeoff for various k,\nproviding regret bounds for each of the Cressie-Read divergences (7).\nTo control the B \u21e4p terms in the bound (6), we use the curvature of p (dually, smoothness of \u21e4p)\n\nto show that B \u21e4p (u, v) \u21e1P(ui  vi)2. For this approximation to hold, we shift our loss functions\nbased on the f-divergence. When k  2, we assume that `(x) 2 [0, 1]n. If k < 2, we instead apply\nAlgorithm 1 with shifted losses `0(x) = `(x)  , so that `0(x) 2 [1, 0]n. We call the method with\n`0 Algorithm 1\u2019, noting thatb`t,i(xt) = `i(xt)1\n3.1 Power divergences when k 62 {0, 1}\nFor our \ufb01rst results, we prove a generic regret bound for Algorithm 1 when k 62 {0, 1} by taking the\nk(k1)Pn\ndistance-generating function p(p) = 1\ni , which is differentiable and strictly convex on\n+. Before proceeding further, we \ufb01rst note that for p 2P \u21e2,n and p1 = 1\nRn\n\n1{It = i} in this case.\n\nn , we have\n\ni=1 pk\n\npt,i\n\nB p(p, p1) = p(p)  p(p1)  r p(p1)>(p  p1)\n\n=\n\nnk\n\nk(k  1)\n\nnXi=1(npi)k  knpi + k  1 = nkDf (p|| /n) \uf8ff nk\u21e2\n\n(8)\n\n4\n\n\f+\n\n\u21b5p\n2\n\np1k\n\n(9)\n\nTXt=1\n\nTXt=1\n\nnk\u21e2\n\u21b5p\n\nt,i 35 .\n\nE[`(xt)>(p  pt)] =\n\nE24 Xi:pt,i>0\n\nE[b`t(xt)>(p  pt)] \uf8ff\n\nbounding the \ufb01rst term in expression (6). From Lemma 2, it remains to bound the Bregman divergence\nterms B \u21e4p . Using smoothness of \u21e4p in the positive orthant, we obtain the following bound.\nTheorem 1. Assume that `(x) 2 [0, 1]n. For any real-valued k  2 and any p 2P \u21e2,n, Algorithm 1\nsatis\ufb01es\nTXt=1\nFor k \uf8ff 2 with k 62 {0, 1}, an identical bound holds for Algorithm 1\u2019 with `0(x) = `(x)  .\nSee Appendix A.3 for the proof. We now use Theorem 1 to obtain concrete convergence guarantees for\nCressie-Read divergences with parameter k < 1, giving sublinear (in T ) regret bounds independent\nof n. In the corollary, whose proof we provide in Appendix A.4, we let Ck,\u21e2 = (1k)(1k\u21e2)\n, which\nis positive for k < 0.\nk,\u21e2 nkp2\u21e2/T Algorithm 1\u2019 with `0(x) = `(x)  2\nCorollary 1. For k 2 (1, 0) and \u21b5p = C\n[1, 0]n acheives the regret bound\nE[b`t(xt)>(p  pt)] \uf8ffq2C1k\nTXt=1\nTXt=1\nE[b`t(xt)>(p  pt)] \uf8ffp2\u21e2T .\n\nFor k 2 (0, 1) and \u21b5p = nkp2\u21e2/T , Algorithm 1\u2019 with `0(x) = `(x)  2 [1, 0]n acheives the\n\nIt is worth noting that despite the robusti\ufb01cation, the above regret is independent of n. In the special\ncase that k 2 (0, 1), Theorem 1 is the regret bound for the implicitly normalized forecaster of\nAudibert and Bubeck [1] (cf. [8, Ch 5.4]).\n\nE[`(xt)>(p  pt)] =\n\nE[`(xt)>(p  pt)] =\n\nregret bound\n\nTXt=1\n\nTXt=1\n\nk,\u21e2 \u21e2T .\n\nk1\n\n2\n\nk\n\n3.2 Regret bounds using the KL divergences (k = 1 and k = 0)\nThe choice f1(t) = t log t  t + 1 yields Df (P||Q) = Dkl (P||Q), and in this case, we take\n p(p) =Pn\ni=1 pi log pi, which means that Algorithm 1 performs entropic gradient ascent. To control\nthe divergence B \u21e4p , we use the rescaled losses `0(x) = `(x)  (as we have k < 2). Then we have\nthe following bound, whose proof we provide in Appendix A.5.\nTheorem 2. Algorithm 1\u2019 with loss `0(x) = `(x)  yields\n\nnT.\n\n(10)\n\nTXt=1\n\nE[`(xt)>(p  pt)] =\n\nTXt=1\nnq 2\u21e2\nT , we havePT\n\n\u21e2\nn\u21b5p\n\nE[b`t(xt)>(p  pt)] \uf8ff\nt=1 E[`(xt)>(p  pt)] \uf8ff p2\u21e2T .\n\n+\n\n\u21b5p\n2\n\nIn particular, when \u21b5p = 1\n\nUsing k = 0, so that f0(t) =  log t + t  1, we obtain Df (P||Q) = Dkl (Q||P ), which results\nin a robusti\ufb01cation technique identical to Owen\u2019s original empirical likelihood [24]. We again use\nthe rescaled losses `0(x) = `(x)  , but in this scenario we use the proximal function p(p) =\nPn\nTheorem 3. Algorithm 1\u2019 with loss `0(x) = `(x)  yields\n\ni=1 log pi in Algorithm 1\u2019. Then we have the following regret bound (see Appendix A.6).\n\nTXt=1\n\nTXt=1\nE[b`t(xt)>(p  pt)] \uf8ff\nE[`(xt)>(p  pt)] =\nt=1 E[`(xt)>(p  pt)] \uf8ff p2\u21e2T .\nT , we havePT\n\nIn particular, when \u21b5p =q 2\u21e2\nIn both of these cases, the expected pseudo-regret of our robust gradient procedure is independent of\nn and grows as pT , which is essentially identical to that achieved by pure online gradient methods.\n\n\u21b5p\n2\n\nT.\n\n\u21e2\n\u21b5p\n\n+\n\n5\n\n\f3.3 Power divergences (k > 1)\nCorollary 1 provides convergence guarantees for power divergences fk with k < 1, but says nothing\nabout the case that k > 1; the choice p(p) = 1\ni allows the individual probabilities\n\ni=1 pk\n\nproblem (1) by re-de\ufb01ning our robust empirical distributions set, taking\n\npt,i to be too small, which can cause excess variance ofb`. To remedy this, we regularize the robust\n\nP\u21e2,n, :=np 2 Rn\n\n+ | p \n\nf (npi) \uf8ff \u21e2o,\n\nk(k1)Pn\nnXi=1\n\n\nn\n\n,\n\nwhere we no longer constrain the weights p to satisfy >p = 1. Nonetheless, it is still possible to\nshow that the guarantees (2) and (3) hold with P\u21e2,n, replacing P\u21e2,n. Indeed, we may give bounds for\nthe pseudo-regret of the regularized problem with P\u21e2,n,, where we apply Algorithm 1 with a slightly\nmodi\ufb01ed sampling strategy, drawing indices i according to the normalized distribution pt/Pn\ni=1 pt,i\nand appropriately normalizing the loss estimate via\n\npt,i\n\n1{It = i} .\n\nb`t,i(xt) = nXi=1\n\npt,i! `i(xt)\nThis vector is still unbiased for `(xt). De\ufb01ne the constant Ck := max{t : fk(t) \uf8ff t}_ \u21e2\nC2 = 2 + p3). With our choice p(p) = 1\nk(k1)Pn\nresult, whose proof we provide in Appendix A.7.\nTheorem 4. For k 2 [2,1), any p 2P \u21e2,n,, Algorithm 1 with \u21b5p = nkp\u21e2k1/ (4C3\nTXt=1\nE[b`t(xt)>(p  pt)] \uf8ff 2Ckp\u21e2Ck1kT\n\nFor k 2 (1, 2), assume that `(x) 2 [1, 0]n. Then, Algorithm 1 gives identical bounds.\n4 Ef\ufb01cient updates when k = 2\n\nE[`(xt)>(p  pt)] =\n\nTXt=1\n\ni=1 pk\n\nn < 1 (so\ni and for > 0, we obtain the following\n\nkT ) yields\n\ndespite the sparsity ofb`(xt) (see Appendix B for concrete updates for each of our cases). In this\n\nThe previous section shows that Algorithm 1 with careful choice of p yields sublinear regret bounds.\nThe projection step pt+1 = argminp2P\u21e2,n, B p(p, wt+1), however, can still take time linear in n\nsection, we show how to compute the bandit mirror descent update in Alg. 1, line 5, in time O(log n)\ntime for f2(t) = 1\ni . Building off of Duchi et al. [14], we use\ncarefully designed balanced binary search trees (BSTs) to this end.\nThe Lagrangian for the update pt+1 = argminp2P\u21e2,n, B p(p, wt+1) (suppressing t) is\n\ni=1 p2\n\n2Pn\n2 (t  1)2 and p(p) = 1\nn2 \u21e2 \n\nL(p, , \u2713) = B p(p, w) \n\n\n\n\n\nn \u25c6\n\n+. The KKT conditions imply (1+)p = w+ \n\nnXi=1\n\nf2(npi)!  \u2713>\u2713p \nn \u25c6+\np() =\u2713 1\np2P\u21e2,n, L(p, , \u2713) = B p(p(), w)   \u21e2 \n\n1\nn \n\n\nn\n\ninf\n\n\n\n\n\nw +\n\n(11)\n1 + \n+ L(p, , \u2713). Substituting this into the Lagrangian, we obtain\n\n1 + \n\n+\n\n,\n\nwhere p() = argminp2P\u21e2,n, inf \u27132Rn\nthe concave dual objective\n\nn +\u2713, and strict complementarity\n\nwhere   0,\u2713 2 Rn\nyields\n\ng() := sup\n\n\u2713\n\nfk(npi())! .\n\nnXi=1\n\nWe can run a bisection search on the nondecreasing function g0() to \ufb01nd  such that g0() = 0.\nAfter algebraic manipulations, we have that\n\n@\n@\n\ng() = g1() Xi2I()\n\nw2\n\ni + g2() Xi2I()\n\nwi + g3()|I()| +\n\n(1  )2\n\n2n \n\n\u21e2\nn2 ,\n\n6\n\n\fwhere I() := {1 \uf8ff i \uf8ff n : wi  \n\nn + ( \n\ng1() =\n\n1\n\n(1 + )2 , g2() =\n\nn  1)} and (see expression (18) in Appendix B.4)\nn(1 + )2 , g3() =\n\nn2(1 + )2 \n\n(1  )2\n\n2\n\n1\n\n.\n\n2n\n\n\u270f ) time, it suf\ufb01ces to\nTo see that we can solve for \u21e4 that acheives |g0(\u21e4)|\uf8ff \u270f in O(log n + log 1\nevaluatePi2I() wq\ni for q = 0, 1, 2 in time O(log n). To this end, we store the w\u2019s in a balanced\nsearch tree (e.g., red-black tree) keyed on the weights up to a multiplicative and an additive constant.\nA key ingredient in our implementation is that the BST stores in each node the sum of the appropriate\npowers of values in the left and right subtree [14]. See Appendix C for detailed pseudocode for all\noperations required in Algorithm 1: each subroutine (sampling It \u21e0 pt, updating w, computing \u21e4,\nand updating p(\u21e4)) require time O(log n) using standard BST operations.\n\n5 Experiments\n\nIn this section, we present experimental results demonstrating the ef\ufb01ciency of our algorithm. We \ufb01rst\ncompare our method with existing algorithms for solving the robust problem (1) on a synthetic dataset,\nthen investigating the robust formulation on real datasets to show how the calibrated con\ufb01dence\nguarantees behave in practice, especially in comparison to the ERM. We experiment on natural high\ndimensional datasets as well as those with many training examples.\nOur implementation uses the ef\ufb01cient updates outlined in Section 4. Throughout our experiments,\nwe use the best tuned step sizes for all methods. For the \ufb01rst two experiments, we set \u21e2 = 2\n1,.9\nso that the resulting robust objective (1) will be a calibrated 95% upper con\ufb01dence bound on the\noptimal population risk. For our last experiment, the asymptotic regime (3) fails to hold due to the\nhigh dimensional nature of the problem, so we choose \u21e2 = 50 (somewhat arbitrarily, but other \u21e2 give\n\nsimilar behavior). We take X =x 2 Rd : kxk2 \uf8ff R for our experiments.\nFor the experiment with synthetic data, we compare our algorithm against two benchmark methods\nfor solving the robust problem (1). The \ufb01rst is the interior point method for the dual reformulation (5)\nusing the Gurobi solver [17]. The second is using gradient descent, viewing the robust formulation (1)\nas a minimization problem with the objective x 7! supp2P\u21e2,n, p>`(x). To ef\ufb01ciently compute the\ngradient, we bisect over the dual form (5) with respect to   0, \u2318. We use the best step sizes for\nboth our proposed bandit-based algorithm and gradient descent.\niid\u21e0 N(0, I)\nTo generate the data, we choose a true classi\ufb01er x\u21e4 2 Rd and sample the feature vectors ai\nfor i 2 [n]. We set the labels to be bi = sign(a>i x\u21e4) and \ufb02ip them with probability 10%. We use\nthe hinge loss `i(x) = 1  bia>i x+ with n = 2000, d = 500 and R = 10 in our experiment.\nIn Figure 1a, we plot the log optimality ratio (log of current objective value over optimal value)\nwith respect to the runtime for the three algorithms. While the interior point method (IPM) obtains\naccurate solutions, it scales relatively poorly in n and d (the initial \ufb02at region in the plot is due to\npre-computations for factorizing within the solver). Gradient descent performs quite well in this\nmoderate sized example although each iteration takes time \u2326(n).\nWe also perform experiments on two datasets with larger n: the Adult dataset [22] and the Reuters\nRCV1 Corpus [21]. The Adult dataset has n = 32,561 training and 16,281 test examples with\n123-dimensional features. We use binary logistic loss `i(x) = log(1 + exp(bia>i x)) to classify\nwhether the income level is greater than $5K. For the Reuters RCV1 Corpus, our task is to classify\nwhether a document belongs to the Corporate category. With d = 47,236 features, we randomly\nsplit the 804,410 examples into 723,969 training (90% of data) and 80,441 (10% of data) test\nexamples. We use the hinge loss and solve the binary classi\ufb01cation problem for the document type.\nTo test the ef\ufb01ciency of our method in large scale settings, we plot the log ratio log Rn(x)\nRn(x?), where\nRn(x) = supp2P\u21e2,n, p>`(x), versus CPU time for our algorithm and gradient descent in Figure 1b.\nAs is somewhat typical of stochastic gradient-based methods, our bandit-based optimization algorithm\nquickly obtains a solution with small optimality gap (about 2% relative error), while the gradient\ndescent method eventually achieves better loss.\nIn Figures 2a\u20132d, we plot the loss value and the classi\ufb01cation error compared with applying pure\nstochastic gradient descent to the standard empirical loss, plotting the con\ufb01dence bound for the robust\n\n7\n\n\f(a) Synthetic Data (n = 2000, d = 500)\n\n(b) Reuters Corpus (n = 7.2 \u00b7 105, d \u21e1 5 \u00b7 104)\n\nFigure 1: Comparison of Solvers\n\n(a) Adult: Logistic Loss\n\n(b) Adult: Classi\ufb01cation Error\n\n(c) Reuters: Hinge Loss\n\n(d) Reuters: Classi\ufb01cation Error\n\nFigure 2: Comparison with ERM\n\nmethod as well. As the theory suggests [15, 13], the robust objective provides upper con\ufb01dence\nbounds on the true risk (approximated by the average loss on the test sample).\n\nAcknowledgments\nJCD and HN were partially supported by the SAIL-Toyota Center for AI Research and the National\nScience Foundation award NSF-CAREER-1553086. HN was also partially supported Samsung\nFellowship.\n\n8\n\n\fReferences\n[1] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. In\n\nJournal of Machine Learning Research, pages 2635\u20132686, 2010.\n\n[2] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press,\n\n2009.\n\n[3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions\nof optimization problems affected by uncertain probabilities. Management Science, 59(2):\n341\u2013357, 2013.\n\n[4] A. Ben-Tal, E. Hazan, T. Koren, and S. Mannor. Oracle-based robust optimization via online\n\nlearning. Operations Research, 63(3):628\u2013638, 2015.\n\n[5] J. Borwein, A. J. Guirao, P. H\u00e1jek, and J. Vanderwerff. Uniformly convex functions on Banach\n\nspaces. Proceedings of the American Mathematical Society, 137(3):1081\u20131091, 2009.\n\n[6] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: a survey of some recent\n\nadvances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\n[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[8] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\nPress, 2001.\n\n[10] K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine learning. Journal\n\nof the Association for Computing Machinery, 59(5), 2012.\n\n[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT\n\n[12] N. Cressie and T. R. Read. Multinomial goodness-of-\ufb01t tests. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 440\u2013464, 1984.\n\n[13] J. C. Duchi and H. Namkoong. Statistics of robust optimization: A generalized empirical\nlikelihood approach. arXiv:1610.02581 [stat.ML], 2016. URL https://arxiv.org/abs/\n1610.02581.\n\n[14] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the\n`1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on\nMachine Learning, 2008.\n\n[15] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized\nempirical likelihood approach. arXiv:1610.03425 [stat.ML], 2016. URL https://arxiv.\norg/abs/1610.03425.\n\n[16] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, I: a generic algorithmic framework. SIAM Journal on\nOptimization, 22(4):1469\u20131492, 2012.\n\n[17] I. Gurobi Optimization. Gurobi optimizer reference manual, 2015. URL http://www.gurobi.\n\ncom.\n\nml.\n\n[18] E. Hazan. The convex optimization approach to regret minimization. In Optimization for\n\nMachine Learning, chapter 10. MIT Press, 2012.\n\n[19] E. Hazan and S. Kale. An optimal algorithm for stochastic strongly convex optimization. In\nProceedings of the Twenty Fourth Annual Conference on Computational Learning Theory, 2011.\n[20] J. Hiriart-Urruty and C. Lemar\u00e9chal. Convex Analysis and Minimization Algorithms I & II.\n\nSpringer, New York, 1993.\n\n[21] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[22] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\n[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[24] A. B. Owen. Empirical likelihood. CRC press, 2001.\n[25] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[26] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings\n\nof the 32nd International Conference on Machine Learning, 2016.\n\n[27] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nProceedings of the Twentieth International Conference on Machine Learning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1144, "authors": [{"given_name": "Hongseok", "family_name": "Namkoong", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}