{"title": "Online Convex Optimization with Stochastic Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 1428, "page_last": 1438, "abstract": "This paper considers online convex optimization  (OCO) with stochastic constraints, which generalizes Zinkevich's OCO over a  known simple fixed set by introducing multiple stochastic functional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. It also includes many important problems  as special case, such as OCO with long term constraints, stochastic constrained convex optimization, and deterministic constrained convex optimization.  To solve this problem, this paper proposes a new algorithm that achieves $O(\\sqrt{T})$ expected regret and constraint violations and $O(\\sqrt{T}\\log(T))$ high probability regret and constraint violations. Experiments on a real-world data center scheduling problem further verify the performance of the new algorithm.", "full_text": "Online Convex Optimization with Stochastic\n\nConstraints\n\nHao Yu, Michael J. Neely, Xiaohan Wei\n\nDepartment of Electrical Engineering, University of Southern California\u21e4\n\n{yuhao,mjneely,xiaohanw}@usc.edu\n\nAbstract\n\nThis paper considers online convex optimization (OCO) with stochastic constraints,\nwhich generalizes Zinkevich\u2019s OCO over a known simple \ufb01xed set by introducing\nmultiple stochastic functional constraints that are i.i.d. generated at each round\nand are disclosed to the decision maker only after the decision is made. This\nformulation arises naturally when decisions are restricted by stochastic environ-\nments or deterministic environments with noisy observations. It also includes\nmany important problems as special case, such as OCO with long term constraints,\nstochastic constrained convex optimization, and deterministic constrained con-\nvex optimization. To solve this problem, this paper proposes a new algorithm\nthat achieves O(pT ) expected regret and constraint violations and O(pT log(T ))\nhigh probability regret and constraint violations. Experiments on a real-world data\ncenter scheduling problem further verify the performance of the new algorithm.\n\n1\n\nIntroduction\n\nOnline convex optimization (OCO) is a multi-round learning process with arbitrarily-varying convex\nloss functions where the decision maker has to choose decision x(t) 2X before observing the\ncorresponding loss function f t(\u00b7). For a \ufb01xed time horizon T , de\ufb01ne the regret of a learning algorithm\nwith respect to the best \ufb01xed decision in hindsight (with full knowledge of all loss functions) as\n\nregret(T ) =\n\nf t(x(t))  min\nx2X\n\nf t(x).\n\nTXt=1\n\nTXt=1\n\nThe goal of OCO is to develop dynamic learning algorithms such that regret grows sub-linearly with\nrespect to T . The setting of OCO is introduced in a series of work [3, 14, 9, 29] and is formalized in\n[29]. OCO has gained considerable amount of research interest recently with various applications\nsuch as online regression, prediction with expert advice, online ranking, online shortest paths, and\nportfolio selection. See [23, 11] for more applications and background.\nIn [29], Zinkevich shows O(pT ) regret can be achieved by using an online gradient descent (OGD)\nupdate given by\n\nx(t + 1) = PX\u21e5x(t)  rf t(x(t))\u21e4\n(1)\nwhere rf t(\u00b7) is a subgradient of f t(\u00b7) and PX [\u00b7] is the projection onto set X . Hazan et al. in [12]\nshow that better regret is possible under the assumption that each loss function is strongly convex but\nO(pT ) is the best possible if no additional assumption is imposed.\nIt is obvious that Zinkevich\u2019s OGD in (1) requires the full knowledge of set X and low complexity\nof the projection PX [\u00b7]. However, in practice, the constraint set X , which is often described by\n\n\u21e4This work is supported in part by grant NSF CCF-1718477.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1\n\nT PT\n\nmany functional inequality constraints, can be time varying and may not be fully disclosed to the\ndecision maker. In [18], Mannor et al. extend OCO by considering time-varying constraint functions\ngt(x) which can arbitrarily vary and are only disclosed to us after each x(t) is chosen. In this\nsetting, Mannor et al. in [18] explore the possibility of designing learning algorithms such that\nregret grows sub-linearly and lim supT!1\nt=1 gt(x(t)) \uf8ff 0, i.e., the (cumulative) constraint\nviolationPT\nt=1 gt(x(t)) also grows sub-linearly. Unfortunately, Mannor et al. in [18] prove that this\nis impossible even when both f t(\u00b7) and gt(\u00b7) are simple linear functions.\nGiven the impossibility results shown by Mannor et al. in [18], this paper considers OCO where\nconstraint functions gt(x) are not arbitrarily varying but independently and identically distributed\n(i.i.d.) generated from an unknown probability model (and functions f t(x) are still arbitrarily varying\nand possibly non-i.i.d.). Speci\ufb01cally, this paper considers online convex optimization (OCO) with\nstochastic constraint X = {x 2X 0 : E![gk(x; !)] \uf8ff 0, k 2{ 1, 2, . . . , m}} where X0 is a known\n\ufb01xed set; the expressions of stochastic constraints E![gk(x; !)] (involving expectations with respect\nto ! from an unknown distribution) are unknown; and subscripts k 2{ 1, 2, . . . , m} indicate the\npossibility of multiple functional constraints. In OCO with stochastic constraints, the decision maker\nk(x) = gk(x; !(t)) at each\nreceives loss function f t(x) and i.i.d. constraint function realizations gt\nk(\u00b7) and f t(\u00b7) are disclosed to the decision maker only after\nround t. However, the expressions of gt\ndecision x(t) 2X 0 is chosen. This setting arises naturally when decisions are restricted by stochastic\nenvironments or deterministic environments with noisy observations. For example, if we consider\nonline routing (with link capacity constraints) in wireless networks [18], each link capacity is not\na \ufb01xed constant (as in wireline networks) but an i.i.d. random variable since wireless channels are\nstochastically time-varying by nature [25]. OCO with stochastic constraints also covers important\nspecial cases such as OCO with long term constraints [16, 5, 13], stochastic constrained convex\noptimization [17] and deterministic constrained convex optimization [21].\n\n1\n\n2\n\nt=1 gt\n\nt=1 f t(x) be the best \ufb01xed decision in hind-\nsight (knowing all loss functions f t(x) and the distribution of stochastic constraint functions\ngk(x; !)). Thus, x\u21e4 minimizes the T -round cumulative loss and satis\ufb01es all stochastic constraints in\nexpectation, which also implies lim supT!1\nk(x\u21e4) \uf8ff 0 almost surely by the strong law\nof large numbers. Our goal is to develop dynamic learning algorithms that guarantee both regret\n\nLet x\u21e4 = argmin{x2X0:E[gk(x;!)]\uf8ff0,8k2{1,2,...,m}}PT\nT PT\n\nt=1 gt\n\nk(x(t)) grow sub-linearly.\n\n\u2022 OCO with long term constraints: This is a special case where each gt\n\nt=1 f t(x\u21e4) and constraint violationsPT\n\nPT\nt=1 f t(x(t)) PT\nNote that Zinkevich\u2019s algorithm in (1) is not applicable to OCO with stochastic constraints since X\nis unknown and it can happen that X (t) = {x 2X 0 : gk(x; !(t)) \uf8ff 0,8k 2{ 1, 2, . . . , m}} = ;\nfor certain realizations !(t), such that projections PX [\u00b7] or PX (t)[\u00b7] required in (1) are not even\nwell-de\ufb01ned.\nOur Contributions: This paper solves online convex optimization with stochastic constraints. In\nparticular, we propose a new learning algorithm that is proven to achieve O(pT ) expected regret\nand constraint violations and O(pT log(T )) high probability regret and constraint violations. The\nproposed new algorithm also improves upon state-of-the-art results in the following special cases:\nk(x) \u2318 gk(x) is known\nand does not depend on time. Note that X = {x 2X 0 : gk(x) \uf8ff 0,8k 2{ 1, 2, . . . , m}} can\nbe complicated while X0 might be a simple hypercube. To avoid high complexity involved in\nthe projection onto X as in Zinkevich\u2019s algorithm, work in [16, 5, 13] develops low complexity\nalgorithms that use projections onto a simpler set X0 by allowing gk(x(t)) > 0 for certain\nrounds but ensuring lim supT!1\nt=1 gk(x(t)) \uf8ff 0. The best existing performance is\nO(T max{,1}) regret and O(T 1/2) constraint violations where  2 (0, 1) is an algorithm\nparameter [13]. This gives O(pT ) regret with worse O(T 3/4) constraint violations or O(pT )\nconstraint violations with worse O(T ) regret. In contrast, our algorithm, which only uses\nprojections onto X0 as shown in Lemma 1, can achieve O(pT ) regret and O(pT ) constraint\nviolations simultaneously. Note that by adapting the methodology presented in this paper, our\nother work [27] developed a different algorithm that can only solve the special case problem\n\u201cOCO with long term constraints\u201d but can achieve O(pT ) regret and O(1) constraint violations.\n\u2022 Stochastic constrained convex optimization: This is a special case where each f t(x) is i.i.d.\ngenerated from an unknown distribution. This problem has many applications in operations\nresearch and machine learning such as Neyman-Pearson classi\ufb01cation and risk-mean portfolio.\n\n1\n\nT PT\n\n\fThe work [17] develops a (batch) of\ufb02ine algorithm that produces a solution with high probability\nperformance guarantees only after sampling the problems for suf\ufb01ciently many times. That is,\nduring the process of sampling, there is no performance guarantee. The work [15] proposes\na stochastic approximation based (batch) of\ufb02ine algorithm for stochastic convex optimization\nwith one single stochastic functional inequality constraint. In contrast, our algorithm is an\nonline algorithm with online performance guarantees and can deal with an arbitrary number of\nstochastic constraints.\n\n\u2022 Deterministic constrained convex optimization: This is a special case where each f t(x) \u2318 f (x)\nand gt\nk(x) \u2318 gk(x) are known and do not depend on time. In this case, the goal is to develop\na fast algorithm that converges to a good solution (with a small error) with a few number of\niterations; and our algorithm with O(pT ) regret and constraint violations is equivalent to an\niterative numerical algorithm with O(1/pT ) convergence rate. Our algorithm is subgradient\nbased and does not require the smoothness or differentiability of the convex program. The\nprimal-dual subgradient method considered in [19] has the same O(1/pT ) convergence rate but\nrequires an upper bound of optimal Lagrange multipliers, which is usually unknown in practice.\n\n2 Formulation and New Algorithm\nLet X0 be a known \ufb01xed compact convex set. Let f t(x) be a sequence of arbitrarily-varying convex\nfunctions. Let gk(x; !(t)), k 2{ 1, 2, . . . , m} be sequences of functions that are i.i.d. realizations of\nstochastic constraint functions \u02dcgk(x) = E![gk(x; !)] with random variable ! 2 \u2326 from an unknown\ndistribution. That is, !(t) are i.i.d. samples of !. Assume that each f t(\u00b7) is independent of all !(\u2327 )\nwith \u2327  t + 1 so that we are unable to predict future constraint functions based on the knowledge of\nthe current loss function. For each ! 2 \u2326, we assume gk(x; !) are convex with respect to x 2X 0. At\nthe beginning of each round t, neither the loss function f t(x) nor the constraint function realizations\ngk(x; !(t)) are known to the decision maker. However, the decision maker still needs to make a\ndecision x(t) 2X 0 for round t; and after that f t(x) and gk(x,! (t)) are disclosed to the decision\nmaker at the end of round t.\nFor convenience, we often suppress the dependence of each gk(x; !(t)) on !(t) and write\nk(x) = gk(x; !(t)). Recall \u02dcgk(x) = E![gk(x; !)] where the expectation is with respect to !.\ngt\nDe\ufb01ne X = {x 2X 0 : \u02dcgk(x) = E[gk(x; !)] \uf8ff 0,8k 2{ 1, 2, . . . , m}}. We further de\ufb01ne the\nstacked vector of multiple functions gt\nm(x)]T and de\ufb01ne\n\u02dcg(x) = [E![g1(x; !)], . . . , E![gm(x; !)]]T. We use k\u00b7k to denote the Euclidean norm for a vector.\nThroughout this paper, we have the following assumptions:\nAssumption 1 (Basic Assumptions).\n\nm(x) as gt(x) = [gt\n\n1(x), . . . , gt\n\n1(x), . . . , gt\n\n\u2022 Loss functions f t(x) and constraint functions gk(x; !) have bounded subgradients on X0.\nThat is, there exists D1 > 0 and D2 > 0 such that krf t(x)k \uf8ff D1 for all x 2X 0 and all\nt 2{ 0, 1, . . .} and krgk(x; !)k \uf8ff D2 for all x 2X 0, all ! 2 \u2326 and all k 2{ 1, 2, . . . , m}.2\n\u2022 There exists constant G > 0 such that kg(x; !)k \uf8ff G for all x 2X 0 and all ! 2 \u2326.\n\u2022 There exists constant R > 0 such that kx  yk \uf8ff R for all x, y 2X 0.\n\nAssumption 2 (The Slater Condition). There exists \u270f> 0 and \u02c6x 2X 0 such that \u02dcgk(\u02c6x) =\nE![gk(\u02c6x; !)] \uf8ff \u270f for all k 2{ 1, 2, . . . , m}.\n2.1 New Algorithm\nNow consider the following algorithm described in Algorithm 1. This algorithm chooses x(t + 1) as\nthe decision for round t + 1 based on f t(\u00b7) and gt(\u00b7) without requiring f t+1(\u00b7) or gt+1(\u00b7).\nFor each stochastic constraint function gk(x; !), we introduce Qk(t) and call it a virtual queue since\nits dynamic is similar to a queue dynamic. The next lemma summarizes that x(t + 1) update in (2)\ncan be implemented via a simple projection onto X0.\nLemma 1. The x(t + 1) update in (2) is given by x(t + 1) = PX0\u21e5x(t)  1\nV rf t(x(t)) +Pm\n\n2 The notation rh(x) is used to denote a subgradient of a convex function h at the point x.; it is the same as\n\nk(x(t)) and PX0[\u00b7] is the projection onto convex set X0.\n\n2\u21b5 d(t)\u21e4, where d(t) =\n\nk=1 Qk(t)rgt\n\nthe gradient whenever the gradient exists.\n\n3\n\n\fAlgorithm 1\nLet V > 0 and \u21b5> 0 be constant algorithm parameters. Choose x(1) 2X 0 arbitrarily and let\nQk(1) = 0,8k 2{ 1, 2, . . . , m}. At the end of each round t 2{ 1, 2, . . .}, observe f t(\u00b7) and gt(\u00b7)\nand do the following:\n\u2022 Choose x(t + 1) that solves\nk(x(t))]T[x  x(t)] + \u21b5kx  x(t)k2 (2)\nx2X0V [rf t(x(t))]T[x  x(t)] +\nas the decision for the next round t + 1, where rf t(x(t)) is a subgradient of f t(x) at point\nx = x(t) and rgt\n\u2022 Update each virtual queue Qk(t + 1),8k 2{ 1, 2, . . . , m} via\n\nk(x(t)) is a subgradient of gt\n\nk(x) at point x = x(t).\n\nQk(t)[rgt\n\nmXk=1\n\nmin\n\nQk(t + 1) = maxQk(t) + gt\n\nk(x(t)) + [rgt\n\nwhere max{\u00b7,\u00b7} takes the larger one between two elements.\n\nk(x(t))]T[x(t + 1)  x(t)], 0 ,\n\n(3)\n\n2\u21b5 d(t)]k2 and is equivalent to (2).\n\nIntuitions of Algorithm 1\n\nProof. The projection by de\ufb01nition is minx2X0 kx  [x(t)  1\n2.2\nNote that if there are no stochastic constraints gt\n0,8t and becomes Zinkevich\u2019s algorithm with  = V\n\n2\u21b5 in (1) since\n\nx(t + 1)\n\n(a)\n= argmin\n\nx2X0  V [rf t(x(t))]T[x  x(t)] + \u21b5kx  x(t)k2\n}\n\n{z\n\n|\n\npenalty\n\nk(x), i.e., X = X0, then Algorithm 1 has Qk(t) \u2318\n\n (b)\n= PX0\u21e5x(t) \n\nV\n\n2\u21b5rf t(x(t))\u21e4\n\n(4)\n\nwhere (a) follows from (2); and (b) follows from Lemma 1 by noting that d(t) = V rf t(x(t)). Call\nthe term marked by an underbrace in (4) the penalty. Thus, Zinkevich\u2019s algorithm is to minimize the\npenalty term and is a special case of Algorithm 1 used to solve OCO over X0.\nLet Q(t) =\u21e5Q1(t), . . . , Qm(t)\u21e4T be the vector of virtual queue backlogs. Let L(t) = 1\n\na Lyapunov function and de\ufb01ne Lyapunov drift\n\n2kQ(t)k2 be\n\n(t) = L(t + 1)  L(t) =\n\n[kQ(t + 1)k2  kQ(t)k2].\n\n(5)\n\n1\n2\n\nThe intuition behind Algorithm 1 is to choose x(t + 1) to minimize an upper bound of the expression\n\n(t)\n\n|{z}drift\n\n+ V [rf t(x(t))]T[x  x(t)] + \u21b5kx  x(t)k2\n|\n}\n\n{z\n\npenalty\n\n(6)\n\nThe intention to minimize penalty is natural since Zinkevich\u2019s algorithm (for OCO without stochastic\nconstraints) minimizes penalty, while the intention to minimize drift is motivated by observing that\nk(x(t)) is accumulated into queue Qk(t + 1) introduced in (3) such that we intend to have small\ngt\nqueue backlogs. The drift (t) can be complicated and is in general non-convex. The next lemma\n(proven in Supplement 7.1) provides a simple upper bound on (t) and follows directly from (3).\nLemma 2. At each round t 2{ 1, 2, . . .}, Algorithm 1 guarantees\n\nk(x(t)) + [rgt\n\nk(x(t))]T[x(t + 1)  x(t)]\u21e4 +\n\n1\n2\n\n[G + pmD2R]2,\n\n(7)\n\nmXk=1\n\n(t) \uf8ff\n\nQk(t)\u21e5gt\nAt the end of round t,Pm\n\nk=1 Qk(t)gt\n\nwhere m is the number of constraint functions; and D2, G and R are de\ufb01ned in Assumption 1.\n\n2 [G + pmD2R]2 is a given constant that is not\naffected by decision x(t + 1). The algorithm decision in (2) is now transparent: x(t + 1) is chosen to\nminimize the drift-plus-penalty expression (6), where (t) is approximated by the bound in (7).\n2.3 Preliminary Analysis and More Intuitions of Algorithm 1\nThe next lemma (proven in Supplement 7.2) relates constraint violations and virtual queue values and\nfollows directly from (3).\n\nk(x(t)) + 1\n\n4\n\n\fLemma 3. For any T  1, Algorithm 1 guaranteesPT\n1)  x(t)k,8k 2{ 1, 2, . . . , m}, where D2 is de\ufb01ned in Assumption 1.\n2kxk2 is convex over\nRecall that function h : X0 ! R is said to be c-strongly convex if h(x)  c\nx 2X 0. It is easy to see that if q : X0 ! R is a convex function, then for any constant c > 0\n2kx  bk2 is c-strongly convex. Further, it is known that if\nand any vector b, the function q(x) + c\nh : X! R is a c-strongly convex function that is minimized at a point xmin 2X 0, then (see, for\nexample, Corollary 1 in [28]):\n\nk(x(t)) \uf8ff kQ(T +1)k+D2PT\n\nt=1 kx(t+\n\nt=1 gt\n\nh(xmin) \uf8ff h(x) \n\nc\n2kx  xmink2 8x 2X 0\n\n(8)\n\nNote that the expression involved in minimization (2) in Algorithm 1 is strongly convex with modulus\n2\u21b5 and x(t + 1) is chosen to minimize it. Thus, the next lemma follows.\nLemma 4. Let z 2X 0 be arbitrary. For all t  1, Algorithm 1 guarantees\n\nQk(t)[rgt\n\nk(x(t))]T[x(t + 1)  x(t)] + \u21b5kx(t + 1)  x(t)k2\n\nV [rf t(x(t))]T[x(t + 1)  x(t)] +\nmXk=1\n\n\uf8ffV [rf t(x(t))]T[z  x(t)] +\n\nmXk=1\nQk(t)[rgt\n\nk(x(t))]T[z  x(t)] + \u21b5kz  x(t)k2  \u21b5kz  x(t + 1)k2.\n\n2\n\n2\u21b5 +\n\nt=1 gt\n\n2\u21b5 +\n\nk(x(t)) \uf8ff kQ(T + 1)k + V T D1D2\n\nt=1 kQ(t)k,8k 2{ 1, 2, . . . , m} where D1 and D2 are de\ufb01ned in Assumption 1.\n\nThe next corollary follows by taking z = x(t) in Lemma 4 and is proven in Supplement 7.3.\npmD2\nCorollary 1. For all t  1, Algorithm 1 guarantees kx(t + 1)  x(t)k \uf8ff V D1\n2\u21b5 kQ(t)k.\nThe next corollary follows directly from Lemma 3 and Corollary 1 and shows that constraint violations\nare ultimately bounded by sequence kQ(t)k, t 2{ 1, 2, . . . , T + 1}.\nCorollary 2. For any T  1, Algorithm 1 guaranteesPT\npmD2\n2\u21b5 PT\nThis corollary further justi\ufb01es why Algorithm 1 intends to minimize drift (t). As illustrated in\nthe next section, controlled drift can often lead to boundedness of a stochastic process. Thus, the\nintuition of minimizing drift (t) is to yield small kQ(t)k bounds.\n3 Expected Performance Analysis of Algorithm 1\nThis section shows that if we choose V = pT and \u21b5 = T in Algorithm 1, then both expected regret\nand expected constraint violations are O(pT ).\n3.1 A Drift Lemma for Stochastic Processes\nLet {Z(t), t  0} be a discrete time stochastic process adapted3 to a \ufb01ltration {F(t), t  0}. For\nexample, Z(t) can be a random walk, a Markov chain or a martingale. The drift analysis is the\nmethod of deducing properties, e.g., recurrence, ergodicity, or boundedness, about Z(t) from its drift\nE[Z(t + 1)  Z(t)|F(t)]. See [6, 10] for more discussions or applications on drift analysis. This\npaper proposes a new drift analysis lemma for stochastic processes as follows:\nLemma 5. Let {Z(t), t  0} be a discrete time stochastic process adapted to a \ufb01ltration {F(t), t \n0} with Z(0) = 0 and F(0) = {;, \u2326}. Suppose there exists an integer t0 > 0, real constants \u2713> 0,\nmax > 0 and 0 <\u21e3 \uf8ff max such that\n(9)\n\n|Z(t + 1)  Z(t)|\uf8ff max,\n\nE[Z(t + t0)  Z(t)|F(t)] \uf8ff\u21e2 t0max,\n\nt0\u21e3,\n\nif Z(t) <\u2713\nif Z(t)  \u2713 .\n\n42\nmax\n\u21e3\n\nhold for all t 2{ 1, 2, . . .}. Then, the following holds\n1. E[Z(t)] \uf8ff \u2713 + t0max + t0\n2. For any constant 0 < \u00b5 < 1, we have Pr(Z(t)  z) \uf8ff \u00b5,8t 2{ 1, 2, . . .} where z =\nlog\u21e5 82\n\n\u2713 + t0max + t0\n3Random variable Y is said to be adapted to -algebra F if Y is F-measurable. In this case, we often write\n\nlog\u21e5 82\n\u21e32 \u21e4 + t0\n\n\u21e32 \u21e4,8t 2{ 1, 2, . . .}.\n\nY 2F . Similarly, random process {Z(t)} is adapted to \ufb01ltration {F(t)} if Z(t) 2F (t),8t. See e.g. [7].\n\n42\nmax\n\u21e3\n\n42\nmax\n\u21e3\n\nlog( 1\n\n\u00b5 ).\n\nmax\n\nmax\n\n(10)\n\n5\n\n\fThe above lemma is proven in Supplement 7.4 and provides both expected and high probability\nbounds for stochastic processes based on a drift condition. It will be used to establish upper bounds of\nvirtual queues kQ(t)k, which further leads to expected and high probability constraint performance\nbounds of our algorithm. For a given stochastic process Z(t), it is possible to show the drift condition\n(10) holds for multiple t0 with different \u21e3 and \u2713. In fact, we will show in Lemma 7 that kQ(t)k\nyielded by Algorithm 1 satis\ufb01es (10) for any integer t0 > 0 by selecting \u21e3 and \u2713 according to t0.\nOne-step drift conditions, corresponding to the special case t0 = 1 of Lemma 5, have been previously\nconsidered in [10, 20]. However, Lemma 5 (with general t0 > 0) allows us to choose the best t0 in\nperformance analysis such that sublinear regret and constraint violation bounds are possible.\n\n3.2 Expected Constraint Violation Analysis\nDe\ufb01ne \ufb01ltration {W(t), t  0} with W(0) = {;, \u2326} and W(t) = (!(1), . . . ,! (t)) being the\n-algebra generated by random samples {!(1), . . . ,! (t)} up to round t. From the update rule\nin Algorithm 1, we observe that x(t + 1) is a deterministic function of f t(\u00b7), g(\u00b7; !(t)) and Q(t)\nwhere Q(t) is further a deterministic function of Q(t  1), g(\u00b7; !(t  1)), x(t) and x(t  1). By\ninductions, it is easy to show that (x(t)) \u2713W (t  1) and (Q(t)) \u2713W (t  1) for all t  1 where\n(Y ) denotes the -algebra generated by random variable Y . For \ufb01xed t  1, since Q(t) is fully\ndetermined by !(\u2327 ),\u2327 2{ 1, 2, . . . , t  1} and !(t) are i.i.d., we know gt(x) is independent of Q(t).\nThis is formally summarized in the next lemma.\nLemma 6. If x\u21e4 2X 0 satis\ufb01es \u02dcg(x\u21e4) = E![g(x\u21e4; !)] \uf8ff 0, then Algorithm 1 guarantees:\n\nE[Qk(t)gt\n\nk(x\u21e4)] \uf8ff 0,8k 2{ 1, 2, . . . , m},8t  1.\n\n(11)\n\nFix k 2{ 1, 2, . . . , m} and t  1. Since gt\n\nProof.\nof Qk(t), which is determined by {!(1), . . . ,! (t  1)}, it follows that E[Qk(t)gt\nE[Qk(t)]E[gt\n\nk(x\u21e4) = gk(x\u21e4; !(t)) is independent\nk(x\u21e4)] =\n\n\uf8ff 0, where (a) follows from the fact that E[gt\n\nk(x\u21e4)] \uf8ff 0 and Qk(t)  0.\n\nk(x\u21e4)]\n\n(a)\n\nTo establish a bound on constraint violations, by Corollary 2, it suf\ufb01ces to derive upper bounds for\nkQ(t)k. In this subsection, we derive upper bounds for kQ(t)k by applying the new drift lemma\n(Lemma 5) developed at the beginning of this section. The next lemma shows that random process\nZ(t) = kQ(t)k satis\ufb01es the conditions in Lemma 5.\nLemma 7. Let t0 > 0 be an arbitrary integer. At each round t 2{ 1, 2, . . . ,} in Algorithm 1, the\nfollowing holds\n\nkQ(t + 1)k  kQ(t)k \uf8ffG + pmD2R,\n\nE[kQ(t + t0)k  kQ(t)kW(t  1)] \uf8ff\u21e2 t0(G + pmD2R),\n\n2 t0 + (G + pmD2R)t0 + 2\u21b5R2\n\nt0\u270f + 2V D1R+[G+pmD2R]2\n\nt0\n\nand\n\n\u270f\n2 ,\n\n\u270f\n\nif kQ(t)k <\u2713\nif kQ(t)k  \u2713 ,\n\nwhere \u2713 = \u270f\n, m is the number of constraint\nfunctions; D1, D2, G and R are de\ufb01ned in Assumption 1; and \u270f is de\ufb01ned in Assumption 2. (Note\nthat \u270f< G by the de\ufb01nition of G.)\nLemma 7 (proven in Supplement 7.5) allows us to apply Lemma 5 to random process Z(t) = kQ(t)k\nand obtain E[kQ(t)k] = O(pT ),8t by taking t0 = dpTe, V = pT and \u21b5 = T , where dpTe\nrepresents the smallest integer no less than pT . By Corollary 2, this further implies the expected\nconstraint violation bound E[PT\nTheorem 1 (Expected Constraint Violation Bound). If V = pT and \u21b5 = T in Algorithm 1, then for\nall T  1, we have\nTXt=1\n\nt=1 gk(x(t))] \uf8ff O(pT ) as summarized in the next theorem.\n\nk(x(t))] \uf8ff O(pT ),8k 2{ 1, 2, . . . , m}.\n\n(12)\n\nE[\n\ngt\n\nwhere the expectation is taken with respect to all !(t).\n\nProof. De\ufb01ne random process Z(t) with Z(0) = 0 and Z(t) = kQ(t)k, t  1 and \ufb01ltration\nF(t) with F(0) = {;, \u2326} and F(t) = W(t  1), t  1. Note that Z(t) is adapted to F(t). By\n\n6\n\n\f\u270f\n\n\u270f\n\n\u270f\n\n\u270f2\n\n+\n\n2\n\n2\n\npmD2\n\nt=1 gt\n\n1)k +\n\npT D1D2\n\n2T PT\n\nlog[ 32[G+pmD2R]2\n\nt=1 gt\nk(x(t))] \uf8ff O(pT ).\n\nk(x(t)) \uf8ff kQ(T +\nt=1 kQ(t)k,8k 2{ 1, 2, . . . , m}. Taking expectations on both sides and\n\n2 and\nt0\u270f + 2V D1R+[G+pmD2R]2\n. Thus, by part (1) of Lemma 5, for all\nt0\u270f + 2V D1R+[G+pmD2R]2\n2 t0 + 2(G + pmD2R)t0 + 2\u21b5R2\n+\n]. Taking t0 = dpTe, V = pT and \u21b5 = T , we have\n\nLemma 7, Z(t) satis\ufb01es the conditions in Lemma 5 with max = G + pmD2R, \u21e3 = \u270f\n2 t0 + (G + pmD2R)t0 + 2\u21b5R2\n\u2713 = \u270f\nt 2{ 1, 2, . . .}, we have E[kQ(t)k] \uf8ff \u270f\n8[G+pmD2R]2\nt0\nE[kQ(t)k] \uf8ff O(pT ) for all t 2{ 1, 2, . . .}.\nFix T  1. By Corollary 2 (with V = pT and \u21b5 = T ) , we havePT\nsubstituting E[kQ(t)k] = O(pT ),8t into it yields E[PT\n3.3 Expected Regret Analysis\nThe next lemma (proven in Supplement 7.6) re\ufb01nes Lemma 4 and is useful to analyze the regret.\nLemma 8. Let z 2X 0 be arbitrary. For all T  1, Algorithm 1 guarantees\nTXt=1\nwhere m is the number of constraint functions; and D1, D2, G and R are de\ufb01ned in Assumption 1.\nNote that if we take V = pT and \u21b5 = T , then term (I) in (13) is O(pT ). Recall that the expectation\nof term (II) in (13) with z = x\u21e4 is non-positive by Lemma 6. The expected regret bound of Algorithm\n1 follows by taking expectations on both sides of (13) and is summarized in the next theorem.\nTheorem 2 (Expected Regret Bound). Let x\u21e4 2X 0 be any \ufb01xed solution that satis\ufb01es \u02dcg(x\u21e4) \uf8ff 0,\ne.g., x\u21e4 = argminx2XPT\n\nt=1 f t(x). If V = pT and \u21b5 = T in Algorithm 1, then for all T  1,\nTXt=1\nE[\n\n[G + pmD2R]2 T\nV\n}\n{z\n\nf t(z) + \u21b5\nV\n|\n\nTXt=1\u21e5 mXk=1\n{z\n\nf t(x\u21e4)] + O(pT ).\n\nf t(x(t))] \uf8ff E[\n\nf t(x(t)) \uf8ff\n\nk(z)\u21e4\n}\n\n+ 1\nV\n|\n\nTXt=1\n\nTXt=1\n\nV D2\n1\n4\u21b5\n\nQk(t)gt\n\n(13)\n\nR2 +\n\n1\n2\n\n(I)\n\nT +\n\n(II)\n\nwhere the expectation is taken with respect to all !(t).\n\nV D2\n4\u21b5 T + 1\n1\n\n2 [G + pmD2R]2 T\n\nProof. Fix T  1. Taking z = x\u21e4 in Lemma 8 yieldsPT\nt=1\u21e5Pm\nV PT\nand using (11) yieldsPT\nt=1 E[f t(x(t))] \uf8ffPT\nTaking V = pT and \u21b5 = T yieldsPT\nt=1 E[f t(x(t))] \uf8ffPT\n\nk=1 Qk(t)gt\nt=1 E[f t(x\u21e4)] + R2 \u21b5\n\nV + 1\n\nt=1 f t(x\u21e4) + \u21b5\n\nt=1 f t(x(t)) \uf8ffPT\nk(x\u21e4)\u21e4. Taking expectations on both sides\n2 [G+pmD2R]2 T\nV .\nt=1 E[f t(x\u21e4)] + O(pT ).\n\nV + D2\n\nV\n\u21b5 T + 1\n\nV R2 +\n\n1\n4\n\n3.4 Special Case Performance Guarantees\nTheorems 1 and 2 provide expected performance guarantees of Algorithm 1 for OCO with stochastic\nconstraints. The results further imply the performance guarantees in the following special cases:\n\n\u2022 OCO with long term constraints: In this case, gk(x; !(t)) \u2318 gk(x) and there is no random-\nness. Thus, the expectations in Theorems 1 and 2 disappear. For this problem, Algorithm 1 can\nachieve O(pT ) (deterministic) regret and O(pT ) (deterministic) constraint violations.\n\n\u2022 Stochastic constrained convex optimization: Note that i.i.d.\n\ntime-varying f (x; !(t)) is a\nspecial case of arbitrarily-varying f t(x) as considered in our OCO setting. Thus, Theorems 1\nand 2 still hold when Algorithm 1 is applied to solve stochastic constrained convex optimization\nminx{E[f (x; !)] : E[gk(x; !)] \uf8ff 0,8k 2{ 1, 2, . . . , m}, x 2X 0} in an online fashion with\ni.i.d. realizations !(t) \u21e0 !. Since Algorithm 1 chooses each x(t) without knowing !(t), it\nfollows that x(t) is independent of !(t0) for any t0  t by the i.i.d. property of each !(t).\nT PT\nt=1 x(t) as a \ufb01xed solu-\nFix T > 0, if we run Algorithm 1 for T slots and use x(T ) = 1\nT PT\n(b)\ntion for any future slot t0  T + 1, then E[f (x(T ); !(t0)]\nt=1 E[f (x(t); !(t0))]\n\uf8ff 1\n=\nT PT\n= E[f (x\u21e4; !(t0))] + O( 1pT\nt=1 E[f (x\u21e4; !(t))] + O( 1pT\n)\nT PT\nt=1 E[gk(x(T ); !(t0)]\n\uf8ff\n\nT PT\n\uf8ff 1\nT PT\n\uf8ff 1\n\nand E[gk(x(T ); !(t0)]\n\nt=1 E[gk(x(t); !(t))]\n\nt=1 E[f (x(t); !(t))]\n\n(b)\n= 1\n\n)\n(c)\n\n(a)\n\n(a)\n\n(d)\n\n(c)\n\n1\n\n7\n\n\f),8k 2{ 1, 2, . . . , m} where (a) follows from Jensen\u2019s inequality and the fact that x(T )\nO( 1pT\nis independent of !(t0); (b) follows because each x(t) is independent of both !(t) and !(t0)\nand !(t),! (t0) are i.i.d. realizations of !; (c) follows from Theorems 1 and 2 by dividing both\nsides by T and (d) follows because E[f (x\u21e4; !(t))] = E[f (x\u21e4; !(t0))] for all t 2{ 1, . . . , T} by\nthe i.i.d. property of each !(t). Thus, if we use Algorithm 1 as a (batch) of\ufb02ine algorithm to\nsolve stochastic constrained convex optimization, it has O(1/pT ) convergence and ties with\nthe algorithm developed in [15], which is by design a (batch) of\ufb02ine algorithm and can only\nsolve stochastic optimization with a single constraint function.\n\n\u2022 Deterministic constrained convex optimization: Similarly to OCO with long term con-\nstraints, the expectations in Theorems 1 and 2 disappear in this case since f t(x) \u2318 f (x)\nt=1 x(t) as the solution, then f (x(T )) \uf8ff\nand gk(x; !(t)) \u2318 gk(x). If we use x(T ) = 1\n), which follows by dividing inequalities in Theo-\nf (x\u21e4) + O( 1pT\nrems 1 and 2 by T on both sides and applying Jensen\u2019s inequality. Thus, Algorithm 1 solves\ndeterministic constrained convex optimization with O( 1pT\n\n) and gk(x(T )) \uf8ff O( 1pT\n\nT PT\n\n) convergence.\n\n4 High Probability Performance Analysis\nThis section shows that if we choose V = pT and \u21b5 = T in Algorithm 1, then for any 0 << 1,\nwith probability at least 1  , regret is O(pT log(T ) log1.5( 1\n )) and constraint violations are\nOpT log(T ) log( 1\n ).\n\n4.1 High Probability Constraint Violation Analysis\nSimilarly to the expected constraint violation analysis, we can use part (2) of the new drift lemma\n(Lemma 5) to obtain a high probability bound of kQ(t)k, which together with Corollary 2 leads to a\nhigh probability constraint violation bound summarized in Theorem 3 (proven in Supplement 7.7).\nTheorem 3 (High Probability Constraint Violation Bound). Let 0 << 1 be arbitrary. If V = pT\nand \u21b5 = T in Algorithm 1, then for all T  1 and all k 2{ 1, 2, . . . , m}, we have\n\nPr\u21e3 TXt=1\n\ngk(x(t)) \uf8ff OpT log(T ) log(\n\n1\n\n\n)\u2318  1  .\n\n4.2 High Probability Regret Analysis\nTo obtain a high probability regret bound from Lemma 8, it remains to derive a high probability\nbound of term (II) in (13) with z = x\u21e4. The main challenge is that term (II) is a supermartingale with\nunbounded differences (due to the possibly unbounded virtual queues Qk(t)). Most concentration\ninequalities, e.g., the Hoeffding-Azuma inequality, used in high probability performance analysis of\nonline algorithms are restricted to martingales/supermartingales with bounded differences. See for\nexample [4, 2, 16]. The following lemma considers supermartingales with unbounded differences.\nIts proof (provided in Supplement 7.8) uses the truncation method to construct an auxiliary well-\nbehaved supermartingale. Similar proof techniques are previously used in [26, 24] to prove different\nconcentration inequalities for supermartingales/martingales with unbounded differences.\nLemma 9. Let {Z(t), t  0} be a supermartingale adapted to a \ufb01ltration {F(t), t  0} with\nZ(0) = 0 and F(0) = {;, \u2326}, i.e., E[Z(t + 1)|F(t)] \uf8ff Z(t),8t  0. Suppose there exits a constant\nc > 0 such that {|Z(t + 1)  Z(t)| > c}\u2713{ Y (t) > 0},8t  0, where Y (t) is process with Y (t)\nadapted to F(t) for all t  0. Then, for all z > 0, we have\nt1X\u2327 =0\n\nPr(Z(t)  z) \uf8ff ez2/(2tc2) +\n\nNote that if Pr(Y (t) > 0) = 0,8t  0, then Pr({|Z(t + 1)  Z(t)| > c}) = 0,8t  0 and Z(t) is a\nsupermartingale with differences bounded by c. In this case, Lemma 9 reduces to the conventional\nHoeffding-Azuma inequality.\nThe next theorem (proven in Supplement 7.9) summarizes the high probability regret performance of\nAlgorithm 1 and follows from Lemmas 5-9 .\n\nPr(Y (\u2327 ) > 0),8t  1.\n\n8\n\n\f1\n\n\nTheorem 4 (High Probability Regret Bound). Let x\u21e4 2X 0 be any \ufb01xed solution that satis\ufb01es\nt=1 f t(x). Let 0 << 1 be arbitrary. If V = pT and\n\u02dcg(x\u21e4) \uf8ff 0, e.g., x\u21e4 = argminx2XPT\n\u21b5 = T in Algorithm 1, then for all T  1, we have\nTXt=1\n\nf t(x\u21e4) + O(pT log(T ) log1.5(\n\n))\u2318  1  .\n\nPr\u21e3 TXt=1\n\nf t(x(t)) \uf8ff\n\n5 Experiment: Online Job Scheduling in Distributed Data Centers\nConsider a geo-distributed data center infrastructure consisting of one front-end job router and 100\ngeographically distributed servers, which are located at 10 different zones to form 10 clusters (10\nservers in each cluster). See Fig. 1(a) for an illustration. The front-end job router receives job\ntasks and schedules them to different servers to ful\ufb01ll the service. To serve the assigned jobs, each\nserver purchases power (within its capacity) from its zone market. Electricity market prices can vary\nsigni\ufb01cantly across time and zones. For example, see Fig. 1(b) for a 5-minute average electricity\nprice trace (between 05/01/2017 and 05/10/2017) at New York zone CENTRL [1]. This problem\nis to schedule jobs and control power levels at each server in real time such that all incoming jobs\nare served and electricity cost is minimized. In our experiment, each server power is adjusted every\n5 minutes, which is called a slot. (In practice, server power can not be adjusted too frequently due\nto hardware restrictions and con\ufb01guration delay.) Let x(t) = [x1(t), . . . , x100(t)] be the power\nvector at slot t, where each xi(t) must be chosen from an interval [xmin\n] restricted by the\nhardware, and the service rate at each server i satis\ufb01es \u00b5i(t) = hi(xi(t)), where hi(\u00b7) is an increasing\nconcave function. At each slot t, the job router schedules \u00b5i(t) amount of jobs to server i. The\nelectricity cost at slot t is f t(x(t)) =P100\ni=1 ci(t)xi(t) where ci(t) is the electricity price at server\ni\u2019s zone. We use ci(t) from real-world 5-minute average electricity price data at 10 different zones\nin New York city between 05/01/2017 and 05/10/2017 obtained from NYISO [1]. At each slot\nt, the incoming job is given by !(t) and satis\ufb01es a Poisson distribution. Note that the amount of\nincoming jobs and electricity price ci(t) are unknown to us at the beginning of each slot t but can\nbe observed at the end of each slot. This is an example of OCO with stochastic constraints, where\nwe aim to minimize the electricity cost subject to the constraint that incoming jobs must be served\nin time. In particular, at each round t, we receive loss function f t(x(t)) and constraint function\n\n, xmax\n\ni\n\ni\n\ngt(x(t)) = !(t) P100\n\ni=1 hi(xi(t)).\n\nWe compare our proposed algorithm with 3 baselines: (1) best \ufb01xed decision in hindsight; (2) react\n[8] and (3) low-power [22]. Both \u201creact\" and \u201clow-power\" are popular power control strategies\nused in distributed data centers. See Supplement 7.10 for more details of these 2 baselines and our\nexperiment. Fig. 1(c)(d) plot the performance of 4 algorithms, where the running average is the\ntime average up to the current slot. Fig. 1(c) compares electricity cost while Fig. 1(d) compares\nunserved jobs. (Unserved jobs accumulate if the service rate provided by an algorithm is less than\nthe job arrival rate, i.e., the stochastic constraint is violated.) Fig. 1(c)(d) show that our proposed\nalgorithm performs closely to the best \ufb01xed decision in hindsight over time, both in electricity cost\nand constraint violations. \u2018React\" performs well in serving job arrivals but yields larger electricity\ncost, while \u201clow-power\" has low electricity cost but fails to serve job arrivals.\n\n)\nh\nW\nM\n\n/\nr\na\n\nl\nl\n\no\nd\n(\n \ne\nc\ni\nr\n\nP\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n0\n\n(a)\n\nElectricity market price\n\nRunning average electricity cost\n\nRunning average unserved jobs\n\n15000\n\n10000\n\n5000\n\n)\nr\na\n\nl\nl\n\no\nd\n(\n \nt\ns\no\nC\n\nOur algorithm\nBest fixed strategy in hindsight\nReact (Gandhi et al. 2012)\nLow-power (Qureshi et al. 2009)\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\nNumber of slots (each 5 min)\n\n0\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\nNumber of slots (each 5 min)\n\n(b)\n\n(c)\n\n)\nt\n\nl\n\no\ns\n \nr\ne\np\n(\n \ns\nb\no\n\nj\n \n\nd\ne\nv\nr\ne\ns\nn\nU\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n0\n\n-200\n\n0\n\nOur algorithm\nBest fixed decision in hindsight\nReact (Gandhi et al. 2012)\nLow-power (Qureshi et al. 2009)\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\nNumber of slots (each 5 min)\n\n(d)\n\nFigure 1: (a) Geo-distributed data center infrastructure; (b) Electricity market prices at zone CEN-\nTRAL New York; (c) Running average electricity cost; (d) Running average unserved jobs.\n6 Conclusion\nThis paper studies OCO with stochastic constraints, where the objective function varies arbitrarily but\nthe constraint functions are i.i.d. over time. A novel learning algorithm is developed that guarantees\nO(pT ) expected regret and constraint violations and O(pT log(T )) high probability regret and\nconstraint violations.\n\n9\n\n\fReferences\n[1] New York ISO open access pricing data. http://www.nyiso.com/.\n[2] Peter L Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj\nTewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of\nConference on Learning Theory (COLT), 2008.\n\n[3] Nicol\u00f2 Cesa-Bianchi, Philip M Long, and Manfred K Warmuth. Worst-case quadratic loss\nbounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural\nNetworks, 7(3):604\u2013619, 1996.\n\n[4] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[5] Andrew Cotter, Maya Gupta, and Jan Pfeifer. A light touch for heavily constrained sgd. In\n\nProceedings of Conference on Learning Theory (COLT), 2015.\n[6] Joseph L Doob. Stochastic processes. Wiley New York, 1953.\n[7] Rick Durrett. Probability: Theory and Examples. Cambridge University Press, 2010.\n[8] Anshul Gandhi, Mor Harchol-Balter, and Michael A Kozuch. Are sleep states effective in data\n\ncenters? In International Green Computing Conference (IGCC), 2012.\n\n[9] Geoffrey J Gordon. Regret bounds for prediction problems. In Proceeding of Conference on\n\nLearning Theory (COLT), 1999.\n\n[10] Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applica-\n\ntions. Advances in Applied Probability, 14(3):502\u2013525, 1982.\n\n[11] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3\u20134):157\u2013325, 2016.\n\n[12] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69:169\u2013192, 2007.\n\n[13] Rodolphe Jenatton, Jim Huang, and C\u00e9dric Archambeau. Adaptive algorithms for online\nconvex optimization with long-term constraints. In Proceedings of International Conference on\nMachine Learning (ICML), 2016.\n\n[14] Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for\n\nlinear predictors. Information and Computation, 132(1):1\u201363, 1997.\n\n[15] Guanghui Lan and Zhiqiang Zhou. Algorithms for stochastic optimization with expectation\n\nconstraints. arXiv:1604.03887, 2016.\n\n[16] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for ef\ufb01ciency: online convex\noptimization with long term constraints. Journal of Machine Learning Research, 13(1):2503\u2013\n2528, 2012.\n\n[17] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic convex optimization with multiple\n\nobjectives. In Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[18] Shie Mannor, John N Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints.\n\nJournal of Machine Learning Research, 10:569\u2013590, March 2009.\n\n[19] Angelia Nedi\u00b4c and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal\n\nof Optimization Theory and Applications, 142(1):205\u2013228, 2009.\n\n[20] Michael J. Neely. Energy-aware wireless scheduling with near optimal backlog and convergence\n\ntime tradeoffs. IEEE/ACM Transactions on Networking, 24(4):2223\u20132236, 2016.\n\n[21] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer\n\nScience & Business Media, 2004.\n\n[22] Asfandyar Qureshi, Rick Weber, Hari Balakrishnan, John Guttag, and Bruce Maggs. Cutting\n\nthe electric bill for internet-scale systems. In ACM SIGCOMM, 2009.\n\n[23] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2011.\n\n[24] Terence Tao and Van Vu. Random matrices: universality of local spectral statistics of non-\n\nhermitian matrices. The Annals of Probability, 43(2):782\u2013874, 2015.\n\n[25] David Tse and Pramod Viswanath. Fundamentals of Wireless Communication. Cambridge\n\nUniversity Press, 2005.\n\n[26] Van Vu. Concentration of non-lipschitz functions and applications. Random Structures &\n\nAlgorithms, 20(3):262\u2013316, 2002.\n\n10\n\n\f[27] Hao Yu and Michael J. Neely. A low complexity algorithm with O(pT ) regret and \ufb01nite con-\nstraint violations for online convex optimization with long term constraints. arXiv:1604.02218,\n2016.\n\n[28] Hao Yu and Michael J. Neely. A simple parallel algorithm with an O(1/t) convergence rate for\n\ngeneral convex programs. SIAM Journal on Optimization, 27(2):759\u2013783, 2017.\n\n[29] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn Proceedings of International Conference on Machine Learning (ICML), 2003.\n\n11\n\n\f", "award": [], "sourceid": 917, "authors": [{"given_name": "Hao", "family_name": "Yu", "institution": "University of Southern California"}, {"given_name": "Michael", "family_name": "Neely", "institution": "Univ. Southern California"}, {"given_name": "Xiaohan", "family_name": "Wei", "institution": "University of Southern California"}]}