{"title": "Online Learning of Optimal Bidding Strategy in Repeated Multi-Commodity Auctions", "book": "Advances in Neural Information Processing Systems", "page_first": 4507, "page_last": 4517, "abstract": "We study the online learning problem of a bidder who participates in repeated auctions. With the goal of maximizing his T-period payoff, the bidder determines the optimal allocation of his budget among his bids for $K$ goods at each period. As a bidding strategy, we propose a polynomial-time algorithm, inspired by the dynamic programming approach to the knapsack problem. The proposed algorithm, referred to as dynamic programming on discrete set (DPDS), achieves a regret order of $O(\\sqrt{T\\log{T}})$. By showing that the regret is lower bounded by $\\Omega(\\sqrt{T})$ for any strategy, we conclude that DPDS is order optimal up to a $\\sqrt{\\log{T}}$ term. We evaluate the performance of DPDS empirically in the context of virtual trading in wholesale electricity markets by using historical data from the New York market. Empirical results show that DPDS consistently outperforms benchmark heuristic methods that are derived from machine learning and online learning approaches.", "full_text": "Online Learning of Optimal Bidding Strategy\n\nin Repeated Multi-Commodity Auctions\n\nSevi Baltaoglu\nCornell University\nIthaca, NY 14850\n\nmsb372@cornell.edu\n\nLang Tong\n\nCornell University\nIthaca, NY 14850\n\nQing Zhao\n\nCornell University\nIthaca, NY 14850\n\nlt35@cornell.edu\n\nqz16@cornell.edu\n\nAbstract\n\nWe study the online learning problem of a bidder who participates in repeated\nauctions. With the goal of maximizing his T-period payoff, the bidder determines\nthe optimal allocation of his budget among his bids for K goods at each period.\nAs a bidding strategy, we propose a polynomial-time algorithm, inspired by the\ndynamic programming approach to the knapsack problem. The proposed algorithm,\n\u221a\nreferred to as dynamic programming on discrete set (DPDS), achieves a regret\nT ) for\norder of O(\nany strategy, we conclude that DPDS is order optimal up to a\nlog T term. We\nevaluate the performance of DPDS empirically in the context of virtual trading in\nwholesale electricity markets by using historical data from the New York market.\nEmpirical results show that DPDS consistently outperforms benchmark heuristic\nmethods that are derived from machine learning and online learning approaches.\n\n\u221a\nT log T ). By showing that the regret is lower bounded by \u2126(\n\n\u221a\n\n1\n\nIntroduction\n\nWe consider the problem of optimal bidding in a multi-commodity uniform-price auction (UPA) [1],\nwhich promotes the law of one price for identical goods. UPA is widely used in practice. Examples\ninclude spectrum auction, the auction of treasury notes, the auction of emission permits (UK), and\nvirtual trading in the wholesale electricity market, which we discuss in detail in Sec. 1.1.\nA mathematical abstraction of multi-commodity UPA is as follows. A bidder has K goods to bid on\nat an auction. With the objective to maximize his T-period expected pro\ufb01t, at each period, the bidder\ndetermines how much to bid for each good subject to a budget constraint.\nIn the bidding period t, if a bid xt,k for good k is greater than or equal to its auction clearing price\n\u03bbt,k, then the bid is cleared, and the bidder pays \u03bbt,k. His revenue resulting from the cleared bid\nwill be the good\u2019s spot price (utility) \u03c0t,k. In particular, the payoff obtained from good k at period\nt is (\u03c0t,k \u2212 \u03bbt,k)1{xt,k \u2265 \u03bbt,k} where 1{xt,k \u2265 \u03bbt,k} indicates whether the bid is cleared. Let\n(cid:124) be the vector of auction clearing and spot market\n\u03bbt = [\u03bbt,1, ..., \u03bbt,K]\n(cid:124) be the vector of bids for period\nprices at period t, respectively. Similarly, let xt = [xt,1, ..., xt,K]\nt. We assume that (\u03c0t, \u03bbt) are drawn from an unknown joint distribution and, in our analysis,\nindependent and identically distributed (i.i.d.) over time.1\nAt the end of each period, the bidder observes the auction clearing and spot prices of all goods.\nTherefore, before choosing the bid of period t, all the information the bidder has is a vector It\u22121\ncontaining his observation and decision history {xi, \u03bbi, \u03c0i}t\u22121\ni=1. Consequently, a bidding policy \u00b5 of\na bidder is de\ufb01ned as a sequence of decision rules, i.e., \u00b5 = (\u00b50, \u00b51..., \u00b5T\u22121), such that, at time t\u2212 1,\n1This implies that the auction clearing price is independent of bid xt, which is a reasonable assumption for\n\n(cid:124) and \u03c0t = [\u03c0t,1, ..., \u03c0t,K]\n\nany market where an individual\u2019s bid has negligible impact on the market price.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u00b5t\u22121 maps the information history It\u22121 to the bid xt of period t. The performance of any bidding\npolicy \u00b5 is measured by its regret, which is de\ufb01ned by the difference between the total expected\npayoff of policy \u00b5 and that of the optimal bidding strategy under known distribution of (\u03c0t, \u03bbt).\n\n1.1 Motivating applications\n\nThe mathematical abstraction introduced above applies to virtual trading in the U.S. wholesale\nelectricity markets that are operated under a two-settlement framework. In the day-ahead (DA)\nmarket, the independent system operator (ISO) receives offers to sell and bids to buy from generators\nand retailers for each hour of the next day. To determine the optimal DA dispatch of the next day and\nDA electricity prices at each location, ISO solves an economic dispatch problem with the objective of\nmaximizing social surplus while taking transmission and operational constraints into account. Due\nto system congestion and losses, wholesale electricity prices vary from location to location.2 In the\nreal-time (RT) market, ISO adjusts the DA dispatch according to the RT operating conditions, and the\nRT wholesale price compensates deviations in the actual consumption from the DA schedule.\nThe differences between DA and RT prices occur frequently both as a result of generators and\nretailers exercising locational market power [2] and as a result of price spikes in the RT due to\nunplanned outages and unpredictable weather conditions [3]. To promote price convergence between\nDA and RT markets, in the early 2000s, virtual trading was introduced [4]. Virtual trading is a\n\ufb01nancial mechanism that allows market participants and external \ufb01nancial entities to arbitrage on the\ndifferences between DA and RT prices. Empirical and analytical studies have shown that increased\ncompetition in the market due to virtual trading results in price convergence and increased market\nef\ufb01ciency [2, 3, 5].\nVirtual transactions make up a signi\ufb01cant portion of the wholesale electricity markets. For example,\nthe total volume of cleared virtual transactions in \ufb01ve big ISO markets was 13% of the total load in\n2013 [4]. In the same year, total payoff resulting from all virtual transactions was around 250 million\ndollars in the PJM market [2] and 45 million dollars in NYISO market [6].\nA bid in virtual trading is a bid to buy (sell) energy in the DA market at a speci\ufb01c location with an\nobligation to sell (buy) back exactly the same amount in the RT market at the same location if the bid\nis cleared (accepted). Speci\ufb01cally, a bid to buy in the DA market is cleared if the offered bid price is\nhigher than the DA market price. Similarly, a bid to sell in the DA market is cleared if it is below the\nDA market price. In this context, different locations and/or different hours of the day are the set of\ngoods to bid on. The DA prices are the auction clearing prices, and the RT prices are the spot prices.\nThe problem studied here may also \ufb01nd applications in other types of repeated auctions where the\nauction may be of the double, uniform-price, or second-price types. For example, in the case of\nonline advertising auctions [7], different goods can correspond to different types of advertising space\nan advertiser may consider to bid on.\n\n1.2 Main results and related work\n\nWe propose an online learning approach to the algorithmic bidding under budget constraints in\nrepeated multi-commodity auctions. The proposed approach falls in the category of empirical risk\nminimization (ERM) also referred to as the follow the leader approach. The main challenge here is\nthat optimizing the payoff (risk) amounts to solving a multiple choice knapsack problem (MCKP)\nthat is known to be NP hard [8]. The proposed approach, referred to as dynamic programming on\ndiscrete set (DPDS), is inspired by a pseudo-polynomial dynamic programming approach to 0-1\nKnapsack problems. DPDS allocates the limited budget of the bidder among K goods in polynomial\ntime both in terms of the number of goods K and in terms of the time horizon T . We show that the\nexpected payoff of DPDS converges to that of the optimal strategy under known distribution by a rate\nT log T ). By showing that, for\nT ), we prove that DPDS is order optimal up\nany bidding strategy, the regret is lower bounded by \u2126(\nto a\nlog T term. We also evaluate the performance of DPDS empirically in the context of virtual\ntrading by using historical data from the New York energy market. Our empirical results show that\n\nno slower than(cid:112)log t/t which results in a regret upper bound of O(\n\n\u221a\n\n\u221a\n\n\u221a\n\n2For example, transmission congestion may prevent scheduling the least expensive resources at some\n\nlocations.\n\n2\n\n\fDPDS consistently outperforms benchmark heuristic methods that are derived from standard machine\nlearning methods.\nThe problem formulated here can be viewed in multiple machine learning perspectives. We highlight\nbelow several relevant existing approaches. Since the bidder can calculate the reward that could have\nbeen obtained by selecting any given bid value regardless of its own decision, our problem falls into\nthe category of full-feedback version of multi-armed bandit (MAB) problem, referred to as experts\nproblem, where the reward of all arms (actions) are observable at the end of each period regardless of\nthe chosen arm. For the case of \ufb01nite number of arms, Kleinberg et al. [9] showed that, for stochastic\nsetting, constant regret is achievable by choosing the arm with the highest average reward at each\n\u221a\nperiod. A special case of the adversarial setting was studied by Cesa-Bianchi et al. [10] who provided\nmatching upper and lower bounds in the order of \u0398(\nT ). Later, Freund and Schapire [11] and Auer\net al. [12] showed that the Hedge algorithm, a variation of weighted majority algorithm [13], achieves\nthe matching bound for the general setting. These results, however, do not apply to experts problems\nwith continuous action spaces.\nThe stochastic experts problem where the set of arms is an uncountable compact metric space (X , d)\nrather than \ufb01nite was studied by Kleinberg and Slivkins [14] (see [15] for an extended version). Since\nthere are uncountable number of arms, it is assumed that, in each period, a payoff function drawn from\nan i.i.d. distribution is observed rather than the individual payoff of each arm. Under the assumption\n\u221a\nof Lipschitz expected payoff function, they showed that the instance-speci\ufb01c regret of any algorithm is\nlower bounded by \u2126(\nT ). They also showed that their algorithm\u2014NaiveExperts\u2014achieves a regret\nupper bound of O(T \u03b3) for any \u03b3 > (b + 1)/(b + 2) where b is the isometry invariant of the metric\nspace. However, NaiveExperts is computationally intractable in practice because the computational\ncomplexity of its direct implementation grows exponentially with the dimension (number of goods in\nour case). Furthermore, the lower bound in [14] does not imply a lower bound for our problem with\na speci\ufb01c payoff. Krichene et al. [16] studied the adversarial setting and proposed an extension of\nthe Hedge algorithm, which achieves O(\nT log T ) regret under the assumption of Lipschitz payoff\nfunctions. For our problem, it is reasonable to assume that the expected payoff function is Lipschitz;\nyet it is clear that, at each period, the payoff realization is a step function which is not Lipschitz.\nHence, Lipschitz assumption of [16] doesn\u2019t hold in our setting.\nStochastic gradient descent methods, which have low computational complexity, have been extensively\nstudied in the literature of continuum-armed bandit [17, 18, 19]. However, either the concavity or\nthe unimodality of the expected payoff function is required for regret guarantees of these methods to\nhold. This may not be the case in our problem depending on the underlying distribution of prices.\nA relevant work that takes an online learning perspective for the problem of a bidder engaging in\nrepeated auctions is Weed et al. [7]. They are motivated by online advertising auctions and studied\nthe partial information setting of the same problem as ours but without a budget constraint. Under the\nmargin condition, i.e., the probability of auction price occurring in close proximity of mean utility is\n\u221a\nbounded, they showed that their algorithm, inspired by the UCB1 algorithm [20], achieves regret that\nranges from O(log T ) to O(\nT log T ) depending on how tight the margin condition is. They also\nprovided matching lower bounds up to a logarithmic factor. However, their lower bound does not\nimply a bound for the full information setting we study here. Also, the learning algorithm in [7] does\nnot apply here because the goods are coupled through the budget constraint in our case. Furthermore,\nwe do not have margin condition, and we allow the utility of the good to depend on the auction price.\nSome other examples of literature on online learning in repeated auctions studied the problem of an\nadvertiser who wants to maximize the number of clicks with a budget constraint [21, 22], or that of\na seller who tries to learn the valuation of its buyer in a posted price auction [23, 24]. The settings\nconsidered in those problems are considerably different from that studied here in the implementation\nof budget constraints [21, 22], and in the strategic behavior of the bidder [23, 24].\n\n\u221a\n\n2 Problem formulation\n\nThe total expected payoff at period t given bid xt can be expressed as\n(cid:124)1{xt \u2265 \u03bbt}|xt) ,\n\nr(xt) = E ((\u03c0t \u2212 \u03bbt)\n\nwhere the expectation is taken using the joint distribution of (\u03c0t, \u03bbt), and 1{xt \u2265 \u03bbt} is the vector of\nindicator functions with the k-th entry corresponding to 1{xt,k \u2265 \u03bbt,k}. We assume that the payoff\n\n3\n\n\f(cid:33)\n\n(cid:32) T(cid:88)\n\nt=1\n\nE\n\n(\u03c0t \u2212 \u03bbt)\n(cid:124)1{xt \u2265 \u03bbt} obtained at each period is a bounded random variable with support in [l, u],3\nand the auction prices are drawn from a distribution with positive support. Hence, a zero bid for any\ngood is equivalent to not bidding because it will not get cleared.\nThe objective is to determine a bidding policy \u00b5 that maximizes the expected T-period payoff subject\nto a budget constraint for each individual period:\n\n\u00b5\n\nmaximize\n\nr(x\u00b5\nt )\n(cid:107)x\u00b5\nt (cid:107)1 \u2264 B,\nt \u2265 0,\nx\u00b5\nwhere B is the auction budget of the bidder, x\u00b5\nt,k \u2265 0 for all k \u2208 {1, 2, ..., K}.\nis equivalent to x\u00b5\n\nsubject to\n\nfor all t = 1, ..., T,\nfor all t = 1, ..., T,\n\n(1)\n\nt denotes the bid determined by policy \u00b5, and x\u00b5\n\nt \u2265 0\n\n2.1 Optimal solution under known distribution\n\nIf the joint distribution f (., .) of \u03c0t and \u03bbt is known, the optimization problem (1) decouples to\nsolving for each time instant separately. Since (\u03c0t, \u03bbt) is i.i.d. over t, an optimal solution under\nknown model does not depend on t and is given by\n\nx\u2217 = arg max\nxt\u2208F\n\nr(xt)\n\n(2)\n\nwhere F = {x \u2208 (cid:60)K : x \u2265 0,(cid:107)x(cid:107)1 \u2264 B} is the feasible set of bids. Optimal solution x\u2217 may not\nbe unique or it may not have a closed form. The following example illustrates a case where there\nisn\u2019t a closed form solution and shows that, even in the case of known distribution, the problem is a\ncombinatorial stochastic optimization, and it is not easy to calculate an optimal solution.\n\nExample. Let \u03bbt and \u03c0t be independent, \u03bbt,k be exponentially distributed with mean \u00af\u03bbk > 0, and\nthe mean of \u03c0t,k be \u00af\u03c0k > 0 for all k \u2208 {1, .., K}. Since not bidding for good k is optimal if \u00af\u03c0k \u2264 0,\nwe exclude the case \u00af\u03c0k \u2264 0 without loss of generality. For this example, we can use the concavity of\n(cid:124), to obtain the unique optimal solution x\u2217, which\nr(x) in the interval [0, \u00af\u03c0], where \u00af\u03c0 = [\u00af\u03c01, ..., \u00af\u03c0K]\nis characterized by\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\u00af\u03c0k\n\nx\u2217\nk =\n\n0\nxk satisfying (\u00af\u03c0k \u2212 xk)e\u2212xk/\u00af\u03bbk /\u00af\u03bbk = \u03b3\u2217\n\nif(cid:80)K\nif(cid:80)K\nif(cid:80)K\n\nk=1 \u00af\u03c0k \u2264 B,\nk=1 \u00af\u03c0k > B and \u00af\u03c0k/\u00af\u03bbk < \u03b3\u2217,\nk=1 \u00af\u03c0k > B and \u00af\u03c0k/\u00af\u03bbk \u2265 \u03b3\u2217,\n\nwhere the Lagrange multiplier \u03b3\u2217 > 0 is chosen such that (cid:107)x\u2217(cid:107)1 = B is satis\ufb01ed. This solution takes\nthe form of a \"water-\ufb01lling\" strategy. More speci\ufb01cally, if the budget constraint is not binding, then\nthe optimal solution is to bid \u00af\u03c0k for every good k. However, in the case of a binding budget constraint,\nthe optimal solution is determined by the bid value at which the marginal expected payoff associated\nwith each good k is equal to min(\u03b3\u2217, \u00af\u03c0k/\u00af\u03bbk), and this bid value cannot be expressed in closed form.\nWe measure the performance of a bidding policy \u00b5 by its regret4, the difference between the expected\nT-period payoff of \u00b5 and that of x\u2217, i.e.,\n\nR\u00b5\nT (f ) =\n\nE(r(x\u2217) \u2212 r(x\u00b5\n\nt )),\n\n(3)\n\nt=1\n\nwhere the expectation is taken with respect to the randomness induced by \u00b5. The regret of any policy\nis monotonically increasing. Hence, we are interested in policies with sub-linear regret growth.\n\n3This is reasonable in the case of virtual trading because DA and RT prices are bounded due to offer/bid caps.\n4The regret de\ufb01nition used here is the same as in [14]. This de\ufb01nition is also known as pseudo-regret in the\n\nliterature [25].\n\n4\n\nT(cid:88)\n\n\f3 Online learning approach to optimal bidding\n\nThe idea behind our approach is to maximize the sample mean of the expected payoff function, which\nis an ERM approach [26]. However, we show that a direct implementation of ERM is NP-hard. Hence,\nwe propose a polynomial-time algorithm that is based on dynamic programming on a discretized\nfeasible set. We show that our approach achieves the order optimal regret.\n\n3.1 Approximate expected payoff function and its optimization\n\nRegardless of the bidding policy, one can observe the auction and spot prices of past periods.\nTherefore, the average payoff that could have been obtained by bidding x up to the current period can\nbe calculated for any \ufb01xed value of x \u2208 F. Speci\ufb01cally, the average payoff \u02c6rt,k(xk) for a good k as\na function of the bid value xk can be calculated at period t + 1 by using observations up to t, i.e.,\n\n\u02c6rt,k(xk) = (1/t)\n\n(\u03c0i,k \u2212 \u03bbi,k)1{xk \u2265 \u03bbi,k}.\n\nt(cid:88)\n\ni=1\n\n(cid:26) t\u22121\n\nFor example, at the end of \ufb01rst period, \u02c6rt,k(xk) = (\u03c01,k\u2212 \u03bb1,k)1{xk \u2265 \u03bb1,k} as illustrated in Fig. 1a.\nFor, t \u2265 2, this can be expressed recursively;\n\n\u02c6rt,k(xk) =\n\nt \u02c6rt\u22121,k(xk)\nt\u22121\nt \u02c6rt\u22121,k(xk) + 1\n\nt (\u03c0t,k \u2212 \u03bbt,k)\n\nif xk < \u03bbt,k,\nif xk \u2265 \u03bbt,k.\n\n(4)\n\ni=1 and zero be \u03bb(k) =(cid:2)0, \u03bb(1),k, ..., \u03bb(t),k\n\nSince each observation introduces a new breakpoint, and the value of average payoff function is\nconstant between two consecutive breakpoints, we observe that \u02c6rt,k(xk) is a piece-wise constant\nfunction with at most t breakpoints. Let the vector of order statistics of the observed auction clearing\nprices {\u03bbi,k}t\n, and let the vector of associated average\npayoffs be r(k), i.e., r(k)\ne.g., see Fig. 1b.\n\n. Then, \u02c6rt,k(xk) can be expressed by the pair(cid:0)\u03bb(k), r(k)(cid:1),\n\ni = \u02c6rt,k\n\n(cid:3)(cid:124)\n\n\u03bb(k)\ni\n\n(cid:16)\n\n(cid:17)\n\n\u02c6r1,k(xk)\n\n\u02c6r4,k(xk)\n\n\u03c01,k \u2212 \u03bb1,k\n\n0\n\n\u03bb1,k\n\n(a) t = 1\n\nr(k)\n5\nr(k)\n3\nr(k)\n4\nr(k)\n2\n\nxk\n\n0\n\n\u03bb(k)\n2\n\n\u03bb(k)\n\u03bb(k)\n4\n3\n(b) t = 4\n\n\u03bb(k)\n5\n\nxk\n\nFigure 1: Piece-wise constant average payoff function of good k\n\n(cid:0)\u03bb(k), r(k)(cid:1) = (0, 0) at the beginning of \ufb01rst period. Then, at each period t \u2265 1, the pair(cid:0)\u03bb(k), r(k)(cid:1)\n\nFor a vector y, let ym:n = (ym, ym+1, ..., yn) denote the sequence of entries from m to n. Initialize\n\ncan be updated recursively as follows:\n\n(cid:16)\n\n\u03bb(k), r(k)(cid:17)\n\n(cid:18)(cid:104)\n\n=\n\n\u03bb(k)\n1:ik\n\n, \u03bbt,k, \u03bb(k)\n\nik+1:t\n\n,\n\n(cid:105)(cid:124)\n\n(cid:20) t \u2212 1\n\nr(k)\n1:ik\n\n,\n\nt \u2212 1\nt\n\nr(k)\nik:t +\n\n1\nt\n\nt\n\n(\u03c0t,k \u2212 \u03bbt,k)\n\n,\n\n(5)\n\n(cid:21)(cid:124)(cid:19)\n\ni <\u03bbt,k\n\ni at period t.\n\nwhere ik = maxi:\u03bb(k)\nConsequently, overall average payoff function \u02c6rt(x) can be expressed as a sum of average payoff\nfunctions of individual goods. Instead of the unknown expected payoff r(x), let\u2019s consider the\nmaximization of the average payoff function, which corresponds to the ERM approach, i.e.,\n\nx\u2208F \u02c6rt(x) = max\nmax\nx\u2208F\n\n\u02c6rt,k(xk).\n\n(6)\n\nthe same amount to the overall payoff as choosing any xk \u2208 (cid:104)\n\nDue to the piece-wise constant structure, choosing xk = \u03bb(k)\n\nk=1\n\ni\n\nfor some i \u2208 {1, ..., t + 1} contributes\nif i < t + 1 and any\n\n, \u03bb(k)\ni+1\n\n\u03bb(k)\ni\n\n(cid:17)\n\nK(cid:88)\n\n5\n\n\fxk \u2265 \u03bb(k)\nan optimal solution to (6) can be obtained by solving the following integer linear program:\n\nif i = t + 1. However, choosing xk = \u03bb(k)\n\nutilizes a smaller portion of the budget. Hence,\n\ni\n\ni\n\nk=1\n\nmaximize\n{zk}K\n\nK(cid:88)\nK(cid:88)\nwhere the bid value xk =(cid:0)\u03bb(k)(cid:1)(cid:124)\n\nsubject to\n\nk=1\n\n(cid:16)\n(cid:16)\n\nr(k)(cid:17)(cid:124)\n\u03bb(k)(cid:17)(cid:124)\n\nzk\n\nzk \u2264 1,\n\nk=1\n(cid:124)\n1\nzk,i \u2208 {0, 1},\n\nzk for good k.\n\nzk \u2264 B,\n\u2200k = 1, ..., K,\n\u2200i = 1, ..., t + 1;\u2200k = 1, ..., K.\n\n(7)\n\nObserve that (7) is a multiple choice knapsack problem (MCKP) [8], a generalization of 0-1 knapsack.\nUnfortunately, (7) is NP-hard [8]. If we had a polynomial-time algorithm that \ufb01nds an optimal\nsolution x \u2208 F to (6), then we could have obtained the solution of (7) in polynomial-time too by\nsetting zk,i = 1 where i = maxi:\u03bb(k)\ni for each k. Therefore, (6) is also NP-hard, and, to the\nbest of our knowledge, there isn\u2019t any method in the ERM literature [27], which mostly focuses on\nclassi\ufb01cation problems, suitable to implement for the speci\ufb01c problem at hand.\n\ni \u2264xk\n\n3.2 Dynamic programming on discrete set (DPDS) policy\n\nNext, we present an approach that discretizes the feasible set using intervals of equal length and\noptimizes the average payoff on this new discrete set via a dynamic program. Although this approach\ndoesn\u2019t solve (6), the solution can be arbitrarily close to the optimal depending on the choice of\nthe interval length under the assumption of the Lipschitz continuous expected payoff function. To\nexploit the smoothness of Lipschitz continuity, discretization approach of the continuous feasible set\nhas been used in the continuous MAB literature previously [17, 14]. However, different than MAB\nliterature, in this paper, discretization approach is utilized to reduce the computational complexity of\nan NP-hard problem as well.\nLet \u03b1t be an integer sequence increasing with t and Dt = {0, B/\u03b1t, 2B/\u03b1t, ..., B} as illustrated in\nFig. 2. Then, the new discrete set is given as Ft = {x \u2208 F : xk \u2208 Dt,\u2200k \u2208 {1, ..., K}}. Our goal is\nto optimize \u02c6rt(.) on the new set Ft rather than F, i.e.,\n\nmax\nxt+1\u2208Ft\n\n\u02c6rt(xt+1).\n\n(8)\n\n\u02c6r4,k(xk)\n\nr(k)\n5\nr(k)\n3\nr(k)\n4\nr(k)\n2\n\n0\n\n\u03bb(k)\n2\n\nB\n\u03b14\n\n\u03bb(k)\n3\n\n2B\n\u03b14\n\n\u03bb(k)\n4\n\n3B\n\u03b14\n\n\u03bb(k)\n5\n\n4B\n\u03b14\n\nxk\n\nFigure 2: Example of the discretization of the decision space for good k when t = 4\n\nNow, we use dynamic programming approach that has been used to solve 0-1 Knapsack problems\nincluding MCKP given in (7) [28]. However, direct implementation of this approach results in pseudo-\npolynomial computational complexity in the case of 0-1 Knapsack problems. The discretization of\nthe feasible set with equal interval length reduces the computational complexity to polynomial time.\nWe de\ufb01ne the maximum payoff one can collect with budget b among goods {1, ..., n} when the bid\nvalue xk is restricted to the set Dt for each good k as\n\nVn(b) =\n\nk=1:(cid:80)n\n\n{xk}n\n\nmax\nk=1 xk\u2264b,xk\u2208Dt\u2200k\n\n6\n\nn(cid:88)\n\nk=1\n\n\u02c6rt,k(xk).\n\n\f(cid:40)0\n\nThen, the following recursion can be used to solve for VK(B) which gives the optimal solution to (8):\n\nmax\n0\u2264i\u2264j\n\nVn(jB/\u03b1t) =\n\n(\u02c6rt,n(iB/\u03b1t) + Vn\u22121((j \u2212 i)B/\u03b1t))\n\nif n = 0, j \u2208 {0, 1, ..., \u03b1t},\nif 1 \u2264 n \u2264 K, j \u2208 {0, 1, ..., \u03b1t}.\n(9)\nThis is the Bellman equation where Vn(b) is the maximum total payoff one can collect using remaining\nbudget b and remaining n goods. Its optimality can be shown via a simple induction argument. Recall\nthat \u02c6rt,n(0) = 0 for all (t, n) pairs due to the assumption of positive day-ahead prices.\nRecursion (9) can be solved starting from n = 1 and proceeding to n = K, where, for each n, Vn(b)\nis calculated for all b \u2208 Dt. Since the computation of Vn(b) requires at most \u03b1t + 1 comparison for\nany \ufb01xed value of n \u2208 {1, ..., K} and b \u2208 Dt, it has a computational complexity on the order of K\u03b12\nonce the average payoff values \u02c6rt,n(xn) for all xn \u2208 Dt and n \u2208 {1, ..., K} are given. For each\nn \u2208 {1, ..., K}, computation of \u02c6rt,n(xn) for all xn \u2208 Dt introduces an additional computational\n\ncomplexity of at most on the order of t, which can be observed from the update step of(cid:0)\u03bb(k), \u03c0(k)(cid:1),\n\nt\n\ngiven in (5). Hence, total computational complexity of DPDS is O(K max(t, \u03b12\n\nt )) at each period t.\n\n3.3 Convergence and regret of DPDS policy\n\nthe value of the optimal policy under known model with a rate faster than or equal to(cid:112)log t/t if the\n\nUnder the assumption of Lipschitz continuity, Theorem 1 shows that the value of DPDS converges to\nDPDS algorithm parameter \u03b1t = (cid:100)t\u03b3(cid:101) with \u03b3 \u2265 1/2. Consequently, the regret growth rate of DPDS\nT log T ). If \u03b3 = 1/2, then the computational complexity of the algorithm\nis upper bounded by O(\nis bounded by O(Kt) at each period t, and total complexity over the entire horizon is O(KT 2).\n\n\u221a\n\nTheorem 1 Let xDPDS\nt+1 denote the bid of DPDS policy for period t + 1. If r(.) is Lipschitz continuous\non F with p-norm and Lipschitz constant L, then, for any \u03b3 > 0 and for DPDS parameter choice\n\u03b1t \u2265 2,\nE(r(x\u2217)\u2212 r(xDPDS\n\n+(cid:112)2(\u03b3 + 1)K + 1(u\u2212 l)\n\n4 min(u \u2212 l, LK 1/pB)\u03b1K\n\nt+1 )) \u2264 LK 1/pB\n\n(cid:114)\n\nlog t\n\n+\n\nt\n\nt\n\nt(\u03b3+1)K+1/2\n\n\u03b1t\n\nand for \u03b1t = max((cid:100)t\u03b3(cid:101), 2) with \u03b3 \u2265 1/2,\nRDPDS\n\n(f ) \u2264 2(LK 1/pB +4 min(u\u2212l, LK 1/pB))\n\nT\n\n\u221a\n\nT +2(cid:112)2(\u03b3 + 1)K + 1(u\u2212l)(cid:112)T log T . (11)\n\n,\n(10)\n\nt+1 )) \u2264 LK q/p(B/\u03b1t)q +(u\u2212l)((cid:112)2(\u03b3 + 1)K + 1(cid:112)log t/t+4\u03b1K\n\nActually, we can relax the uniform Lipschitz continuity condition. Under the weaker condition of\n|r(x\u2217) \u2212 r(x)| \u2264 L(cid:107)x\u2217 \u2212 x(cid:107)q\np for all x \u2208 F and for some constant L > 0, the incremental regret\nbound that is given in (10) becomes\nE(r(x\u2217)\u2212r(xDPDS\nt t\u2212(\u03b3+1)K\u22121/2).\nThe proof of Theorem 1 is derived by showing that the value of x\u2217\nt+1 = arg maxx\u2208Ft r(x) converges\nto the value of x\u2217 due to Lipschitz continuity, and the value of xDPDS\nt+1 converges to the value of x\u2217\nt+1\nvia the use of concentration inequality inspired by [20, 17].\nEven though the upper bound of regret in Theorem 1 depends on the budget B linearly, this de-\npendence can be avoided in the expense of increase in computational complexity. For example,\nin the literature, the reward is generally assumed to be in the unit interval, i.e., l = 0 and u = 1,\nand the expected reward is assumed to be Lipschitz continuous with Euclidean norm and constant\nL = 1. In this case, by following the proof of Theorem 1, we observe that assigning \u03b3 = 1/2 and\n\u03b1t = max((cid:100)\u03b1t\u03b3(cid:101), 2) for some \u03b1 > 0 gives a regret upper bound of 2B\n\u221a\nKT log T +\u03b1\nfor T > \u03b1 + 1. Consequently, if B = O(K), then O(K 3/4\nKT log T ) regret is achievable\nby setting \u03b1 = K 3/4.\n\nKT /\u03b1 +12\n\nT +\n\n\u221a\n\n\u221a\n\n\u221a\n\n3.4 Lower bound of regret for any bidding policy\n\nWe now show that DPDS in fact achieves the slowest possible regret growth. Speci\ufb01cally, Theorem 2\nstates that, for any bidding policy \u00b5 and horizon T , there exists a distribution f for which the regret\ngrowth is slower than or equal to the square root of the horizon T .\n\n7\n\n\fTheorem 2 Consider the case where K = 1, B = 1, and \u03bbt and \u03c0t are independent random\nvariables with distributions\n\nf\u03bb(\u03bbt) = \u0001\u221211{(1 \u2212 \u0001)/2 \u2264 \u03bbt \u2264 (1 + \u0001)/2}\n\nand f\u03c0(\u03c0t) = Bernoulli(\u00af\u03c0), respectively. Let f (\u03bbt, \u03c0t) = f\u03bb(\u03bbt)f\u03c0(\u03c0t) and \u0001 = T \u22121/2/2\nfor any bidding policy \u00b5,\n\n\u221a\n\n\u221a\n\neither for \u00af\u03c0 = 1/2 + \u0001 or for \u00af\u03c0 = 1/2 \u2212 \u0001.\n\nT (f ) \u2265 (1/16\nR\u00b5\n\n5)\n\nT ,\n\n\u221a\n\n5. Then,\n\nAs seen in Theorem 2, we choose a speci\ufb01c distribution for the auction clearing and spot prices.\nObserve that, for this distribution, the payoff function is Lipschitz continuous with Lipschitz constant\nL = 3/2 because the magnitude of the derivative of the payoff function |r(cid:48)(x)| \u2264 |\u00af\u03c0 \u2212 x|/\u0001 \u2264 3/2\nfor (1 \u2212 \u0001)/2 \u2264 x \u2264 (1 + \u0001)/2 and r(cid:48)(x) = 0 otherwise. So, it satis\ufb01es the condition given in\nTheorem 1.\nThe proof of Theorem 2 is obtained by showing that, every time the bid is cleared, an incremental\nregret greater than \u0001/2 is incurred under the distribution with \u00af\u03c0 = (1/2\u2212\u0001); otherwise, an incremental\nregret greater than \u0001/2 is incurred under the distribution with \u00af\u03c0 = (1/2 + \u0001). However, to distinguish\nbetween these two distributions, one needs \u2126(T ) samples, which results in a regret lower bound\nof \u2126(\nT ). The bound is obtained by adapting a similar argument used by [29] in the context of\nnon-stochastic MAB problem.\n\n\u221a\n\n4 Empirical study\n\nNew York ISO (NYISO), which consists of 11 zones, allows virtual transactions at zonal nodes only.\nSo, we use historical DA and RT prices of these zones from 2011 to 2016 [30]. Since the price for\neach hour is different at each zone, there are 11\u00d7 24 different locations, i.e., zone-hour pairs, to bid on\nevery day. The prices are per unit (MWh) prices. We also consider buy and sell bids simultaneously\nfor all location. As explained in Sec. 1.1, a sell bid is a bid to sell in the DA market with an obligation\nto buy back in the RT market. Hence, the pro\ufb01t of a sell bid at period t is (\u03bbt \u2212 \u03c0t)\n(cid:124)1{xt \u2264 \u03bbt}.\nGenerally, an upper bound \u00afp for the DA prices is known, e.g. \u00afp = $1000 for NYISO. We convert\nt = \u00afp \u2212 \u03c0t instead of xt, \u03bbt,\na sell bid to a buy bid by using xsell\nand \u03c0t. NYISO DA market for day t closes at 5:00 am on day t \u2212 1. Hence, the RT prices of all\nhours of day t \u2212 1 cannot be observed before the bid submission for day t. Therefore, the most recent\ninformation used before the submission for day t was the observations from day t \u2212 2.\n\nt = \u00afp \u2212 \u03bbt, and \u03c0sell\n\nt = \u00afp \u2212 xt, \u03bbsell\n\n(a) y = 2012\n\n(b) y = 2013\n\n(c) y = 2014\n\n(d) y = 2015\n\n(e) y = 2016\n\nFigure 3: Cumulative pro\ufb01t trajectory of year y for B = 100000\n\nWe compare DPDS with three algorithms. One of them is UCBID-GR, inspired by UCBID [7]. At\neach day, UCBID-GR sorts all locations according to their pro\ufb01tabilities, i.e., their price spread (the\ndifference between DA and RT price) sample means. Then, starting from the most pro\ufb01table location,\n\n8\n\n\fUCBID-GR sets the bid of a location equal to its RT price sample mean until there isn\u2019t any suf\ufb01cient\nbudget left.\nThe second algorithm, referred to as SA, is a variant of Kiefer-Wolfowitz stochastic approximation\nmethod. SA approximates the gradient of the payoff function by using the current observation and\nupdates the bid of each k as follows;\nxt,k = xt\u22121,k + at ((\u03c0t\u22122,k \u2212 \u03bbt\u22122,k)(1{xt\u22121,k + ct \u2265 \u03bbt\u22122,k} \u2212 1{xt\u22121,k \u2265 \u03bbt\u22122,k})) /ct.\nThen, xt is projected to the feasible set F.\nThe last algorithm is SVM-GR, which is inspired by the use of support vector machines (SVM) by\nTang et al. [31] to determine if a buy or a sell bid is pro\ufb01table at a location, i.e., if the price spread is\npositive or negative. Due to possible correlation of the price spread at a location on day t with the\nprice spreads observed recently at that and also at other locations, the input of SVM for each location\nis set as the price spreads of all locations from day t \u2212 7 to day t \u2212 2. To test SVM-GR algorithm\nat a particular year, for each location, the data from the previous year is used to train SVM and to\ndetermine the average pro\ufb01t, i.e., average price spread, and the bid level that will be accepted with\n95% con\ufb01dence in the event that a buy or a sell bid is pro\ufb01table. For the test year, at each period,\nSVM-GR \ufb01rst determines if a buy or a sell bid is pro\ufb01table for each location. Then, SVM-GR sorts\nall locations according to their average pro\ufb01ts, and, starting from the most pro\ufb01table location, it sets\nthe bid of a location equal to the bid level with 95% con\ufb01dence of acceptance until there isn\u2019t any\nsuf\ufb01cient budget left.\nTo evaluate the performance of a year, DPDS, UCBID-GR, and SA algorithms have also been trained\nstarting from the beginning of the previous year. The algorithm parameter of DPDS was set as \u03b1t = t;\nand the step size at and ct of SA were set as 20000/t and 2000/t1/4, respectively.\nFor B=$100,000, the cumulative pro\ufb01t trajectory of \ufb01ve consecutive years are given in Fig. 3. We\nobserve that DPDS obtains a signi\ufb01cant pro\ufb01t in all cases, and it outperforms other algorithms\nconsistently except 2015 where SVM-GR makes approximately 25% more pro\ufb01t. However, in three\nout of \ufb01ve years, SVM-GR suffers a considerable amount of loss. In general, UCBID-GR performs\nquite well except 2016, and SA algorithm incurs a loss almost every year.\n\n5 Conclusion\n\n\u221a\n\nlog T term.\n\nT ) for any bidding\n\nBy applying general techniques such as ERM, discretization approach, and dynamic programming, we\nderive a practical and ef\ufb01cient algorithm to the algorithmic bidding problem under budget constraint\nin repeated multi-commodity auctions. We show that the expected payoff of the proposed algorithm,\n\nDPDS, converges to that of the optimal strategy by a rate no slower than(cid:112)log t/t, which results\n\n\u221a\nT log T ) regret. By showing that the regret is lower bounded by \u2126(\n\n\u221a\nin a O(\nstrategy, we prove that DPDS is order optimal up to a\nFor the motivating application of virtual bidding in electricity markets (see Sec. 1.1), the stochastic\nsetting, studied in this paper, is natural due to the electricity markets being competitive, which\nimplies that the existence of an adversary is very unlikely. However, it is also of interest to study the\nadversarial setting to extend the results to other applications. For example, the adversarial setting of\nour problem is a special case of no-regret learning problem of Simultaneous Second Price Auctions\n(SiSPA), studied by Daskalakis and Syrgkanis [32] and Dudik et al. [33].\nIn particular, to deal with the adversarial setting, it is possible to use our dynamic programming\napproach as the of\ufb02ine oracle for the Oracle-Based Generalized FTPL algorithm proposed by Dudik\net al. [33] if we \ufb01x the discretized action set over the whole time horizon. More speci\ufb01cally, let the\ninterval length of discretization be B/m, i.e., \u03b1t = m. Then, it is possible to show that a 1-admissible\ntranslation matrix with K(cid:100)log m(cid:101) columns is implementable with complexity m. Consequently,\nno-regret result of Dudik et al. [33] holds with a regret bound of O(K\nT log m) if we measure\nthe performance of the algorithm against the best action in hindsight in the discretized \ufb01nite action\nset rather than in the original continuous action set considered here. Unfortunately, as shown by\nWeed et al. [7], it is not possible to achieve sublinear regret with a \ufb01xed discretization for the speci\ufb01c\nproblem considered in this paper. Hence, it requires further work to see if this method can be extended\nto obtain no-regret learning for the adversarial setting under the original continuous action set.\n\n\u221a\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Professor Robert Kleinberg for the insightful discussion.\nThis work was supported in part by the National Science Foundation under Award 1549989 and by\nthe Army Research Laboratory Network Science CTA under Cooperative Agreement W911NF-09-2-\n0053.\n\nReferences\n[1] Paul Milgrom. Putting auction theory to work. Cambridge University Press, 2004.\n\n[2] PJM. Virtual transactions in the pjm energy markets. Technical report, Oct 2015. http://\n\nwww.pjm.com/~/media/committees-groups/committees/mc/20151019-webinar/\n20151019-item-02-virtual-transactions-in-the-pjm-energy-markets-\nwhitepaper.ashx.\n\n[3] Ruoyang Li, Alva J. Svoboda, and Shmuel S. Oren. Ef\ufb01ciency impact of convergence bidding in the\n\ncalifornia electricity market. Journal of Regulatory Economics, 48(3):245\u2013284, 2015.\n\n[4] John E. Parsons, Cathleen Colbert, Jeremy Larrieu, Taylor Martin, and Erin Mastrangelo. Financial\narbitrage and ef\ufb01cient dispatch in wholesale electricity markets, February 2015. https://ssrn.com/\nabstract=2574397.\n\n[5] Wenyuan Tang, Ram Rajagopal, Kameshwar Poolla, and Pravin Varaiya. Model and data analysis of\ntwo-settlement electricity market with virtual bidding. In 2016 IEEE 55th Conference on Decision and\nControl (CDC), pages 6645\u20136650, 2016.\n\n[6] David B. Patton, Pallas LeeVanSchaick, and Jie Chen. 2014 state of the market report for the new york iso\n\nmarkets. Technical report, May 2015. http://www.nyiso.com/public/webdocs/\nmarkets_operations/documents/Studies_and_Reports/Reports/\nMarket_Monitoring_Unit_Reports/2014/NYISO2014SOMReport__5-13-\n2015_Final.pdf.\n\n[7] Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. In 29th\n\nAnnual Conference on Learning Theory, page 1562\u20131583, 2016.\n\n[8] Hans Kellerer, Ulrich Pferschy, and David Pisinger. The Multiple-Choice Knapsack Problem, pages\n\n317\u2013347. Springer Berlin Heidelberg, 2004.\n\n[9] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts\n\nand bandits. In 21st Conference on Learning Theory, pages 425\u2013436, 2008.\n\n[10] Nicol\u00f2 Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K.\nWarmuth. How to use expert advice. In Proceedings of the Twenty-\ufb01fth Annual ACM Symposium on Theory\nof Computing, pages 382\u2013391. ACM, 1993.\n\n[11] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\napplication to boosting. In Proceedings of the Second European Conference on Computational Learning\nTheory, pages 23\u201337. Springer-Verlag, 1995.\n\n[12] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer\nScience, pages 322\u2013331, 1995.\n\n[13] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computa-\n\ntion, 108(2):212 \u2013 261, 1994.\n\n[14] Robert Kleinberg and Aleksandrs Slivkins. Sharp dichotomies for regret minimization in metric spaces. In\nProceedings of the Twenty-\ufb01rst Annual ACM-SIAM Symposium on Discrete Algorithms, pages 827\u2013846.\nSociety for Industrial and Applied Mathematics, 2010.\n\n[15] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. arXiv preprint\n\narXiv:1312.1277v2, 2015.\n\n[16] Walid Krichene, Maximilian Balandat, Claire Tomlin, and Alexandre Bayen. The hedge algorithm on a\ncontinuum. In Proceedings of the 32Nd International Conference on International Conference on Machine\nLearning - Volume 37, pages 824\u2013832. JMLR.org, 2015.\n\n10\n\n\f[17] Robert D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In L. K. Saul, Y. Weiss,\nand L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 697\u2013704. MIT Press,\n2005.\n\n[18] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the\nbandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM\nSymposium on Discrete Algorithms, pages 385\u2013394. Society for Industrial and Applied Mathematics, 2005.\n\n[19] Eric W. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE\n\nTransactions on Automatic Control, 54(6):1243\u20131253, 2009.\n\n[20] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[21] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. Budget optimization for sponsored\nsearch: Censored learning in mdps. In Proceedings of the Twenty-Eighth Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 54\u201363. AUAI Press, 2012.\n\n[22] Long Tran-Thanh, Lampros Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R Jennings, and\nPeter Key. Ef\ufb01cient regret bounds for online bid optimisation in budget-limited sponsored search auctions.\nIn Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 809\u2013818. AUAI\nPress, 2014.\n\n[23] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for repeated auctions with strategic\nbuyers. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 26, pages 1169\u20131177. Curran Associates, Inc., 2013.\n\n[24] Mehryar Mohri and Andres Munoz. Optimal regret minimization in posted-price auctions with strategic\nbuyers. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 27, pages 1871\u20131879. Curran Associates, Inc., 2014.\n\n[25] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[26] Vladimir Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson,\nand R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831\u2013838.\nMorgan-Kaufmann, 1992.\n\n[27] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n[28] Krzysztof Dudzi\u00b4nski and Stanis\u0142aw Walukiewicz. Exact methods for the knapsack problem and its\n\ngeneralizations. European Journal of Operational Research, 28(1):3 \u2013 21, 1987.\n\n[29] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[30] NYISO Website, 2017. http://www.nyiso.com/public/markets_operations/\n\nmarket_data/pricing_data/index.jsp.\n\n[31] Wenyuan Tang, Ram Rajagopal, Kameshwar Poolla, and Pravin Varaiya. Private communications, 2017.\n\n[32] Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In 2016\n\nIEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 219\u2013228, 2016.\n\n[33] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Shapire, Vasilis Syrgkanis, and Jennifer Wortman\nVaughan. Oracle-ef\ufb01cient online learning and auction design. In 2017 IEEE 58th Annual Symposium on\nFoundations of Computer Science (FOCS), pages 528\u2013539, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2353, "authors": [{"given_name": "M. Sevi", "family_name": "Baltaoglu", "institution": "Cornell University"}, {"given_name": "Lang", "family_name": "Tong", "institution": "Cornell University"}, {"given_name": "Qing", "family_name": "Zhao", "institution": "Cornell University"}]}