{"title": "Real-Time Bidding with Side Information", "book": "Advances in Neural Information Processing Systems", "page_first": 5162, "page_last": 5172, "abstract": "We consider the problem of repeated bidding in online advertising auctions when some side information (e.g. browser cookies) is available ahead of submitting a bid in the form of a $d$-dimensional vector. The goal for the advertiser is to maximize the total utility (e.g. the total number of clicks) derived from displaying ads given that a limited budget $B$ is allocated for a given time horizon $T$. Optimizing the bids is modeled as a contextual Multi-Armed Bandit (MAB) problem with a knapsack constraint and a continuum of arms. We develop UCB-type algorithms that combine two streams of literature: the confidence-set approach to linear contextual MABs and the probabilistic bisection search method for stochastic root-finding. Under mild assumptions on the underlying unknown distribution, we establish distribution-independent regret bounds of order $\\tilde{O}(d \\cdot \\sqrt{T})$ when either $B = \\infty$ or when $B$ scales linearly with $T$.", "full_text": "Real-Time Bidding with Side Information\n\nArthur Flajolet\n\nMIT, ORC\n\nflajolet@mit.edu\n\nPatrick Jaillet\n\nMIT, EECS, LIDS, ORC\n\njaillet@mit.edu\n\nAbstract\n\nWe consider the problem of repeated bidding in online advertising auctions when\nsome side information (e.g. browser cookies) is available ahead of submitting a bid\nin the form of a d-dimensional vector. The goal for the advertiser is to maximize\nthe total utility (e.g. the total number of clicks) derived from displaying ads given\nthat a limited budget B is allocated for a given time horizon T . Optimizing the bids\nis modeled as a contextual Multi-Armed Bandit (MAB) problem with a knapsack\nconstraint and a continuum of arms. We develop UCB-type algorithms that combine\ntwo streams of literature: the con\ufb01dence-set approach to linear contextual MABs\nand the probabilistic bisection search method for stochastic root-\ufb01nding. Under\nmild assumptions on the underlying unknown distribution, we establish distribution-\nT ) when either B = \u221e or when B\n\nindependent regret bounds of order \u02dcO(d \u00b7 \u221a\n\nscales linearly with T .\n\n1\n\nIntroduction\n\nOn the internet, advertisers and publishers now interact through real-time marketplaces called ad\nexchanges. Through them, any publisher can sell the opportunity to display an ad when somebody is\nvisiting a webpage he or she owns. Conversely, any advertiser interested in such an opportunity can\npay to have his or her ad displayed. In order to match publishers with advertisers and to determine\nprices, ad exchanges commonly use a variant of second-price auctions which typically runs as follows.\nEach participant is initially provided with some information about the person that will be targeted\nby the ad (e.g. browser cookies, IP address, and operating system) along with some information\nabout the webpage (e.g. theme) and the ad slot (e.g. width and visibility). Based on this limited\nknowledge, advertisers must submit a bid in a timely fashion if they deem the opportunity worthwhile.\nSubsequently, the highest bidder gets his or her ad displayed and is charged the second-highest bid.\nMoreover, the winner can usually track the customer\u2019s interaction with the ad (e.g. clicks). Because\nthe auction is sealed, very limited feedback is provided to the advertiser if the auction is lost. In\nparticular, the advertiser does not receive any customer feedback in this scenario. In addition, the\ndemand for ad slots, the supply of ad slots, and the websurfers\u2019 pro\ufb01les cannot be predicted ahead of\ntime and are thus commonly modeled as random variables, see [19]. These two features contribute to\nmaking the problem of bid optimization in ad auctions particularly challenging for advertisers.\n\n1.1 Problem statement and contributions\n\nWe consider an advertiser interested in purchasing ad impressions through an ad exchange. As\nstandard practice in the online advertising industry, we suppose that the advertiser has allocated a\nlimited budget B for a limited period of time, which corresponds to the next T ad auctions. Rounds,\nindexed by t \u2208 N, correspond to ad auctions in which the advertiser participates. At the beginning of\nround t \u2208 N, some contextual information about the ad slot and the person that will be targeted is\nrevealed to the advertiser in the form of a multidimensional vector xt \u2208 X , where X is a subset of\nRd. Without loss of generality, the coordinates of xt are assumed to be normalized in such a way that\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(cid:107)x(cid:107)\u221e \u2264 1 for all x \u2208 X . Given xt, the advertiser must submit a bid bt in a timely fashion. If bt is\nlarger than the highest bid submitted by the competitors, denoted by pt and also referred to as the\nmarket price, the advertiser wins the auction, is charged pt, and gets his or her ad displayed, from\nwhich he or she derives a utility vt. Monetary amounts and utility values are assumed to be normalized\nin such a way that bt, pt, vt \u2208 [0, 1]. In this modeling, one of the competitors is the publisher himself\nwho submits a reserve price so that pt > 0. No one wins the auction if no bid is larger than the reserve\nprice. For the purpose of modeling, we suppose that ties are broken in favor of the advertiser but this\nchoice is arbitrary and by no means a limitation of the approach. Hence, the advertiser collects a\nreward rt = vt \u00b7 1bt\u2265pt and is charged ct = pt \u00b7 1bt\u2265pt at the end of round t. Since the monetary\nvalue of getting an ad displayed is typically dif\ufb01cult to assess, vt and ct may be expressed in different\nunits and thus cannot be compared directly in general, which makes the problem two-dimensional.\nThis is the case, for example, when the goal of the advertiser is to maximize the number of clicks, in\nwhich case vt = 1 if the ad was clicked on and vt = 0 otherwise. We consider a stochastic setting\nwhere the environment and the competitors are not fully adversarial. Speci\ufb01cally, we assume that,\nat any round t \u2208 N, the vector (xt, vt, pt) is jointly drawn from a \ufb01xed probability distribution \u03bd\nindependently from the past. While this assumption may seem unnatural at \ufb01rst as the other bidders\nalso act as learning agents, it is motivated by the following observation. In our setting, we consider\nthat there are many bidders, each participating in a small subset of a large number of auctions, that\nvalue ad opportunities very differently depending on the intended audience, the nature and topic of\nthe ads, and other technical constraints. Since bidders have no idea who they will be competing\nagainst for a particular ad (because the auctions are sealed), they are naturally led to be oblivious\nto the competition and to bid with the only objective of maximizing their own objective functions.\nGiven the variety of objective functions and the large number of bidders and ad auctions, we argue\nthat, by the law of large numbers, the process (xt, pt, vt)t=1,...,T that we experience as a bidder is\ni.i.d., at least for a short period of time. Moreover, while the assumption that the distribution of\n(xt, vt, pt) is stationary may only be valid for a short period of time, advertisers tend to participate in\na large number of ad auctions per second so that T and B are typically large values, which motivates\nan asymptotic study. We generically denote by (X, V, P ) a vector of random variables distributed\naccording to \u03bd. We make a structural assumption about \u03bd, which we use throughout the paper.\nAssumption 1. The random variables V and P are conditionally independent given X. Moreover,\nthere exists \u03b8\u2217 \u2208 Rd such that E[V | X] = X T\u03b8\u2217 and (cid:107)\u03b8\u2217(cid:107)\u221e \u2264 1.\nNote, in particular, that Assumption 1 is satis\ufb01ed if V and P are deterministic functions of X. The\n\ufb01rst part of Assumption 1 is very natural since: (i) X captures all and only the information about the\nad shared to all bidders before submitting a bid and (ii) websurfers are oblivious to the ad auctions\nthat take place behind the scenes to determine which ad they will be presented with. The second\npart of Assumption 1 is standard in the literature on linear contextual MABs, see [1] and [16], and\nis arguably the simplest model capturing a dependence between xt and vt. When the advertiser\u2019s\nobjective is to maximize the number of clicks, this assumption translates into a linear Click-Through\nRate (CTR) model.\n( \u02dcFt)t\u2208N) the natural \ufb01ltration generated by ((xt, vt, pt))t\u2208N\nWe denote by (Ft)t\u2208N (resp.\n(resp. ((xt+1, vt, pt))t\u2208N). Since the advertiser can keep bidding only so long as he or she does\nnot run out of money or time, he or she can no longer participate in ad auctions at round \u03c4\u2217,\nmathematically de\ufb01ned by:\n\n\u03c4\u2217 = min(T + 1, min{t \u2208 N |\n\nc\u03c4 > B}).\n\nt(cid:88)\n\n\u03c4 =1\n\nNote that \u03c4\u2217 is a stopping time with respect to (Ft)t\u2208N. The dif\ufb01culty for the advertiser when it\ncomes to determining how much to bid at each round lies in the fact that the underlying distribution \u03bd\nis initially unknown. This task is further complicated by the fact that the feedback provided to the\nadvertiser upon bidding bt is partially censored: pt and vt are only revealed if the advertiser wins\nthe auction, i.e. if bt \u2265 pt. In particular when bt < pt, the advertiser can never evaluate how much\nreward would have been obtained and what price would have been charged if he or she had submitted\na higher bid. The goal for the advertiser is to design a non-anticipating algorithm that, at any round t,\nselects bt based on the information acquired in the past so as to keep the pseudo-regret de\ufb01ned as:\n\nRB,T = EROPT(B, T ) \u2212 E[\n\nrt]\n\n\u03c4\u2217\u22121(cid:88)\n\nt=1\n\n2\n\n\fas small as possible, where EROPT(B, T ) is the maximum expected sum of rewards that can be\nobtained by a non-anticipating oracle algorithm that has knowledge of the underlying distribution.\nHere, an algorithm is said to be non-anticipating if the bid selection process does not depend on the\nfuture observations. We develop algorithms with bounds on the pseudo-regret that do not depend on\nthe underlying distribution \u03bd, which are referred to as distribution-independent regret bounds. This\nentails studying the asymptotic behavior of RB,T when B and T go to in\ufb01nity. For mathematical\nconvenience, we consider that the advertiser keeps bidding even if he or she has run out of time or\nmoney so that all quantities are well de\ufb01ned for any t \u2208 N. Of course, the rewards obtained for\nt \u2265 \u03c4\u2217 are not taken into account in the advertiser\u2019s total reward when establishing regret bounds.\n\nnical assumptions on the underlying distribution \u03bd, our algorithms incur a regret RB,T = \u02dcO(d \u00b7 \u221a\n\nContributions We develop UCB-type algorithms that combine the ellipsoidal con\ufb01dence set ap-\nproach to linear contextual MAB problems with a special-purpose stochastic binary search procedure.\nWhen the budget is unlimited or when it scales linearly with time, we show that, under additional tech-\nT ),\nwhere the \u02dcO notation hides logarithmic factors in d and T . A key insight is that overbidding is not\nonly essential to incentivize exploration in order to estimate \u03b8\u2217, but also crucial to \ufb01nd the optimal\nbidding strategy given \u03b8\u2217 because bidding higher always provide more feedback in real-time bidding.\n\n1.2 Literature review\n\nTo handle the exploration-exploitation trade-off inherent to MAB problems, an approach that has\nproved to be particularly successful is the optimism in the face of uncertainty paradigm. The idea is\nto consider all plausible scenarios consistent with the information collected so far and to select the\ndecision that yields the largest reward among all identi\ufb01ed scenarios. Auer et al. [7] use this idea\nto solve the standard MAB problem where decisions are represented by K \u2208 N arms and pulling\narm k \u2208 {1,\u00b7\u00b7\u00b7 , K} at round t \u2208 {1,\u00b7\u00b7\u00b7 , T} yields a random reward drawn from an unknown\ndistribution speci\ufb01c to this arm independently from the past. Speci\ufb01cally, Auer et al. [7] develop the\nUpper Con\ufb01dence Bound algorithm (UCB1), which consists in selecting the arm with the current\nlargest upper con\ufb01dence bound on its mean reward, and establish near-optimal regret bounds. This\napproach has since been successfully extended to a number of more general settings. Of most notable\ninterest to us are: (i) linear contextual MAB problems, where, for each arm k and at each round t,\nsome context xk\nt is provided to the decision maker ahead of pulling any arm and the expected reward\nt for some unknown \u03b8\u2217 \u2208 Rd, and (ii) the Bandits with Knapsacks (BwK) framework,\nof arm k is \u03b8T\u2217xk\nan extension to the standard MAB problem allowing to model resource consumption.\n\nUCB-type algorithms for linear contextual MAB problems were \ufb01rst developed in [6] and\nlater extended and improved upon in [1] and [16]. In this line of work, the key idea is to build, at any\nround t, an ellipsoidal con\ufb01dence set Ct on the unknown parameter \u03b8\u2217 and to pull the arm k that\nd \u00b7 T ) upper bounds on regret\nmaximizes max\u03b8\u2208Ct \u03b8Txk\nthat hold with high probability, where the \u02dcO notations hides logarithmic factors in d and T . While\nthis result is not directly applicable in our setting, partly because of the knapsack constraint, we rely\non this technique to estimate \u03b8\u2217.\n\n\u221a\nt . Using this idea, Chu et al. [16] derive \u02dcO(\n\nThe real-time bidding problem considered in this work can be formulated as a BwK prob-\nlem with contextual information and a continuum of arms. This framework, \ufb01rst introduced in its\nfull generality in [10] and later extended to incorporate contextual information in [11], [3], and\n[2], captures resource consumption by assuming that pulling any arm incurs the consumption of\npossibly many different limited resource types by random amounts. BwK problems are notoriously\nharder to solve than standard MAB problems. For example, sublinear regret cannot be achieved\nin general for BwK problems when an opponent is adversarially picking the rewards and the\namounts of resource consumption at each round, see [10], while this is possible for standard MAB\nproblems, see [8]. The problem becomes even more complex when some contextual information\nis available at the beginning of each round as approaches developed for standard contextual MAB\nproblems and for BwK problems fail when applied to contextual BwK problems, see the discussion\nin [11], which calls for the development of new techniques. Agrawal and Devanur [2] consider a\nparticular case where the expected rewards and the expected amounts of resource consumption\nd \u00b7 T ) bounds on regret when the initial\nare linear in the context and derive, in particular, \u02dcO(\nendowments of resources scale linearly with the time horizon T . These results do not carry over\n\n\u221a\n\n3\n\n\fto our setting because the expected costs, and in fact also the expected rewards, are not linear in\nthe context. To the best of our knowledge, the only prior works that deal simultaneously with\nknapsack constraints and a non-linear dependence of the rewards and the amounts of resource\nconsumption on the contextual information are Agrawal et al. [3] and Badanidiyuru et al. [11]. When\n\nthere is a \ufb01nite number of arms K, they derive regret bounds that scale as \u02dcO((cid:112)K \u00b7 T \u00b7 ln(\u03a0)),\n\nwhere \u03a0 is the size of the set of benchmark policies. To some extent, at least when \u03b8\u2217 is\nknown, it is possible to apply these results but this requires to discretize the set of valid bids\n[0, 1] and the regret bounds thus derived scale as \u223c T 2/3, see the analysis in [10], which is suboptimal.\n\nthe most closely related prior works studying repeated ad auctions\nOn the modeling side,\nunder the lens of online learning are [25], [23], [17], [12], and [5]. Weed et al. [25] develop\nalgorithms to solve the problem considered in this work when no contextual information is\navailable and when there is no budget constraint, in which case the rewards are de\ufb01ned as\nrt = (vt \u2212 pt) \u00b7 1bt\u2265pt, but in a more general adversarial setting where few assumptions are made\n\u221a\nconcerning the sequence ((vt, pt))t\u2208N. They obtain \u02dcO(\nT ) regret bounds with an improved rate\nO(ln(T )) in some favorable settings of interest. Inspired by [4], Tran-Thanh et al. [23] study a\nparticular case of the problem considered in this work when no contextual information is available\nand when the goal is to maximize the number of impressions. They use a dynamic programming\napproach and claim to derive \u02dcO(\nT ) regret bounds. Balseiro and Gur [12] identify near-optimal\nbidding strategies in a game-theoretic setting assuming that each bidder has a black-box function\nthat maps the contextual information available before bidding to the expected utility derived from\ndisplaying an ad (which amounts to assuming that \u03b8\u2217 is known a priori in our setting). They show that\nbidding an amount equal to the expected utility derived from displaying an ad normalized by a bid\nmultiplier, to be estimated, is a near-optimal strategy. We extend this observation to the contextual\nsettings. Compared to their work, the dif\ufb01culty in our setting lies in estimating simultaneously the\nbid multiplier and \u03b8\u2217. Finally, the authors of [5] and [17] take the point of view of the publisher\nwhose goal is to price ad impressions, as opposed to purchasing them, in order to maximize revenues\nwith no knapsack constraint. Cohen et al. [17] derive O(ln(d2 \u00b7 ln(T /d))) bounds on regret with\nhigh probability with a multidimensional binary search.\n\n\u221a\n\nOn the technical side, our work builds upon and contributes to the stream of literature on\nprobabilistic bisection search algorithms. This class of algorithms was originally developed for\nsolving stochastic root \ufb01nding problems, see [22] for an overview, but has also recently appeared in\nthe MAB literature, see [20]. Our approach is largely inspired by the work of Lei et al. [20] who\ndevelop a stochastic binary search algorithm to solve a dynamic pricing problem with limited supply\nbut no contextual information, which can be modeled as a BwK problem with a continuum of arms.\nDynamic pricing problems with limited supply are often modeled as BwK problems in the literature,\nsee [24], [9], and [20], but, to the best of our knowledge, the availability of contextual information\nabout potential customers is never captured. Inspired by the technical developments introduced in\nthese works, our approach is to characterize a near-optimal strategy in closed form and to re\ufb01ne\nour estimates of the (usually few) initially unknown parameters involved in the characterization as\nwe make decisions online, implementing this strategy using the latest estimates for the parameters.\nHowever, the technical challenge in these works differs from ours in one key aspect: the feedback\nprovided to the decision maker is completely censored in dynamic pricing problems, since the\ncustomers\u2019 valuations are never revealed, while it is only partially censored in real-time bidding, since\nthe market price is revealed if the auction is won. Making the most of this additional feature enables\nus to develop a stochastic binary search procedure that can be compounded with the ellipsoidal\ncon\ufb01dence set approach to linear contextual bandits in order to incorporate contextual information.\n\nOrganization The remainder of the paper is organized as follows. In order to increase the level of\ndif\ufb01culty progressively, we start by studying the situation of an advertiser with unlimited budget, i.e.\nB = \u221e, in Section 2. Given that second-price auctions induce truthful bidding when the bidder has\nno budget constraint, this setting is easier since the optimal bidding strategy is to bid bt = xT\nt \u03b8\u2217 at\nany round t \u2208 N. This drives us to focus on the problem of estimating \u03b8\u2217, which we do by means\nof ellipsoidal con\ufb01dence sets. Next, in Section 3, we study the setting where B is \ufb01nite and scales\nlinearly with the time horizon T . We show that a near-optimal strategy is to bid bt = xT\nt \u03b8\u2217/\u03bb\u2217 at\nany round t \u2208 N, where \u03bb\u2217 \u2265 0 is a scalar factor whose purpose is to spread the budget as evenly\nas possible, i.e. E[P \u00b7 1X T\u03b8\u2217\u2265\u03bb\u2217\u00b7P ] = B/T . Given this characterization, we \ufb01rst assume that \u03b8\u2217\n\n4\n\n\fis known a priori to focus instead on the problem of computing an approximate solution \u03bb \u2265 0 to\nE[P \u00b7 1X T\u03b8\u2217\u2265\u03bb\u00b7P ] = B/T in Section 3.1. We develop a stochastic binary search algorithm for this\n\u221a\npurpose which is shown to incur \u02dcO(\nT ) regret under mild assumptions on the underlying distribution\nbased on ellipsoidal con\ufb01dence sets to tackle the general problem and derive \u02dcO(d\u00b7\u221a\n\u03bd. In Section 3.2, we bring the stochastic binary search algorithm together with the estimation method\nT ) regret bounds.\n\nAll the proofs are deferred to the Appendix.\nNotations For a vector x \u2208 Rd, (cid:107)x(cid:107)\u221e refers to the L\u221e-norm of x. For a positive de\ufb01nite matrix\nM \u2208 Rd\u00d7d and a vector x \u2208 Rd, we de\ufb01ne the norm (cid:107)x(cid:107)M as (cid:107)x(cid:107)M =\nxTM x. For x, y \u2208 Rd, it\nis well known that the following Cauchy-Schwarz inequality holds: |xTy| \u2264 (cid:107)x(cid:107)M \u00b7 (cid:107)y(cid:107)M\u22121. We\ndenote by Id the identity matrix in dimension d. We use the standard asymptotic notation O(\u00b7) when\nT , B, and d go to in\ufb01nity. We also use the notation \u02dcO(\u00b7) that hides logarithmic factors in d, T , and B.\nFor x \u2208 R, (x)+ refers to the positive part of x. For a \ufb01nite set S (resp. a compact interval I \u2282 R),\n|S| (resp. |I|) denotes the cardinality of S (resp. the length of I). For a set S, P(S) denotes the set\nof all subsets of S. Finally, for a real-valued function f (\u00b7), supp f (\u00b7) denotes the support of f (\u00b7).\n\n\u221a\n\n2 Unlimited budget\nIn this section, we suppose that the budget is unlimited, i.e. B = \u221e, which implies that the rewards\nhave to be rede\ufb01ned in order to directly incorporate the costs. For this purpose, we assume in this\nsection that vt is expressed in monetary value and we rede\ufb01ne the rewards as rt = (vt \u2212 pt) \u00b7 1bt\u2265pt.\nSince the budget constraint is irrelevant when B = \u221e, we use the notations RT and EROPT(T )\nin place of RB,T and EROPT(B, T ). As standard in the literature on MAB problems, we start by\nanalyzing the optimal oracle strategy that has knowledge of the underlying distribution. This will not\nonly guide the design of algorithms when \u03bd is unknown but this will also facilitate the regret analysis.\nThe algorithm developed in this section as well as the regret analysis are extensions of the work of\nWeed et al. [25] to the contextual setting.\n\nBenchmark analysis\nIt is well known that second-price auctions induce truthful bidding in the\nsense that any participant whose only objective is to maximize the immediate payoff should always\nbid what he or she thinks the good being auctioned is worth. The following result should thus come at\nno surprise in the context of real-time bidding given Assumption 1 and the fact that each participant\nis provided with the contextual information xt before the t-th auction takes place.\nt \u03b8\u2217 at any time period t \u2208 N and\nLemma 1. The optimal non-anticipating strategy is to bid bt = xT\n\nwe have EROPT(T ) =(cid:80)T\n\nE[(xT\n\nt \u03b8\u2217 \u2212 pt)+].\n\nt=1\n\nLemma 1 shows that the problem faced by the advertiser essentially boils down to estimating \u03b8\u2217.\nSince the bidder only gets to observe vt if the auction is won, this gives advertisers a clear incentive\nto overbid early on so that they can progressively re\ufb01ne their estimates downward as they collect\nmore data points.\n\nSpeci\ufb01cation of the algorithm Following the approach developed in [6] for linear contextual MAB\nproblems, we de\ufb01ne, at any round t, the regularized least square estimate of \u03b8\u2217 given all the feedback\nacquired in the past \u02c6\u03b8t = M\u22121\n\u03c4 , as\nwell as the corresponding ellipsoidal con\ufb01dence set:\n\n1b\u03c4\u2265p\u03c4 \u00b7 v\u03c4 \u00b7 x\u03c4 , where Mt = Id +(cid:80)t\u22121\n\n1b\u03c4\u2265p\u03c4 \u00b7 x\u03c4 xT\n\n\u03c4 =1\n\nt\n\nwith \u03b4T = 2(cid:112)d \u00b7 ln((1 + d \u00b7 T ) \u00b7 T ). For the reasons mentioned above, we take the optimism in the\n\n\u2264 \u03b4T},\n\n(cid:80)t\u22121\nCt = {\u03b8 \u2208 Rd | (cid:13)(cid:13)(cid:13)\u03b8 \u2212 \u02c6\u03b8t\n\n\u03c4 =1\n\n(cid:13)(cid:13)(cid:13)Mt\n\nface of uncertainty approach and bid:\n\nbt = max(0, min(1, max\n\u03b8\u2208Ct\n\n(1)\nat any round t. Since Ct was designed with the objective of guaranteeing that \u03b8\u2217 \u2208 Ct with high\nprobability at any round t, irrespective of the number of auctions won in the past, bt is larger than the\noptimal bid xT\n\nt \u03b8\u2217 in general, i.e. we tend to overbid.\n\n\u03b8Txt)) = max(0, min(1, \u02c6\u03b8T\n\nt xt))\n\nt xt + \u03b4T \u00b7(cid:113)\n\nt M\u22121\nxT\n\n5\n\n\fRegret analysis Concentration inequalities are intrinsic to any kind of learning and are thus key to\nderive regret bounds in online learning. We start with the following lemma, which is a consequence\nof the results derived in [1] for linear contextual MABs, that shows that \u03b8\u2217 lies in all the ellipsoidal\ncon\ufb01dence sets with high probability. Assumption 1 is key to establish this result.\nLemma 2. We have P[\u03b8\u2217 /\u2208 \u2229T\nEquipped with Lemma 2 along with some standard results for linear contextual bandits, we are now\nready to extend the analysis of Weed et al. [25] to the contextual setting.\nT ).\n\nTheorem 1. Bidding according to (1) incurs a regret RT = \u02dcO(d \u00b7 \u221a\n\nt=1Ct] \u2264 1\nT .\n\nAlternative algorithm with lazy updates As \ufb01rst pointed out by Abbasi-Yadkori et al. [1] in the\ncontext of linear bandits, updating the con\ufb01dence set Ct at every round is not only inef\ufb01cient but also\nunnecessary from a performance standpoint. Instead, we can perform batch updates, only updating Ct\nusing all the feedback collected in the past at rounds t for which det(Mt) has increased by a factor at\nleast (1 + A) compared to the last time there was an update, for some constant A > 0 of our choosing.\nThis leads to an interesting trade-off between computational ef\ufb01ciency and deterioration of the regret\nbound captured in our next result. For mathematical convenience, we keep the same notations as\nwhen we were updating the con\ufb01dence sets at every round. The only difference lies in the fact that\nthe bid submitted at time t is now de\ufb01ned as:\n\nbt = max(0, min(1, max\n\u03b8\u2208C\u03c4t\n\n\u03b8Txt)),\n\n(2)\n\nwhere \u03c4t is the last round before round t where the last batch update happened.\n\nTheorem 2. Bidding according to (2) at any round t incurs a regret RT = \u02dcO(d \u00b7 \u221a\n\nA \u00b7 T ).\n\nThe fact that we can afford lazy updates will turn out to be important to tackle the general case in\nSection 3.2 since we will only be able to update the con\ufb01dence sets at most O(ln(T )) times overall.\n\n3 Limited budget\n\nIn this section, we consider the setting where B is \ufb01nite and scales linearly with the time horizon T .\nWe will need the following assumptions for the remainder of the paper.\nAssumption 2. (a) B/T = \u03b2 is a constant independent of any other relevant quantities.\n(b) There exists r > 0, known to the advertiser, such that pt \u2265 r for all t \u2208 N.\n(c) We have E[1/X T\u03b8\u2217] < \u221e.\n(d) The random variable P has a continuous conditional probability density function given the\n\noccurrence of the value x of X, denoted by fx(\u00b7), that is upper bounded by \u00afL < \u221e.\n\nConditions (a) and (b) are very natural in real-time bidding where the budget scales linearly with\ntime and where r corresponds to the minimum reserve price across ad auctions. Observe that\nCondition (c) is satis\ufb01ed, for example, when the probability of a click given any context is at least no\nsmaller than a (possibly unknown) positive threshold. Condition (d) is motivated by technical consid-\nerations that will appear clear in the analysis. Note that \u00afL is not assumed to be known to the advertiser.\n\nIn order to increase the level of dif\ufb01culty progressively and to prepare for the integration of\nthe ellipsoidal con\ufb01dence sets, we \ufb01rst look at an arti\ufb01cial setting in Section 3.1 where we assume\nthat there exists a known set C \u2282 Rd such that E[V |X] = min(1, max\u03b8\u2208C X T\u03b8) (as opposed to\nE[V |X] = X T\u03b8\u2217) and such that \u03b8\u2217 \u2208 C. This is to sidestep the estimation problem in a \ufb01rst step in\norder to focus on determining an optimal bidding strategy given \u03b8\u2217. Next, in Section 3.2, we bring\ntogether the methods developed in Section 2 and Section 3.1 to tackle the general setting.\n\n3.1 Preliminary work\nIn this section, we make the following modeling assumption in lieu of E[V |X] = X T\u03b8\u2217.\nAssumption 3. There exists C \u2282 Rd such that E[V |X] = min(1, max\u03b8\u2208C X T\u03b8) and \u03b8\u2217 \u2208 C.\n\n6\n\n\fFurthermore, we assume that C is known to the advertiser initially. Of course, we recover the original\nsetting introduced in Section 1 when C = {\u03b8\u2217} (since V \u2208 [0, 1] implies E[V |X] \u2208 [0, 1]) and \u03b8\u2217\nis known but the level of generality considered here will prove useful to tackle the general case in\nSection 3.2 when we de\ufb01ne C as an ellipsoidal con\ufb01dence set on \u03b8\u2217. As in Section 2, we start by\nidentifying a near-optimal oracle bidding strategy that has knowledge of the underlying distribution.\nThis will not only guide the design of algorithms when \u03bd is unknown but this will also facilitate the\nregret analysis. We use the shorthand g(X) = min(1, max\u03b8\u2208C X T\u03b8) throughout this section.\n\nBenchmark analysis To bound the performance of any non-anticipating strategy, we will be\ninterested in the mappings \u03c6 : \u03bb,C \u2192 E[P \u00b7 1g(X)\u2265\u03bb\u00b7P ] and R : \u03bb,C \u2192 E[g(X) \u00b7 1g(X)\u2265\u03bb\u00b7P ] for\n(\u03bb,C) \u2208 [0, 2/r] \u00d7 P(Rd). Note that \u03c6(\u00b7,C) is non-increasing and that, without loss of generality, we\ncan restrict \u03bb to be no larger than 2/r because \u03c6(\u03bb,C) = \u03c6(2/r,C) = 0 for \u03bb \u2265 2/r since P \u2265 r.\nExploiting the structure of the MAB problem at hand, we can bound the sum of rewards obtained by\nany non-anticipating strategy by the value of a knapsack problem where the weights and the values of\nthe items are drawn in an i.i.d. fashion from a \ufb01xed distribution. Since characterizing the expected\noptimal value of a knapsack problem is a well-studied problem, see [21], we can derive a simple\nupper bound on EROPT(B, T ) through this reduction, as we next show.\nLemma 3. We have EROPT(B, T ) \u2264 T \u00b7R(\u03bb\u2217,C)+\nor \u03bb\u2217 = 0 if no such solution exists (i.e. if E[P ] < \u03b2) in which case \u03c6(\u03bb\u2217,C) \u2264 \u03b2.\nLemma 3 suggests that, given C, a good strategy is to bid bt = min(1, min(1, max\u03b8\u2208C xT\nt \u03b8)/\u03bb\u2217), at\nany round t. The following result shows that we can actually afford to settle for an approximate\nsolution \u03bb \u2265 0 to \u03c6(\u03bb,C) = \u03b2.\nLemma 4. For any \u03bb1, \u03bb2 \u2265 0, we have: |R(\u03bb1,C) \u2212 R(\u03bb2,C)| \u2264 1/r \u00b7 |\u03c6(\u03bb1,C) \u2212 \u03c6(\u03bb2,C)|.\n\nT /r+1, where \u03bb\u2217 \u2265 0 satis\ufb01es \u03c6(\u03bb\u2217,C) = \u03b2\n\n\u221a\n\nLemma 3 combined with Lemma 4 suggests that the problem of computing a near-optimal bidding\nstrategy essentially reduces to a stochastic root-\ufb01nding problem for the function |\u03c6(\u00b7,C) \u2212 \u03b2|. As it\nturns out, the fact that the feedback is only partially censored makes a stochastic bisection search\npossible with minimal assumptions on \u03c6(\u00b7,C). Speci\ufb01cally, we only need that \u03c6(\u00b7,C) be Lipschitz,\nwhile the technique developed in [20] for a dynamic pricing problem requires \u03c6(\u00b7,C) to be bi-\nLipschitz. This is a signi\ufb01cant improvement because this last condition is not necessarily satis\ufb01ed\nuniformly for all con\ufb01dence sets C, which will be important when we use a varying ellipsoidal\ncon\ufb01dence set instead of C = {\u03b8\u2217} in Section 3.2. Note, however, that Assumption 2 guarantees that\n\u03c6(\u00b7,C) is always Lipschitz, as we next show.\nLemma 5. \u03c6(\u00b7,C) is \u00afL \u00b7 E[1/X T\u03b8\u2217]-Lipschitz.\n\nWe stress that Conditions (c) and (d) of Assumption 2 are crucial to establish Lemma 5 but are not\nrelied upon anywhere else in this paper.\n\nSpeci\ufb01cation of the algorithm At any round t \u2208 N, we bid:\n\nbt = min(1, min(1, max\n\n\u03b8\u2208C xT\n\nt \u03b8)/\u03bbt),\n\n(3)\n\nwhere \u03bbt \u2265 0 is the current proxy for \u03bb\u2217. We perform a binary search on \u03bb\u2217 by repeatedly using\nthe same value of \u03bbt for consecutive rounds forming phases, indexed by k \u2208 N, and by keeping\ntrack of an interval, denoted by Ik = [\u03bbk, \u00af\u03bbk]. We start with phase k = 0 and we initially set\n\u03bb0 = 0 and \u00af\u03bb0 = 2/r. The length of the interval is shrunk by half at the end of every phase so that\n|Ik| = (2/r)/2k for any k. Phase k lasts for Nk = 3 \u00b7 4k \u00b7 ln2(T ) rounds during which we set the\nvalue of \u03bbt to \u03bbk. Since \u03bbk will be no larger than \u03bb\u2217 with high probability, this means that we tend\nk=0 Nk \u2265 T} phases overall. The\nkey observation enabling a bisection search approach is that, since the feedback is only partially\ncensored, we can build, at the end of any phase k, an empirical estimate of \u03c6(\u03bb,C), which we denote\n\nto overbid. Note that there are at most \u00afkT = inf{n \u2208 N | (cid:80)n\n\n7\n\n\fby \u02c6\u03c6k(\u03bb,C), for any \u03bb \u2265 \u03bbk using all of the Nk samples obtained during phase k. The decision rule\nused to update Ik at the end of phase of k is speci\ufb01ed next.\n\nAlgorithm 1: Interval updating procedure at the end of phase k\n\nData: \u00af\u03bbk, \u03bbk, \u2206k = 3(cid:112)2 ln(2T )/Nk, and \u02c6\u03c6k(\u03bb,C) for any \u03bb \u2265 \u03bbk\n\nResult: \u00af\u03bbk+1 and \u03bbk+1\n\u00af\u03b3k = \u00af\u03bbk, \u03b3k = \u03bbk;\nwhile \u02c6\u03c6k(\u00af\u03b3k,C) > \u03b2 + \u2206k do\n\n\u00af\u03b3k = \u00af\u03b3k + |Ik|, \u03b3k = \u03b3k + |Ik|;\n\nend\nif \u02c6\u03c6k(1/2\u00af\u03b3k + 1/2\u03b3k,C) \u2264 \u03b2 + \u2206k then\n\u00af\u03bbk+1 = 1/2\u00af\u03b3k + 1/2\u03b3k, \u03bbk+1 = \u03b3k;\nelse\n\n\u00af\u03bbk+1 = \u00af\u03b3k, \u03bbk+1 = 1/2\u00af\u03b3k + 1/2\u03b3k;\n\nend\nThe splitting decision is trivial when | \u02c6\u03c6k(1/2\u00af\u03b3k + 1/2\u03b3k,C) \u2212 \u03b2| > \u2206k because we get a clear\nsignal that dominates the stochastic noise to either increase or decrease the current proxy for \u03bb\u2217. The\ntricky situation is when | \u02c6\u03c6k(1/2\u00af\u03b3k + 1/2\u03b3k,C) \u2212 \u03b2| \u2264 \u2206k, in which case the level of noise is too\nhigh to draw any conclusion. In this situation, we always favor a smaller value for \u03bbk even if that\nmeans shifting the interval upwards later on if we realize that we have made a mistake (which is the\npurpose of the while loop). This is because we can always recover from underestimating \u03bb\u2217 since the\nfeedback is only partially censored. Finally, note that the while loop of Algorithm 1 always ends after\na \ufb01nite number of iterations since \u02c6\u03c6k(2/r,C) = 0 \u2264 \u03b2 + \u2206k.\n\nRegret analysis\nJust like in Section 2, using concentration inequalities is essential to establish\nregret bounds but this time we need uniform concentration inequalities. We use the Rademacher\ncomplexity approach to concentration inequalities (see, for example, [13] and [15]) to control the\ndeviations of \u02c6\u03c6k(\u00b7,C) uniformly.\nLemma 6. We have P[sup\u03bb\u2208[\u03bbk,2/r] | \u02c6\u03c6k(\u03bb,C) \u2212 \u03c6(\u03bb,C)| \u2264 \u2206k] \u2265 1 \u2212 1/T , for any k.\nNext, we bound the number of phases as a function of the time horizon.\nLemma 7. For T \u2265 3, we have \u00afkT \u2264 ln(T + 1) and 4\u00afkT \u2264 T\nln2(T ) + 1.\n\nUsing Lemma 6, we next show that the stochastic bisection search procedure correctly identi\ufb01es\n\u03bb \u2265 0 such that |\u03c6(\u03bb,C) \u2212 \u03c6(\u03bb\u2217,C)| is small with high probability, which is all we really need to\nlower bound the rewards accumulated in all rounds given Lemma 4.\nLemma 8. For C = \u00afL \u00b7 E[1/X T\u03b8\u2217] and provided that T \u2265 exp(8r2/C 2), we have:\n\nP[\u2229\u00afkT\n\nk=0{| \u02c6\u03c6k(\u03bbk,C) \u2212 \u03c6(\u03bb\u2217,C)| \u2264 4C \u00b7 |Ik|, |\u03c6(\u03bbk,C) \u2212 \u03c6(\u03bb\u2217,C)| \u2264 3C \u00b7 |Ik|}] \u2265 1 \u2212 2 ln2(T )\n\nT\n\n.\n\nIn a last step, we show, using the above result and at the cost of an additive logarithmic term in the\nregret bound, that we may assume that the advertiser participates in exactly T auctions. This enables\nus to combine Lemma 4, Lemma 7, and Lemma 8 to establish a distribution-free regret bound.\nT \u00b7 ln(T )).\nTheorem 3. Bidding according to (3) incurs a regret RB,T = \u02dcO(\n\n\u00afL\u00b7E[1/X T\u03b8\u2217]\n\n\u00b7 \u221a\n\nr2\n\nObserve that Theorem 3 applies in particular when \u03b8\u2217 is known to the advertiser initially and that the\nregret bound derived does not depend on d.\n\n3.2 General case\n\nIn this section, we combine the methods developed in Sections 2 and 3.1 to tackle the general case.\n\n8\n\n\fSpeci\ufb01cation of the algorithm At any round t \u2208 N, we bid:\nxT\nt \u03b8)/\u03bbt),\n\nbt = min(1, min(1, max\n\u03b8\u2208C\u03c4t\n\nmove on to a new master phase. Hence, there are at most \u00afkq \u2264 \u00afkT = inf{n \u2208 N | (cid:80)n\n\n(4)\nwhere \u03c4t is de\ufb01ned in the last paragraph of Section 2 and \u03bbt \u2265 0 is speci\ufb01ed below. We use the\nbisection search method developed in Section 3.1 as a subroutine in a master algorithm that also\nruns in phases. Master phases are indexed by q = 0,\u00b7\u00b7\u00b7 , Q and a new master phase starts whenever\ndet(Mt) has increased by a factor at least (1 + A) compared to the last time there was an update,\nfor some A > 0 of our choosing. By construction, the ellipsoidal con\ufb01dence set used during the\nq-th master phase is \ufb01xed so that we can denote it by Cq. During the q-th master phase, we run the\nbisection search method described in Section 3.1 from scratch for the choice C = Cq in order to\nidentify a solution \u03bbq,\u2217 \u2265 0 to \u03c6(\u03bbq,\u2217,Cq) = \u03b2 (or \u03bbq,\u2217 = 0 if no solution exists). Thus, \u03bbt is a proxy\nfor \u03bbq,\u2217 during the q-th master phase. This bisection search lasts for \u00afkq phases and stops as soon as we\nk=0 Nk \u2265 T}\nphases during the q-th master phase. We denote by \u03bbq,k the lower end of the interval used at the k-th\nphase of the bisection search run during the q-th master phase.\nRegret analysis First we show that there can be at most O(d \u00b7 ln(T \u00b7 d)) master phases overall.\nLemma 9. We have Q \u2264 \u00afQ = d \u00b7 ln(T \u00b7 d)/ ln(1 + A) almost surely.\nLemma 9 is important because it implies that the bisection searches run long enough to be able to\nidentify suf\ufb01ciently good approximate values for \u03bbq,\u2217. Note that our approach is \u201cdoubly\u201d optimistic\nsince both \u03bbq,k \u2264 \u03bbq,\u2217 and \u03b8\u2217 \u2208 Cq hold with high probability at any point in time. At a high level,\nthe regret analysis goes as follows. First, just like in Section 3.1, we show, using Lemma 8 and at the\ncost of an additive logarithmic term in the \ufb01nal regret bound, that we may assume that the advertiser\nparticipates in exactly T auctions. Second, we show, using the analysis of Theorem 2, that we may\nassume that the expected per-round reward obtained during phase q is E[min(1, max\u03b8\u2208Cq xT\nt \u03b8)] (as\nopposed to xT\nT ) in the \ufb01nal regret bound.\nThird, we note that Theorem 3 essentially shows that the expected per-round reward obtained during\nphase q is R(\u03bbq,\u2217,Cq), up to an additive term of order \u02dcO(\nT ) in the \ufb01nal regret bound. Finally, what\nremains to be done is to compare R(\u03bbq,\u2217,Cq) with R(\u03bb\u2217,{\u03b8\u2217}), which is done using Lemmas 2 and\n3.\nTheorem 4. Bidding according to (4) incurs a regret RB,T = \u02dcO(d\u00b7 \u00afL\u00b7E[1/X T\u03b8\u2217]\nf (A) = 1/ ln(1 + A) +\n\nt \u03b8\u2217) at any round t, up to an additive term of order \u02dcO(d \u00b7 \u221a\n\n\u221a\n\n1 + A.\n\n\u221a\n\n\u00b7 f (A)\u00b7\u221a\n\nT ), where\n\nr2\n\n4 Concluding remark\n\nAn interesting direction for future research is to characterize achievable regret bounds, in particular\nthrough the derivation of lower bounds on regret. When there is no budget limit and no contextual\n\u221a\ninformation, Weed et al. [25] provide a thorough characterization with rates ranging from \u0398(ln(T ))\nto \u0398(\nT ), depending on whether a margin condition on the underlying distribution is satis\ufb01ed.\nThese lower bounds carry over to our more general setting and, as a result, the dependence of our\nregret bounds with respect to T cannot be improved in general. It is however unclear whether the\ndependence with respect to d is optimal. Based on the lower bounds established by Dani et al. [18]\nfor linear stochastic bandits, a model which is arguably closer to our setting than that of Chu et al.\n[16] because of the need to estimate the bid multiplier \u03bb\u2217, we conjecture that a linear dependence on\nd is optimal but this calls for more work. Given that the contextual information available in practice\nis often high-dimensional, developing algorithms that exploit the sparsity of the data in a similar\nfashion as done in [14] for linear contextual MAB problems is also a promising research direction.\nIn this paper, observing that general BwK problems with contextual information are notoriously\nhard to solve, we exploit the structure of real-time bidding problems to develop a special-purpose\nalgorithm (a stochastic binary search combined with an ellipsoidal con\ufb01dence set) to get optimal\nregret bounds. We believe that the ideas behind this special-purpose algorithm could be adapted for\nother important applications such as contextual dynamic pricing with limited supply.\n\nAcknowledgments\n\nResearch funded in part by the Of\ufb01ce of Naval Research (ONR) grant N00014-15-1-2083.\n\n9\n\n\fReferences\n[1] Abbasi-Yadkori, Y., P\u00e1l, D., and Szepesv\u00e1ri, C. (2011). Improved algorithms for linear stochastic\n\nbandits. In Adv. Neural Inform. Processing Systems, pages 2312\u20132320.\n\n[2] Agrawal, S. and Devanur, N. (2016). Linear contextual bandits with knapsacks. In Adv. Neural\n\nInform. Processing Systems, pages 3450\u20133458.\n\n[3] Agrawal, S., Devanur, N. R., and Li, L. (2016). An ef\ufb01cient algorithm for contextual bandits\nwith knapsacks, and an extension to concave objectives. In Proc. 29th Annual Conf. Learning\nTheory, pages 4\u201318.\n\n[4] Amin, K., Kearns, M., Key, P., and Schwaighofer, A. (2012). Budget optimization for sponsored\nsearch: Censored learning in mdps. In Proc. 28th Conf. Uncertainty in Arti\ufb01cial Intelligence,\npages 54\u201363.\n\n[5] Amin, K., Rostamizadeh, A., and Syed, U. (2014). Repeated contextual auctions with strategic\n\nbuyers. In Adv. Neural Inform. Processing Systems, pages 622\u2013630.\n\n[6] Auer, P. (2002). Using con\ufb01dence bounds for exploitation-exploration trade-offs. J. Machine\n\nLearning Res., 3(Nov):397\u2013422.\n\n[7] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47(2-3):235\u2013256.\n\n[8] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377.\n\n[9] Babaioff, M., Dughmi, S., Kleinberg, R., and Slivkins, A. (2012). Dynamic pricing with limited\n\nsupply. In Proc. 13th ACM Conf. Electronic Commerce, pages 74\u201391.\n\n[10] Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2013). Bandits with knapsacks. In Proc.\n\n54th IEEE Annual Symp. Foundations of Comput. Sci., pages 207\u2013216.\n\n[11] Badanidiyuru, A., Langford, J., and Slivkins, A. (2014). Resourceful contextual bandits. In\n\nProc. 27th Annual Conf. Learning Theory, volume 35, pages 1109\u20131134.\n\n[12] Balseiro, S. and Gur, Y. (2017). Learning in repeated auctions with budgets: Regret minimization\n\nand equilibrium. In Proc. 18th ACM Conf. Economics and Comput., pages 609\u2013609.\n\n[13] Bartlett, P. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. J. Machine Learning Res., 3(Nov):463\u2013482.\n\n[14] Bastani, H. and Bayati, M. (2015). Online decision-making with high-dimensional covariates.\n\nWorking Paper.\n\n[15] Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classi\ufb01cation: A survey of some\n\nrecent advances. ESAIM: Probability and Statist., 9:323\u2013375.\n\n[16] Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff\n\nfunctions. In J. Machine Learning Res. - Proc., volume 15, pages 208\u2013214.\n\n[17] Cohen, M., Lobel, I., and Leme, R. P. (2016). Feature-based dynamic pricing. In Proc. 17th\n\nACM Conf. Economics and Comput., pages 817\u2013817.\n\n[18] Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linear optimization under bandit feedback.\n\nIn Proc. 21st Annual Conf. Learning Theory, pages 355\u2013366.\n\n[19] Ghosh, A., Rubinstein, B. I. P., Vassilvitskii, S., and Zinkevich, M. (2009). Adaptive bidding\n\nfor display advertising. In Proc. 18th Int. Conf. World Wide Web, pages 251\u2013260.\n\n[20] Lei, Y., Jasin, S., and Sinha, A. (2015). Near-optimal bisection search for nonparametric\n\ndynamic pricing with inventory constraint. Working Paper.\n\n[21] Lueker, G. (1998). Average-case analysis of off-line and on-line knapsack problems. Journal of\n\nAlgorithms, 29(2):277\u2013305.\n\n10\n\n\f[22] Pasupathy, R. and Kim, S. (2011). The stochastic root-\ufb01nding problem: Overview, solutions,\n\nand open questions. ACM Trans. Modeling and Comput. Simulation, 21(3):19.\n\n[23] Tran-Thanh, L., Stavrogiannis, C., Naroditskiy, V., Robu, V., Jennings, N. R., and Key, P. (2014).\nEf\ufb01cient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In\nProc. 30th Conf. Uncertainty in Arti\ufb01cial Intelligence, pages 809\u2013818.\n\n[24] Wang, Z., Deng, S., and Ye, Y. (2014). Close the gaps: A learning-while-doing algorithm for\n\nsingle-product revenue management problems. Operations Research, 62(2):318\u2013331.\n\n[25] Weed, J., Perchet, V., and Rigollet, P. (2016). Online learning in repeated auctions. In Proc.\n\n29th Annual Conf. Learning Theory, volume 49, pages 1562\u20131583.\n\n11\n\n\f", "award": [], "sourceid": 2669, "authors": [{"given_name": "arthur", "family_name": "flajolet", "institution": "MIT"}, {"given_name": "Patrick", "family_name": "Jaillet", "institution": "MIT"}]}