{"title": "Revenue Optimization against Strategic Buyers", "book": "Advances in Neural Information Processing Systems", "page_first": 2530, "page_last": 2538, "abstract": "We present a revenue optimization algorithm for posted-price auctions when facing a buyer with random valuations who seeks to optimize his $\\gamma$-discounted surplus. To analyze this problem, we introduce the notion of  epsilon-strategic buyer, a more natural notion of strategic behavior than what has been used in the past.  We improve upon the previous state-of-the-art and achieve an optimal regret bound in  $O\\Big( \\log T + \\frac{1}{\\log(1/\\gamma)} \\Big)$ when the seller can offer prices from a finite set $\\cP$ and provide a regret bound in  $\\widetilde O \\Big(\\sqrt{T} + \\frac{T^{1/4}}{\\log(1/\\gamma)} \\Big)$ when the buyer is offered prices from the interval $[0, 1]$.", "full_text": "Revenue Optimization against\n\nStrategic Buyers\n\nMehryar Mohri\n\nCourant Institute of Mathematical Sciences\n\n251 Mercer Street\n\nNew York, NY, 10012\n\nAndr\u00b4es Mu\u02dcnoz Medina\u21e4\n\nGoogle Research\n111 8th Avenue\n\nNew York, NY, 10011\n\nAbstract\n\nWe present a revenue optimization algorithm for posted-price auctions when fac-\ning a buyer with random valuations who seeks to optimize his -discounted sur-\nplus. In order to analyze this problem we introduce the notion of \u270f-strategic buyer,\na more natural notion of strategic behavior than what has been considered in the\npast. We improve upon the previous state-of-the-art and achieve an optimal regret\nbound in O(log T + 1/ log(1/)) when the seller selects prices from a \ufb01nite set\n\nand provide a regret bound in eO(pT + T 1/4/ log(1/)) when the prices offered\n\nare selected out of the interval [0, 1].\n\n1\n\nIntroduction\n\nOnline advertisement is currently the fastest growing form of advertising. This growth has been\nmotivated, among other reasons, by the existence of well de\ufb01ned metrics of effectiveness such as\nclick-through-rate and conversion rates. Moreover, online advertisement enables the design of better\ntargeted campaigns by allowing advertisers to decide which type of consumers should see their\nadvertisement. These advantages have promoted the fast pace development of a large number of\nadvertising platforms. Among them, AdExchanges have increased in popularity in recent years. In\ncontrast to traditional advertising, AdExchanges do not involve contracts between publishers and\nadvertisers. Instead, advertisers are allowed to bid in real-time for the right to display their ad.\nAn AdExchange works as follows: when a user visits a publisher\u2019s website, the publisher sends\nthis information to the AdExchange which runs a second-price auction with reserve (Vickrey, 1961;\nMilgrom, 2004) among all interested advertisers. Finally, the winner of the auction gets the right\nto display his ad on the publisher\u2019s website and pays the maximum of the second highest bid and\nthe reserve price. In practice, this process is performed in milliseconds, resulting in millions of\ntransactions recorded daily by the AdExchange. Thus, one might expect that the AdExchange could\nbene\ufb01t from this information by learning how much an advertiser values the right to display his ad\nand setting an optimal reserve price. This idea has recently motivated research in the learning com-\nmunity on revenue optimization in second-price auctions with reserve (Mohri and Medina, 2014a;\nCui et al., 2011; Cesa-Bianchi et al., 2015).\nThe algorithms proposed by these authors heavily rely on the assumption that the advertisers\u2019 bids\nare drawn i.i.d. from some underlying distribution. However, if an advertiser is aware of the fact that\nthe AdExchange or publisher are using a revenue optimization algorithm, then, most likely, he would\nadjust his behavior to trick the publisher into offering a more bene\ufb01cial price in the future. Under\nthis scenario, the assumptions of (Mohri and Medina, 2014a) and (Cesa-Bianchi et al., 2015) would\nbe violated. In fact, empirical evidence of strategic behavior by advertisers has been documented by\nEdelman and Ostrovsky (2007). It is therefore critical to analyze the interactions between publishers\nand strategic advertisers.\n\n\u21e4This work was partially done at the Courant Institute of Mathematical Sciences.\n\n1\n\n\fIn this paper, we consider the simpler scenario of revenue optimization in posted-price auctions with\nstrategic buyers, \ufb01rst analyzed by Amin et al. (2013). As pointed out by Amin et al. (2013), the study\nof this simpli\ufb01ed problem is truly relevant since a large number of auctions run by AdExchanges\nconsist of only one buyer (or one buyer with a large bid and several buyers with negligible bids). In\nthis scenario, a second-price auction in fact reduces to a posted-price auction where the seller sets a\nreserve price and the buyer decides to accept it (bid above it) or reject it (bid below).\nTo analyze the sequential nature of this problem, we can cast it as a repeated game between a buyer\nand a seller where a strategic buyer seeks to optimize his surplus while the seller seeks to collect\nthe largest possible revenue from the buyer. This can be viewed as an instance of a repeated non-\nzero sum game with incomplete information, which is a problem that has been well studied in the\nEconomics and Game Theory community (Nachbar, 1997, 2001). However, such previous work has\nmostly concentrated on the characterization of different types of achievable equilibria as opposed to\nthe design of an algorithm for the seller. Furthermore, the problem we consider admits a particular\nstructure that can be exploited to derive learning algorithms with more favorable guarantees for the\nspeci\ufb01c task of revenue optimization.\nThe problem can also be viewed as an instance of a multi-armed bandit problem (Auer et al., 2002;\nLai and Robbins, 1985), more speci\ufb01cally, a particular type of continuous bandit problem previously\nstudied by Kleinberg and Leighton (2003). Indeed, at every time t the buyer can only observe the\nrevenue of the price he offered and his goal is to \ufb01nd, as fast as possible, the price that would yield the\nlargest expected revenue. Unlike a bandit problem, however, here, the performance of an algorithm\ncannot be measured in terms of the external regret. Indeed, as observed by Bubeck and Cesa-Bianchi\n(2012) and Arora et al. (2012), the notion of external regret becomes meaningless when facing an\nadversary that reacts to the learner\u2019s actions. In short, instead of comparing to the best achievable\nrevenue by a \ufb01xed price over the sequence of rewards seen, one should compare against the simulated\nsequence of rewards that would have been seen had the seller played a \ufb01xed price. This notion of\nregret is known as strategic regret and regret minimization algorithms have been proposed before\nunder different scenarios (Amin et al., 2013, 2014; Mohri and Medina, 2014a). In this paper we\nprovide a regret minimization algorithm for the stochastic scenario, where, at each round, the buyer\nreceives an i.i.d. valuation from an underlying distribution. While this random valuation might seems\nsurprising, it is in fact a standard assumption in the study of auctions (Milgrom and Weber, 1982;\nMilgrom, 2004; Cole and Roughgarden, 2014). Moreover, in practice, advertisers rarely interact\ndirectly with an AdExchange. Instead, several advertisers are part of an ad network and it is that ad\nnetwork that bids on their behalf. Therefore, the valuation of the ad network is not likely to remain\n\ufb01xed. Our model is also motivated by the fact that the valuation of an advertiser depends on the\nuser visiting the publisher\u2019s website. Since these visits can be considered random, it follows that the\nbuyer\u2019s valuation is in fact a random variable.\nA crucial component of our analysis is the de\ufb01nition of a strategic buyer. We consider a buyer who\nseeks to optimize his cumulative discounted surplus. However, we show that a buyer who exactly\nmaximizes his surplus must have unlimited computational power, which is not a realistic assumption\nin practice. Instead, we de\ufb01ne the notion of an \u270f-strategic buyer who seeks only to approximately\noptimize his surplus. Our main contribution is to show that, when facing an \u270f-strategic buyer, a seller\ncan achieve O(log T ) regret when the set of possible prices to offer is \ufb01nite, and an O(pT ) regret\nbound when the set of prices is [0, 1]. Remarkably, these bounds on the regret match those given by\nKleinberg and Leighton (2003) in a truthful scenario where the buyer does not behave strategically.\nThe rest of this paper is organized as follows. In Section 2, we discuss in more detail related previous\nwork. Next, we de\ufb01ne more formally the problem setup (Section 3). In particular, we give a precise\nde\ufb01nition of the notion of \u270f-strategic buyer (Section 3.2). Our main algorithm for a \ufb01nite set of\nprices is described in Section 4, where we also provide a regret analysis. In Section 5, we extend\nour algorithm to the continuous case where we show that a regret in O(pT ) can be achieved.\n\n2 Previous work\nThe problem of revenue optimization in auctions goes back to the seminal work of Myerson (1981),\nwho showed that under some regularity assumptions over the distribution D, the revenue optimal,\nincentive-compatible mechanism is a second-price auction with reserve. This result applies to single-\nshot auctions where buyers and the seller interact only once and the underlying value distribution is\n\n2\n\n\fknown to the seller. In practice, however it is not realistic to assume that the seller has access to this\ndistribution. Instead, in cases such as on-line advertisement, the seller interacts with the buyer a large\nnumber of times and can therefore infer his behavior from historical data. This fact has motivated\nthe design of several learning algorithms such as that of (Cesa-Bianchi et al., 2015) who proposed\na bandit algorithm for revenue optimization in second-price auctions; and the work of (Mohri and\nMedina, 2014a), who provided learning guarantees and an algorithm for revenue optimization where\neach auction is associated with a feature vector.\nThe aforementioned algorithms are formulated under the assumption of buyers bidding in an i.i.d.\nfashion and do not take into account the fact that buyers can in fact react to the use of revenue\noptimization algorithms by the seller. This has motivated a series of publications focusing on this\nparticular problem. Bikhchandani and McCardle (2012) analyzed the same problem proposed here\nwhen the buyer and seller interact for only two rounds. Kanoria and Nazerzadeh (2014) consid-\nered a repeated game of second-price auctions where the seller knows that the value distribution\ncan be either high, meaning it is concentrated around high values, or low; and his goal is to \ufb01nd\nout from which distribution the valuations are drawn under the assumption that buyers can behave\nstrategically.\nFinally, the scenario considered here was \ufb01rst introduced by Amin et al. (2013) where the authors\nsolve the problem of optimizing revenue against a strategic buyer with a \ufb01xed valuation and showed\n\n1. Mohri and Medina (2014b) later showed that one can\nthat a buyer can achieve regret in O pT\nin fact achieve a regret in O( log T\n1 ) closing the gap with the lower bound to a factor of log T . The\nscenario of random valuations we consider here was also analyzed by Amin et al. (2013) where an\n1/\u21b5 was proposed when prices are offered\nalgorithm achieving regret in O|P|T \u21b5 +\nfrom a \ufb01nite set P, with  = minp2P p\u21e4D(v > p\u21e4)  pD(v > p) and \u21b5 a free parameter. Finally,\nan extension of this algorithm to the contextual setting was presented by the same authors in (Amin\net al., 2014) where they provide an algorithm achieving O T 2/3\n\nThe algorithms proposed by Amin et al. (2013, 2014) consist of alternating exploration and exploita-\ntion. That is, there exist rounds where the seller only tries to estimate the value of the buyer and\nother rounds where he uses this information to try to extract the largest possible revenue. It is well\nknown in the bandit literature (Dani and Hayes, 2006; Abernethy et al., 2008) that algorithms that\nignore information obtained on exploitation rounds tend to be sub-optimal. Indeed, even in a truthful\nscenario where the UCB algorithm (Auer et al., 2002) achieves regret in O( log T\n ), the algorithm pro-\n\n1 regret.\n\n(1)1/\u21b5 + 1\n\n1\n\nof \u21b5 which, incidentally, requires also access to the unknown value .\nWe propose instead an algorithm inspired by the UCB strategy using exploration and exploitation\n\nposed by Amin et al. (2013) achieves sub-optimal regret in Oeplog T log 1\nsimultaneously. We show that our algorithm admits a regret that is in O log T\n\n for the optimal choice\nlog(1/), which\nmatches the UCB bound in the truthful scenario and which depends on  only through the additive\n1 known to be unavoidable (Amin et al., 2013). Our results cannot be directly\nterm\ncompared with those of Amin et al. (2013) since they consider a fully strategic adversary whereas\nwe consider an \u270f-strategic adversary. As we will see in the next section, however, the notion of \u270f-\nstrategic adversary is in fact more natural than that of a buyer who exactly optimizes his discounted\nsurplus. Moreover, it is not hard to show that, when applied to our scenario, perhaps modulo a\nconstant, the algorithm of Amin et al. (2013) cannot achieve a better regret than in the fully strategic\nadversary.\n\nlog(1/) \u21e1 1\n\n + |P|\n\n1\n\n3 Setup\n\nWe consider the following scenario, similar to the one introduced by Amin et al. (2013).\n\n3.1 Scenario\nA buyer and a seller interact for T rounds. At each round t 2 {1, . . . , T}, the seller attempts to sell\nsome good to the buyer, such as the right to display an ad. The buyer receives a valuation vt 2 [0, 1]\nwhich is unknown to the seller and is sampled from a distribution D. The seller offers a price pt,\n\n3\n\n\fin response to which the buyer selects an action at 2 {0, 1}, with at = 1 indicating that he accepts\nthe price and at = 0 otherwise. We will say the buyer lies if he accepts the price at time t (at = 1)\nwhile the price offered is above his valuation (vt \uf8ff pt), or when he rejects the price (at = 0) while\nhis valuation is above the price offered (vt > pt).\nThe seller seeks to optimize his expected revenue over the T rounds of interaction, that is,\n\nRev = E\uf8ff TXt=1\n\natpt.\n\nNotice that, when facing a truthful buyer, for any price p, the expected revenue of the seller is given\nby pD(v > p). Therefore, with knowledge of D, the seller could set all prices pt to p\u21e4, where\np\u21e4 2 argmaxp2[0,1] pD(v > p). Since the actions of the buyer do not affect the choice of future\nprices by the seller, the buyer has no incentive to lie and the seller will obtain an expected revenue\nof T p\u21e4D(v > p\u21e4). It is therefore natural to measure the performance of any revenue optimization\nalgorithm in terms of the following notion of strategic regret:\n\nRegT = T p\u21e4D(v > p\u21e4)  Rev = max\np2[0,1]\n\nT pD(v > p)  E\uf8ff TXt=1\n\natpt.\n\nThe objective of the seller coincides with the one assumed by Kleinberg and Leighton (2003) in the\nstudy of repeated interactions with buyers with a random valuation. However, here, we will allow\nthe buyer to behave strategically, which results in a harder problem. Nevertheless, the buyer is not\nassumed to be fully adversarial as in (Kleinberg and Leighton, 2003). Instead, we will assume, as\ndiscussed in detail in the next section, that the buyer seeks to approximately optimize his surplus,\nwhich can be viewed as a more natural assumption.\n\n\u270f-strategic Buyers\n\n3.2\nHere, we de\ufb01ne the family of buyers considered throughout this paper. We denote by x1:t 2 Rt\nthe vector (x1, . . . , xt) and de\ufb01ne the history of the game up to time t by Ht := p1:t, v1:t, a1:t.\nBefore the \ufb01rst round, the seller decides on an algorithm A for setting prices and this algorithm is\nannounced to the buyer. The buyer then selects a strategy B : (Ht1, vt, pt) 7! at. For any value\n 2 (0, 1) and strategy B, we de\ufb01ne the buyer\u2019s discounted expected surplus by\n\nSur(B) = E\uf8ff TXt=1\n\nt1at(vt  pt).\n\nA buyer minimizing this discounted surplus wishes to acquire the item as inexpensively as possible,\nbut does not wish to wait too long to obtain a favorable price.\nIn order to optimize his surplus, a buyer must then solve a non-homogeneous Markov decision pro-\ncess (MDP). Indeed, consider the scenario where at time t the seller offers prices from a distribution\nDt 2 D, where D is a family of probability distributions over the interval [0, 1]. The seller up-\ndates his beliefs as follows: the current distribution Dt is selected as a function of the distribution\nat the previous round as well as the history Ht1 (which is all the information available to the\nseller). More formally, we let ft : (Dt, Ht) 7! Dt+1 be a transition function for the seller. Let\nst = (Dt, Ht1, vt, pt) denote the state of the environment at time t, that is, all the information\navailable at time t to the buyer. Finally, let St(st) denote the maximum attainable expected surplus\nof a buyer that is in state st at time t. It is clear that St will satisfy the following Bellman equations:\nt1at(vt  pt) + E(vt+1,pt+1)\u21e0D\u21e5ft(Dt,Ht)\u21e5St+1(ft(Dt, Ht), Ht, vt+1, pt+1\u21e4,\nSt(st) = max\n\nwith the boundary condition ST (sT ) = T1(vT  pT )1pT \uf8ffvT .\nDe\ufb01nition 1. A buyer is said to be strategic if his action at time t is a solution of the Bellman\nequation (1).\n\nat2{0,1}\n\n(1)\n\nNotice that, depending on the choice of the family D, the number of states of the MDP solved by\na strategic buyer may be in\ufb01nite. Even for a deterministic algorithm that offers prices from a \ufb01nite\nset P, the number of states of this MDP would be in \u2326(T |P|), which quickly becomes intractable.\nThus, in view of the prohibitive cost of computing his actions, the model of a fully strategic buyer\ndoes not seem to be realistic. We introduce instead the concept of \u270f-strategic buyers.\n\n4\n\n\fDe\ufb01nition 2. A buyer is said to be \u270f-strategic if he behaves strategically, except when no sequence\nof actions can improve upon the future surplus of the truthful sequence by more than t0\u270f, or except\nfor the \ufb01rst 0 < t < t0 rounds, for some t0  0 depending only on the seller\u2019s algorithm, in which\ncases he acts truthfully.\n\nWe show in Section 4 that this de\ufb01nition implies the existence of t1 > t0 such that an \u270f-strategic\nbuyer only solves an MDP over the interval [t0, t1] which becomes a tractable problem for t1 \u2327 T .\nThe parameter t0 used in the de\ufb01nition is introduced to consider the unlikely scenario where a\nbuyer\u2019s algorithm deliberately ignores all information observed during the rounds 0 < t < t0, in\nwhich case it is optimal for the buyer to behave truthfully.\nOur de\ufb01nition is motivated by the fact that, for a buyer with bounded computational power, there is\nno incentive in acting non-truthfully if the gain in surplus over a truthful behavior is negligible.\n\n4 Regret Analysis\n\nWe now turn our attention to the problem faced by the seller. The seller\u2019s goal is to maximize his\nrevenue. When the buyer is truthful, Kleinberg and Leighton (2003) have shown that this problem\ncan be cast as a continuous bandit problem. In that scenario, the strategic regret in fact coincides\nwith the pseudo-regret, which is the quantity commonly minimized in a stochastic bandit setting\n(Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012). Thus, if the set of possible prices P is \ufb01nite,\nthe seller can use the UCB algorithm Auer et al. (2002) to minimize his pseudo-regret.\nIn the presence of an \u270f-strategic buyer, the rewards are no longer stochastic. Therefore, we need to\nanalyze the regret of a seller in the presence of lies. Let P denote a \ufb01nite set of prices offered by\nthe seller. De\ufb01ne \u00b5p = pD(v > p) and p = \u00b5p\u21e4  \u00b5p. For every price p 2 P, de\ufb01ne also Tp(t)\nto be the number of times price p has been offered up to time t. We will denote by T \u21e4 and \u00b5\u21e4 the\ncorresponding quantities associated with the optimal price p\u21e4.\nLemma 1. Let L denote the number of times a buyer lies. For any  > 0, the strategic regret of a\nseller can be bounded as follows:\n\nRegT \uf8ff E[L] + Xp : p>\n\nE[Tp(t)]p + T .\n\nProof. Let Lt denote the event that the buyer lies at round t, then the expected revenue of a seller is\ngiven by\nt,\n\n1vt>pp1pt=p1\n\natpt1pt=p(1\n\nLt + 1\n\nwhere the last equality follows from the fact that when the buyer is truthful at = 1vt>p. Moreover,\n\nTXt=1\n\nLc\n\n1vt>pp1pt=p1\n\nLt\n\nE\uf8ff TXt=1Xp2P\nusing the fact thatPT\nE\uf8ffXp2P\n\nTXt=1\n\nt=1\n\n1\n\nLt = L, we have\n\n1vt>pp1pt=p1\n\nLc\n\nLc\n\natpt1pt=p1\n\nt= E\uf8ffXp2P\n1vt>pp1pt=p  E\uf8ffXp2P\n\nTXt=1\n\u00b5p E[Tp(T )]  E\uf8ff TXt=1\n\nLc\n\nt ) E\uf8ff TXt=1Xp2P\nt = E\uf8ffXp2P\n=Xp2P\nXp2P\n\n1vt>ptpt1\n\nTXt=1\nLt\n\n\u00b5p E[Tp(T )]  E[L].\nSince the regret of offering prices for which p \uf8ff  is bounded by T , it follows that the regret of\nthe seller is bounded by E[L] +Pp : p> p E[Tp(T )] + T .\nWe now de\ufb01ne a robust UCB (R-UCBL) algorithm for which we can bound the expectations\nE[Tp(T )]. For every price p 2 P, de\ufb01ne\nb\u00b5p(t) =\n\npt1pt=p1vt>pt\n\ntXi=1\n\nTp(t)\n\n1\n\n5\n\n\fto be the true empirical mean of the reward that a seller would obtain when facing a truthful buyer.\n\nthe buyer lied. Notice that Lt(p) can be positive or negative. Finally, let\n\ni=1at  1vt>p1pt=pp denote the revenue obtained by the seller in rounds where\n\nLet Lt(p) =Pt\n\nbe the empirical mean obtained when offering price p that is observed by the seller. For the de\ufb01nition\nof our algorithm, we will make use of the following upper con\ufb01dence bound:\n\nLt(p)\nTp(t)\n\n\u00b5p(t) =b\u00b5p(t) +\n\nBp(t, L) =\n\nLp\nTp(t)\n\n+s 2 log t\n\nTp(t)\n\n.\n\n\u00b5p(t) + Bp(t, L).\n\nmax\np2P\n\nWe will use B\u21e4 as a shorthand for Bp\u21e4. Our R-UCBL algorithm selects the price pt that maximizes\nthe quantity\n\nWe proceed to bound the expected number of times a sub-optimal price p is offered.\n\nProposition 1. Let Pt(p, L) := P Lt(p)\n\ninequality holds:\n\nTp(t) + | Lt(p\u21e4)\n\n32 log T\n\nE[Tp(t)] \uf8ff\n\n4Lp\np\n\n+\n\nT \u21e4(t)  L p\nTXt=1\n\n+ 2 +\n\n2\np\n\nTp(t) + p\u21e4\n\nT \u21e4(t). Then, the following\n\nPt(p, L).\n\nthen\n\nProof. For any p and t de\ufb01ne \u2318p(t) =q 2 log t\n\u00b5p(t) + Bp(t, L)  \u00b5\u21e4(t)  B\u21e4(t, L)  0\n\nTp(t) and let \u2318\u21e4 = \u2318p\u21e4. If at time t price p 6= p\u21e4 is offered\n\nLp\u21e4\n\nLt(p)\n\nTp(t) \n\np + 32 log T\n\n2\np\n\nLp\nTp(t) \n\nLt(p\u21e4)\nT \u21e4(t)  0\nLt(p\u21e4)\nT \u21e4(t) \n\nThis combined with the positivity of at least one of the four terms in (2) yields:\n\nTp(t) b\u00b5\u21e4(t)  B\u21e4(t, L) \n, b\u00b5p(t) + Bp(t, L) +\n, \u21e5b\u00b5p(t)  \u00b5p  \u2318p(t)\u21e4 +\u21e52Bp(t, L)  p\u21e4 +h Lt(p)\n\n(2)\nTherefore, if price p is selected, then at least one of the four terms in inequality (2) must be positive.\nLet u = 4Lp\n\nT \u21e4(t)i\n+\u21e5\u00b5\u21e4 b\u00b5\u21e4(t)  \u2318\u21e4(t)\u21e4  0.\n. Notice that if Tp(t) > u then 2Bp(t, L)  p < 0. Thus, we can write\n1pt=p(1Tp(t)\uf8ffu + 1Tp(t)>u)i = u +\nTXt=u\nT \u21e4(t)\u2318\nPrb\u00b5p(t)  \u00b5p  \u2318p(t) + Pr\u21e3 Lt(p\u21e4)\nPrb\u00b5p(t)  \u00b5p  \u2318p(t) + Pr\u00b5\u21e4 b\u00b5\u21e4(t) > \u2318\u21e4(t) + Pt(p, L).\nTp(t)! \uf8ff Pr 9s 2 [0, t] :\ns !\np1vi>p  \u00b5p r 2 log t\n\nE[Tp(T )] = Eh TXt=1\nTXt=u\n+ Pr\u00b5\u21e4 b\u00b5\u21e4(t) > \u2318\u21e4(t)\nTXt=u\n\nWe can now bound the probabilities appearing in (3) as follows:\n\nE[Tp(T )] \uf8ff u +\n\nPr(pt = p, Tp(t) > u).\n\nLt(p)\nTp(t) \n\nT \u21e4(t) \n\nLp\nTp(t)\n\n\uf8ff u +\n\nLp\u21e4\n\n+\n\n(3)\n\nPr b\u00b5p(t)  \u00b5p s 2 log t\n\n1\ns\n\nsXi=1\n\nt4 = t3,\n\n\uf8ff\n\ntXs=1\n\n6\n\n\fwhere the last inequality follows from an application of Hoeffding\u2019s inequality as well as the union\nbound. A similar argument can be made to bound the other term in (3). Using the de\ufb01nition of u we\nthen have\n\n32 log T\n\n4Lp\np\n\n+\n\n+\n\n2t3 +\n\nE[Tp(T )] \uf8ff\n2\np\nwhich completes the proof.\nCorollary 1. Let L denote the number of times a buyer lies. Then, the strategic regret of R-UCBL\ncan be bounded as follows:\n\nPt(p, L) \uf8ff\n\nPt(p, L),\n\n+ 2 +\n\n2\np\n\n+\n\n32 log T\n\n4Lp\np\n\nTXt=1\n\nTXt=u\n\nTXt=1\n\nRegT \uf8ff L\u21e34Xp2P\n\np\u2318 + E[L] + Xp : p>\u2713 32 log T\n\np\n\n+ 2p +\n\nPt(p, L)\u25c6 + T .\n\nTXt=1\n\nNotice that the choice of parameter L of R-UCBL is subject to a trade-off: on the one hand, L\nshould be small to minimize the \ufb01rst term of this regret bound; on the other hand, function Pt(p, L)\n\nt=1 Pt(p, L) is bene\ufb01cial for larger values of L.\n\nis decreasing in T , therefore the termPT\n\nWe now show that an \u270f-strategic buyer can only lie a \ufb01nite number of times, which will imply\nthe existence of an appropriate choice of L for which we can ensure that Pt(p, L) = 0, thereby\nrecovering the standard logarithmic regret of UCB.\nProposition 2. If the discounting factor  satis\ufb01es  \uf8ff 0 < 1, an \u270f-strategic buyer stops lying\nafter S =l log(1/\u270f(10))\n\nm rounds.\n\nProof. After S rounds, for any sequence of actions at the surplus that can be achieved by the buyer\nin the remaining rounds is bounded by\n\nlog(1/0)\n\nTXt=t0+S\n\nE[at(vt  pt)] \uf8ff\n\nS+t0  T\n\n1  \n\n\uf8ff\n\nS+t0\n\n1   \uf8ff \u270f,\n\nfor any sequence of actions. Thus, by de\ufb01nition, an \u270f-strategic buyer does not lie after S rounds.\nCorollary 2. If the discounting factor  satis\ufb01es  \uf8ff 0 < 1 and the seller uses the R-UCBL\nm, then the strategic regret of the seller is bounded by\nalgorithm with L =l log(1/\u270f(10))\n'\u21e34Xp2P\n\np + 1\u2318 + Xp:p>\n\n\u270f(10)\nlog 1\n0\n\n& log\n\n+ 2p + T .\n\n32 log T\n\nlog(1/0)\n\np\n\n(4)\n\n1\n\nProof. Follows trivially from Corollary 1 and the previous proposition, which implies that\nPt(p, L) \u2318 0.\nLet us compare our results with those of Amin et al. (2013). The regret bound given in (Amin et al.,\n2013) is in O\u21e3|P|T \u21b5 + |P|2\nrounds used for exploration and  = minp2P p. In particular, notice that the dependency of this\nbound on the cardinality of P is quadratic instead of linear as in our case. Moreover, the dependency\n1/\u21b5). Therefore, even in a truthful scenario where  \u2327 1. The dependency on T\non 0 is in O( 1\n1\nremains polynomial whereas we recover the standard logarithmic regret. Only when the seller has\naccess to , which is a strong requirement, can he set the optimal value of \u21b5 to achieve regret in\n\n(10)1/\u21b5\u2318, where \u21b5 is a parameter controlling the fraction of\n\n2/\u21b5 +\n\n|P|2\n\nOeplog T log 1\n.\n\nOf course, the algorithm proposed by Amin et al. (2013) assumes that the buyer is fully strategic\nwhereas we only require the buyer to be \u270f-strategic. However, the authors assume that the distribu-\ntion satis\ufb01es a Lipchitz condition which technically allows them to bound the number of lies in the\nsame way as in Proposition 2. Therefore, the regret bound achieved by their algorithm remains the\nsame in our scenario.\n\n7\n\n\f5 Continuous pricing strategy\n\nThus far, we have assumed that the prices offered by the buyer are selected out of a discrete set P.\nIn practice, however, the optimal price may not be within P and therefore the algorithm described in\nthe previous section might accumulate a large regret when compared against the best price in [0, 1].\nIn order to solve this problem, we propose to discretize the interval [0, 1] and run our R-UCBL algo-\nrithm on the resulting discretization. This induces a trade-off since a better discretization implies a\nlarger regret term in (4). To \ufb01nd the optimal size of the discretization we follow the ideas of Klein-\nberg and Leighton (2003) and consider distributions D that satisfy the condition that the function\nf : p 7! pD(v > p) admits a unique maximizer p\u21e4 such that f00(p) < 0.\nThroughout this section, we let K 2 N and we consider the following \ufb01nite set of prices\nPK =  i\nK|1 \uf8ff i \uf8ff K \u21e2 [0, 1]. We also let pK be an optimal price in PK, that is pK 2\nargmaxp2PK f (p) and we let p\u21e4 = argmaxp2[0,1] f (p). Finally, we denote by p = f (pK) f (p)\nthe sub-optimality gap with respect to price pK and by p = f (p\u21e4)  f (p) the corresponding\ngap with respect to p\u21e4. The following theorem can be proven following similar ideas to those of\nKleinberg and Leighton (2003). We defer its proof to the appendix.\nlog T1/4, if the discounting factor  satis\ufb01es  \uf8ff 0 < 1 and the seller\nTheorem 1. Let K =  T\nm, then the strategic\nuses the R-UCBL algorithm with the set of prices PK and L =l log(1/\u270f(10))\n'\uf8ff\u21e3 T\nf (p)  E\uf8ff TXt=1\nlog T\u23181/4\n\natpt \uf8ff CpT log T +& log\n\nregret of the seller can be bounded as follows:\n\n+ 1.\n\nmax\np2[0,1]\n\nlog(1/0)\n\n1\n\n\u270f(10)\nlog 1\n0\n\n6 Conclusion\n\nWe introduced a revenue optimization algorithm for posted-price auctions that is robust against \u270f-\nstrategic buyers. Moreover, we showed that our notion of strategic behavior is more natural than\n\nwhat has been previously studied. Our algorithm bene\ufb01ts from the optimal O log T + 1\nbound for a \ufb01nite set of prices and admits regret in OT 1/2 + T 1/4\n\n1 regret\n1 when the buyer is offered prices\n\nin [0, 1], a scenario that had not been considered previously in the literature of revenue optimization\nagainst strategic buyers. It is known that a regret in o(T 1/2) is unattainable even in a truthful set-\nting, but it remains an open problem to verify that the dependency on  cannot be improved. Our\nalgorithm admits a simple analysis and we believe that the idea of making truthful algorithms robust\nis general and can be extended to more complex auction mechanisms such as second-price auctions\nwith reserve.\n\n7 Acknowledgments\n\nWe thank Afshin Rostamizadeh and Umar Syed for useful discussions about the topic of this paper\nand the NIPS reviewers for their insightful comments. This work was partly funded by NSF IIS-\n1117591 and NSF CCF-1535987.\n\n8\n\n\fReferences\nAbernethy, J., E. Hazan, and A. Rakhlin (2008). Competing in the dark: An ef\ufb01cient algorithm for\n\nbandit linear optimization. In Proceedings of COLT 2008, pp. 263\u2013274.\n\nAmin, K., A. Rostamizadeh, and U. Syed (2013). Learning prices for repeated auctions with strategic\n\nbuyers. In Proceedings of NIPS, pp. 1169\u20131177.\n\nAmin, K., A. Rostamizadeh, and U. Syed (2014). Repeated contextual auctions with strategic buy-\n\ners. In Proceedings of NIPS 2014, pp. 622\u2013630.\n\nArora, R., O. Dekel, and A. Tewari (2012). Online bandit learning against an adaptive adversary:\n\nfrom regret to policy regret. In Proceedings of ICML.\n\nAuer, P., N. Cesa-Bianchi, and P. Fischer (2002). Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning 47(2-3), 235\u2013256.\n\nBikhchandani, S. and K. McCardle (2012). Behaviour-based price discrimination by a patient seller.\n\nThe B.E. Journal of Theoretical Economics 12(1), 1935\u20131704.\n\nBubeck, S. and N. Cesa-Bianchi (2012). Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning 5(1), 1\u2013122.\n\nCesa-Bianchi, N., C. Gentile, and Y. Mansour (2015). Regret minimization for reserve prices in\n\nsecond-price auctions. IEEE Transactions on Information Theory 61(1), 549\u2013564.\n\nCole, R. and T. Roughgarden (2014). The sample complexity of revenue maximization. In Proceed-\n\nings of STOC 2014, pp. 243\u2013252.\n\nCui, Y., R. Zhang, W. Li, and J. Mao (2011). Bid landscape forecasting in online ad exchange\n\nmarketplace. In Proceedings of SIGKDD 2011, pp. 265\u2013273.\n\nDani, V. and T. P. Hayes (2006). Robbing the bandit: less regret in online geometric optimization\n\nagainst an adaptive adversary. In Proceedings of SODA 2006, pp. 937\u2013943.\n\nEdelman, B. and M. Ostrovsky (2007). Strategic bidder behavior in sponsored search auctions.\n\nDecision Support Systems 43(1), 192\u2013198.\n\nKanoria, Y. and H. Nazerzadeh (2014). Dynamic reserve prices for repeated auctions: Learning\n\nfrom bids. In Proceedings of WINE 2014, pp. 232.\n\nKleinberg, R. D. and F. T. Leighton (2003). The value of knowing a demand curve: Bounds on\n\nregret for online posted-price auctions. In Proceedings of FOCS 2003, pp. 594\u2013605.\n\nLai, T. and H. Robbins (1985). Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics 6(1), 4 \u2013 22.\n\nMilgrom, P. and R. Weber (1982). A theory of auctions and competitive bidding. Econometrica:\n\nJournal of the Econometric Society 50(5), 1089\u20131122.\n\nMilgrom, P. R. (2004). Putting auction theory to work. Cambridge University Press.\nMohri, M. and A. M. Medina (2014a). Learning theory and algorithms for revenue optimization in\n\nsecond price auctions with reserve. In Proceedings of ICML 2014, pp. 262\u2013270.\n\nMohri, M. and A. M. Medina (2014b). Optimal regret minimization in posted-price auctions with\n\nstrategic buyers. In Proceedings of NIPS 2014, pp. 1871\u20131879.\n\nMyerson, R. B. (1981). Optimal auction design. Mathematics of Operations Research 6(1), pp.\n\n58\u201373.\n\nNachbar, J. (2001). Bayesian learning in repeated games of incomplete information. Social Choice\n\nand Welfare 18(2), 303\u2013326.\n\nNachbar, J. H. (1997). Prediction, optimization, and learning in repeated games. Econometrica:\n\nJournal of the Econometric Society 65(2), 275\u2013309.\n\nVickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. The Journal of\n\n\ufb01nance 16(1), 8\u201337.\n\n9\n\n\f", "award": [], "sourceid": 1497, "authors": [{"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute and Google"}, {"given_name": "Andres", "family_name": "Munoz", "institution": "Google"}]}