{"title": "Algorithms with Logarithmic or Sublinear Regret for  Constrained Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 433, "page_last": 441, "abstract": "We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The time and budget constraints significantly complicate the exploration and exploitation tradeoff because they introduce complex coupling among contexts over time. To gain insight, we first study unit-cost systems with known context distribution. When the expected rewards are known, we develop an approximation of the oracle, referred to Adaptive-Linear-Programming(ALP), which achieves near-optimality and only requires the ordering of expected rewards. With these highly desirable features,  we  then combine ALP with the upper-confidence-bound (UCB) method in the general case where the expected rewards are unknown a priori. We show that the proposed UCB-ALP algorithm achieves logarithmic regret except in certain boundary cases.Further, we design algorithms and obtain similar regret analysis results for  more general systems with unknown context distribution or heterogeneous costs.  To the best of our knowledge, this is the  first work that shows how to achieve logarithmic regret in constrained contextual bandits. Moreover, this work also sheds light on the study of computationally efficient algorithms for general constrained contextual bandits.", "full_text": "Algorithms with Logarithmic or Sublinear Regret for\n\nConstrained Contextual Bandits\n\nHuasen Wu\n\nUniversity of California at Davis\n\nhswu@ucdavis.edu\n\nXin Liu\n\nUniversity of California at Davis\n\nliu@cs.ucdavis.edu\n\nR. Srikant\n\nUniversity of Illinois at Urbana-Champaign\n\nrsrikant@illinois.edu\n\nChong Jiang\n\nUniversity of Illinois at Urbana-Champaign\n\njiang17@illinois.edu\n\nAbstract\n\nWe study contextual bandits with budget and time constraints, referred to as con-\nstrained contextual bandits. The time and budget constraints signi\ufb01cantly com-\nplicate the exploration and exploitation tradeoff because they introduce complex\ncoupling among contexts over time. To gain insight, we \ufb01rst study unit-cost sys-\ntems with known context distribution. When the expected rewards are known, we\ndevelop an approximation of the oracle, referred to Adaptive-Linear-Programming\n(ALP), which achieves near-optimality and only requires the ordering of expected\nrewards. With these highly desirable features, we then combine ALP with the\nupper-con\ufb01dence-bound (UCB) method in the general case where the expected\nrewards are unknown a priori. We show that the proposed UCB-ALP algorithm\nachieves logarithmic regret except for certain boundary cases. Further, we de-\nsign algorithms and obtain similar regret bounds for more general systems with\nunknown context distribution and heterogeneous costs. To the best of our knowl-\nedge, this is the \ufb01rst work that shows how to achieve logarithmic regret in con-\nstrained contextual bandits. Moreover, this work also sheds light on the study of\ncomputationally ef\ufb01cient algorithms for general constrained contextual bandits.\n\n1\n\nIntroduction\n\nThe contextual bandit problem [1, 2, 3] is an important extension of the classic multi-armed bandit\n(MAB) problem [4], where the agent can observe a set of features, referred to as context, before\nmaking a decision. After the random arrival of a context, the agent chooses an action and receives\na random reward with expectation depending on both the context and action. To maximize the\ntotal reward, the agent needs to make a careful tradeoff between taking the best action based on the\nhistorical performance (exploitation) and discovering the potentially better alternative actions under\na given context (exploration). This model has attracted much attention as it \ufb01ts the personalized\nservice requirement in many applications such as clinical trials, online recommendation, and online\nhiring in crowdsourcing. Existing works try to reduce the regret of contextual bandits by leveraging\nthe structure of the context-reward models such as linearity [5] or similarity [6], and more recent\nwork [7] focuses on computationally ef\ufb01cient algorithms with minimum regret. For Markovian\ncontext arrivals, algorithms such as UCRL [8] for more general reinforcement learning problem can\nbe used to achieve logarithmic regret.\nHowever, traditional contextual bandit models do not capture an important characteristic of real\nsystems: in addition to time, there is usually a cost associated with the resource consumed by each\naction and the total cost is limited by a budget in many applications. Taking crowdsourcing [9] as\nan example, the budget constraint for a given set of tasks will limit the number of workers that an\nemployer can hire. Another example is the clinical trials [10], where each treatment is usually costly\nand the budget of a trial is limited. Although budget constraints have been studied in non-contextual\nbandits where logarithmic or sublinear regret is achieved [11, 12, 13, 14, 15, 16], as we will see\nlater, these results are inapplicable in the case with observable contexts.\n\n1\n\n\f\u221a\n\nIn this paper, we study contextual bandit problems with budget and time constraints, referred to\nas constrained contextual bandits, where the agent is given a budget B and a time-horizon T . In\naddition to a reward, a cost is incurred whenever an action is taken under a context. The bandit\nprocess ends when the agent runs out of either budget or time. The objective of the agent is to\nmaximize the expected total reward subject to the budget and time constraints. We are interested in\nthe regime where B and T grow towards in\ufb01nity proportionally.\nThe above constrained contextual bandit problem can be viewed as a special case of Resourceful\nContextual Bandits (RCB) [17]. In [17], RCB is studied under more general settings with possibly\nin\ufb01nite contexts, random costs, and multiple budget constraints. A Mixture Elimination algorithm is\nproposed and shown to achieve O(\nT ) regret. However, the benchmark for the de\ufb01nition of regret\nin [17] is restricted to within a \ufb01nite policy set. Moreover, the Mixture Elimination algorithm suffers\nhigh complexity and the design of computationally ef\ufb01cient algorithms for such general settings is\nstill an open problem.\nTo tackle this problem, motivated by certain applications, we restrict the set of parameters in our\nmodel as follows: we assume \ufb01nite discrete contexts, \ufb01xed costs, and a single budget constraint. This\nsimpli\ufb01ed model is justi\ufb01ed in many scenarios such as clinical trials [10] and rate selection in wire-\nless networks [18]. More importantly, these simpli\ufb01cations allow us to design easily-implementable\nalgorithms that achieve O(log T ) regret (except for a set of parameters of zero Lebesgue measure,\nwhich we refer to as boundary cases), where the regret is de\ufb01ned more naturally as the performance\ngap between the proposed algorithm and the oracle, i.e., the optimal algorithm with known statistics.\nEven with simpli\ufb01ed assumptions considered in this paper, the exploration-exploitation tradeoff is\nstill challenging due to the budget and time constraints. The key challenge comes from the complex-\nity of the oracle algorithm. With budget and time constraints, the oracle algorithm cannot simply\ntake the action that maximizes the instantaneous reward. In contrast, it needs to balance between\nthe instantaneous and long-term rewards based on the current context and the remaining budget. In\nprinciple, dynamic programming (DP) can be used to obtain this balance. However, using DP in\nour scenario incurs dif\ufb01culties in both algorithm design and analysis: \ufb01rst, the implementation of\nDP is computationally complex due to the curse of dimensionality; second, it is dif\ufb01cult to obtain\na benchmark for regret analysis, since the DP algorithm is implemented in a recursive manner and\nits expected total reward is hard to be expressed in a closed form; third, it is dif\ufb01cult to extend the\nDP algorithm to the case with unknown statistics, due to the dif\ufb01culty of evaluating the impact of\nestimation errors on the performance of DP-type algorithms.\nTo address these dif\ufb01culties, we \ufb01rst study approximations of the oracle algorithm when the system\nstatistics are known. Our key idea is to approximate the oracle algorithm with linear programming\n(LP) that relaxes the hard budget constraint to an average budget constraint. When \ufb01xing the average\nbudget constraint at B/T , this LP approximation provides an upper bound on the expected total\nreward, which serves as a good benchmark in regret analysis. Further, we propose an Adaptive\nLinear Programming (ALP) algorithm that adjusts the budget constraint to the average remaining\nbudget b\u03c4 /\u03c4, where \u03c4 is the remaining time and b\u03c4 is the remaining budget. Note that although the\nidea of approximating a DP problem with an LP problem has been widely studied in literature (e.g.,\n[17, 19]), the design and analysis of ALP here is quite different. In particular, we show that ALP\nachieves O(1) regret, i.e., its expected total reward is within a constant independent of T from the\noptimum, except for certain boundaries. This ALP approximation and its regret analysis make an\nimportant step towards achieving logarithmic regret for constrained contextual bandits.\nUsing the insights from the case with known statistics, we study algorithms for constrained contex-\ntual bandits with unknown expected rewards. Complicated interactions between information acqui-\nsition and decision making arise in this case. Fortunately, the ALP algorithm has a highly desirable\nproperty that it only requires the ordering of the expected rewards and can tolerate certain estimation\nerrors of system parameters. This property allows us to combine ALP with estimation methods that\ncan ef\ufb01ciently provide a correct rank of the expected rewards. In this paper, we propose a UCB-ALP\n\u221a\nalgorithm by combining ALP with the upper-con\ufb01dence-bound (UCB) method [4]. We show that\nT ).\nUCB-ALP achieves O(log T ) regret except for certain boundary cases, where its regret is O(\nWe note that UCB-type algorithms are proposed in [20] for non-contextual bandits with concave\n\u221a\nrewards and convex constraints, and further extended to linear contextual bandits. However, [20]\nfocuses on static contexts1 and achieves O(\nT ) regret in our setting since it uses a \ufb01xed budget\nconstraint in each round. In comparison, we consider random context arrivals and use an adaptive\n\n\u221a\n1After the online publication of our preliminary version, two recent papers [21, 22] extend their previous\nT )\n\nwork [20] to the dynamic context case, where they focus on possibly in\ufb01nite contexts and achieve O(\nregret, and [21] restricts to a \ufb01nite policy set as [17].\n\n2\n\n\fbudget constraint to achieve logarithmic regret. To the best of our knowledge, this is the \ufb01rst work\nthat shows how to achieve logarithmic regret in constrained contextual bandits. Moreover, the pro-\nposed UCB-ALP algorithm is quite computationally ef\ufb01cient and we believe these results shed light\non addressing the open problem of general constrained contextual bandits.\nAlthough the intuition behind ALP and UCB-ALP is natural, the rigorous analysis of their regret is\nnon-trivial since we need to consider many interacting factors such as action/context ranking errors,\nremaining budget \ufb02uctuation, and randomness of context arrival. We evaluate the impact of these\nfactors using a series of novel techniques, e.g., the method of showing concentration properties under\nadaptive algorithms and the method of bounding estimation errors under random contexts. For the\nease of exposition, we study the ALP and UCB-ALP algorithms in unit-cost systems with known\ncontext distribution in Sections 3 and 4, respectively. Then we discuss the generalization to systems\nwith unknown context distribution in Section 5 and with heterogeneous costs in Section 6, which\nare much more challenging and the details can be found in the supplementary material.\n\n2 System Model\nWe consider a contextual bandit problem with a context set X = {1, 2, . . . , J} and an action set\nA = {1, 2, . . . , K}. At each round t, a context Xt arrives independently with identical distribution\nP{Xt = j} = \u03c0j, j \u2208 X , and each action k \u2208 A generates a non-negative reward Yk,t. Under a\ngiven context Xt = j, the reward Yk,t\u2019s are independent random variables in [0, 1]. The conditional\nexpectation E[Yk,t|Xt = j] = uj,k is unknown to the agent. Moreover, a cost is incurred if action k\nis taken under context j. To gain insight into constrained contextual bandits, we consider \ufb01xed and\nknown costs in this paper, where the cost is cj,k > 0 when action k is taken under context j. Similar\nto traditional contextual bandits, the context Xt is observable at the beginning of round t, while only\nthe reward of the action taken by the agent is revealed at the end of round t.\nAt the beginning of round t, the agent observes the context Xt and takes an action At from {0}\u222aA,\nwhere \u201c0\u201d represents a dummy action that the agent skips the current context. Let Yt and Zt be the\nreward and cost for the agent in round t, respectively. If the agent takes an action At = k > 0,\nthen the reward is Yt = Yk,t and the cost is Zt = cXt,k. Otherwise, if the agent takes the dummy\naction At = 0, neither reward nor cost is incurred, i.e., Yt = 0 and Zt = 0. In this paper, we focus\non contextual bandits with a known time-horizon T and limited budget B. The bandit process ends\nwhen the agent runs out of the budget or at the end of time T .\nA contextual bandit algorithm \u0393 is a function that maps the historical observations Ht\u22121 =\n(X1, A1, Y1; X2, A2, Y2; . . . ; Xt\u22121, At\u22121, Yt\u22121) and the current context Xt to an action At \u2208\n{0} \u222a A. The objective of the algorithm is to maximize the expected total reward U\u0393(T, B) for\na given time-horizon T and a budget B, i.e.,\n\n(cid:2) T(cid:88)\n\n(cid:3)\n\nYt\n\nmaximize\u0393\n\nsubject to\n\nU\u0393(T, B) = E\u0393\n\nT(cid:88)\n\nt=1\n\nZt \u2264 B,\n\nt=1\n\nwhere the expectation is taken over the distributions of contexts and rewards. Note that we consider\na \u201chard\u201d budget constraint, i.e., the total costs should not be greater than B under any realization.\nWe measure the performance of the algorithm \u0393 by comparing it with the oracle, which is the optimal\nalgorithm with known statistics, including the knowledge of \u03c0j\u2019s, uj,k\u2019s, and cj,k\u2019s. Let U\u2217(T, B)\nbe the expected total reward obtained by the oracle algorithm. Then, the regret of the algorithm \u0393 is\nde\ufb01ned as\n\nR\u0393(T, B) = U\u2217(T, B) \u2212 U\u0393(T, B).\n\nThe objective of the algorithm is then to minimize the regret. We are interested in the asymptotic\nregime where the time-horizon T and the budget B grow to in\ufb01nity proportionally, i.e., with a \ufb01xed\nratio \u03c1 = B/T .\n\n3 Approximations of the Oracle\n\nIn this section, we study approximations of the oracle, where the statistics of bandits are known\nto the agent. This will provide a benchmark for the regret analysis and insights into the design of\nconstrained contextual bandit algorithms.\n\n3\n\n\fj = maxk\u2208A uj,k and k\u2217\n\nj . Similarly, we also assume u\u2217\n\nj be the best action for context j, i.e., u\u2217\n\nAs a starting point, we focus on unit-cost systems, i.e., cj,k = 1 for each j and k, from Section 3 to\nSection 5, which will be relaxed in Section 6. In unit-cost systems, the quality of action k under con-\ntext j is fully captured by its expected reward uj,k. Let u\u2217\nj be the highest expected reward under con-\ntext j, and k\u2217\nj = arg maxk\u2208A uj,k.\nFor ease of exposition, we assume that the best action under each context is unique, i.e., uj,k < u\u2217\nfor all j and k (cid:54)= k\u2217\nj\nWith the knowledge of uj,k\u2019s, the agent knows the best action k\u2217\nj and its expected reward u\u2217\nany context j. In each round t, the task of the oracle is deciding whether to take action k\u2217\ndepending on the remaining time \u03c4 = T \u2212 t + 1 and the remaining budget b\u03c4 .\nThe special case of two-context systems (J = 2) is trivial, where the agent just needs to procrastinate\nfor the better context (see Appendix D of the supplementary material). When considering more\ngeneral cases with J > 2, however, it is computationally intractable to exactly characterize the\noracle solution. Therefore, we resort to approximations based on linear programming (LP).\n\nJ for simplicity.\n\n1 > u\u2217\n\n2 > . . . > u\u2217\n\nj under\nor not\n\nXt\n\n3.1 Upper Bound: Static Linear Programming\nWe propose an upper bound for the expected total reward U\u2217(T, B) of the oracle by relaxing the\nhard constraint to an average constraint and solving the corresponding constrained LP problem.\nSpeci\ufb01cally, let pj \u2208 [0, 1] be the probability that the agent takes action k\u2217\nj for context j, and 1 \u2212 pj\nbe the probability that the agent skips context j (i.e., taking action At = 0). Denote the probability\nvector as p = (p1, p2, . . . , pJ ). For a time-horizon T and budget B, consider the following LP\nproblem:\n\n(LP T,B) maximizep\n\nsubject to\n\nJ(cid:88)\nJ(cid:88)\n\nj=1\n\npj\u03c0ju\u2217\nj ,\n\npj\u03c0j \u2264 B/T,\n\nj=1\n\np \u2208 [0, 1]J .\n\nDe\ufb01ne the following threshold as a function of the average budget \u03c1 = B/T :\n\n\u02dcj(\u03c1) = max{j :\n\n\u03c0j(cid:48) \u2264 \u03c1}\n\nj(cid:88)\n\nj(cid:48)=1\n\nwith the convention that \u02dcj(\u03c1) = 0 if \u03c01 > \u03c1. We can verify that the following solution is optimal for\nLP T,B:\n\npj(\u03c1) =\n\n1,\n\n\u03c1\u2212(cid:80)\u02dcj(\u03c1)\n\nj(cid:48)=1\n\u03c0\u02dcj(\u03c1)+1\n\n0,\n\n\u03c0j(cid:48)\n\n,\n\nif 1 \u2264 j \u2264 \u02dcj(\u03c1),\nif j = \u02dcj(\u03c1) + 1,\nif j > \u02dcj(\u03c1) + 1.\n\nCorrespondingly, the optimal value of LP T,B is\n\nv(\u03c1) =\n\n\u03c0ju\u2217\n\nj + p\u02dcj(\u03c1)+1(\u03c1)\u03c0\u02dcj(\u03c1)+1u\u2217\n\n\u02dcj(\u03c1)+1.\n\nThis optimal value v(\u03c1) can be viewed as the maximum expected reward in a single round with\nT v(\u03c1), which is an upper bound of U\u2217(T, B).\nLemma 1. For a unit-cost system with known statistics, if the time-horizon is T and the budget is\n\naverage budget \u03c1. Summing over the entire horizon, the total expected reward becomes (cid:98)U (T, B) =\nB, then (cid:98)U (T, B) \u2265 U\u2217(T, B).\ncan bound the regret of any algorithm by comparing its performance with the upper bound (cid:98)U (T, B)\ninstead of U\u2217(T, B). Since (cid:98)U (T, B) has a simple expression, as we will see later, it signi\ufb01cantly\n\nThe proof of Lemma 1 is available in Appendix A of the supplementary material. With Lemma 1, we\n\nreduces the complexity of regret analysis.\n\n4\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u02dcj(\u03c1)(cid:88)\n\nj=1\n\n\f3.2 Adaptive Linear Programming\n\nXt\n\nJ\n\nAlthough the solution (4) provides an upper bound on the expected reward, using such a \ufb01xed\nalgorithm will not achieve good performance as the ratio b\u03c4 /\u03c4, referred to as average remaining\nbudget, \ufb02uctuates over time. We propose an Adaptive Linear Programming (ALP) algorithm that\nadjusts the threshold and randomization probability according to the instantaneous value of b\u03c4 /\u03c4.\nSpeci\ufb01cally, when the remaining time is \u03c4 and the remaining budget is b\u03c4 = b, we consider an LP\nproblem LP \u03c4,b which is the same as LP T,B except that B/T in Eq. (2) is replaced with b/\u03c4. Then,\nthe optimal solution for LP \u03c4,b can be obtained by replacing \u03c1 in Eqs. (3), (4), and (5) with b/\u03c4. The\nALP algorithm then makes decisions based on this optimal solution.\nALP Algorithm: At each round t with remaining budget b\u03c4 = b, obtain pj(b/\u03c4 )\u2019s by solving LP \u03c4,b;\ntake action At = k\u2217\nThe above ALP algorithm only requires the ordering of the expected rewards instead of their accurate\nvalues. This highly desirable feature allows us to combine ALP with classic MAB algorithms such as\nUCB [4] for the case without knowledge of expected rewards. Moreover, this simple ALP algorithm\nachieves very good performance within a constant distance from the optimum, i.e., O(1) regret,\nexcept for certain boundary cases. Speci\ufb01cally, for 1 \u2264 j \u2264 J, let qj be the cumulative probability\nj(cid:48)=1 \u03c0j(cid:48) with the convention that q0 = 0. The following theorem states the near\n\nwith probability pXt(b/\u03c4 ), and At = 0 with probability 1 \u2212 pXt(b/\u03c4 ).\n\nde\ufb01ned as qj =(cid:80)j\n\n1 \u2212 u\u2217\n\nJ )(cid:112)\u03c1(1 \u2212 \u03c1) and \u03b4(cid:48) = min{\u03c1 \u2212 q\u02dcj(\u03c1)\u22121, q\u02dcj(\u03c1)+1 \u2212 \u03c1}.\n\noptimality of ALP.\nTheorem 1. Given any \ufb01xed \u03c1 \u2208 (0, 1), the regret of ALP satis\ufb01es:\n1\u2212u\u2217\n1) (Non-boundary cases) if \u03c1 (cid:54)= qj for any j \u2208 {1, 2, . . . , J \u2212 1}, then RALP(T, B) \u2264 u\u2217\n1\u2212e\u22122\u03b42 ,\nwhere \u03b4 = min{\u03c1 \u2212 q\u02dcj(\u03c1), q\u02dcj(\u03c1)+1 \u2212 \u03c1}.\n\u221a\n2) (Boundary cases) if \u03c1 = qj for some j \u2208 {1, 2, . . . , J \u2212 1}, then RALP(T, B) \u2264 \u0398(o)\n1\u2212u\u2217\nu\u2217\n1\u2212e\u22122(\u03b4(cid:48))2 , where \u0398(o) = 2(u\u2217\nTheorem 1 shows that ALP achieves O(1) regret except for certain boundary cases, where it still\nT ) regret. This implies that the regret due to the linear relaxation is negligible in most\nachieves O(\ncases. Thus, when the expected rewards are unknown, we can achieve low regret, e.g., logarithmic\nregret, by combining ALP with appropriate information-acquisition mechanisms.\nSketch of Proof: Although the ALP algorithm seems fairly intuitive, its regret analysis is non-\ntrivial. The key to the proof is to analyze the evolution of the remaining budget b\u03c4 by mapping\nALP to \u201csampling without replacement\u201d. Speci\ufb01cally, from Eq. (4), we can verify that when the\nremaining time is \u03c4 and the remaining budget is b\u03c4 = b, the system consumes one unit of budget with\nprobability b/\u03c4, and consumes nothing with probability 1 \u2212 b/\u03c4. When considering the remaining\nbudget, the ALP algorithm can be viewed as \u201csampling without replacement\u201d. Thus, we can show\nthat b\u03c4 follows the hypergeometric distribution [23] and has the following properties:\nLemma 2. Under the ALP algorithm, the remaining budget b\u03c4 satis\ufb01es:\nT\u22121 \u03c4 \u03c1(1 \u2212 \u03c1), respectively.\n1) The expectation and variance of b\u03c4 are E[b\u03c4 ] = \u03c1\u03c4 and Var(b\u03c4 ) = T\u2212\u03c4\n2) For any positive number \u03b4 satisfying 0 < \u03b4 < min{\u03c1, 1 \u2212 \u03c1}, the tail distribution of b\u03c4 satis\ufb01es\n\nT +\n\n\u221a\n\nJ\n\nP{b\u03c4 < (\u03c1 \u2212 \u03b4)\u03c4} \u2264 e\u22122\u03b42\u03c4 and P{b\u03c4 > (\u03c1 + \u03b4)\u03c4} \u2264 e\u22122\u03b42\u03c4 .\n\nis UALP(T, B) = E(cid:2)(cid:80)T\n\nThen, we prove Theorem 1 based on Lemma 2. Note that the expected total reward under ALP\n\n\u03c4 =1 v(b\u03c4 /\u03c4 )(cid:3), where v(\u00b7) is de\ufb01ned in (5) and the expectation is taken\n\nover the distribution of b\u03c4 . For the non-boundary cases, the single-round expected reward satis\ufb01es\nE[v(b\u03c4 /\u03c4 )] = v(\u03c1) if the threshold \u02dcj(b\u03c4 /\u03c4 ) = \u02dcj(\u03c1) for all possible b\u03c4 \u2019s. The regret then is bounded\nby a constant because the probability of the event \u02dcj(b\u03c4 /\u03c4 ) (cid:54)= \u02dcj(\u03c1) decays exponentially due to the\nconcentration property of b\u03c4 . For the boundary cases, we show the conclusion by relating the regret\nwith the variance of b\u03c4 . Please refer to Appendix B of the supplementary material for details.\n\n4 UCB-ALP Algorithm for Constrained Contextual Bandits\n\nNow we get back to the constrained contextual bandits, where the expected rewards are unknown\nto the agent. We assume the agent knows the context distribution as [17], which will be relaxed in\nSection 5. Thanks to the desirable properties of ALP, the maxim of \u201coptimism under uncertainty\u201d\n\n5\n\n\f[8] is still applicable and ALP can be extended to the bandit settings when combined with estimation\npolicies that can quickly provide correct ranking with high probability. Here, combining ALP with\nthe UCB method [4], we propose a UCB-ALP algorithm for constrained contextual bandits.\n\n1\n\n4.1 UCB: Notations and Property\nLet Cj,k(t) be the number of times that action k \u2208 A has been taken under context j up to round t.\n(cid:80)t\u22121\nIf Cj,k(t \u2212 1) > 0, let \u00afuj,k(t) be the empirical reward of action k under context j, i.e., \u00afuj,k(t) =\n(cid:113) log t\nt(cid:48)=1 Yt(cid:48) 1(Xt(cid:48) = j, At(cid:48) = k), where 1(\u00b7) is the indicator function. We de\ufb01ne the UCB\nCj,k(t\u22121)\n2Cj,k(t\u22121) for Cj,k(t \u2212 1) > 0, and \u02c6uj,k(t) = 1 for Cj,k(t \u2212\nof uj,k at t as \u02c6uj,k(t) = \u00afuj,k(t) +\n1) = 0. Furthermore, we de\ufb01ne the UCB of the maximum expected reward under context j as\n\u02c6u\u2217\nj (t) = maxk\u2208A \u02c6uj,k(t). As suggested in [24], we use a smaller coef\ufb01cient in the exploration term\n2Cj,k(t\u22121) than the traditional UCB algorithm [4] to achieve better performance.\n\n(cid:113) log t\n\nWe present the following property of UCB that is important in regret analysis.\nLemma 3. For two context-action pairs, (j, k) and (j(cid:48), k(cid:48)), if uj,k < uj(cid:48),k(cid:48), then for any t \u2264 T ,\n\nP{\u02c6uj,k(t) \u2265 \u02c6uj(cid:48),k(cid:48)(t)|Cj,k(t \u2212 1) \u2265 (cid:96)j,k} \u2264 2t\u22121,\n\n(6)\n\nwhere (cid:96)j,k =\n\n2 log T\n\n(uj(cid:48),k(cid:48)\u2212uj,k)2 .\n\nLemma 3 states that for two context-action pairs, the ordering of their expected rewards can be iden-\nti\ufb01ed correctly with high probability, as long as the suboptimal pair has been executed for suf\ufb01cient\ntimes (on the order of O(log T )). This property has been widely applied in the analysis of UCB-\nbased algorithms [4, 13], and its proof can be found in [13, 25] with a minor modi\ufb01cation on the\ncoef\ufb01cients.\n\n4.2 UCB-ALP Algorithm\n\nWe propose a UCB-based adaptive linear programming (UCB-ALP) algorithm, as shown in Algo-\nrithm 1. As indicated by the name, the UCB-ALP algorithm maintains UCB estimates of expected\nrewards for all context-action pairs and then implements the ALP algorithm based on these esti-\nmates. Note that the UCB estimates \u02c6u\u2217\nj (t)\u2019s may be non-decreasing in j. Thus, the solution of\nLP \u03c4,b based on \u02c6u\u2217\nj (t)\u2019s and may be different from Eq. (4).\nWe use \u02c6pj(\u00b7) rather than pj(\u00b7) to indicate this difference.\n\nj (t) depends on the actual ordering of \u02c6u\u2217\n\nAlgorithm 1 UCB-ALP\n\nfor t = 1 to T do\n\nInput: Time-horizon T , budget B, and context distribution \u03c0j\u2019s;\nInit: \u03c4 = T , b = B;\n\nCj,k(0) = 0, \u00afuj,k(0) = 0, \u02c6uj,k(0) = 1, \u2200j \u2208 X and \u2200k \u2208 A; \u02c6u\u2217\nj (t) \u2190 arg maxk \u02c6uj,k(t), \u2200j;\nk\u2217\nj (t) \u2190 \u02c6u\u2217\n\u02c6u\u2217\nj,k\u2217\nif b > 0 then\nObtain the probabilities \u02c6pj(b/\u03c4 )\u2019s by solving LP \u03c4,b with u\u2217\nTake action k\u2217\n\n(t) with probability \u02c6pXt(b/\u03c4 );\n\nj (t)(t);\n\nXt\n\nend if\nUpdate \u03c4, b, Cj,k(t), \u00afuj,k(t), and \u02c6uj,k(t).\n\nend for\n\nj (0) = 1, \u2200j \u2208 X ;\n\nj replaced by \u02c6u\u2217\n\nj (t);\n\n4.3 Regret of UCB-ALP\n\nRecall that qj = (cid:80)j\n\nWe study the regret of UCB-ALP in this section. Due to space limitations, we only present a sketch\nof the analysis. Speci\ufb01c representations of the regret bounds and proof details can be found in the\nsupplementary material.\n\nj(cid:48)=1 \u03c0j(cid:48) (1 \u2264 j \u2264 J) are the boundaries de\ufb01ned in Section 3. We show that\nas the budget B and the time-horizon T grow to in\ufb01nity in proportion, the proposed UCB-ALP\nalgorithm achieves logarithmic regret except for the boundary cases.\n\n6\n\n\fTheorem 2. Given \u03c0j\u2019s, uj,k\u2019s and a \ufb01xed \u03c1 \u2208 (0, 1), the regret of UCB-ALP satis\ufb01es:\n1) (Non-boundary cases) if \u03c1 (cid:54)= qj for any j \u2208 {1, 2, . . . , J \u2212 1}, then the regret of UCB-ALP is\n2) (Boundary cases) if \u03c1 = qj for some j \u2208 {1, 2, . . . , J \u2212 1}, then the regret of UCB-ALP is\n\nRUCB\u2212ALP(T, B) = O(cid:0)JK log T(cid:1).\nRUCB\u2212ALP(T, B) = O(cid:0)\u221a\n\nT + JK log T(cid:1).\n\n\u221a\n\nTheorem 2 differs from Theorem 1 by an additional term O(JK log T ). This term results from using\nUCB to learn the ordering of expected rewards. Under UCB, each of the JK content-action pairs\nshould be executed roughly O(log T ) times to obtain the correct ordering. For the non-boundary\ncases, UCB-ALP is order-optimal because obtaining the correct action ranking under each context\nwill result in O(log T ) regret [26]. Note that our results do not contradict the lower bound in [17]\nbecause we consider discrete contexts and actions, and focus on instance-dependent regret. For\n\u221a\nthe boundary cases, we keep both the\nT and log T terms because the constant in the log T term\nis typically much larger than that in the\nT term. Therefore, the log T term may dominate the\n\u221a\nregret particularly when the number of context-action pairs is large for medium T . It is still an open\nproblem if one can achieve regret lower than O(\nSketch of Proof: We bound the regret of UCB-ALP by comparing its performance with the bench-\n\nmark (cid:98)U (T, B). The analysis of this bound is challenging due to the close interactions among differ-\n\nent sources of regret and the randomness of context arrivals. We \ufb01rst partition the regret according\nto the sources and then bound each part of regret, respectively.\nStep 1: Partition the regret. By analyzing the implementation of UCB-ALP, we show that its\nregret is bounded as\n\nT ) in these cases.\n\nRUCB\u2212ALP(T, B) \u2264 R(a)\n\nwhere the \ufb01rst part R(a)\naction ranking errors within a context, and the second part R(c)\n\nUCB\u2212ALP(T, B) = (cid:80)J\nUCB\u2212ALP(T, B) = (cid:80)T\n(cid:3) is the regret from the \ufb02uctuations of b\u03c4 and context ranking errors.\n\n(cid:80)J\nj=1 \u02c6pj(b\u03c4 /\u03c4 )\u03c0ju\u2217\n\nUCB\u2212ALP(T, B) + R(c)\n\nUCB\u2212ALP(T, B),\n\n(cid:80)\n\nk(cid:54)=k\u2217\n\nj \u2212 uj,k)E[Cj,k(T )] is the regret from\n(u\u2217\n\nE(cid:2)v(\u03c1) \u2212\n\nj=1\n\nj\n\n\u03c4 =1\n\nj\n\nFor the \ufb01rst part, we can show that R(a)\n\nStep 2: Bound each part of regret.\nUCB\u2212ALP(T, B) =\nO(log T ) using similar techniques for traditional UCB methods [25]. The major challenge of regret\nanalysis for UCB-ALP then lies in the evaluation of the second part R(c)\nWe \ufb01rst verify that the evolution of b\u03c4 under UCB-ALP is similar to that under ALP and Lemma 2\nstill holds under UCB-ALP. With respect to context ranking errors, we note that unlike classic UCB\nmethods, not all context ranking errors contribute to the regret due to the threshold structure of\nALP. Therefore, we carefully categorize the context ranking results based on their contributions. We\nbrie\ufb02y discuss the analysis for the non-boundary cases here. Recall that \u02dcj(\u03c1) is the threshold for the\nstatic LP problem LP T,B. We de\ufb01ne the following events that capture all possible ranking results\nbased on UCBs:\n\nUCB\u2212ALP(T, B).\n\nErank,0(t) =(cid:8)\u2200j \u2264 \u02dcj(\u03c1), \u02c6u\u2217\nErank,1(t) =(cid:8)\u2203j \u2264 \u02dcj(\u03c1), \u02c6u\u2217\nErank,2(t) =(cid:8)\u2203j > \u02dcj(\u03c1) + 1, \u02c6u\u2217\n\nj (t) > \u02c6u\u2217\nj (t) \u2264 \u02c6u\u2217\n\n\u02dcj(\u03c1)+1(t);\u2200j > \u02dcj(\u03c1) + 1, \u02c6u\u2217\n\u02dcj(\u03c1)+1(t);\u2200j > \u02dcj(\u03c1) + 1, \u02c6u\u2217\n\nj (t) < \u02c6u\u2217\nj (t) < \u02c6u\u2217\n\n\u02dcj(\u03c1)+1(t)(cid:9).\n\nj (t) \u2265 \u02c6u\u2217\n\n\u02dcj(\u03c1)+1(t)(cid:9),\n\u02dcj(\u03c1)+1(t)(cid:9),\n\nwith below-threshold reward having higher UCB\u201d. Let T (s) =(cid:80)T\n\nThe \ufb01rst event Erank,0(t) indicates a roughly correct context ranking, because under Erank,0(t) UCB-\nALP obtains a correct solution for LP \u03c4,b\u03c4 if b\u03c4 /\u03c4 \u2208 [q\u02dcj(\u03c1), q\u02dcj(\u03c1)+1]. The last two events Erank,s(t),\ns = 1, 2, represent two types of context ranking errors: Erank,1(t) corresponds to \u201ccertain contexts\nwith above-threshold reward having lower UCB\u201d, while Erank,2(t) corresponds to \u201ccertain contexts\n1(Erank,s(t)) for 0 \u2264 s \u2264 2.\nWe can show that the expected number of context ranking errors satis\ufb01es E[T (s)] = O(JK log T ),\ns = 1, 2, implying that R(c)\nUCB\u2212ALP(T, B) = O(JK log T ). Summarizing the two parts, we have\nRUCB\u2212ALP(T, B) = O(JK log T ) for the non-boundary cases. The regret for the boundary cases\ncan be bounded using similar arguments.\nKey Insights from UCB-ALP: Constrained contextual bandits involve complicated interactions\nbetween information acquisition and decision making. UCB-ALP alleviates these interactions by\n\nt=1\n\n7\n\n\fapproximating the oracle with ALP for decision making. This approximation achieves near-optimal\nperformance while tolerating certain estimation errors of system statistics, and thus enables the\ncombination with estimation methods such as UCB in unknown statistics cases. Moreover, the\nadaptation property of UCB-ALP guarantees the concentration property of the system status, e.g.,\nb\u03c4 /\u03c4. This allows us to separately study the impact of action or context ranking errors and conduct\nrigorous analysis of regret. These insights can be applied in algorithm design and analysis for\nconstrained contextual bandits under more general settings.\n\n5 Bandits with Unknown Context Distribution\n\n(cid:80)t\n\nt(cid:48)=1\n\nWhen the context distribution is unknown, a reasonable heuristic is to replace the probability \u03c0j in\nALP with its empirical estimate, i.e., \u02c6\u03c0j(t) = 1\n1(Xt(cid:48) = j). We refer to this modi\ufb01ed ALP\nt\nalgorithm as Empirical ALP (EALP), and its combination with UCB as UCB-EALP.\nThe empirical distribution provides a maximum likelihood estimate of the context distribution and\nthe EALP and UCB-EALP algorithms achieve similar performance as ALP and UCB-ALP, respec-\ntively, as observed in numerical simulations. However, a rigorous analysis for EALP and UCB-\nEALP is much more challenging due to the dependency introduced by the empirical distribution. To\ntackle this issue, our rigorous analysis focuses on a truncated version of EALP where we stop updat-\ning the empirical distribution after a given round. Using the method of bounded averaged differences\nbased on coupling argument, we obtain the concentration property of the average remaining budget\nb\u03c4 /\u03c4, and show that this truncated EALP algorithm achieves O(1) regret except for the boundary\ncases. The regret of the corresponding UCB-based version can by bounded similarly as UCB-ALP.\n\n6 Bandits with Heterogeneous Costs\n\nThe insights obtained from unit-cost systems can also be used to design algorithms for heteroge-\nneous cost systems where the cost cj,k depends on j and k. We generalize the ALP algorithm to\napproximate the oracle, and adjust it to the case with unknown expected rewards. For simplicity, we\nassume the context distribution is known here, while the empirical estimate can be used to replace\nthe actual context distribution if it is unknown, as discussed in the previous section.\nWith heterogeneous costs, the quality of an action k under a context j is roughly captured by its\nnormalized expected reward, de\ufb01ned as \u03b7j,k = uj,k/cj,k. However, the agent cannot only focus\non the \u201cbest\u201d action, i.e., k\u2217\nj = arg maxk\u2208A \u03b7j,k, for context j. This is because there may exist\nanother action k(cid:48) such that \u03b7j,k(cid:48) < \u03b7j,k\u2217\nIf\nthe budget allocated to context j is suf\ufb01cient, then the agent may take action k(cid:48) to maximize the\nexpected reward. Therefore, to approximate the oracle, the ALP algorithm in this case needs to\nsolve an LP problem accounting for all context-action pairs with an additional constraint that only\none action can be taken under each context. By investigating the structure of ALP in this case and\n\u221a\nthe concentration of the remaining budget, we show that ALP achieves O(1) regret in non-boundary\ncases, and O(\nT ) regret in boundary cases. Then, an \u0001-First ALP algorithm is proposed for the\nunknown statistics case where an exploration stage is implemented \ufb01rst and then an exploitation\nstage is implemented according to ALP.\n\n(and of course, cj,k(cid:48) > cj,k\u2217\n\n, but uj,k(cid:48) > uj,k\u2217\n\nj\n\nj\n\n).\n\nj\n\n7 Conclusion\n\nIn this paper, we study computationally-ef\ufb01cient algorithms that achieve logarithmic or sublinear\nregret for constrained contextual bandits. Under simpli\ufb01ed yet practical assumptions, we show\nthat the close interactions between the information acquisition and decision making in constrained\ncontextual bandits can be decoupled by adaptive linear relaxation. When the system statistics are\nknown, the ALP approximation achieves near-optimal performance, while tolerating certain estima-\ntion errors of system parameters. When the expected rewards are unknown, the proposed UCB-ALP\nalgorithm leverages the advantages of ALP and UCB, and achieves O(log T ) regret except for cer-\nT ) regret. Our study provides an ef\ufb01cient approach of\ntain boundary cases, where it achieves O(\ndealing with the challenges introduced by budget constraints and could potentially be extended to\nmore general constrained contextual bandits.\nAcknowledgements: This research was supported in part by NSF Grants CCF-1423542, CNS-\n1457060, CNS-1547461, and AFOSR MURI Grant FA 9550-10-1-0573.\n\n\u221a\n\n8\n\n\fReferences\n[1] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 817\u2013824, 2007.\n\n[2] T. Lu, D. P\u00b4al, and M. P\u00b4al. Contextual multi-armed bandits. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pages 485\u2013492, 2010.\n\n[3] L. Zhou. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326, 2015.\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-\n\nlem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[5] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In ACM International Conference on World Wide Web (WWW),\npages 661\u2013670, 2010.\n\n[6] A. Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning\n\nResearch, 15(1):2533\u20132568, 2014.\n\n[7] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster:\nA fast and simple algorithm for contextual bandits. In International Conference on Machine\nLearning (ICML), 2014.\n\n[8] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learn-\n\ning. In Advances in Neural Information Processing Systems (NIPS), pages 49\u201356, 2007.\n\n[9] A. Badanidiyuru, R. Kleinberg, and Y. Singer. Learning on a budget: posted price mechanisms\nfor online procurement. In ACM Conference on Electronic Commerce, pages 128\u2013145, 2012.\n[10] T. L. Lai and O. Y.-W. Liao. Ef\ufb01cient adaptive randomization and stopping rules in multi-arm\n\nclinical trials for testing a new treatment. Sequential Analysis, 31(4):441\u2013457, 2012.\n\n[11] L. Tran-Thanh, A. C. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal\npolicies for budget-limited multi-armed bandits. In AAAI Conference on Arti\ufb01cial Intelligence,\n2012.\n\n[12] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In IEEE 54th Annual\n\nSymposium on Foundations of Computer Science (FOCS), pages 207\u2013216, 2013.\n\n[13] C. Jiang and R. Srikant. Bandits with budgets. In IEEE 52nd Annual Conference on Decision\n\nand Control (CDC), pages 5345\u20135350, 2013.\n\n[14] A. Slivkins. Dynamic ad allocation: Bandits with budgets. arXiv preprint arXiv:1306.0155,\n\n2013.\n\n[15] Y. Xia, H. Li, T. Qin, N. Yu, and T.-Y. Liu. Thompson sampling for budgeted multi-armed\n\nbandits. In International Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[16] R. Combes, C. Jiang, and R. Srikant. Bandits with budgets: Regret lower bounds and optimal\n\nalgorithms. In ACM Sigmetrics, 2015.\n\n[17] A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Conference\n\non Learning Theory (COLT), 2014.\n\n[18] R. Combes, A. Proutiere, D. Yun, J. Ok, and Y. Yi. Optimal rate sampling in 802.11 systems.\n\nIn IEEE INFOCOM, pages 2760\u20132767, 2014.\n\n[19] M. H. Veatch. Approximate linear programming for average cost MDPs. Mathematics of\n\nOperations Research, 38(3):535\u2013544, 2013.\n\n[20] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In ACM\n\nConference on Economics and Computation, pages 989\u20131006. ACM, 2014.\n\n[21] S. Agrawal, N. R. Devanur, and L. Li. Contextual bandits with global constraints and objective.\n\narXiv preprint arXiv:1506.03374, 2015.\n\n[22] S. Agrawal and N. R. Devanur. Linear contextual bandits with global constraints and objective.\n\narXiv preprint arXiv:1507.06738, 2015.\n\n[23] D. P. Dubhashi and A. Panconesi. Concentration of measure for the analysis of randomized\n\nalgorithms. Cambridge University Press, 2009.\n\n[24] A. Garivier and O. Capp\u00b4e. The KL-UCB algorithm for bounded stochastic bandits and beyond.\n\nIn Conference on Learning Theory (COLT), pages 359\u2013376, 2011.\n\n[25] D. Golovin and A. Krause. Dealing with partial feedback, 2009.\n[26] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n9\n\n\f", "award": [], "sourceid": 333, "authors": [{"given_name": "Huasen", "family_name": "Wu", "institution": "University of California at Davis"}, {"given_name": "R.", "family_name": "Srikant", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Xin", "family_name": "Liu", "institution": "University of California, Davis"}, {"given_name": "Chong", "family_name": "Jiang", "institution": "University of Illinois at Urbana-Champaign"}]}