{"title": "Committing Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1557, "page_last": 1565, "abstract": "We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment $\\Theta(\\ln T)$ steps and then commit, where $T$ is the time horizon.", "full_text": "Committing Bandits\n\nLoc Bui\u2217\n\nMS&E Department\nStanford University\n\nRamesh Johari\u2020\nMS&E Department\nStanford University\n\nShie Mannor\u2021\nEE Department\n\nTechnion\n\nAbstract\n\nWe consider a multi-armed bandit problem where there are two phases. The \ufb01rst\nphase is an experimentation phase where the decision maker is free to explore\nmultiple options. In the second phase the decision maker has to commit to one of\nthe arms and stick with it. Cost is incurred during both phases with a higher cost\nduring the experimentation phase. We analyze the regret in this setup, and both\npropose algorithms and provide upper and lower bounds that depend on the ratio\nof the duration of the experimentation phase to the duration of the commitment\nphase. Our analysis reveals that if given the choice, it is optimal to experiment\n\u0398(ln T ) steps and then commit, where T is the time horizon.\n\n1 Introduction\nIn a range of applications, a dynamic decision making problem exhibits two distinctly differentkinds\nof phases: experimentation and commitment. In the \ufb01rst phase, the decision maker explores multiple\noptions, to determine which might be most suitable for the task at hand. However, eventually the\ndecision maker must commit to a choice, and use that decision for the duration of the problem\nhorizon. A notable feature of these phases in the models we study is that costs are incurred during\nboth phases; that is, experimentation is not carried out \u201cof\ufb02ine,\u201d but rather is run \u201clive\u201d in the actual\nsystem.\nFor example, consider the design of a recommendation engine for an online retailer (such as Ama-\nzon). Experimentation amounts to testing different recommendation strategies on arriving cus-\ntomers. However, such testing is not carried out without consequences; the retailer might lose\npotential rewards if experimentation leads to suboptimal recommendations. Eventually, the recom-\nmendation engine must be stabilized (both from a software development standpoint and a customer\nexpectation standpoint), and when this happens the retailer has effectively committed to one strat-\negy moving forward. As another example, consider product design and delivery (e.g., tapeouts in\nsemiconductor manufacturing, or major releases in software engineering). The process of experi-\nmentation during design entails costs to the producer, but eventually the experimentation must stop\nand the design must be committed. Another example is that of dating followed by marriage to\nhopefully, the best possible mate.\nIn this paper we consider a class of multi-armed bandit problems (which we call committing bandit\nproblems) that mix these two features: the decision maker is allowed to try different arms in each\nperiod until commitment, at which point a \ufb01nal choice is made (\u201ccommitted\u201d) and the chosen arm\nis used until the end of the horizon. Of course, models that investigate each phase in isolation are\nextensively studied. If the problem consists of only experimentation, then we have the classical\nmulti-armed bandit problem, where the decision maker is interested in minimizing the expected\ntotal regret against the best arm [9, 2]. At the other extreme, several papers have studied the pure\n\n\u2217Email: locbui@stanford.edu\n\u2020Email: ramesh.johari@stanford.edu\n\u2021Email: shie@ee.technion.ac.il\n\n1\n\n\fexploration or budgeted learning problem, where the goal is to output the best arm at the end of an\nexperimentation phase [13, 6, 4]; no costs are incurred for experimentation, but after \ufb01nite time a\nsingle decision must be chosen (see [12] for a review).\nFormally, in a committing bandit problem, the decision maker can experimentwithout constraints for\nthe \ufb01rst N of T periods, but must commit to a single decision for the last T \u2212 N periods, where T is\nthe problem horizon. We \ufb01rst consider the soft deadline setting where the experimentation deadline\nN can be chosen by the decision maker, but there is a cost incurred per experimentation period.\nWe divide this setting into two regimes depending on how N is chosen: the non-adaptive regime\n(Section 3) in which the decision maker has to choose N before the algorithm begins running, and\nthe adaptive regime (Section 4) in which N can be chosen adaptively as the algorithm runs.\nWe obtain two main results for the soft deadline setting. First, in both regimes, we \ufb01nd that the\nbest tradeoff between experimentation and commitment (in terms of expected regret performance)\nis essentially obtained by experimenting for N = \u0398(ln T ) periods, and then committing to the\nempirical best action for the remaining T \u2212 \u0398(ln T ) periods; this yields an expected average regret\nof \u0398(ln T /T ). Second, and somewhat surprisingly, we \ufb01nd that if the algorithm has access to\ndistributional information about the arms, then adaptivity provides no additional bene\ufb01t (at least in\nterms of expected regret performance); however, as we observe via simulations, on a sample path\nbasis adaptive algorithms can outperform nonadaptive algorithms due to the additional \ufb02exibility.\nFinally, we demonstrate that if the algorithm has no initial distributional information, adaptivity is\nbene\ufb01cial: we demonstrate an adaptive algorithm that achieves \u0398(ln T /T ) regret in this case.\nWe then study the hard deadline regime where the value of N is given to the decision maker in\nadvance (Section 5). This is a sensible assumption for problems where the decision maker cannot\ncontrol how long the experimentation period is; for example, in the product design example above,\nthe release date is often \ufb01xed well in advance, and the engineers are not generally free to alter it.\nWe propose the UCB-poly(\u03b4) algorithm for this setting, where the parameter \u03b4 \u2208 (0, 1) re\ufb02ects the\ntradeoff between experimentationand commitment. We show how to tune the algorithm to optimally\nchoose \u03b4, based on the relative values of N and T.\nWe mention in passing that the celebrated exploration-exploitation dilemma is also a major issue\nin our setup. During the \ufb01rst N periods the tradeoff between exploration and exploitation exists\nbearing in mind that the last T \u2212 N periods will be used solely for exploitation. This changes the\nstandard setup so that exploration in the \ufb01rst N periods becomes more important, as we shall see in\nour results.\n2 The committing bandit problem\nWe \ufb01rst describe the setup of the classical stochastic multi-armed bandit problem, as it will serve as\nbackground for the committing bandit problem. In a stochastic multi-armed bandit problem, there\nare K independent arms; each arm i, when pulled, returns a reward which is independently and\nidentically drawn from a \ufb01xed Bernoulli distribution1 with unknown parameter \u03b8i \u2208 [0, 1]. Let It\ndenote the index of the arm pulled at time t (It \u2208{ 1, 2, . . . , K}), and let Xt denote the associated\nreward. Note that E[Xt] = \u03b8It. Also, we de\ufb01ne the following notation:\n\n\u03b8\u2217 := max\n1\u2264i\u2264K\n\n\u03b8i,\n\ni\u2217 := arg max\n1\u2264i\u2264K\n\n\u03b8i, \u2206i := \u03b8\u2217 \u2212 \u03b8i, \u2206 := min\ni:\u2206i>0\n\n\u2206i.\n\nAn allocation policy is an algorithm that chooses the next arm to pull based on the sequence of past\npulled arms and obtained rewards. The cumulative regret of an allocation policy A after time n is:\n\nn\n\nRn =\n\n(X \u2217\n\nt \u2212 Xt) ,\n\nt is the reward that the algorithm would have received at time t if it had pulled the optimal\nwhere X \u2217\narm i\u2217. In other words, Rn is the cumulative loss due to the fact that the allocation policy does not\nalways pull the optimal arm. Let Ti(n) be the number of times that arm i is pulled up to time n.\n1We assume Bernoulli distributions throughout the paper. Our results hold with minor modi\ufb01cation for any\n\ndistribution with bounded support.\n\n!t=1\n\n2\n\n\fThen:\n\nK\n\nE[Rn] = \u03b8\u2217n \u2212\n\n\u2206iE[Ti(n)].\n\n!i=1\n\n\u03b8iE[Ti(n)] = !i#=i\u2217\n\nThe reader is referred to the supplementary material for some well-known allocation policies, e.g.,\nUnif (Uniform allocation) and UCB (Upper Con\ufb01dence Bound) [2].\nA recommendation policy is an algorithm that tries to recommend the \u201cbest\u201d arm based on the\nsequence of past pulled arms and obtained rewards. Suppose that after time n, a recommendation\npolicy R recommends the arm Jn as the \u201cbest\u201d arm. Then the regret of recommendation policy R\nafter time n, called the simple regret in [4], is de\ufb01ned as\n\nrn = \u03b8\u2217 \u2212 \u03b8Jn =\u2206 Jn.\n\nThe reader is also referred to the supplementary material for some natural recommendation policies,\ne.g., EBA (Empirical Best Arm) and MPA (Most Played Arm).\nThe committing bandit problem considered in this paper is a version of the stochastic multi-armed\nbandit problem in which the algorithm is forced to commit to only one arm after some period of\ntime. More precisely, the problem setting is as follows. Let T be the time horizon of the problem.\nFrom time 1 to some time N (N < T), the algorithm can pull any arm in {1, 2, . . . , K}. Then, from\ntime N + 1 to the end of the horizon (time T), it must commit to pull only one arm. The \ufb01rst phase\n(time 1 to N) is called the experimentation phase, and the second phase (time N + 1 to T) is called\nthe commitment phase. We refer to time N as the experimentation deadline.\nAn algorithm for the committing bandit problem is a combination of an allocation and a recommen-\ndation policy. That is, the algorithm has to decide which arm to pull during the \ufb01rst N slots, and\nthen choose an arm to commit to during the remaining T \u2212 N slots. Because we consider settings\nwhere the algorithm designer can choose the experimentation deadline, we also assume a cost is\nimposed during the experimentation phase; otherwise, it is never optimal to be forced to commit.\nIn particular, we assume that the reward earned during the experimentation phase is reduced by a\nconstant factor \u03b3 \u2208 [0, 1). Thus the expected regret E[Reg] of such an algorithm is the average regret\nacross both phases, i.e.:\n\nE[Reg] =\n\n1\n\nT \" T\n!t=1\n\n\u03b8\u2217 \u2212 \u03b3\n\nN\n\n!t=1\n\nE[\u03b8It] \u2212\n\nT\n\n!t=N +1\n\nE[\u03b8JN ]# = \u03b3\n\nE[RN ]\n\nT\n\n+\n\nT \u2212 N\n\nT\n\nE[rN ]+(1\u2212\u03b3)\n\nN\u03b8 \u2217\nT\n\n.\n\n2.1 Committing bandit regimes\nWe focus on three distinct regimes, that differ in the level of control given to the algorithm designer\nin choosing the experimentation deadline.\nRegime 1: Soft experimentation deadline, non-adaptive. In this regime, the value of T is given\nto the algorithm. For a given value of T, the value of N can be chosen freely between 1 and T \u2212 1,\nbut the choice must be made before the process begins.\nRegime 2: Soft experimentation deadline, adaptive. The setting in this regime is the same as\nthe previous one, except for the fact that the algorithm can choose the value of N adaptively as\noutcomes of past pulls are observed.\nRegime 3: Hard experimentation deadline. In this regime, both N and T are \ufb01xed and given to\nthe algorithm. That is, the algorithm cannot control the experimentation deadline N. We are mainly\ninterested in the asymptotic behavior of the algorithm when both N and T go to in\ufb01nity.\n\n2.2 Known lower-bounds\nAs mentioned in the Introduction section, the experimentation and commitment phases have each\nbeen extensively studied in isolation. In this subsection, we only summarize brie\ufb02y the known lower\nbounds on cumulative regret and simple regret that will be used in the paper.\nResult 1 (Distribution-dependent lower bound on cumulative regret [9]). For any allocation policy,\nand for any set of reward distributions such that their parameters \u03b8i are not all equal, there exists\n\n3\n\n\fan ordering of (\u03b81, . . . ,\u03b8 K) such that\n\nE[Rn] \u2265 \uf8eb\n\uf8ed!i#=i\u2217\n\n\u2206i\n\nD(pi$p\u2217)\n\n+ o(1)\uf8f6\n\n\uf8f8 ln n,\n\np\u2217 + p\u2217 log p\u2217\npi\n\nwhere D(pi$p\u2217) = pi log pi\nreward distributions pi (of arm i) and p\u2217 (of the optimal arm), and o(1) \u2192 0 as n \u2192 \u221e.\nResult 2 (Distribution-free lower bound on cumulative regret [13]). There exist positive constants c\nand N0 such that for any allocation policy, there exists a set of Bernoulli reward distributions such\nthat\n\nis the Kullback-Leibler divergence between two Bernoulli\n\nE[Rn] \u2265 cK(ln n \u2212 ln K),\n\n\u2200n \u2265 N0.\n\nThe difference between Result 1 and Result 2 is that the lower bound in the former depends on the\nparameters of reward distributions (hence, called distribution-dependent), while the lower bound\nin the latter does not (hence, called distribution-free). That means, in the latter case, the reward\ndistributions can be chosen adversarially. Therefore, it should be clear that the distribution-free\nlower bound is always higher than the distribution-dependent lower bound.\nResult 3 (Distribution-dependentbound on simple regret [4]). For any pair of allocation and recom-\nmendation policies, if the allocation policy can achieve an upper bound such that for all (Bernoulli)\nreward distributions \u03b81, . . . ,\u03b8 K, there exists a constant C \u2265 0 with\n\nthen for all sets of K \u2265 3 Bernoulli reward distributions with parameters \u03b8i that are all distinct and\nall different from 1, there exists an ordering (\u03b81, . . . ,\u03b8 K) such that\n\nE[Rn] \u2264 Cf (n),\n\nE[rn] \u2265\n\ne\u2212Df (n),\n\n\u2206\n2\n\nwhere D is a constant which can be calculated in closed form from C, and \u03b81, . . . ,\u03b8 K.\nIn particular, since E[Rn] \u2264 \u03b8\u2217n for any allocation policy, there exists a constant \u03be depending only\non \u03b81, . . . ,\u03b8 K such that E[rn] \u2265 (\u2206/2)e\u2212\u03ben.\nResult 4 (Distribution-free lower bound on simple regret [4]). For any pair of allocation and recom-\n20( K\nmendation policies, there exists a set of Bernoulli reward distributions such that E[rn] \u2265\n.\nn\nIn the subsequent sections we analyze each of the committing bandit regimes in detail; in particular,\nwe provide constructive upper bounds and matching lower bounds on the regret in each regime. The\ndetailed proofs of all the results in this paper are presented in the supplementary material.\n3 Regime 1: Soft experimentation deadline, non-adaptive\nIn this regime, for a given value of T, the value of N can be chosen freely between 1 and T \u2212 1,\nbut only before the algorithm begins pulling arms. Our main insight is that there exist matching\nupper and lower bounds of order \u0398(ln T /T ); further, we propose an algorithm that can achieve this\nperformance.\nTheorem 1. (1) Distribution-dependent lower bound: In Regime 1, for any algorithm, and any set\nof K \u2265 3 Bernoulli reward distributions such that \u03b8i are all distinct and all different from 1, there\nexists an ordering (\u03b81, . . . ,\u03b8 K) such that\n\n1\n\nE[Reg] \u2265 \uf8eb\n\n\uf8edmax\uf8f1\uf8f2\n\uf8f3\n\n(1 \u2212 \u03b3)\u03b8\u2217\n\n\u03be\n\n,!i#=i\u2217\n\n\u2206i\n\nD(pi$p\u2217)\uf8fc\uf8fd\n\uf8fe\n\n+ o(1)\uf8f6\n\uf8f8\n\nln T\nT\n\n,\n\nwhere o(1) \u2192 0 as T \u2192 \u221e, and \u03be is the constant discussed in Result 3.\n(2) Distribution-free lower bound: Also, for any algorithm in Regime 1, there exists a set of Bernoulli\nreward distributions such that\n\nwhere c is the constant in Result 2.\n\nE[Reg] \u2265 cK/1 \u2212\n\n4\n\nln K\n\nln T 0 ln T\n\nT\n\n,\n\n\fWe now show that the Non-adaptive Unif-EBA algorithm (Algorithm 1) achieves the matching\nupper bound, as stated in the following theorem.\nAlgorithm 1 Non-adaptive Unif-EBA\n\nInput: a set of arms {1, 2, . . . , K}, T, \u2206\nrepeat\n\nSample each arm in {1, 2, . . . , K} in the round robin fashion.\n\nuntil each arm has been chosen1ln T /\u220622 times.\n\nCommit to the arm with maximum empirical average reward for the remaining periods.\n\nTheorem 2. For the Non-adaptive Unif-EBA algorithm (Algorithm 1),\nln T\uf8f6\n\uf8f8\n\n\u22062 \uf8eb\n\uf8ed(1 \u2212 \u03b3)\u03b8\u2217 +\n\nK !i#=i\u2217\n\nE[Reg] \u2264\n\n\u2206i +\n\n2\u22062\n\nK\n\n\u03b3\n\nln T\nT\n\n.\n\nThis matches the lower bounds in Theorem 1 to the correct order in T. Observe that in this regime,\nboth distribution-dependent and distribution-free lower bounds have the same asymptotic order of\nln T /T. However, the preceding algorithm requires knowing the value of \u2206. If \u2206 is unknown, a low\nregret algorithm that matches the lower bound does not seem to be possible in this regime, because\nof the relative nature of the regret. An algorithm may be unable to choose an N that explores\nsuf\ufb01ciently long when arms are dif\ufb01cult to distinguish, and yet commits quickly when arms are easy\nto distinguish.\n4 Regime 2: Soft experimentation deadline, adaptive\nThe setting in this regime is the same as the previous one, except that the algorithm is not required\nto choose N before it runs, i.e., N can be chosen adaptively. Thus, in particular, it is possible for\nthe algorithm to reject bad arms or to estimate \u2206 as it runs.\nWe \ufb01rst present the lower bounds on regret for any algorithm in this regime.\nTheorem 3. (1) Distribution-dependent lower bound: In Regime 2, for any algorithm, and any set\nof K \u2265 3 Bernoulli reward distribution such that \u03b8i are all distinct and all different from 1, there\nexists an ordering (\u03b81, . . . ,\u03b8 K) such that\nE[Reg] \u2265 \uf8eb\n\uf8ed!i#=i\u2217\n\nwhere o(1) \u2192 0 as T \u2192 \u221e.\n(2) Distribution-free lower bound: Also, for any algorithm in Regime 2, there exists a set of Bernoulli\nreward distributions such that\n\n+ o(1)\uf8f6\n\uf8f8\n\nD(pi$p\u2217)\n\nln T\nT\n\n\u2206i\n\n,\n\nE[Reg] \u2265 cK/1 \u2212\n\nln K\n\nln T 0 ln T\n\nT\n\n,\n\nwhere c is the constant in Result 2.\nNext, we derive several sequential algorithms with matching upper bounds on regret. The \ufb01rst al-\ngorithm is called Sequential Elimination & Commitment 1 (SEC1) (Algorithm 2); this algorithm\nrequires the values of \u2206 and \u03b8\u2217.\nTheorem 4. For the SEC1 algorithm (Algorithm 2),\n\nE[Reg] \u2264\n\nK\n\n\u22062 \uf8eb\n\uf8ed(1 \u2212 \u03b3)\u03b8\u2217 +\nln T \u2192 0 as T \u2192 \u221e.\n\n\u03b3\n\nK !i#=i\u2217\n\n\u2206i + b\uf8f6\n\uf8f8\n\nln T\nT\n\n,\n\nwhere b =32 + \u22062(K+2)\n\n(1\u2212e\u2212\u22062 /2)24 1\n\n5\n\n\fAlgorithm 2 Sequential Elimination & Commitment 1 (SEC1)\n\nInput: A set of arms {1, 2, . . . , K}, T, \u2206, \u03b8\u2217\nInitialization: Set m = 0, B0 = {1, 2, . . . , K}, \u03b1 = 1/\u22062, \u00011 = 1/\u2206, \u00012 =\u2206 /2.\nrepeat\n\nm be the total reward obtained from arm i so far.\n\nSample each arm in Bm once. Let Si\nSet Bm+1 = Bm, m = m + 1.\nfor i \u2208 Bm do\n\nDelete arm i from Bm.\n\nif m \u2264 )\u03b1 ln T * and |m\u03b8\u2217 \u2212 Si\nend if\nif m > )\u03b1 ln T * and |m\u03b8\u2217 \u2212 Si\nend if\nend for\n\nDelete arm i from Bm.\n\nm| >\u0001 1 ln T then\n\nm| >\u0001 2m then\n\nuntil there is only one arm in Bm, then commit to that arm or the horizon T is reached.\n\nObserve that this algorithm matches the lower bounds in Theorem 3 to the correct order in T. We\nnote that when N can be chosen adaptively, both distribution-dependent and distribution-free lower\nbounds have the same asymptotic order of ln T /T as the ones in the non-adaptive regime. In the\ndistribution-dependent case, therefore, we obtain the surprising conclusion that adaptivity does not\nIndeed, the regret bound of SEC1 in Theorem 4 is exactly\nreduce the optimal expected regret.\nthe same as for Non-adaptive Unif-EBA in Theorem 2. We conjecture that the constant 1/\u22062 is\nactually the best achievable constant on expected regret.\nWhat is the bene\ufb01t of adaptivity then? As simulation results in Section 6 suggest, SEC1 performs\nmuch better than Non-adaptive Unif-EBA in practice. The reason is rather intuitive: due to its\nwhile Non-adaptive Unif-EBA has to wait until that point to make decisions.\nRemark 1. Although SEC1 requires the value of \u03b8\u2217, that requirement can be relaxed as \u03b8\u2217 can\nbe estimated by the maximum empirical average reward across arms. In fact, as we will see in\nthe simulations (Section 6), another version of SEC1 (called SEC2) in which m\u03b8\u2217 is replaced by\nmaxj\u2208Bm Sj\nNow, if the value of \u2206 is unknown, we have the following Sequential Committing UCB (SC-UCB)\nalgorithm which is based on the improved UCB algorithm in [3]. The idea is to maintain an estimate\nof \u2206 and reduce it over time.\n\nadaptive nature, SEC1 is able to eliminate poor arms much earlier than the 1ln T /\u220622 threshold,\n\nm achieves a nearly identical performance.\n\nAlgorithm 3 Sequential Committing UCB (SC-UCB)\n\nInput: A set of arms {1, 2, . . . , K}, T\nInitialization: Set m = 0, \u02dc\u22060 = 0, B0 = {1, 2, . . . , K}.\nfor m = 0, 1, 2, . . . , +log2(T /e)/2, do\n\nif |Bm| > 1 then\nSample each arm in Bm until each arm has been chosen nm = 52 ln(T \u02dc\u22062\nLet Si\nDelete all arms i from Bm for which\n\nm be the total reward obtained from arm i so far.\n\nm)/ \u02dc\u22062\n\nm6 times.\n\nm > 27nm ln(T \u02dc\u22062\n\nm)/2\n\nmaxj\u2208Bm Sj\n\nm \u2212 Si\n\nto obtain Bm+1.\nSet \u02dc\u2206m+1 = \u02dc\u2206m/2.\nCommit to the single arm in Bm.\n\nelse\nend if\nend for\nCommit to any arm in Bm.\n\n6\n\n\fTheorem 5. For the SC-UCB algorithm (Algorithm 3),\n\nT\n\ni )\n\n\u22062\ni\n\ni )0 .\n\n\u22062\ni + 96\nln(T \u22062\n\n0 ln(T \u22062\n\nE[Reg] \u2264 !i#=i\u2217/ \u03b3\u2206i + (1 \u2212 \u03b3)\u03b8\u2217\n\n/32 +\nThis matches the lower bounds in Theorem 3 to the correct order in T.\n5 Regime 3: Hard experimentation deadline\nWe now investigate the third regime where, in contrast to the previous two, the experimentation\ndeadline N is \ufb01xed exogenously together with T. We consider the asymptotic behavior of regret\nas T and N approach in\ufb01nity together. Note that since in this case the experimentation deadline is\noutside the algorithm designer\u2019s control, we set the cost of experimentation \u03b3 = 1 for this section.\nBecause both T and N are given, the main challenge in this context is choosing an algorithm that\noptimally balances the cumulativeand simple regrets. We design and tune an algorithm that achieves\nthis balance.\nWe know from Result 3 that for any pair of allocation and recommendation policies, if E[RN ] \u2264\nC1f (N ), then E[rN ] \u2265 (\u2206/2)e\u2212Df (N ). In other words, given an allocation policy A that has a\ncumulative regret bound C1f (N ) (for some constant C1), the best (distribution-dependent) upper\nbound that any recommendation policy can achieve is C2e\u2212C3f (N ) (for some constants C2 and C3).\nAssuming that there exists a recommendationpolicy RA that achieves such an upper bound, we have\nthe following upper bound on regret when applying [A, RA] to the committing bandit problem:\n\nT\n\nT\n\n+\n\nf (N )\n\nT \u2212 N\n\nE[Reg] \u2264 C1\n\nC2e\u2212C3f (N ).\n\n(1)\nOne can clearly see the trade-off between experimentation and commitment in (1): the smaller the\n\ufb01rst term, the larger the second term, and vice versa. Note that ln(N ) \u2264 f (N ) \u2264 N, and we have\nalgorithms that give us only either one of the extremes (e.g., Unif has f (N ) = N, while UCB [2] has\nf (N ) = ln N). On the other hand, it would be useful to have an algorithm that can balance between\nthese two extremes. In particular, we focus on \ufb01nding a pair of allocation and recommendation\npolicies which can simultaneously achieve the allocation bound C1N \u03b4 and the recommendation\nbound C2e\u2212C3N \u03b4 where 0 <\u03b4< 1.\nLet us consider a modi\ufb01cation of the UCB allocation policy called UCB-poly(\u03b4) (for 0 <\u03b4< 1),\nwhere for t > K, with \u02c6\u03b8i,Ti(t\u22121) be the empirical average of rewards from arm i so far,\n\n1\u2264i\u2264K\"\u02c6\u03b8i,Ti(t\u22121) +8 2(t \u2212 1)\u03b4\nTi(t \u2212 1)# .\nThen we have the following result on the upper bound of its cumulative regret.\nTheorem 6. The cumulative regret of UCB-poly(\u03b4) is upper-bounded by\n\nIt = arg max\n\nwhere o(1) \u2192 0 as n \u2192 \u221e. Moreover, the simple regret for the pair [UCB-poly(\u03b4), EBA] is\nupper-bounded by\n\n8\n\u2206i\n\n+ o(1)# n\u03b4,\n\nE[Rn] \u2264 \" !i:\u2206i>0\nE[rn] \u2264 \uf8eb\n\n\u03c3\n2\n\ni\n\n\u22062\ni .\n\nwhere \u03c7 = min\nIn the supplementary material (see Theorem 7 there) we show that in the limit, as T and N increase\nto in\ufb01nity, the optimal value of \u03b4 can be chosen as limN\u2192\u221e ln(ln(T (N ) \u2212 N ))/ ln N if that limit\nexists. In particular, if T (N ) is super-exponential in N we get an optimal \u03b4 of 1 representing pure\nexploration in the experimentation phase. If T (N ) is sub-exponential we get an optimal \u03b4 of 0\nrepresenting a standard UCB during the experimentation phase. If T (N ) is exponential we obtain \u03b4\nin between.\n\n\uf8ed2!i#=i\u2217\n\n\u2206i\uf8f6\n\uf8f8 e\u2212\u03c7n\u03b4\n\n,\n\n7\n\n\fFigure 1: Numerical performances where K = 20, \u03b3 = 0.75, and \u2206= 0 .02\n\n6 Simulations\nIn this section, we present numerical results on the performance of Non-adaptive Unif-EBA, SEC1,\nSEC2, and SC-UCB algorithms. (Recall that the SEC2 algorithm is a version of SEC1 in which\nm\u03b8\u2217 is replaced by maxj\u2208Bm Sj\nm, as discussed in Remark 1). The simulation setting includes K\narms with Bernoulli reward distributions, the time horizon T, and the values of \u03b3 and \u2206. The arm\ncon\ufb01gurations are generated as follows. For each experiment, \u03b8\u2217 is generated independently and\nuniformly in the [0.5, 1] interval, and the second best arm reward is set as \u03b8\u2217\n2 = \u03b8\u2217 \u2212 \u2206. These two\nvalues are then assigned to two randomly chosen arms, and the rest of arm rewards are generated\nindependently and uniformly in [0,\u03b8 \u2217\n2].\nFigure 1 shows the regrets of the above algorithms for various values of T (in logarithmic scale)\nwith parameters K = 20, \u03b3 = 0.75, and \u2206= 0 .02 (we omitted error bars because the variation\nwas small). Observe that the performances of SEC1 and SEC2 are nearly identical, which suggests\nthat the requirement of knowing \u03b8\u2217 in SEC1 can be relaxed (see Remark 1). Moreover, SEC1 (or\nequivalently, SEC2) performs much better than Non-adaptive Unif-EBA due to its adaptive nature\n(see the discussion before Remark 1). Particularly, the performance of Non-adaptive Unif-EBA\nis quite poor when the experimentation deadline is roughly equal to T, since the algorithm does\nnot commit before the experimentation deadline. Finally, SC-UCB does not perform as well as the\nothers when T is large, but this algorithm does not need to know \u2206, and thus suffers a performance\nloss due to the additional effort required to estimate \u2206.\nAdditional simulation results can be found in the supplementary material.\n7 Extensions and future directions\nOur work is a \ufb01rst step in the study of the committing bandit setup. There are several extensions that\ncall for future research which we outline below.\nFirst, an extension of the basic committing bandits setup to the case of contextual bandits [10, 11]\nis natural. In this setup before choosing an arm an additional \u201ccontext\u201d is provided to the decision\nmaker. The problem is to choose a decision rule from a given class that prescribes what arm to\nchoose for every context. This setup is more realistic when the decision maker has to commit to\nsuch a rule after some exploration time. Second, models with many arms (structured as in [8, 5])\nor even in\ufb01nitely arms (as in [1, 7, 14]) are of interest here as they may lead to different regimes\nand results here. Third, our models assumed that the commitment time is either predetermined or\naccording to the decision maker\u2019s will. There are other models of interest such as the case where\nsome stochastic process determines the commitment time.\nFinally, a situation where the exploration and commitment phases alternate (randomly or according\nto a given schedule or at a cost) is of practical interest. This can represent the situation where there\nare a few releases of a product where exploration can be done until the time of the release, when the\nproduct is \u201cfrozen\u201d until a new exploration period followed by a new release.\n\n8\n\n\fReferences\n[1] R. Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimiza-\n\ntion, 33(6):1926\u20131951, 1995.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit prob-\n\nlem. Machine Learning Journal, 47(2-3):235\u2013256, 2002.\n\n[3] P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed\n\nbandit problem. Periodica Mathematica Hungarica, 61(1-2):55\u201365, 2010.\n\n[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in \ufb01nitely-armed and continuous-armed\n\nbandits. Theoretical Computer Science, 412(19):1832\u20131852, 2011.\n\n[5] P. A. Coquelin and R. Munos. Bandit algorithms for tree search. CoRR, abs/cs/0703062, 2007.\n[6] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for\nthe multi-armed bandit and reinforcement learning problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages\n\n681\u2013690, 2008.\n\n2006.\n\n[8] L. Kocsis and C. Szepesv\u00b4ari. Bandit based Monte-Carlo planning. In ECML, pages 282\u2013293,\n\n[9] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6:4\u201322, 1985.\n\n[10] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In\n\nAdvances in Neural Information Processing (NIPS), 2008.\n\n[11] L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th International Conference on World\nWide Web, pages 661\u2013670, 2010.\n\n[12] S. Mannor. k-armed bandit. In Encyclopedia of Machine Learning, pages 561\u2013563. 2010.\n[13] S. Mannor and J. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit\n\nproblem. Journal of Machine Learning Research, 5:623\u2013648, 2004.\n\n[14] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Mathematics of Oper-\n\nations Research, 35(2):395\u2013411, 2010.\n\n9\n\n\f", "award": [], "sourceid": 893, "authors": [{"given_name": "Loc", "family_name": "Bui", "institution": null}, {"given_name": "Ramesh", "family_name": "Johari", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}]}