{"title": "Action Centered Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 5977, "page_last": 5985, "abstract": "Contextual bandits have become popular as they offer a middle ground between very simple approaches based on multi-armed bandits and very complex approaches using the full power of reinforcement learning. They have demonstrated success in web applications and have a rich body of associated theoretical guarantees. Linear models are well understood theoretically and preferred by practitioners because they are not only easily interpretable but also simple to implement and debug. Furthermore, if the linear model is true, we get very strong performance guarantees. Unfortunately, in emerging applications in mobile health, the time-invariant linear model assumption is untenable. We provide an extension of the linear model for contextual bandits that has two parts: baseline reward and treatment effect. We allow the former to be complex but keep the latter simple. We argue that this model is plausible for mobile health applications. At the same time, it leads to algorithms with strong performance guarantees as in the linear model setting, while still allowing for complex nonlinear baseline modeling. Our theory is supported by experiments on data gathered in a recently concluded mobile health study.", "full_text": "Action Centered Contextual Bandits\n\nKristjan Greenewald\nDepartment of Statistics\n\nHarvard University\n\nkgreenewald@fas.harvard.edu\n\nAmbuj Tewari\n\nDepartment of Statistics\nUniversity of Michigan\ntewaria@umich.edu\n\nPredrag Klasnja\n\nSchool of Information\nUniversity of Michigan\nklasnja@umich.edu\n\nSusan Murphy\n\nDepartments of Statistics and Computer Science\n\nHarvard University\n\nsamurphy@fas.harvard.edu\n\nAbstract\n\nContextual bandits have become popular as they offer a middle ground between\nvery simple approaches based on multi-armed bandits and very complex approaches\nusing the full power of reinforcement learning. They have demonstrated success in\nweb applications and have a rich body of associated theoretical guarantees. Linear\nmodels are well understood theoretically and preferred by practitioners because\nthey are not only easily interpretable but also simple to implement and debug.\nFurthermore, if the linear model is true, we get very strong performance guarantees.\nUnfortunately, in emerging applications in mobile health, the time-invariant linear\nmodel assumption is untenable. We provide an extension of the linear model for\ncontextual bandits that has two parts: baseline reward and treatment effect. We\nallow the former to be complex but keep the latter simple. We argue that this\nmodel is plausible for mobile health applications. At the same time, it leads to\nalgorithms with strong performance guarantees as in the linear model setting, while\nstill allowing for complex nonlinear baseline modeling. Our theory is supported by\nexperiments on data gathered in a recently concluded mobile health study.\n\n1\n\nIntroduction\n\nIn the theory of sequential decision-making, contextual bandit problems (Tewari & Murphy, 2017)\noccupy a middle ground between multi-armed bandit problems (Bubeck & Cesa-Bianchi, 2012) and\nfull-blown reinforcement learning (usually modeled using Markov decision processes along with\ndiscounted or average reward optimality criteria (Sutton & Barto, 1998; Puterman, 2005)). Unlike\nbandit algorithms, which cannot use any side-information or context, contextual bandit algorithms\ncan learn to map the context into appropriate actions. However, contextual bandits do not consider\nthe impact of actions on the evolution of future contexts. Nevertheless, in many practical domains\nwhere the impact of the learner\u2019s action on future contexts is limited, contextual bandit algorithms\nhave shown great promise. Examples include web advertising (Abe & Nakamura, 1999) and news\narticle selection on web portals (Li et al., 2010).\nAn in\ufb02uential thread within the contextual bandit literature models the expected reward for any\naction in a given context using a linear mapping from a d-dimensional context vector to a real-valued\nreward. Algorithms using this assumption include LinUCB and Thompson Sampling, for both of\nwhich regret bounds have been derived. These analyses often allow the context sequence to be chosen\nadversarially, but require the linear model, which links rewards to contexts, to be time-invariant.\nThere has been little effort to extend these algorithms and analyses when the data follow an unknown\nnonlinear or time-varying model.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we consider a particular type of non-stationarity and non-linearity that is motivated\nby problems arising in mobile health (mHealth). Mobile health is a fast developing \ufb01eld that uses\nmobile and wearable devices for health care delivery. These devices provide us with a real-time\nstream of dynamically evolving contextual information about the user (location, calendar, weather,\nphysical activity, internet activity, etc.). Contextual bandit algorithms can learn to map this contextual\ninformation to a set of available intervention options (e.g., whether or not to send a medication\nreminder). However, human behavior is hard to model using stationary, linear models. We make a\nfundamental assumption in this paper that is quite plausible in the mHealth setting. In these settings,\nthere is almost always a \u201cdo nothing\u201d action usually called action 0. The expected reward for this\naction is the baseline reward and it can change in a very non-stationary, non-linear fashion. However,\nthe treatment effect of a non-zero action, i.e., the incremental change over the baseline reward due to\nthe action, can often be plausibly modeled using standard stationary, linear models.\nWe show, both theoretically and empirically, that the performance of an appropriately designed\naction-centered contextual bandit algorithm is agnostic to the high model complexity of the baseline\nreward. Instead, we get the same level of performance as expected in a stationary, linear model setting.\nNote that it might be tempting to make the entire model non-linear and non-stationary. However, the\nsample complexity of learning very general non-stationary, non-linear models is likely to be so high\nthat they will not be useful in mHealth where data is often noisy, missing, or collected only over a\nfew hundred decision points.\nWe connect our algorithm design and theoretical analysis to the real world of mHealth by using data\nfrom a pilot study of HeartSteps, an Android-based walking intervention. HeartSteps encourages\nwalking by sending individuals contextually-tailored suggestions to be active. Such suggestions can\nbe sent up to \ufb01ve times a day\u2013in the morning, at lunchtime, mid-afternoon, at the end of the workday,\nand in the evening\u2013and each suggestion is tailored to the user\u2019s current context: location, time of day,\nday of the week, and weather. HeartSteps contains two types of suggestions: suggestions to go for a\nwalk, and suggestions to simply move around in order to disrupt prolonged sitting. While the initial\npilot study of HeartSteps micro-randomized the delivery of activity suggestions (Klasnja et al., 2015;\nLiao et al., 2015), delivery of activity suggestions is an excellent candidate for the use of contextual\nbandits, as the effect of delivering (vs. not) a suggestion at any given time is likely to be strongly\nin\ufb02uenced by the user\u2019s current context, including location, time of day, and weather.\nThis paper\u2019s main contributions can be summarized as follows. We introduce a variant of the standard\nlinear contextual bandit model that allows the baseline reward model to be quite complex while\nkeeping the treatment effect model simple. We then introduce the idea of using action centering in\ncontextual bandits as a way to decouple the estimation of the above two parts of the model. We show\nthat action centering is effective in dealing with time-varying and non-linear behavior in our model,\nleading to regret bounds that scale as nicely as previous bounds for linear contextual bandits. Finally,\nwe use data gathered in the recently conducted HeartSteps study to validate our model and theory.\n\n1.1 Related Work\n\nContextual bandits have been the focus of considerable interest in recent years. Chu et al. (2011) and\nAgrawal & Goyal (2013) have examined UCB and Thompson sampling methods respectively for\nlinear contextual bandits. Works such as Seldin et al. (2011), Dudik et al. (2011) considered contextual\nbandits with \ufb01xed policy classes. Methods for reducing the regret under complex reward functions\ninclude the nonparametric approach of May et al. (2012), the \u201ccontextual zooming\" approach of\nSlivkins (2014), the kernel-based method of Valko et al. (2013), and the sparse method of Bastani\n& Bayati (2015). Each of these approaches has regret that scales with the complexity of the overall\nreward model including the baseline, and requires the reward function to remain constant over time.\n\n2 Model and Problem Setting\n\nConsider a contextual bandit with a baseline (zero) action and N non-baseline arms (actions or\ntreatments). At each time t = 1, 2, . . . , a context vector \u00afst \u2208 Rd(cid:48)\nis observed, an action at \u2208\n{0, . . . , N} is chosen, and a reward rt(at) is observed. The bandit learns a mapping from a state\nvector st,at depending on \u00afst and at to the expected reward rt(st,at). The state vector st,at \u2208 Rd is\na function of at and \u00afst. This form is used to achieve maximum generality, as it allows for in\ufb01nite\npossible actions so long as the reward can be modeled using a d-dimensional st,a. In the most\n\n2\n\n\f= [I(at = 1)\u00afsT\n\nunstructured case with N actions, we can simply encode the reward with a d = N d(cid:48) dimensional\nsT\nt,at\nFor maximum generality, we assume the context vectors are chosen by an adversary on the basis of\nthe history Ht\u22121 of arms a\u03c4 played, states \u00afs\u03c4 , and rewards r\u03c4 (\u00afs\u03c4 , a\u03c4 ) received up to time t \u2212 1, i.e.,\n\nt ] where I(\u00b7) is the indicator function.\n\nt , . . . , I(at = N )\u00afsT\n\nHt\u22121 = {a\u03c4 , \u00afst, r\u03c4 (\u00afs\u03c4 , a\u03c4 ), i = 1, . . . , N, \u03c4 = 1, . . . , t \u2212 1}.\n\nConsider the model E[rt(\u00afst, at)|\u00afst, at] = \u00afft(\u00afst, at), where \u00afft can be decomposed into a \ufb01xed\ncomponent dependent on action and a time-varying component that does not depend on action:\n\nE[rt(\u00afst, at)|\u00afst, at] = \u00afft(\u00afst, at) = f (st,at)I(at > 0) + gt(\u00afst),\n\nwhere \u00afft(\u00afst, 0) = gt(\u00afst) due to the indicator function I(at > 0). Note that the optimal action\ndepends in no way on gt, which merely confounds the observation of regret. We hypothesize that\nthe regret bounds for such a contextual bandit asymptotically depend only on the complexity of f,\nnot of gt. We emphasize that we do not require any assumptions about or bounds on the complexity\nor smoothness of gt, allowing gt to be arbitrarily nonlinear and to change abruptly in time. These\nconditions create a partially agnostic setting where we have a simple model for the interaction but the\nbaseline cannot be modeled with a simple linear function. In what follows, for simplicity of notation\nwe drop \u00afst from the argument for rt, writing rt(at) with the dependence on \u00afst understood.\nIn this paper, we consider the linear model for the reward difference at time t:\n\nrt(at) \u2212 rt(0) = f (st,at)I(at > 0) + nt = sT\n\n(1)\nwhere nt is zero-mean sub-Gaussian noise with variance \u03c32 and \u03b8 \u2208 Rd is a vector of coef\ufb01cients.\nThe goal of the contextual bandit is to estimate \u03b8 at every time t and use the estimate to decide which\nactions to take under a series of observed contexts. As is common in the literature, we assume that\nboth the baseline and interaction rewards are bounded by a constant for all t.\nThe task of the action-centered contextual bandit is to choose the probabilities \u03c0(a, t) of playing each\narm at at time t so as to maximize expected differential reward\n\n\u03b8I(at > 0) + nt\n\nt,at\n\n(cid:88)N\n(cid:88)N\n\na=0\n\na=0\n\nE[rt(at) \u2212 rt(0)|Ht\u22121, st,a] =\n\n=\n\n\u03c0(a, t)E[rt(a) \u2212 rt(0)|Ht\u22121, st,a]\n\n(2)\n\n\u03c0(a, t)sT\n\nt,a\u03b8I(a > 0).\n\nThis task is closely related to obtaining a good estimate of the reward function coef\ufb01cients \u03b8.\n\n2.1 Probability-constrained optimal policy\n\nIn the mHealth setting, a contextual bandit must choose at each time point whether to deliver to the\nuser a behavior-change intervention, and if so, what type of intervention to deliver. Whether or not an\nintervention, such as an activity suggestion or a medication reminder, is sent is a critical aspect of the\nuser experience. If a bandit sends too few interventions to a user, it risks the user\u2019s disengaging with\nthe system, and if it sends too many, it risks the user\u2019s becoming overwhelmed or desensitized to the\nsystem\u2019s prompts. Furthermore, standard contextual bandits will eventually converge to a policy that\nmaps most states to a near-100% chance of sending or not sending an intervention. Such regularity\ncould not only worsen the user\u2019s experience, but ignores the fact that users have changing routines\nand cannot be perfectly modeled. We are thus motivated to introduce a constraint on the size of the\nprobabilities of delivering an intervention. We constrain 0 < \u03c0min \u2264 1 \u2212 P(at = 0|\u00afst) \u2264 \u03c0max < 1,\nwhere P(at = 0|\u00afst) is the conditional bandit-chosen probability of delivering an intervention at time\nt. The constants \u03c0min and \u03c0max are not learned by the algorithm, but chosen using domain science,\nand might vary for different components of the same mHealth system. We constrain P(at = 0|\u00afst), not\neach P(at = i|\u00afst), as which intervention is delivered is less critical to the user experience than being\nprompted with an intervention in the \ufb01rst place. User habituation can be mitigated by implementing\nthe nonzero actions (a = 1, . . . , N) to correspond to several types or categories of messages, with\nthe exact message sent being randomized from a set of differently worded messages.\nConceptually, we can view the bandit as pulling two arms at each time t: the probability of sending\na message (constrained to lie in [\u03c0min, \u03c0max]) and which message to send if one is sent. While\nthese probability constraints are motivated by domain science, these constraints also enable our\n\n3\n\n\fproposed action-centering algorithm to effectively orthogonalize the baseline and interaction term\nrewards, achieving sublinear regret in complex scenarios that often occur in mobile health and other\napplications and for which existing approaches have large regret.\nUnder this probability constraint, we can now derive the optimal policy with which to compare the\nbandit. The policy that maximizes the expected reward (2) will play the optimal action\n\na\u2217\nt = arg max\n\ni\u2208{0,...,N} sT\n\nt,i\u03b8I(i > 0),\n\nwith the highest allowed probability. The remainder of the probability is assigned as follows. If\nthe optimal action is nonzero, the optimal policy will then play the zero action with the remaining\nprobability (which is the minimum allowed probability of playing the zero action). If the optimal\naction is zero, the optimal policy will play the nonzero action with the highest expected reward\n\n\u00afa\u2217\nt = arg max\n\ni\u2208{1,...,N} sT\nt,i\u03b8\n\nwith the remaining probability, i.e. \u03c0min. To summarize, under the constraint 1 \u2212 \u03c0\u2217\nt (0, t) \u2208\n[\u03c0min, \u03c0max], the expected reward maximizing policy plays arm at with probability \u03c0\u2217(a, t), where\n\u03c0\u2217(a, t) = 0 \u2200a (cid:54)= 0, a\u2217\n(3)\nt\n\u03c0\u2217(a, t) = 0 \u2200a (cid:54)= 0, \u00afa\u2217\nt .\n\nt (cid:54)= 0 : \u03c0\u2217(a\u2217\nt = 0 : \u03c0\u2217(0, t) = 1 \u2212 \u03c0min,\n\nt , t) = \u03c0max,\n\n\u03c0\u2217(0, t) = 1 \u2212 \u03c0max,\nt , t) = \u03c0min,\n\n\u03c0\u2217(\u00afa\u2217\n\nIf a\u2217\nIf a\u2217\n\n3 Action-centered contextual bandit\n\nSince the observed reward always contains the sum of the baseline reward and the differential reward\nwe are estimating, and the baseline reward is arbitrarily complex, the main challenge is to isolate the\ndifferential reward at each time step. We do this via an action-centering trick, which randomizes the\naction at each time step, allowing us to construct an estimator whose expectation is proportional to\nthe differential reward rt(\u00afat) \u2212 rt(0), where \u00afat is the nonzero action chosen by the bandit at time t\nto be randomized against the zero action. For simplicity of notation, we set the probability of the\nbandit taking nonzero action P(at > 0) to be equal to 1 \u2212 \u03c0(0, t) = \u03c0t.\n3.1 Centering the actions - an unbiased rt(\u00afat) \u2212 rt(0) estimate\nTo determine a policy, the bandit must learn the coef\ufb01cients \u03b8 of the model for the differential reward\nrt(\u00afat) \u2212 rt(0) = sT\n\u03b8 as a function of \u00afat. If the bandit had access at each time t to the differential\nreward rt(\u00afat) \u2212 rt(0), we could estimate \u03b8 using a penalized least-squares approach by minimizing\n\nt,\u00afat\n\n(cid:88)T\n\narg min\n\n\u03b8\n\nt=1\n\n(rt(\u00afat) \u2212 rt(0) \u2212 \u03b8T st,\u00afat)2 + \u03bb(cid:107)\u03b8(cid:107)2\n\n2\n\nover \u03b8, where rt(a) is the reward under action a at time t (Agrawal & Goyal, 2013). This corresponds\nto the Bayesian estimator when the reward is Gaussian. Although we have only access to rt(at),\nnot rt(\u00afat) \u2212 rt(0), observe that given \u00afat, the bandit randomizes to at = \u00afat with probability \u03c0t and\nat = 0 otherwise. Thus\n\nE[(I(at > 0) \u2212 \u03c0t)rt(at)|Ht\u22121, \u00afat, \u00afst] = \u03c0t(1 \u2212 \u03c0t)rt(\u00afa) \u2212 (1 \u2212 \u03c0t)\u03c0trt(0)\n\n(4)\n\n= \u03c0t(1 \u2212 \u03c0t)(rt(\u00afat) \u2212 rt(0)).\n\nThus (I(at > 0) \u2212 \u03c0t)rt(at), which only uses the observed rt(at), is proportional to an unbiased\nestimator of rt(\u00afat) \u2212 rt(0). Recalling that \u00afat, at are both known since they are chosen by the bandit\nat time t, we create the estimate of the differential reward between \u00afat and action 0 at time t as\n\n\u02c6rt(\u00afat) = (I(at > 0) \u2212 \u03c0t)rt(at).\n\nThe corresponding penalized weighted least-squares estimator for \u03b8 using \u02c6rt(\u00afat) is the minimizer of\n\nt=1\n\n(cid:88)T\n\u03c0t(1 \u2212 \u03c0t)(\u02c6rt(\u00afat)/(\u03c0t(1 \u2212 \u03c0t)) \u2212 \u03b8T st,\u00afat)2 + (cid:107)\u03b8(cid:107)2\n\n2\n\n\u2212 2\u02c6rt(\u00afat)\u03b8T st,\u00afat + \u03c0t(1 \u2212 \u03c0t)(\u03b8T st,\u00afat)2 + (cid:107)\u03b8(cid:107)2\n\n2\n\n(5)\n\n(\u02c6rt(\u00afat))2\n\u03c0t(1 \u2212 \u03c0t)\n\n=\n= c \u2212 2\u03b8T \u02c6b + \u03b8T B\u03b8 + (cid:107)\u03b8(cid:107)2\n2,\n\nt=1\n\n(cid:88)T\n\n4\n\n\fwhere for simplicity of presentation we have used unit penalization (cid:107)\u03b8(cid:107)2\n\n\u02c6b =\n\n(I(at > 0) \u2212 \u03c0t)st,\u00afatrt(at), B = I +\n\n(cid:88)T\n\nt=1\n\n(cid:88)T\n\nt=1\n\n2, and\n\u03c0t(1 \u2212 \u03c0t)st,\u00afatsT\n\n(cid:104) \u02c6rt(\u00afat)\n\n.\n\nt,\u00afat\n\n(cid:12)(cid:12)(cid:12)Ht\u22121, \u00afat, \u00afst\n\n(cid:105)\n\nThe weighted least-squares weights are \u03c0t(1 \u2212 \u03c0t), since var\nvar[\u02c6rt(\u00afat)t|Ht\u22121,\u00afat,\u00afst]\nHt\u22121, \u00afat, \u00afst is of order gt(\u00afst) = O(1). The minimizer of (5) is \u02c6\u03b8 = B\u22121\u02c6b.\n\n=\nand the standard deviation of \u02c6rt(\u00afat) = (I(at > 0) \u2212 \u03c0t)rt(at) given\n\n(\u03c0t(1\u2212\u03c0t))2\n\n\u03c0t(1\u2212\u03c0t)\n\n3.2 Action-Centered Thompson Sampling\n\nAs the Thompson sampling approach generates probabilities of taking an action, rather than selecting\nan action, Thompson sampling is particularly suited to our regression approach. We follow the basic\nframework of the contextual Thompson sampling approach presented by Agrawal & Goyal (2013),\nextending and modifying it to incorporate our action-centered estimator and probability constraints.\nThe critical step in Thompson sampling is randomizing the model coef\ufb01cients according to the\nprior N (\u02c6\u03b8, v2B\u22121) for \u03b8 at time t. A \u03b8(cid:48) \u223c N (\u02c6\u03b8, v2B\u22121) is generated, and the action at chosen\nt,a\u03b8(cid:48). The probability that this procedure selects any action a is determined by the\nto maximize sT\ndistribution of \u03b8(cid:48); however, it may select action 0 with a probability not in the required range\n[1 \u2212 \u03c0max, 1 \u2212 \u03c0min]. We thus introduce a two-step hierarchical procedure. After generating the\nrandom \u03b8(cid:48), we instead choose the nonzero \u00afat maximizing the expected reward\n\n\u00afat = arg max\n\na\u2208{1,...,N} sT\n\nt,a\u03b8(cid:48).\n\nThen we randomly determine whether to take the nonzero action, choosing at = \u00afat with probability\n\nAlgorithm 1 Action-Centered Thompson Sampling\n1: Set B = I, \u02c6\u03b8 = 0, \u02c6b = 0, choose [\u03c0min, \u03c0max].\n2: for t = 1, 2, . . . do\n3:\n4:\n5:\n\nObserve current context \u00afst and form st,a for each a \u2208 {1, . . . , N}.\nRandomly generate \u03b8(cid:48) \u223c N (\u02c6\u03b8, v2B\u22121).\nLet\n\n\u00afat = arg max\n\na\u2208{1,...,N} sT\n\nt,a\u03b8(cid:48).\n\n6:\n7:\n8:\n\nCompute probability \u03c0t of taking a nonzero action according to (6).\nPlay action at = \u00afat with probability \u03c0t, else play at = 0.\nObserve reward rt(at) and update \u02c6\u03b8\n\nB = B + \u03c0t(1 \u2212 \u03c0t)st,\u00afatsT\n\nt,\u00afat\n\n,\n\n\u02c6b = \u02c6b + st,\u00afat(I(at > 0) \u2212 \u03c0t)rt(at),\n\n\u02c6\u03b8 = B\u22121\u02c6b.\n\n9: end for\n\n\u03c0t = P(at > 0) = max(\u03c0min, min(\u03c0max, P(sT\n\nt,\u00afa\n\n\u02dc\u03b8 > 0))),\n\n(6)\n\nand at = 0 otherwise, where \u02dc\u03b8 \u223c N (\u02c6\u03b8, v2B\u22121). P(sT\n\u02dc\u03b8 > 0) is the probability that the expected\n\u02dc\u03b8 of action \u00afat is higher than zero for \u02dc\u03b8 \u223c N (\u02c6\u03b8, v2B\u22121). This probability is easily\nrelative reward sT\nt,\u00afa\ncomputed using the normal CDF. Finally the bandit updates \u02c6b, B and computes an updated \u02c6\u03b8 = B\u22121\u02c6b.\nOur action-centered Thompson sampling algorithm is summarized in Algorithm 1.\n\nt,\u00afa\n\n4 Regret analysis\n\nClassically, the regret of a bandit is de\ufb01ned as the difference between the reward achieved by taking\nthe optimal actions a\u2217\nt , and the expected reward received by playing the arm at chosen by the bandit\n(7)\n\nregretclassical(t) = sT\n\n\u03b8 \u2212 sT\n\nt,a\u2217\n\n\u03b8,\n\nt,at\n\nt\n\n5\n\n\f(cid:18) d2\n\n\u0001\n\n\u221a\n\nR(T ) \u2264 C\n\nT 1+\u0001(log(T d) log\n\n(cid:19)\n\n1\n\u03b4\n\n)\n\nwhere the expectation is taken conditionally on at, sT\nt (0, t)\nbe the probability that the optimal policy takes a nonzero action, and recall that \u03c0t = 1 \u2212 \u03c0t(0, t) is\nthe probability the bandit takes a nonzero action. The probability constraint implies that the optimal\npolicy (3) plays the optimal arm with a probability bounded away from 0 and 1, hence de\ufb01nition (7)\nis no longer meaningful. We can instead create a regret that is the difference in expected rewards\nconditioned on \u00afat, \u03c0t, sT\n\n,Ht\u22121, but not on the randomized action at:\n\nt,at\n\n,Ht\u22121. For simplicity, let \u03c0\u2217\n\nt = 1 \u2212 \u03c0\u2217\n\nregret(t) = \u03c0\u2217\n\n\u03b8 \u2212 \u03c0tsT\n\n(8)\nwhere we have recalled that given \u00afat, the bandit plays action at = \u00afat with probability \u03c0t and plays\nthe at = 0 with differential reward 0 otherwise. The action-centered contextual bandit attempts to\n\nminimize the cumulative regret R(T ) =(cid:80)T\n\nt=1 regret(t) over horizon T .\n\nt sT\n\nt,\u00afa\u2217\n\nt,\u00afat\n\n\u03b8\n\nt\n\nt,at\n\n4.1 Regret bound for Action-Centered Thompson Sampling\n\nIn the following theorem we show that with high probability, the probability-constrained Thompson\nsampler has low regret relative to the optimal probability-constrained policy.\nTheorem 1. Consider the action-centered contextual bandit problem, where \u00afft is potentially time\nvarying, and \u00afst at time t given Ht\u22121 is chosen by an adversary. Under this regime, the total regret at\ntime T for the action-centered Thompson sampling contextual bandit (Algorithm 1) satis\ufb01es\n\nwith probability at least 1 \u2212 3\u03b4/2, for any 0 < \u0001 < 1, 0 < \u03b4 < 1. The constant C is in the proof.\nObserve that this regret bound does not depend on the number of actions N, is sublinear in T , and\nscales only with the complexity d of the interaction term, not the complexity of the baseline reward g.\nFurthermore, \u0001 = 1/ log(T ) can be chosen giving a regret of order O(d2\nThis bound is of the same order as the baseline Thompson sampling contextual bandit in the adversarial\nsetting when the baseline is identically zero (Agrawal & Goyal, 2013). When the baseline can be\nmodeled with d(cid:48) features where d(cid:48) > d, our method achieves O(d2\nT ) regret whereas the standard\nThompson sampling approach has O((d + d(cid:48))2\nT ) regret. Furthermore, when the baseline reward is\ntime-varying, the worst case regret of the standard Thompson sampling approach is O(T ), while the\nregret of our method remains O(d2\n\nT ).\n\nT ).\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n4.2 Proof of Theorem 1 - Decomposition of the regret\n\nWe will \ufb01rst bound the regret (8) at time t.\n\u03b8 \u2212 \u03c0tsT\n\nregret(t) = \u03c0\u2217\n\nt sT\n\nt,\u00afa\u2217\n\nt,\u00afat\n\nt\n\nwhere the inequality holds since (sT\n\nR(T ) =\n\n(cid:88)T\n\nt=1\n\n\u03b8 \u2212 sT\n\nt,\u00afa\u2217\n\nregret(t) \u2264(cid:88)T\n(cid:124)\n\nt\n\nt=1\n\nt,\u00afat\n\n\u03b8 = (\u03c0\u2217\n\u2264 (\u03c0\u2217\n\nt \u2212 \u03c0t)(sT\nt \u2212 \u03c0t)(sT\n\u03b8) \u2265 0 and 0 < \u03c0\u2217\n(cid:123)(cid:122)\nt \u2212 \u03c0t)(sT\n(\u03c0\u2217\n\n(cid:125)\n\nt,\u00afat\n\nt,\u00afat\n\nt,\u00afat\n\n\u03b8)\n\n+\n\nI\n\n(cid:88)T\n(cid:124)\n\n\u03b8) + \u03c0\u2217\n\u03b8) + (sT\n\nt (sT\nt,\u00afa\u2217\n\nt\n\n\u03b8 \u2212 sT\n\nt\n\nt,\u00afa\u2217\n\u03b8 \u2212 sT\n\nt,\u00afat\n\n\u03b8)\n\nt,\u00afat\n\u03b8),\n\nt < 1 by de\ufb01nition. Then\n\n(sT\n\nt,\u00afa\u2217\n\n\u03b8 \u2212 sT\n\nt,\u00afat\n\nt=1\n\n(cid:123)(cid:122)\n\nt\n\nII\n\n\u03b8)\n\n(cid:125)\n\n(9)\n(10)\n\nObserve that we have decomposed the regret into a term I that depends on the choice of the\nrandomization \u03c0t between the zero and nonzero action, and a term II that depends only on the\nchoice of the potential nonzero action \u00afat prior to the randomization. We bound I using concentration\ninequalities, and bound II using arguments paralleling those for standard Thompson sampling.\nLemma 1. Suppose that the conditions of Theorem 1 apply. Then with probability at least 1 \u2212 \u03b4\n2 ,\n\nI \u2264 C(cid:112)d3T log(T d) log(1/\u03b4) for some constant C given in the proof.\n\nLemma 2. Suppose that the conditions of Theorem 1 apply. Then term II can be bounded as\n\nT(cid:88)\n\n\u03b8) \u2264 C(cid:48)(cid:18) d2\n\n\u221a\n\nII =\n\n(sT\n\nt,\u00afa\u2217\n\n\u03b8 \u2212 sT\n\nt,\u00afat\n\n\u0001\nwhere the inequality holds with probability at least 1 \u2212 \u03b4.\n\nt=1\n\nt\n\nT 1+\u0001 log\n\n1\n\u03b4\n\nlog(T d)\n\n(cid:19)\n\n6\n\n\fThe proofs are contained in Sections 4 and 5 in the supplement respectively. In the derivation,\nthe \u201cpseudo-actions\u201d \u00afat that Algorithm 1 chooses prior to the \u03c0t baseline-nonzero randomization\ncorrespond to the actions in the standard contextual bandit setting. Note that I involves only \u00afat, not\n\u00afa\u2217\nt , hence it is not surprising that the bound is smaller than that for II. Combining Lemmas 1 and 2\nvia the union bound gives Theorem 1.\n\n5 Results\n\n5.1 Simulated data\n\nWe \ufb01rst conduct experiments with simulated data, using N = 2 possible nonzero actions. In each\nexperiment, we choose a true reward generative model rt(s, a) inspired by data from the HeartSteps\nstudy (for details see Section 1.1 in the supplement), and generate two length T sequences of state\nvectors st,a \u2208 RN K and \u00afst \u2208 RL, where the \u00afst are iid Gaussian and st,a is formed by stacking\ncolumns I(a = i)[1; \u00afst] for i = 1, . . . , N. We consider both nonlinear and nonstationary baselines,\nwhile keeping the treatment effect models the same. The bandit under evaluation iterates through the\nT time points, at each choosing an action and receiving a reward generated according to the chosen\nmodel. We set \u03c0min = 0.2, \u03c0max = 0.8.\nAt each time step, the reward under the optimal policy is calculated and compared to the reward\nreceived by the bandit to form the regret regret(t). We can then plot the cumulative regret\n\ncumulative regret(t) =\n\nregret(\u03c4 ).\n\n(cid:88)t\n\n\u03c4 =1\n\nIn the \ufb01rst experiment, the baseline reward is nonlinear. Speci\ufb01cally, we generate rewards using\n\n(a) Median cumulative regret\n\n(b) Median with 1st and 3rd quartiles (dashed)\nFigure 1: Nonlinear baseline reward g, in scenario with 2 nonzero actions and reward function based\non collected HeartSteps data. Cumulative regret shown for proposed Action-Centered approach,\ncompared to baseline contextual bandit, median computed over 100 random trials.\nrt(st,at, \u00afst, at) = \u03b8T st,at + 2I(|[\u00afst]1| < 0.8) + nt where nt = N (0, 1) and \u03b8 \u2208 R8 is a \ufb01xed vector\nlisted in supplement section 1.1. This simulates the quite likely scenario that for a given individual the\nbaseline reward is higher for small absolute deviations from the mean of the \ufb01rst context feature, i.e.\nrewards are higher when the feature at the decision point is \u201cnear average\u201d, with reward decreasing\nfor abnormally high or low values. We run the benchmark Thompson sampling algorithm (Agrawal\n& Goyal, 2013) and our proposed action-centered Thompson sampling algorithm, computing the\ncumulative regrets and taking the median over 500 random trials. The results are shown in Figure 1,\ndemonstrating linear growth of the benchmark Thompson sampling algorithm and signi\ufb01cantly lower,\nsublinear regret for our proposed method.\nWe then consider a scenario with the baseline reward gt(\u00b7) function changing in time. We generate\nt \u00afst + nt where nt = N (0, 1), \u03b8 is a \ufb01xed vector as above,\nrewards as rt(st,at, \u00afst, at) = \u03b8T st,at + \u03b7T\nand \u03b7t \u2208 R7, \u00afst are generated as smoothly varying Gaussian processes (supplement Section 1.1). The\ncumulative regret is shown in Figure 2, again demonstrating linear regret for the baseline approach\nand signi\ufb01cantly lower sublinear regret for our proposed action-centering algorithm as expected.\n\n7\n\n2004006008001000Decision Point01020304050Cumulative Regret wrt Optimal PolicyAction Centered TSStandard TS2004006008001000Decision Point050100150200Cumulative Regret wrt Optimal PolicyAction Centered TSStandard TS\f(a) Median cumulative regret\n\n(b) Median with 1st and 3rd quartiles (dashed)\nFigure 2: Nonstationary baseline reward g, in scenario with 2 nonzero actions and reward function\nbased on collected HeartSteps data. Cumulative regret shown for proposed Action-Centered approach,\ncompared to baseline contextual bandit, median computed over 100 random trials.\n\n5.2 HeartSteps study data\n\nThe HeartSteps study collected the sensor and weather-based features shown in Figure 1 at 5 decision\npoints per day for each study participant. If the participant was available at a decision point, a\nmessage was sent with constant probability 0.6. The sent message could be one of several activity\nor anti-sedentary messages chosen by the system. The reward for that message was de\ufb01ned to be\nlog(0.5 + x) where x is the step count of the participant in the 30 minutes following the suggestion.\nAs noted in the introduction, the baseline reward, i.e. the step count of a subject when no message is\nsent, does not only depend on the state in a complex way but is likely dependent on a large number of\nunobserved variables. Because of these unobserved variables, the mapping from the observed state to\nthe reward is believed to be strongly time-varying. Both these characteristics (complex, time-varying\nbaseline reward function) suggest the use of the action-centering approach.\nWe run our contextual bandit on the HeartSteps data, considering the binary action of whether or not\nto send a message at a given decision point based on the features listed in Figure 1 in the supplement.\nEach user is considered independently, for maximum personalization and independence of results.\nAs above we set \u03c0min = 0.2, \u03c0max = 0.8.\nWe perform of\ufb02ine evaluation of the bandit using the method of Li et al. (2011). Li et al. (2011)\nuses the sequence of states, actions, and rewards in the data to form an near-unbiased estimate of\nthe average expected reward achieved by each algorithm, averaging over all users. We used a total\nof 33797 time points to create the reward estimates. The resulting estimates for the improvement\nin average reward over the baseline randomization, averaged over 100 random seeds of the bandit\nalgorithm, are shown in Figure 2 of the supplement with the proposed action-centering approach\nachieving the highest reward. Since the reward is logarithmic in the number of steps, the results imply\nthat the benchmark Thompson sampling approach achieves an average 1.6% increase in step counts\nover the non-adaptive baseline, while our proposed method achieves an increase of 3.9%.\n\n6 Conclusion\n\nMotivated by emerging challenges in adaptive decision making in mobile health, in this paper we\nproposed the action-centered Thompson sampling contextual bandit, exploiting the randomness of\nthe Thompson sampler and an action-centering approach to orthogonalize out the baseline reward.\nWe proved that our approach enjoys low regret bounds that scale only with the complexity of the\ninteraction term, allowing the baseline reward to be arbitrarily complex and time-varying.\n\nAcknowledgments\n\nThis work was supported in part by grants R01 AA023187, P50 DA039838, U54EB020404, R01\nHL125440 NHLBI/NIA, NSF CAREER IIS-1452099, and a Sloan Research Fellowship.\n\n8\n\n2004006008001000Decision Point050100150Cumulative Regret wrt Optimal PolicyAction Centered TSStandard TS2004006008001000Decision Point050100150200250Cumulative Regret wrt Optimal PolicyAction Centered TSStandard TS\fReferences\nAbe, Naoki and Nakamura, Atsuyoshi. Learning to optimally schedule internet banner advertisements.\nIn Proceedings of the Sixteenth International Conference on Machine Learning, pp. 12\u201321. Morgan\nKaufmann Publishers Inc., 1999.\n\nAgrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In\nProceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 127\u2013135,\n2013.\n\nBastani, Hamsa and Bayati, Mohsen. Online decision-making with high-dimensional covariates.\n\nAvailable at SSRN 2661896, 2015.\n\nBubeck, S\u00e9bastien and Cesa-Bianchi, Nicolo. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\nChu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual bandits with linear payoff\nfunctions. In International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 208\u2013214, 2011.\n\nDudik, Miroslav, Hsu, Daniel, Kale, Satyen, Karampatziakis, Nikos, Langford, John, Reyzin, Lev,\nand Zhang, Tong. Ef\ufb01cient optimal learning for contextual bandits. In Proceedings of the Twenty-\nSeventh Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pp. 169\u2013178.\nAUAI Press, 2011.\n\nKlasnja, Predrag, Hekler, Eric B., Shiffman, Saul, Boruvka, Audrey, Almirall, Daniel, Tewari, Ambuj,\nand Murphy, Susan A. Microrandomized trials: An experimental design for developing just-in-time\nadaptive interventions. Health Psychology, 34(Suppl):1220\u20131228, Dec 2015.\n\nLi, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th International Conference\non World Wide Web, pp. 661\u2013670. ACM, 2010.\n\nLi, Lihong, Chu, Wei, Langford, John, and Wang, Xuanhui. Unbiased of\ufb02ine evaluation of contextual-\nbandit-based news article recommendation algorithms. In Proceedings of the fourth ACM interna-\ntional conference on Web search and data mining, pp. 297\u2013306. ACM, 2011.\n\nLiao, Peng, Klasnja, Predrag, Tewari, Ambuj, and Murphy, Susan A. Sample size calculations for\n\nmicro-randomized trials in mhealth. Statistics in medicine, 2015.\n\nMay, Benedict C., Korda, Nathan, Lee, Anthony, and Leslie, David S. Optimistic Bayesian sampling\nin contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069\u20132106,\n2012.\n\nPuterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 2005.\n\nSeldin, Yevgeny, Auer, Peter, Shawe-Taylor, John S., Ortner, Ronald, and Laviolette, Fran\u00e7ois.\nPAC-Bayesian analysis of contextual bandits. In Advances in Neural Information Processing\nSystems, pp. 1683\u20131691, 2011.\n\nSlivkins, Aleksandrs. Contextual bandits with similarity information. The Journal of Machine\n\nLearning Research, 15(1):2533\u20132568, 2014.\n\nSutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.\n\nTewari, Ambuj and Murphy, Susan A. From ads to interventions: Contextual bandits in mobile health.\nIn Rehg, Jim, Murphy, Susan A., and Kumar, Santosh (eds.), Mobile Health: Sensors, Analytic\nMethods, and Applications. Springer, 2017.\n\nValko, Michal, Korda, Nathan, Munos, R\u00e9mi, Flaounas, Ilias, and Cristianini, Nello. Finite-time\nanalysis of kernelised contextual bandits. In Uncertainty in Arti\ufb01cial Intelligence, pp. 654, 2013.\n\n9\n\n\f", "award": [], "sourceid": 3049, "authors": [{"given_name": "Kristjan", "family_name": "Greenewald", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}, {"given_name": "Susan", "family_name": "Murphy", "institution": "University of Michigan"}, {"given_name": "Predag", "family_name": "Klasnja", "institution": null}]}