{"title": "Weighted Linear Bandits for Non-Stationary Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 12040, "page_last": 12049, "abstract": "We consider a  stochastic linear bandit model in which the available actions\n  correspond to arbitrary context vectors whose associated rewards\n  follow a non-stationary linear regression model.\n  In this setting, the unknown regression parameter is allowed to vary in time.  To address this problem, we propose\n  D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget\n  the past.\n  This involves  studying the deviations of the sequential weighted least-squares estimator under generic assumptions.\n  As a by-product, we obtain novel deviation results that can be used  beyond non-stationary environments.\n   We provide theoretical guarantees on the behavior of\n  D-LinUCB in both slowly-varying and abruptly-changing\n  environments. We obtain an upper bound on the\n  dynamic regret that is of order d B_T^{1/3}T^{2/3}, where B_T\n  is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We\n  also illustrate the empirical performance of  D-LinUCB\n  and compare it with recently proposed alternatives in\n  simulated environments.", "full_text": "Weighted Linear Bandits for Non-Stationary\n\nEnvironments\n\nYoan Russac\n\nCNRS, Inria, ENS, Universit\u00e9 PSL\n\nyoan.russac@ens.fr\n\nClaire Vernade\n\nDeepmind\n\nvernade@google.com\n\nOlivier Capp\u00e9\n\nCNRS, Inria, ENS, Universit\u00e9 PSL\n\nolivier.cappe@cnrs.fr\n\nAbstract\n\nWe consider a stochastic linear bandit model in which the available actions corre-\nspond to arbitrary context vectors whose associated rewards follow a non-stationary\nlinear regression model. In this setting, the unknown regression parameter is al-\nlowed to vary in time. To address this problem, we propose D-LinUCB, a novel\noptimistic algorithm based on discounted linear regression, where exponential\nweights are used to smoothly forget the past. This involves studying the devia-\ntions of the sequential weighted least-squares estimator under generic assumptions.\nAs a by-product, we obtain novel deviation results that can be used beyond non-\nstationary environments. We provide theoretical guarantees on the behavior of\nD-LinUCB in both slowly-varying and abruptly-changing environments. We ob-\ntain an upper bound on the dynamic regret that is of order d2/3B1/3\nT T 2/3, where\nBT is a measure of non-stationarity (d and T being, respectively, dimension and\nhorizon). This rate is known to be optimal. We also illustrate the empirical perfor-\nmance of D-LinUCB and compare it with recently proposed alternatives in simulated\nenvironments.\n\n1\n\nIntroduction\n\nMulti-armed bandits offer a class of models to address sequential learning tasks that involve\nexploration-exploitation trade-offs.\nIn this work we are interested in structured bandit models,\nknown as stochastic linear bandits, in which linear regression is used to predict rewards [1, 2, 22].\nA typical application of bandit algorithms based on the linear model is online recommendation where\nactions are items to be, for instance, ef\ufb01ciently arranged on personalized web pages to maximize some\nconversion rate. However, it is unlikely that customers\u2019 preferences remain stable and the collected\ndata becomes progressively obsolete as the interest for the items evolve. Hence, it is essential to\ndesign adaptive bandit agents rather than restarting the learning from scratch on a regular basis. In\nthis work, we consider the use of weighted least-squares as an ef\ufb01cient method to progressively forget\npast interactions. Thus, we address sequential learning problems in which the parameter of the linear\nbandit is evolving with time.\nOur \ufb01rst contribution consists in extending existing deviation inequalities to sequential weighted\nleast-squares. Our result applies to a large variety of bandit problems and is of independent interest.\nIn particular, it extends the recent analysis of heteroscedastic environments by [18]. It can also be\nuseful to deal with class imbalance situations, or, as we focus on here, in non-stationary environments.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAs a second major contribution, we apply our results to propose D-LinUCB, an adaptive linear bandit\nalgorithm based on carefully designed exponential weights. D-LinUCB can be implemented fully\nrecursively \u2014without requiring the storage of past actions\u2014 with a numerical complexity that is\ncomparable to that of LinUCB. To characterize the performance of the algorithm, we provide a uni\ufb01ed\nregret analysis for abruptly-changing or slowly-varying environments.\nThe setting and notations are presented below and we state our main deviation result in Section 2.\nSection 3 is dedicated to non-stationary linear bandits: we describe our algorithms and provide regret\nupper bounds in abruptly-changing and slowly-varying environments. We complete this theoretical\nstudy with a set of experiments in Section 4.\n\n1.1 Model and Notations\n\nThe setting we consider in this paper is a non-stationary variant of the stochastic linear bandit problem\nconsidered in [1, 22], where, at each round t \u2265 1, the learner\n\n\u2022 receives a \ufb01nite set of feasible actions At \u2282 Rd;\n\u2022 chooses an action At \u2208 At and receives a reward Xt such that\n\n(1)\nt \u2208 Rd is an unknown parameter and \u03b7t is, conditionally on the past, a\n\nXt = (cid:104)At, \u03b8(cid:63)\n\nt (cid:105) + \u03b7t,\n\nwhere \u03b8(cid:63)\n\u03c3\u2212subgaussian random noise.\n\nThe action set At may be arbitrary but its components are assumed to be bounded, in the sense that\nt (cid:107)2 \u2264 S. We\n(cid:107)a(cid:107)2 \u2264 L, \u2200a \u2208 At. The time-varying parameter is also assumed to be bounded: \u2200t,(cid:107)\u03b8(cid:63)\nt (cid:105)| \u2264 1, \u2200t,\u2200a \u2208 At, (obviously, this could be guaranteed by assuming\nfurther assume that |(cid:104)a, \u03b8(cid:63)\nthat L = S = 1, but we indicate the dependence in L and S in order to facilitate the interpretation\nof some results). For a positive de\ufb01nite matrix M and a vector x, we denote by (cid:107)x(cid:107)M the norm\n\u221a\nx(cid:62)M x.\n\nThe goal of the learner is to minimize the expected dynamic regret de\ufb01ned as\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\nR(T ) = E\n\n(cid:104)a, \u03b8(cid:63)\n\nt (cid:105) \u2212 Xt\n\nmax\na\u2208At\n\n=\n\n(cid:104)a \u2212 At, \u03b8(cid:63)\nt (cid:105) .\n\nmax\na\u2208At\n\n(2)\n\nEven in the stationary case \u2014i.e., when \u03b8(cid:63)\nt = \u03b8(cid:63)\u2014, there is, in general, no single \ufb01xed best action in\nthis model.\nWhen making stronger structural assumption on At, one recovers speci\ufb01c instances that have also\nbeen studied in the literature. In particular, the canonical basis of Rd, At = {e1, . . . , ed}, yields the\nfamiliar \u2014non contextual\u2014 multi-armed bandit model [20]. Another variant, studied by [15] and\nothers, is obtained when At = {e1 \u2297 at, . . . , ek \u2297 at}, where \u2297 denotes the Kronecker product and\nat is a time-varying context vector shared by the k actions.\n\n1.2 Related Work\n\na variation bound BT =(cid:80)T\u22121\n\nThere is an important literature on online learning in changing environments. For the sake of\nconciseness, we restrict the discussion to works that consider speci\ufb01cally the stochastic linear bandit\nmodel in (1), including its restriction to the simpler (non-stationnary) multi-armed bandit model.\nNote that there is also a rich line of works that consider possibly non-linear contextual models in the\ncase where one can make probabilistic assumptions on the contexts [10, 23].\nControlling the regret with respect to the non-stationary optimal action de\ufb01ned in (2) depends on the\nt . A generic way of quantifying them is through\nassumptions that are made on the time-variations of \u03b8(cid:63)\ns+1(cid:107)2 [4, 6, 11], similar to the penalty used in the group fused\nLasso [8]. The main advantage of using the variation budget is that is includes both slowly-varying\nand abruptly-changing environments. For the K\u2212armed bandits with known BT , [4\u20136] achieve the\ntight dynamic regret bound of O(K 1/3B1/3\nT T 2/3). For linear bandits, [11, 12] propose an algorithm\nbased on the use of a sliding-window and provide a O(d2/3B1/3\nT T 2/3) dynamic regret bound; since\nthis contribution is close to ours, we discuss it further in Section 3.2.\n\ns=1 (cid:107)\u03b8(cid:63)\n\ns \u2212 \u03b8(cid:63)\n\n2\n\n\f\u221a\n\nA more speci\ufb01c non-stationary setting arises when the number of changes in the parameter is bounded\nby \u0393T , as in traditional change-point models. The problem is usually referred to as switching bandits\nor abruptly-changing environments. It is, for instance, the setting considered in the work by Garivier\nand Moulines [14], who analyzed the dynamic regret of UCB strategies based on either a sliding-\n\u221a\nwindow or exponential discounting. For both policies, they prove upper bounds on the regret in\n\u0393T T ) when \u0393T is known. They also provide a lower bound in a speci\ufb01c non-stationary setting,\nO(\nshowing that R(T ) = \u2126(\nT ). The algorithm ideas can be traced back to [19]. [28] shows that an\nhorizon-independent version of the sliding window algorithm can also be analyzed in a slowly-varying\nsetting. [17] analyze windowing and discounting approaches to address dynamic pricing guided by a\n(time-varying) linear regression model. Discount factors have also been used with Thomson sampling\nin dynamic environments as in [16, 26].\nIn abruptly-changing environments, the alternative approach relies on change-point detection [3, 7,\n9, 29, 30]. A bound on the regret in O(( 1\n\u2206 ) log(T )) is proven by [30], where \u0001 is the smallest\n\u221a\ngap that can be detected by the algorithm, which had to be given as prior knowledge. [9] proves a\nminimax bound in O(\n\u0393T KT ) without any prior\nknowledge of the gaps or \u0393T . In the contextual case, [29] builds on the same idea: they use a pool of\nLinUCB learners called slave models as experts and they add a new model when no existing slave\nis able to give good prediction, that is, when a change is detected. A limitation however of such an\napproach is that it can not adapt to some slowly-varying environments, as will be illustrated in Section\n4. From a practical viewpoint, the methods based either on sliding window or change-point detection\nrequire the storage of past actions whereas those based on discount factors can be implemented fully\nrecursively.\nFinally, non-stationarity may also arise in more speci\ufb01c scenarios connected, for instance, to the\ndecaying attention of the users, as investigated in [21, 24, 27]. In the following, we consider the\nt+1(cid:107)2 \u2264 BT and\n\ngeneral case where the parameters satisfy the variation bound, i.e.,(cid:80)T\u22121\n\n\u0393T KT ) if \u0393T is known. [7] achieves a rate of O(\n\n\u00012 + 1\n\nt=1 (cid:107)\u03b8(cid:63)\n\nt \u2212 \u03b8(cid:63)\n\n\u221a\n\nwe propose an algorithm based on discounted linear regression.\n\n2 Con\ufb01dence Bounds for Weighted Linear Bandits\n\nIn this section, we consider the concentration of the weighted regularized least-squares estimator,\nwhen used with general weights and regularization parameters. To the best of our knowledge there is\nno such results in the literature for sequential learning \u2014i.e., when the current regressor may depend\non the random outcomes observed in the past. The particular case considered in Lemma 5 of [18]\n(heteroscedastic noise with optimal weights) stays very close to the unweighted case and we show\nbelow how to extend this result. We believe that this new bound is of interest beyond the speci\ufb01c\nmodel considered in this paper. For the sake of clarity, we \ufb01rst focus on the case of regression models\nwith \ufb01xed parameter, where \u03b8(cid:63)\nFirst consider a deterministic sequence of regularization parameters (\u03bbt)t\u22651. The reason why these\nshould be non-constant for weighted least-squares will appear clearly in Section 3. Next, de\ufb01ne by\nFt = \u03c3(X1, . . . , Xt) the \ufb01ltration associated with the random observations. We assume that both the\nactions At and positive weights wt are predictable, that is, they are Ft\u22121 measurable.\nDe\ufb01ning by\n\nt = \u03b8(cid:63), for all t.\n\n(cid:33)\n\nws(Xs \u2212 (cid:104)As, \u03b8(cid:105))2 + \u03bbt(cid:107)\u03b8(cid:107)2\n\n2\n\nthe regularized weighted least-squares estimator of \u03b8(cid:63) at time t, one has\n\n\u02c6\u03b8t = V \u22121\n\nt\n\nwsAsXs where Vt =\n\ns=1\n\ns=1\n\nwsAsA(cid:62)\n\ns + \u03bbtId,\n\n(3)\n\nand Id denotes the d-dimensional identity matrix. We further consider an arbitrary sequence of\npositive parameters (\u00b5t)t\u22651 and de\ufb01ne the matrix\n\nt(cid:88)\n\n(cid:32) t(cid:88)\n\ns=1\n\n\u02c6\u03b8t = arg min\n\u03b8\u2208Rd\n\nt(cid:88)\n\nt(cid:88)\n\n(cid:101)Vt =\n\ns=1\n\n3\n\nsAsA(cid:62)\nw2\n\ns + \u00b5tId.\n\n(4)\n\n\fP\n\n1 +\n\n\u22121\nt Vt\n\nS + \u03c3\n\n(cid:32)\n\n(cid:101)V is strongly connected to the variance of the estimator \u02c6\u03b8t, which involves the squares of the weights\n\ns)s\u22651. For the time being, \u00b5t is arbitrary and will be set as a function of \u03bbt in order to optimize\n\n(w2\nthe deviation inequality.\nWe then have the following maximal deviation inequality.\nTheorem 1. For any Ft-predictable sequences of actions (At)t\u22651 and positive weights (wt)t\u22651 and\nfor all \u03b4 > 0,\n\n\uf8eb\uf8ed\u2200t,(cid:107)\u02c6\u03b8t \u2212 \u03b8(cid:63)(cid:107)Vt(cid:101)V\n(cid:33)\uf8f6\uf8f8 \u2265 1 \u2212 \u03b4.\nChapter 20] is recovered by taking \u00b5t = \u03bbt and wt = 1 (note that (cid:101)Vt is then equal to Vt). When\nthe weights are not equal to 1, the appearance of the matrix (cid:101)Vt is a consequence of the fact that the\nde\ufb01ned with the weights wt. In the weighted case, the matrix Vt(cid:101)V \u22121\n\nThe proof of this theorem is deferred to the appendix and combines an argument using the method\nof mixtures and the use of a proper stopping time. The standard result used for least-squares [20,\n\n(cid:118)(cid:117)(cid:117)(cid:116)2 log(1/\u03b4) + d log\n\nt , while the least-squares estimator itself is\nt Vt must be used to de\ufb01ne the\n\nvariance terms are proportional to the squared weights w2\n\nL2(cid:80)t\n\ns=1 w2\ns\nd\u00b5t\n\n\u2264 \u03bbt\u221a\n\u00b5t\n\ncon\ufb01dence ellipsoid.\nAn important property of the least-squares estimator is to be scale-invariant, in the sense that\nmultiplying all weights (ws)1\u2264s\u2264t and the regularization parameter \u03bbt by a constant leaves the\nestimator \u02c6\u03b8t unchanged. In Theorem 1, the only choice of sequence (\u00b5t)t\u22651 that is compatible with\nthis scale-invariance property is to take \u00b5t proportional to \u03bb2\nt Vt becomes\nscale-invariant (i.e. unchanged by the transformation ws (cid:55)\u2192 \u03b1ws) and so does the upper bound of\n(cid:107)\u02c6\u03b8t \u2212 \u03b8(cid:63)(cid:107)Vt(cid:101)V\nin Theorem 1. In the following, we will stick to this choice, while particularizing\nthe choice of the weights wt to allow for non-stationary models.\nmeasurable, by de\ufb01ning (cid:101)Vt as(cid:80)t\nIt is possible to extend this result to heteroscedastic noise, when \u03b7t is \u03c3t sub-Gaussian and \u03c3t is Ft\u22121\ns + \u00b5tId. In the next section, we will also use an\nextension of Theorem 1 to the non-stationary model presented in (1) . In this case, Theorem 1 holds\nwith \u03b8(cid:63) replaced by V \u22121\n3 in Appendix). The fact that r can be chosen freely is a consequence of the assumption that the\nsequence of L2-norms of the parameters (\u03b8(cid:63)\n\n(cid:1), where r is an arbitrary time index (proposition\n\nt : then the matrix Vt(cid:101)V \u22121\n\ns=1 w2\ns\u03c32\ns=1 wsAsA(cid:62)\n\n(cid:0)(cid:80)t\n\nt )t\u22651 is bounded by S.\n\ns AsA(cid:62)\n\ns + \u03bbt\u03b8(cid:63)\nr\n\nt\n\ns \u03b8(cid:63)\n\n\u22121\nt Vt\n\n3 Application to Non-stationary Linear Bandits\n\nIn this section, we consider the non-stationary model de\ufb01ned in (1) and propose a bandit algorithm in\nSection 3.1, called Discounted Linear Upper Con\ufb01dence Bound (D-LinUCB), that relies on weighted\nleast-squares to adapt to changes in the parameters \u03b8(cid:63)\nt . Analyzing the performance of D-LinUCB\nin Section 3.2, we show that it achieves reliable performance both for abruptly changing or slowly\ndrifting parameters.\n\n3.1 The D-LinUCB Algorithm\n\nBeing adaptive to parameter changes indeed implies to reduce the in\ufb02uence of observations that are\nfar back in the past, which suggests using weights wt that increase with time. In doing so, there\nare two important caveats to consider. First, this can only be effective if the sequence of weights\nis growing suf\ufb01ciently fast (see the analysis in the next section). We thus consider exponentially\nincreasing weights of the form wt = \u03b3\u2212t, where 0 < \u03b3 < 1 is the discount factor.\nNext, due to the absence of assumptions on the action sets At, the regularization is instrumental in\nobtaining guarantees of the form given in Theorem 1. In fact, if wt = \u03b3\u2212t while \u03bbt does not increase\n\nsuf\ufb01ciently fast, then the term log(cid:0)1 + (L2(cid:80)t\n\ns)/(d\u00b5t)(cid:1) will eventually dominate the radius\n\nof the con\ufb01dence region since we choose \u00b5t proportional to \u03bb2\nt . This occurs because there is no\nguarantee that the algorithm will persistently select actions At that span the entire space. With this\nin mind, we consider an increasing regularization factor of the form \u03bbt = \u03b3\u2212t\u03bb, where \u03bb > 0 is a\nhyperparameter.\n\ns=1 w2\n\n4\n\n\fNote that due to the scale-invariance property of the weighted least-square estimator, we can equiva-\nlently consider that at time t, we are given time-dependent weights wt,s = \u03b3t\u2212s, for 1 \u2264 s \u2264 t and\nthat \u02c6\u03b8t is de\ufb01ned as\n\n(cid:0) t(cid:88)\n\ns=1\n\narg min\n\n\u03b8\u2208Rd\n\n\u03b3t\u2212s(Xs \u2212 (cid:104)As, \u03b8(cid:105))2 + \u03bb(cid:107)\u03b8(cid:107)2\n\n2\n\n(cid:1).\n\nFor numerical stability reasons, this form is preferable and is used in the statement of Algorithm 1. In\nthe analysis of Section 3.2 however we revert to the standard form of the weights, which is required to\napply the concentration result of Section 1. We are now ready to describe D-LinUCB in Algorithm 1.\n\nAlgorithm 1: D-LinUCB\nInput: Probability \u03b4, subgaussianity constant \u03c3, dimension d, regularization \u03bb, upper bound for\n\nfor t \u2265 1 do\n\nactions L, upper bound for parameters S, discount factor \u03b3.\n\nInitialization: b = 0Rd, V = \u03bbId, (cid:101)V = \u03bbId, \u02c6\u03b8 = 0Rd\n(cid:114)\n(cid:1) + d log\n2 log(cid:0) 1\n(cid:112)\na(cid:62)V \u22121(cid:101)V V \u22121a\nt + (1 \u2212 \u03b3)\u03bbId, (cid:101)V = \u03b32(cid:101)V + AtA(cid:62)\n\nCompute UCB(a) = a(cid:62) \u02c6\u03b8 + \u03b2t\u22121\nAt = arg max a(UCB(a))\nPlay action At and receive reward Xt\nUpdating phase: V = \u03b3V + AtA(cid:62)\n\nReceive At, compute \u03b2t\u22121 =\nfor a \u2208 At do\n\n\u03bbS + \u03c3\n\n(cid:16)\n\n\u221a\n\n\u03b4\n\nb = \u03b3b + XtAt, \u02c6\u03b8 = V \u22121b\n\n(cid:17)\n\n1 + L2(1\u2212\u03b32(t\u22121))\n\u03bbd(1\u2212\u03b32)\n\nt + (1 \u2212 \u03b32)\u03bbId\n\n3.2 Analysis\nAs discussed previously, we consider weights of the form wt = \u03b3\u2212t (where 0 < \u03b3 < 1) in the\nD-LinUCB algorithm. In accordance with the discussion at the end of Section 1, Algorithm 1 uses\n\u00b5t = \u03b3\u22122t\u03bb as the parameter to de\ufb01ne the con\ufb01dence ellipsoid around \u02c6\u03b8t\u22121. The con\ufb01dence ellipsoid\n\nCt is de\ufb01ned as(cid:8)\u03b8 : (cid:107)\u03b8 \u2212 \u02c6\u03b8t\u22121(cid:107)Vt\u22121(cid:101)V\n(cid:115)\n\n\u221a\n\n\u2264 \u03b2t\u22121\n\n\u22121\nt\u22121Vt\u22121\n\n(cid:9) where\n(cid:18)\n\n\u03b2t =\n\n\u03bbS + \u03c3\n\n2 log(1/\u03b4) + d log\n\n1 +\n\n.\n\n(5)\n\n(cid:19)\n\nL2(1 \u2212 \u03b32t)\n\u03bbd(1 \u2212 \u03b32)\n\nTheorem 2. Assuming that(cid:80)T\u22121\n\nUsing standard algebraic calculations together with the remark above about scale-invariance it is\neasily checked that at time t Algorithm 1 selects the action At that maximizes (cid:104)a, \u03b8(cid:105) for a \u2208 At and\n\u03b8 \u2208 Ct. The following theorem bounds the regret resulting from Algorithm 1.\n\nbounded for all \u03b3 \u2208 (0, 1) and integer D \u2265 1, with probability at least 1 \u2212 \u03b4, by\n\ns=1 (cid:107)\u03b8(cid:63)\n\ns \u2212 \u03b8(cid:63)\n\ns+1(cid:107)2 \u2264 BT , the regret of the D-LinUCB algorithm is\n\nRT \u2264 2LDBT +\n\n4L3S\n\n\u03bb\n\n\u03b3D\n1 \u2212 \u03b3\n\nT + 2\n\n\u221a\n\n\u221a\n\n2\u03b2T\n\ndT\n\nT log(1/\u03b3) + log\n\n1 +\n\nL2\n\nd\u03bb(1 \u2212 \u03b3)\n\n.\n\n(6)\n\n(cid:115)\n\n(cid:18)\n\n(cid:19)\n\nThe \ufb01rst two terms of the r.h.s. of (6) are the result of the bias due to the non-stationary environment.\nThe last term is the consequence of the high probability bound established in the previous section and\nan adaptation of the technique used in [1].\nWe give the complete proof of this result in appendix. The high-level idea of the proof is to isolate bias\nand variance terms. However, in contrast with the stationary case, the con\ufb01dence ellipsoid Ct does\nnot necessarily contain (with high probability) the actual parameter value \u03b8(cid:63)\nt due to the (unknown)\nbias arising from the time variations of the parameter. We thus de\ufb01ne\n\n(cid:33)\n\n\u00af\u03b8t = V \u22121\nt\u22121\n\n\u03b3\u2212sAsA(cid:62)\n\ns + \u03bb\u03b3\u2212(t\u22121)\u03b8(cid:63)\n\nt\n\ns \u03b8(cid:63)\n\n(cid:32)t\u22121(cid:88)\n\ns=1\n\n5\n\n\fwhich is an action-dependent analogue of the parameter value \u03b8(cid:63) in the stationary setting (although\nthis is a random value). As mentioned in section 2, \u00af\u03b8t does belong to Ct with probability at least 1\u2212 \u03b4\n(see Proposition 3 in Appendix). The regret may then be split as\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nRT \u2264 2L\n\n(cid:107)\u03b8(cid:63)\n\nt \u2212 \u00af\u03b8t(cid:107)2 +\n\n(cid:104)At, \u03b8t \u2212 \u00af\u03b8t(cid:105)\n\n(with probability at least 1 \u2212 \u03b4),\n\ns=1 (cid:107)\u03b8(cid:63)\n\ns \u2212 \u03b8(cid:63)\n\nterm in the r.h.s. can be bounded deterministically, from the assumption made on(cid:80)T\u22121\n\nwhere (At, \u03b8t) = arg max(a\u2208At,\u03b8\u2208Ct)(cid:104)a, \u03b8(cid:105). The rightmost term can be handled by proceeding as in\nthe case of stationary linear bandits, thanks to the deviation inequality obtained in Section 2. The \ufb01rst\ns+1(cid:107)2.\nIn doing so, we introduce the analysis parameter D that, roughly speaking, corresponds to the window\nlength equivalent to a particular choice of discount factor \u03b3: the bias resulting from observations that\nare less than D time steps apart may be bounded in term of D while the remaining ones are bounded\nglobally by the second term of the r.h.s. of (6). This sketch of proof is substantially different from\nthe arguments used by [11] to analyze their sliding window algorithm (called SW-LinUCB). We refer\nto the appendix for a more detailed analysis of these differences. Interestingly, the regret bound of\nTheorem 2 holds despite the fact that the true parameter \u03b8(cid:63)\nt may not be contained in the con\ufb01dence\nellipsoid Ct\u22121, in contrast to the proof of [14].\nIt can be checked that, as T tends to in\ufb01nity, the optimal choice of the analysis parameter D is to take\nD = log(T )/(1 \u2212 \u03b3). Further assuming that one may tune \u03b3 as a function of the horizon T and the\nvariation upper bound BT yields the following result.\nCorollary 1. By choosing \u03b3 = 1 \u2212 (BT /(dT ))2/3, the regret of the D-LinUCB algorithm is asymp-\ntotically upper bounded with high probability by a term O(d2/3B1/3\n\nT T 2/3) when T \u2192 \u221e.\n\nThis result is favorable as it corresponds to the same order as the lower bound established by [4].\nMore precisely, the case investigated by [4] corresponds to a non-contextual model with a number\nof changes that grows with the horizon. On the other hand, the guarantee of Corollary 1 requires\nhorizon-dependent tuning of the discount factor \u03b3, which opens interesting research issues (see also\n[11]).\n\n4 Experiments\n\nThis section is devoted to the evaluation of the empirical performance of D-LinUCB. We \ufb01rst consider\ntwo simulated low-dimensional environments that illustrate the behavior of the algorithms when\nconfronted to either abrupt changes or slow variations of the parameters. The analysis of the previous\nsection, suggests that D-LinUCB should behave properly in both situations. We then consider a more\nrealistic scenario in Section 4.2, where the contexts are high-dimensional and extracted from a data\nset of actual user interactions with a web service.\nFor benchmarking purposes, we compare D-LinUCB to the Dynamic Linear Upper Con\ufb01dence Bound\n(dLinUCB) algorithm proposed by [29] and with the Sliding Window Linear UCB (SW-LinUCB)\nof [11]. The principle of the dLinUCB algorithm is that a master bandit algorithm is in charge of\nchoosing the best LinUCB slave bandit for making the recommendation. Each slave model is built\nto run in each one of the different environments. The choice of the slave model is based on a lower\ncon\ufb01dence bound for the so-called badness of the different models. The badness is de\ufb01ned as the\nnumber of times the expected reward was found to be far enough from the actual observed reward on\nthe last \u03c4 steps, where \u03c4 is a parameter of the algorithm. When a slave is chosen, the action proposed\nto a user is the result of the LinUCB algorithm associated with this slave. When the action is made,\nall the slave models that were good enough are updated and the models whose badness were too high\nare deleted from the pool of slaves models. If none of the slaves were found to be suf\ufb01ciently good, a\nnew slave is added to the pool.\nThe other algorithm that we use for comparison is SW-LinUCB, as presented in [11]. Rather than\nusing exponentially increasing weights, a hard threshold is adopted. Indeed, the actions and rewards\nincluded in the l-length sliding window are used to estimate the linear regression coef\ufb01cients. We\nexpect D-LinUCB and SW-LinUCB to behave similarly as they both may be shown to have the same\nsort of regret guarantees (see appendix).\nIn the case of abrupt changes, we also compare these algorithms to the Oracle Restart LinUCB\n(LinUCB-OR) strategy that would know the change-points and simply restart, after each change, a\n\n6\n\n\fnew instance of the LinUCB algorithm. The regret of this strategy may be seen as an empirical lower\nbound on the optimal behavior of an online learning algorithm in abruptly changing environments.\nIn the following \ufb01gures, the vertical red dashed lines correspond to the change-points (in abrupt\nchanges scenarios). They are represented to ease the understanding but except for LinUCB-OR, they\nare of course unknown to the learning algorithms. When applicable, the blue dashed lines correspond\nto the average detection time of the breakpoints with the dLinUCB algorithm. For D-LinUCB the\ndiscount parameter is chosen as \u03b3 = 1 \u2212 ( BT\ndT )2/3. For SW-LinUCB the window\u2019s length is set to\n)2/3, where d = 2 in the experiment. Those values are theoretically supposed to minimize the\nl = ( dT\nBT\nasymptotic regret. For the Dynamic Linear UCB algorithm, the badness is estimated from \u03c4 = 200\nsteps, as in the experimental section of [29].\n\n4.1 Synthetic data in abruptly-changing or slowly-varying scenarios\n\nFigure 1: Performances of the algorithms in the abruptly-changing environment (on the left), and, the\nslowly-varying environment (on the right). The upper plots correspond to the estimated parameter and\nthe lower ones to the accumulated regret, both are averaged on N = 100 independent experiments\n\nt = (1, 0); for t \u2208 [[1000, 2000]], \u03b8(cid:63)\n\nt = (\u22121, 0); for t \u2208 [[2000, 3000]], \u03b8(cid:63)\n\nIn this \ufb01rst experiment, we observe the empirical performance of all algorithms in an abruptly\nchanging environment of dimension 2 with 3 breakpoints. The number of rounds is set to T = 6000.\nThe light blue triangles correspond to the different positions of the true unknown parameter \u03b8(cid:63)\nt :\nbefore t = 1000, \u03b8(cid:63)\nt = (0, 1);\nt = (0,\u22121). This corresponds to a hard problem as the sequence of\nand, \ufb01nally, for t > 3000, \u03b8(cid:63)\nparameters is widely spread in the unit ball. Indeed it forces the algorithm to adapt to big changes,\nwhich typically requires a longer adaptation phase. On the other hand, it makes the detection of\nchanges easier, which is an advantage for dLinUCB. In the second half of the experiment (when\nt \u2265 3000) there is no change, LinUCB struggles to catch up and suffers linear regret for long periods\nafter the last change-point. The results of our simulations are shown in the left column of Figure 1.\nOn the top row we show a 2-dimensional scatter plot of the estimate of the unknown parameters\n\u02c6\u03b8t every 1000 steps averaged on 100 independent experiment. The bottom row corresponds to the\nregret averaged over 100 independent experiments with the upper and the lower 5% quantiles. In this\nenvironment, with 1-subgaussian random noise, dLinUCB struggles to detect the change-points. Over\nthe 100 experiments, the \ufb01rst change-point was detected in 95% of the runs, the second was never\ndetected and the third only in 6% of the runs, thus limiting the effectiveness of the dLinUCB approach.\nWhen decreasing the variance of the noise, the performance of dLinUCB improves and gets closer to\n\n7\n\n\fa steady period of 3000 steps. For this sequence of parameters, BT =(cid:80)T\u22121\n\nthe performance of the oracle restart strategy LinUCB-OR. It is worth noting that for both SW-LinUCB\nand D-LinUCB, the estimator \u02c6\u03b8t adapts itself to non-stationarity and is able to follow \u03b8(cid:63)\nt (with some\ndelay), as shown on the scatter plot. Predictably, LinUCB-OR achieves the best performance by\nrestarting exactly whenever a change-point happens.\nThe second experiment corresponds to a slowly-changing environment. It is easier for LinUCB to\nt starts at (1 and moves\nkeep up with the adaptive policies in this scenario. Here, the parameter \u03b8(cid:63)\ncontinuously counter-clockwise on the unit-circle up to the position [0, 1] in 3000 steps. We then have\nt=1 (cid:107)\u03b8(cid:63)\nt+1(cid:107)2 = 1.57.\nThe results are reported in the right column of Figure 1. Unsurprisingly, dLinUCB does not detect\nany change and thus displays the same performance as LinUCB. SW-LinUCB and D-LinUCB behaves\nsimilarly and are both robust to such an evolution in the regression parameters. The performance of\nLinUCB-OR is not reported here, as restarting becomes ineffective when the changes are too frequent\n(here, during the \ufb01rst 3000 time steps, there is a change at every single step). The scatter plot also\ngives interesting information: \u02c6\u03b8t tracks \u03b8(cid:63)\nt quite effectively for both SW-LinUCB and D-LinUCB but\nthe two others algorithms lag behind. LinUCB will eventually catch up if the length of the stationary\nperiod becomes larger.\n\nt \u2212 \u03b8(cid:63)\n\n4.2 Simulation based on a real dataset\n\nFigure 2: Behavior of the different algorithms on large-dimensional data\n\nD-LinUCB also performs well in high-dimensional space (d = 50). For this experiment, a dataset\nproviding a sample of 30 days of Criteo live traf\ufb01c data [13] was used. It contains banners that\nwere displayed to different users and contextual variables, including the information of whether the\nbanner was clicked or not. We kept the categorical variables cat1 to cat9 , together with the variable\ncampaign, which is a unique identi\ufb01er of each campaign. Beforehand, these contexts have been one-\nhot encoded and 50 of the resulting features have been selected using a Singular Value Decomposition.\n\u03b8(cid:63) is obtained by linear regression. The rewards are then simulated using the regression model with\nan additional Gaussian noise of variance \u03c32 = 0.15. At each time step, the different algorithms\nhave the choice between two 50-dimensional contexts drawn at random from two separate pools of\n10000 contexts corresponding, respectively, to clicked or not clicked banners. The non-stationarity\nis created by switching 60% of \u03b8(cid:63) coordinates to \u2212\u03b8(cid:63) at time 4000, corresponding to a partial class\ninversion. The cumulative dynamic regret is then averaged over 100 independent replications. The\nresults are shown on Figure 2. In the \ufb01rst stationary period, LinUCB and dLinUCB perform better\nthan the adaptive policies by using all available data, whereas the adaptive policies only use the most\nrecent events. After the breakpoint, LinUCB suffers a large regret, as the algorithm fails to adapt to the\nnew environment. In this experiment, dLinUCB does not detect the change-point systematically and\nperforms similarly as LinUCB on average, it can still outperform adaptive policies from time to time\nwhen the breakpoint is detected as can be seen with the 5% quantile. D-LinUCB and SW-LinUCB adapt\nmore quickly to the change-point and perform signi\ufb01cantly better than the non-adaptive policies after\nthe breakpoint. Of course, the oracle policy LinUCB-OR is the best performing policy. The take-away\nmessage is that there is no free lunch: in a stationary period by using only the most recent events\nSW-LinUCB and D-LinUCB do not perform as good as a policy that uses all the available information.\nNevertheless, after a breakpoint, the recovery is much faster with the adaptive policies.\n\n8\n\n\fReferences\n[1] Y. Abbasi-Yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\n\nIn Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\n[2] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[3] P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best arm with an unknown number of\n\ndistribution changes. In European Workshop on Reinforcement Learning 14, 2018.\n\n[4] O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary\n\nrewards. In Advances in neural information processing systems, pages 199\u2013207, 2014.\n\n[5] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations research,\n\n63(5):1227\u20131244, 2015.\n\n[6] O. Besbes, Y. Gur, and A. Zeevi. Optimal exploration-exploitation in a multi-armed-bandit\n\nproblem with non-stationary rewards. Available at SSRN 2436629, 2018.\n\n[7] L. Besson and E. Kaufmann. The generalized likelihood ratio test meets klucb: an improved\n\nalgorithm for piece-wise non-stationary bandits. arXiv preprint arXiv:1902.01575, 2019.\n\n[8] K. Bleakley and J.-P. Vert. The group fused lasso for multiple change-point detection. arXiv\n\npreprint arXiv:1106.4199, 2011.\n\n[9] Y. Cao, W. Zheng, B. Kveton, and Y. Xie. Nearly optimal adaptive procedure for piecewise-\nstationary bandit: a change-point detection approach. arXiv preprint arXiv:1802.03692, 2018.\n\n[10] Y. Chen, C.-W. Lee, H. Luo, and C.-Y. Wei. A new algorithm for non-stationary contextual\n\nbandits: Ef\ufb01cient, optimal, and parameter-free. arXiv preprint arXiv:1902.00980, 2019.\n\n[11] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. arXiv\n\npreprint arXiv:1810.03024, 2018.\n\n[12] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Hedging the drift: Learning to optimize under\n\nnon-stationarity. arXiv preprint arXiv:1903.01461, 2019.\n\n[13] Diemert Eustache, Meynet Julien, P. Galland, and D. Lefortier. Attribution modeling increases\nIn Proceedings of the AdKDD and TargetAd\n\nef\ufb01ciency of bidding in display advertising.\nWorkshop, KDD, Halifax, NS, Canada, August, 14, 2017. ACM, 2017.\n\n[14] A. Garivier and E. Moulines. On upper-con\ufb01dence bound policies for switching bandit problems.\nIn International Conference on Algorithmic Learning Theory, pages 174\u2013188. Springer, 2011.\n\n[15] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stoch. Syst., 3(1):230\u2013261,\n\n2013.\n\n[16] N. Gupta, O.-C. Granmo, and A. Agrawala. Thompson sampling for dynamic multi-armed\nbandits. In 2011 10th International Conference on Machine Learning and Applications and\nWorkshops, volume 1. IEEE, 2011.\n\n[17] N. B. Keskin and A. Zeevi. Chasing demand: Learning and earning in a changing environment.\n\nMathematics of Operations Research, 42(2):277\u2013307, 2017.\n\n[18] J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic\n\nnoise. arXiv preprint arXiv:1801.09667, 2018.\n\n[19] L. Kocsis and C. Szepesv\u00e1ri. Discounted ucb. In: 2nd Pascal Challenge Workshop, 2006.\n\n[20] T. Lattimore and C. Szepesv\u00e1ri. Bandit Algorithms. Cambridge University Press, 2019.\n\n[21] N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In Advances in Neural Information\n\nProcessing Systems, pages 3074\u20133083, 2017.\n\n9\n\n\f[22] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\n\nnews article recommendation. In WWW, 2010.\n\n[23] H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Ef\ufb01cient contextual bandits in non-stationary\n\nworlds. arXiv preprint arXiv:1708.01799, 2017.\n\n[24] Y. Mintz, A. Aswani, P. Kaminsky, E. Flowers, and Y. Fukuoka. Non-stationary bandits with\n\nhabituation and recovery dynamics. arXiv preprint arXiv:1707.08423, 2017.\n\n[25] V. H. Pe\u00f1a, T. L. Lai, and Q.-M. Shao. Self-normalized processes: Limit theory and Statistical\n\nApplications. Springer Science & Business Media, 2008.\n\n[26] V. Raj and S. Kalyani. Taming non-stationary bandits: A bayesian approach. arXiv preprint\n\narXiv:1707.09727, 2017.\n\n[27] J. Seznec, A. Locatelli, A. Carpentier, A. Lazaric, and M. Valko. Rotting bandits are no harder\n\nthan stochastic ones. arXiv preprint arXiv:1811.11043, 2018.\n\n[28] L. Wei and V. Srivatsva. On abruptly-changing and slowly-varying multiarmed bandit problems.\n\nIn 2018 Annual American Control Conference (ACC), pages 6291\u20136296. IEEE, 2018.\n\n[29] Q. Wu, N. Iyer, and H. Wang. Learning contextual bandits in a non-stationary environment. In\nThe 41st International ACM SIGIR Conference on Research & Development in Information\nRetrieval, SIGIR \u201918, pages 495\u2013504, New York, NY, USA, 2018. ACM.\n\n[30] J. Y. Yu and S. Mannor. Piecewise-stationary bandit problems with side observations. In\nProceedings of the 26th Annual International Conference on Machine Learning, pages 1177\u2013\n1184. ACM, 2009.\n\n10\n\n\f", "award": [], "sourceid": 6485, "authors": [{"given_name": "Yoan", "family_name": "Russac", "institution": "Ecole Normale Sup\u00e9rieure"}, {"given_name": "Claire", "family_name": "Vernade", "institution": "Google DeepMind"}, {"given_name": "Olivier", "family_name": "Capp\u00e9", "institution": "CNRS"}]}