{"title": "Parametric Bandits: The Generalized Linear Case", "book": "Advances in Neural Information Processing Systems", "page_first": 586, "page_last": 594, "abstract": "We consider structured multi-armed bandit tasks in which the agent is guided by prior structural knowledge that can be exploited to efficiently select the optimal arm(s) in situations where the number of arms is large, or even infinite. We pro- pose a new optimistic, UCB-like, algorithm for non-linearly parameterized bandit problems using the Generalized Linear Model (GLM) framework. We analyze the regret of the proposed algorithm, termed GLM-UCB, obtaining results similar to those recently proved in the literature for the linear regression case. The analysis also highlights a key difficulty of the non-linear case which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual efficiency of current parameterized bandit algorithms is often deceiving in practice, we provide an asymptotic argument leading to significantly faster convergence. Simulation studies on real data sets illustrate the performance and the robustness of the proposed GLM-UCB approach.", "full_text": "Parametric Bandits:\n\nThe Generalized Linear Case\n\nSarah Filippi\n\nLTCI\n\nParis, France\n\nTelecom ParisTech et CNRS\n\nOlivier Capp\u00b4e\n\nLTCI\n\nTelecom ParisTech et CNRS\n\nParis, France\n\nfilippi@telecom-paristech.fr\n\ncappe@telecom-paristech.fr\n\nAur\u00b4elien Garivier\n\nLTCI\n\nTelecom ParisTech et CNRS\n\nParis, France\n\nCsaba Szepesv\u00b4ari\nRLAI Laboratory\n\nUniversity of Alberta\nEdmonton, Canada\n\ngarivier@telecom-paristech.fr\n\nszepesva@ualberta.ca\n\nAbstract\n\nWe consider structured multi-armed bandit problems based on the Generalized\nLinear Model (GLM) framework of statistics. For these bandits, we propose a new\nalgorithm, called GLM-UCB. We derive \ufb01nite time, high probability bounds on\nthe regret of the algorithm, extending previous analyses developed for the linear\nbandits to the non-linear case. The analysis highlights a key dif\ufb01culty in generaliz-\ning linear bandit algorithms to the non-linear case, which is solved in GLM-UCB\nby focusing on the reward space rather than on the parameter space. Moreover, as\nthe actual effectiveness of current parameterized bandit algorithms is often poor in\npractice, we provide a tuning method based on asymptotic arguments, which leads\nto signi\ufb01cantly better practical performance. We present two numerical experi-\nments on real-world data that illustrate the potential of the GLM-UCB approach.\nKeywords: multi-armed bandit, parametric bandits, generalized linear models,\nUCB, regret minimization.\n\n1\n\nIntroduction\n\nIn the classical K-armed bandit problem, an agent selects at each time step one of the K arms and\nreceives a reward that depends on the chosen action. The aim of the agent is to choose the sequence\nof arms to be played so as to maximize the cumulated reward. There is a fundamental trade-off\nbetween gathering experimental data about the reward distribution (exploration) and exploiting the\narm which seems to be the most promising.\n\nIn the basic multi-armed bandit problem, also called the independent bandits problem, the\nrewards are assumed to be random and distributed independently according to a probability\ndistribution that is speci\ufb01c to each arm \u2013see [1, 2, 3, 4] and references therein. Recently, structured\nbandit problems in which the distributions of the rewards pertaining to each arm are connected\nby a common unknown parameter have received much attention [5, 6, 7, 8, 9]. This model is\nmotivated by the many practical applications where the number of arms is large, but the payoffs are\ninterrelated. Up to know, two different models were studied in the literature along these lines. In\none model, in each times step, a side-information, or context, is given to the agent \ufb01rst. The payoffs\nof the arms depend both on this side information and the index of the arm. Thus the optimal arm\nchanges with the context [5, 6, 9]. In the second, simpler model, that we are also interested in here,\nthere is no side-information, but the agent is given a model that describes the possible relations\n\n1\n\n\fbetween the arms\u2019 payoffs. In particular, in \u201clinear bandits\u201d [10, 8, 11, 12], each arm a \u2208 A is\nassociated with some d-dimensional vector ma \u2208 Rd known to the agent. The expected payoffs\nof the arms are given by the inner product of their associated vector and some \ufb01xed, but initially\na\u03b8\u2217, which is linear in \u03b8\u2217.1\nunknown parameter vector \u03b8\u2217. Thus, the expected payoff of arm a is m\u2032\nIn this article, we study a richer generalized linear model (GLM) in which the expectation of\nthe reward conditionally to the action a is given by \u00b5(m\u2032\na\u03b8\u2217), where \u00b5 is a real-valued, non-linear\nfunction called the (inverse) link function. This generalization allows to consider a wider class\nof problems, and in particular cases where the rewards are counts or binary variables using,\nrespectively, Poisson or logistic regression. Obviously, this situation is very common in the \ufb01elds of\nmarketing, social networking, web-mining (see example of Section 5.2 below) or clinical studies.\n\nOur \ufb01rst contribution is an \u201coptimistic\u201d algorithm, termed GLM-UCB, inspired by the Upper Con-\n\ufb01dence Bound (UCB) approach [2]. GLM-UCB generalizes the algorithms studied by [10, 8, 12].\nOur next contribution are \ufb01nite-time bounds on the statistical performance of this algorithm. In\nparticular, we show that the performance depends on the dimension of the parameter but not on the\nnumber of arms, a result that was previously known in the linear case. Interestingly, the GLM-UCB\napproach takes advantage of the particular structure of the parameter estimate of generalized linear\nmodels and operates only in the reward space. In contrast, the parameter-space con\ufb01dence region\napproach adopted by [8, 12] appears to be harder to generalize to non-linear regression models.\nOur second contribution is a tuning method based on asymptotic arguments. This contribution\naddresses the poor empirical performance of the current algorithms that we have observed for small\nor moderate sample-sizes when these algorithms are tuned based on \ufb01nite-sample bounds.\n\nThe paper is organized as follows. The generalized linear bandit model is presented in Section 2,\ntogether with a brief survey of needed statistical results. Section 3 is devoted to the description\nof the GLM-UCB algorithm, which is compared to related approaches. Section 4 presents our\nregret bounds, as well as a discussion, based on asymptotic arguments, on the optimal tuning of the\nmethod. Section 5 reports the results of two experiments on real data sets.\n\n2 Generalized Linear Bandits, Generalized Linear Models\n\nWe consider a structured bandit model with a \ufb01nite, but possibly very large, number of arms. At\neach time t, the agent chooses an arm At from the set A (we shall denote the cardinality of A by K).\nThe prior knowledge available to the agent consists of a collection of vectors {ma}a\u2208A of features\nwhich are speci\ufb01c to each arm and a so-called (inverse) link function \u00b5 : R \u2192 R.\n\nThe generalized linear bandit model investigated in this work is based on the assumption that\nthe payoff Rt received at time t is conditionally independent of the past payoffs and choices and it\nsatis\ufb01es\n\nAt \u03b8\u2217) ,\n\nE [ Rt| At] = \u00b5(m\u2032\n\n(1)\nfor some unknown parameter vector \u03b8\u2217 \u2208 Rd. This framework generalizes the linear bandit model\nconsidered by [10, 8, 12]. Just like the linear bandit model builds on linear regression, our model\ncapitalizes on the well-known statistical framework of Generalized Linear Models (GLMs). The\nadvantage of this framework is that it allows to address various, speci\ufb01c reward structures widely\nfound in applications. For example, when rewards are binary-valued, a suitable choice of \u00b5 is\n\u00b5(x) = exp(x)/(1 + exp(x)), leading to the logistic regression model. For integer valued rewards,\nthe choice \u00b5(x) = exp(x) leads to the Poisson regression model. This can be easily extended to the\ncase of multinomial (or polytomic) logistic regression, which is appropriate to model situations in\nwhich the rewards are associated with categorical variables.\n\nTo keep this article self-contained, we brie\ufb02y review the main properties of GLMs [13]. A\nunivariate probability distribution is said to belong to a canonical exponential family if its density\nwith respect to a reference measure is given by\n\n(2)\nwhere \u03b2 is a real parameter, c(\u00b7) is a real function and the function b(\u00b7) is assumed to be twice\ncontinuously differentiable. This family contains the Gaussian and Gamma distributions when the\nreference measure is the Lebesgue measure and the Poisson and Bernoulli distributions when the\n\np\u03b2(r) = exp (r\u03b2 \u2212 b(\u03b2) + c(r)) ,\n\n1Throughout the paper we use the prime to denote transposition.\n\n2\n\n\freference measure is the counting measure on the integers. For a random variable R with density\nde\ufb01ned in (2), E(R) = \u02d9b(\u03b2) and Var(R) = \u00a8b(\u03b2), where \u02d9b and \u00a8b denote, respectively, the \ufb01rst and\nsecond derivatives of b. In addition, \u00a8b(\u03b2) can also be shown to be equal to the Fisher information\nmatrix for the parameter \u03b2. The function b is thus strictly convex.\n\nNow, assume that, in addition to the response variable R, we have at hand a vector of covariates\nX \u2208 Rd. The canonical GLM associated to (2) postulates that p\u03b8(r|x) = px\u2032\u03b8(r), where \u03b8 \u2208 Rd\nis a vector of parameter. Denote by \u00b5 = \u02d9b the so-called inverse link function. From the properties\nof b, we know that \u00b5 is continuously differentiable, strictly increasing, and thus one-to-one. The\nmaximum likelihood estimator \u02c6\u03b8t, based on observations (R1, X1), . . . (Rt\u22121, Xt\u22121), is de\ufb01ned as\nthe maximizer of the function\n\nt\u22121\n\nt\u22121\n\nXk=1\n\nlog p\u03b8(Rk|Xk) =\n\nRkX \u2032\n\nk\u03b8 \u2212 b(X \u2032\n\nk\u03b8) + c(Rk) ,\n\nXk=1\n\na strictly concave function in \u03b8.2 Upon differentiating, we obtain that \u02c6\u03b8t is the unique solution of\nthe following estimating equation\n\nt\u22121\n\nXk=1\n\n(Rk \u2212 \u00b5(X \u2032\n\nk\u03b8)) Xk = 0 ,\n\n(3)\n\nwhere we have used the fact that \u00b5 = \u02d9b. In practice, the solution of (3) may be found ef\ufb01ciently\nusing, for instance, Newton\u2019s algorithm.\n\nA semi-parametric version of the above model is obtained by assuming only that E\u03b8[R|X] =\n\u00b5(X \u2032\u03b8) without (much) further assumptions on the conditional distribution of R given X. In this\ncase, the estimator obtained by solving (3) is referred to as the maximum quasi-likelihood estimator.\nIt is a remarkable fact that this estimator is consistent under very general assumptions as long as the\nk tends to in\ufb01nity [14]. As we will see, this matrix also plays a crucial\nrole in the algorithm that we propose for bandit optimization in the generalized linear bandit model.\n\nk=1 XkX \u2032\n\ndesign matrixPt\u22121\n\n3 The GLM-UCB Algorithm\n\nAccording to (1), the agent receives, upon playing arm a, a random reward whose expected value is\na\u03b8\u2217), where \u03b8\u2217 \u2208 \u0398 is the unknown parameter. The parameter set \u0398 is an arbitrary closed subset\n\u00b5(m\u2032\nof Rd. Any arm with largest expected reward is called optimal. The aim of the agent is to quickly \ufb01nd\n\u02c6\u03b8t)\nan optimal arm in order to maximize the received rewards. The greedy action argmaxa\u2208A \u00b5(m\u2032\na\nmay lead to an unreliable algorithm which does not suf\ufb01ciently explore to guarantee the selection of\nan optimal arm. This issue can be addressed by resorting to an \u201coptimistic approach\u201d. As described\nby [8, 12] in the linear case, an optimistic algorithm consists in selecting, at time t, the arm\n\nAt = argmax\n\na\n\nmax\n\n\u03b8\n\nE\u03b8 [ Rt | At = a] s.t. k\u03b8 \u2212 \u02c6\u03b8tkMt \u2264 \u03c1(t) ,\n\nwhere \u03c1 is an appropriate, \u201cslowly increasing\u201d function,\n\nt\u22121\n\n(4)\n\n(5)\n\nMt =\n\nmAk m\u2032\n\nAk\n\nXk=1\n\nis the design matrix corresponding to the \ufb01rst t \u2212 1 timesteps and kvkM = \u221av\u2032M v denotes the\nmatrix norm induced by the positive semide\ufb01nite matrix M . The region k\u03b8 \u2212 \u02c6\u03b8tkMt \u2264 \u03c1(t) is\na con\ufb01dence ellipsoid around the estimated parameter \u02c6\u03b8t. Generalizing this approach beyond the\ncase of linear link functions looks challenging. In particular, in GLMs, the relevant con\ufb01dence\nregions may have a more complicated geometry in the parameter space than simple ellipsoids. As\na consequence, the bene\ufb01ts of this form of optimistic algorithms appears dubious.3\n\n2Here, and in what follows log denotes the natural logarithm.\n3Note that maximizing \u00b5(m\u2032\n\na\u03b8 over the\nsame region since \u00b5 is strictly increasing. Thus, computationally, this approach is not more dif\ufb01cult than it is\nfor the linear case.\n\na\u03b8) over a convex con\ufb01dence region is equivalent to maximizing m\u2032\n\n3\n\n\fAn alternative approach consists in directly determining an upper con\ufb01dence bound for the\n\nexpected reward of each arm, thus choosing the action a that maximizes\n\nE\u02c6\u03b8t\n\n[ Rt | At = a] + \u03c1(t)kmakM \u22121\n\nt\n\n.\n\nIn the linear case the two approaches lead to the same solution [12]. Interestingly, for non-linear\nbandits, the second approach looks more appropriate.\n\nIn the rest of this section, we apply this second approach to the GLM bandit model de\ufb01ned in (1).\nAccording to (3), the maximum quasi-likelihood estimator of the parameter in the GLM is the\nunique solution of the estimating equation\n\nt\u22121\n\nXk=1(cid:16)Rk \u2212 \u00b5(m\u2032\n\nAk\n\n\u02c6\u03b8t)(cid:17) mAk = 0 ,\n\nwhere A1, . . . , At\u22121 denote the arms played so far and R1, . . . , Rt\u22121 are the corresponding rewards.\nAk \u03b8)mAk be the invertible function such that the estimated parameter \u02c6\u03b8t\nk=1 RkmAk. Since \u02c6\u03b8t might be outside of the set of admissible parameters \u0398,\n\nk=1 \u00b5(m\u2032\n\n(6)\n\n.\n\n(7)\n\nwe \u201cproject it\u201d to \u0398, to obtain \u02dc\u03b8t:\n\nLet gt(\u03b8) = Pt\u22121\nsatis\ufb01es gt(\u02c6\u03b8t) = Pt\u22121\n\u03b8\u2208\u0398 (cid:13)(cid:13)(cid:13)\n\n\u02dc\u03b8t = argmin\n\ngt(\u03b8) \u2212 gt(\u02c6\u03b8t)(cid:13)(cid:13)(cid:13)M \u22121\n\nt\n\n= argmin\n\n\u03b8\u2208\u0398 (cid:13)(cid:13)(cid:13)\n\ngt(\u03b8) \u2212\n\nt\u22121\n\nXk=1\n\nRkmAk(cid:13)(cid:13)(cid:13)M \u22121\n\nt\n\nNote that if \u02c6\u03b8t \u2208 \u0398 (which is easy to check and which happened to hold always in the examples we\ndealt with) then we can let \u02dc\u03b8t = \u02c6\u03b8t. This is important since computing \u02dc\u03b8t is non-trivial and we can\nsave this computation by this simple check. The proposed algorithm, GLM-UCB, is as follows:\n\nAlgorithm 1 GLM-UCB\n1: Input: {ma}a\u2208A\n2: Play actions a1, . . . , ad, receive R1, . . . , Rd.\n3: for t > d do\n4:\n5:\n6:\n7: end for\n\nEstimate \u02c6\u03b8t according to (6)\nif \u02c6\u03b8t \u2208 \u0398 let \u02dc\u03b8t = \u02c6\u03b8t else compute \u02dc\u03b8t according to (7)\nPlay the action At = argmaxan\u00b5(m\u2032\n\na\n\n\u02dc\u03b8t) + \u03c1(t)kmakM \u22121\n\nt o, receive Rt\n\nt = \u03c1(t)kmakM \u22121\n\nAt time t, for each arm a, an upper bound \u00b5(m\u2032\na\n\nt is computed, where the \u201cexploration\nis the product of two terms. The quantity \u03c1(t) is a slowly increasing\nbonus\u201d \u03b2a\nfunction; we prove in Section 4 that \u03c1(t) can be set to guarantee high-probability bounds on the\nt is kmakM \u22121\nexpected regret (for the actual form used, see (8)). Note that the leading term of \u03b2a\nwhich decreases to zero as t increases.\n\n\u02dc\u03b8t) + \u03b2a\n\nt\n\nt\n\nAs we are mostly interested in the case when the number of arms K is much larger than the\ndimension d, the algorithm is simply initialized by playing actions a1, . . . , ad such that the vectors\nma1 . . . , mad form a basis of M = span(ma, a \u2208 A). Without loss of generality, here and in what\nfollows we assume that the dimension of M is equal to d. Then, by playing a1, . . . , ad in the \ufb01rst\nd steps the agent ensures that Mt is invertible for all t. An alternative strategy would be to initialize\nM0 = \u03bb0I, where I is the d \u00d7 d identify matrix.\n3.1 Discussion\n\nThe purpose of this section is to discuss some properties of Algorithm 1, and in particular the\ninterpretation of the role played by kmakM \u22121\nGeneralizing UCB The standard UCB algorithm for K arms [2] can be seen as a special case of\nGLM-UCB where the vectors of covariates associated with the arms form an orthogonal system and\n\u00b5(x) = x. Indeed, take d = K, A = {1, . . . , K}, de\ufb01ne the vectors {ma}a\u2208A as the canonical basis\n{ea}a\u2208A of Rd, and take \u03b8 \u2208 Rd the vector whose component \u03b8a is the expected reward for arm a.\n\n.\n\nt\n\n4\n\n\fThen, Mt is a diagonal matrix whose a-th diagonal element is the number Nt(a) of times the\na-th arm has been played up to time t. Therefore, the exploration bonus in GLM-UCB is given by\nt = \u02c6\u03b8t(a)\nt = \u03c1(t)/pNt(a). Moreover, the maximum quasi-likelihood estimator \u02c6\u03b8t satis\ufb01es \u00afRa\n\u03b2a\nfor all a \u2208 A, where \u00afRa\nI{At=a}Rk is the empirical mean of the rewards received\nwhile playing arm a. Algorithm 1 then reduces to the familiar UCB algorithm. In this case, it\nis known that the expected cumulated regret can be controlled upon setting the slowly varying\n\nNt(a)Pt\u22121\n\nt = 1\n\nk=1\n\nfunction \u03c1 to \u03c1(t) =p2 log(t), assuming that the range of the rewards is bounded by one [2].\n\nGeneralizing linear bandits Obviously, setting \u00b5(x) = x, we obtain a linear bandit model. In\nthis case, assuming that \u0398 = Rd, the algorithm will reduce to those described in the papers [8, 12].\nIn particular, the maximum quasi-likelihood estimator becomes the least-squares estimator and as\nnoted earlier, the algorithm behaves identically to one which chooses the parameter optimistically\nwithin the con\ufb01dence ellipsoid {\u03b8 : k\u03b8 \u2212 \u02c6\u03b8tkMt \u2264 \u03c1(t)}.\nDependence in the Number of Arms\nIn contrast to an algorithm such as UCB, Algorithm 1\ndoes not need that all arms be played even once.4 To understand this phenomenon, observe that,\n= kmak2\n) for\nas Mt+1 = Mt + mAtm\u2032\nany arm a. Thus the exploration bonus \u03b2a\nt+1 decreases for all arms, except those which are exactly\northogonal to mAt (in the M \u22121\nt metric). The decrease is most signi\ufb01cant for arms that are colinear\nto mAt. This explains why the regret bounds obtained in Theorems 1 and 2 below depend on d but\nnot on K.\n\nt mAt(cid:1)2(cid:14)(1 + kmAtk2\n\nt \u2212(cid:0)m\u2032\n\nAt, kmak2\n\naM \u22121\n\nM \u22121\nt+1\n\nM \u22121\n\nM \u22121\n\nt\n\n4 Theoretical analysis\n\nIn this section we \ufb01rst give our \ufb01nite sample regret bounds and then show how the algorithm can be\ntuned based on asymptotic arguments.\n\n4.1 Regret Bounds\n\nTo quantify the performance of the GLM-UCB algorithm, we consider the cumulated (pseudo)\nregret de\ufb01ned as the expected difference between the optimal reward obtained by always playing\nan optimal arm and the reward received following the algorithm:\n\nT\n\nRegretT =\n\n\u00b5(m\u2032\n\na\u2217 \u03b8\u2217) \u2212 \u00b5(m\u2032\n\nAt \u03b8\u2217) .\n\nXt=1\n\nFor the sake of the analysis, in this section we shall assume that the following assumptions hold:\nAssumption 1. The link function \u00b5 : R \u2192 R is continuously differentiable, Lipschitz with constant\nk\u00b5 and such that c\u00b5 = inf \u03b8\u2208\u0398,a\u2208A \u02d9\u00b5(m\u2032\n\na\u03b8) > 0.\n\nFor the logistic function k\u00b5 = 1/4, while the value of c\u00b5 depends on sup\u03b8\u2208\u0398,a\u2208A |m\u2032\n\nAssumption 2. The norm of covariates in {ma : a \u2208 A} is bounded: there exists cm < \u221e such\nthat for all a \u2208 A, kmak2 \u2264 cm.\n\na\u03b8|.\n\nFinally, we make the following assumption on the rewards:\n\nAssumption 3. There exists Rmax > 0 such that for any t \u2265 1, 0 \u2264 Rt \u2264 Rmax holds a.s. Let\n\u01ebt = Rt \u2212 \u00b5(m\u2032\n\nAt \u03b8\u2217). For all t \u2265 1, it holds that E [\u01ebt|mAt , \u01ebt\u22121, . . . , mA2 , \u01eb1, mA1] = 0 a.s.\n\nAs for the standard UCB algorithm, the regret can be analyzed in terms of the difference between\n\nthe expected reward received playing an optimal arm and that of the best sub-optimal arm:\n\n\u2206(\u03b8\u2217) =\n\nmin\n\na:\u00b5(m\u2032\n\na\u03b8\u2217)<\u00b5(m\u2032\n\na\u2217 \u03b8\u2217)\n\n\u00b5(m\u2032\n\na\u2217 \u03b8\u2217) \u2212 \u00b5(m\u2032\n\na\u03b8\u2217) .\n\nTheorem 1 establishes a high probability bound on the regret underlying using GLM-UCB with\n\n2k\u00b5\u03baRmax\n\n\u03c1(t) =\n\nc\u00b5 p2d log(t) log(2 d T /\u03b4) ,\n\n4Of course, the linear bandit algorithms also share this property with our algorithm.\n\n(8)\n\n5\n\n\fai, which by our previous assumption is positive.\n\ni=1 maim\u2032\n\nwhere T is the \ufb01xed time horizon, \u03ba = p3 + 2 log(1 + 2c2\neigenvalue ofPd\nTheorem 1 (Problem Dependent Upper Bound). Let s = max(1, c2\n1\u20133, for all T \u2265 1, the regret satis\ufb01es:\nC d2\nP(cid:18)RegretT \u2264 (d + 1)Rmax +\n\u2206(\u03b8\u2217)\n\nlog2 [s T ] log(cid:20) 2d T\n\n\u03b4 (cid:21)(cid:19) \u2265 1\u2212 \u03b4 with C =\n\nm/\u03bb0) and \u03bb0 denotes the smallest\n\nm/\u03bb0). Then, under Assumptions\n\n32\u03ba2R2\nmaxk2\n\u00b5\nc2\n\u00b5\n\n.\n\nNote that the above regret bound depends on the true value of \u03b8\u2217 through \u2206(\u03b8\u2217). The following\n\ntheorem provides an upper-bound of the regret independently of the \u03b8\u2217.\nTheorem 2 (Problem Independent Upper Bound). Let s = max(1, c2\nAssumptions 1\u20133, for all T \u2265 1, the regret satis\ufb01es\nP RegretT \u2264 (d + 1)Rmax + Cd log [s T ]sT log(cid:20) 2dT\n\n\u03b4 (cid:21)! \u2265 1 \u2212 \u03b4 with C =\n\nm/\u03bb0).\n\nThen, under\n\n8Rmaxk\u00b5\u03ba\n\nc\u00b5\n\n.\n\nThe proofs of Theorems 1\u20132 can be found in the supplementary material. The main idea is to use\n\nthe explicit form of the estimator given by (6) to show that\n\n\u00b5(m\u2032\n\nAt \u03b8\u2217) \u2212 \u00b5(m\u2032\n\nAt\n\n(cid:12)(cid:12)(cid:12)\n\n\u02c6\u03b8t)(cid:12)(cid:12)(cid:12) \u2264\n\nk\u00b5\nc\u00b5 kmAtkM \u22121\n\nt\u22121\n\nXk=1\n\nt (cid:13)(cid:13)(cid:13)\n\n.\n\nmAk \u01ebk(cid:13)(cid:13)(cid:13)M \u22121\n\nt\n\nBounding the last term on the right-hand side is then carried out following the lines of [12].\n\n4.2 Asymptotic Upper Con\ufb01dence Bound\n\nPreliminary experiments carried out using the value of \u03c1(t) de\ufb01ned equation (8), including the\ncase where \u00b5 is the identity function \u2013i.e., using the algorithm described by [8, 12], revealed poor\nperformance for moderate sample sizes. A look into the proof of the regret bound easily explains\nthis observation as the mathematical involvement of the arguments is such that some approximations\nseem unavoidable, in particular several applications of the Cauchy-Schwarz inequality, leading\nto pessimistic con\ufb01dence bounds. We provide here some asymptotic arguments that suggest to\nchoose signi\ufb01cantly smaller exploration bonuses, which will in turn be validated by the numerical\nexperiments presented in Section 5.\n\nConsider the canonical GLM associated with an inverse link function \u00b5 and assume that the\nvectors of covariates X are drawn independently under a \ufb01xed distribution. This random design\nmodel would for instance describe the situation when the arms are drawn randomly from a \ufb01xed\ndistribution. Standard statistical arguments show that the Fisher information matrix pertaining to\nthis model is given by J = E[ \u02d9\u00b5(X \u2032\u03b8\u2217)XX \u2032] and that the maximum likelihood estimate \u02c6\u03b8t is such\nthat t\u22121/2(\u02c6\u03b8t \u2212 \u03b8\u2217) D\u2212\u2192N (0, J \u22121), where D\u2212\u2192 stands for convergence in distribution. Moreover,\nt\u22121Mt\n\n\u2212\u2192 \u03a3 where \u03a3 = E[XX \u2032]. Hence, using the delta-method and Slutsky\u2019s lemma\nJ \u22121) .\n\n(\u00b5(m\u2032\na\n\na.s.\n\n\u02c6\u03b8t) \u2212 \u00b5(m\u2032\n\na\u03b8\u2217)) D\u2212\u2192N (0, \u02d9\u00b5(m\u2032\n\na\u03b8\u2217)km\u2032\n\nak\u22122\n\n\u03a3\u22121km\u2032\n\nak2\n\nThe right-hand variance is smaller than k\u00b5/c\u00b5 as J (cid:23) c\u00b5\u03a3. Hence, for any sampling distribution\nsuch that J and \u03a3 are positive de\ufb01nite and suf\ufb01ciently large t and small \u03b4,\n\nkmak\u22121\n\nM \u22121\n\nt\n\nP(cid:18)kmak\u22121\n\nt\n\nM \u22121\n\n(\u00b5(m\u2032\na\n\n\u02c6\u03b8t) \u2212 \u00b5(m\u2032\n\na\u03b8\u2217)) >q2k\u00b5/c\u00b5 log(1/\u03b4)(cid:19)\n\nis asymptotically bounded by \u03b4. Based on the above asymptotic argument, we postulate that using\n\n\u03c1(t) = p2k\u00b5/c\u00b5 log(t), i.e., in\ufb02ating the exploration bonus by a factor of pk\u00b5/c\u00b5 compared to\n\nthe usual UCB setting, is suf\ufb01cient. This is the setting used in the simulations below.\n\n5 Experiments\n\nTo the best of our knowledge, there is currently no public benchmark available to test bandit\nmethods on real world data. On simulated data, the proposed method unsurprisingly outperforms\nits competitors when the data is indeed simulated from a well-speci\ufb01ed generalized linear model.\nIn order to evaluate the potential of the method in more challenging scenarios, we thus carried out\ntwo experiments using real world datasets.\n\n6\n\n\f5.1 Forest Cover Type Data\n\nIn this \ufb01rst experiment, we test the performance of the proposed method on a toy problem using the\n\u201cForest Cover Type dataset\u201d from the UCI repository. The dataset (centered and normalized with\nconstant covariate added, resulting in 11-dimensional vectors, ignoring all categorical variables)\nhas been partitioned into K = 32 clusters using unsupervised k-means. The values of the response\nvariable for the data points assigned to each cluster are viewed as the outcomes of an arm while the\ncentroid of the cluster is taken as the 11-dimensional vector of covariates characteristic of the arm.\nTo cast the problem into the logistic regression framework, each response variable is binarized by\nassociating the \ufb01rst class (\u201cSpruce/Fir\u201d) to a response R = 1 and all other six classes to R = 0.\nThe proportions of responses equal to 1 in each cluster (or, in other word, the expected reward\nassociated with each arm) ranges from 0.354 to 0.992, while the proportion on the complete set\nof 581,012 data points is equal to 0.367. In effect, we try to locate as fast as possible the cluster\nthat contains the maximal proportion of trees from a given species. We are faced with a 32-arm\nproblem in a 11-dimensional space with binary rewards. Obviously, the logistic regression model\nis not satis\ufb01ed, although we do expect some regularity with respect to the position of the cluster\u2019s\ncentroid as the logistic regression trained on all data reaches a 0.293 misclassi\ufb01cation rate.\n\nt\n\nt\n\ne\nr\ng\ne\nR\n\n2000\n\n1500\n\n1000\n\n500\n\n \n\n0\n0\n\n6000\n\n4000\n\n2000\n\n0\n\n \n\nUCB\nGLM\u2212UCB\n\u03b5\u2212greedy\n\n \n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nt\n\n6000\n\n7000\n\n8000\n\n9000 10000\n\n \n\nGLM\u2212UCB\nUCB\n\n2\n\n4\n\n6\n\n8\n\n10\narm a\n\n12\n\n14\n\n16\n\n18\n\nFigure 1: Top: Regret of the UCB, GLM-UCB and the \u01eb-greedy algorithms. Bottom: Frequencies\nof the 20 best arms draws using the UCB and GLM-UCB.\n\nFirst,\n\nWe compare the performance of three algorithms.\n\nthe GLM-UCB algorithm, with\nparameters tuned as indicated in Section 4.2. Second, the standard UCB algorithm that ignores\nthe covariates. Third, an \u01eb-greedy algorithm that performs logistic regression and plays the best\n\u02c6\u03b8t), with probability 1 \u2212 \u01eb (with \u01eb = 0.1). We observe in\nestimated action, At = argmaxa \u00b5(m\u2032\na\nthe top graph of Figure 1 that the GLM-UCB algorithm achieves the smallest average regret by a\nlarge margin. When the parameter is well estimated, the greedy algorithm may \ufb01nd the best arm\nin little time and then leads to small regrets. However, the exploration/exploitation tradeoff is not\ncorrectly handled by the \u01eb-greedy approach causing a large variability in the regret. The lower plot\nof Figure 1 shows the number of times each of the 20 best arms have been played by the UCB\nand GLM-UCB algorithms. The arms are sorted in decreasing order of expected reward. It can be\nobserved that GML-UCB only plays a small subset of all possible arms, concentrating on the bests.\nThis behavior is made possible by the predictive power of the covariates: by sharing information\nbetween arms, it is possible to obtain suf\ufb01ciently accurate predictions of the expected rewards of all\nactions, even for those that have never (or rarely) been played.\n\n7\n\n\f5.2 Internet Advertisement Data\n\nIn this experiment, we used a large record of the activity of internet users provided by a major ISP.\nThe original dataset logs the visits to a set of 1222 pages over a six days period corresponding to\nabout 5.108 page visits. The dataset also contains a record of the users clicks on the ads that were\npresented on these pages. We worked with a subset of 208 ads and 3.105 users. The pages (ads)\nwere partitioned in 10 (respectively, 8) categories using Latent Dirichlet Allocation [15] applied to\ntheir respective textual content (in the case of ads, the textual content was that of the page pointed\nto by the ad\u2019s link). This second experiment is much more challenging, as the predictive power of\nthe sole textual information turns out to be quite limited (for instance, Poisson regression trained on\nthe entire data does not even correctly identify the best arm).\n\nThe action space is composed of the 80 pairs of pages and ads categories: when a pair is chosen,\nit is presented to a group of 50 users, randomly selected from the database, and the reward is the\nnumber of recorded clicks. As the average reward is typically equal to 0.15, we use a logarithmic\nlink function corresponding to Poisson regression. The vector of covariates for each pair is of\nit is composed of an intercept followed by the concatenation of two vectors of\ndimension 19:\ndimension 10 and 8 representing, respectively, the categories of the pages and the ads.\nIn this\nproblem, the covariate vectors do not span the entire space; to address this issue, it is suf\ufb01cient to\nconsider the pseudo-inverse of Mt instead of the inverse.\n\nOn this data, we compared the GLM-UCB algorithm with the two alternatives described in\nSection 5.1. Figure 2 shows that GLM-UCB once again outperforms its competitors, even though\nthe margin over UCB is now less remarkable. Given the rather limited predictive power of the\ncovariates in this example, this is an encouraging illustration of the potential of techniques which\nuse vectors of covariates in real-life applications.\n\n3000\n\nt\n\ne\nr\ng\ne\nR\n\n2000\n\n1000\n\n0\n \n0\n\nUCB\nGLM\u2212UCB\n\u03b5\u2212greedy\n\n \n\n1000\n\n2000\n\nt\n\n3000\n\n4000\n\n5000\n\nFigure 2: Comparison of the regret of the UCB, GLM-UCB and the \u01eb-greedy (\u01eb = 0.1) algorithm\non the advertisement dataset.\n\n6 Conclusions\n\nWe have introduced an approach that generalizes the linear regression model studied by [10, 8, 12].\nAs in the original UCB algorithm, the proposed GLM-UCB method operates directly in the reward\nspace. We discussed how to tune the parameters of the algorithm to avoid exaggerated optimism,\nwhich would slow down learning.\nIn the numerical simulations, the proposed algorithm was\nshown to be competitive and suf\ufb01ciently robust to tackle real-world problems. An interesting\nopen problem (already challenging in the linear case) consists in tightening the theoretical results\nobtained so far in order to bridge the gap between the existing (pessimistic) con\ufb01dence bounds and\nthose suggested by the asymptotic arguments presented in Section 4.2, which have been shown to\nperform satisfactorily in practice.\n\nAcknowledgments\n\nThis work was supported in part by AICML, AITF, NSERC, PASCAL2 under no216886, the\nDARPA GALE project under noHR0011-08-C-0110 and Orange Labs under contract no289365.\n\n8\n\n\fReferences\n\n[1] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47(2):235\u2013256, 2002.\n\n[3] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Pr, 2006.\n[4] J. Audibert, R. Munos, and Cs. Szepesv\u00b4ari. Tuning bandit algorithms in stochastic environ-\n\nments. Lecture Notes in Computer Science, 4754:150, 2007.\n\n[5] C.C. Wang, S.R. Kulkarni, and H.V. Poor. Bandit problems with side observations.\n\nTransactions on Automatic Control, 50(3):338\u2013355, 2005.\n\nIEEE\n\n[6] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. Advances in Neural Information Processing Systems, pages 817\u2013824, 2008.\n\n[7] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent\n\narms. International Conference on Machine learning, pages 721\u2013728, 2007.\n\n[8] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback.\n\nConference on Learning Theory, 2008.\n\n[9] S.M. Kakade, S. Shalev-Shwartz, and A. Tewari. Ef\ufb01cient bandit algorithms for online\nIn Proceedings of the 25th International Conference on Machine\n\nmulticlass prediction.\nlearning, pages 440\u2013447. ACM, 2008.\n\n[10] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3:397\u2013422, 2002.\n\n[11] Y. Abbasi-Yadkori, A. Antos, and Cs. Szepesv\u00b4ari. Forced-exploration based algorithms for\nIn COLT Workshop on On-line Learning with Limited\n\nplaying in stochastic linear bandits.\nFeedback, 2009.\n\n[12] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\n[13] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.\n[14] K. Chen, I. Hu, and Z. Ying. Strong consistency of maximum quasi-likelihood estima-\ntors in generalized linear models with \ufb01xed and adaptive designs. Annals of Statistics,\n27(4):1155\u20131163, 1999.\n\n[15] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Advances\n\nin Neural Information Processing Systems, 14:601\u2013608, 2002.\n\n[16] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes: exponential inequal-\nities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3):1902\u20131933,\n2004.\n\n[17] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Arxiv preprint\n\narXiv:0812.3465v2, 2008.\n\n9\n\n\f", "award": [], "sourceid": 828, "authors": [{"given_name": "Sarah", "family_name": "Filippi", "institution": null}, {"given_name": "Olivier", "family_name": "Cappe", "institution": null}, {"given_name": "Aur\u00e9lien", "family_name": "Garivier", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}