{"title": "Thompson Sampling for 1-Dimensional Exponential Family Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1448, "page_last": 1456, "abstract": "Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for $1$-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (and thus Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families.", "full_text": "Thompson Sampling for 1-Dimensional Exponential\n\nFamily Bandits\n\nNathaniel Korda\n\nINRIA Lille - Nord Europe, Team SequeL\n\nnathaniel.korda@inria.fr\n\nEmilie Kaufmann\n\nInstitut Mines-Telecom; Telecom ParisTech\nkaufmann@telecom-paristech.fr\n\nRemi Munos INRIA Lille - Nord Europe, Team SequeL\n\nremi.munos@inria.fr\n\nAbstract\n\nThompson Sampling has been demonstrated in many complex bandit models,\nhowever the theoretical guarantees available for the parametric multi-armed bandit\nare still limited to the Bernoulli case. Here we extend them by proving asymptotic\noptimality of the algorithm using the Jeffreys prior for 1-dimensional exponential\nfamily bandits. Our proof builds on previous work, but also makes extensive use\nof closed forms for Kullback-Leibler divergence and Fisher information (through\nthe Jeffreys prior) available in an exponential family. This allow us to give a \ufb01nite\ntime exponential concentration inequality for posterior distributions on exponen-\ntial families that may be of interest in its own right. Moreover our analysis covers\nsome distributions for which no optimistic algorithm has yet been proposed, in-\ncluding heavy-tailed exponential families.\n\n1\n\nIntroduction\n\nK-armed bandit problems provide an elementary model for exploration-exploitation tradeoffs found\nat the heart of many online learning problems. In such problems, an agent is presented with K\ndistributions (also called arms, or actions) {pa}K\na=1, from which she draws samples interpreted as\nrewards she wants to maximize. This objective induces a trade-off between choosing to sample a\ndistribution that has already yielded high rewards, and choosing to sample a relatively unexplored\ndistribution at the risk of loosing rewards in the short term. Here we make the assumption that\nthe distributions, pa, belong to a parametric family of distributions P = {p(\u00b7 | \u03b8), \u03b8 \u2208 \u0398} where\n\u0398 \u2282 R. The bandit model is described by a parameter \u03b80 = (\u03b81, . . . , \u03b8K) such that pa = p(\u00b7 | \u03b8a).\nWe introduce the mean function \u00b5(\u03b8) = EX\u223cp(\u00b7|\u03b8)[X], and the optimal arm \u03b8\u2217 = \u03b8a\u2217 where a\u2217 =\nargmaxa \u00b5(\u03b8a).\nAn algorithm, A, for a K-armed bandit problem is a (possibly randomised) method for choosing\nwhich arm at to sample from at time t, given a history of previous arm choices and obtained rewards,\nHt\u22121 := ((as, xs))t\u22121\ns=1: each reward xs is drawn from the distribution pas. The agent\u2019s goal is to\ndesign an algorithm with low regret:\n\n(cid:34) t(cid:88)\n\n(cid:35)\n\nR(A, t) = R(A, t)(\u03b8) := t\u00b5(\u03b8\u2217) \u2212 EA\n\nxs\n\n.\n\nThis quantity measures the expected performance of algorithm A compared to the expected perfor-\nmance of an optimal algorithm given knowledge of the reward distributions, i.e. sampling always\nfrom the distribution with the highest expectation.\n\ns=1\n\n1\n\n\fSince the early 2000s the \u201coptimisim in the face of uncertainty\u201d heuristic has been a popular ap-\nproach to this problem, providing both simplicity of implementation and \ufb01nite-time upper bounds\non the regret (e.g. [4, 7]). However in the last two years there has been renewed interest in the\nThompson Sampling heuristic (TS). While this heuristic was \ufb01rst put forward to solve bandit prob-\nlems eighty years ago in [15], it was not until recently that theoretical analyses of its performance\nwere achieved [1, 2, 11, 13]. In this paper we take a major step towards generalising these analyses\nto the same level of generality already achieved for \u201coptimistic\u201d algorithms.\n\nThompson Sampling Unlike optimistic algorithms which are often based on con\ufb01dence intervals,\nthe Thompson Sampling algorithm, denoted by A\u03c00 uses Bayesian tools and puts a prior distribution\n\u03c0a,0 = \u03c00 on each parameter \u03b8a. A posterior distribution, \u03c0a,t, is then maintained according to the\nrewards observed in Ht\u22121. At each time a sample \u03b8a,t is drawn from each posterior \u03c0a,t and then\nthe algorithm chooses to sample at = arg maxa\u2208{1,...,K}{\u00b5(\u03b8a,t)}. Note that actions are sampled\naccording to their posterior probabilities of being optimal.\n\nOur contributions TS has proved to have impressive empirical performances, very close to those\nof state of the art algorithms such as DMED and KL-UCB [11, 9, 7]. Furthermore recent works\n[11, 2] have shown that in the special case where each pa is a Bernoulli distribution B(\u03b8a), TS using\na uniform prior over the arms is asymptotically optimal in the sense that it achieves the asymptotic\nlower bound on the regret provided by Lai and Robbins in [12] (that holds for univariate parametric\nbandits). As explained in [1, 2], Thompson Sampling with uniform prior for Bernoulli rewards\ncan be slightly adapted to deal with bounded rewards. However, there is no notion of asymptotic\noptimality for this non-parametric family of rewards. In this paper, we extend the optimality property\nthat holds for Bernoulli distributions to more general families of parametric rewards, namely 1-\ndimensional exponential families if the algorithm uses the Jeffreys prior:\nTheorem 1. Suppose that the reward distributions belong to a 1-dimensional canonical exponential\nfamily and let \u03c0J denote the associated Jeffreys prior. Then,\n\nR(A\u03c0J , T )\n\n\u00b5(\u03b8a\u2217 ) \u2212 \u00b5(\u03b8a)\n\n,\n\n(1)\n\nK(cid:88)\n\na=1\n\n=\n\nln T\n\nK(\u03b8a, \u03b8a\u2217 )\n\nlim\nT\u2192\u221e\n\u03b8) is the Kullback-Leibler divergence between p\u03b8 and p(cid:48)\n\u03b8.\n\nwhere K(\u03b8, \u03b8(cid:48)) := KL(p\u03b8, p(cid:48)\nThis theorem follows directly from Theorem 2. In the proof of this result we provide in Theorem\n4 a \ufb01nite-time, exponential concentration bound for posterior distributions of exponential family\nrandom variables, something that to the best of our knowledge is new to the literature and of interest\nin its own right. Our proof also exploits the connection between the Jeffreys prior, Fisher information\nand the Kullback-Leibler divergence in exponential families.\n\nThompson Sampling. [2] establishes that R(A\u03c0U , T ) = O((cid:112)KT ln(T )) for Thompson Sampling\n\nRelated Work Another line of recent work has focused on distribution-independent bounds for\n\nfor bounded rewards (with the classic uniform prior \u03c0U on the underlying Bernoulli parameter). [14]\ngo beyond the Bernoulli model, and give an upper bound on the Bayes risk (i.e. the regret averaged\nover the prior) independent of the prior distribution. For the parametric multi-armed bandit with K\narms described above, their result states that the regret of Thompson Sampling using a prior \u03c00 is\nnot too big when averaged over this same prior:\n\n[R(A\u03c00 , T )(\u03b8)] \u2264 4 + K + 4(cid:112)KT log(T ).\n\nE\n\u03b8\u223c\u03c0\n\n\u2297K\n0\n\nKT . In our paper, we rather\nBuilding on the same ideas, [6] have improved this upper bound to 14\nsee the prior used by Thompson Sampling as a tool, and we want therefore to derive regret bounds\nfor any given problem parametrized by \u03b8 that depend on this parameter.\n[14] also use Thompson Sampling in more general models, like the linear bandit model. Their result\nis a bound on the Bayes risk that does not depend on the prior, whereas [3] gives a \ufb01rst bound on\nthe regret in this model. Linear bandits consider a possibly in\ufb01nite number of arms whose mean\nrewards are linearly related by a single, unknown coef\ufb01cient vector. Once again, the analysis in\n[3] encounters the problem of describing the concentration of posterior distributions. However by\nusing a conjugate normal prior, they can employ explicit concentration bounds available for Normal\ndistributions to complete their argument.\n\n\u221a\n\n2\n\n\fPaper Structure\nIn Section 2 we describe important features of the one-dimensional canonical\nexponential families we consider, including closed-form expression for KL-divergences and the\nJeffreys\u2019 prior. Section 3 gives statements of the main results, and provides the proof of the regret\nbound. Section 4 proves the posterior concentration result used in the proof of the regret bound.\n\n2 Exponential Families and the Jeffreys Prior\n\nA distribution is said to belong to a one-dimensional canonical exponential family if it has a density\nwith respect to some reference measure \u03bd of the form:\n\n(2)\nwhere \u03b8 \u2208 \u0398 \u2282 R. T and A are some \ufb01xed functions that characterize the exponential family\n\nand F (\u03b8) = log(cid:0)(cid:82) A(x) exp [T (x)\u03b8] d\u03bd(x)(cid:1). \u0398 is called the parameter space, T (x) the suf\ufb01cient\n\np(x | \u03b8) = A(x) exp(T (x)\u03b8 \u2212 F (\u03b8)),\n\nstatistic, and F (\u03b8) the normalisation function. We make the classic assumption that F is twice\ndifferentiable with a continuous second derivative. It is well known [17] that:\nVarX|\u03b8[T (X)] = F (cid:48)(cid:48)(\u03b8)\n\nEX|\u03b8(T (X)) = F (cid:48)(\u03b8)\n\nand\n\nshowing in particular that F is strictly convex. The mean function \u00b5 is differentiable and stricly\nincreasing, since we can show that\n\n\u00b5(cid:48)(\u03b8) = CovX|\u03b8(X, T (X)) > 0.\n\nIn particular, this shows that \u00b5 is one-to-one in \u03b8.\n\nKL-divergence in Exponential Families\nIn an exponential family, a direct computation shows\nthat the Kullback-Leibler divergence can be expressed as a Bregman divergence of the normalisation\nfunction, F:\n\nK(\u03b8, \u03b8(cid:48)) = DB\n\nF (\u03b8(cid:48), \u03b8) := F (\u03b8(cid:48)) \u2212 [F (\u03b8) + F (cid:48)(\u03b8)(\u03b8(cid:48) \u2212 \u03b8)] .\n\n(3)\n\nJeffreys prior in Exponential Families\nIn the Bayesian literature, a special \u201cnon-informative\u201d\nprior, introduced by Jeffreys in [10], is sometimes considered. This prior, called the Jeffreys prior,\nis invariant under re-parametrisation of the parameter space, and it can be shown to be proportional\nto the square-root of the Fisher information I(\u03b8). In the special case of the canonical exponential\nfamily, the Fisher information takes the form I(\u03b8) = F (cid:48)(cid:48)(\u03b8), hence the Jeffreys prior for the model\n(2) is\n\nUnder the Jeffreys prior, the posterior on \u03b8 after n observations is given by\n\np(\u03b8|y1, . . . yn) \u221d(cid:112)F (cid:48)(cid:48)(\u03b8) exp\nWhen(cid:82)\nare not proper: the prior is called improper if(cid:82)\n\n(cid:32)\n(cid:112)F (cid:48)(cid:48)(\u03b8)d\u03b8 < +\u221e, the prior is called proper. However, stasticians often use priors which\n(cid:112)F (cid:48)(cid:48)(\u03b8)d\u03b8 = +\u221e and any observation makes the\n\nT (yi) \u2212 nF (\u03b8i)\n\n(cid:33)\n\n(4)\n\ni=1\n\n\u0398\n\n\u03c0J (\u03b8) \u221d(cid:112)|F (cid:48)(cid:48)(\u03b8)|.\nn(cid:88)\n\n\u03b8\n\ncorresponding posterior (4) integrable.\n\n\u0398\n\nSome Intuition for choosing the Jeffreys Prior\nIn the proof of our concentration result for\nposterior distributions (Theorem 4) it will be crucial to lower bound the prior probability of\nan \u0001-sized KL-divergence ball around each of the parameters \u03b8a. Since the Fisher information\nF (cid:48)(cid:48)(\u03b8) = lim\u03b8(cid:48)\u2192\u03b8 K(\u03b8, \u03b8(cid:48))/|\u03b8 \u2212 \u03b8(cid:48)|2, choosing a prior proportional to F (cid:48)(cid:48)(\u03b8) ensures that the prior\n\u221a\nmeasure of such balls are \u2126(\n\n\u0001).\n\nExamples and Pseudocode Algorithm 1 presents pseudocode for Thompson Sampling with the\nJeffreys prior for distributions parametrized by their natural parameter \u03b8. But as the Jeffreys prior\nis invariant under reparametrization, if a distribution is parametrised by some parameter \u03bb (cid:54)\u2261 \u03b8,\n\nthe algorithm can use the Jeffreys prior \u221d(cid:112)I(\u03bb) on \u03bb, drawing samples from the posterior on \u03bb.\n\nNote that the posterior sampling step (in bold) is always tractable using, for example, a Hastings-\nMetropolis algorithm.\n\n3\n\n\fAlgorithm 1 Thompson Sampling for Exponential Families with the Jeffreys prior\nRequire: F normalization function, T suf\ufb01cient statistic, \u00b5 mean function\n\nfor t = 1 . . . K do\n\nSample arm t and get rewards xt\nNt = 1, St = T (xt).\n\nend for\nfor t = K + 1 . . . n do\n\nSample \u03b8a,t from \u03c0a,t \u221d(cid:112)F (cid:48)(cid:48)(\u03b8) exp (\u03b8Sa \u2212 NaF (\u03b8))\n\nfor a = 1 . . . K do\n\nend for\nSample arm At = argmaxa\u00b5(\u03b8a,t) and get reward xt\nSAt = SAt + T (xt) NAt = NAt + 1\n\nend for\n\nlog\n\n2\n\nDistribution\n\nName\nB(\u03bb)\n\n(cid:17)\n\n(cid:16) \u03bb\n\n\u03b8\n1\u2212\u03bb\n\u03bb\n\u03c32\n\nBeta(cid:0) 1\n\nPrior on \u03bb\n2 , 1\n\nN (\u03bb, \u03c32)\n\u0393(k, \u03bb)\nP(\u03bb)\n\n\u03bbx(1 \u2212 \u03bb)1\u2212x\u03b40,1\n\u2212 (x\u2212\u03bb)2\n1\u221a\n2\u03c0\u03c32 e\n2\u03c32\n\n(cid:1) Beta(cid:0) 1\n2 + n \u2212 s(cid:1)\nN(cid:16) s\n(cid:17)\n2 + s, n(cid:1)\n\u0393(cid:0) 1\nFigure 1: The posterior distribution after observations y1, . . . , yn depends on n and s =(cid:80)n\n\nPosterior on \u03bb\n2 + s, 1\nn , \u03c32\nn\n\u0393(kn, s)\n\n\u0393 (n + 1, s \u2212 n log xm)\n\u03b1\u03bb(n\u22121)k exp(\u2212\u03bbks)\n\n\u2212\u03bb\nlog(\u03bb)\n\u2212\u03bb \u2212 1\n\u2212\u03bbk\n\n\u0393(k) xk\u22121e\u2212\u03bbx1[0,+\u221e[(x)\n\n\u03bb\n\n\u221d 1\n\u221d 1\n\u221d 1\u221a\n\u221d 1\n\u221d 1\n\n\u03bb\n\n\u03bbx\u03bb\nx\u03bb+1 1[xm,+\u221e[(x)\nm\n\nk\u03bb(x\u03bb)k\u22121e\u2212(\u03bbx)k\n\nPareto(xm, \u03bb)\nWeibull(k, \u03bb)\n\n\u03bbxe\u2212\u03bb\n\nx!\n\n\u03b4N(x)\n\n\u03bb\n\n\u03bbk\n\n1[0,+\u221e[\n\n\u03bbk\n\ni=1 T (yi)\n\nSome examples of common exponential family models are given in Figure 1, together with the\nposterior distributions on the parameter \u03bb that is used by TS with the Jeffreys prior. In addition to\nexamples already studied in [7] for which T (x) = x, we also give two examples of more general\ncanonical exponential families, namely the Pareto distribution with known min value and unknown\ntail index \u03bb, Pareto(xm, \u03bb), for which T (x) = log(x), and the Weibul distribution with known shape\nand unknown rate parameter, Weibull(k, \u03bb), for which T (x) = xk. These last two distributions are\nnot covered even by the work in [8], and belong to the family of heavy-tailed distributions.\nFor the Bernoulli model, we note futher that the use of the Jeffreys prior is not covered by the\nprevious analyses. These analyses make an extensive use of the uniform prior, through the fact that\nthe coef\ufb01cient of the Beta posteriors they consider have to be integers.\n\n3 Results and Proof of Regret Bound\n\nAn exponential family K-armed bandit is a K-armed bandit for which the reward distributions pa\nare known to be elements of an exponential family of distributions P(\u0398). We denote by p\u03b8a the\ndistribution of arm a and its mean by \u00b5a = \u00b5(\u03b8a).\nTheorem 2 (Regret Bound). Assume that \u00b51 > \u00b5a for all a (cid:54)= 1, and that \u03c0a,0 is taken to be the\nJeffreys prior over \u0398. Then for every \u0001 > 0 there exists a constant C(\u0001,P) depending on \u0001 and on\nthe problem P such that the regret of Thompson Sampling using the Jeffreys prior satis\ufb01es\n\n(cid:32) K(cid:88)\n\na=2\n\n(cid:33)\n\n(\u00b51 \u2212 \u00b5a)\nK(\u03b8a, \u03b81)\n\nR(A\u03c0J , T ) \u2264 1 + \u0001\n1 \u2212 \u0001\n\nln(T ) + C(\u0001,P).\n\nProof: We give here the main argument of the proof of the regret bound, which proceed by bound-\ning the expected number of draws of any suboptimal arm. Along the way we shall state concentration\nresults whose proofs are postponed to later sections.\n\n4\n\n\fStep 0: Notation We denote by ya,s the s-th observation of arm a and by Na,t the number of times\narm a is chosen up to time t. (ya,s)s\u22651 is i.i.d. with distribution p\u03b8a. Let Y u\na := (ya,s)1\u2264s\u2264u be\nthe vector of \ufb01rst u observations from arm a. Ya,t := Y Na,t\nis therefore the vector of observations\nfrom arm a available at the beginning of round t. Recall that \u03c0a,t, respectively \u03c0a,0, is the posterior,\nrespectively the prior, on \u03b8a at round t of the algorithm.\nWe de\ufb01ne L(\u03b8) to be such that PY \u223cp(|\u03b8)(p(Y |\u03b8) \u2265 L(\u03b8)) \u2265 1\n2. Observations from arm a such that\np(ya,s|\u03b8) \u2265 L(\u03b8a) can therefore be seen as likely observations. For any \u03b4a > 0, we introduce the\nevent \u02dcEa,t = \u02dcEa,t(\u03b4a):\n\na\n\n(cid:80)Na,t\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4a\n\n\u2212 F (cid:48)(\u03b8a)\n\n.\n\n(5)\n\n\u02dcEa,t =\n\n\u22031 \u2264 s(cid:48) \u2264 Na,t : p(ya,s(cid:48)|\u03b8a) \u2265 L(\u03b8a),\n\nNa,t \u2212 1\nFor all a (cid:54)= 1 and \u2206a such that \u00b5a < \u00b5a + \u2206a < \u00b51, we introduce\n\na,t(\u2206a) :=(cid:0)\u00b5 (\u03b8a,t) \u2264 \u00b5a + \u2206a\n\na,t = E\u03b8\n\nE\u03b8\n\ns=1,s(cid:54)=s(cid:48) T (ya,s)\n\n(cid:1).\n\n(cid:32)\n\nOn \u02dcEa,t, the empirical suf\ufb01cient statistic of arm a at round t is well concentrated around its mean\nand a \u2019likely\u2019 realization of arm a has been observed. On E\u03b8\na,t, the mean of the distribution with\nparameter \u03b8a,t does not exceed by much the true mean, \u00b5a. \u03b4a and \u2206a will be carefully chosen at\nthe end of the proof.\n\nStep 1: Decomposition The idea of the proof is to decompose the probability of playing a subop-\n\ntimal arm using the events given in Step 0, and that E[Na,T ] =(cid:80)T\na,t)c(cid:17)\n(cid:125)\n\nat = a, \u02dcEa,t, E\u03b8\na,t\n\nat = a, \u02dcEa,t, (E\u03b8\n\nT(cid:88)\n(cid:124)\n\nT(cid:88)\n(cid:124)\n\nE [Na,T ] =\n\nP(cid:16)\n\nP(cid:16)\n\n(cid:17)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\nt=1\n\nt=1\n\nt=1\n\n+\n\nP (at = a):\n\nP(cid:16)\n\nT(cid:88)\n(cid:124)\n\nt=1\n\n+\n\n(cid:17)\n(cid:125)\n\n.\n\nat = a, \u02dcEc\na,t\n\n(cid:123)(cid:122)\n\n(C)\n\n(A)\n\n(B)\n\nwhere Ec denotes the complement of event E. Term (C) is controlled by the concentration of the\nempirical suf\ufb01cient statistic, and (B) is controlled by the tail probabilities of the posterior distribu-\ntion. We give the needed concentration results in Step 2. When conditioned on the event that the\noptimal arm is played at least polynomially often, term (A) can be decomposed further, and then\ncontroled by the results from Step 2. Step 3 proves that the optimal arm is played this many times.\n\nStep 2: Concentration Results We state here the two concentration results that are necessary to\nevaluate the probability of the above events.\nLemma 3. Let (ys) be an i.i.d sequence of distribution p(\u00b7 | \u03b8) and \u03b4 > 0. Then\n\n(cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nu\n\nu(cid:88)\n\ns=1\n\nP\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 \u03b4\n\n[T (ys) \u2212 F (cid:48)(\u03b8)]\n\n\u2264 2e\u2212u \u02dcK(\u03b8,\u03b4),\n\nwhere \u02dcK(\u03b8, \u03b4) = min(K(\u03b8 + g(\u03b4), \u03b8), K(\u03b8 \u2212 h(\u03b4), \u03b8)), with g(\u03b4) > 0 de\ufb01ned by F (cid:48)(\u03b8 + g(\u03b4)) =\nF (cid:48)(\u03b8) + \u03b4 and h(\u03b4) > 0 de\ufb01ned by F (cid:48)(\u03b8 \u2212 h(\u03b4)) = F (cid:48)(\u03b8) \u2212 \u03b4.\nThe two following inequalities that will be useful in the sequel can easily be deduced from Lemma\n3. Their proof is gathered in Appendix A with that of Lemma 3. For any arm a, for any b \u2208]0, 1[,\n\nT(cid:88)\n\nP(at = a, ( \u02dcEa,t(\u03b4a))c) \u2264\n\nt=1\n\nP(( \u02dcEa,t(\u03b4a))c \u2229 Na,t > tb) \u2264\n\nT(cid:88)\n\nt=1\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nt=1\n\nt=1\n\n(cid:18) 1\n(cid:19)t\n(cid:18) 1\n(cid:19)tb\n\n2\n\n+\n\nt\n\n2\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nt=1\n\n+\n\nt=1\n\n2te\u2212(t\u22121) \u02dcK(\u03b8a,\u03b4a)\n\n2t2e\u2212(tb\u22121) \u02dcK(\u03b8a,\u03b4a),\n\n(6)\n\n(7)\n\nThe second result tells us that concentration of the empirical suf\ufb01cient statistic around its mean\nimplies concentration of the posterior distribution around the true parameter:\nTheorem 4 (Posterior Concentration). Let \u03c0a,0 be the Jeffreys prior. There exists constants C1,a =\nC1(F, \u03b8a) > 0, C2,a = C2(F, \u03b8a, \u2206a) > 0, and N (\u03b8a, F ) s.t., \u2200Na,t \u2265 N (\u03b8a, F ),\n\n(cid:1) \u2264 C1,ae\u2212(Na,t\u22121)(1\u2212\u03b4aC2,a)K(\u03b8a,\u00b5\u22121(\u00b5a+\u2206a))+ln(Na,t)\n\nP(cid:0)\u00b5(\u03b8a,t) > \u00b5(\u03b8a) + \u2206a|Ya,t\n\n1 \u02dcEa,t\n\nwhenever \u03b4a < 1 and \u2206a are such that 1 \u2212 \u03b4aC2,a(\u2206a) > 0.\n\n5\n\n\fStep 3: Lower Bound the Number of Optimal Arm Plays with High Probability The main\ndif\ufb01culty adressed in previous regret analyses for Thompson Sampling is the control of the number\nof draws of the optimal arm. We provide this control in the form of Proposition 5 which is adapted\nfrom Proposition 1 in [11]. The proof of this result, an outline of which is given in Appendix D,\nexplores in depth the randomised nature of Thompson Sampling. In particular, we show that the\nproof in [11] can be signi\ufb01cantly simpli\ufb01ed, but at the expense of no longer being able to describe\nthe constant Cb explicitly:\n\nProposition 5. \u2200b \u2208 (0, 1), \u2203Cb(\u03c0, \u00b51, \u00b52, K) < \u221e such that(cid:80)\u221e\n\nP(cid:0)N1,t \u2264 tb(cid:1) \u2264 Cb.\n\nt=1\n\nStep 4: Bounding the Terms of the Decomposition Now we bound the terms of the decomposi-\ntion as discussed in Step 1: An upper bound on term (C) is given in (6), whereas a bound on term\n(B) follows from Lemma 6 below. Although the proof of this lemma is standard, and bears a strong\nsimilarity to Lemma 3 of [3], we provide it in Appendix C for the sake of completeness.\nLemma 6. For all actions a and for all \u0001 > 0, \u2203 N\u0001 = N\u0001(\u03b4a, \u2206a, \u03b8a) > 0 such that\n\n(B) \u2264 [(1 \u2212 \u0001)(1 \u2212 \u03b4aC2,a)K(\u03b8a, \u00b5\u22121(\u00b5a + \u2206a))]\u22121 ln(T ) + max{N\u0001, N (\u03b8a, F )} + 1.\n\nwhere N\u0001 = N\u0001(\u03b4a, \u2206a, \u03b8a) is the smallest integer such that for all n \u2265 N\u0001\n\n(n \u2212 1)\u22121 ln(C1,an) < \u0001(1 \u2212 \u03b4aC2,a)K(\u03b8a, \u00b5\u22121(\u00b5a + \u2206a)),\n\nand N (\u03b8a, F ) is the constant from Theorem 4.\n\nt=1\n\nt=1\n\nat = a, \u02dcEa,t, E\u03b8\n\nP(cid:16)\nP(cid:16)\n\nWhen we have seen enough observations on the optimal arm, term (A) also becomes a result about\nthe concentration of the posterior and the empirical suf\ufb01cient statistic, but this time for the optimal\narm:\n\na,t, N1,t > tb(cid:17)\n+ Cb \u2264 T(cid:88)\na, \u02dcE1,t(\u03b41), N1,t > tb(cid:17)\n(cid:123)(cid:122)\n(cid:125)\n\n(A) \u2264 T(cid:88)\n\u2264 T(cid:88)\n(cid:124)\na = \u00b51 \u2212 \u00b5a \u2212 \u2206a and \u03b41 > 0 remains to be chosen. The \ufb01rst inequality comes from\nwhere \u2206(cid:48)\nProposition 5, and the second inequality comes from the following fact: if arm 1 is not chosen and\narm a is such that \u00b5(\u03b8a,t) \u2264 \u00b5a + \u2206a, then \u00b5(\u03b81,t) \u2264 \u00b5a + \u2206a. A bound on term (C\u2019) is given in\n(7) for a = 1 and \u03b41. In Theorem 4, we bound the conditional probability that \u00b5(\u03b8a,t) exceed the\ntrue mean. Following the same lines, we can also show that\n\nP(cid:0)\u00b5(\u03b81,t) \u2264 \u00b51 \u2212 \u2206(cid:48)\nP(cid:16) \u02dcEc\n1,t(\u03b41) \u2229 N1,t > tb(cid:17)\nT(cid:88)\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n\na, N1,t > tb(cid:1) + Cb\n\n\u00b5(\u03b81,t) \u2264 \u00b51 \u2212 \u2206(cid:48)\n\nP (\u00b5(\u03b81,t) \u2264 \u00b51 \u2212 \u2206(cid:48)\n\nFor any \u2206(cid:48)\nfunction u (cid:55)\u2192 e\u2212(u\u22121)(1\u2212\u03b41C2,1)K(\u03b81,\u00b5\u22121(\u00b51\u2212\u2206(cid:48)\n\na > 0, one can choose \u03b41 such that 1 \u2212 \u03b41C1,1 > 0. Then, with N = N (P) such that the\na))+ln u is decreasing for u \u2265 N, (B(cid:48)) is bounded by\n\na|Y1,t) 1 \u02dcE1,t(\u03b41) \u2264 C1,1e\u2212(N1,t\u22121)(1\u2212\u03b41C2,1)K(\u03b81,\u00b5\u22121(\u00b51\u2212\u2206(cid:48)\n\u221e(cid:88)\n\na))+ln(N1,t).\n\n+Cb (8)\n\nC1,1e\u2212(tb\u22121)(1\u2212\u03b41C2,1)K(\u03b81,\u00b5\u22121(\u00b51\u2212\u2206(cid:48)\n\na))+ln(tb) < \u221e.\n\nN 1/b +\n\nB(cid:48)\n\n+\n\nt=1\n\nt=1\n\nC(cid:48)\n\nt=N 1/b+1\n\nStep 4: Choosing the Values \u03b4a and \u0001a So far, we have shown that for any \u0001 > 0 and for any\nchoice of \u03b4a > 0 and 0 < \u2206a < \u00b51 \u2212 \u00b5a such that 1 \u2212 \u03b4aC2,a > 0, there exists a constant\nC(\u03b4a, \u2206a, \u0001,P) such that\nE[Na,T ] \u2264\n\n+ C(\u03b4a, \u2206a, \u0001,P)\n\n(1 \u2212 \u03b4aC2,a)K(\u03b8a, \u00b5\u22121(\u00b5a + \u2206a))(1 \u2212 \u0001)\n\nThe constant is of course increasing (dramatically) when \u03b4a goes to zero, \u2206a to \u00b51 \u2212 \u00b5a, or \u0001 to\nzero. But one can choose \u2206a close enough to \u00b51 \u2212 \u00b5a and \u03b4a small enough, such that\n\nln(T )\n\n(1 \u2212 C2,a(\u2206a)\u03b4a)K(\u03b8a, \u00b5\u22121(\u00b5a + \u2206a)) \u2265 K(\u03b8a, \u03b81)\n(1 + \u0001)\n\n,\n\nand this choice leads to\n\nUsing that R(A, T ) =(cid:80)K\n\nln(T )\n\nE[Na,T ] \u2264 1 + \u0001\n1 \u2212 \u0001\na=2(\u00b51 \u2212 \u00b5a)EA[Na,T ] for any algorithm A concludes the proof.\n\n+ C(\u03b4a, \u2206a, \u0001,P).\n\nK(\u03b8a, \u03b81)\n\n6\n\n\f4 Posterior Concentration: Proof of Theorem 4\n\nFor ease of notation, we drop the subscript a and let (ys) be an i.i.d. sequence of distribution p\u03b8,\nwith mean \u00b5 = \u00b5(\u03b8). Furthermore, by conditioning on the value of Ns, it is enough to bound\n1 \u02dcEu\n\nP (\u00b5(\u03b8u) \u2265 \u00b5 + \u2206|Y u) where Y u = (ys)1\u2264s\u2264u and\n\n(cid:32)\n\n\u02dcEu =\n\n\u22031 \u2264 s(cid:48) \u2264 u : p(ys(cid:48)|\u03b8) \u2265 L(\u03b8),\n\ns=1,s(cid:54)=s(cid:48) T (ys)\n\nu \u2212 1\n\n\u2212 F (cid:48)(\u03b8)\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4\n\n.\n\n(cid:80)u\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nStep 1: Extracting a Kullback-Leibler Rate The argument rests on the following Lemma, whose\nproof can be found in Appendix B\nLemma 7. Let \u02dcEu be the event de\ufb01ned by (5), and introduce \u0398\u03b8,\u2206 := {\u03b8(cid:48) \u2208 \u0398 : \u00b5(\u03b8(cid:48)) \u2265 \u00b5(\u03b8)+\u2206}.\nThe following inequality holds:\n\n(cid:82)\n\n(cid:82)\n\ne\u2212(u\u22121)(K[\u03b8,\u03b8(cid:48)]\u2212\u03b4|\u03b8\u2212\u03b8(cid:48)|)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48)\n\u03b8(cid:48)\u2208\u0398\u03b8,\u2206\n\u03b8(cid:48)\u2208\u0398 e\u2212(u\u22121)(K[\u03b8,\u03b8(cid:48)]+\u03b4|\u03b8\u2212\u03b8(cid:48)|)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48)\n\n,\n\n(9)\n\n1 \u02dcEu\n\nP (\u00b5(\u03b8u) \u2265 \u00b5 + \u2206|Y u) \u2264\nwith s(cid:48) = inf{s \u2208 N : p(ys|\u03b8) \u2265 L(\u03b8)}.\n\nStep 2: Upper bounding the numerator of (9) We \ufb01rst note that on \u0398\u03b8,\u2206 the leading term in the\nexponential is K(\u03b8, \u03b8(cid:48)). Indeed, from (3) we know that\n\nK(\u03b8, \u03b8(cid:48))/|\u03b8 \u2212 \u03b8(cid:48)| = |F (cid:48)(\u03b8) \u2212 (F (\u03b8) \u2212 F (\u03b8(cid:48)))/(\u03b8 \u2212 \u03b8(cid:48))|\n\nwhich, by strict convexity of F , is strictly increasing in |\u03b8 \u2212 \u03b8(cid:48)| for any \ufb01xed \u03b8. Now since \u00b5 is\none-to-one and continuous, \u0398c\n\n\u03b8,\u2206 is an interval whose interior contains \u03b8, and hence, on \u0398\u03b8,\u2206,\n\n|\u03b8 \u2212 \u03b8(cid:48)| \u2265 F (\u00b5\u22121(\u00b5 + \u2206)) \u2212 F (\u03b8)\nK(\u03b8, \u03b8(cid:48))\n\n\u00b5\u22121(\u00b5 + \u2206) \u2212 \u03b8\n\n\u2212 F (cid:48)(\u03b8) := (C2(F, \u03b8, \u2206))\u22121 > 0.\n\nSo for \u03b4 such that 1 \u2212 \u03b4C2 > 0 we can bound the numerator of (9) by:\n\ne\u2212(u\u22121)(K(\u03b8,\u03b8(cid:48))\u2212\u03b4|\u03b8\u2212\u03b8(cid:48)|)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48) \u2264\n\ne\u2212(u\u22121)K(\u03b8,\u03b8(cid:48))(1\u2212\u03b4C2)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48)\n\n\u03b8(cid:48)\u2208\u0398\u03b8,\u2206\n\u2264 e\u2212(u\u22121)(1\u2212\u03b4C2)K(\u03b8,\u00b5\u22121(\u00b5+\u2206))\n\n(10)\nwhere we have used that \u03c0(\u00b7|ys(cid:48)) is a probability distribution, and that, since \u00b5 is increasing,\nK(\u03b8, \u00b5\u22121(\u00b5 + \u2206)) = inf \u03b8(cid:48)\u2208\u0398\u03b8,\u2206 K(\u03b8, \u03b8(cid:48)).\n\n\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48) \u2264 e\u2212(u\u22121)(1\u2212\u03b4C2)K(\u03b8,\u00b5\u22121(\u00b5+\u2206))\n\n\u0398\u03b8,\u2206\n\n(cid:90)\n\n(cid:90)\n\n\u03b8(cid:48)\u2208\u0398\u03b8,\u2206\n\n(cid:90)\n\nStep 3: Lower bounding the denominator of (9) To lower bound the denominator, we reduce\nthe integral on the whole space \u0398 to a KL-ball, and use the structure of the prior to lower bound\nthe measure of that KL-ball under the posterior obtained with the well-chosen observation ys(cid:48). We\nintroduce the following notation for KL balls: for any x \u2208 \u0398, \u0001 > 0, we de\ufb01ne\n\nB\u0001(x) := {\u03b8(cid:48) \u2208 \u0398 : K(x, \u03b8(cid:48)) \u2264 \u0001} .\n\n(\u03b8\u2212\u03b8(cid:48))2 \u2192 F (cid:48)(cid:48)(\u03b8) (cid:54)= 0 (since F is strictly convex). Therefore, there exists N1(\u03b8, F ) such\n\nWe have K(\u03b8,\u03b8(cid:48))\nthat for u \u2265 N1(\u03b8, F ), on B 1\n\n(\u03b8),\n\nu2\n\n|\u03b8 \u2212 \u03b8(cid:48)| \u2264(cid:112)2K(\u03b8, \u03b8(cid:48))/F (cid:48)(cid:48)(\u03b8).\n\n(cid:90)\n\nUsing this inequality we can then bound the denominator of (9) whenever u \u2265 N1(\u03b8, F ) and \u03b4 < 1:\ne\u2212(u\u22121)(K(\u03b8,\u03b8(cid:48))+\u03b4|\u03b8\u2212\u03b8(cid:48)|)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48)\n(cid:17)\n\ne\u2212(u\u22121)(K(\u03b8,\u03b8(cid:48))+\u03b4|\u03b8\u2212\u03b8(cid:48)|)\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48) \u2265\n(cid:19)\n\n(cid:90)\n\u03c0(\u03b8(cid:48)|ys(cid:48))d\u03b8(cid:48) \u2265 \u03c0(cid:0)B1/u2 (\u03b8)|ys(cid:48)(cid:1) e\n\n\u03b8(cid:48)\u2208B1/u2 (\u03b8)\n\n(cid:113) 2\n\n2K(\u03b8,\u03b8(cid:48))\nF(cid:48)(cid:48) (\u03b8)\n\n\u2212(u\u22121)\n\nK(\u03b8,\u03b8(cid:48))+\u03b4\n\n1+\n\nF (cid:48)(cid:48)(\u03b8)\n\n\u2212(cid:16)\n\n\u03b8(cid:48)\u2208\u0398\n\n(cid:90)\n\n(cid:114)\n\n(cid:18)\n\n\u2265\n\ne\n\n\u03b8(cid:48)\u2208B1/u2 (\u03b8)\n\n.\n\n(11)\n\n7\n\n\fFinally we turn our attention to the quantity\n\n\u03c0(cid:0)B1/u2(\u03b8)|ys(cid:48)(cid:1) =\n\n(cid:82)\n(cid:82)\nB1/u2 (\u03b8) p(y(cid:48)\n\n\u0398 p(y(cid:48)\n\ns|\u03b8(cid:48))\u03c00(\u03b8(cid:48))d\u03b8(cid:48)\n\ns|\u03b8(cid:48))\u03c00(\u03b8(cid:48))d\u03b8(cid:48)\n\n=\n\n(cid:82)\n(cid:82)\nB1/u2 (\u03b8) p(y(cid:48)\n\n\u0398 p(y(cid:48)\n\ns|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48))d\u03b8(cid:48)\ns|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48))d\u03b8(cid:48)\n(cid:20) F (b) \u2212 F (\u03b8)\n\n.\n\n(12)\n\n1\n\nNow since the KL divergence is convex in the second argument, we can write B1/u2(\u03b8) = (a, b).\nSo, from the convexity of F we deduce that\n\n(cid:21)\nu2 = K(\u03b8, b) = F (b) \u2212 [F (\u03b8) + (b \u2212 \u03b8)F (cid:48)(\u03b8)] = (b \u2212 \u03b8)\n\u0398 p(y|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48))d\u03b8(cid:48) < \u221e is continuous on the compact C(\u03b8). Thus, it follows that\n\n\u2264 (b \u2212 \u03b8) [F (cid:48)(b) \u2212 F (cid:48)(\u03b8)] \u2264 (b \u2212 a) [F (cid:48)(b) \u2212 F (cid:48)(\u03b8)] \u2264 (b \u2212 a) [F (cid:48)(b) \u2212 F (cid:48)(a)] .\n\nAs p(y | \u03b8) \u2192 0 as y \u2192 \u00b1\u221e, the set C(\u03b8) = {y : p(y | \u03b8) \u2265 L(\u03b8)} is compact. The map\n\ny (cid:55)\u2192(cid:82)\n\n\u2212 F (cid:48)(\u03b8)\n\n(b \u2212 \u03b8)\n\n(cid:26)(cid:90)\n\np(y|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48))d\u03b8(cid:48)(cid:27)\n\n< \u221e\n\nL(cid:48)(\u03b8) = L(cid:48)(\u03b8, F ) :=\n\nsup\n\ny:p(y|\u03b8)>L(\u03b8)\n\n\u0398\n\nis an upper bound on the denominator of (12).\nNow by the continuity of F (cid:48)(cid:48), and the continuity of (y, \u03b8) (cid:55)\u2192 p(y|\u03b8) in both coordinates, there exists\nan N2(\u03b8, F ) such that for all u \u2265 N2(\u03b8, F )\nF (cid:48)(cid:48)(\u03b8) \u2265 1\n2\nFinally, for u \u2265 N2(\u03b8, F ), we have a lower bound on the numerator of (12):\n\n(cid:18)\np(y|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48)) \u2265 L(\u03b8)\ns|\u03b8(cid:48))(cid:112)F (cid:48)(cid:48)(\u03b8(cid:48))d\u03b8(cid:48) \u2265 L(\u03b8)\n(cid:112)F (cid:48)(cid:48)(\u03b8)\n\n(cid:19)\n(cid:112)F (cid:48)(cid:48)(\u03b8), \u2200\u03b8(cid:48) \u2208 B1/u2 (\u03b8), y \u2208 C(\u03b8)\n(cid:112)(F (cid:48)(b) \u2212 F (cid:48)(a)) (b \u2212 a) \u2265 L(\u03b8)\n\nF (cid:48)(b) \u2212 F (cid:48)(a)\n\n(cid:90) b\n\nd\u03b8(cid:48) =\n\nb \u2212 a\n\np(y(cid:48)\n\n(cid:90)\n\nL(\u03b8)\n\nand\n\n2\n\n.\n\nB1/u2 (\u03b8)\nPuting everything together, we get\nmax{N1, N2} such that for every \u03b4 < 1 satisfying 1 \u2212 \u03b4C2 > 0, and for every u \u2265 N, one has\n\nthat there exist constants C2 = C2(F, \u03b8, \u2206) and N (\u03b8, F ) =\n\na\n\n2u\n\n2\n\n2\n\nP(\u00b5(\u03b8u) \u2265 \u00b5(\u03b8) + \u2206|Yu) \u2264 2e\n\n1 \u02dcEu\n\ne\u2212(u\u22121)(1\u2212\u03b4C2)K(\u03b8,\u00b5\u22121(\u00b5+\u2206)).\n\nRemark 8. Note that when the prior is proper we do not need to introduce the observation ys(cid:48),\nwhich signi\ufb01cantly simpli\ufb01es the argument. Indeed in this case, in (10) we can use \u03c00 in place of\n\u03c0(\u00b7|ys(cid:48)) which is already a probability distribution. In particular, the quantity (12) is replaced by\n\u03c00\n\n(cid:0)B1/u2 (\u03b8)(cid:1), and so the constants L and L(cid:48) are not needed.\n\n5 Conclusion\n\n1+\n\n(cid:113) 2\nF(cid:48)(cid:48)(\u03b8) L(cid:48)(\u03b8)u\nL(\u03b8)\n\nWe have shown that choosing to use the Jeffreys prior in Thompson Sampling leads to an asymp-\ntotically optimal algorithm for bandit models whose rewards belong to a 1-dimensional canonical\nexponential family. The cornerstone of our proof is a \ufb01nite time concentration bound for posterior\ndistributions in exponential families, which, to the best of our knowledge, is new to the literature.\nWith this result we built on previous analyses and avoided Bernoulli-speci\ufb01c arguments. Thompson\nSampling with Jeffreys prior is now a provably competitive alternative to KL-UCB for exponential\nfamily bandits. Moreover our proof holds for slightly more general problems than those for which\nKL-UCB is provably optimal, including some heavy-tailed exponential family bandits.\nOur arguments are potentially generalisable. Notably generalising to n-dimensional exponential\nfamily bandits requires only generalising Lemma 3 and Step 3 in the proof of Theorem 4. Our result\nis asymptotic, but the only stage where the constants are not explicitly derivable from knowledge of\nF , T , and \u03b80 is in Lemma 9. Future work will investigate these open problems. Another possible\nfuture direction lies the optimal choice of prior distribution. Our theoretical guarantees only hold for\nJeffreys\u2019 prior, but a careful examination of our proof shows that the important property is to have,\nfor every \u03b8a,\n\n(cid:33)\n\n(cid:32)(cid:90)\n\n\u2212 ln\n\n(\u03b8(cid:48):K(\u03b8a,\u03b8(cid:48))\u2264n\u22122)\n\n\u03c00(\u03b8(cid:48))d\u03b8(cid:48)\n\n= o (n) ,\n\nwhich could hold for prior distributions other than the Jeffreys prior.\n\n8\n\n\fReferences\n[1] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem.\n\nIn Conference On Learning Theory (COLT), 2012.\n\n[2] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In Sixteenth\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[3] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\n30th International Conference on Machine Learning (ICML), 2013.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-\n\nlem. Machine Learning, 47(2):235\u2013256, 2002.\n\n[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford Univeristy Press,\n\n2013.\n\n[6] S. Bubeck and Che-Yu Liu. A note on the bayesian regret of thompson sampling with an\n\narbitrairy prior. arXiv:1304.5758, 2013.\n\n[7] O. Capp\u00b4e, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper con-\n\n\ufb01dence bounds for optimal sequential allocation. Annals of Statistics, 41(3):516\u2013541, 2013.\n\n[8] A. Garivier and O. Capp\u00b4e. The kl-ucb algorithm for bounded stochastic bandits and beyond.\n\nIn Conference On Learning Theory (COLT), 2011.\n\n[9] J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support\n\nmodels. In Conference On Learning Theory (COLT), 2010.\n\n[10] H. Jeffreys. An invariant form for prior probability in estimation problem. Proceedings of the\n\nRoyal Society of London, 186:453\u2013461, 1946.\n\n[11] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal\nIn Algorithmic Learning Theory, Lecture Notes in Computer Science,\n\n\ufb01nite-time analysis.\npages 199\u2013213. Springer, 2012.\n\n[12] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[13] B.C. May, N. Korda, A. Lee, and D. Leslie. Optimistic bayesian sampling in contextual bandit\n\nproblems. Journal of Machine Learning Research, 13:2069\u20132106, 2012.\n\n[14] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. arXiv:1301.2609,\n\n2013.\n\n[15] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\n[16] A.W Van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[17] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing\n\nCompany, Incorporated, 2010.\n\n9\n\n\f", "award": [], "sourceid": 721, "authors": [{"given_name": "Nathaniel", "family_name": "Korda", "institution": "INRIA"}, {"given_name": "Emilie", "family_name": "Kaufmann", "institution": "Telecom ParisTech"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA"}]}