{"title": "Optimistic posterior sampling for reinforcement learning: worst-case regret bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 1184, "page_last": 1194, "abstract": "We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\\tilde{O}(D\\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\\tilde{O}(DS\\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\\Omega(\\sqrt{DSAT})$ for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.", "full_text": "Optimistic posterior sampling for reinforcement\n\nlearning: worst-case regret bounds\n\nShipra Agrawal\n\nColumbia University\n\nsa3305@columbia.edu\n\nAbstract\n\nRandy Jia\n\nColumbia University\n\nrqj2000@columbia.edu\n\nWe present an algorithm based on posterior sampling (aka Thompson sampling)\nthat achieves near-optimal worst-case regret bounds when the underlying Markov\nDecision Process (MDP) is communicating with a \ufb01nite, though unknown, diameter.\nOur main result is a high probability regret upper bound of \u02dcO(D\nSAT ) for any\ncommunicating MDP with S states, A actions and diameter D, when T \u2265 S5A.\nHere, regret compares the total reward achieved by the algorithm to the total\nexpected reward of an optimal in\ufb01nite-horizon undiscounted average reward policy,\nin time horizon T . This result improves over the best previously known upper\nbound of \u02dcO(DS\nAT ) achieved by any algorithm in this setting, and matches the\ndependence on S in the established lower bound of \u2126(\nDSAT ) for this problem.\nOur techniques involve proving some novel results about the anti-concentration of\nDirichlet distribution, which may be of independent interest.\n\n\u221a\n\n\u221a\n\n\u221a\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) refers to the problem of learning and planning in sequential decision\nmaking systems when the underlying system dynamics are unknown, and may need to be learned by\ntrying out different options and observing their outcomes. A typical model for the sequential decision\nmaking problem is a Markov Decision Process (MDP), which proceeds in discrete time steps. At each\ntime step, the system is in some state s, and the decision maker may take any available action a to\nobtain a (possibly stochastic) reward. The system then transitions to the next state according to a \ufb01xed\nstate transition distribution. The reward and the next state depend on the current state s and the action\na, but are independent of all the previous states and actions. In the reinforcement learning problem,\nthe underlying state transition distributions and/or reward distributions are unknown, and need to be\nlearned using the observed rewards and state transitions, while aiming to maximize the cumulative\nreward. This requires the algorithm to manage the tradeoff between exploration vs. exploitation, i.e.,\nexploring different actions in different states in order to learn the model more accurately vs. taking\nactions that currently seem to be reward maximizing.\nExploration-exploitation tradeoff has been studied extensively in the context of stochastic multi-\narmed bandit (MAB) problems, which are essentially MDPs with a single state. The performance of\nMAB algorithms is typically measured through regret, which compares the total reward obtained by\nthe algorithm to the total expected reward of an optimal action. Optimal regret bounds have been\nestablished for many variations of MAB (see Bubeck et al. [2012] for a survey), with a large majority\nof results obtained using the Upper Con\ufb01dence Bound (UCB) algorithm, or more generally, the\noptimism in the face of uncertainty principle. Under this principle, the learning algorithm maintains\ntight over-estimates (or optimistic estimates) of the expected rewards for individual actions, and\nat any given step, picks the action with the highest optimistic estimate. More recently, posterior\nsampling, aka Thompson Sampling [Thompson, 1933], has emerged as another popular algorithm\ndesign principle in MAB, owing its popularity to a simple and extendible algorithmic structure, an\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nSAT + DS7/4A3/4T 1/4), which is \u02dcO(D\n\nattractive empirical performance [Chapelle and Li, 2011, Kaufmann et al., 2012], as well as provably\noptimal performance bounds that have been recently obtained for many variations of MAB [Agrawal\nand Goyal, 2012, 2013b,a, Russo and Van Roy, 2015, 2014, Bubeck and Liu, 2013]. In this approach,\nthe algorithm maintains a Bayesian posterior distribution for the expected reward of every action;\nthen at any given step, it generates an independent sample from each of these posteriors and takes the\naction with the highest sample value.\nWe consider the reinforcement learning problem with \ufb01nite states S and \ufb01nite actions A in a similar\nregret based framework, where the total reward of the reinforcement learning algorithm is compared\nto the total expected reward achieved by a single benchmark policy over a time horizon T . In our\nsetting, the benchmark policy is the in\ufb01nite-horizon undiscounted average reward optimal policy for\nthe underlying MDP, under the assumption that the MDP is communicating with (unknown) \ufb01nite\ndiameter D. The diameter D is an upper bound on the time it takes to move from any state s to any\nother state s(cid:48) using an appropriate policy, for each pair s, s(cid:48). A \ufb01nite diameter is understood to be\nnecessary for interesting bounds on the regret of any algorithm in this setting [Jaksch et al., 2010].\nThe UCRL2 algorithm of Jaksch et al. [2010], which is based on the optimism principle, achieved the\nbest previously known upper bound of \u02dcO(DS\nAT ) for this problem. A similar bound was achieved\nby Bartlett and Tewari [2009], though assuming the knowledge of the diameter D. Jaksch et al.\n[2010] also established a worst-case lower bound of \u2126(\nDSAT ) on the regret of any algorithm for\nthis problem.\nOur main contribution is a posterior sampling based algorithm with a high probability worst-case\nSAT ) when T \u2265 S5A.\nregret upper bound of \u02dcO(D\nThis improves the previously best known upper bound for this problem by a factor of\nS, and\nmatches the dependence on S in the lower bound, for large enough T .\nOur algorithm uses an \u2018optimistic version\u2019 of the posterior sampling heuristic, while utilizing several\nideas from the algorithm design structure in Jaksch et al. [2010], such as an epoch based execution\nand the extended MDP construction. The algorithm proceeds in epochs, where in the beginning of\nevery epoch, it generates \u03c8 = \u02dcO(S) sample transition probability vectors from a posterior distribution\nfor every state and action, and solves an extended MDP with \u03c8A actions and S states formed using\nthese samples. The optimal policy computed for this extended MDP is used throughout the epoch.\nPosterior Sampling for Reinforcement Learning (PSRL) approach has been used previously in Osband\net al. [2013], Abbasi-Yadkori and Szepesvari [2014], Osband and Van Roy [2016], but in a Bayesian\nregret framework. Bayesian regret is de\ufb01ned as the expected regret over a known prior on the\ntransition probability matrix. Osband and Van Roy [2016] demonstrate an \u02dcO(H\nSAT ) bound on\nthe expected Bayesian regret for PSRL in \ufb01nite-horizon episodic Markov decision processes, when\nthe episode length is H. In this paper, we consider the stronger notion of worst-case regret, aka\nminimax regret, which requires bounding the maximum regret for any instance of the problem. 1\nFurther, we consider a non-episodic communicating MDP setting, and produce a comparable bound\nof \u02dcO(D\nSAT ) for large T , where D is the unknown diameter of the communicating MDP. In\ncomparison to a single sample from the posterior in PSRL, our algorithm is slightly inef\ufb01cient as\nit uses multiple ( \u02dcO(S)) samples. It is not entirely clear if the extra samples are only an artifact\nof the analysis. In an empirical study of a multiple sample version of posterior sampling for RL,\nFonteneau et al. [2013] show that multiple samples can potentially improve the performance of\nposterior sampling in terms of probability of taking the optimal decision. Our analysis utilizes some\nideas from the Bayesian regret analysis, most importantly the technique of stochastic optimism from\nOsband et al. [2014] for deriving tighter deviation bounds. However, bounding the worst-case regret\nrequires several new technical ideas, in particular, for proving \u2018optimism\u2019 of the gain of the sampled\nMDP. Further discussion is provided in Section 4.\nWe should also compare our result with the very recent result of Azar et al. [2017], which provides\nan optimistic version of value-iteration algorithm with a minimax (i.e., worst-case) regret bound of\n\n\u221a\n\n\u221a\n\n\u221a\n\n1Worst-case regret is a strictly stronger notion of regret in case the reward distribution function is known\nand only the transition probability distribution is unknown, as we will assume here for the most part. In case of\nunknown reward distribution, extending our worst-case regret bounds would require an assumption of bounded\nrewards, where as the Bayesian regret bounds in the above-mentioned literature allow more general (known)\npriors on the reward distributions with possibly unbounded support. Bayesian regret bounds in those more\ngeneral settings are incomparable to the worst-case regret bounds presented here.\n\n2\n\n\f\u221a\nHSAT ) when T \u2265 H 3S3A. However, the setting considered in Azar et al. [2017] is that of an\n\u02dcO(\nepisodic MDP, where the learning agent interacts with the system in episodes of \ufb01xed and known\nlength H. The initial state of each episode can be arbitrary, but importantly, the sequence of these\ninitial states is shared by the algorithm and any benchmark policy. In contrast, in the non-episodic\nsetting considered in this paper, the state trajectory of the benchmark policy over T time steps can be\ncompletely different from the algorithm\u2019s trajectory. To the best of our understanding, the shared\nsequence of initial states and the \ufb01xed known length H of episodes seem to form crucial components\nof the analysis in Azar et al. [2017], making it dif\ufb01cult to extend their analysis to the non-episodic\ncommunicating MDP setting considered in this paper.\nAmong other related work, Burnetas and Katehakis [1997] and Tewari and Bartlett\n[2008] present optimistic linear programming approaches that achieve logarithmic regret\nbounds with problem dependent constants.\nStrong PAC bounds have been provided in\nKearns and Singh [1999], Brafman and Tennenholtz [2002], Kakade et al. [2003], Asmuth et al.\n[2009], Dann and Brunskill [2015]. There, the aim is to bound the performance of the policy\nlearned at the end of the learning horizon, and not the performance during learning as quanti\ufb01ed here\nby regret. Notably, the BOSS algorithm proposed in Asmuth et al. [2009] is similar to the algorithm\nproposed here in the sense that the former also takes multiple samples from the posterior to form\nan extended (referred to as merged) MDP. Strehl and Littman [2005, 2008] provide an optimistic\nalgorithm for bounding regret in a discounted reward setting, but the de\ufb01nition of regret is slightly\ndifferent in that it measures the difference between the rewards of an optimal policy and the rewards\nof the learning algorithm along the trajectory taken by the learning algorithm.\n\n2 Preliminaries and Problem De\ufb01nition\n\n2.1 Markov Decision Process (MDP)\nWe consider a Markov Decision Process M de\ufb01ned by tuple {S,A, P, r, s1}, where S is a \ufb01nite\nstate-space of size S, A is a \ufb01nite action-space of size A, P : S \u00d7 A \u2192 \u2206S is the transition model,\nr : S \u00d7 A \u2192 [0, 1] is the reward function, and s1 is the starting state. When an action a \u2208 A is taken\nin a state s \u2208 S, a reward rs,a is generated and the system transitions to the next state s(cid:48) \u2208 S with\n\nprobability Ps,a(s(cid:48)), where(cid:80)\n\ns(cid:48)\u2208S Ps,a(s(cid:48)) = 1.\n\nWe consider \u2018communicating\u2019 MDPs with \ufb01nite \u2018diameter\u2019 (see Bartlett and Tewari [2009]\nfor an in-depth discussion). Below we de\ufb01ne communicating MDPs, and recall some useful known\nresults for such MDPs.\nDe\ufb01nition 1 (Policy). A deterministic policy \u03c0 : S \u2192 A is a mapping from state space to action\nspace.\nDe\ufb01nition 2 (Diameter D(M)). Diameter D(M) of an MDP M is de\ufb01ned as the minimum time\nrequired to go from one state to another in the MDP using some deterministic policy:\n\nD(M) = max\n\ns(cid:54)=s(cid:48),s,s(cid:48)\u2208S min\n\n\u03c0:S\u2192A T \u03c0\n\ns\u2192s(cid:48),\n\ns\u2192s(cid:48) is the expected number of steps it takes to reach state s(cid:48) when starting from state s and\nwhere T \u03c0\nusing policy \u03c0.\nDe\ufb01nition 3 (Communicating MDP). An MDP M is communicating if and only if it has a \ufb01nite\ndiameter. That is, for any two states s (cid:54)= s(cid:48), there exists a policy \u03c0 such that the expected number of\nsteps to reach s(cid:48) from s, T \u03c0\nDe\ufb01nition 4 (Gain of a policy). The gain of a policy \u03c0, from starting state s1 = s, is de\ufb01ned as the\nin\ufb01nite horizon undiscounted average reward, given by\n\ns\u2192s(cid:48), is at most D, for some \ufb01nite D \u2265 0.\n\n\u03bb\u03c0(s) = E[ lim\nT\u2192\u221e\n\n1\nT\n\nrst,\u03c0(st)|s1 = s].\n\nT(cid:88)\n\ni=1\n\nwhere st is the state reached at time t.\nLemma 2.1 (Optimal gain for communicating MDPs). For a communicating MDP M with diameter\nD:\n\n3\n\n\f(a) (Puterman [2014] Theorem 8.1.2, Theorem 8.3.2) The optimal (maximum) gain \u03bb\u2217 is state\nindependent and is achieved by a deterministic stationary policy \u03c0\u2217, i.e., there exists a\ndeterministic policy \u03c0\u2217 such that\n\nHere, \u03c0\u2217 is referred to as an optimal policy for MDP M.\n\n\u03bb\u2217 := max\n\ns(cid:48)\u2208S max\n\n\u03c0\n\n\u03bb\u03c0(s(cid:48)) = \u03bb\u03c0\u2217\n\n(s),\u2200s \u2208 S.\n\n(b) (Tewari and Bartlett [2008], Theorem 4) The optimal gain \u03bb\u2217 satis\ufb01es the following equa-\n\ntions,\n\n\u03bb\u2217 = min\nh\u2208RS\n\nmax\ns,a\n\nrs,a + P T\n\ns,ah \u2212 hs = max\n\nrs,a + P T\n\ns,ah\u2217 \u2212 h\u2217\n\ns,\u2200s\n\n(1)\n\nwhere h\u2217, referred to as the bias vector of MDP M, satis\ufb01es:\n\na\n\ns \u2212 min\nh\u2217\n\ns \u2264 D.\nh\u2217\n\ns\n\nmax\n\ns\n\nGiven the above de\ufb01nitions and results, we can now de\ufb01ne the reinforcement learning problem\nstudied in this paper.\n\n2.2 The reinforcement learning problem\n\nThe reinforcement learning problem proceeds in rounds t = 1, . . . , T . The learning agent starts from\na state s1 at round t = 1. In the beginning of every round t, the agent takes an action at \u2208 A and\nobserves the reward rst,at as well as the next state st+1 \u223c Pst,at, where r and P are the reward\nfunction and the transition model, respectively, for a communicating MDP M with diameter D.\nThe learning agent knows the state-space S, the action space A, as well as the rewards rs,a,\u2200s \u2208\nS, a \u2208 A, for the underlying MDP, but not the transition model P or the diameter D. (The assumption\nof known and deterministic rewards has been made here only for simplicity of exposition, since the\nunknown transition model is the main source of dif\ufb01culty in this problem. Our algorithm and results\ncan be extended to bounded stochastic rewards with unknown distributions using standard Thompson\nSampling for MAB, e.g., using the techniques in Agrawal and Goyal [2013b].)\nThe agent can use the past observations to learn the underlying MDP model and decide future actions.\nt=1 rst,at, or equivalently, minimize the total regret over\n\nThe goal is to maximize the total reward(cid:80)T\n\na time horizon T , de\ufb01ned as\n\nR(T,M) := T \u03bb\u2217 \u2212(cid:80)T\n\nt=1 rst,at\n\nwhere \u03bb\u2217 is the optimal gain of MDP M.\nWe present an algorithm for the learning agent with a near-optimal upper bound on the regret\nR(T,M) for any communicating MDP M with diameter D, thus bounding the worst-case regret\nover this class of MDPs.\n\n(2)\n\n3 Algorithm Description\n\nOur algorithm combines the ideas of Posterior sampling (aka Thompson Sampling) with the extended\nMDP construction used in Jaksch et al. [2010]. Below we describe the main components of our\nalgorithm.\nSome notations: N t\ns,a denotes the total number of times the algorithm visited state s and played\naction a until before time t, and N t\ns,a steps\nwhere the next state was i, i.e., a transition from state s to i was observed. We index the states from 1\ns,a for any t. We use the symbol 1 to denote the vector of all 1s, and\n\ns,a(i) denotes the number of time steps among these N t\n\nto S, so that(cid:80)S\n\ns,a(i) = N t\n\ni=1 N t\n\n1i to denote the vector with 1 at the ith coordinate and 0 elsewhere.\n\nDoubling epochs: Our algorithm uses the epoch based execution framework of Jaksch et al. [2010].\nAn epoch is a group of consecutive rounds. The rounds t = 1, . . . , T are broken into consecutive\nepochs as follows: the kth epoch begins at the round \u03c4k immediately after the end of (k \u2212 1)th epoch\nand ends at the \ufb01rst round \u03c4 such that for some state-action pair s, a, N \u03c4\ns,a. The algorithm\ncomputes a new policy \u02dc\u03c0k at the beginning of every epoch k, and uses that policy through all the\nrounds in that epoch. It is easy to observe that irrespective of how the policy \u02dc\u03c0k is computed, the\nnumber of epochs in T rounds is bounded by SA log(T ).\n\ns,a \u2265 2N \u03c4k\n\n4\n\n\fPosterior Sampling: We use posterior sampling to compute the policy \u02dc\u03c0k in the beginning of\nevery epoch. Dirichlet distribution is a convenient choice maintaining posteriors for the transition\nprobability vectors Ps,a for every s \u2208 S, a \u2208 A, as they satisfy the following useful property: given\na prior Dirichlet(\u03b11, . . . , \u03b1S) on Ps,a, after observing a transition from state s to i (with underlying\nprobability Ps,a(i)), the posterior distribution is given by Dirichlet(\u03b11, . . . , \u03b1i + 1, . . . , \u03b1S). By this\nproperty, for any s \u2208 S, a \u2208 A, on starting from prior Dirichlet(1) for Ps,a, the posterior at time t is\nDirichlet({N t\nOur algorithm uses a modi\ufb01ed, optimistic version of this approach. At the beginning of every epoch\nk, for every s \u2208 S, a \u2208 A such that Ns,a \u2265 \u03b7, it generates multiple samples for Ps,a from a \u2018boosted\u2019\nposterior. Speci\ufb01cally, it generates \u03c8 = O(S log(SA/\u03c1)) independent sample probability vectors\nQ1,k\n\ns,a(i) + 1}i=1,...,S).\n\ns,a, . . . , Q\u03c8,k\n\ns,a as\n\ns,a \u223c Dirichlet(M\u03c4k\nQj,k\n\ns,a),\n\nwhere Mt\n\ns,a denotes the vector [M t\n\ns,a(i)]i=1,...,S, with\n\nM t\n\n\u03ba (N t\n\ns,a(i) := 1\n\ns,a(i) + \u03c9), for i = 1, . . . , S.\n\n(cid:113) T S\n(3)\nA + 12\u03c9S2, and \u03c1 \u2208 (0, 1) is a parameter of\nHere, \u03ba = O(log(T /\u03c1)), \u03c9 = O(log(T /\u03c1)), \u03b7 =\nthe algorithm. In the regret analysis, we derive suf\ufb01ciently large constants that can be used in the\nde\ufb01nition of \u03c8, \u03ba, \u03c9 to guarantee the bounds. However, no attempt has been made to optimize those\nconstants, and it is likely that much smaller constants suf\ufb01ce.\nFor every remaining s, a, i.e., those with small Ns,a (Ns,a < \u03b7) the algorithm use a simple optimistic\nsampling described in Algorithm 1. This special sampling for s, a with small Ns,a has been introduced\nto handle a technical dif\ufb01culty in analyzing the anti-concentration of Dirichlet posteriors when the\nparameters are very small. We suspect that with an improved analysis, this may not be required.\n\nExtended MDP: The policy \u02dc\u03c0k to be used in epoch k is computed as the optimal policy of an\nextended MDP \u02dcMk de\ufb01ned by the sampled transition probability vectors, using the construction of\nJaksch et al. [2010]. Given sampled vectors Qj,k\ns,a, j = 1, . . . , \u03c8, for every state-action pair s, a, we\nde\ufb01ne extended MDP \u02dcMk by extending the original action space as follows: for every s, a, create\n\u03c8 actions for every action a \u2208 A, denoting by aj the action corresponding to action a and sample\nj; then, in MDP \u02dcMk, on taking action aj in state s, reward is rs,a but transitions to the next state\nfollows the transition probability vector Qj,k\ns,a.\nNote that the algorithm uses the optimal policy \u02dc\u03c0k of extended MDP \u02dcMk to take actions in the action\nspace A which is technically different from the action space of MDP \u02dcMk, where the policy \u02dc\u03c0k is\nde\ufb01ned. We slightly abuse the notation to say that the algorithm takes action at = \u02dc\u03c0(st) to mean that\nthe algorithm takes action at = a \u2208 A when \u02dc\u03c0k(st) = aj for some j.\nOur algorithm is summarized as Algorithm 1.\n\n4 Regret Bounds\n\nWe prove the following bound on the regret of Algorithm 1 for the reinforcement learning problem.\nTheorem 1. For any communicating MDP M with S states, A actions, and diameter D, with\nprobability 1 \u2212 \u03c1. the regret of Algorithm 1 in time T \u2265 CDA log2(T /\u03c1) is bounded as:\n\n(cid:16)\n\n\u221a\n\n(cid:17)\n\nR(T,M) \u2264 \u02dcO\n\nSAT + DS7/4A3/4T 1/4 + DS5/2A\nwhere C is an absolute constant. For T \u2265 S5A, this implies a regret bound of\n\nD\n\n(cid:16)\n\n\u221a\n\n(cid:17)\n\n.\n\nR(T,M) \u2264 \u02dcO\n\nD\n\nSAT\n\nHere \u02dcO hides logarithmic factors in S, A, T, \u03c1 and absolute constants.\n\nThe rest of this section is devoted to proving the above theorem. Here, we provide a sketch of the\nproof and discuss some of the key lemmas, all missing details are provided in the supplementary\nmaterial.\n\n5\n\n\fAlgorithm 1 A posterior sampling based algorithm for the reinforcement learning problem\n\nInputs: State space S, Action space A, starting state s1, reward function r, time horizon\nT , parameters \u03c1 \u2208 (0, 1], \u03c8 = O(S log(SA/\u03c1)), \u03c9 = O(log(T /\u03c1)), \u03ba = O(log(T /\u03c1)), \u03b7 =\n\n(cid:113) T S\n\nA + 12\u03c9S2.\n\nInitialize: \u03c4 1 := 1, M\u03c41\ns,a = \u03c91.\nfor all epochs k = 1, 2, . . . , do\n\nSample transition probability vectors: For each s, a, generate \u03c8 independent sample probability\nvectors Qj,k\ns,a \u2265 \u03b7, use samples from the Dirichlet\n\ns,a, j = 1, . . . , \u03c8, as follows:\n(Posterior sampling): For s, a such that N \u03c4k\ndistribution:\n\n\u2022\n\ns,a \u223c Dirichlet(M\u03c4k\nQj,k\n\ns,a),\n\n\u2022\n\n(Simple optimistic sampling): For remaining s, a, with N \u03c4k\nsimple optimistic sampling: let\n\ns,a < \u03b7, use the following\n\n\u03c4k\ns,a(i)\nwhere \u02c6Ps,a(i) = N\n, \u02c6Ps,a(i)\nN\nlet z be a random vector picked uniformly at random from {11, . . . , 1S}; set\n\n, and \u2206i = min\n\n+ 3 log(4S)\n\n3 \u02c6Ps,a(i) log(4S)\n\n\u03c4k\ns,a\n\nN\n\n\u03c4k\ns,a\n\nN\n\n\u03c4k\ns,a\n\n, and\n\ns,a = \u02c6Ps,a \u2212 \u2206,\nP \u2212\n\n(cid:26)(cid:114)\ns,a + (1 \u2212(cid:80)S\n\ns,a = P \u2212\nQj,k\n\ni=1 P \u2212\n\ns,a(i))z.\n\n(cid:27)\n\ns,a, j = 1, . . . , \u03c8, s \u2208 S, a \u2208 A}.\n\nCompute policy \u02dc\u03c0k: as the optimal gain policy for extended MDP \u02dcMk constructed using sample\nset {Qj,k\nExecute policy \u02dc\u03c0k:\nfor all time steps t = \u03c4k, \u03c4k + 1, . . . , until break epoch do\n\nPlay action at = \u02dc\u03c0k(st).\nObserve the transition to the next state st+1.\nSet N t+1\ns,a (i), M t+1\n\u2265 2N \u03c4k\nIf N t+1\nst,at\n\ns,a (i) for all a \u2208 A, s, i \u2208 S as de\ufb01ned (refer to Equation (3)).\nst,at, then set \u03c4k+1 = t + 1 and break epoch.\n\nend for\n\nend for\n\n4.1 Proof of Theorem 1\n\nAs de\ufb01ned in Section 2, regret R(T,M) is given by R(T,M) = T \u03bb\u2217 \u2212(cid:80)T\n\nt=1 rst,at, where \u03bb\u2217 is\nthe optimal gain of MDP M, at is the action taken and st is the state reached by the algorithm at\ntime t. Algorithm 1 proceeds in epochs k = 1, 2, . . . , K, where K \u2264 SA log(T ). To bound its regret\nin time T , we \ufb01rst analyze the regret in each epoch k, namely,\n\nRk := (\u03c4k+1 \u2212 \u03c4k)\u03bb\u2217 \u2212(cid:80)\u03c4k+1\u22121\n\nt=\u03c4k\n\nrst,at,\n\nand bound Rk by roughly\n\nD\n\n(cid:88)\n\ns,a \u2212 N \u03c4k\nN \u03c4k+1\n\ns,a\n\n(cid:112)N \u03c4k\n\ns,a\n\ns,a\nwhere, by de\ufb01nition, for every s, a, (N \u03c4k+1\nvisited in epoch k. The proof of this bound has two main components:\n\ns,a \u2212 N \u03c4k\n\ns,a) is the number of times this state-action pair is\n\n(a) Optimism: The policy \u02dc\u03c0k used by the algorithm in epoch k is computed as an optimal gain policy\nof the extended MDP \u02dcMk. The \ufb01rst part of the proof is to show that with high probability, the\nextended MDP \u02dcMk is (i) a communicating MDP with diameter at most 2D, and (ii) optimistic,\ni.e., has optimal gain at least (close to) \u03bb\u2217. Part (i) is stated as Lemma 4.1, with a proof provided\nin the supplementary material. Now, let \u02dc\u03bbk be the optimal gain of the extended MDP \u02dcMk. In\n\n6\n\n\fLemma 4.2, which forms one of the main novel technical components of our proof, we show that\nwith probability 1 \u2212 \u03c1,\n\n(cid:113) SA\n\nT ).\n\n\u02dc\u03bbk \u2265 \u03bb\u2217 \u2212 \u02dcO(D\n\nWe \ufb01rst show that above holds if for every s, a, there exists a sample transition probability vector\nwhose projection on a \ufb01xed unknown vector (h\u2217) is optimistic. Then, in Lemma 4.3 we prove this\noptimism by deriving a fundamental new result on the anti-concentration of any \ufb01xed projection\nof a Dirichlet random vector (Proposition A.1 in the supplementary material).\nSubstituting this upper bound on \u03bb\u2217, we have the following bound on Rkwith probability 1 \u2212 \u03c1:\n\nRk \u2264(cid:80)\u03c4k+1\u22121\n\nt=\u03c4k\n\n(cid:16)\u02dc\u03bbk \u2212 rst,at + \u02dcO(D\n\n(cid:113) SA\n\nT )\n\n(cid:17)\n\n.\n\n(4)\n\n(b) Deviation bounds: Optimism guarantees that with high probability, the optimal gain \u02dc\u03bbk for MDP\n\u02dcMk is at least \u03bb\u2217. And, by de\ufb01nition of \u02dc\u03c0k, \u02dc\u03bbk is the gain of the chosen policy \u02dc\u03c0k for MDP \u02dcMk.\nHowever, the algorithm executes this policy on the true MDP M. The only difference between\nthe two is the transition model: on taking an action aj := \u02dc\u03c0k(s) in state s in MDP \u02dcMk, the next\nstate follows the sampled distribution\n\n\u02dcPs,a := Qj,k\ns,a,\n\n(5)\nwhere as on taking the corresponding action a in MDP M, the next state follows the distribution\nPs,a. The next step is to bound the difference between \u02dc\u03bbk and the average reward obtained by the\nalgorithm by bounding the deviation ( \u02dcPs,a \u2212 Ps,a). This line of argument bears similarities to the\nanalysis of UCRL2 in Jaksch et al. [2010], but with tighter deviation bounds that we are able to\nguarantee due to the use of posterior sampling instead of deterministic optimistic bias used in\nUCRL2. Now, since at = \u02dc\u03c0k(st), using the relation between the gain \u02dc\u03bbk, the bias vector \u02dch, and\nreward vector of optimal policy \u02dc\u03c0k for communicating MDP \u02dcMk (refer to Lemma 2.1)\n\n(cid:80)\u03c4k+1\u22121\n\nt=\u03c4k\n\n(cid:16)\u02dc\u03bb \u2212 rst,at\n\n(cid:17)\n\n= (cid:80)\u03c4k+1\u22121\n= (cid:80)\u03c4k+1\u22121\n\nt=\u03c4k\n\nt=\u03c4k\n\n( \u02dcPst,at \u2212 1st)T \u02dch\n( \u02dcPst,at \u2212 Pst,at + Pst,at \u2212 1st)T \u02dch\n\n(6)\n\nwhere with high probability, \u02dch \u2208 RS, the bias vector of MDP \u02dcMk satis\ufb01es\n\nmaxs\n\n\u02dchs \u2212 mins\n\n\u02dchs \u2264 D( \u02dcMk) \u2264 2D (refer to Lemma 4.1).\n\nNext, we bound the deviation ( \u02dcPs,a \u2212 Ps,a)T \u02dch for all s, a, to bound the \ufb01rst term in above.\nNote that \u02dch is random and can be arbitrarily correlated with \u02dcP , therefore, we need to bound\nmaxh\u2208[0,2D]S ( \u02dcPs,a \u2212 Ps,a)T h. (For the above term, w.l.o.g. we can assume \u02dch \u2208 [0, 2D]S).\nFor s, a such that N \u03c4k\nwe show that with high probability,\n\ns,a is a sample from the Dirichlet posterior. In Lemma 4.4,\n\ns,a > \u03b7, \u02dcPs,a = Qj,k\n\nmax\n\nh\u2208[0,2D]S\n\ns,a \u2212 Ps,a)T h \u2264 \u02dcO(\n( \u02dcP k\n\u221a\n\nD(cid:112)N \u03c4k\n\ns,a\n\n+\n\nDS\nN \u03c4k\ns,a\n\n).\n\n(7)\n\nS factor over the corresponding deviation bound obtainable\nThis bound is an improvement by a\nfor the optimistic estimates of Ps,a in UCRL2. The derivation of this bound utilizes and extends\ns,a \u2264 \u03b7, \u02dcPs,a = Qj,k\nthe stochastic optimism technique from Osband et al. [2014]. For s, a with N \u03c4k\ns,a\nis a sample from the simple optimistic sampling, where we can only show the following weaker\nbound, but since this is used only while N \u03c4k\ns,a is small, the total contribution of this deviation will\nbe small:\n\n(cid:32)\n\n(cid:115)\n\nmax\n\nS\nN \u03c4k\ns,a\nFinally, to bound the second term in (6), we observe that E[1T\nAzuma-Hoeffding inequality to obtain with probability (1 \u2212 \u03c1\n\ns,a \u2212 Ps,a)T h \u2264 \u02dcO\n\nh\u2208[0,2D]S\n\n( \u02dcP k\n\nD\n\n(cid:80)\u03c4k+1\u22121\n\nt=\u03c4k\n\n(Pst,at \u2212 1st)T \u02dch \u2264 O((cid:112)(\u03c4k+1 \u2212 \u03c4k) log(SA/\u03c1)).\n\nst+1\nSA ):\n\n+\n\n.\n\nDS\nN \u03c4k\ns,a\n\u02dch|\u02dc\u03c0k, \u02dch, st] = P T\n\nst,at\n\n(8)\n\n\u02dch and use\n\n(9)\n\n(cid:33)\n\n7\n\n\fCombining the above observations (equations (4), (6), (7), (8), (9)), we obtain the following bound\non Rk within logarithmic factors:\ns,a \u2212 N \u03c4k\nN \u03c4k+1\n\n(cid:114)\n\n(cid:16)\n\n\u221a\n\n(cid:88)\n\n+D(cid:112)\u03c4k+1 \u2212 \u03c4k.\n\n(cid:17)\ns,a \u2264 \u03b7)\n\nD(\u03c4k+1\u2212\u03c4k)\n\n+D\n\n1(N \u03c4k+1\n\ns,a > \u03b7) +\n\nS1(N \u03c4k+1\n\ns,a\n\n(cid:112)N \u03c4k\n\ns,a\n\nSA\nT\n\ns,a\n\n(10)\nWe can \ufb01nish the proof by observing that (by de\ufb01nition of an epoch) the number of visits of any\nstate-action pair can at most double in an epoch,\n\ns,a \u2212 N \u03c4k\nN \u03c4k+1\n\ns,a \u2264 N \u03c4k\ns,a,\n\nand therefore, substituting this observation in (10), we can bound (within logarithmic factors) the\n\ntotal regret R(T ) =(cid:80)K\nK(cid:80)\n\nk=1 Rk as:\n\nD(\u03c4k+1 \u2212 \u03c4k)\n\n(cid:32)\n(cid:113) SA\nT + D (cid:80)\n(cid:112)N \u03c4K\nSAT + D log(K)((cid:80)\ns,a and(cid:80)\ns,a \u2264 T , by simple worst scenario analysis,(cid:80)\n\ns,a \u2264 2N \u03c4k\n\n\u03c4k\ns,a>\u03b7\n\ns,a:N\n\ns,a\n\nwhere we used N \u03c4k+1\n\u221a\nand SA\n\n(cid:80)\n\ns,a N \u03c4K\n\nk=1\n\n\u221a\n\n\u2264 D\n\nR(T,M) \u2264 \u02dcO(D\n\n4.2 Main lemmas\n\n\u221a\n\n(cid:33)\n\n\u221a\n\n\u03c4k+1 \u2212 \u03c4k\n\ns,a + D\n\n(cid:112)N \u03c4k\ns,a + D (cid:80)\n\n\u221a\ns,a ) + D log(K)(SA\n\ns,a:N\n\n\u221a\n\n\u03c4k\ns,a<\u03b7\n\n(cid:112)SN \u03c4k\n(cid:113) T S\n(cid:112)N \u03c4K\ns,a \u2264 \u221a\n\nS\u03b7) + D\n\nKT\n\nk(\u03c4k+1 \u2212 \u03c4k) = T . Now, we use that K \u2264 SA log(T ),\nA + 12\u03c9S2). Also, since\n\ns,a\n\nSAT , and we obtain,\n\nS\u03b7 = O(S7/4A3/4T 1/4 + S5/2A log(T /\u03c1)) (using \u03b7 =\n\nSAT + DS7/4A3/4T 1/4 + DS5/2A).\n\nFollowing lemma form the main technical components of our proof. All the missing proofs are\nprovided in the supplementary material.\nLemma 4.1. Assume T \u2265 CDA log2(T /\u03c1) for a large enough constant C. Then, with probability\n1 \u2212 \u03c1, for every epoch k, the diameter of MDP \u02dcMk is bounded by 2D.\nLemma 4.2. With probability 1 \u2212 \u03c1, for every epoch k, the optimal gain \u02dc\u03bbk of the extended MDP\n\u02dcMk satis\ufb01es:\n\n(cid:16)\n\n(cid:113) SA\n\n(cid:17)\n\n,\n\nT\n\nD log2(T /\u03c1)\nwhere \u03bb\u2217 the optimal gain of MDP M and D is the diameter.\n\n\u02dc\u03bbk \u2265 \u03bb\u2217 \u2212 O\n\nProof. Let h\u2217 be the bias vector for an optimal policy \u03c0\u2217 of MDP M (refer to Lemma 2.1 in the\npreliminaries section). Since h\u2217 is a \ufb01xed (though unknown) vector with |hi \u2212 hj| \u2264 D, we can\napply Lemma 4.3 to obtain that with probability 1 \u2212 \u03c1, for all s, a, there exists a sample vector Qj,k\nfor some j \u2208 {1, . . . , \u03c8} such that\n\ns,a\n\n(cid:16)\n\n(cid:113) SA\n\n(cid:17)\n\ns,a)T h\u2217 \u2265 P T\n\ns,ah\u2217 \u2212 \u03b4\n\nT\n\nD log2(T /\u03c1)\n\n(Qj,k\n. Now, consider the policy \u03c0 for MDP \u02dcMk which for any s, takes\nwhere \u03b4 = O\naction aj, with a = \u03c0\u2217(s) and j being a sample satisfying above inequality. Let Q\u03c0 be the transition\nmatrix for this policy, whose rows are formed by the vectors Qj,k\ns,\u03c0\u2217(s), and P\u03c0\u2217 be the transition\nmatrix whose rows are formed by the vectors Ps,\u03c0\u2217(s). Above implies Q\u03c0h\u2217 \u2265 P\u03c0\u2217 h\u2217 \u2212 \u03b41.\nWe use this inequality along with the known relations between the gain and the bias of optimal\npolicy in communicating MDPs to obtain that the gain \u02dc\u03bb(\u03c0) of policy in \u03c0 for MDP \u02dcMk satis\ufb01es\n\u02dc\u03bb(\u03c0) \u2265 \u03bb\u2217 \u2212 \u03b4 (details provided in the supplementary material), which proves the lemma statement\nsince by optimality \u02dc\u03bbk \u2265 \u02dc\u03bb(\u03c0).\n\n8\n\n\fLemma 4.3. (Optimistic Sampling) Fix any vector h \u2208 RS such that |hi \u2212 hi(cid:48)| \u2264 D for any i, i(cid:48),\nand any epoch k. Then, for every s, a, with probability 1 \u2212 \u03c1\nSA there exists at least one j such that\n\n(cid:16)\n\nD log2(T /\u03c1)\n\n(cid:113) SA\n\n(cid:17)\n\n.\n\nT\n\nLemma 4.4. (Deviation bound) With probability 1 \u2212 \u03c1, for all epochs k, sample j, all s, a\n\n(Qj,k\n\ns,a)T h \u2265 P T\n(cid:32)\n(cid:32)\n\ns,ah \u2212 O\n(cid:115)\n(cid:115)\n\nD\n\nO\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nN \u03c4k\ns,a\n\nN \u03c4k\ns,a\n\nlog(SAT /\u03c1)\n\nS log(SAT /\u03c1)\n\n+ D\n\nN \u03c4k\ns,a\n\n, N \u03c4k\n\ns,a > \u03b7\n\nS log(SAT /\u03c1)\n\nO\n\nD\n\n+ D\n\nS log(S)\n\nN \u03c4k\ns,a\n\n,\n\ns,a \u2264 \u03b7\nN \u03c4k\n\n(cid:33)\n\n(cid:33)\n\nmax\n\nh\u2208[0,2D]S\n\n(Qj,k\n\ns,a \u2212 Ps,a)T h \u2264\n\n5 Conclusions\n\n\u221a\n\nWe presented an algorithm inspired by posterior sampling that achieves near-optimal worst-case\nregret bounds for the reinforcement learning problem with communicating MDPs in a non-episodic,\nundiscounted average reward setting. Our algorithm may be viewed as a more ef\ufb01cient randomized\nversion of the UCRL2 algorithm of Jaksch et al. [2010], with randomization via posterior sampling\nforming the key to the\nS factor improvement in the regret bound provided by our algorithm. Our\nanalysis demonstrates that posterior sampling provides the right amount of uncertainty in the samples,\nso that an optimistic policy can be obtained without excess over-estimation.\nWhile our work surmounts some important technical dif\ufb01culties in obtaining worst-case regret bounds\nfor posterior sampling based algorithms for communicating MDPs, the provided bound is tight in\nits dependence on S and A only for large T (speci\ufb01cally, for T \u2265 S5A). Other related results on\n\u221a\ntight worst-case regret bounds have a similar requirement of large T (Azar et al. [2017] produce an\nHSAT ) bound when T \u2265 H 3S3A). Obtaining a cleaner worst-case regret bound that does\n\u02dcO(\nnot require such a condition remains an open question. Other important directions of future work\ninclude reducing the number of posterior samples required in every epoch from \u02dcO(S) to constant or\nlogarithmic in S, and extensions to contextual and continuous state MDPs.\n\n9\n\n\fReferences\nYasin Abbasi-Yadkori and Csaba Szepesvari. Bayesian optimal control of smoothly parameterized\n\nsystems: The lazy posterior sampling algorithm. arXiv preprint arXiv:1406.3926, 2014.\n\nMilton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs,\n\nand mathematical tables, volume 55. Courier Corporation, 1964.\n\nShipra Agrawal and Navin Goyal. Analysis of Thompson Sampling for the Multi-armed Bandit\n\nProblem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), 2012.\n\nShipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\nProceedings of the 30th International Conference on Machine Learning (ICML), 2013a.\n\nShipra Agrawal and Navin Goyal. Further Optimal Regret Bounds for Thompson Sampling. In\n\nAISTATS, pages 99\u2013107, 2013b.\n\nJohn Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate. A Bayesian sampling\napproach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference\non Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326. AUAI Press, 2009.\n\nMohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for reinforce-\n\nment learning. arXiv preprint arXiv:1703.05449, 2017.\n\nPeter L Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement\nlearning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\nRonen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-\noptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231, 2002.\n\nS\u00e9bastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson\n\nsampling. In Advances in Neural Information Processing Systems, pages 638\u2013646, 2013.\n\nS\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\nApostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision\n\nprocesses. Mathematics of Operations Research, 22(1):222\u2013255, 1997.\n\nOlivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in\n\nneural information processing systems, pages 2249\u20132257, 2011.\n\nChristoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pages 2818\u20132826, 2015.\n\nRapha\u00ebl Fonteneau, Nathan Korda, and R\u00e9mi Munos. An optimistic posterior sampling strat-\negy for bayesian reinforcement learning. In NIPS 2013 Workshop on Bayesian Optimization\n(BayesOpt2013), 2013.\n\nCharles Miller Grinstead and James Laurie Snell. Introduction to probability. American Mathematical\n\nSoc., 2012.\n\nThomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\nSham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis,\n\nUniversity of London London, England, 2003.\n\nEmilie Kaufmann, Nathaniel Korda, and R\u00e9mi Munos. Thompson Sampling: An Optimal Finite\n\nTime Analysis. In International Conference on Algorithmic Learning Theory (ALT), 2012.\n\nMichael J Kearns and Satinder P Singh. Finite-sample convergence rates for Q-learning and indirect\n\nalgorithms. In Advances in neural information processing systems, pages 996\u20131002, 1999.\n\n10\n\n\fRobert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In\nProceedings of the fortieth annual ACM symposium on Theory of computing, pages 681\u2013690.\nACM, 2008.\n\nIan Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement\n\nlearning. arXiv preprint arXiv:1607.00215, 2016.\n\nIan Osband, Dan Russo, and Benjamin Van Roy. (More) ef\ufb01cient reinforcement learning via posterior\n\nsampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011, 2013.\n\nIan Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\n\nvalue functions. arXiv preprint arXiv:1402.0635, 2014.\n\nMartin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 2014.\n\nDaniel Russo and Benjamin Van Roy. Learning to Optimize Via Posterior Sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\nDaniel Russo and Benjamin Van Roy. An Information-Theoretic Analysis of Thompson Sampling.\n\nJournal of Machine Learning Research (to appear), 2015.\n\nYevgeny Seldin, Fran\u00e7ois Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and Peter Auer.\nPAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):\n7086\u20137093, 2012.\n\nI. G. Shevtsova. An improvement of convergence rate estimates in the Lyapunov theorem. Doklady\n\nMathematics, 82(3):862\u2013864, 2010.\n\nAlexander L Strehl and Michael L Littman. A theoretical analysis of model-based interval estimation.\nIn Proceedings of the 22nd international conference on Machine learning, pages 856\u2013863. ACM,\n2005.\n\nAlexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for\nMarkov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\nAmbuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for\nirreducible MDPs. In Advances in Neural Information Processing Systems, pages 1505\u20131512,\n2008.\n\nWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n11\n\n\f", "award": [], "sourceid": 787, "authors": [{"given_name": "Shipra", "family_name": "Agrawal", "institution": "Columbia University"}, {"given_name": "Randy", "family_name": "Jia", "institution": "Columbia University"}]}