{"title": "Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function", "book": "Advances in Neural Information Processing Systems", "page_first": 2827, "page_last": 2836, "abstract": "We present an algorithm based on the \\emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. \nBy evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\\tilde{O}(\\sqrt{SATH})$\\footnote{The symbol $\\tilde{O}$ means $O$ with log factors ignored. } for MDP with S states and A actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. \nThis result outperforms the best previous regret bounds $\\tilde{O}(HS\\sqrt{AT})$\\cite{bartlett2009regal} by a factor of $\\sqrt{SH}$. \nFurthermore, this regret bound matches the lower bound of $\\Omega(\\sqrt{SATH})$\\cite{jaksch2010near} up to a logarithmic factor. As a consequence,  we show that there is a near optimal regret bound of $\\tilde{O}(\\sqrt{DSAT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\\Omega(\\sqrt{DSAT})$\\cite{jaksch2010near}.", "full_text": "Regret Minimization for Reinforcement Learning by\n\nEvaluating the Optimal Bias Function\n\nZihan Zhang\n\nTsinghua University\n\nzihan-zh17@mails.tsinghua.edu.cn\n\nXiangyang Ji\n\nTsinghua University\n\nxyji@tsinghua.edu.cn\n\nAbstract\n\nWe present an algorithm based on the Optimism in the Face of Uncertainty (OFU)\nprinciple which is able to learn Reinforcement Learning (RL) modeled by Markov\ndecision process (MDP) with \ufb01nite state-action space ef\ufb01ciently. By evaluating\nthe state-pair difference of the optimal bias function h\u21e4, the proposed algorithm\nachieves a regret bound of \u02dcO(pSAHT )1 for MDP with S states and A actions,\nin the case that an upper bound H on the span of h\u21e4, i.e., sp(h\u21e4) is known. This\nresult outperforms the best previous regret bounds \u02dcO(SpAHT )[Fruit et al., 2019]\nby a factor of pS. Furthermore, this regret bound matches the lower bound of\n\u2326(pSAHT )[Jaksch et al., 2010] up to a logarithmic factor. As a consequence,\nwe show that there is a near optimal regret bound of \u02dcO(pSADT ) for MDPs with\na \ufb01nite diameter D compared to the lower bound of \u2326(pSADT )[Jaksch et al.,\n2010].\n\n1\n\nIntroduction\n\nIn this work we consider the Reinforcement Learning (RL) problem [Burnetas and Katehakis, 1997,\nSutton and Barto, 2018] of an agent interacting with an environment. The problem is generally\nmodelled as a discrete Markov Decision Process (MDP)[Puterman, 1994]. The RL agent needs to\nlearn the underlying dynamics of the environment in order to make sequential decisions. At step t,\nthe agent observes current state st and chooses an action at based on the policy learned from the past.\nThen the agent receives a reward rt from the environment, and the environment transits to state st+1\naccording to the states transition model. Particularly, both rt and st+1 are independent of previous\ntrajectories, and are only conditioned on st and at. In the online framework of reinforcement learning,\nwe aim to maximize cumulative reward. Therefore, there is a trade-off between exploration and\nexploitation, i.e., taking actions we have not learned accurately enough and taking actions which\nseem to be optimal currently.\nThe solutions to exploration-exploitation dilemma can mainly be divided into two groups. In the \ufb01rst\ngroup, the approaches utilize the Optimism in the Face of Uncertainty (OFU) principle [Auer et al.,\n2002]. Under OFU principle, the agent maintains a con\ufb01dent set of MDPs and the underlying MDP\nis contained in this set with high probability. The agent executes the optimal policy of the best MDP\nin the con\ufb01dence set [Bartlett and Tewari, 2009, Jaksch et al., 2010, Maillard et al., 2011, Fruit et al.,\n2018a]. In the second group, the approaches utilize posterior sampling [Thompson, 1933]. The agent\nmaintains a posterior distribution over reward functions and transition models. It samples an MDP\nand executes corresponding optimal policy in each epoch. Because of simplicity and scalability, as\nwell as provably optimal regret bound, posterior sampling has been getting popular in related research\n\ufb01eld [Osband et al., 2013, Osband and Van Roy, 2016, Agrawal and Jia, 2017, Abbasi-Yadkori, 2015].\n\n1The symbol \u02dcO means O with log factors ignored.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Related Work\n\nIn the research \ufb01eld of regret minimization for reinforcement learning, Jaksch et al. [2010] showed a\nregret bound of \u02dcO(DSpAT ) for MDPs with a \ufb01nite diameter D, and proved that it is impossible\nto reach a regret bound smaller than \u2326(pSADT ). Agrawal and Jia [2017] established a better\nregret bound of \u02dcO(DpSAT ) by posterior sampling method. Bartlett and Tewari [2009] achieved\na regret bound of \u02dcO(HSpAT ) where H is an input as an upper bound of sp(h\u21e4) . Fruit et al.\n[2018b] designed a practical algorithm for the constrained optimization problem in REGAL.C [Bartlett\nand Tewari, 2009], and obtained a regret bound of \u02dcO(HpSAT ) where  \uf8ff S is the number\nof possible next states. On the other hand, Ouyang et al. [2017] and Theocharous et al. [2017]\ndesigned posterior sampling algorithms with Bayesian regret bound of \u02dcO(HSpAT ), with the\nassumption that elements of support of the prior distribution have a consistent upper bound H for\ntheir optimal bias spans.Talebi and Maillard [2018] showed a problem-dependent regret bound of\n\n\u02dcO(qPs,a V (Ps,a, h\u21e4)ST ). Recently, Fruit et al. [2019] presented improved analysis of UCRL2B\n\nalgorithm and obtained a regret bound of \u02dcO(SpDAT ).\nThere are also considerable work devoted to studying \ufb01nite-horizon MDP. Osband and Van Roy\n[2016] presented PRSL to establish a Bayesian regret bound of \u02dcO(HpSAT ) using posterior sampling\nmethod. And later Azar et al. [2017] reached a better regret bound of \u02dcO(pSAHT ). Recently, Kakade\net al. [2018] and Zanette and Brunskill [2019] achieved the same regret bound of \u02dcO(pSAHT ) by\nlearning a precise value function to predict the best future reward of current state.\nWe notice a mistake about concentration of average of independent multinoulli trials in the proof of\n[Agrawal and Jia, 2017] (see Appendix.A for further details). This mistake suggests that they may\nnot reduce a factor of pS in their regret bounds.\n\n1.2 Main Contribution\nIn this paper, we design an OFU based algorithm, and achieve a regret bound of \u02dcO(pSAHT ) given\nan upper bound H on sp(h\u21e4). As a corollary, we establish a regret bound of \u02dcO(pSADT ) for the\nMDPs with \ufb01nite diameter D. Meanwhile the corresponding lower bounds for the above two upper\nbounds are \u2326(pSAHT ) and \u2326(pSADT ) respectively. In a nutshell, our algorithm improves the\nregret bound by a factor of pS compared to the best previous known results.\nOur Approach: we consider regret minimization for RL by evaluating state-pair difference of the\noptimal bias function. Firstly, we observe that we can achieve a near-optimal regret bound with guide\nof the optimal bias function. Considering the fact that it is hard to estimate the optimal bias function\ndirectly [Ortner, 2008], we design a con\ufb01dence set Hk of the optimal bias function. Based on Hk\nwe obtain a tighter con\ufb01dence set of MDPs and a better regret bound. It is notable that the order\nof samples in the trajectory is crucial when computing Hk in our algorithm, while it is ignored in\nprevious methods. In this way, we utilize more information about the trajectory when computing the\ncon\ufb01dence set, which enables us to achieve a better regret bound.\n\n2 Preliminaries\n\nWe consider the MDP learning problem where the MDP M = hS,A, r, P, s1i. S = {1, 2, ..., S} is\nthe state space, A = {1, 2, ..., A} is the action space, P : S\u21e5A! S 2 is the transition model,\nr : S\u21e5A! [0,1] is the reward function, and s1 is the initial state. The agent executes action a at\nstate s and receives a reward r(s, a), and then the system transits to the next state s0 according to\nP(\u00b7|s, a) = Ps,a. In this paper, we assume that E[r(s, a)] is known for each (s, a) pair, and denote\nE[r(s, a)] as rs,a. It is not dif\ufb01cult to extend the proof to the original case.\nIn the following sections, we mainly focus on weak-communicating (see de\ufb01nition [Bartlett and\nTewari, 2009]) MDPs.\nAssumption 1. The underlying MDP is weak-communicating .\n\n2In this paper, we use X to denote all distributions on X. Particularly, we use m to denote the m-simplex.\n\n2\n\n\fWe \ufb01rst summarize several useful known results for MDPs and RL.\nDe\ufb01nition 1 (Policy). A policy \u21e1 : S! A is a mapping from the state space to all distributions on\nthe action space. In the case the support of \u21e1(s) is a single action, we also denote this action as \u21e1(s).\n\nGiven a policy \u21e1, transition model P and reward function r, we use P\u21e1 to denote the transition\nprobability matrix and r\u21e1 to denote the reward vector under \u21e1. Speci\ufb01cally, when \u21e1 is a deterministic\npolicy, P\u21e1 = [P1,\u21e1(1), ..., Ps,\u21e1(s)] and r\u21e1 = [r1,\u21e1(1), ..., rS,\u21e1(S)]T .\nDe\ufb01nition 2 (Average reward). Given a policy \u21e1, when starting from s1 = s, the average reward is\nde\ufb01ned as:\n\n\u21e2\u21e1(s) = lim\nT!1\n\n1\nT Eat\u21e0\u21e1(st),1\uf8fft\uf8ffT [\n\nTXt=1\n\nrst,at|s1 = s].\n\nThe optimal average reward and the optimal policy are de\ufb01ned as \u21e2\u21e4(s) = max\u21e1 \u21e2\u21e1(s) and \u21e7\u21e4(s) =\narg max\u21e1 \u21e2\u21e1(s) respectively. It is well known that, under Assumption 1, \u21e2\u21e4(s) is state independent,\nso that we write it as \u21e2\u21e4 in the rest of the paper for simplicity.\nDe\ufb01nition 3 (Diameter). Diameter of an MDP M is de\ufb01ned as:\nT \u21e1\ns!s0,\n\nD(M ) = max\n\nmin\n\ns,s02S,s6=s0\n\n\u21e1:S!A\n\nwhere T \u21e1\n\ns!s0 denotes the expected number of steps to reach s0 from s under policy \u21e1.\n\nUnder Assumption 1, it is known the optimal bias function h\u21e4 satis\ufb01es that\n\nh\u21e4 + \u21e2\u21e41 = max\na2A\n\n(rs,a + P T\n\ns,ah\u21e4)\n\n(1)\n\nwhere 1 = [1, 1, ..., 1]T . It is obvious that if h satis\ufb01es (1), then so is h\u21e4+1 for any  2 R. Assuming\nh is a solution to (1), we set3  =  mins hs and h\u21e4 = h + 1, then the optimal bias function h\u21e4 is\nuniquely de\ufb01ned. Besides, the span operator sp : RS ! R is de\ufb01ned as sp(v) = max\ns,s02[S]|vs  vs0|.\nThe reinforcement learning problem. In reinforcement learning, the agent starts at s1 = sstart,\nand proceeds to make decisions in rounds t = 1, 2, ..., T . The S, A and {rs,a}s2S,a2A are known\nto the agent, while the transition model P is unknown to agent. Therefore, the \ufb01nal performance is\nmeasured by the cumulative regret de\ufb01ned as\n\nR(T, sstart) := T\u21e2 \u21e4 \n\nTXt=1\nThe upper bound for R(T, sstart) we provide is always consistent with that of sstart. In the following\nsections, we use R(T, sstart) to denote R(T ) for simplicity.\n3 Algorithm Description\n\nrst,at.\n\n3.1 Framework of UCRL2\nWe \ufb01rst revisit the classical framework of UCRL2 [Jaksch et al., 2010] brie\ufb02y. As described in\nAlgorithm 1 (EBF), there are mainly three components in the UCRL2 framework: doubling episodes,\nbuilding the con\ufb01dence set and solving the optimization problem.\nDoubling episodes: The algorithm proceeds through episodes k = 1, 2, .... In the k-th episode, the\nagent makes decisions according to \u21e1k. The episode ends whenever 9(s, a), such that the visit count\nof (s, a) in the k-th episode is larger than or equal to the visit count of (s, a) before the k-th episode.\nLet K be the number of episodes. Therefore, we can get that K \uf8ff SA(log2( T\nSA ) + 1) \uf8ff 3SA log(T )\nwhen SA  2 [Jaksch et al., 2010].\nBuilding the con\ufb01dence set: At the beginning of an episode, the algorithm computes a collection of\nplausible MDPs, i.e., the con\ufb01dence set Mk based on previous trajectory. Mk should be designed\nproperly such that the underlying MDP M is contained by Mk with high probability, and the elements\n\n3In this paper, we use [v1, v2, ..., vS]T to indicate a vector v 2 RS\n\n3\n\n\fin Mk are closed to M. In our algorithm, the con\ufb01dence set is not a collection of MDPs. Instead, we\ndesign a 4-tuple (\u21e1, P 0(\u21e1), h0(\u21e1),\u21e2 (\u21e1)) to describe a plausible MDP and its optimal policy.\nSolving the optimization problem: Given a con\ufb01dence set M, the algorithm selects an element\nfrom M according to some criteria. Generally, to keep the optimality of the chosen MDP, the\nalgorithm needs to maximize the average reward with respect to certain constraints. Then the\ncorresponding optimal policy will be executed in current episode.\n\n3.2 Tighter Con\ufb01dence Set by Evaluating the Optimal Bias Function\nREGAL.C [Bartlett and Tewari, 2009] utilizes H to compute Mk, thus avoiding the issues brought by\nthe diameter D. Similar to REGAL.C, we assume that H, an upper bound of sp(h\u21e4) is known. We\ndesign a novel method to compute the con\ufb01dence set, which is able to utilize the knowledge of the\nhistory trajectory more ef\ufb01ciently. We \ufb01rst compute a well-designed con\ufb01dence set Hk of the optimal\nbias function, and obtain a tighter con\ufb01dence set Mk based on Hk.\nOn the basis of above discussion, we summarize high-level intuitions as below:\nExploration guided by the optimal bias function: Once the true optimal bias function h\u21e4 is given,\nwe could get a better regret bound. In this case we regard the regret minimization problem as S\nindependent multi-armed bandit problems. UCB algorithm with Bernstein bound [Lattimore and\nHutter, 2012] provides a near optimal regret bound. However, we can not get h\u21e4 exactly. Instead, a\ntight con\ufb01dence set of h\u21e4 also helps to guide exploration.\nCon\ufb01dence set of the optimal bias function: We \ufb01rst study what could be learned about h\u21e4 if we\nalways choose optimal actions. For two different states s, s0, suppose we start from s at t1, and reach\ns0 the \ufb01rst time at t2 (t2 is a stopping time), then we have E[Pt21\n(rt  \u21e2\u21e4)]4= \u21e4s,s0 := h\u21e4s  h\u21e4s0 by\nthe de\ufb01nition of optimal bias function. As a result,Pt21\n(rt  \u21e2\u21e4) could be regarded as an unbiased\nestimator for \u21e4s,s0. Based on concentration inequalities for martingales, we have the following formal\nde\ufb01nitions and lemma.\nDe\ufb01nition 4. Given a trajectory L = {(st, at, st+1, rt)}1\uf8fft\uf8ffN , for s, s0 2S and s 6= s0, let\nts1(L) := min{min{t|st = s}, N + 2}. We de\ufb01ne {tsk(L)}k2 and {tek(L)}k1 recursively by\nfollowing rules,\n\nt=t1\n\nt=t1\n\ntek(L) := min min{t|st = s0, t > tsk(L)}, N + 2 ,\ntsk(L) := min min{t|st = s, t > tek1(L)}, N + 2 .\n\nThe count of arrivals c(s, s0,L) from s to s0 is de\ufb01ned as\n\nc(s, s0,L) := max{k|tek(L) \uf8ff N + 1}.\n\nHere we de\ufb01ne min ? = +1 and max ? = 0 respectively.\nLemma 1 (Main Lemma). We say an MDP is \ufb02at if all its actions are optimal. Suppose M is a\n\ufb02at MDP (without the constraint rs,a 2 [0, 1]). We run N steps following an algorithm G under M.\nLet L = {(st, at, st+1, rt)}1\uf8fft\uf8ffN be the \ufb01nal trajectory. For any two states s, s0 2S and s 6= s0,\nlet c(s, s0,L), {tek(L)}k1 and {tsk(L)}k1 be de\ufb01ned as in De\ufb01nition 4. Then we have, for any\nalgorithm G, with probability at least 1  N , for any 1 \uf8ff c \uf8ff c(s, s0,L) it holds that\n\n\n\ncXk=1\u21e3h\u21e4s0  h\u21e4s +\n\nXtsk(L)\uf8fft\uf8fftek(L)1\n\n(rt  \u21e2\u21e4)\u2318 \uf8ff (p2N + 1)sp(h\u21e4).\n\n )5\n\nwhere  = log( 2\nTo use Lemma 1 to compute Hk, we have to overcome two problems: (i) M may not be \ufb02at; (ii) we\ndo not have the value of \u21e2\u21e4. Under the assumption the total regret is \u02dcO(HSpAT ), we can solve the\nproblems subtly.\nLet regs,a = h\u21e4s +\u21e2\u21e4P T\nand could be regarded as the single step regret of (s, a). Let r0s,a = h\u21e4s + \u21e2\u21e4  P T\n\ns,ah\u21e4rs,a, which is also called optimal gap [Burnetas and Katehakis, 1997]\ns,ah\u21e4 = rs,a + regs,a\n\n(2)\n\n4To explain the high-level idea, we assume this expectaion is well-de\ufb01ned.\n5In this paper  always denotes log( 2\n\n ).\n\n4\n\n\fAlgorithm 1 EBF: Estimate the Bias Function\nInput: H, , T .\nInitialize: t 1,tk 0.\n1: for episodes k = 1, 2, ... do\ntk current time;\n2:\nLtk1 { (si, ai, si+1, ri)}1\uf8ffi\uf8fftk1;\n3:\n4: Mk BuildCS(H, log( 2\nChoose (\u21e1, P 0(\u21e1), h0(\u21e1),\u21e2 (\u21e1)) 2M k to maximize \u21e2(\u21e1) over Mk;\n5:\n\u21e1k \u21e1;\n6:\nFollow \u21e1k until the visit count of some (s, a) pair doubles.\n7:\n8: end for\n\n ),Ltk1);\n\n\n\nNs,a,s0|(h\u21e4s0  h\u21e4s)  (h0s0  h0s)|\uf8ff 2\n\nand M0 = hS,A, r0, P, s1i. It is easy to prove that M0 is \ufb02at and has the same optimal bias function\nand optimal average reward as M. We attain by Lemma 1 that with high probability, it holds that\nregst,at + (p2N + 1)sp(h\u21e4). (3)\nLet h0 2 [0, H]S be a vector such that (3) still holds with h\u21e4 replaced by h0, then we can derive that\n\nc(s,s0,L)Xk=1 \u21e3h\u21e4s0  h\u21e4s +\n\nXtsk(L)\uf8fft\uf8fftek(L)1\n\n(rst,at  \u21e2\u21e4)\u2318 \uf8ff\nNXt=1\n\nNXt=1\nregst,at + 2(p2N + 1)H\n\nwith high probability it holds\n\n\u02c6Ns,a,s0|(h\u21e4s0  h\u21e4s)  (h0s0  h0s)| = \u02dcO(HSpAN ).\n\n(4)\nAs for the problem we have no knowledge about \u21e2\u21e4, we can replace \u21e2\u21e4 with the empirical average\n\nwhere Ns,a,s0 :=PN\nt=1 I[st = s, at = a, st+1 = s0] \uf8ff c(s, s0,L). Because it is not hard to bound\nt=1 regst,at \u21e1R (N ) up to \u02dcO(HSpAN ) by REGAL.C [Bartlett and Tewari, 2009], we obtain that\nPN\nreward \u02c6\u21e2. Our claim about (4) still holds as long as N (\u21e2\u21e4  \u02c6\u21e2) = \u02dcO(HSpAN ), which is equivalent\nto R(N ) = \u02dcO(HSpAN ).\nAlthough it seems that (4) is not tight enough, it helps to bound the error term due to the difference\nbetween hk and h\u21e4 up to o(pT ) by setting N = T . (refer to Appendix.C.5.)\nBased on the discussion above, we de\ufb01ne Hk as:\nHk := {h 2 [0, H]S||L1(h, s, s0,Ltk1)|\uf8ff 48SpAT sp(h) + (p2T + 1)sp(h),8s, s0, s 6= s0}\n\nwhere\n\nTogether with constraints on the transition model (5)-(7) and constraint on optimality (8), we propose\nAlgorithm 2 to build the con\ufb01dence set, where\n\nL1(h, s, s0,L) =\n\nXtsk(L)\uf8ffi\uf8fftek(L)1\n\n(ri  \u02c6\u21e2)\u2318.\n\nc(s,s0,L)Xk=1 \u21e3(hs0  hs) +\nV (x, h) =Xs\n\nxsh2\n\ns  (xT h)2.\n\n4 Main Results\n\nIn this section, we summarize the results obtained by using Algorithm 1 on weak-communicating\nMDPs. In the case there is an available upper bound H for sp(h\u21e4), we have following theorem.\nTheorem 1 (Regret bound (H known)). With probability 1  , for any weak-communicating MDP\n )) and S, A, H  20 where p1 is a\nM and any initial state sstart 2S , when T  p1(S, A, H, log( 1\npolynomial function, the regret of EBF algorithm is bounded by\n\nR(T ) \uf8ff 490rSAHT log(\n\n40S2A2T log(T )\n\n\n\n),\n\n5\n\n\ft=1 I[st = s, at = a], 1}, 8(s, a);\n\nAlgorithm 2 BuildCS(H,, L)\nInput: H, , L = {(si, ai, si+1, ri)}1\uf8ffi\uf8ffN\n1: H { h 2 [0, H]S| |L1(h, s, s0,L)|\uf8ff 48SpAT sp(h) + (p2T + 1)sp(h),8s, s0, s 6= s0};\n2: Ns,a max{PN\n3: \u02c6Ps,a,s0 PN\n4: O {\nR, such that\n\n\u21e1|\u21e1 is a deterministic policy, and 9P 0(\u21e1) 2 RS\u21e5A\u21e5S, h0(\u21e1) 2H and \u21e2(\u21e1) 2\n|P 0s,a,s0(\u21e1)  \u02c6Ps,a,s0|\uf8ff 2q \u02c6Ps,a,s0/Ns,a + 3/Ns,a + 4\n\nt=1 I[st=s,at=a,st+1=s0]\n\n, 8(s, a, s0);\n\n(5)\n\n3\n4\n\n3\n\n4 /N\n\ns,a,\n\nNs,a\n\n|P 0s,a(\u21e1)  \u02c6Ps,a|1 \uf8ffq14S/N s,a\n\n(6)\n\n|(P 0s,a(\u21e1) \u02c6Ps,a)T h0(\u21e1)|\uf8ff 2qV ( \u02c6Ps,a, h0(\u21e1))/Ns,a + 12H/N s,a + 10H 3/4/N 3/4\n\nk,s,a, (7)\n\nP 0s,\u21e1(s)(\u21e1)T h0(\u21e1) + rs,\u21e1(s) = max\na2A\n\nholds for any s, a, s0};\n\n5: Return:{(\u21e1, P 0(\u21e1), h0(\u21e1),\u21e2 (\u21e1))|\u21e1 2O }.\n\nP 0s,a(\u21e1)T h0(\u21e1) + rs,a = h0(\u21e1) + \u21e2(\u21e1)1\n\n(8)\n\nT , we get\n\nwhenever an upper bound of the span of optimal bias function H is known. By setting  = 1\n\nthat E[R(T )] = \u02dcO(pSAHT )\nTheorem 1 generalizes the \u02dcO(pSAHT ) regret bound from the \ufb01nite-horizon setting [Azar et al.,\n2017] to general weak-communicating MDPs, and improves the best previous known regret bound\n\u02dcO(HpSAT )[Fruit et al., 2019] by an pS factor. More importantly, this upper bound matches the\n\u2326(pSAHT ) lower bound up to a logarithmic factor.\nBased on Theorem 1, in the case the diameter D is \ufb01nite but unknown, we can reach a regret bound\nof \u02dcO(pSADT ).\nCorollary 1. For weak-communicating MDP M with a \ufb01nite unknown diameter D and any initial\n )) and S, A, D  20 where p2\nstate sstart 2S , with probability 1  , when T  p2(S, A, D, log( 1\nis a polynomial function, the regret can be bounded by\n\nR(T ) \uf8ff 491rSADT (log(\nT , we get that E[R(T )] = \u02dcO(pSADT ).\n\nBy setting  = 1\n\nS3A2T log(T )\n\n).\n\n\n\nWe postpone the proof of Corollary 1 to Appendix.D.\nAlthough EBF is proved to be near optimal, it is hard to implement the algorithm ef\ufb01ciently. The\noptimization problem in line 5 Algorithm 1 is well-posed because of the optimality equation (8).\nHowever, the constraint (7) is non-convex in h0(\u21e1), which makes the optimization problem hard\nto solve. Recently, Fruit et al. [2018b] proposed a practical algorithm SCAL, which solves the\noptimization problem in REGAL.C ef\ufb01ciently. We try to expand the span truncation operator Tc to our\nframework, but fail to make substantial progress. We have to leave this to future work.\n\n5 Analysis of EBF (Proof Sketch of Theorem 1)\n\nOur proof mainly contains two parts. In the \ufb01rst part, we bound the probabilites of the bad events. In\nthe second part, we manage to bound the regret when the good event occurs.\n\n6\n\n\f5.1 Probability of Bad Events\n\ns,a =Pt\n\ns,a,s0\n\n3\n\n+ 2\n\n+\n\nWe \ufb01rst present the explicit de\ufb01nition of the bad events. Let N (t)\ni=1 I[si = s, ai = a]. We\ndenote Nk,s,a = N (tk1)\nas the visit count of (s, a) before the k-th episode, and vk,s,a as the visit\ncount of (s, a) in the k-th episode respectively. We also denote \u02c6P (k) as the empirical transition model\nbefore the k-th episode.\nDe\ufb01nition 5 (Bad event). For the k-th episode, de\ufb01ne\n\ns,a\n\nmax{Nk,s,a, 1}\n\nmax{Nk,s,a, 1}\n\nmax{Nk,s,a, 1}\n\nsp(h\u21e4)\n\nmax{Nk,s,a, 1},\n\nvk0,s,aregs,a > 22HSpAT o\n\ns,a )T h\u21e4| > 2s V (Ps,a, h\u21e4))\ns,a,s0  Ps,a,s0| > 2vuut \u02c6P (k)\n(\u21e2\u21e4  rst,at)| > 26HSpAT ,Xk0<kXs,a\n\nB1,k :=\u21e29(s, a), s.t.|(Ps,a  \u02c6P (k)\nB2,k =\u21e29(s, a, s0), s.t.| \u02c6P (k)\nB3,k =n| X1\uf8fft<tk\nB4,k ={(\u21e1\u21e4, P \u21e4, h\u21e4,\u21e2 \u21e4)|\u21e1\u21e4is a deterministic optimal policy}\\M k = ? .\nThe bad event in the k-th episode therefore is de\ufb01ned as Bk = B1,k [ B2,k [ B3,k [ B4,k, and the\ntotal bad event B is de\ufb01ned as B := [1\uf8ffk\uf8ffK+1Bk. At the same time, we have the de\ufb01nition of the\ngood event as G = BC.\nLemma 2 (Bound of P(B)). Suppose we run Algorithm 1 for T steps, then P(B) \uf8ff (6AT +\n12S2A)SA log(T ) when T  A log(T ) and SA  4.\n5.2 Regret when the Good Event Occurs\nIn this section we assume that the good event G occurs. We use Rk to denote the regret in the\nk-th episode. We use P 0k, Pk, \u02c6Pk, rk, \u21e2k and hk to denote P 0\u21e1k (\u21e1k), P\u21e1k, \u02c6P (k)\n\u21e1k , r\u21e1k, \u21e2(\u21e1k) and\nh0(\u21e1k) respectively. We de\ufb01ne vk as the vector such that vk,s = vk,s,\u21e1k(s),8s, and introduce\nk,s,s0 = hk,s  hk,s0,8s, s0.\nNoting that for \u21b5> 0,PkPs,a vk,s,a\n\n2\u21b5),\nwhich could be ignored when T is suf\ufb01ciently large. Therefore, we can omit such terms without\nchanging the regret bound.\nAccording to BC\nRk = vT\n= vT\n|\n\nk (\u21e2k1  rk) = vT\nk (\u21e2\u21e41  rk) \uf8ff vT\nk (P 0k  I)T hk\nk ( \u02c6Pk  Pk)T h\u21e4\n+ vT\n+ vT\nk (Pk  I)T hk\n{z\n|\n}\n|\n}\n3k\nWe bound the four terms in the right side of (9) separately.\nTerm 1k : The expectation of 1k never exceeds [H, H]. However, we can not directly utilize this\nto bound 1k. By observing that 1k has a martingale difference structure, we have following lemma\nbased on concentration inequality for martingales.\nLemma 3. When T  S2AH 2, with probability 1  3, it holds that\n1k \uf8ff KH + (4H + 2p12T H).\n\nk ( \u02c6Pk  Pk)T (hk  h\u21e4)\n}\n\nk (P 0k  \u02c6Pk)T hk\n}\n\n+\u21b5 could be roughly bounded by O(T 1\n\n4,k and the optimality of \u21e2k we have\n\n{z\n4k\n\n1\n\nmax{Nk,a,s,1}\n\n1\n2\n\n{z\n2k\n\n+ vT\n\n|\n\n{z\n1k\n\n. (9)\n\n+\n\n4\n\n4 3\n\n4,\nmax{Nk,s,a, 1} 3\n\nXk\nvk,s,a\u27132s V (Ps,a, h\u21e4)\n\nTerm 2k : Recalling the de\ufb01nition of V (x, h) in Section 3, BC\n2k \uf8ffXs,a\n\nmax{Nk,s,a, 1}\u25c6 \u21e1 O\u2713Xs,a\n\nmax{Nk,s,a, 1}\n\nH\n\n+2\n\n1,k implies that\n\nvk,s,as V (Ps,a, h\u21e4)\nmax{Nk,s,a, 1}\u25c6,\n\n(10)\n\n7\n\n\fterms. We bound RHS of (10) by bounding\n\ns,a V (Ps,a, h\u21e4) by O(T H). Formally, we have following lemma.\n\nthe insigni\ufb01cant\n\nwhere \u21e1 means we omit\nPs,a N (T )\nLemma 4. When T  S2AH 2, with probability 1  \nvk,s,as V (Ps,a, h\u21e4)\n\nXk,s,a\n\nTerm 3k : According to (7) we have\n\n3k \uf8ffXs,a\n\nvk,s,aL2(max{Nk,s,a, 1}, \u02c6P (k)\n\nmax{Nk,s,a, 1} \uf8ff 21pSAHT .\ns,a , hk) \u21e1 O\u2713Xs,a\n\nvk,s,as V ( \u02c6P (k)\n\nmax{Nk,s,a, 1}\u25c6\n\ns,a , hk)\n\n(11)\n\nof (11), because hk varies in different episodes, we have to bound the static part and the dynamic part\nseparately. Noting that\n\nwhere L2(N, p, h) = 2pV (p, h)/N + 12H/N + 10H 3/4/N 3/4. When dealing with the RHS\nqV ( \u02c6P (k)\n\ns,a , h\u21e4) qV (Ps,a, h\u21e4))\n\n\u02c6P (k)\n\ns,a , hk) qV ( \u02c6P (k)\ns,a , h\u21e4)) + (qV ( \u02c6P (k)\ns,a , hk) qV (Ps,a, h\u21e4) \uf8ff (qV ( \u02c6P (k)\n\uf8ffq|V ( \u02c6P (k)\ns,a , h\u21e4)| +q|V ( \u02c6P (k)\ns,a , hk)  V ( \u02c6P (k)\n\uf8ffs4HXs0\ns,a,s0|k,s,s0  \u21e4s,s0| +q4H 2| \u02c6P (k)\ns,a,s0|k,s,s0  \u21e4s,s0| +vuut4H 2s\n\uf8ffXs0 q4H \u02c6P (k)\n\u21e1 O\u21e3Xs0 q4H \u02c6P (k)\ns,a,s0|k,s,s0  \u21e4s,s0|\u2318,\nvuut \u02c6P (k)\n\ns,a,s0|k,s,s0  \u21e4s,s0|\nmax{Nk,s,a, 1}\n\nvk,s,aXs0\n\npH Xk,s,a\n\nAccording to the bound of the second term, it suf\ufb01ces to bound\n\ns,a , h\u21e4)  V (Ps,a, h\u21e4)|\ns,a  Ps,a|1\n\n14S\n\nmax{Nk,s,a, 1}\n\ns,a,s0  Ps,a,s)(\u21e4s,s0  k,s,s0)\n\n( \u02c6P (k)\n\ns,a,s0\n\nvk,s,a( \u02c6P (k)\n\nvk,s,aXs0\n\n2,k the fourth term can be bounded by:\n\nSurprisingly, we \ufb01nd that this term is an upper bound for the fourth term.\nTerm 4k : Recalling that \u21e4s,s0 = h\u21e4s  h\u21e4s0, according to BC\n4k =Xs,a\ns,a  Ps,a)T (hk  hk,s1  h\u21e4 + h\u21e4s1) =Xs,a\nvuut \u02c6P (k)\n\u21e1 O\u2713Xs,a\nmax{Nk,s,a, 1}|k,s,s0  \u21e4s,s0|\u25c6\nvk,s,aXs0\nvuut \u02c6P (k)\n\u25c6.\n= O\u2713pHXs,a\nvk,s,aXs0\n(14)\naccording to (4) and the fact vk,s,a \uf8ff max{Nk,s,a, 1} we have\nvk,s,ar \u02c6P (k)\nmax{Nk,s,a,1} \uf8ff qmax{Nk,s,a, 1} \u02c6P (k)\ns,a,s0|k,s,s0\u21e4s,s0|\n4 ). To be rigorous,\nwe have following lemma.\nLemma 5. With probability 1  S2T , it holds that\n\ns,a,s0|k,s,s0  \u21e4s,s0|\nmax{Nk,s,a, 1}\n\ns,a,s0|k,s,s0  \u21e4s,s0| = \u02dcO(T 1\n\nTo bound (13,\n\n(12)\n\n(13)\n\nXk Xs,a\n\nvk,s,aXs0\n\ns,a,s0|(k,s,s0  \u21e4s,s0)|\nmax{Nk,s,a, 1}\n\nvuut \u02c6P (k)\n\n8\n\n\uf8ff 11KS\n\n5\n\n2 A\n\n1\n\n4 H\n\n1\n\n2 T\n\n1\n\n4 \n\n1\n\n4 .\n\n(15)\n\n\fDue to the lack of space, the proofs are delayed to the appendix.\nPutting (9)-(12), (14), Lemma 3, Lemma 4 and Lemma 5 together, we conclude that R(T ) =\n\u02dcO(pSAHT ).\n\n6 Conclusion\n\nIn this paper we answer the open problems proposed by Jiang and Agarwal [2018] partly by designing\nan OFU based algorithm EBF and proving a regret bound of \u02dcO(pHSAT ) whenever H, an upper\nbound on sp(h\u21e4) is known. We evaluate state-pair difference of the optimal bias function during\nlearning process. Based on this evaluation, we design a delicate con\ufb01dence set to guide the agent to\nexplore in the right direction. We also prove a regret bound of \u02dcO(pDSAT ) without prior knowledge\nabout sp(h\u21e4). Both two regret bounds match the corresponding lower bound up to a logarithmic\nfactor and outperform the best previous known bound by an pS factor.\n\nAcknowledgments\n\nThe authors would like to thank the anonymous reviewers for valuable comments and advice.\n\nReferences\nYasin Abbasi-Yadkori. Bayesian optimal control of smoothly parameterized systems. In Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\nShipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning, worst-case\nregret bounds. In Advances in Neural Information Processing Systems, pages 1184\u20131194, 2017.\n\nPeter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine learning, 47(2-3):235\u2013256, 2002.\n\nMohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for reinforce-\n\nment learning. arXiv preprint arXiv:1703.05449, 2017.\n\nPeter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\nA. N. Burnetas and M. N. Katehakis. Optimal Adaptive Policies for Markov Decision Processes.\n\n1997.\n\nRonan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in\nnon-communicating markov decision processes. In Advances in Neural Information Processing\nSystems, pages 2998\u20133008, 2018a.\n\nRonan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-constrained\n\nexploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018b.\n\nRonan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2b. 2019.\n\nThomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\nNan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds\n\non planning horizon. In Conference On Learning Theory, pages 3395\u20133398, 2018.\n\nSham Kakade, Mengdi Wang, and Lin F Yang. Variance reduction methods for sublinear reinforce-\n\nment learning. arXiv preprint arXiv:1802.09184, 2018.\n\nTor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International Conference on\n\nAlgorithmic Learning Theory, pages 320\u2013334. Springer, 2012.\n\n9\n\n\fOdalric-Ambrym Maillard, R\u00e9mi Munos, and Gilles Stoltz. A \ufb01nite-time analysis of multi-armed\nbandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference\nOn Learning Theory, pages 497\u2013514, 2011.\n\nRonald Ortner. Online regret bounds for markov decision processes with deterministic transitions. In\n\nInternational Conference on Algorithmic Learning Theory, pages 123\u2013137. Springer, 2008.\n\nIan Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement\n\nlearning? arXiv preprint arXiv:1607.00215, 2016.\n\nIan Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. Advances in Neural Information Processing Systems, pages 3003\u20133011, 2013.\nYi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision\nprocesses: A thompson sampling approach. In Advances in Neural Information Processing Systems,\npages 1333\u20131342, 2017.\n\nM L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.\nMohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undis-\n\ncounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.\n\nGeorgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling\n\nfor large scale reinforcement learning. arXiv preprint arXiv:1711.07979, 2017.\n\nWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nAndrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement\nlearning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210,\n2019.\n\n10\n\n\f", "award": [], "sourceid": 1600, "authors": [{"given_name": "Zihan", "family_name": "Zhang", "institution": "Tsinghua University"}, {"given_name": "Xiangyang", "family_name": "Ji", "institution": "Tsinghua University"}]}