{"title": "Is Q-Learning Provably Efficient?", "book": "Advances in Neural Information Processing Systems", "page_first": 4863, "page_last": 4873, "abstract": "Model-free reinforcement learning (RL) algorithms directly parameterize and update value functions or policies, bypassing the modeling of the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that they require large numbers of samples to learn.  The theoretical question of whether not model-free algorithms are in fact \\emph{sample efficient} is one of the most fundamental questions in RL. The problem is unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\\tlO(\\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. Our regret matches the optimal regret up to a single $\\sqrt{H}$ factor.  Thus we establish the sample efficiency of a classical model-free approach. Moreover, to the best of our knowledge, this is the first model-free analysis to establish $\\sqrt{T}$ regret \\emph{without} requiring access to a ``simulator.''", "full_text": "Is Q-learning Provably Ef\ufb01cient?\n\nChi Jin\u2217\n\nUniversity of California, Berkeley\n\nchijin@cs.berkeley.edu\n\nZeyuan Allen-Zhu\u2217\n\nMicrosoft Research, Redmond\n\nzeyuan@csail.mit.edu\n\nSebastien Bubeck\n\nMicrosoft Research, Redmond\nsebubeck@microsoft.com\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\n\njordan@cs.berkeley.edu\n\nAbstract\n\nModel-free reinforcement learning (RL) algorithms, such as Q-learning, directly\nparameterize and update value functions or policies without explicitly modeling\nthe environment. They are typically simpler, more \ufb02exible to use, and thus more\nprevalent in modern deep RL than model-based approaches. However, empiri-\ncal work has suggested that model-free algorithms may require more samples to\nlearn [7, 22]. The theoretical question of \u201cwhether model-free algorithms can\nbe made sample ef\ufb01cient\u201d is one of the most fundamental questions in RL, and\nremains unsolved even in the basic scenario with \ufb01nitely many states and actions.\n\u221a\nWe prove that, in an episodic MDP setting, Q-learning with UCB exploration\nachieves regret \u02dcO(\nH 3SAT ), where S and A are the numbers of states and ac-\ntions, H is the number of steps per episode, and T is the total number of steps.\nThis sample ef\ufb01ciency matches the optimal regret that can be achieved by any\nH factor. To the best of our knowledge,\nmodel-based approach, up to a single\nT regret without\nthis is the \ufb01rst analysis in the model-free setting that establishes\nrequiring access to a \u201csimulator.\u201d\n\n\u221a\n\n\u221a\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its\ncumulative rewards via interacting with an unknown environment through time [26]. There are two\nmain approaches to RL: model-based and model-free. Model-based algorithms make use of a model\nfor the environment, forming a control policy based on this learned model. Model-free approaches\ndispense with the model and directly update the value function\u2014the expected reward starting from\neach state, or the policy\u2014the mapping from states to their subsequent actions. There has been a long\ndebate on the relative pros and cons of the two approaches [7].\nFrom the classical Q-learning algorithm [27] to modern DQN [17], A3C [18], TRPO [22], and oth-\ners, most state-of-the-art RL has been in the model-free paradigm. Its pros\u2014model-free algorithms\nare online, require less space, and, most importantly, are more expressive since specifying the value\nfunctions or policies is often more \ufb02exible than specifying the model for the environment\u2014arguably\noutweigh its cons relative to model-based approaches. These relative advantages underly the signif-\nicant successes of model-free algorithms in deep RL applications [17, 24].\nOn the other hand it is believed that model-free algorithms suffer from a higher sample complexity\ncompared to model-based approaches. This has been evidenced empirically in [7, 22], and recent\nwork has tried to improve the sample ef\ufb01ciency of model-free algorithms by combining them with\n\u2217The \ufb01rst two authors contributed equally. Full paper (and future edition) available at https://arxiv.\n\norg/abs/1807.03765.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fmodel-based approaches [19, 21]. There is, however, little theory to support such blending, which\nrequires a more quantitative understanding of relative sample complexities. Indeed, the following\nbasic theoretical questions remain open:\n\nCan we design model-free algorithms that are sample ef\ufb01cient?\n\nIn particular, is Q-learning provably ef\ufb01cient?\n\nThe answers remain elusive even in the basic tabular setting where the number of states and actions\nare \ufb01nite.\nIn this paper, we attack this problem head-on in the setting of the episodic Markov\nDecision Process (MDP) formalism (see Section 2 for a formal de\ufb01nition). In this setting, an episode\nconsists of a run of MDP dynamics for H steps, where the agent aims to maximize total reward\nover multiple episodes. We do not assume access to a \u201csimulator\u201d (which would allow us to query\narbitrary state-action pairs of the MDP) and the agent is not allowed to \u201creset\u201d within each episode.\nThis makes our setting suf\ufb01ciently challenging and realistic. In this setting, the standard Q-learning\nheuristic of incorporating \u03b5-greedy exploration appears to take exponentially many episodes to learn\n[14].\nAs seen in the literature on bandits, the key to achieving good sample ef\ufb01ciency generally lies in\nmanaging the tradeoff between exploration and exploitation. One needs an ef\ufb01cient strategy to ex-\nplore the uncertain environment while maximizing reward. In the model-based setting, a recent line\nof research has imported ideas from the bandit literature\u2014including the use of upper con\ufb01dence\nbounds (UCB) and improved design of exploration bonuses\u2014and has obtained asymptotically op-\ntimal sample ef\ufb01ciency [1, 5, 10, 12]. In contrast, the understanding of model-free algorithms is\nstill very limited. To the best of our knowledge, the only existing theoretical result on model-free\nRL that applies to the episodic setting is for delayed Q-learning; however, this algorithm is quite\nsample-inef\ufb01cient compared to model-based approaches [25].\nIn this paper, we answer the two aforementioned questions af\ufb01rmatively. We show that Q-learning,\n\u221a\nwhen equipped with a UCB exploration policy that incorporates estimates of the con\ufb01dence of Q\nvalues and assign exploration bonuses, achieves total regret \u02dcO(\nH 3SAT ). Here, S and A are the\nnumbers of states and actions, H is the number of steps per episode, and T is the total number\nof steps. Up to a\nH factor, our regret matches the information-theoretic optimum, which can be\nachieved by model-based algorithms [5, 12]. Since our algorithm is just Q-learning, it is online\nand does not store additional data besides the table of Q values (and a few integers per entry of\nthis table). Thus, it also enjoys a signi\ufb01cant advantage over model-based algorithms in terms of\ntime and space complexities. To our best knowledge, this is the \ufb01rst sharp analysis for model-free\nT regret or equivalently O(1/\u03b52) samples for \u03b5-optimal policy\u2014without\nalgorithms\u2014featuring\nrequiring access to a \u201csimulator.\u201d\nFor practitioners, there are two key takeaways from our theoretical analysis:\n1. The use of UCB exploration instead of \u03b5-greedy exploration in the model-free setting allows for\n\n\u221a\n\n\u221a\n\nbetter treatment of uncertainties for different states and actions.\n\n2. It is essential to use a learning rate which is \u03b1t = O(H/t), instead of 1/t, when a state-action\npair is being updated for the t-th time. The former learning rate assigns more weight to updates\nthat are more recent, as opposed to assigning uniform weights to all previous updates. This deli-\ncate choice of reweighting leads to the crucial difference between our sample-ef\ufb01cient guarantee\nversus earlier highly inef\ufb01cient results that require exponentially many samples in H.\n\n1.1 Related Work\n\nIn this section, we focus our attention on theoretical results for the tabular MDP setting, where the\nnumbers of states and actions are \ufb01nite. We acknowledge that there has been much recent work in\nRL for continuous state spaces [see, e.g., 9, 11], but this setting is beyond our scope.\nWith simulator. Some results assume access to a simulator [15] (a.k.a., a generative model [3]),\nwhich is a strong oracle that allows the algorithm to query arbitrary state-action pairs and return\nthe reward and the next state. The majority of these results focus on an in\ufb01nite-horizon MDP with\ndiscounted reward [e.g., 2, 3, 8, 16, 23]. When a simulator is available, model-free algorithms\n[2] (variants of Q-learning) are known to be almost as sample ef\ufb01cient as the best model-based\nalgorithms [3]. However, the simulator setting is considered to much easier than standard RL, as it\n\u201cdoes not require exploration\u201d [2]. Indeed, a naive exploration strategy which queries all state-action\n\n2\n\n\fAlgorithm\nRLSVI [? ]\nUCRL2 [10] 1\n\nModel-based\n\nAgrawal and Jia [1] 1\n\nUCBVI [5] 2\nvUCQ [12] 2\n\nQ-learning (\u03b5-greedy) [14]\n\n(if 0 initialized)\n\nModel-free\n\nDelayed Q-learning [25] 3\n\nQ-learning (UCB-H)\nQ-learning (UCB-B)\n\nlower bound\n\nH 4S2AT )\n\nH 3S2AT )\n\n\u02dcO(\n\nRegret\n\u221a\nH 3SAT )\n\u221a\n\u221a\n\nat least \u02dcO(\nat least \u02dcO(\n\u221a\n\u02dcO(\nH 2SAT )\n\u221a\n\u02dcO(\nH 2SAT )\n\u2126(min{T, AH/2})\n\u02dcOS,A,H (T 4/5)\n\u221a\n\u02dcO(\nH 4SAT )\n\u221a\n\u02dcO(\nH 3SAT )\n\u221a\nH 2SAT )\n\n\u2126(\n\nTime\n\nSpace\n\n\u02dcO(T S2A2) O(S2A2H)\n\n\u2126(T S2A)\n\n\u02dcO(T S2A)\n\nO(S2AH)\n\nO(T )\n\nO(SAH)\n\n-\n\n-\n\nTable 1: Regret comparisons for RL algorithms on episodic MDP. T = KH is totally number of steps, H is\nthe number of steps per episode, S is the number of states, and A is the number of actions. For clarity,\nthis table is presented for T \u2265 poly(S, A, H), omitting low order terms.\n\npairs uniformly at random already leads to the most ef\ufb01cient algorithm for \ufb01nding optimal policy\n[3].\n\nWithout simulator. Reinforcement learning becomes much more challenging without the presence\nof a simulator, and the choice of exploration policy can now determine the behavior of the learning\nalgorithm. For instance, Q-learning with \u03b5-greedy may take exponentially many episodes to learn\nthe optimal policy [14] (for the sake of completeness, we present this result in our episodic language\nin Appendix A).\nMathemtically, this paper de\ufb01nes \u201cmodel-free\u201d algorithms as in existing literature [25, 26]:\nDe\ufb01nition 1. A reinforcement learning algorithm is model-free if its space complexity is always\nsublinear (for any T ) relative to the space required to store an MDP. In episodic setting of this\npaper, a model-free algorithm has space complexity o(S2AH) (independent of T ).\n\n\u221a\n\nH 2SAT ). The additional\n\nH 4S2AT ) and \u02dcO(\n\nIn the model-based setting, UCRL2 [10] and Agrawal and Jia [1] form estimates of the transition\nprobabilities of the MDP using past samples, and add upper-con\ufb01dence bounds (UCB) to the es-\n\u221a\ntimated transition matrix. When applying their results to the episodic MDP scenario, their total\nregret is at least \u02dcO(\n\u221a\nH 3S2AT ) respectively.1 In contrast, the information-\ntheoretic lower bound is \u02dcO(\nH factors were later removed by\nthe UCBVI algorithm [5] which adds a UCB bonus directly to the Q values instead of the estimated\ntransition matrix.2 The vUCQ algorithm [12] is similar to UCBVI but improves lower-order regret\n\u221a\nterms using variance reduction. Finally, RLSVI [? ], an algorithm designed for setting of linear ap-\nproxmation, provides \u02dcO(\nH 3SAT ) regret bound when adapted to tabular MDP setting. However,\nit is batch algorithm in nature, and requires O(d2H) space where in tabular setting d = SA.\nWe note that despite the sharp regret guarantees, most of the results in this line of research crucially\nrely on estimating and storing the entire transition matrix and thus suffer from unfavorable time and\nspace complexities compared to model-free algorithms.\n\nS and\n\n\u221a\n\n\u221a\n\n1Jaksch et al. [10] and Agrawal and Jia [1] apply to the more general setting of weakly communicating\nMDPs with S(cid:48) states and diameter D; our episodic MDP is a special case obtained by augmenting the state\nspace so that S(cid:48) = SH and D \u2265 H.\n2Azar et al. [5] and Kakade et al. [12] assume equal transition matrices P1 = \u00b7\u00b7\u00b7 = PH; in the setting of\nthis paper P1,\u00b7\u00b7\u00b7 , PH can be entirely different. This adds a factor of\n3Strehl et al. [25] applies to MDPs with S(cid:48) states and discount factor \u03b3; our episodic MDP can be converted\nto that case by setting S(cid:48) = SH and 1 \u2212 \u03b3 = 1/H. Their result only applies to the stochastic setting where\ninitial states xk\n1 come from a \ufb01xed distribution, and only gives a PAC guarantee. See our full version for a\ncomparison between PAC and regret guarantees.\n\nH to their total regret.\n\n\u221a\n\n3\n\n\fIn the model-free setting, Strehl et al. [25] introduced delayed Q-learning, where, to \ufb01nd an \u03b5-\noptimal policy, the Q value for each state-action pair is updated only once every m = \u02dcO(1/\u03b52)\ntimes this pair is visited. In contrast to the incremental update of Q-learning, delayed Q-learning\nalways replaces old Q values with the average of the most recent m experiences. When translated\nto the setting of this paper, this gives \u02dcO(T 4/5) total regret, ignoring factors in S, A and H.3 This is\nquite suboptimal compared to the \u02dcO(\n\nT ) regret achieved by model-based algorithm.\n\n\u221a\n\n2 Preliminary\nWe consider the setting of a tabular episodic Markov decision process, MDP(S,A, H, P, r), where\nS is the set of states with |S| = S, A is the set of actions with |A| = A, H is the number of steps in\neach episode, P is the transition matrix so that Ph(\u00b7|x, a) gives the distribution over states if action\na is taken for state x at step h \u2208 [H], and rh : S \u00d7A \u2192 [0, 1] is the deterministic reward function at\nstep h.4\nIn each episode of this MDP, an initial state x1 is picked arbitrarily by an adversary. Then, at each\nstep h \u2208 [H], the agent observes state xh \u2208 S, picks an action ah \u2208 A, receives reward rh(xh, ah),\nand then transitions to a next state, xh+1, that is drawn from the distribution Ph(\u00b7|xh, ah). The\nepisode ends when xH+1 is reached.\nh : S \u2192 R\nto denote the value function at step h under policy \u03c0, so that V \u03c0\nh (x) gives the expected sum of\nremaining rewards received under policy \u03c0, starting from xh = x, until the end of the episode. In\nsymbols:\n\nA policy \u03c0 of an agent is a collection of H functions(cid:8)\u03c0h : S \u2192 A(cid:9)\n\nh\u2208[H]. We use V \u03c0\n\nh(cid:48)=h rh(cid:48)(xh(cid:48), \u03c0h(cid:48)(xh(cid:48)))|xh = x\n\n.\n\n(cid:105)\n\nAccordingly, we also de\ufb01ne Q\u03c0\nh(x, a)\ngives the expected sum of remaining rewards received under policy \u03c0, starting from xh = x, ah = a,\ntill the end of the episode. In symbols:\n\nh : S \u00d7A \u2192 R to denote Q-value function at step h so that Q\u03c0\n\nh(cid:48)=h+1 rh(cid:48)(xh(cid:48), \u03c0h(cid:48)(xh(cid:48)))|xh = x, ah = a] .\n\nSince the state and action spaces, and the horizon, are all \ufb01nite, there always exists (see, e.g., [5]) an\nh (x) for all x \u2208 S and h \u2208 [H].\noptimal policy \u03c0(cid:63) which gives the optimal value V (cid:63)\nFor simplicity, we denote [PhVh+1](x, a) := Ex(cid:48)\u223cP(\u00b7|x,a)Vh+1(x(cid:48)). Recall the Bellman equation\nand the Bellman optimality equation:\n\nh (x) = sup\u03c0 V \u03c0\n\nV \u03c0\n\nh (x) := E(cid:104)(cid:80)H\nh(x, a) := rh(x, a) + E[(cid:80)H\n\nQ\u03c0\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 V \u03c0\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 V (cid:63)\n(cid:2)V (cid:63)\n\n1 (xk\n\nh (x) = Q\u03c0\nh(x, \u03c0h(x))\nh(x, a) := (rh + PhV \u03c0\nQ\u03c0\n\u2200x \u2208 S\nV \u03c0\nH+1(x) = 0\n\nh+1)(x, a)\n\nand\n\nh(x, a)\n\nh (x) = maxa\u2208A Q(cid:63)\nh(x, a) := (rh + PhV (cid:63)\nQ(cid:63)\nV (cid:63)\nH+1(x) = 0\n\n\u2200x \u2208 S .\n\nh+1)(x, a)\n\n(2.1)\n\nThe agent plays the game for K episodes k = 1, 2, . . . , K, and we let the adversary pick a starting\n1 for each episode k, and let the agent choose a policy \u03c0k before starting the k-th episode.\nstate xk\nThe total (expected) regret is then\n\nRegret(K) =(cid:80)K\n\nk=1\n\n1)(cid:3) .\n\n1) \u2212 V \u03c0k\n\n1 (xk\n\n3 Main Results\n\nIn this section, we present our main theoretical result\u2014a sample complexity result for a variant\nof Q-learning that incorporates UCB exploration. We also present a theorem that establishes an\ninformation-theoretic lower bound for episodic MDP.\nAs seen in the bandit setting, the choice of exploration policy plays an essential role in the ef\ufb01-\nciency of a learning algorithm. In episodic MDP, Q-learning with the commonly used \u03b5-greedy\nexploration strategy can be very inef\ufb01cient: it can take exponentially many episodes to learn [14]\n\n4While we study deterministic reward functions for notational simplicity, our results generalize to random-\n\nized reward functions. Also, we assume the reward is in [0, 1] without loss of generality.\n\n4\n\n\fAlgorithm 1 Q-learning with UCB-Hoeffding\n1: initialize Qh(x, a) \u2190 H and Nh(x, a) \u2190 0 for all (x, a, h) \u2208 S \u00d7 A \u00d7 [H].\n2: for episode k = 1, . . . , K do\n3:\n4:\n5:\n6:\n7:\n8:\n\nTake action ah \u2190 argmaxa(cid:48) Qh(xh, a(cid:48)), and observe xh+1.\nQh(xh, ah) \u2190 (1 \u2212 \u03b1t)Qh(xh, ah) + \u03b1t[rh(xh, ah) + Vh+1(xh+1) + bt].\nVh(xh) \u2190 min{H, maxa(cid:48)\u2208A Qh(xh, a(cid:48))}.\n\nt = Nh(xh, ah) \u2190 Nh(xh, ah) + 1; bt \u2190 c(cid:112)H 3\u03b9/t.\n\nreceive x1.\nfor step h = 1, . . . , H do\n\n(see also Appendix A).\nIn contrast, our algorithm (Algorithm 1), which is Q-learning with an\nupper-con\ufb01dence bound (UCB) exploration strategy, will be seen to be ef\ufb01cient. This algorithm\nmaintains Q values, Qh(x, a), for all (x, a, h) \u2208 S \u00d7 A \u00d7 [H] and the corresponding V values\nVh(x) \u2190 min{H, maxa(cid:48)\u2208A Qh(x, a(cid:48))}. If, at time step h \u2208 [H], the state is x \u2208 S, the algorithm\ntakes the action a \u2208 A that maximizes the current estimate Qh(x, a), and is apprised of the next\nstate x(cid:48) \u2208 S. The algorithm then updates the Q values:\n\nQh(x, a) \u2190 (1 \u2212 \u03b1t)Qh(x, a) + \u03b1t[rh(x, a) + Vh+1(x(cid:48)) + bt] ,\n\nwhere t is the counter for how many times the algorithm has visited the state-action pair (x, a) at\nstep h, bt is the con\ufb01dence bonus indicating how certain the algorithm is about current state-action\npair, and \u03b1t is a learning rate de\ufb01ned as follows:\n\n\u03b1t :=\n\nH + 1\nH + t\n\n.\n\n(3.1)\n\nAs mentioned in the introduction, our choice of learning rate \u03b1t scales as O(H/t) instead of\nO(1/t)\u2014this is crucial to obtain regret that is not exponential in H.\nWe present analyses for two different speci\ufb01cations of the upper con\ufb01dence bonus bt in this paper:\n\nQ-learning with Hoeffding-style bonus. The \ufb01rst (and simpler) choice is bt = O((cid:112)H 3\u03b9/t).\n\n(Here, and throughout this paper, we use \u03b9 := log(SAT /p) to denote a log factor.) This choice\nof bonus makes sense intuitively because: (1) Q-values are upper-bounded by H, and, accordingly,\n(2) Hoeffding-type martingale concentration inequalities imply that if we have visited (x, a) for t\ntimes, then a con\ufb01dence bound for the Q value scales as 1/\nt. For this reason, we call this choice\nUCB-Hoeffding (UCB-H). See Algorithm 1.\nTheorem 2 (Hoeffding). There exists an absolute constant c > 0 such that, for any p \u2208 (0, 1), if\n\nwe choose bt = c(cid:112)H 3\u03b9/t, then with probability 1 \u2212 p, the total regret of Q-learning with UCB-\n\n\u221a\n\n\u221a\nHoeffding (see Algorithm 1) is at most O(\n\nH 4SAT \u03b9), where \u03b9 := log(SAT /p).\n\n\u221a\n\nTheorem 2 shows, under a rather simple choice of exploration bonus, Q-learning can be made very\nef\ufb01cient, enjoying a \u02dcO(\nT ) regret which is optimal in terms of dependence on T . To the best of\nT regret without\nour knowledge, this is the \ufb01rst analysis of a model-free procedure that features a\nrequiring access to a \u201csimulator.\u201d\nCompared to the previous model-based results, Theorem 2 shows that the regret (or equivalently\nthe sample complexity; see discussion in full version) of this version of Q-learning is as good as\nthe best model-based one in terms of the dependency on the number of states S, actions A and the\ntotal number of steps T . Although our regret slightly increases the dependency on H, the algorithm\nis online and does not store additional data besides the table of Q values (and a few integers per\nentry of this table). Thus, it enjoys an advantage over model-based algorithms in time and space\ncomplexities, especially when the number of states S is large.\n\n\u221a\n\nQ-learning with Bernstein-style bonus. Our second speci\ufb01cation of bt makes use of a Bernstein-\nstyle upper con\ufb01dence bound. The key observation is that, although in the worst case the value\nfunction is at most H for any state-action pair, if we sum up the \u201ctotal variance of the value function\u201d\nfor an entire episode, we obtain a factor of only O(H 2) as opposed to the naive O(H 3) bound (see\nLemma C.5). This implies that the use of a Bernstein-type martingale concentration result could be\n\n5\n\n\fsharper than the Hoeffding-type bound by an additional factor of H.5 (The idea of using Bernstein\ninstead of Hoeffding for reinforcement learning applications has appeared in previous work; see,\ne.g., [3, 4, 16].)\nUsing Bernstein concentration requires us to design the bonus term bt more carefully, as it now\ndepends on the empirical variance of Vh+1(x(cid:48)) where x(cid:48) is the next state over the previous t visits\nof current state-action (x, a). This empirical variance can be computed in an online fashion without\nincreasing the space complexity of Q-learning. We defer the full speci\ufb01cation of bt to Algorithm 2\nin Appendix C. We now state the regret theorem for this approach.\nTheorem 3 (Bernstein). For any p \u2208 (0, 1), one can specify bt so that with probability 1\u2212p, the total\n\u221a\nH 9S3A3 \u00b7\nregret of Q-learning with UCB-Bernstein (see Algorithm 2) is at most O(\n\u03b92).\n\nH 3SAT \u03b9 +\n\n\u221a\n\n\u221a\n\n\u221a\n\nT ) improves by a factor of\n\nTheorem 3 shows that for Q-learning with UCB-B exploration, the leading term in regret (which\nscales as\nH over UCB-H exploration, at the price of using a more\ncomplicated exploration bonus design. The asymptotic regret of UCB-B is now only one\nH factor\nworse than the best regret achieved by model-based algorithms.\n\n\u221a\nH 9S3A3 \u00b7 \u03b92) in its regret, which dominates\nWe also note that Theorem 3 has an additive term O(\nthe total regret when T is not very large compared with S, A and H. It is not clear whether this\nlower-order term is essential, or is due to technical aspects of the current analysis.\n\n\u221a\n\nInformation-theoretical limit.\ninformation-theoretic lower bound for the episodic MDP setting studied in this paper:\n\u221a\nTheorem 4. For the episodic MDP problem studied in this paper, the expected regret for any algo-\nrithm must be at least \u2126(\n\nTo demonstrate the sharpness of our results, we also note an\n\nH 2SAT ).\n\nTheorem 4 (see Appendix D for details) shows that both variants of our algorithm are nearly optimal,\nin the sense they differ from the optimal regret by a factor of H and\n\nH, respectively.\n\n\u221a\n\n4 Proof for Q-learning with UCB-Hoeffding\n\nIn this section, we provide the full proof of Theorem 2. Intuitively, the episodic MDP with H steps\nper epsiode can be viewed as a contextual bandit of H \u201clayers.\u201d The key challenge here is to control\nthe way error and con\ufb01dence propagate through different \u201clayers\u201d in an online fashion, where our\nspeci\ufb01c choice of exploration bonus and learning rate make the regret as sharp as possible.\nNotation. We denote by I[A] the indicator function for event A. We denote by (xk\nh) the\nactual state-action pair observed and chosen at step h of episode k. We also denote by Qk\nh , N k\nh\nrespectively the Qh, Vh, Nh functions at the beginning of episode k. Using this notation, the update\nequation at episode k can be rewritten as follows, for every h \u2208 [H]:\n\nh, ak\nh, V k\n\n(cid:26)(1 \u2212 \u03b1t)Qk\n\nQk\n\nh(x, a)\n\nQk+1\n\nh\n\n(x, a) =\n\nAccordingly,\n\nh(x, a) + \u03b1t[rh(x, a) + V k\n\nh+1(xk\n\nh+1) + bt]\n\nh (x) \u2190 min(cid:8)H, max\n\nV k\n\nh(x, a(cid:48))(cid:9),\n\na(cid:48)\u2208A Qk\n\n\u2200x \u2208 S .\n\nif (x, a) = (xk\notherwise .\n\nh, ak\nh)\n\n(4.1)\n\nhVh+1](x, a) := Vh+1(xk\n\nh+1), which is de\ufb01ned only for (x, a) = (xk\n\nRecall that we have [PhVh+1](x, a) := Ex(cid:48)\u223cPh(\u00b7|x,a)Vh+1(x(cid:48)). We also denote its empirical coun-\nterpart of episode k as [\u02c6Pk\nh).\nh, ak\nH+t . For notational convenience, we also\nRecall that we have chosen the learning rate as \u03b1t := H+1\nintroduce the following related quantities:\nj=1(1 \u2212 \u03b1j),\n\n(4.2)\n5Recall that for independent zero-mean random variables X1, . . . , XT satisfying |Xi| \u2264 M, their sum-\nT ) with high probability using Hoeffding concentration. If we have in hand a\n\n(cid:81)t\nj=i+1(1 \u2212 \u03b1j) .\nE[Xi]2(cid:1) using Bernstein concentration.\n\nbetter variance bound, this can be improved to \u02dcO(cid:0)M +(cid:112)(cid:80)\n\nmation does not exceed \u02dcO(M\n\nt =(cid:81)t\n\nt = \u03b1i\n\n\u221a\n\n\u03b10\n\n\u03b1i\n\ni\n\n6\n\n\ft(cid:88)\n\n(cid:104)\n\n(cid:105)\n\nIt is easy to verify that (1)(cid:80)t\n\nt = 0 for t \u2265 1; (2)(cid:80)t\n\ni=1 \u03b1i\n\nt = 1 and \u03b10\n\ni=1 \u03b1i\n\nt = 0 and \u03b10\n\nt = 1 for\n\nt = 0.\nFavoring Later Updates. At any (x, a, h, k) \u2208 S \u00d7A\u00d7 [H]\u00d7 [K], let t = N k\nh (x, a) and suppose\n(x, a) was previously taken at step h of episodes k1, . . . , kt < k. By the update equation (4.1) and\nthe de\ufb01nition of \u03b1i\n\nt in (4.2), we have:\n\nQk\n\nh(x, a) = \u03b10\n\nt H +\n\n\u03b1i\nt\n\nrh(x, a) + V ki\n\nh+1(xki\n\nh+1) + bi\n\n.\n\n(4.3)\n\ni=1\n\nt , . . . , \u03b1t\n\nt. Our choice of the learning rate \u03b1t = H+1\n\nAccording to (4.3), the Q value at episode k equals a weighted average of the V values of the \u201cnext\nstates\u201d with weights \u03b11\nH+t ensures that, approxi-\nmately speaking, the last 1/H fraction of the indices i is given non-negligible weights, whereas the\n\ufb01rst 1 \u2212 1/H fraction is forgotten. This ensures that the information accumulates smoothly across\nt would all equal\nthe H layers of the MDP. If one were to use \u03b1t = 1\n\u221a\n1/t, and using those V values from earlier episodes would hurt the accuracy of the Q function. In\ncontrast, if one were to use \u03b1t = 1/\nt instead, the weights \u03b11\nt would concentrate too much\non the most recent episodes, which would incur high variance.\n\nt instead, the weights \u03b11\n\nt , . . . , \u03b1t\n\nt , . . . , \u03b1t\n\n4.1 Proof Details\n\nWe \ufb01rst present an auxiliary lemma which exhibits some important properties that result from our\nchoice of learning rate. The proof is based on simple manipulations on the de\ufb01nition of \u03b1t, and is\nprovided in Appendix B.\nLemma 4.1. The following properties hold for \u03b1i\nt:\n\ni=1\n\n(a)\n\n1\u221a\nt\n\n\u2264 2\u221a\n\nfor every t \u2265 1.\n\n\u2264(cid:80)t\n(c) (cid:80)\u221e\ncan blow up the regret by a multiplicative factor of(cid:80)\u221e\n\ni=1(\u03b1i\nH for every i \u2265 1.\n\n\u03b1i\nt\u221a\ni\nt \u2264 2H\nt = 1 + 1\n\nt and(cid:80)t\n\n(b) maxi\u2208[t] \u03b1i\n\nt)2 \u2264 2H\n\nt=i \u03b1i\n\nt\n\nt\n\nfor every t \u2265 1.\n\nt=i \u03b1i\n\nWe note that property (c) is especially important\u2014as we will show later, each step in one episode\nt. With our choice of learning rate, we\n\nensure that this blow-up is at most (1 + 1/H)H, which is a constant factor.\nWe now proceed to the formal proof. We start with a lemma that gives a recursive formula for\nQ \u2212 Q(cid:63), as a weighted average of previous updates.\nLemma 4.2 (recursion on Q). For any (x, a, h) \u2208 S \u00d7 A \u00d7 [H] and episode k \u2208 [K], let t =\nh (x, a) and suppose (x, a) was previously taken at step h of episodes k1, . . . , kt < k. Then:\nN k\nh\u2212Q(cid:63)\n\nh)(x, a) = \u03b10\n\nh \u2212 Ph)V (cid:63)\n\nh+1) + [(\u02c6Pki\n\nt (H\u2212Q(cid:63)\n\nh+1)(xki\n\nh(x, a))+\n\n(Qk\n\n(cid:104)\n\nh+1](x, a) + bi\n\n(cid:105)\n\n.\n\ni=1\n\n\u03b1i\nt\n\nh+1 \u2212 V (cid:63)\n(V ki\n\nt(cid:88)\nh+1), and the fact that(cid:80)t\n(cid:104)\nrh(x, a) +(cid:0)Ph \u2212 \u02c6Pki\n\n\u03b1i\nt\n\nh\n\nt(cid:88)\n\n(cid:1)V (cid:63)\n\nProof of Lemma 4.2. From the Bellman optimality equation, Q(cid:63)\nnotation [\u02c6Pki\n\nh Vh+1](x, a) := Vh+1(xki\n\nh(x, a) = (rh + PhV (cid:63)\ni=0 \u03b1i\n\nt = 1, we have\n\nh+1)(x, a), our\n\nQ(cid:63)\n\nh(x, a) = \u03b10\n\nt Q(cid:63)\n\nh(x, a) +\n\nh+1(x, a) + V (cid:63)\n\nh+1(xki\n\nh+1)\n\nSubtracting the formula (4.3) from this equation, we obtain Lemma 4.2.\n\ni=1\n\n(cid:105)\n\n.\n\n(cid:3)\n\nNext, using Lemma 4.2 and the Azuma-Hoeffding concentration bound, our next lemma shows that\nQk is always an upper bound on Q(cid:63) at any episode k, and the difference between Qk and Q(cid:63) can be\nbounded by quantities from the next step.\nLemma 4.3 (bound on Qk \u2212 Q(cid:63)). There exists an absolute constant c > 0 such that, for any\n\np \u2208 (0, 1), letting bt = c(cid:112)H 3\u03b9/t, we have \u03b2t = 2(cid:80)t\n\ntbi \u2264 4c(cid:112)H 3\u03b9/t and, with probability\n\ni=1 \u03b1i\n\n7\n\n\fat least 1 \u2212 p, the following holds simultaneously for all (x, a, h, k) \u2208 S \u00d7 A \u00d7 [H] \u00d7 [K]:\n\n0 \u2264 (Qk\n\nh \u2212 Q(cid:63)\n\nh)(x, a) \u2264 \u03b10\n\nt H +\n\nt(V ki\n\u03b1i\n\nh+1 \u2212 V (cid:63)\n\nh+1)(xki\n\nh+1) + \u03b2t ,\n\nt(cid:88)\n\ni=1\n\nh (x, a) and k1, . . . , kt < k are the episodes where (x, a) was taken at step h.\n\nwhere t = N k\nProof of Lemma 4.3. For each \ufb01xed (x, a, h) \u2208 S \u00d7 A \u00d7 [H], let us denote k0 = 0, and denote\n\nh) = (x, a)(cid:9) \u222a {K + 1}(cid:1) .\n\nh, ak\n\nbe the \u03c3-\ufb01eld generated by all the random variables until episode ki, step h. Then,(cid:0)I[ki \u2264 K] \u00b7\n\nThat is, ki is the episode of which (x, a) was taken at step h for the ith time (or ki = K + 1\nif it is taken for fewer than i times). The random variable ki is clearly a stopping time. Let Fi\ni=1 is a martingale difference sequence w.r.t the \ufb01ltration {Fi}i\u22650. By\n[(\u02c6Pki\nAzuma-Hoeffding and a union bound, we have that with probability at least 1 \u2212 p/(SAH):\n\nh \u2212 Ph)V (cid:63)\n\nki = min(cid:0)(cid:8)k \u2208 [K] | k > ki\u22121 \u2227 (xk\nh+1](x, a)(cid:1)\u03c4\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u03c4(cid:88)\n\n\u03c4 \u00b7 I[ki \u2264 K] \u00b7 [(\u02c6Pki\n\u03b1i\n\nh \u2212 Ph)V (cid:63)\n\ni=1\n\n\u2200\u03c4 \u2208 [K] :\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 cH\n\n2\n\n(cid:118)(cid:117)(cid:117)(cid:116) \u03c4(cid:88)\n\ni=1\n\n(cid:114)\n\u03c4 )2 \u00b7 \u03b9 \u2264 c\n\n(\u03b1i\n\nH 3\u03b9\n\n\u03c4\n\n,\n\nh+1](x, a)\n\n(4.4)\nfor some absolute constant c. Because inequality (4.4) holds for all \ufb01xed \u03c4 \u2208 [K] uniformly, it also\nh (x, a) \u2264 K, which is a random variable, where k \u2208 [K]. Also note I[ki \u2264\nholds for \u03c4 = t = N k\nK] = 1 for all i \u2264 N k\nh (x, a). Putting everything together, and using a union bound, we see that with\nleast 1\u2212 p probability, the following holds simultaneously for all (x, a, h, k) \u2208 S \u00d7A\u00d7 [H]\u00d7 [K]:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t(cid:88)\nOn the other hand, if we choose bt = c(cid:112)H 3\u03b9/t for the same constant c in Eq. (4.4), we have\ntbi \u2208 [c(cid:112)H 3\u03b9/t, 2c(cid:112)H 3\u03b9/t(cid:3) according to Lemma 4.1.a. Then the right-hand side\n\u03b2t/2 =(cid:80)t\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 c\n(cid:114)\n\nof Lemma 4.3 follows immediately from Lemma 4.2 and inequality (4.5). The left-hand side also\n(cid:3)\nfollows from Lemma 4.2 and Eq. (4.5) and induction on h = H, H \u2212 1, . . . , 1.\n\nh \u2212 Ph)V (cid:63)\n\nt[(\u02c6Pki\n\u03b1i\n\nh+1](x, a)\n\nh (x, a) .\n\nt = N k\n\ni=1 \u03b1i\n\nwhere\n\n(4.5)\n\nH 3\u03b9\n\ni=1\n\nt\n\nWe are now ready to prove Theorem 2. The proof decomposes the regret in a recursive form, and\ncarefully controls the error propagation with repeated usage of Lemma 4.3.\n\nProof of Theorem 2. Denote by\n\u03b4k\nh := (V k\n\nh \u2212 V \u03c0k\n\nh := (V k\nh \u2265 Q(cid:63)\n\nand \u03c6k\nBy Lemma 4.3, we have that with 1 \u2212 p probability, Qk\nregret can be upper bounded:\n\nh )(xk\nh)\n\nRegret(K) =(cid:80)K\nh to(cid:80)K\n\nk=1 \u03b4k\n\nk=1(V (cid:63)\n\n1 \u2212 V \u03c0k\n\nThe main idea of the rest of the proof is to upper bound(cid:80)K\nrelating(cid:80)K\n\nk=1(V k\n\n1 )(xk\n\n1) \u2264(cid:80)K\n\nh.\nk=1 \u03c6k\n\nFor any \ufb01xed (k, h) \u2208 [K] \u00d7 [H], let t = N k\nh, ak\nat step h of episodes k1, . . . , kt < k. Then we have:\n\nh (xk\n\nh \u2212 V (cid:63)\nh )(xk\nh and thus V k\n1 \u2212 V \u03c0k\n\n1 )(xk\n\nh) .\nh \u2265 V (cid:63)\n\n1) =(cid:80)K\nh by the next step(cid:80)K\n\nk=1 \u03b4k\n\n1 .\n\nh . Thus, the total\n\nh+1,\nthus giving a recursive formula to calculate total regret. We can obtain such a recursive formula by\n\nk=1 \u03b4k\n\nk=1 \u03b4k\n\nh), and suppose (xk\n\nh, ak\n\nh) was previously taken\n\nh )(xk\nh)\nh, ak\nh)(xk\n\nh \u2212 V \u03c0k\nh \u2212 Q(cid:63)\n\n\u03b4k\nh = (V k\n= (Qk\ny\u2264 \u03b10\nz\n= \u03b10\n\nt H +(cid:80)t\nt H +(cid:80)t\ntbi \u2264 O(1)(cid:112)H 3\u03b9/t and \u03bek\n\nh \u2212 Q\u03c0k\nh )(xk\nh \u2212 Q\u03c0k\n\nx\u2264 (Qk\nh, ak\nh)\nh) + (Q(cid:63)\nh, ak\nh )(xk\nh)\nh+1 \u2212 V \u03c0k\nh+1 + \u03b2t + [Ph(V (cid:63)\nt\u03c6ki\nh+1 + \u03b2t \u2212 \u03c6k\nt\u03c6ki\nh+1 + \u03bek\nh)(V (cid:63)\nh+1)](xk\nh) \u2264 maxa(cid:48)\u2208A Qk\n\nh+1 ,\nh+1 \u2212 V k\n\nh+1 := [(Ph \u2212 \u02c6Pk\nmartingale difference sequence. Inequality x holds because V k\nh (xk\n\nwhere \u03b2t = 2(cid:80) \u03b1i\n\ni=1 \u03b1i\ni=1 \u03b1i\n\nh+1 + \u03b4k\n\nh+1)](xk\n\nh, ak\nh)\n\n(4.6)\nh) is a\nh, ak\nh, a(cid:48)) =\nh(xk\n\n8\n\n\fh), and inequality y holds by Lemma 4.3 and the Bellman equation (2.1). Finally, equality\n\nh(xk\n\nh, ak\n\nh+1 = (V (cid:63)\n\nQk\nz holds by de\ufb01nition \u03b4k\n\nh+1 \u2212 \u03c6k\nK(cid:88)\n\nWe turn to computing the summation(cid:80)K\nh+1 \u2212 V \u03c0k\nK(cid:88)\nk=1 \u03b4k\nh(cid:88)\n\nK(cid:88)\n\n\u03b10\nnk\nh\n\nH =\n\nk=1\n\nk=1\n\nnk\n\nh+1)(xk\n\nh+1).\n\nh. Denoting by nk\nH \u00b7 I[nk\n\nh = N k\nh = 0] \u2264 SAH .\n\nThe key step is to upper bound the second term in (4.6), which is:\n\n\u03b1i\n\nnk\nh\n\n\u03c6ki(xk\n\nh+1\n\nh,ak\nh)\n\n,\n\nk=1\n\ni=1\n\nh (xk\n\nh, ak\n\nh), we have:\n\nh, ak\n\nh) is the episode in which (xk\n\nwhere ki(xk\nthe summands in a different way. For every k(cid:48) \u2208 [K], the term \u03c6k(cid:48)\nk > k(cid:48) if and only if (xk\nsecond time it appears we have nk\n\nh) was taken at step h for the ith time. We regroup\nh+1 appears in the summand with\nh + 1, the\n\nh ). The \ufb01rst time it appears we have nk\n\nh + 2, and so on. Therefore\n\nh , sk(cid:48)\nh = nk(cid:48)\n\nh) = (xk(cid:48)\n\nh = nk(cid:48)\n\nh, ak\n\nh, sk\n\nt=i \u03b1i\n\nt = 1+ 1\n\nH from Lemma 4.1.c. Plugging these back into (4.6),\n\n\u03b1nk(cid:48)\nt \u2264\n\nh\n\n1 +\n\n1\nH\n\n\u03c6k\n\nh+1,\n\n(cid:18)\n\nK(cid:88)\n\n(cid:19) K(cid:88)\n\nk=1\n\nK(cid:88)\n\n\u03c6k\n\nh+1 +\n\n\u03b4k\nh+1 +\n\n(\u03b2nk\n\nh\n\n+ \u03bek\n\nh+1)\n\nk=1\n\nk=1\n\nh\n\nk=1\n\nk=1\n\n(\u03b2nk\n\n+ \u03bek\n\nh+1) ,\n\n1\n(4.7)\n\u03b4k\nh+1 +\nH\nh+1 \u2264 \u03b4k\nh+1 (owing to the fact that V (cid:63) \u2265 V \u03c0k). Recursing the\n(cid:16)\n= O(1) \u00b7(cid:88)\n\n(cid:17)\nH 3SAK\u03b9(cid:1) = O(cid:0)\u221a\n\nH+1 \u2261 0, we have:\nK(cid:88)\nH(cid:88)\n(cid:114)\nh (x,a)(cid:88)\n\nH 2SAT \u03b9(cid:1)\n\nx\u2264 O(cid:0)\u221a\n\nH 2SA +\n\nh+1)\n\n+ \u03bek\n\n(\u03b2nk\n\n1 \u2264 O\n\u03b4k\n\nH 3\u03b9\n\nN K\n\nh=1\n\nk=1\n\n.\n\nh\n\nn=1\n\nn\n\nK(cid:88)\n\nh(cid:88)\n\nnk\n\n\u03b1i\n\n\u03c6ki(xk\n\nh,ak\nh)\n\nwhere the \ufb01nal inequality uses(cid:80)\u221e\n\nh+1\n\nk=1\n\ni=1\n\nnk\nh\n\nwe have:\n\nK(cid:88)\n\nk=1\n\nh \u2264 SAH +\n\u03b4k\n\n\u2264 SAH +\n\n1 +\n\n1\nH\n\n1 +\n\nk(cid:48)=1\n\n\u2264 K(cid:88)\n(cid:19) K(cid:88)\n(cid:19) K(cid:88)\n\nk=1\n\n\u221e(cid:88)\n\nt=nk(cid:48)\n\nh +1\n\n\u03c6k(cid:48)\n\nh+1\n\n\u03c6k\n\nh+1 \u2212 K(cid:88)\nK(cid:88)\n\nk=1\n\nwhere the \ufb01nal inequality uses \u03c6k\nresult for h = 1, 2, . . . , H, and using the fact \u03b4K\n\n(cid:18)\n(cid:18)\n\nK(cid:88)\n\nk=1\n\nFinally, by the pigeonhole principle, for any h \u2208 [H]:\n\n(cid:115)\n\nh\n\nx,a\n\nk=1\n\nk=1\n\n\u03b2nk\n\nH 3\u03b9\nnk\nh\n\nwhen N K\n1 \u2212 p, we have:\n\n\u2264 O(1) \u00b7 K(cid:88)\n\nK(cid:88)\nwhere inequality x is true because(cid:80)\n(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12) H(cid:88)\nK(cid:88)\n1 \u2264 O(cid:0)H 2SA +\n1 \u2264 O(cid:0)H 2SA +\n\n(cid:12)(cid:12)(cid:12) H(cid:88)\nK(cid:88)\nThis establishes (cid:80)K\nIn sum, we have(cid:80)K\n\n\u221a\nwe have\n\nk=1 \u03b4k\np to p/2 \ufb01nishes the proof.\n\nk=1 \u03b4k\n\n\u03bek\nh+1\n\n\u221a\n\nh=1\n\nh=1\n\nk=1\n\nk=1\n\n\u221a\n\n(4.8)\nh (x, a) = K and the left-hand side of x is maximized\nh (x, a) = K/SA for all x, a. Also, by the AzumaHoeffding inequality, with probability\n\nx,a N K\n\nH 4SAT \u03b9 \u2265 H 2SA, and when T \u2264\n\nk=1 \u03b4k\nH 4SAT \u03b9. Therefore, we can remove the H 2SA term in the regret upper bound.\n\n[(Ph \u2212 \u02c6Pk\n\n\u221a\n\n\u221a\n\nh, ak\nh)\n\nh)(V (cid:63)\n\nh+1)](xk\n\nh+1 \u2212 V k\n\n(cid:12)(cid:12)(cid:12) \u2264 cH\nH 4SAT \u03b9(cid:1). We note that when T \u2265\nH 4SAT \u03b9, we have(cid:80)K\nH 4SAT \u03b9(cid:1), with probability at least 1 \u2212 2p. Rescaling\n(cid:3)\n\nH 4SAT \u03b9,\n1 \u2264 HK = T \u2264\n\nT \u03b9.\n\u221a\n\n\u221a\n\n9\n\n\fAcknowledgements\n\nWe thank Nan Jiang, Sham M. Kakade, Greg Yang and Chicheng Zhang for valuable discussions.\nThis work was supported in part by the DARPA program on Lifelong Learning Machines, and\nMicrosoft Research Gratis Traveler program.\n\nReferences\n\n[1] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning:\nIn Conference on Neural Information Processing Systems, pages\n\nworst-case regret bounds.\n1184\u20131194. Curran Associates Inc., 2017.\n\n[2] Mohammad Azar, Remi Munos, Mohammad Ghavamzadeh, and Hilbert J Kappen. Speedy Q-\nlearning. In Conference on Neural Information Processing Systems, pages 2411\u20132419. Curran\nAssociates Inc., 2011.\n\n[3] Mohammad Azar, R\u00b4emi Munos, and Hilbert J. Kappen. On the sample complexity of reinforce-\nment learning with a generative model. In Proceedings of the 29th International Conference\non Machine Learning (ICML), 2012.\n\n[4] Mohammad Azar, R\u00b4emi Munos, and Hilbert J. Kappen. Minimax PAC bounds on the sample\ncomplexity of reinforcement learning with a generative model. Machine Learning, 91(3):325\u2013\n349, 2013.\n\n[5] Mohammad Azar, Ian Osband, and R\u00b4emi Munos. Minimax regret bounds for reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning (ICML),\npages 263\u2013272, 2017.\n\n[6] S\u00b4ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochas-\ntic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[7] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach\nto policy search. In Proceedings of the 28th International Conference on machine learning\n(ICML), pages 465\u2013472, 2011.\n\n[8] Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning. Journal of Machine Learn-\n\ning Research, 5(Dec):1\u201325, 2003.\n\n[9] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy\n\ngradient methods for linearized control problems. arXiv preprint arXiv:1801.05039, 2018.\n\n[10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[11] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire.\nContextual decision processes with low Bellman rank are PAC-learnable. arXiv preprint\narXiv:1610.09512, 2016.\n\n[12] Sham Kakade, Mengdi Wang, and Lin F Yang. Variance reduction methods for sublinear\n\nreinforcement learning. ArXiv e-prints, abs/1802.09184, April 2018.\n\n[14] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49(2-3):209\u2013232, 2002.\n\n[15] Sven Koenig and Reid G Simmons. Complexity analysis of real-time reinforcement learning.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence, pages 99\u2013105, 1993.\n\n[16] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In International Con-\n\nference on Algorithmic Learning Theory, pages 320\u2013334, 2012.\n\n[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv\npreprint arXiv:1312.5602, 2013.\n\n[18] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In International Conference on Machine Learning (ICML), pages 1928\u20131937,\n2016.\n\n10\n\n\f[19] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. arXiv\npreprint arXiv:1708.02596, 2017.\n\n[20] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning.\n\nArXiv e-prints, abs/1608.02732, April 2016.\n\n[] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via random-\n\nized value functions. arXiv preprint arXiv:1402.0635, 2014.\n\n[21] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models:\n\nModel-free deep RL for model-based control. arXiv preprint arXiv:1802.09081, 2018.\n\n[22] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust\nregion policy optimization. In International Conference on Machine Learning (ICML), pages\n1889\u20131897, 2015.\n\n[23] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and\nfaster algorithms for solving Markov decision processes. In Proceedings of the Twenty-Ninth\nAnnual ACM-SIAM Symposium on Discrete Algorithms, pages 770\u2013787. SIAM, 2018.\n\n[24] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van\nDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-\ntot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529\n(7587):484\u2013489, 2016.\n\n[25] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC\nmodel-free reinforcement learning. In Proceedings of the 23rd International Conference on\nMachine Learning, pages 881\u2013888. ACM, 2006.\n\n[26] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press\n\nCambridge, 1998.\n\n[27] Christopher Watkins. Learning from delayed rewards. PhD thesis, King\u2019s College, Cambridge,\n\n1989.\n\n11\n\n\f", "award": [], "sourceid": 2366, "authors": [{"given_name": "Chi", "family_name": "Jin", "institution": "University of California, Berkeley"}, {"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Sebastien", "family_name": "Bubeck", "institution": "Microsoft Research"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}