{"title": "Speedy Q-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2411, "page_last": 2419, "abstract": "We introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor \\gamma only T=O\\big(\\log(n)/(\\epsilon^{2}(1-\\gamma)^{4})\\big) steps are required for the SQL algorithm to converge to an \\epsilon-optimal action-value function with high probability. This bound has a better dependency on 1/\\epsilon and 1/(1-\\gamma), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning.", "full_text": "Speedy Q-Learning\n\nMohammad Gheshlaghi Azar\nRadboud University Nijmegen\nGeert Grooteplein 21N, 6525 EZ\n\nNijmegen, Netherlands\n\nm.azar@science.ru.nl\n\nMohammad Ghavamzadeh\nINRIA Lille, SequeL Project\n\n40 avenue Halley\n\n59650 Villeneuve d\u2019Ascq, France\nm.ghavamzadeh@inria.fr\n\nRemi Munos\n\nINRIA Lille, SequeL Project\n\n40 avenue Halley\n\n59650 Villeneuve d\u2019Ascq, France\n\nr.munos@inria.fr\n\nHilbert J. Kappen\n\nRadboud University Nijmegen\nGeert Grooteplein 21N, 6525 EZ\n\nNijmegen, Netherlands\n\nb.kappen@science.ru.nl\n\nAbstract\n\nWe introduce a new convergent variant of Q-learning, called speedy Q-learning\n(SQL), to address the problem of slow convergence in the standard form of the\nQ-learning algorithm. We prove a PAC bound on the performance of SQL, which\nshows that for an MDP with n state-action pairs and the discount factor \u03b3 only T =\n\nO(cid:0) log(n)/(\u01eb2(1 \u2212 \u03b3)4)(cid:1) steps are required for the SQL algorithm to converge to\n\nan \u01eb-optimal action-value function with high probability. This bound has a better\ndependency on 1/\u01eb and 1/(1 \u2212 \u03b3), and thus, is tighter than the best available result\nfor Q-learning. Our bound is also superior to the existing results for both model-\nfree and model-based instances of batch Q-value iteration that are considered to\nbe more ef\ufb01cient than the incremental methods like Q-learning.\n\n1\n\nIntroduction\n\nQ-learning [20] is a well-known model-free reinforcement learning (RL) algorithm that \ufb01nds an\nestimate of the optimal action-value function. Q-learning is a combination of dynamic programming,\nmore speci\ufb01cally the value iteration algorithm, and stochastic approximation. In \ufb01nite state-action\nproblems, it has been shown that Q-learning converges to the optimal action-value function [5, 10].\nHowever, it suffers from slow convergence, especially when the discount factor \u03b3 is close to one [8,\n17]. The main reason for the slow convergence of Q-learning is the combination of the sample-based\nstochastic approximation (that makes use of a decaying learning rate) and the fact that the Bellman\noperator propagates information throughout the whole space (specially when \u03b3 is close to 1).\nIn this paper, we focus on RL problems that are formulated as \ufb01nite state-action discounted in\ufb01nite\nhorizon Markov decision processes (MDPs), and propose an algorithm, called speedy Q-learning\n(SQL), that addresses the problem of slow convergence of Q-learning. At each time step, SQL uses\ntwo successive estimates of the action-value function that makes its space complexity twice as the\nstandard Q-learning. However, this allows SQL to use a more aggressive learning rate for one of\nthe terms in its update rule and eventually achieves a faster convergence rate than the standard Q-\nlearning (see Section 3.1 for a more detailed discussion). We prove a PAC bound on the performance\n\nof SQL, which shows that only T = O(cid:0) log(n)/((1 \u2212 \u03b3)4\u01eb2)(cid:1) number of samples are required for\n\nSQL in order to guarantee an \u01eb-optimal action-value function with high probability. This is superior\nto the best result for the standard Q-learning by [8], both in terms of 1/\u01eb and 1/(1 \u2212 \u03b3). The rate\nfor SQL is even better than that for the Phased Q-learning algorithm, a model-free batch Q-value\n\n1\n\n\fiteration algorithm proposed and analyzed by [12]. In addition, SQL\u2019s rate is slightly better than\nthe rate of the model-based batch Q-value iteration algorithm in [12] and has a better computational\nand memory requirement (computational and space complexity), see Section 3.3.2 for more detailed\ncomparisons. Similar to Q-learning, SQL may be implemented in synchronous and asynchronous\nfashions. For the sake of simplicity in the analysis, we only report and analyze its synchronous\nversion in this paper. However, it can easily be implemented in an asynchronous fashion and our\ntheoretical results can also be extended to this setting by following the same path as [8].\n\nThe idea of using previous estimates of the action-values has already been used to improve the per-\nformance of Q-learning. A popular algorithm of this kind is Q(\u03bb) [14, 20], which incorporates the\nconcept of eligibility traces in Q-learning, and has been empirically shown to have a better perfor-\nmance than Q-learning, i.e., Q(0), for suitable values of \u03bb. Another recent work in this direction\nis Double Q-learning [19], which uses two estimators for the action-value function to alleviate the\nover-estimation of action-values in Q-learning. This over-estimation is caused by a positive bias in-\ntroduced by using the maximum action value as an approximation for the expected action value [19].\n\nThe rest of the paper is organized as follows. After introducing the notations used in the paper\nin Section 2, we present our Speedy Q-learning algorithm in Section 3. We \ufb01rst describe the al-\ngorithm in Section 3.1, then state our main theoretical result, i.e., a high-probability bound on the\nperformance of SQL, in Section 3.2, and \ufb01nally compare our bound with the previous results on\nQ-learning in Section 3.3. Section 4 contains the detailed proof of the performance bound of the\nSQL algorithm. Finally, we conclude the paper and discuss some future directions in Section 5.\n\n2 Preliminaries\n\nIn this section, we introduce some concepts and de\ufb01nitions from the theory of Markov decision\nprocesses (MDPs) that are used throughout the paper. We start by the de\ufb01nition of supremum norm.\nFor a real-valued function g : Y 7\u2192 R, where Y is a \ufb01nite set, the supremum norm of g is de\ufb01ned as\nkgk , maxy\u2208Y |g(y)|.\nWe consider the standard reinforcement learning (RL) framework [5, 16] in which a learning agent\ninteracts with a stochastic environment and this interaction is modeled as a discrete-time discounted\nMDP. A discounted MDP is a quintuple (X, A, P, R, \u03b3), where X and A are the set of states and\nactions, P is the state transition distribution, R is the reward function, and \u03b3 \u2208 (0, 1) is a discount\nfactor. We denote by P (\u00b7|x, a) and r(x, a) the probability distribution over the next state and the\nimmediate reward of taking action a at state x, respectively. To keep the representation succinct, we\nuse Z for the joint state-action space X \u00d7 A.\nAssumption 1 (MDP Regularity). We assume Z and, subsequently, X and A are \ufb01nite sets with\ncardinalities n, |X| and |A|, respectively. We also assume that the immediate rewards r(x, a) are\nuniformly bounded by Rmax and de\ufb01ne the horizon of the MDP \u03b2 , 1/(1 \u2212 \u03b3) and Vmax , \u03b2Rmax.\n\nA stationary Markov policy \u03c0(\u00b7|x) is the distribution over the control actions given the current\nstate x. It is deterministic if this distribution concentrates over a single action. The value and the\naction-value functions of a policy \u03c0, denoted respectively by V \u03c0 : X 7\u2192 R and Q\u03c0 : Z 7\u2192 R,\nare de\ufb01ned as the expected sum of discounted rewards that are encountered when the policy \u03c0\nis executed. Given a MDP, the goal is to \ufb01nd a policy that attains the best possible values,\nV \u2217(x) , sup\u03c0 V \u03c0(x), \u2200x \u2208 X. Function V \u2217 is called the optimal value function. Similarly\nthe optimal action-value function is de\ufb01ned as Q\u2217(x, a) = sup\u03c0 Q\u03c0(x, a), \u2200(x, a) \u2208 Z. The opti-\nmal action-value function Q\u2217 is the unique \ufb01xed-point of the Bellman optimality operator T de\ufb01ned\n\nas (TQ)(x, a) , r(x, a) + \u03b3Py\u2208X P (y|x, a) maxb\u2208A Q(y, b), \u2200(x, a) \u2208 Z. It is important to note\n\nthat T is a contraction with factor \u03b3, i.e., for any pair of action-value functions Q and Q\u2032, we have\nkTQ \u2212 TQ\u2032k \u2264 \u03b3 kQ \u2212 Q\u2032k [4, Chap. 1]. Finally for the sake of readability, we de\ufb01ne the max\noperator M over action-value functions as (MQ)(x) = maxa\u2208A Q(x, a), \u2200x \u2208 X.\n\n3 Speedy Q-Learning\n\nIn this section, we introduce our RL algorithm, called speedy Q-Learning (SQL), derive a perfor-\nmance bound for this algorithm, and compare this bound with similar results on standard Q-learning.\n\n2\n\n\fThe derived performance bound shows that SQL has a rate of convergence of order O(p1/T ),\n\nwhich is better than all the existing results for Q-learning.\n\n3.1 Speedy Q-Learning Algorithm\n\nThe pseudo-code of the SQL algorithm is shown in Algorithm 1. As it can be seen, this is the\nsynchronous version of the algorithm, which will be analyzed in the paper. Similar to the standard\nQ-learning, SQL may be implemented either synchronously or asynchronously. In the asynchronous\nversion, at each time step, the action-value of the observed state-action pair is updated, while the\nrest of the state-action pairs remain unchanged. For the convergence of this instance of the algo-\nrithm, it is required that all the states and actions are visited in\ufb01nitely many times, which makes\nthe analysis slightly more complicated. On the other hand, given a generative model, the algo-\nrithm may be also formulated in a synchronous fashion, in which we \ufb01rst generate a next state\ny \u223c P (\u00b7|x, a) for each state-action pair (x, a), and then update the action-values of all the state-\naction pairs using these samples. We chose to include only the synchronous version of SQL in\nthe paper just for the sake of simplicity in the analysis. However, the algorithm can be imple-\nmented in an asynchronous fashion (similar to the more familiar instance of Q-learning) and our\ntheoretical results can also be extended to the asynchronous case under some mild assumptions.1\nAlgorithm 1: Synchronous Speedy Q-Learning (SQL)\nInput: Initial action-value function Q0, discount factor \u03b3, and number of iteration T\nQ\u22121 := Q0;\nfor k := 0, 1, 2, 3, . . . , T \u2212 1 do\n\n// Initialization\n// Main loop\n\nk+1 ;\n\n\u03b1k := 1\nfor each (x, a) \u2208 Z do\n\nGenerate the next state sample yk \u223c P (\u00b7|x, a);\nTkQk\u22121(x, a) := r(x, a) + \u03b3MQk\u22121(yk);\nTkQk(x, a) := r(x, a) + \u03b3MQk(yk);\n// Empirical Bellman operator\nQk+1(x, a) := Qk(x, a)+\u03b1k`TkQk\u22121(x, a)\u2212Qk(x, a)\u00b4+(1\u2212\u03b1k)`TkQk(x, a)\u2212TkQk\u22121(x, a)\u00b4;\n// SQL update rule\n\nend\n\nend\nreturn QT\nAs it can be seen from Algorithm 1, at each time step k, SQL keeps track of the action-value func-\ntions of the two time-steps k and k \u2212 1, and its main update rule is of the following form:\n\nQk+1(x, a) = Qk(x, a)+\u03b1k(cid:0)TkQk\u22121(x, a)\u2212Qk(x, a)(cid:1)+(1\u2212\u03b1k)(cid:0)TkQk(x, a)\u2212 TkQk\u22121(x, a)(cid:1),\n\n(1)\nwhere TkQ(x, a) = r(x, a) + \u03b3MQ(yk) is the empirical Bellman optimality operator for the sam-\npled next state yk \u223c P (\u00b7|x, a). At each time step k and for state-action pair (x, a), SQL works as\nfollows: (i) it generates a next state yk by drawing a sample from P (\u00b7|x, a), (ii) it calculates two\nsample estimates TkQk\u22121(x, a) and TkQk(x, a) of the Bellman optimality operator (for state-action\npair (x, a) using the next state yk) applied to the estimates Qk\u22121 and Qk of the action-value function\nat the previous and current time steps, and \ufb01nally (iii) it updates the action-value function of (x, a),\ngenerates Qk+1(x, a), using the update rule of Eq. 1. Moreover, we let \u03b1k decays linearly with\ntime, i.e., \u03b1k = 1/(k + 1), in the SQL algorithm. 2The update rule of Eq. 1 may be rewritten in the\nfollowing more compact form:\n\nQk+1(x, a) = (1 \u2212 \u03b1k)Qk(x, a) + \u03b1kDk[Qk, Qk\u22121](x, a),\n\n(2)\n\nwhere Dk[Qk, Qk\u22121](x, a) , kTkQk(x, a) \u2212 (k \u2212 1)TkQk\u22121(x, a). This compact form will come\nspeci\ufb01cally handy in the analysis of the algorithm in Section 4.\n\nLet us consider the update rule of Q-learning\n\nQk+1(x, a) = Qk(x, a) + \u03b1k(cid:0)TkQk(x, a) \u2212 Qk(x, a)(cid:1),\n\n1See [2] for the convergence analysis of the asynchronous variant of SQL.\n2Note that other (polynomial) learning steps can also be used with speedy Q-learning. However one can\nshow that the rate of convergence of SQL is optimized for \u03b1k = 1\u2039(k + 1). This is in contrast to the standard\nQ-learning algorithm for which the rate of convergence is optimized for a polynomial learning step [8].\n\n3\n\n\fwhich may be rewritten as\n\nQk+1(x, a) = Qk(x, a) + \u03b1k(cid:0)TkQk\u22121(x, a) \u2212 Qk(x, a)(cid:1) + \u03b1k(cid:0)TkQk(x, a) \u2212 TkQk\u22121(x, a)(cid:1). (3)\n\nComparing the Q-learning update rule of Eq. 3 with the one for SQL in Eq. 1, we \ufb01rst notice that\nthe same terms: TkQk\u22121 \u2212 Qk and TkQk \u2212 TkQk\u22121 appear on the RHS of the update rules of both\nalgorithms. However, while Q-learning uses the same conservative learning rate \u03b1k for both these\nterms, SQL uses \u03b1k for the \ufb01rst term and a bigger learning step 1 \u2212 \u03b1k = k/(k + 1) for the second\none. Since the term TkQk \u2212 TkQk\u22121 goes to zero as Qk approaches its optimal value Q\u2217, it is not\nnecessary that its learning rate approaches zero. As a result, using the learning rate \u03b1k, which goes\nto zero with k, is too conservative for this term. This might be a reason why SQL that uses a more\naggressive learning rate 1 \u2212 \u03b1k for this term has a faster convergence rate than Q-learning.\n\n3.2 Main Theoretical Result\n\nThe main theoretical result of the paper is expressed as a high-probability bound over the perfor-\nmance of the SQL algorithm.\nTheorem 1. Let Assumption 1 holds and T be a positive integer. Then, at iteration T of SQL with\nprobability at least 1 \u2212 \u03b4, we have\n\nkQ\u2217 \u2212 QT k \u2264 2\u03b22Rmax\uf8ee\n\uf8f0\n\n+s 2 log 2n\n\nT\n\n\u03b4\n\n\u03b3\nT\n\n\uf8f9\n\uf8fb .\n\nWe report the proof of Theorem 1 in Section 4. This result, combined with Borel-Cantelli lemma [9],\n\nguarantees that QT converges almost surely to Q\u2217 with the ratep1/T . Further, the following result\n\nwhich quanti\ufb01es the number of steps T required to reach the error \u01eb > 0 in estimating the optimal\naction-value function, w.p. 1 \u2212 \u03b4, is an immediate consequence of Theorem 1.\nCorollary 1 (Finite-time PAC (\u201cprobably approximately correct\u201d) performance bound for SQL).\nUnder Assumption 1, for any \u01eb > 0, after\n\nT =\n\nmax log 2n\n11.66\u03b24R2\n\u01eb2\n\n\u03b4\n\nsteps of SQL, the uniform approximation error kQ\u2217 \u2212 QT k \u2264 \u01eb, with probability at least 1 \u2212 \u03b4.\n\n3.3 Relation to Existing Results\n\nIn this section, we \ufb01rst compare our results for SQL with the existing results on the convergence of\nstandard Q-learning. This comparison indicates that SQL accelerates the convergence of Q-learning,\nespecially for \u03b3 close to 1 and small \u01eb. We then compare SQL with batch Q-value iteration (QI) in\nterms of sample and computational complexities, i.e., the number of samples and the computational\ncost required to achieve an \u01eb-optimal solution w.p. 1 \u2212 \u03b4, as well as space complexity, i.e., the\nmemory required at each step of the algorithm.\n\n3.3.1 A Comparison with the Convergence Rate of Standard Q-Learning\n\nThere are not many studies in the literature concerning the convergence rate of incremental model-\nfree RL algorithms such as Q-learning. [17] has provided the asymptotic convergence rate for Q-\nlearning under the assumption that all the states have the same next state distribution. This result\nshows that the asymptotic convergence rate of Q-learning has exponential dependency on 1 \u2212 \u03b3, i.e.,\nthe rate of convergence is of \u02dcO(1/t1\u2212\u03b3) for \u03b3 \u2265 1/2.\nThe \ufb01nite time behavior of Q-learning have been throughly investigated in [8] for different\ntime scales. Their main result indicates that by using the polynomial learning step \u03b1k =\n\n1(cid:14) (k + 1)\u03c9 , 0.5 < \u03c9 < 1, Q-learning achieves \u01eb-optimal performance w.p. at least 1 \u2212 \u03b4 after\n\n1\n\nw\n\n(4)\n\nT = O\uf8eb\n\n\uf8ed\" \u03b24R2\n\nmax log n\u03b2Rmax\n\n\u03b4\u01eb\n\n\u01eb2\n\n4\n\n#\n\n+(cid:20)\u03b2 log\n\n\u03b2Rmax\n\n\u01eb\n\n(cid:21)\n\n1\n\n1\u2212\u03c9\uf8f6\n\uf8f8\n\n\fsteps. When \u03b3 \u2248 1, one can argue that \u03b2 = 1/(1 \u2212 \u03b3) becomes the dominant term in the bound of\n\nEq. 4, and thus, the optimized bound w.r.t. \u03c9 is obtained for \u03c9 = 4/5 and is of \u02dcO(cid:0)\u03b25/\u01eb2.5(cid:1). On the\nother hand, SQL is guaranteed to achieve the same precision in only O(cid:0)\u03b24/\u01eb2(cid:1) steps. The difference\n\nbetween these two bounds is signi\ufb01cant for large values of \u03b2, i.e., \u03b3\u2019s close to 1.\n\n3.3.2 SQL vs. Q-Value Iteration\n\nFinite sample bounds for both model-based and model-free (Phased Q-learning) QI have been de-\nrived in [12] and [7]. These algorithms can be considered as the batch version of Q-learning.\nThey show that to quantify \u01eb-optimal action-value functions with high probability, we need\n\nmodel-free and model-based QI, respectively. A comparison between their results and the main re-\n\nO(cid:0)n\u03b25/\u01eb2 log(1/\u01eb)(cid:0) log(n\u03b2) + log log 1(cid:14)\u01eb(cid:1)(cid:1) and O(cid:0)n\u03b24/\u01eb2(log(n\u03b2) + log log 1(cid:14)\u01eb)(cid:1) samples in\nsult of this paper suggests that the sample complexity of SQL, which is of order O(cid:0)n\u03b24/\u01eb2 log n(cid:1),3\n\nis better than model-free QI in terms of \u03b2 and log(1/\u01eb). Although the sample complexities of SQL\nis only slightly tighter than the model-based QI, SQL has a signi\ufb01cantly better computational and\nspace complexity than model-based QI: SQL needs only 2n memory space, while the space com-\nplexity of model-based QI is given by either \u02dcO(n\u03b24/\u01eb2) or n(|X| + 1), depending on whether the\nlearned state transition matrix is sparse or not [12]. Also, SQL improves the computational com-\nplexity by a factor of \u02dcO(\u03b2) compared to both model-free and model-based QI.4 Table 1 summarizes\nthe comparisons between SQL and the other RL methods discussed in this section.\n\nTable 1: Comparison between SQL, Q-learning, model-based and model-free Q-value iteration in\nterms of sample complexity (SC), computational complexity (CC), and space complexity (SPC).\n\nMethod\n\nSQL\n\nQ-learning (optimized) Model-based QI Model-free QI\n\nSC\n\nCC\n\n\u02dcO(cid:18) n\u03b24\n\u01eb2 (cid:19)\n\u02dcO(cid:18) n\u03b24\n\u01eb2 (cid:19)\n\nSPC\n\n\u0398(n)\n\n\u02dcO(cid:18) n\u03b25\n\u01eb2.5(cid:19)\n\u02dcO(cid:18) n\u03b25\n\u01eb2.5(cid:19)\n\n\u0398(n)\n\n\u02dcO(cid:18) n\u03b24\n\u01eb2 (cid:19)\n\u02dcO(cid:18) n\u03b25\n\u01eb2 (cid:19)\n\u02dcO(cid:18) n\u03b24\n\u01eb2 (cid:19)\n\n\u02dcO(cid:18) n\u03b25\n\u01eb2 (cid:19)\n\u02dcO(cid:18) n\u03b25\n\u01eb2 (cid:19)\n\n\u0398(n)\n\n4 Analysis\n\nIn this section, we give some intuition about the convergence of SQL and provide the full proof of\nthe \ufb01nite-time analysis reported in Theorem 1. We start by introducing some notations.\nLet Fk be the \ufb01ltration generated by the sequence of all random samples {y1, y2, . . . , yk} drawn\nfrom the distribution P (\u00b7|x, a), for all state action (x, a) up to round k. We de\ufb01ne the operator\nD[Qk, Qk\u22121] as the expected value of the empirical operator Dk conditioned on Fk\u22121:\n\nD[Qk, Qk\u22121](x, a) , E(Dk[Qk, Qk\u22121](x, a) |Fk\u22121 )\n\n= kTQk(x, a) \u2212 (k \u2212 1)TQk\u22121(x, a).\n\nThus the update rule of SQL writes\n\nQk+1(x, a) = (1 \u2212 \u03b1k)Qk(x, a) + \u03b1k (D[Qk, Qk\u22121](x, a) \u2212 \u01ebk(x, a)) ,\n\n(5)\n\n3Note that at each round of SQL n new samples are generated. This combined with the result of Corollary 1\n\ndeduces the sample complexity of order O(n\u03b24/\u01eb2 log(n/\u03b4)).\n\n4SQL has the sample and computational complexity of a same order since it performs only one Q-value\nupdate per sample, whereas, in the case of model-based QI, the algorithm needs to iterate the action-value\nfunction of all state-action pairs at least \u02dcO(\u03b2) times using Bellman operator, which leads to a computational\ncomplexity bound of order \u02dcO(n\u03b25/\u01eb2) given that only \u02dcO(n\u03b24/\u01eb2) entries of the estimated transition matrix\nare non-zero [12].\n\n5\n\n\fwhere the estimation error \u01ebk is de\ufb01ned as the difference between the operator D[Qk, Qk\u22121] and its\nsample estimate Dk[Qk, Qk\u22121] for all (x, a) \u2208 Z:\n\n\u01ebk(x, a) , D[Qk, Qk\u22121](x, a) \u2212 Dk[Qk, Qk\u22121](x, a).\n\nWe have the property that E[\u01ebk(x, a)|Fk\u22121] = 0 which means that for all (x, a) \u2208 Z the sequence\nof estimation error {\u01eb1(x, a), \u01eb2(x, a), . . . , \u01ebk(x, a)} is a martingale difference sequence w.r.t. the\n\ufb01ltration Fk. Let us de\ufb01ne the martingale Ek(x, a) to be the sum of the estimation errors:\n\nk\n\nEk(x, a) ,\n\n\u01ebj(x, a),\n\n\u2200(x, a) \u2208 Z.\n\n(6)\n\nThe proof of Theorem 1 follows the following steps: (i) Lemma 1 shows the stability of the algorithm\n(i.e., the sequence of Qk stays bounded). (ii) Lemma 2 states the key property that the SQL iterate\nQk+1 is very close to the Bellman operator T applied to the previous iterate Qk plus an estimation\nerror term of order Ek/k. (iii) By induction, Lemma 3 provides a performance bound kQ\u2217 \u2212 Qkk\nin terms of a discounted sum of the cumulative estimation errors {Ej}j=0:k\u22121. Finally (iv) we use\na maximal Azuma\u2019s inequality (see Lemma 4) to bound Ek and deduce the \ufb01nite time performance\nfor SQL.\n\nXj=0\n\nFor simplicity of the notations, we remove the dependence on (x, a) (e.g., writing Q for Q(x, a),\nEk for Ek(x, a)) when there is no possible confusion.\nLemma 1 (Stability of SQL). Let Assumption 1 hold and assume that the initial action-value func-\ntion Q0 = Q\u22121 is uniformly bounded by Vmax, then we have, for all k \u2265 0,\n\nkQkk \u2264 Vmax,\n\nk\u01ebkk \u2264 2Vmax,\n\nand\n\nkDk[Qk, Qk\u22121]k \u2264 Vmax.\n\nProof. We \ufb01rst prove that kDk[Qk, Qk\u22121]k \u2264 Vmax by induction. For k = 0 we have:\n\nkD0[Q0, Q\u22121]k \u2264 krk + \u03b3kMQ\u22121k \u2264 Rmax + \u03b3Vmax = Vmax.\n\nNow for any k \u2265 0, let us assume that the bound kDk[Qk, Qk\u22121]k \u2264 Vmax holds. Thus\n\nkDk+1[Qk+1, Qk]k \u2264 krk + \u03b3 k(k + 1)MQk+1 \u2212 kMQkk\n1\n\n(k + 1)M(cid:18) k\n\nk + 1\n\nQk +\n\nk + 1\n\n= krk + \u03b3(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2264 krk + \u03b3 kM(kQk + Dk[Qk, Qk\u22121] \u2212 kQk)k\n\u2264 krk + \u03b3 kDk[Qk, Qk\u22121]k \u2264 Rmax + \u03b3Vmax = Vmax,\n\nDk[Qk, Qk\u22121](cid:19) \u2212 kMQk(cid:13)(cid:13)(cid:13)(cid:13)\n\nand by induction, we deduce that for all k \u2265 0, kDk[Qk, Qk\u22121]k \u2264 Vmax.\nNow the bound on \u01ebk follows from k\u01ebkk = kE(Dk[Qk, Qk\u22121]|Fk\u22121) \u2212 Dk[Qk, Qk\u22121]k \u2264 2Vmax,\n\nand the bound kQkk \u2264 Vmax is deduced by noticing that Qk = 1/kPk\u22121\n\nThe next lemma shows that Qk is close to TQk\u22121, up to a O(1/k) term plus the average cumulative\nestimation error 1\nLemma 2. Under Assumption 1, for any k \u2265 1:\n\nk Ek\u22121.\n\nj=0 Dj[Qj, Qj\u22121].\n\nQk =\n\n1\nk\n\n(TQ0 + (k \u2212 1)TQk\u22121 \u2212 Ek\u22121) .\n\n(7)\n\nProof. We prove this result by induction. The result holds for k = 1, where (7) reduces to (5). We\nnow show that if the property (7) holds for k then it also holds for k + 1. Assume that (7) holds for\nk. Then, from (5) we have:\n1\n\nk\n\nQk+1 =\n\nQk +\n\n(kTQk \u2212 (k \u2212 1)TQk\u22121 \u2212 \u01ebk)\n\nk + 1\n\nk + 1\n\nk\n\nk + 1(cid:18) 1\n\nk\n\n(TQ0 + (k \u2212 1)TQk\u22121 \u2212 Ek\u22121)(cid:19) +\n\n1\n\nk + 1\n\n(kTQk \u2212 (k \u2212 1)TQk\u22121 \u2212 \u01ebk)\n\n(TQ0 + kTQk \u2212 Ek\u22121 \u2212 \u01ebk) =\n\n(TQ0 + kTQk \u2212 Ek).\n\n=\n\n=\n\n1\n\nk + 1\n\n1\n\nk + 1\n\nThus (7) holds for k + 1, and is thus true for all k \u2265 1.\n\n6\n\n\fNow we bound the difference between Q\u2217 and Qk in terms of the discounted sum of cumulative\nestimation errors {E0, E1, . . . , Ek\u22121}.\nLemma 3 (Error Propagation of SQL). Let Assumption 1 hold and assume that the initial action-\nvalue function Q0 = Q\u22121 is uniformly bounded by Vmax, then for all k \u2265 1, we have\n\nkQ\u2217 \u2212 Qkk \u2264\n\n2\u03b3\u03b2Vmax\n\nk\n\n+\n\n1\nk\n\nk\n\nXj=1\n\n\u03b3k\u2212j kEj\u22121k.\n\n(8)\n\nProof. Again we prove this lemma by induction. The result holds for k = 1 as:\n\nkQ\u2217 \u2212 Q1k = kTQ\u2217 \u2212 T0Q0k = ||TQ\u2217 \u2212 TQ0 + \u01eb0||\n\n\u2264 ||TQ\u2217 \u2212 TQ0|| + ||\u01eb0|| \u2264 2\u03b3Vmax + ||\u01eb0|| \u2264 2\u03b3\u03b2Vmax + kE0k\n\nWe now show that if the bound holds for k, then it also holds for k + 1. Thus, assume that (8) holds\nfor k. By using Lemma 2:\n\n(cid:13)(cid:13)Q\u2217 \u2212 Qk+1(cid:13)(cid:13) =(cid:13)(cid:13)(cid:13)(cid:13)\n=(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2264\n\nQ\u2217 \u2212\n\n1\n\nk + 1\n\u03b3\n\nk + 1\n\n1\n\nk + 1\n\n(TQ0 + kTQk \u2212 Ek)(cid:13)(cid:13)(cid:13)(cid:13)\n\nk\n\nk + 1\n\n(TQ\u2217 \u2212 TQ0) +\n\n(TQ\u2217 \u2212 TQk) +\n\nkQ\u2217 \u2212 Q0k +\n\nkQ\u2217 \u2212 Qkk +\n\nk\u03b3\n\nk + 1\n\n\u2264\n\n2\u03b3\n\nk + 1\n\nVmax +\n\n=\n\n2\u03b3\u03b2Vmax\n\nk + 1\n\n+\n\nk\u03b3\n\nk + 1\uf8ee\n\uf8f0\nXj=1\n\nk+1\n\n1\n\nk + 1\n\n2\u03b3\u03b2Vmax\n\nk\n\n+\n\n1\nk\n\nk\n\nXj=1\n\n\u03b3k+1\u2212j kEj\u22121k.\n\nThus (8) holds for k + 1 thus for all k \u2265 1 by induction.\n\n1\n\n1\n\nk + 1\n\nk + 1\n\nkEkk\n\nEk(cid:13)(cid:13)(cid:13)(cid:13)\n\u03b3k\u2212j kEj\u22121k\uf8f9\n\uf8fb +\n\n1\n\nk + 1\n\nkEkk\n\nNow, based on Lemmas 3 and 1, we prove the main theorem of this paper.\n\nProof of Theorem 1. We begin our analysis by recalling the result of Lemma 3 at round T :\n\nkQ\u2217 \u2212 QT k \u2264\n\n2\u03b3\u03b2Vmax\n\nT\n\n+\n\n1\nT\n\n\u03b3T \u2212k kEk\u22121k.\n\nT\n\nXk=1\n\nNote that the difference between this bound and the result of Theorem 1 is just in the second term.\nSo, we only need to show that the following inequality holds, with probability at least 1 \u2212 \u03b4:\n\nWe \ufb01rst notice that:\n\n1\nT\n\nT\n\nXk=1\n\n\u03b3T \u2212k kEk\u22121k \u2264 2\u03b2Vmaxs 2 log 2n\n\nT\n\n\u03b4\n\n.\n\n(9)\n\nT\n\nT\n\nXk=1\n\n\u03b3T \u2212k kEk\u22121k \u2264\n\n1\nT\nTherefore,\nto bound max1\u2264k\u2264T kEk\u22121k =\nmax(x,a)\u2208Z max1\u2264k\u2264T |Ek\u22121(x, a)| in high probability. We start by providing a high probability\nbound for max1\u2264k\u2264T |Ek\u22121(x, a)| for a given (x, a). First notice that\n\n\u03b2 max1\u2264k\u2264T kEk\u22121k\n\n\u03b3T \u2212k max\n1\u2264k\u2264T\n\nkEk\u22121k \u2264\n\nXk=1\n\nto prove\n\nin order\n\nsuf\ufb01cient\n\n(10)\n\n1\nT\n\n(9)\n\nis\n\nit\n\nT\n\n.\n\nP(cid:18) max\n\n1\u2264k\u2264T\n\n|Ek\u22121(x, a)| > \u01eb(cid:19) = P(cid:18)max(cid:20) max\n\n1\u2264k\u2264T\n\n(Ek\u22121(x, a)), max\n1\u2264k\u2264T\n\n(\u2212Ek\u22121(x, a))(cid:21) > \u01eb(cid:19)\n\n1\u2264k\u2264T\n\n= P(cid:18)(cid:26) max\n\u2264 P(cid:18) max\n\n(Ek\u22121(x, a)) > \u01eb(cid:27)[(cid:26) max\n(Ek\u22121(x, a)) > \u01eb(cid:19) + P(cid:18) max\n\n1\u2264k\u2264T\n\n1\u2264k\u2264T\n\n1\u2264k\u2264T\n\n(\u2212Ek\u22121(x, a)) > \u01eb(cid:27)(cid:19)\n(\u2212Ek\u22121(x, a)) > \u01eb(cid:19) ,\n\n(11)\nand each term is now bounded by using a maximal Azuma inequality, reminded now (see e.g., [6]).\n\n7\n\n\fLemma 4 (Maximal Hoeffding-Azuma Inequality). Let V = {V1, V2, . . . , VT } be a mar-\ntingale difference sequence w.r.t. a sequence of random variables {X1, X2, . . . , XT } (i.e.,\nE(Vk+1|X1, . . . Xk) = 0 for all 0 < k \u2264 T ) such that V is uniformly bounded by L > 0. If\n\nwe de\ufb01ne Sk =Pk\n\ni=1 Vi, then for any \u01eb > 0, we have\n\nP(cid:18) max\n\n1\u2264k\u2264T\n\nSk > \u01eb(cid:19) \u2264 exp(cid:18) \u2212\u01eb2\n\n2T L2(cid:19) .\n\nAs mentioned earlier,\nthe sequence of random variables {\u01eb0(x, a), \u01eb1(x, a), \u00b7 \u00b7 \u00b7 , \u01ebk(x, a)} is\na martingale difference sequence w.r.t.\nthe \ufb01ltration Fk (generated by the random samples\n{y0, y1, . . . , yk}(x, a) for all (x, a)), i.e., E[\u01ebk(x, a)|Fk\u22121] = 0. It follows from Lemma 4 that\nfor any \u01eb > 0 we have:\n\n1\u2264k\u2264T\n\nP(cid:18) max\nP(cid:18) max\n\n1\u2264k\u2264T\n\n(Ek\u22121(x, a)) > \u01eb(cid:19) \u2264 exp(cid:18) \u2212\u01eb2\n(\u2212Ek\u22121(x, a)) > \u01eb(cid:19) \u2264 exp(cid:18) \u2212\u01eb2\n\nmax(cid:19)\nmax(cid:19) .\n\n8T V 2\n\n8T V 2\n\nBy combining (12) with (11) we deduce that P (max1\u2264k\u2264T |Ek\u22121(x, a)| > \u01eb) \u2264 2 exp(cid:16) \u2212\u01eb2\n\nand by a union bound over the state-action space, we deduce that\n\nmax(cid:17) ,\n\n8T V 2\n\n(12)\n\n(13)\n\n(14)\n\nThis bound can be rewritten as: for any \u03b4 > 0,\n\n1\u2264k\u2264T\n\nP(cid:18) max\nP max\n\n1\u2264k\u2264T\n\nkEk\u22121k > \u01eb(cid:19) \u2264 2n exp(cid:18) \u2212\u01eb2\nkEk\u22121k \u2264 Vmaxr8T log\n\nmax(cid:19) .\n\u03b4 ! \u2265 1 \u2212 \u03b4,\n\n8T V 2\n\n2n\n\nwhich by using (10) proves (9) and Theorem 1.\n\n5 Conclusions and Future Work\n\nIn this paper, we introduced a new Q-learning algorithm, called speedy Q-learning (SQL). We ana-\nlyzed the \ufb01nite time behavior of this algorithm as well as its asymptotic convergence to the optimal\naction-value function. Our result is in the form of high probability bound on the performance loss\nof SQL, which suggests that the algorithm converges to the optimal action-value function in a faster\nrate than the standard Q-learning. Overall, SQL is a simple, ef\ufb01cient and theoretically well-founded\nreinforcement learning algorithm, which improves on existing RL algorithms such as Q-learning\nand model-based value iteration.\n\nIn this work, we are only interested in the estimation of the optimal action-value function and not the\nproblem of exploration. Therefore, we did not compare our result to the PAC-MDP methods [15,18]\nand the upper-con\ufb01dence bound based algorithms [3, 11], in which the choice of the exploration\npolicy impacts the behavior of the learning algorithms. However, we believe that it would be possible\nto gain w.r.t. the state of the art in PAC-MDPs, by combining the asynchronous version of SQL with\na smart exploration strategy. This is mainly due to the fact that the bound for SQL has been proved to\nbe tighter than the RL algorithms that have been used for estimating the value function in PAC-MDP\nmethods, especially in the model-free case. We consider this as a subject for future research.\n\nAnother possible direction for future work is to scale up SQL to large (possibly continuous) state\nand action spaces where function approximation is needed. We believe that it would be possible to\nextend our current SQL analysis to the continuous case along the same path as in the \ufb01tted value\niteration analysis by [13] and [1]. This would require extending the error propagation result of\nLemma 3 to a \u21132-norm analysis and combining it with the standard regression bounds.\n\nAcknowledgments\n\nThe authors appreciate supports from the PASCAL2 Network of Excellence Internal-Visit Pro-\ngramme and the European Community\u2019s Seventh Framework Programme (FP7/2007-2013) under\ngrant agreement no 231495. We also thank Peter Auer for helpful discussion and the anonymous\nreviewers for their valuable comments.\n\n8\n\n\fReferences\n\n[1] A. Antos, R. Munos, and Cs. Szepesv\u00b4ari. Fitted Q-iteration in continuous action-space MDPs.\nIn Proceedings of the 21st Annual Conference on Neural Information Processing Systems,\n2007.\n\n[2] M. Gheshlaghi Azar, R. Munos, M. Ghavamzadeh, and H.J. Kappen. Reinforcement learning\n\nwith a near optimal rate of convergence. Technical Report inria-00636615, INRIA, 2011.\n\n[3] P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement\nlearning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncer-\ntainty in Arti\ufb01cial Intelligence, 2009.\n\n[4] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scienti\ufb01c,\n\nBelmount, Massachusetts, third edition, 2007.\n\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, Belmont,\n\nMassachusetts, 1996.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, New York, NY, USA, 2006.\n\n[7] E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and Markov\nIn 15th Annual Conference on Computational Learning Theory, pages\n\ndecision processes.\n255\u2013270, 2002.\n\n[8] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning\n\nResearch, 5:1\u201325, 2003.\n\n[9] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, 1968.\n[10] T. Jaakkola, M. I. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic\n\nprogramming. Neural Computation, 6(6):1185\u20131201, 1994.\n\n[11] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[12] M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algo-\nIn Advances in Neural Information Processing Systems 12, pages 996\u20131002. MIT\n\nrithms.\nPress, 1999.\n\n[13] R. Munos and Cs. Szepesv\u00b4ari. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9:815\u2013857, 2008.\n\n[14] J. Peng and R. J. Williams.\n\n3):283\u2013290, 1996.\n\nIncremental multi-step Q-learning. Machine Learning, 22(1-\n\n[15] A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in \ufb01nite MDPs: PAC analysis.\n\nJournal of Machine Learning Research, 10:2413\u20132444, 2009.\n\n[16] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cam-\n\nbridge, Massachusetts, 1998.\n\n[17] Cs. Szepesv\u00b4ari. The asymptotic convergence-rate of Q-learning. In Advances in Neural Infor-\n\nmation Processing Systems 10, Denver, Colorado, USA, 1997, 1997.\n\n[18] I. Szita and Cs. Szepesv\u00b4ari. Model-based reinforcement learning with nearly tight exploration\ncomplexity bounds. In Proceedings of the 27th International Conference on Machine Learning,\npages 1031\u20131038. Omnipress, 2010.\n\n[19] H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems\n\n23, pages 2613\u20132621, 2010.\n\n[20] C. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, England,\n\n1989.\n\n9\n\n\f", "award": [], "sourceid": 1278, "authors": [{"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Mohammad", "family_name": "Azar", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}