{"title": "Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1002, "abstract": null, "full_text": "Finite-Sample Convergence Rates for \nQ-Learning and Indirect Algorithms \n\nMichael Kearns and Satinder Singh \n\nAT&T Labs \n\n180 Park Avenue \n\nFlorham Park, NJ 07932 \n\n{mkearns,baveja }@research.att.com \n\nAbstract \n\nIn this paper, we address two issues of long-standing interest in the re(cid:173)\ninforcement learning literature. First, what kinds of performance guar(cid:173)\nantees can be made for Q-learning after only a finite number of actions? \nSecond, what quantitative comparisons can be made between Q-learning \nand model-based (indirect) approaches, which use experience to estimate \nnext-state distributions for off-line value iteration? \nWe first show that both Q-learning and the indirect approach enjoy \nrather rapid convergence to the optimal policy as a function of the num(cid:173)\nber of state transitions observed. \nIn particular, on the order of only \n(Nlog(1/c)/c2 )(log(N) + loglog(l/c)) transitions are sufficient for both \nalgorithms to come within c of the optimal policy, in an idealized model \nthat assumes the observed transitions are \"well-mixed\" throughout an \nN-state MDP. Thus, the two approaches have roughly the same sample \ncomplexity. Perhaps surprisingly, this sample complexity is far less than \nwhat is required for the model-based approach to actually construct a good \napproximation to the next-state distribution. The result also shows that \nthe amount of memory required by the model-based approach is closer to \nN than to N 2 \u2022 \nFor either approach, to remove the assumption that the observed tran(cid:173)\nsitions are well-mixed, we consider a model in which the transitions are \ndetermined by a fixed, arbitrary exploration policy. Bounds on the number \nof transitions required in order to achieve a desired level of performance \nare then related to the stationary distribution and mixing time of this \npolicy. \n\n1 \n\nIntroduction \n\nThere are at least two different approaches to learning in Markov decision processes: \nindirect approaches, which use control experience (observed transitions and payoffs) \nto estimate a model, and then apply dynamic programming to compute policies from \nthe estimated model; and direct approaches such as Q-Iearning [2], which use control \n\n\fConvergence Rates for Q-Leaming and Indirect Algorithms \n\n997 \n\nexperience to directly learn policies (through value functions) without ever explicitly \nestimating a model. Both are known to converge asymptotically to the optimal pol(cid:173)\nicy [1, 3] . However, little is known about the performance of these two approaches \nafter only a finite amount of experience. \n\nA common argument offered by proponents of direct methods is that it may require \nmuch more experience to learn an accurate model than to simply learn a good policy. \nThis argument is predicated on the seemingly reasonable assumption that an indirect \nmethod must first learn an accurate model in order to compute a good policy. On \nthe other hand, proponents of indirect methods argue that such methods can do \nunlimited off-line computation on the estimated model, which may give an advantage \nover direct methods, at least if the model is accurate. Learning a good model may \nalso be useful across tasks, permitting the computation of good policies for multiple \nreward functions [4]. To date, these arguments have lacked a formal framework for \nanalysis and verification. \n\nIn this paper, we provide such a framework, and use it to derive the first finite-time \nconvergence rates (sample size bounds) for both Q-learning and the standard indirect \nalgorithm. An important aspect of our analysis is that we separate the quality of the \npolicy generating experience from the quality of the two learning algorithms. In \naddition to demonstrating that both methods enjoy rather rapid convergence to the \noptimal policy as a function of the amount of control experience, the convergence rates \nhave a number of specific and perhaps surprising implications for the hypothetical \ndifferences between the two approaches outlined above. Some of these implications, \nas well as the rates of convergence we derive, were briefly mentioned in the abstract; \nin the interests of brevity, we will not repeat them here, but instead proceed directly \ninto the technical material. \n\n2 MDP Basics \n\nLet M be an unknown N-state MDP with A actions. We use PM(ij) to denote the \nprobability of going to state j, given that we are in state i and execute action a; \nand RM(i) to denote the reward received for executing a from i (which we assume is \nfixed and bounded between 0 and 1 without loss of generality). A policy 1r assigns \nan action to each state. The value of state i under policy 1r, VM(i), is the expected \ndiscounted sum of rewards received upon starting in state i and executing 1r forever : \nVM(i) = E7r[rl + ,r2 + ,2r3 + ... ], where rt is the reward received at time step t \nunder a random walk governed by 1r from start state i, and 0 ~ , < 1 is the discount \nfactor . It is also convenient to define values for state-action pairs (i, a): QM (i, a) = \nRM (i) +, Lj PM (ij) VM (j) . The goal of learning is to approximate the optimal policy \n1r* that maximizes the value at every state; the optimal value function is denoted QM. \nGiven QM' we can compute the optimal policy as 1r*(i) = argmaxa{QM(i,a)}. \nIf M is given, value iteration can be used to compute a good approximation to the \noptimal value function. Setting our initial guess as Qo(i, a) = 0 for all (i, a), we \niterate as follows: \n\nRM(i) +, 2)PM(ij)Ve(j)] \n\nj \n\n(1) \n\nwhere we define \\Il(j) = maxv{Qe(j, b)}. It can be shown that after I! iterations, \n\nmax(i,aj{IQe(i, a) - QM(i , a)1} ~ ,e. Given any approximation Q to QM we can com(cid:173)\n\npute the greedy approximation 1r to the optimal policy 1r* as 1r(i) = argmaxa{Q(i, a)}. \n\n\f998 \n\nM Kearns and S. Singh \n\n3 The Parallel Sampling Model \n\nnamely, 7r must try every state-action pair infinitely often, with \n\nIn reinforcement learning, the transition probabilities PM(ij) are not given, and a \ngood policy must be learned on the basis of observed experience (transitions) in M . \nClassical convergence results for algorithms such as Q-Iearning [1] implicitly assume \nthat the observed experience is generated by an arbitrary \"exploration policy\" 7r, and \nthen proceed to prove convergence to the optimal policy if 7r meets certain mini(cid:173)\nmal conditions -\nprobability 1. This approach conflates two distinct issues: the quality of the explo(cid:173)\nration policy 7r, and the quality ofreinforcement learning algorithms using experience \ngenerated by 7r. In contrast, we choose to separate these issues. If the exploration \npolicy never or only very rarely visits some state-action pair, we would like to have \nthis reflected as a factor in our bounds that depends only on 7r; a separate factor \ndepending only on the learning algorithm will in turn reflect how efficiently a partic(cid:173)\nular learning algorithm uses the experience generated by 7r . Thus, for a fixed 7r, all \nlearning algorithms are placed on equal footing, and can be directly compared. \n\nThere are probably various ways in which this separation can be accomplished; we \nnow introduce one that is particularly clean and simple. We would like a model of \nthe ideal exploration policy -\none that produces experiences that are \"well-mixed\", \nin the sense that every state-action pair is tried with equal frequency. Thus, let us \ndefine a parallel sampling subroutine PS(M) that behaves as follows: a single call to \nPS( M) returns, for every state-action pair (i, a), a random next state j distributed \naccording to PM (ij). Thus, every state-action pair is executed simultaneously, and \nthe resulting N x A next states are reported. A single call to PS(M) is therefore really \nsimulating N x A transitions in M, and we must be careful to multiply the number \nof calls to PS(M) by this factor if we wish to count the total number of transitions \nwitnessed. \n\nWhat is PS(M) modeling? It is modeling the idealized exploration policy that man(cid:173)\nages to visit every state-action pair in succession, without duplication, and without \nfail. It should be intuitively obvious that such an exploration policy would be optimal, \nfrom the viewpoint of gathering experience everywhere as rapidly as possible. \n\nWe shall first provide an analysis, in Section 5, of both direct and indirect reinforce(cid:173)\nment learning algorithms, in a setting in which the observed experience is generated \nby calls to PS(M). Of course, in any given MDP M , there may not be any exploration \npolicy that meets the ideal captured by PS(M) -\nfor instance, there may simply be \nsome states that are very difficult for any policy to reach, and thus the experience \ngenerated by any policy will certainly not be equally mixed around the entire MDP. \n(Indeed, a call to PS(M) will typically return a set of transitions that does not even \ncorrespond to a trajectory in M.) Furthermore, even if PS(M) could be simulated \nby some exploration policy, we would like to provide more general results that ex(cid:173)\npress the amount of experience required for reinforcement learning algorithms under \nany exploration policy (where the amount of experience will , of course, depend on \nproperties of the exploration policy). \n\nThus, in Section 6, we sketch how one can bound the amount of experience required \nunder any 7r in order to simulate calls to PS(M) . (More detail will be provided in a \nlonger version of this paper.) The bound depends on natural properties of 7r, such as \nits stationary distribution and mixing time. Combined with the results of Section 5, \nwe get the desired two-factor bounds discussed above: for both the direct and indirect \napproaches, a bound on the total number of transitions required, consisting of one \nfactor that depends only on the algorithm, and another factor that depends only on \nthe exploration policy. \n\n\fConvergence Rates for Q-Learning and Indirect Algorithms \n\n999 \n\n4 The Learning Algorithms \n\nWe now explicitly state the two reinforcement learning algorithms we shall analyze \nand compare. In keeping with the separation between algorithms and exploration \npolicies already discussed, we will phrase these algorithms in the parallel sampling \nframework, and Section 6 indicates how they generalize to the case of arbitrary ex(cid:173)\nploration policies. We begin with the direct approach. \n\nRather than directly studying standard Q-Iearning, we will here instead examine a \nvariant that is slightly easier to analyze, and is called phased Q-Iearning. However, we \nemphasize that all of our resuits can be generalized to apply to standard Q-learning \n(with learning rate a(i, a) = t(i~a)' where t(i, a) is the number oftrials of (i, a) so far) . \nBasically, rather than updating the value function with every observed transition from \n(i , a), phased Q-Iearning estimates the expected value of the next state from (i, a) \non the basis of many transitions, and only then makes an update. The memory \nrequirements for phased Q-learning are essentially the same as those for standard \nQ-Iearning. \n\nDirect Algorithm - Phased Q-Learning: As suggested by the name , the algo(cid:173)\nrithm operates in phases. In each phase, the algorithm will make mD calls to PS(M) \n(where mD will be determined by the analysis), thus gathering mD trials of every \nstate-action pair (i, a) . At the fth phase, the algorithm updates the estimated value \nfunction as follows: for every (i , a), \n\nQl+d i , a) = RM(i) + ,_1_ ~ Oeu\u00a3) \n\nmD k=l \n\n(2) \n\nwhere jf, ... , j~ are the m D next states observed from (i, a) on the m D calls to \nPS(M) during t~e fth phase. The policy computed by the algorithm is then the \ngreedy policy determined by the final value function. Note that phased Q-learning \nis quite like standard Q-Iearning, except that we gather statistics (the summation in \nEquation (2)) before making an update. \n\nWe now proceed to describe the standard indirect approach . \n\nIndirect Algorithm: The algorithm first makes m[ calls to PS(M) to obtain m[ \nnext state samples for each (i, a) . It then builds an empirical model of the transition \nprobabilities as follows: PM(ij) = #(~aj) , where #(i -+a j) is the number of times \nstate j was reached on the m[ trials of (i, a). The algorithm then does value iteration \n(as described in Section 2) on the fixed model PM(ij) for f[ phases. Again , the policy \ncomputed by the algorithm is the greedy policy dictated by the final value function . \nThus, in phased Q-Iearning, the algorithm runs for some number fD phases, and each \nphase requires mD calls to PS(M), for a total number of transitions fD x mD x N x A . \nThe direct algorithm first makes mj calls to PS(M) , and then runs f[ phases of \nvalue iteration (which requires no additional data) , for a total number of transitions \nm[ x N x A. The question we now address is: how large must mD, m[, fD' f[ be \nso that, with probability at least 1 - 6, the resulting policies have expected return \nwithin f. of the optimal policy in M? The answers we give yield perhaps surprisingly \nsimilar bounds on the total number of transitions required for the two approaches in \nthe parallel sampling model. \n\n5 Bounds on the Number of Transitions \n\nWe now state our main result. \n\n\f1000 \n\nM Kearns and S. Singh \n\nTheorem 1 For any MDP M: \n\n\u2022 For an appropriate choice of the parameters mJ and and fJ, the total number \nof calls to PS(M) required by the indirect algorithm in order to ensure that, \nwith probability at least 1 - 6, the expected return of the resulting policy will \nbe within f of the optimal policy, is \n\nO((I/f 2)(log(N/6) + loglog(l/f)). \n\n(3) \n\n\u2022 For an appropriate choice of the parameters mD and fD, the total number of \ncalls to PS(M) required by phased Q-learning in order to ensure that, with \nprobability at least 1 - 6, the expected return of the resulting policy will be \nwithin f of the optimal policy, is \n\nO((log(1/f)/f 2)(log(N/6) + log log(l/f)). \n\n(4) \n\nThe bound for phased Q-learning is thus only O(log(l/f)) larger than that for the \nindirect algorithm. Bounds on the total number of transitions witnessed in either \ncase are obtained by multiplying the given bounds by N x A . \n\nand perhaps surprisingly -\n\nBefore sketching some of the ideas behind the proof of this result, we first discuss \nsome of its implications for the debate on direct versus indirect approaches. First of \nall, for both approaches, convergence is rather fast: with a total number of transitions \nonly on the order of N log(N) (fixing f and 6 for simplicity), near-optimal policies \nare obtained. This represents a considerable advance over the classical asymptotic \nresults: instead of saying that an infinite number of visits to every state-action pair \nare required to converge to the optimal policy, we are claiming that a rather small \nnumber of visits are required to get close to the optimal policy. Second, by our \nanalysis, the two approaches have similar complexities, with the number of transitions \nrequired differing by only a log(l/f) factor in favor of the indirect algorithm. Third \nnote that since only O(log(N)) calls are being made \n-\nto PS(M) (again fixing f and 6), and since the number of trials per state-action pair \nis exactly the number of calls to PS(M), the total number of non-zero entries in the \nmodel PM (ij) built by the indirect approach is in fact only O(log( N)). In other \nwords , PM (ij) will be extremely sparse -\nand thus, a terrible approximation to the \nyet still good enough to derive a near-optimal policy! \ntrue transition probabilities -\nClever representation of PM(ij) will thus result in total memory requirements that \nare only O(N log(N)) rather than O(N2). Fourth, although we do not have space \nto provide any details, if instead of a single reward function, we are provided with L \nreward functions (where the L reward functions are given in aqvance of observing any \nexperience), then for both algorithms, the number of transitions required to compute \nnear-optimal policies for all L reward functions simultaneously is only a factor of \nO(log(L)) greater than the bounds given above. \nOur own view of the result and its implications is: \n\n\u2022 Both algorithms enjoy rapid convergence to the optimal policy as a function \n\nof the amount of experience. \n\n\u2022 In general, neither approach enjoys a significant advantage in convergence \nrate, memory requirements, or handling multiple reward functions. Both are \nquite efficient on all counts. \n\nWe do not have space to provide a detailed proof of Theorem 1, but instead provide \nsome highlights of the main ideas. The proofs for both the indirect algorithm and \nphased Q-Iearning are actually quite similar, and have at their heart two slightly \n\n\fConvergence Rates for Q-Learning and Indirect Algorithms \n\n/001 \n\ndifferent uniform convergence lemmas. For phased Q-Iearning, it is possible to show \nthat, for any bound fD on the number of phases to be executed, and for any T > 0, \nwe can choose mD so that \n\nmD \n\n(l/mD)LVtU\u00a3)- LPijVtU) < T \n\nk=l \n\nj \n\n(5) \n\nwill hold simultaneously for every (i, a) and for every phase f = 1, . . . , fD. In other \nwords, at the end of every phase, the empirical estimate of the expected next-state \nvalue for every (i, a) will be close to the true expectation, where here the expectation \nis with respect to the current estimated value function Vt. \nFor the indirect algorithm, a slightly more subtle uniform convergence argument is \nrequired. Here we show that it is possible to choose, for~any bound fI on the number \nof iterations of value iteration to be executed on the PM(ij), and for any T > 0, a \nvalue mI such that \n\n(6) \n\nj \n\nj \n\nfor every (i,a) and every phase f = 1, . . . ,fI, where the VtU) are the value functions \nresulting from performing true value iteration (that is, on the PM (ij)). Equation (6) \nessentially says that expectations of the true value functions are quite similar under \neither the true or estimated model, even though the indirect algorithm never has \naccess to the true value functions . \n\nIn either case, the uniform convergence results allow us to argue that the corre(cid:173)\nsponding algorithms still achieve successive contractions, as in the classical proof \nof value iteration. For instance, in the case of phased Q-Iearning, if we define \nb..l = max(i ,a){IQe(i, a) - Ql(i , a)l}, we can derive a recurrence relation for b..l+ 1 \nas follows : \n\nm \n\n,(l/m) L VtU\u00a3) -, L Pij VtU) \n\n(7) \n\nk=l \n\n< 7 \"E'I',~x,} { \n< ,T + ,b..l . \n\nj \n\n( y P;j v,(j) +\" ) - y P;j V, (j) }S) \n\n(9) \nHere we have made use of Equation (5). Since b..o = 0 (Qo = Qo) , this recurrence \ngives b..l :::; Tb/(l--,)) for any f. From this it is not hard to show that for any (i,a) \n\n~ \n\nIQdi , a) - Q*(i, a)1 :::; Tb/(l -,)) + ,l . \n\n(10) \nFrom this it can be shown that the regret in expected return suffered by the policy \ncomputed by phased Q-Learning after f phases is at most (T, /(1-,) +,l )(2/(1-,)). \nThe proof proceeds by setting this regret smaller than the desired f, solving for f and \nT, and obtaining the resulting bound on m D. The derivation of bounds for the indirect \nalgorithm is similar. \n\n6 Handling General Exploration Policies \n\nAs promised, we conclude our technical results by briefly sketching how we can trans(cid:173)\nlate the bounds obtained in Section 5 under the idealized parallel sampling model into \n\n\f1002 \n\nM Kearns and S. Singh \n\nbounds applicable when any fixed policy 1r is guiding the exploration. Such bounds \nmust, of course, depend on properties of 1r. Due to space limitations, we can only \noutline the main ideas; the formal statements and proofs are deferred to a longer \nversion of the paper. \n\nLet us assume for simplicity that 1r (which may be a stochastic policy) defines an \nergodic Markov process in the MDP M. Thus, 1r induces a unique stationary distri(cid:173)\nbution PM,1[(i, a) over state-action pairs -\nintuitively, PM,1[(i, a) is the frequency of \nexecuting action a from state i during an infinite random walk in M according to \n1r. Furthermore, we can introduce the standard notion of the mixing time of 1r to \ninformally, this is the number T1[ of steps required such \nits stationary distribution -\nthat the distribution induced on state-action pairs by T1[-step walks according to 1r \n\nwill be \"very close\" to PM,1[ 1. Finally, let us define P1[ = min(i,a){PM,1[(i, an. \n\nArmed with these notions, it is not difficult to show that the number of steps we must \ntake under 1r in order to simulate, with high probability, a call to the oracle PS(M) , \nis polynomial in the quantity T1[ / P1[. The intuition is straightforward: at most every \nT1[ steps, we obtain an \"almost independent\" draw from PM,1[(i, a); and with each \nindependent draw, we have at least probability p of drawing any particular (i, a) \npair. Once we have sampled every (i, a) pair, we have simulated a call to PS(M). \nThe formalization of these intuitions leads to a version of Theorem 1 applicable to \nany 1r, in which the bound is multiplied by a factor polynomial in T1[ / P1[, as desired. \nHowever, a better result is possible. In cases where P1[ may be small or even 0 (which \nwould occur when 1r simply does not ever execute some action from some state), the \nfactor T1[ / P1[ is large or infinite and our bounds become weak or vacuous. In such \ncases, it is better to define the sub-MDP M1[(O'), which is obtained from M by simply \ndeleting any (i, a) for which PM,1[(i, a) < a, where a> 0 is a parameter of our choos(cid:173)\ning. In M1[ (a), P1[ > a by construction, and we may now obtain convergence rates \nto the optimal policy in M1[ (a) for both Q-Iearning and the indirect approach like \nthose given in Theorem 1, multiplied by a factor polynomial in T1[/O'. (Technically, \nwe must slightly alter the algorithms to have an initial phase that detects and elim(cid:173)\ninates small-probability state-action pairs, but this is a minor detail.) By allowing \na to become smaller as the amount of experience we receive from 1r grows, we can \nobtain an \"anytime\" result, since the sub-MDP M1[(O') approaches the full MDP M \nas 0'-+0. \n\nReferences \n\n[1] Jaakkola, T., Jordan, M. I., Singh, S. On the convergence of stochastic iterative dy(cid:173)\n\nnamic programming algorithms. Neural Computation, 6(6), 1185-1201, 1994. \n\n[2] C. J. C. H. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge Uni(cid:173)\n\nversity, 1989. \n\n[3] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, \n\n1998. \n\n[4] S. Mahadevan. Enhancing Transfer in Reinforcement Learning by Building Stochastic \nModels of Robot Actions. In Machine Learning: Proceedings of the Ninth International \nConference, 1992. \n\n1 Formally, the degree of closeness is measured by the distance between the transient and \nstationary distributions. For brevity here we will simply assume this parameter is set to a \nvery small, constant value. \n\n\f", "award": [], "sourceid": 1531, "authors": [{"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}