{"title": "Finite-Sample Analysis for SARSA with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 8668, "page_last": 8678, "abstract": "SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d.\\ setting, where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically. However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples, and the fact that the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of this type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we further provide the finite-sample analysis on the mean square error of the SARSA algorithm. In the end, we present a fitted SARSA algorithm, which includes the original SARSA algorithm and its variant as special cases. This fitted SARSA algorithm provides a framework for \\textit{iterative} on-policy fitted policy iteration, which is more memory and computationally efficient. For this fitted SARSA algorithm, we also present its finite-sample analysis.", "full_text": "Finite-Sample Analysis for SARSA with Linear\n\nFunction Approximation\n\nDepartment of Electrical Engineering\n\nUniversity at Buffalo, The State University of New York\n\nShaofeng Zou\n\nBuffalo, NY 14228\nszou3@buffalo.edu\n\nTengyu Xu\n\nDepartment of ECE\n\nThe Ohio State University\n\nColumbus, OH 43210\nxu.3260@osu.edu\n\nYingbin Liang\n\nDepartment of ECE\n\nThe Ohio State University\n\nColumbus, OH 43210\nliang.889@osu.edu\n\nAbstract\n\nSARSA is an on-policy algorithm to learn a Markov decision process policy in\nreinforcement learning. We investigate the SARSA algorithm with linear func-\ntion approximation under the non-i.i.d. data, where a single sample trajectory is\navailable. With a Lipschitz continuous policy improvement operator that is smooth\nenough, SARSA has been shown to converge asymptotically [28, 23]. However, its\nnon-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d.\nsamples and the fact that the behavior policy changes dynamically with time. In\nthis paper, we develop a novel technique to explicitly characterize the stochastic\nbias of a type of stochastic approximation procedures with time-varying Markov\ntransition kernels. Our approach enables non-asymptotic convergence analyses of\nthis type of stochastic approximation algorithms, which may be of independent\ninterest. Using our bias characterization technique and a gradient descent type\nof analysis, we provide the \ufb01nite-sample analysis on the mean square error of\nthe SARSA algorithm. We then further study a \ufb01tted SARSA algorithm, which\nincludes the original SARSA algorithm and its variant in [28] as special cases. This\n\ufb01tted SARSA algorithm provides a more general framework for iterative on-policy\n\ufb01tted policy iteration, which is more memory and computationally ef\ufb01cient. For\nthis \ufb01tted SARSA algorithm, we also provide its \ufb01nite-sample analysis.\n\n1\n\nIntroduction\n\nSARSA, originally proposed in [31], is an on-policy reinforcement learning algorithm, which\ncontinuously updates the behavior policy towards attaining as large an accumulated reward as\npossible over time. Speci\ufb01cally, SARSA is initialized with a state and a policy. At each time instance,\nit takes an action based on the current policy, observes the next state, and receives a reward. Using\nthe newly observed information, it \ufb01rst updates the estimate of the action-value function, and then\nimproves the behavior policy by applying a policy improvement operator, e.g., \u270f-greedy, to the\nestimated action-value function. Such a process is iteratively taken until it converges (see Algorithm\n1 for a precise description of the SARSA algorithm).\nWith the tabular approach that stores the action-value function, the convergence of SARSA has been\nestablished in [33]. However, the tabular approach may not be applicable when the state space is\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flarge or continuous. For this purpose, SARSA that incorporates parametrized function approximation\nis commonly used, and is more ef\ufb01cient and scalable. With the function approximation approach,\nSARSA is not guaranteed to converge in general when the \u270f-greedy or softmax policy improvement\noperators are used [13, 10]. However, under certain conditions, its convergence can be established.\nFor example, a variant of SARSA with linear function approximation was constructed in [28], where\nbetween two policy improvements, a temporal difference (TD) learning algorithm is applied to learn\nthe action-value function till its convergence. The convergence of this algorithm was established\nin [28] using a contraction argument under the condition that the policy improvement operator is\nLipschitz continuous and the Lipschitz constant is not too large. The convergence of the original\nSARSA algorithm under the same Lipschitz condition was later established using an O.D.E. approach\nin [23].\nPrevious studies on SARSA in [28, 23] mainly focused on the asymptotic convergence analysis,\nwhich does not suggest how fast SARSA converges and how the accuracy of the solution depends\non the number of samples, i.e., sample complexity. The goal of this paper is to provide such a\nnon-asymptotic \ufb01nite-sample analysis of SARSA and to further understand how the parameters of\nthe underlying Markov process and the algorithm affect the convergence rate. Technically, such an\nanalysis does not follow directly from the existing \ufb01nite-sample analysis for time difference (TD)\nlearning [4, 34] and Q-learning [32], where samples are taken by a Markov process with a \ufb01xed\ntransition kernel. The analysis of SARSA necessarily needs to deal with samples taken from a Markov\ndecision process with a time-varying transition kernel, and in this paper, we develop novel techniques\nto explicitly characterize the stochastic bias for a Markov decision process with a time-varying\ntransition kernel, which may be of independent interest.\n\n1.1 Contributions\n\nIn this paper, we design a novel approach to analyze SARSA and a more general \ufb01tted SARSA\nalgorithm, and develop the corresponding \ufb01nite-sample error bounds. In particular, we consider the\non-line setting where a single sample trajectory with Markovian noise is available, i.e., samples are\nnot identical and independently distributed (i.i.d.).\nBias characterization for time-varying Markov process. One major challenge in our analysis is\ndue to the fact that the estimate of the \u201cgradient\u201d is biased with non-i.i.d. Markovian noise. Existing\nstudies mostly focus on the case where the samples are generated according to a Markov process\nwith a \ufb01xed transition kernel, e.g., TD learning [4, 34] and Q-learning with nearest neighbors [32], so\nthat the uniform ergodicity of the Markov process can be exploited to decouple the dependency on\nthe Markovian noise, and then to explicitly bound the stochastic bias. For Markov processes with a\ntime-varying transition kernel, such a property of uniform ergodicity does not hold in general. In this\npaper, we develop a novel approach to explicitly characterize the stochastic bias induced by non-i.i.d.\nsamples generated from Markov processes with time-varying transition kernels. The central idea of\nour approach is to construct auxiliary Markov chains, which are uniformly ergodic, to approximate\nthe dynamically changing Markov process to facilitate the analysis. Our approach can also be applied\nmore generally to analyze stochastic approximation (SA) algorithms with time-varying Markov\ntransition kernels, which may be of independent interest.\nFinite-sample analysis for on-policy SARSA. For the on-policy SARSA algorithm, as the estimate\nof the action-value function changes with time, the behavior policy also changes. By a gradient\ndescent type of analysis [4] and our bias characterization technique for analyzing time-varying\nMarkov processes, we develop the \ufb01nite-sample analysis for the on-policy SARSA algorithm with a\ncontinuous state space and linear function approximation. Our analysis is for the on-line case with a\nsingle sample trajectory and non-i.i.d. data. To the best of our knowledge, this is the \ufb01rst \ufb01nite-sample\nanalysis for this type of on-policy algorithm with time-varying behavior policy.\nFitted SARSA algorithm. We propose a more general on-line \ufb01tted SARSA algorithm, where\nbetween two policy improvements, a \u201c\ufb01tted\u201d step is taken to obtain a more accurate estimate of the\naction-value function of the corresponding behavior policy via multiple iterations rather than taking\nonly a single iteration as in the original SARSA. In particular, it includes the variant of SARSA in [28]\nas a special case, in which each \ufb01tted step is required to converge before doing policy improvement.\nWe provide a non-asymptotic analysis for the convergence of the proposed algorithm. Interestingly,\nour analysis indicates that the \ufb01tted step can stop at any time (not necessarily until convergence)\nwithout affecting the overall convergence of the \ufb01tted SARSA algorithm.\n\n2\n\n\f1.2 Related Work\nFinite-sample analysis for TD learning. The asymptotic convergence of the TD algorithm was\nestablished in [36]. The \ufb01nite-sample analysis of the TD algorithm was provided in [9, 19] under the\ni.i.d. setting and in [4, 34] recently under the non-i.i.d. setting, where a single sample trajectory is\navailable. The \ufb01nite sample analysis for the two-time scale methods for TD learning was also studied\nvery recently under i.i.d. setting in [8], under non-i.i.d. setting with constant step sizes in [15], and\nunder non-i.i.d. setting with diminishing step sizes in [38]. Differently from TD, the goal of which\nis to estimate the value function of a \ufb01xed policy, SARSA aims to continuously update its estimate\nof the action-value function to obtain an optimal policy. While samples of the TD algorithm are\ngenerated by following a time-invariant behavior policy, the behavior policy that generates samples\nin SARSA follows from an instantaneous estimate of the action-value function, which changes over\ntime.\nQ-learning with function approximation. The asymptotic convergence of Q-learning with linear\nfunction approximation was established in [23] under certain conditions. An approach based on a\ncombination of Q-learning and kernel-based nearest neighbor regression was proposed in [32] which\n\ufb01rst discretize the entire state space, and then use the nearest neighbor regression method to estimate\nthe action-value function. Such an approach was shown to converge, and a \ufb01nite-sample analysis of\nthe convergence rate was further provided. Q-learning algorithms in [23, 32] are off-policy algorithms,\nwhere a \ufb01xed behavior policy is used to collect samples, whereas SARSA is an on-policy algorithm\nwith a time-varying behavior policy. Moreover, differently from the nearest neighbor approach, we\nconsider SARSA with linear function approximation. These differences require different techniques\nto characterize the non-asymptotic convergence rate.\nOn-policy SARSA algorithm. SARSA was originally proposed in [31], and using the tabular\napproach its convergence was established in [33]. With function approximation, SARSA is not\nguaranteed to converge if \u270f-greedy and softmax are used. With a smooth enough Lipschitz continuous\npolicy improvement operator, the asymptotic convergence of SARSA was shown in [23, 28]. In this\npaper, we further develop the non-asymptotic \ufb01nite-sample analysis for SARSA under the Lipschitz\ncontinuous condition.\nFitted value/policy iteration algorithms. The least-squares temporal difference learning (LSTD)\nalgorithms have been extensively studied in [6, 5, 25, 20, 12, 29, 30, 35, 37] and references therein,\nwhere in each iteration a least square regression problem based on a batch data is solved. Approximate\n(\ufb01tted) policy iteration (API) algorithms further extend \ufb01tted value iteration with policy improvement.\nSeveral variants were studied, which adopt different objective functions, including least-squares\npolicy iteration (LSPI) algorithms in [18, 21, 39], \ufb01tted policy iteration based on Bellman residual\nminimization (BRM) in [1, 11], and classi\ufb01cation-based policy iteration algorithm in [22]. The \ufb01tted\nSARSA algorithm in this paper uses an iterative way (TD(0) algorithm) to estimate the action-value\nfunction between two policy improvements, which is more memory and computationally ef\ufb01cient\nthan the batch method. Differently from [28], we do not require a convergent TD(0) run for each\n\ufb01tted step. For this algorithm, we provide its non-asymptotic convergence analysis.\n\n2 Preliminaries\n\n2.1 Markov Decision Process\nConsider a general reinforcement learning setting, where an agent interacts with a stochastic environ-\nment, which is modeled as a Markov decision process (MDP). Speci\ufb01cally, we consider a MDP that\nconsists of (X ,A, P, r, ), where X is a continuous state space X\u21e2 Rd, and A is a \ufb01nite action set.\nWe further let Xt 2X denote the state at time t, and At 2A denote the action at time t. Then, the\nmeasure P de\ufb01nes the action dependent transition kernel for the underlying Markov chain {Xt}t0:\nP(Xt+1 2 U|Xt = x, At = a) =RU P(dy|x, a), for any measurable set U \u2713X . The one-stage\nreward at time t is given by r(Xt, At), where r : X\u21e5A! R is the reward function, and is assumed\nto be uniformly bounded, i.e., r(x, a) 2 [0, rmax], for any (x, a) 2X\u21e5A . Finally, denotes the\ndiscount factor.\nA stationary policy maps a state x 2X to a probability distribution \u21e1(\u00b7|x) over A, which\ndoes not depend on time. For a policy \u21e1, the corresponding value function V \u21e1 : X! R\nis de\ufb01ned as the expected total discounted reward obtained by actions executed according to\n\n3\n\n\f\u21e1: V \u21e1 (x0) = E[P1t=0 tr(Xt, At)|X0 = x0]. The action-value function Q\u21e1 : X\u21e5A! R\nis de\ufb01ned as Q\u21e1(x, a) = r(x, a) + RX\nP(dy|x, a)V \u21e1(y). The goal is to \ufb01nd an optimal pol-\nicy that maximizes the value function from any initial state. The optimal value function is\nde\ufb01ned as V \u21e4(x) = sup\u21e1 V \u21e1(x), 8x 2X . The optimal action-value function is de\ufb01ned as\n. The optimal policy \u21e1\u21e4 is then greedy with re-\nQ\u21e4(x, a) = sup\u21e1 Q\u21e1(x, a), 8(x, a) 2X\u21e5A\nIt can be veri\ufb01ed that Q\u21e4 = Q\u21e1\u21e4. The Bellman operator H is de\ufb01ned as\nspect to Q\u21e4.\n(HQ)(x, a) = r(x, a) + RX\nmaxb2A Q(y, b)P(dy|x, a). It is clear that H is contraction in the\nsup norm de\ufb01ned as kQksup = sup(x,a)2X\u21e5A |Q(x, a)|, and the optimal action-value function Q\u21e4 is\nthe \ufb01xed point of H [3].\n\n2.2 Linear Function Approximation\nLet Q = {Q\u2713 : \u2713 2 RN} be a family of real-valued functions de\ufb01ned on X\u21e5A . We consider the\nproblem where any function in Q is a linear combination of a set of N \ufb01xed functions i : X\u21e5A! R\nfor i = 1, . . . , N. Speci\ufb01cally, for \u2713 2 RN, Q\u2713(x, a) =PN\ni=1 \u2713ii(x, a) = T (x, a)\u2713. We assume\nthat k(x, a)k2 \uf8ff 1, 8(x, a) 2X\u21e5A , which can be ensured by normalizing {i}N\ni=1. The goal is\nto \ufb01nd a Q\u2713 with a compact representation in \u2713 to approximate the optimal action-value function Q\u21e4\nwith a continuous state space.\n\n3 Finite-Sample Analysis for SARSA\n\n3.1 SARSA with Linear Function Approximation\n\nWe consider a \u2713-dependent behavior policy, which changes with time. Speci\ufb01cally, the behavior policy\n\u21e1\u2713t is given by (T (x, a)\u2713t), where is a policy improvement operator, e.g., greedy, \u270f-greedy,\nsoftmax and mellowmax [2]. Suppose that {xt, at, rt}t0 is a sample trajectory of states, actions and\nrewards obtained from the MDP following the time dependent behavior policy \u21e1\u2713t (see Algorithm 1).\nThe projected SARSA with linear function approximation updates as follows:\n\n(1)\nwhere gt(\u2713t) = r\u2713Q\u2713(xt, at)t = (xt, at)t, t denotes the temporal difference at time t:\nt = r(xt, at)+T (xt+1, at+1)\u2713tT (xt, at)\u2713t, and proj2,R(\u2713) := arg min\u27130:k\u27130k2\uf8ffR k\u2713\u27130k2.\nIn this paper, we refer to gt as \"gradient\", although it is not a gradient of any function.\n\n\u2713t+1 = proj2,R(\u2713t + \u21b5tgt(\u2713t)),\n\nAlgorithm 1 SARSA\n\nInitialization:\n\u27130, x0, R, i, for i = 1, 2, ..., N\nMethod:\n\u21e1\u27130 (T \u27130)\nChoose a0 according to \u21e1\u27130\nfor t = 1, 2, ... do\n\nObserve xt and r(xt1, at1)\nChoose at according to \u21e1\u2713t1\n\u2713t proj2,R(\u2713t1 + \u21b5t1gt1(\u2713t1))\nPolicy improvement: \u21e1\u2713t (T \u2713t)\n\nend for\n\nHere, the projection step is to control the norm of the gradient gt(\u2713t), which is a commonly used\ntechnique to control the gradient bias [4, 16, 17, 7, 26]. With a small step size \u21b5t and a bounded\ngradient, \u2713t does not change too fast. We note that [14] showed that SARSA converges to a bounded\nregion, and thus \u2713t is bounded for all t 0. This implies that our analysis still holds without\nthe projection step. We further note that even without exploiting the fact that \u2713t is bounded, the\n\ufb01nite-sample analysis for SARSA can still be obtained by combining our approach of analyzing the\nstochastic bias with an extension of the approach in [34]. However, to convey the central idea of\ncharacterizing the stochastic bias of a MDP with dynamically changing transition kernel, we focus on\nthe projected SARSA in this paper.\n\n4\n\n\fWe consider the following Lipschitz continuous policy improvement operator as in [28, 23]. For\nany \u2713 2 RN, the behavior policy \u21e1\u2713 = (T \u2713) is Lipschitz with respect to \u2713: 8(x, a) 2X\u21e5A ,\n\n|\u21e1\u27131(a|x) \u21e1\u27132(a|x)|\uf8ff Ck\u27131 \u27132k2,\n\n(2)\nwhere C > 0 is the Lipschitz constant. Further discussion about this assumption and its impact on\nthe convergence is provided in Section 5. We further assume that for any \ufb01xed \u2713 2 RN, the Markov\nchain {Xt}t0 induced by the behavior policy \u21e1\u2713 and the transition kernel P is uniformly ergodic\nwith the invariant measure denoted by P\u2713, and satis\ufb01es the following assumption.\nAssumption 1. There are constants m > 0 and \u21e2 2 (0, 1) such that\n\ndT V (P(Xt 2\u00b7| X0 = x), P\u2713) \uf8ff m\u21e2t,8t 0,\n\nsup\nx2X\n\nwhere dT V (P, Q) denotes the total-variation distance between the probability measures P and Q.\n\nWe denote by \u00b5\u2713 the probability measure induced by the invariant measure P\u2713 and the behavior\npolicy \u21e1\u2713. We assume that the N base functions i\u2019s are linearly independent in the Hilbert space\nL2(X\u21e5A , \u00b5\u2713\u21e4), where \u2713\u21e4 is the limit point of Algorithm 1, which will be de\ufb01ned in the next section.\nFor the space L2(X\u21e5A , \u00b5\u2713\u21e4), two measurable functions on X\u21e5A are equivalent if they are identical\nexcept on a set of \u00b5\u2713\u21e4-measure zero.\n\n3.2 Finite-Sample Analysis\nWe \ufb01rst de\ufb01ne A\u2713 = E\u2713[(X, A)(T (Y, B) T (X, A))], and b\u2713 = E\u2713[(X, A)r(X, A)], where\nE\u2713 denotes the expectation where X follows the invariant probability measure P\u2713, A is generated by\nthe behavior policy \u21e1\u2713(A = \u00b7|X), Y is the subsequent state of X following action A, i.e., Y follows\nfrom the transition kernel P(Y 2\u00b7| X, A), and B is generated by the behavior policy \u21e1\u2713(B = \u00b7|Y ).\nIt was shown in [23] that the algorithm in (1) converges to a unique point \u2713\u21e4, which satis\ufb01es the\nfollowing relation: A\u2713\u21e4\u2713\u21e4 + b\u2713\u21e4 = 0, if the Lipschitz constant C is not so large that (A\u2713\u21e4 + CI ) is\nnegative de\ufb01nite1.\n1\u21e2 ). Recall in (2) that the policy \u21e1\u2713 is Lipschitz\nLet G = rmax + 2R and = G|A|(2 +dlog\u21e2\nwith respect to \u2713 with Lipschitz constant C. We then make the following assumption [28, 23].\nAssumption 2. The Lipschitz constant C is not so large that (A\u2713\u21e4 + CI ) is negative de\ufb01nite, and\ndenote the largest eigenvalue of 1\n\nme + 1\n\n1\n\n2(A\u2713\u21e4 + CI ) + (A\u2713\u21e4 + CI )T by ws < 0.\n\n2 \uf8ff\n\nG2(4C|A|G\u2327 2\n\n0 + (12 + 2C)\u23270 + 1)(log T + 1)\n\nThe following theorems present the \ufb01nite-sample bound on the convergence of SARSA with dimin-\nishing and constant step sizes.\nTheorem 1. Consider SARSA with linear function approximation in Algorithm 1 with k\u2713\u21e4k2 \uf8ff R.\nConsider a decaying step size \u21b5t =\n2w(t+1) for t 0, where w \uf8ff ws. Under Assumptions 1 and 2,\nwe have that\nEk\u2713T \u2713\u21e4k2\n,\nwhere \u23270 = min{t 0 : m\u21e2t \uf8ff \u21b5T}. For large T , \u23270 \u21e0 log T , and hence Ek\u2713T \u2713\u21e4k2\nT \u2318 . Thus, to guarantee the accuracy E[k\u2713T \u2713\u21e4k2\nO\u21e3 log3 T\ncomplexity is given by O( 1\nTheorem 1 indicates that SARSA has a faster convergence rate than the existing \ufb01nite-sample bound\nfor Q-learning with nearest neighbors [32].\nTheorem 2. Consider SARSA with linear function approximation in Algorithm 1 with k\u2713\u21e4k2 \uf8ff R.\nUnder Assumptions 1 and 2 and with a constant step size \u21b5t = \u21b50 < 1\n2ws\n\n2 \uf8ff\n2] \uf8ff for a small , the overall sample\n\nfor t > 0, we have that\n\n2G2(\u23270w + w + \u21e21)\n\n log3 1\n ).\n\n4w2T\n\n+\n\nw2T\n\n(3)\n\n1\n\nEk\u2713T \u2713\u21e4k2\n\n2 \uf8ffe2\u21b50wsT Ek\u27130 \u2713\u21e4k2\n\n2 +\n\n\u21b50G2((12 + 2C)\u23270 + 4GC|A|\u2327 2\n\n0 + 8/\u21e2 + 1)\n\n2ws\n\n,\n\n(4)\n\nwhere \u23270 = min{t 0 : m\u21e2t \uf8ff \u21b50}.\n\n1It can be shown that if i\u2019s are linearly independent in L2(X\u21e5A , \u00b5\u2713\u21e4 ), then A\u2713\u21e4 is negative de\ufb01nite\n\n[28, 36].\n\n5\n\n\fIf \u21b50 is small enough, and T is large enough, then the algorithm converges to a small neighborhood\n\nof \u2713\u21e4. For example, if \u21b5t = 1/pT , the upper bound converges to zero as T ! 1. The proof of this\ntheorem is a straightforward extension of that for Theorem 1.\nIn order for Theorems 1 and 2 to hold, the projection radius R shall be chosen such that k\u2713\u21e4k2 \uf8ff R.\nHowever, \u2713\u21e4 is unknown in advance. We next provide an upper bound on k\u2713\u21e4k2, which can be\nestimated in practice [4].\nLemma 1. For the projected SARSA algorithm in (1), the limit point \u2713\u21e4 satis\ufb01es that k\u2713\u21e4k2 \uf8ff rmax\n,\n|wl|\nwhere wl < 0 is the largest eigenvalue of 1\n\n2 (A\u2713\u21e4 + AT\n\n\u2713\u21e4).\n\n3.3 Outline of Technical Proof of Theorem 1\nThe major challenge in the \ufb01nite-sample analysis of SARSA lies in analyzing the stochastic bias in\ngradient, which are two-folds: (1) non-i.i.d. samples; and (2) dynamically changing behavior policy.\nFirst, as per the updating rule in (1), there is a strong coupling between the sample path and {\u2713t}t0,\nbecause the samples are used to compute the gradient gt and then \u2713t+1, which introduces a strong\ndependency between {\u2713t}t0, and {Xt, At}t0, and thus the bias in gt. Moreover, differently from\nTD learning and Q-learning, \u2713t is further used (as in the policy \u21e1\u2713t) to generate the subsequent actions,\nwhich makes the dependency even stronger. Although the convergence can still be established using\nthe O.D.E. approach [23], in order to derive a \ufb01nite-sample analysis, the stochastic bias in the gradient\nneeds to be explicitly characterized, which makes the problem challenging.\nSecond, as \u2713t updates, the transition kernel for the state-action pair (Xt, At) changes with time.\nPrevious analyses, e.g., [4], rely on the facts that the behavior policy is \ufb01xed and that the underlying\nMarkov process is uniformly ergodic, so that the Markov process reaches its stationary distribution\nquickly. In [28], a variant of SARSA was studied, where between two policy improvements, the\nbehavior policy is \ufb01xed, and a TD method is used to estimate its action-value function until con-\nvergence. The behavior policy is then improved using a Lipschitz continuous policy improvement\noperator. In this way, for each given behavior policy, the induced Markov process can reach its\nstationary distribution quickly so that the analysis can be conducted. The SARSA algorithm studied\nin this paper does not possess these nice properties. The behavior policy of the SARSA algorithm\nchanges at each time step, and the underlying Markov process does not necessarily reach a stationary\ndistribution due to lack of uniform ergodicity.\nTo provide a \ufb01nite-sample analysis, our major technical novelty lies in the design of auxiliary Markov\nchains, which are uniformly ergodic and , to approximate the original Markov chain induced by the\nSARSA algorithm, and a careful decomposition of the stochastic bias. Using such an approach, the\ngradient bias can be explicitly characterized. Then together with a gradient descent type of analysis,\nwe derive the \ufb01nite-sample analysis for the SARSA algorithm.\nTo illustrate the main idea of the proof, we provide a sketch. We note that Step 3 contains our major\ntechnical contributions of bias characterization for time-varying Markov processes.\n\nProof sketch. We \ufb01rst introduce some notations. For any \ufb01xed \u2713 2 RN, de\ufb01ne \u00afg(\u2713) = E\u2713[gt(\u2713)],\nwhere Xt follows the stationary distribution P\u2713, and (At, Xt+1, At+1) are subsequent actions and\nstates generated according to the policy \u21e1\u2713 and the transition kernel P. Here, \u00afg(\u2713) can be interpreted\nas the noiseless gradient at \u2713. We then de\ufb01ne\n\nThus, \u21e4t(\u2713t) measures the bias caused by using non-i.i.d. samples to estimate the gradient.\nStep 1. Error decomposition. The error at each time step can be decomposed recursively as follows:\n\n\u21e4t(\u2713) = h\u2713 \u2713\u21e4, gt(\u2713) \u00afg(\u2713)i.\n\n(5)\n\nE[k\u2713t+1 \u2713\u21e4k2\n\n2] \uf8ffE[k\u2713t \u2713\u21e4k2\n\n2] + 2\u21b5tE[h\u2713t \u2713\u21e4, \u00afg(\u2713t) \u00afg(\u2713\u21e4)i]\n\n+ \u21b52\n\n(6)\nStep 2. Gradient descent type analysis. The \ufb01rst three terms in (6) mimic the analysis of the gradient\ndescent algorithm without noise, because the accurate gradient \u00afgt at \u2713t is used.\nDue to the projection step in (1), kgt(\u2713t)k2 is upper bounded by G. It can also be shown that\n\n2] + 2\u21b5tE[\u21e4t(\u2713t)].\n\nt E[kgt(\u2713t)k2\n\nE[h\u2713t \u2713\u21e4, \u00afg(\u2713t) \u00afg(\u2713\u21e4)i] \uf8ff (\u2713t \u2713\u21e4)T (A\u2713\u21e4 + CI )(\u2713t \u2713\u21e4).\n\n(7)\n\n6\n\n\fFor a not so large C, i.e., \u21e1\u2713 is smooth enough with respect to \u2713, (A\u2713\u21e4 + CI ) is negative de\ufb01nite.\nThen, we have\n\nE[h\u2713t \u2713\u21e4, \u00afg(\u2713t) \u00afg(\u2713\u21e4)i] \uf8ff wsE[k\u2713t \u2713\u21e4k2\n2].\n\n(8)\n\nStep 3. Stochastic bias analysis. This step consists of our major technical developments. The last term\nin (6) is the bias caused by using a single sample path with non-i.i.d. data and time-varying behavior\npolicy. For convenience, we rewrite \u21e4t(\u2713t) as \u21e4t(\u2713t, Ot), where Ot = (Xt, At, Xt+1, At+1).\nBounding this term is challenging due to the strong dependency between \u2713t and Ot.\nWe \ufb01rst show that \u21e4t(\u2713, Ot) is Lipschitz in \u2713. Due to the projection step, \u2713t changes slowly with t.\nCombining the two facts, we can show that for any \u2327> 0,\n\n\u21e4t(\u2713t, Ot) \uf8ff \u21e4t(\u2713t\u2327 , Ot) + (6 + C)G2\n\n\u21b5i.\n\n(9)\n\nt1Xi=t\u2327\n\nSuch a step is intended to decouple the dependency between Ot and \u2713t by considering Ot and \u2713t\u2327 .\nIf the Markov chain {(Xt, At,\u2713 t)}t0 induced by SARSA was uniformly ergodic, and satis\ufb01ed\nAssumption 1, then for any \u2713t\u2327 , Ot would reach its stationary distribution quickly for large \u2327.\nHowever, such an argument is not necessarily true, since \u2713t changes with time and thus the transition\nkernel of the Markov chain changes with time.\nOur idea is to construct an auxiliary Markov chain to assist our proof. Consider the following new\nMarkov chain. Before time t \u2327 + 1, the states and actions are generated according to the SARSA\nalgorithm, but after time t \u2327 + 1, the behavior policy is kept \ufb01xed as \u21e1\u2713t\u2327 to generate all the\nsubsequent actions. We then denote by \u02dcOt = ( \u02dcXt, \u02dcAt, \u02dcXt+1, \u02dcAt+1) the observations of the new\nMarkov chain at time t and time t + 1. For this new Markov chain, for large \u2327, \u02dcOt reaches the\nstationary distribution induced by \u21e1\u2713t\u2327 and P. It then can be shown that\n\nE[\u21e4t(\u2713t\u2327 , \u02dcOt)] \uf8ff 4G2m\u21e2\u23271.\n\n(10)\n\nThe next step is to bound the difference between the Markov chain generated by the SARSA algorithm\nand the auxiliary Markov chain that we construct. Since the behavior policy changes slowly, due to\nits Lipschitz property and the small step size \u21b5t, the two Markov chains should not deviate from each\nother too much. It can be shown that for the case with diminishing step size (similar argument can be\nobtained for the case with constant step size),\n\nE[\u21e4t(\u2713t\u2327 , Ot)] E[\u21e4t(\u2713t\u2327 , \u02dcOt)] \uf8ff\n\nw\nCombining (9), (10) and (11) yields an upper bound on E[\u21e4t(\u2713t)].\nStep 4. Putting the \ufb01rst three steps together and recursively applying Step 1 complete the proof.\n\nC|A|G3\u2327\n\nlog\n\nt\nt \u2327\n\n.\n\n(11)\n\n4 Finite-sample Analysis for Fitted SARSA Algorithm\nIn this section, we introduce a more general on-policy \ufb01tted SARSA algorithm (see Algorithm 2),\nwhich provides a general framework for on-policy \ufb01tted policy iteration. Speci\ufb01cally, after each\npolicy improvement, we perform a \u201c\ufb01tted\u201d step that consists of B TD(0) iterations to estimate the\naction-value function of the current policy. This more general \ufb01tted SARSA algorithm contains the\noriginal SARSA algorithm [31] as a special case with B = 1 and the algorithm in [28] as another\nspecial case with B = 1 (i.e., until TD(0) converges). Moreover, the entire algorithm uses only\none single Markov trajectory, instead of restarting from state x0 after each policy improvement [28].\nDifferently from most existing \ufb01tted policy iteration algorithms, where a regression problem for\nmodel \ufb01tting is solved between two policy improvements, our \ufb01tted SARSA algorithm does not\nrequire a convergent TD iteration process between policy improvements. As will be shown, the\non-policy \ufb01tted SARSA algorithm is guaranteed to converge for an arbitrary B. The overall sample\ncomplexity for this \ufb01tted algorithm will be provided.\nIn fact, there is no need for the number B of TD iterations in the \ufb01tted step to be the same. More\ngenerally, by setting the number of TD iterations differently, we can control the estimation accuracy of\n\n7\n\n\fAlgorithm 2 General Fitted SARSA\n\nInitialization:\n\u27130, x0, R, i, for i = 1, 2, ..., N\nMethod:\n\u21e1\u27130 (T \u27130)\nChoose a0 according to \u21e1\u27130\nfor t = 0, 1, 2, ... do\n\nTD learning of policy \u21e1\u2713tB:\nfor j = 1, ..., B do\n\nObserve xtB+j and r(xtB+j1, atB+j1)\nChoose atB+j according to \u21e1\u2713tB\n\u2713tB+j proj2,R(\u2713tB+j1 + \u21b5tB+j1gtB+j1(\u2713tB+j1))\n\nend for\nPolicy improvement: \u21e1\u2713(t+1)B (T \u2713(t+1)B)\n\nend for\n\nthe action-value function between policy improvements using the \ufb01nite-sample bound of TD [4]. Our\nanalysis can be extended to this general scenario in a straightforward manner, but the mathematical\nexpressions get more involved. Thus we focus on the simple case with the same B to convey the\ncentral idea.\nThe following theorem provides the \ufb01nite-sample bound on the convergence of the \ufb01tted SARSA\nalgorithm.\nTheorem 3. Consider the \ufb01tted SARSA algorithm with linear function approximation as in Algorithm\n2. Suppose that Assumptions 1 and 2 hold.\n(1) With a decaying step size \u21b5t = 1\n\n2tw for t 1 and w \uf8ff ws, we have that\n\nE[k\u2713T B \u2713\u21e4k2\n2]\n\uf8ff\u21e34G2(\u23270 + B)w + (log T + 1)((6 + C)G2\u23270 + (6.5 + C)G2B\n+ C|A|G3\u2327 2\nT \u2318 . For any given B, to guarantee the accuracy E[k\u2713T B \u2713\u21e4k2\n\n0 ) + 4G2/\u21e2 + 0.5BG2\u2318w2BT,\n\nwhere \u23270 = inf{nB : m\u21e2nB \uf8ff \u21b5T B}. For suf\ufb01ciently large T , \u23270 \u21e0 log T , and hence Ek\u2713T \n2 \uf8ffO \u21e3 log3 T\n2] \uf8ff for a small ,\n\u2713\u21e4k2\nthe overall sample complexity is given by O( 1\n(2) With a constant step size \u21b5t = \u21b50 < 1\nE[k\u2713T B \u2713\u21e4k2\n2]\n\uf8ff e2wsB\u21b50T k\u27130 \u2713\u21e4k2\n\n\u21b50(BG2 + 2(6 + C)G2(\u23270 + B) + 8G2/\u21e2 + 2|A|G3\u2327 2\n0 )\n\n2wsB for t 0, we have that\n\n log3 1\n ).\n\n, (13)\n\n(12)\n\n2 +\n\n2ws\n\nwhere \u23270 = inf{nB : m\u21e2nB \uf8ff \u21b50}.\nThe item (2) of Theorem 3 indicates that with a small enough constant step size and a large enough\nT , the \ufb01tted SARSA algorithm converges to a small neighborhood of \u2713\u21e4.\nTheorem 3 further implies that the \ufb01tted step can take any number of TD iterations (not necessarily\nto converge) without affecting the overall convergence and sample complexity of the \ufb01tted SARSA\nalgorithm. In particular, the comparison between the original SARSA and the \ufb01tted SARSA algo-\nrithms indicates that they have the same overall sample complexity. On the other hand, the \ufb01tted\nSARSA algorithm is more computationally ef\ufb01cient due to the following two facts: (a) with the same\nnumber of samples n0, the general \ufb01tted SARSA algorithm uses a fewer number n0/B of policy\nimprovement operators; and (b) to apply the policy improvement operator, an inner product between\n and \u2713tB needs to be computed, the complexity of which scales linearly with the size of the action\nspace |A|.\n\n8\n\n\f5 Discussion of Lipschitz Continuity Assumption\n\nIn this section, we discuss the Lipschitz continuity assumption on the policy improvement operator ,\nwhich plays an important role in the convergence of SARSA.\nUsing a tabular approach that stores the action-values, the convergence of the SARSA algorithm\nwas established in [33]. However, an example given in [13] shows that SARSA with function\napproximation and \u270f-greedy policy improvement operator is chattering, and does not converge. Later,\n[14] showed that SARSA converges to a bounded region, although this region may be large, and\ndoes not diverge as Q-learning with linear function approximation. One possible explanation of this\nnon-convergent behavior of the SARSA algorithm with \u270f-greedy and softmax policy improvement\noperators is the discontinuity in the action selection strategies [27, 10]. More speci\ufb01cally, a slight\nchange in the estimate of the action-value function may result in a big change in the behavior policy,\nwhich thus yields a completely different estimate of the action-value function.\nToward further understanding the convergence of SARSA, [10] showed that the approximate value\niteration with soft-max policy improvement is guaranteed to have \ufb01xed points, which however may\nnot be unique, and [27] later showed that for any continuous policy improvement operator, \ufb01xed\npoints of SARSA are guaranteed to exist. Then [28] developed a convergent form of SARSA by using\na Lipschitz continuous policy improvement operator, and demonstrated its convergence to the unique\nlimit point when the Lipschitz constant is not too large. As discussed in [27], the non-convergence\nexample in [13] does not contradict the convergence result in [28], because the example does not\nsatisfy the Lipschitz continuity condition of the policy improvement operator, which is essential to\nguarantee the convergence of SARSA. In this paper, we follow this line of reasoning, and consider\nLipschitz continuous policy improvement operators.\nAs discussed in [28], the Lipschitz constant C shall be chosen not so large to ensure the convergence\nof the SARSA algorithm. However, to ensure exploitation, one generally prefers a large Lipschitz\nconstant C so that the agent can choose actions with higher estimated action-values. In [28], an\nadaptive approach to choose a policy improvement operator with a proper C was proposed. It was\nalso noted in [28] that it is possible that the convergence could be obtained with a much larger C than\nthe one suggested by Theorems 1, 2 and 3.\nHowever, an important open problem for the SARSA algorithms with Lipschitz continuous operator\n(also for other algorithms with continuous action selection [10]) is that there is no theoretical\nperformance characterization of the solutions this type of algorithms produce. It is thus of future\ninterest to further investigate the performance of the policy generated by the SARSA algorithm with\nLipschitz continuous operator.\n\n6 Conclusion\n\nIn this paper, we presented the \ufb01rst \ufb01nite-sample analysis for the SARSA algorithm with continuous\nstate space and linear function approximation. Our analysis is applicable to the on-line case with a\nsingle sample path and non-i.i.d. data. In particular, we developed a novel technique to handle the\nstochastic bias for dynamically changing behavior policies, which enables non-asymptotic analysis\nof this type of stochastic approximation algorithms. We also presented a \ufb01tted SARSA algorithm,\nwhich provides a general framework for iterative on-policy \ufb01tted policy iterations. We also presented\nthe \ufb01nite-sample analysis for such a \ufb01tted SARSA algorithm.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewer and the Area Chair for their valuable comments. The\nwork of T. Xu and Y. Liang was supported in partby the U.S. National Science Foundation under\nGrants CCF-1761506, ECCS-1818904, and CCF-1801855.\n\n9\n\n\fReferences\n[1] A. Antos, C. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual\nminimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71(1):89\u2013\n129, 2008.\n\n[2] K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In\n\nProc. International Conference on Machine Learning (ICML), 2016.\n\n[3] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scienti\ufb01c, 3rd\n\nedition, 2012.\n\n[4] J. Bhandari, D. Russo, and R. Singal. A \ufb01nite time analysis of temporal difference learning with\n\nlinear function approximation. arXiv preprint arXiv:1806.02450, 2018.\n\n[5] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning,\n\n49:233\u2013246, 2002.\n\n[6] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning.\n\nMachine Learning, 22:33\u201357, 1996.\n\n[7] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[8] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale\nstochastic approximation with applications to reinforcement learning. In Proc. Conference on\nLearning Theory (COLT), 2018.\n\n[9] G. Dalal, B. Szrnyi, G. Thoppe, and S. Mannor. Finite sample analyses for TD(0) with function\n\napproximation. In Proc. AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[10] D. P. De Farias and B. Van Roy. On the existence of \ufb01xed points for approximate value\niteration and temporal-difference learning. Journal of Optimization theory and Applications,\n105(3):589\u2013608, 2000.\n\n[11] A.-M. Farahmand, C. Szepesvari, and R. Munos. Error propagation for approximate policy and\n\nvalue iteration. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.\n\n[12] M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. LSTD with random projections. In\n\nProc. Advances in Neural Information Processing Systems (NIPS), 2010.\n\n[13] G. J. Gordon. Chattering in SARSA ()-a CMU learning lab internal report. 1996.\n[14] G. J. Gordon. Reinforcement learning with function approximation converges to a region. In\nProc. Advances in Neural Information Processing Systems (NeurIPS), pages 1040\u20131046, 2001.\n[15] H. Gupta, R. Srikant, and L. Ying. Finite-time performance bounds and adaptive learning rate\nselection for two time-scale reinforcement learning. To appear in Proc. Advances in Neural\nInformation Processing Systems (NeurIPS), 2019.\n\n[16] H. Kushner. Stochastic approximation: a survey. Wiley Interdisciplinary Reviews: Computa-\n\ntional Statistics, 2(1):87\u201396, 2010.\n\n[17] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o(1/t) conver-\ngence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002,\n2012.\n\n[18] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[19] C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does\nconstant step-size and iterate averaging go? In Proc. International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2018.\n\n[20] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of lstd. In Proc. Interna-\n\ntional Conference on Machine Learning (ICML), 2010.\n\n10\n\n\f[21] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of least-squares policy\n\niteration. Journal of Machine Learning Research, 13:3041\u20133074, 2012.\n\n[22] A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of classi\ufb01cation-based policy iteration\n\nalgorithms. Journal of Machine Learning Research, 17:583\u2013612, 2016.\n\n[23] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function\napproximation. In Proc. International Conference on Machine Learning (ICML), pages 664\u2013671.\nACM, 2008.\n\n[24] A. Y. Mitrophanov. Sensitivity and convergence of uniformly ergodic markov chains. Journal\n\nof Applied Probability, 42(4):1003\u20131014, 2005.\n\n[25] R. Munos and C. Szepesvari. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9:815\u2013857, May 2008.\n\n[26] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[27] T. J. Perkins and M. D. Pendrith. On the existence of \ufb01xed points for Q-learning and Sarsa in\npartially observable domains. In Proc. International Conference on Machine Learning (ICML),\npages 490\u2013497, 2002.\n\n[28] T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Proc.\n\nAdvances in Neural Information Processing Systems (NeurIPS), pages 1627\u20131634, 2003.\n\n[29] B. A. Pires and C. Szepesvari. Statistical linear estimation with penalized estimators: An\napplication to reinforcement learning. In Proc. International Conference on Machine Learning\n(ICML), 2012.\n\n[30] L. Prashanth, N. Korda, and R. Munos. Fast LSTD using stochastic approximation: Finite time\nanalysis and application to traf\ufb01c control. In Proc. Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases, 2013.\n\n[31] G. A. Rummery and M. Niranjan. Online Q-learning using connectionist systems. Technical\n\nReport, Cambridge University Engineering Department, Sept. 1994.\n\n[32] D. Shah and Q. Xie. Q-learning with nearest neighbors. In Proc. Advances in Neural Information\n\nProcessing Systems (NeurIPS), 2018.\n\n[33] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv\u00e1ri. Convergence results for single-step\n\non-policy reinforcement-learning algorithms. Machine Learning, 38(3):287\u2013308, 2000.\n\n[34] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD\n\nlearning. In Proc. Annual Conference on Learning Theory (CoLT), 2019.\n\n[35] M. Tagorti and B. Scherrer. On the rate of convergence and error bounds for LSTD (). In Proc.\n\nInternational Conference on Machine Learning (ICML), 2015.\n\n[36] J. N. Tsitsiklis and B. Roy. An analysis of temporal-difference learning with function approxi-\n\nmation. IEEE Transactions on Automatic Control, 42(5):674\u2013690, May 1997.\n\n[37] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator.\n\nIn Proc. International Conference on Machine Learning (ICML), 2018.\n\n[38] T. Xu and Y. Zou, S.and Liang. Two time-scale off-policy TD learning: Non-asymptotic analysis\nover Markovian samples. To appear in Proc. Advances in Neural Information Processing\nSystems (NeurIPS), 2019.\n\n[39] Z. Yang, Y. Xie, and Z. Wang. A theoretical analysis of deep Q-learning. ArXiv: 1901.00137,\n\nJan. 2019.\n\n11\n\n\f", "award": [], "sourceid": 4666, "authors": [{"given_name": "Shaofeng", "family_name": "Zou", "institution": "University at Buffalo, the State University of New York"}, {"given_name": "Tengyu", "family_name": "Xu", "institution": "The Ohio State University"}, {"given_name": "Yingbin", "family_name": "Liang", "institution": "The Ohio State University"}]}