{"title": "Model-Free Least-Squares Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1554, "abstract": "", "full_text": "Model-Free Least Squares Policy Iteration\n\nMichail G. Lagoudakis\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\nmgl@cs.duke.edu\n\nRonald Parr\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\nparr@cs.duke.edu\n\nAbstract\n\nWe propose a new approach to reinforcement learning which combines\nleast squares function approximation with policy iteration. Our method\nis model-free and completely off policy. We are motivated by the least\nsquares temporal difference learning algorithm (LSTD), which is known\nfor its ef\ufb01cient use of sample experiences compared to pure temporal\ndifference algorithms. LSTD is ideal for prediction problems, however it\nheretofore has not had a straightforward application to control problems.\nMoreover, approximations learned by LSTD are strongly in\ufb02uenced by\nthe visitation distribution over states. Our new algorithm, Least Squares\nPolicy Iteration (LSPI) addresses these issues. The result is an off-policy\nmethod which can use (or reuse) data collected from any source. We have\ntested LSPI on several problems, including a bicycle simulator in which\nit learns to guide the bicycle to a goal ef\ufb01ciently by merely observing a\nrelatively small number of completely random trials.\n\n1 Introduction\n\nLinear least squares function approximators offer many advantages in the context of re-\ninforcement learning. While their ability to generalize is less powerful than black box\nmethods such as neural networks, they have their virtues: They are easy to implement and\nuse, and their behavior is fairly transparent, both from an analysis standpoint and from\na debugging and feature engineering standpoint. When linear methods fail, it is usually\nrelatively easy to get some insight into why the failure has occurred.\n\nOur enthusiasm for this approach is inspired by the least squares temporal difference learn-\ning algorithm (LSTD) [4]. LSTD makes ef\ufb01cient use of data and converges faster than\nother conventional temporal difference learning methods. Although it is initially appealing\nto attempt to use LSTD in the evaluation step of a policy iteration algorithm, this combina-\ntion can be problematic. Koller and Parr [5] present an example where the combination of\nLSTD style function approximation and policy iteration oscillates between two bad policies\nin an MDP with just 4 states. This behavior is explained by the fact that linear approxi-\nmation methods such as LSTD compute an approximation that is weighted by the state\nvisitation frequencies of the policy under evaluation. Further, even if this problem is over-\ncome, a more serious dif\ufb01culty is that the state value function that LSTD learns is of no use\nfor policy improvement when a model of the process is not available.\n\n\flearns least squares approximations of the state-action (\n\nThis paper introduces the Least Squares Policy Iteration (LSPI) algorithm, which extends\nthe bene\ufb01ts of LSTD to control problems. First, we introduce LSQ, an algorithm that\n) value function, thus permitting\naction selection and policy improvement without a model. Next we introduce LSPI which\nuses the results of LSQ to form an approximate policy iteration algorithm. This algorithm\ncombines the policy search ef\ufb01ciency of policy iteration with the data ef\ufb01ciency of LSTD.\nIt is completely off policy and can, in principle, use data collected from any reasonable\nsampling distribution. We have evaluated this method on several problems, including a\nsimulated bicycle control problem in which LSPI learns to guide the bicycle to the goal by\nobserving a relatively small number of completely random trials.\n\n2 Markov Decision Processes\n\nWe will be assuming that the MDP has an in\ufb01nite horizon and that future rewards are\n\nWe assume that the underlying control problem is a Markov Decision Process (MDP). An\nis a \ufb01nite set of\n\nis a Markovian transition model where\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\u0014\u0013\u0015\r represents the probability of\nin state \u000f and\n()\u0004+*\"\r . (If we assume that all policies\nis the action the agent takes at state \u000f . The\n\nMDP is de\ufb01ned as a 4-tuple\u0001\u0003\u0002\u0005\u0004\u0007\u0006\b\u0004\n\t\u000b\u0004\u0007\f\u000e\r where:\u0002\nis a \ufb01nite set of states;\u0006\nactions;\t\ngoing from state\u000f\nto state\u000f\u0016\u0013 with action\u0011 ; and\f\nis a reward function\f\u0018\u0017\u0010\u0002\u001a\u0019\u001b\u0006\u001c\u0019\u000e\u0002\u001e\u001d\nsuch that \f \u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\nrepresents the reward obtained when taking action \u0011\nending up in state\u000f\"\u0013 .\ndiscounted exponentially with a discount factor#%$'&\nare proper, our results generalize to the undiscounted case.) A stationary policy,\nMDP is a mapping,-\u0017.\u0002/\u001d\n\u001f10\nstate-action value function43\nactions and indicates the expected, discounted total reward when taking action \u0011\nthereafter. The exact \n\u000f and following policy ,\n\nis de\ufb01ned over all possible combinations of states and\nin state\n-values for all state-action pairs can be\n\nfound by solving the linear system of the Bellman equations :\n\n, where,2\u0001!\u000f\"\r\n\u0001\u0003\u000f\u0010\u0004\n\u00115\n\nfor an\n\nIR,\n\n\u0007\rC\u0004\n\n\u0001!\u000f\n0MT andP\n\n\u0001!\u000f6\u0004\n\u00117\r28'9:\u0001!\u000f6\u0004\n\u00117\r<;=#\u000e>@?BA\b\t\b\u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\n\u0004\u0007,2\u0001\u0003\u000f\n\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004G\u000fH\u0013\u0015\rI\fJ\u0001!\u000f6\u0004\n\u0011\u0012\u0004\n\u000fK\u0013L\r . In matrix format, the system becomesM3\nwhere9:\u0001\u0003\u000f\u0010\u0004\u0007\u00117\rD8FE\n, whereS3 and9\n3R\u000e3\nare vectors of size T\n9N;O#QP\n\u0002DTUT\n3 describes the transitions from pairs\u0001\u0003\u000f\u0010\u0004\u0007\u00117\n\n . P\nto pairs \u0001\u0003\u000f\u0014\u0013!\u0004@,2\u0001!\u000fH\u0013\u0015\r@\r .\nsize \u0001\u0007T\n0 T7\u0019OT\n\u0002DTLT\n\u0002DTLT\nFor every MDP, there exists an optimal policy, ,\u0005V , which maximizes the expected, dis-\nicy ,.WYX[Z by solving the above system. Policy improvement de\ufb01nes the next policy as\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r . These steps are repeated until convergence to an op-\n,.WYX[\\^]@ZC\u0001\u0003\u000f\"\rJ8`_\u0014a\nb\u0005c\b_\u0016d)e\n\ncounted return of every state. Policy iteration is a method of discovering this policy by\niterating through a sequence of monotonically improving policies. Each iteration con-\nsists of two phases. Value determination computes the state-action values for a pol-\n\ntimal policy, often in a surprisingly small number of steps.\n\nis a stochastic matrix of\n\n\u000e3\u0010fhgLi\n\n0MT\n\n3 Least Squares Approximation of Q Functions\n\nPolicy iteration relies upon the solution of a system of linear equations to \ufb01nd the Q values\nfor the current policy. This is impractical for large state and action spaces. In such cases we\n\nmay wish to approximate43 with a parametric function approximator and do some form\n\nof approximate policy iteration. We now address the problem of \ufb01nding a set of parameters\nthat maximizes the accuracy of our approximator. A common class of approximators is\nthe so called linear architectures, where the value function is approximated as a linear\n\n\u001f\n\u0013\n\n\n3\n\u0013\n\n\n3\n\u0013\n\u0013\n?\nA\n8\n3\n\f\u0005\u0007\u0006\n\n\u0013\u0002\n\n\u0001!\u000f\u0010\u0004\u0007\u0011\u0012\u0004\u0003\u0002\n\u000f\u0012\u0011\n\nweighted combination of basis functions (features):\n\u0001!\u000f6\u0004\n\u00117\r\t\u0002\n\u0001!\u000f\u0010\u0004\u0007\u00117\r\u0003\n\u000b\u0002S\u0004\nwhere\u0002\nis a set of weights (parameters). In general,\r\f\u000e\f-T\n0 T and so, the linear system\n\u0002DTLT\nabove now becomes an overconstrained system over the parameters\u0002\n9F;O#QP\nwhere\u000f\n0MT!\u0019\u0014)\r matrix. We are interested in a set of weights\u0002\npoint in value function space, that is a value function  3\nthe basis functions. Assuming that the columns of\u000f\n\u000f\u0019\u0011\n8\u0018\u0017\n9N;O#QP\n\r\u0007\r\n, the solution is guaranteed to exist for all but \ufb01nitely many#\n\nthat yields a \ufb01xed\nthat is invariant under\none step of value determination followed by orthogonal projection to the space spanned by\n\nWe note that this is the standard \ufb01xed point approximation method for linear value functions\nwith the exception that the problem is formulated in terms of Q values instead of state\n\nis a\u0001\nT\n\r\u0016\u0015\nvalues. For anyP\n\nare linearly independent this is\n\n4 LSQ: Learning the State-Action Value Function\n\n9\u001b\u001a\n\n\u0002DTLT\n\n:\n\n[5].\n\n.\n\nare of the form:\n\n(sequential) episodes or from random queries to a generative model of the MDP. In the\nextreme case, they can be experiences of other agents on the same MDP. We know that\n\nable. In many practical applications, such a model is not available and the value function\nor, more precisely, its parameters have to be learned from sampled data. These sampled\n\nIn the previous section we assumed that a model \u0001\u0003\fM\u0004\nP\n\r of the underlying MDP is avail-\ndata are tuples of the form: \u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\u001d\u001c\u0016\u0004\n\u000f\u0016\u0013L\r , meaning that in state \u000f , action\u0011 was taken, a re-\nward\u001c was received, and the resulting state was\u000f\u0014\u0013 . These data can be collected from actual\n8! , where\nthe desired set of weights can be found as the solution of the system,\u001e\u001f\u0002\n\u000f\u0019\u0011\n\r and \nand cannot be determined a\nare unknown and so, \u001e\n3 and the vector 9\nand can be approximated using samples. Recall that\u000f\nThe matrix P\npriori. However,\u001e\n, and9\nA:I\nFHG\n),+.-0/\t1J-0/\u001d+0KL42(?),+0KM/\tN\u0018),+\u0016K\u00074\t4\n(*),+.-0/213-\u001d4\t5\n@\u001dA\n@\u001dA\n6O606\n60606\nFPG\n(*),+./7184\n),+./21J/\u0003+\n42(?),+\n/\tNQ),+\n4\t4\t5\n\"$#\n\"$#\nCED\n60606\n6O606\nFPG\n>?9=4\n(*),+:9\n;<9=/\t139\n/\tNQ),+\n42(?),+\n/21\n),+\n/\u0003+\n;<9\n>*9\nATI\n),+\n42UV),+\n/\u001d+\n/\u0003+\n/21\n/21\n606O6\nFHG\n42UV),+./\t1J/\t+\n),+:/71J/\u0003+\nRS#\n606O6\n>?9Y/\u0003+0K\u00074\n;<9X/\t139\n;J9=/\t139\n>?9W/\u0003+0KL42UV),+:9\n),+T9\n\u0004\u001ddfe , where the \u0001\u0003\u000f:]0_G\u0004\n\u0011J]0_I\r\n*6\u0004OaR\u0004b\u001ac\u001ab\u001a\nGiven a set of samples,Z\n8\\[\"\u000f^]0_G\u0004\u0007\u0011J]0_\n\u0004\n\u000fK\u0013\n\u0004\u0003\u001c:]0_@\r+T<`M8\n]0_\naccording to distributiong and the\u000f6\u0013\n_ are sampled according to\n\u000f:]0_G\u0004\u0007\u0011J]0_I\r , we can construct approximate versions of\u000f\nare sampled from\u0002\u001c\u0019\n]0_\nnbo\n,P\n, and9\n\t\b\u0001\u0003\u000fK\u0013\nnbo\nnbo\n\u0004@,sr\n\r\u0003t\n]lk\n]lk\n]lk\n]lk\n\u000fK\u0013\n\u0001!\u000fK\u0013\n\u0004\n\u0011\n\u0001\u0003\u000f\n\u001ac\u001ab\u001a\n\u001ab\u001ac\u001a\n_I\r\n_C\u0004\n\u0011\n\u0001\u0003\u000f\n\u0001\u0003\u000fK\u0013\n\u0004@,\n\u000fK\u0013\n\u001ab\u001ac\u001a\n\u001ac\u001ab\u001a\n\u0004\n\u0011<]\u0016m\n\u0001!\u000f:]\u0016m\n]\u0016m\n]\u0016m\n\u0001!\u000fH\u0013\n\u0004@,\n\u000fK\u0013\n\n]lk\n\u001ab\u001ab\u001a\n\u001ab\u001ab\u001a\n\u001c:]\u0016m\n\n,P\n4\t4\n\nas follows :\n\n\u0001\n\n3\n\n8\n\u0004\n>\n]\n\b\n\u0005\n\u0005\n8\n\b\n\u000f\n\u0002\n\u0010\n3\n\u000f\n\u0002\n\u0001\n#\nP\n3\n\u000f\n\u0010\n9\n3\n8\n\u000f\n\u0002\n3\n\u000f\n\u0001\n\u000f\n\n\u000f\n]\n\u000f\n\n\u0001\n3\n\u000f\n\u0002\n3\n\n8\n\u000f\n\u0002\n3\n\u0002\n3\n8\n\u0001\n\u000f\n\n\u0001\n#\nP\n3\n\u000f\n\u0015\n]\n\u000f\n\n3\n3\n3\n\u001e\n8\n\u000f\n\n\u0001\n#\nP\n3\n\u000f\n8\n\u000f\n\n9\n3\n\u000f\n%\n&\n&\n&\n'\n5\n5\nA\nA\nB\n%\n&\n&\n&\n'\n5\nA\nI\nK\nK\nK\nA\nI\n9\n9\nK\nK\nK\n5\nA\nA\nB\n%\n&\n&\n&\n'\nF\nG\n-\n-\nK\n-\n-\nK\n4\nA\nI\nK\nK\n4\nF\nG\nA\nI\n@\nA\nA\nA\nB\n\u0006\n]\nT\n3\n\u000f\n\u0001\n\u000f\n8\nh\ni\ni\ni\nj\n\b\n\n\b\n]\n]\n\n\b\n\no\no\np\nq\nP\n3\n\u000f\n8\nh\ni\ni\ni\ni\nj\n\b\n\n\b\n]\n_\nr\n]\n_\n\nt\n\n\b\nr\n\nt\n\no\no\no\np\n\u0001\n9\n8\nh\ni\ni\ni\nj\n\u001c\n\u001c\n]\n_\no\no\np\n\ffor\n\n, giving high\n\n:\n\n28\n\nand\n\nand\n\nand \nTUT\n\ntaken directly from the MDP.\n\nand then, conditioned on these samples, as sampling terms from the summations in the\n. The sampling distribution from the summations is\nare\n\nand 9\nand can be approximated as\n\u000f\u0019\u0011\n are consistent approximations of the true\u001e\n\r28\n\nThese approximations can be thought of as \ufb01rst sampling rows from \u000f\naccording to g\ncorresponding rows of P\ngoverned by the underlying dynamics (\t\b\u0001\u0003\u000f\u0010\u0004\u0007\u0011\u0012\u0004\n\u000f\u0014\u0013U\r ) of the process as the samples inZ\n, and \u0001\nGiven\u0001\n,\u001e\n,q\nWithd uniformly distributed samples over pairs of states and actions \u0001!\u000f\u0010\u0004\u0007\u00117\r , the approxi-\nand\u0001\nmations \u0001\n\u001e\u001a\r\nTUT\nThe Markov property ensures that the solution \u0001\nsuf\ufb01ciently larged whenever\u0002\n3 exists:\n\u001e\u001a\r\n C\r\n8$\u0002\n\u0002DTUT\nIn the more general case, whereg\nwhich minimizes theg weighted distance in the projection step. Thus, state\u000f\nassigned weightg\u0012\u0001!\u000fK\r and the projection minimizes the weighted sum of squared errors\nwith respect tog . In LSTD, for example,g\nis the stationary distribution ofP\nAs with LSTD, it is easy to see that approximations (\u0001\nsets of samples (Z\nThis observation leads to an incremental update rule for\u0001\n( and\u0001\n\r@\r\u0003\n\u0010\n\n\u001e\u0002\u00016\u0004\n \u0003\u0001\n . Assume that initially\u0001\nand\u0001\n( . For a \ufb01xed policy, a new sample \u0001!\u000f6\u0004\n\u0011\u0012\u0004\u0003\u001c\u0016\u0004\n\u000f\u0016\u0013\u0015\r contributes to the approximation\n\u001e\u0018;\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r\u0013\u001c\n \u0006\u0004\n\n3 will converge to the true solution\u0002\n\u0002DTUT\n\n) derived from different\n) can be combined additively to yield a better approximation that\n\nis not uniform, we will compute a weighted projection,\nis implicitly\n\nweight to frequently visited states, and low weight to infrequently visited states.\n\nWe call this new algorithm LSQ due to its similarity to LSTD. However, unlike LSTD, it\ncomputes Q functions and does not expect the data to come from any particular Markov\nchain. It is a feature of this algorithm that it can use the same set of samples to compute Q\n\ncorresponds to the combined set of samples:\n\naccording to the following update equation :\n\n\u0001\u0003\u000f\u0010\u0004\n\u00115\r\u001d\n\nis added to \u0001\n\n\u0001!\u000f\u0016\u0013\u0003\u0004\u0007,2\u0001\u0003\u000fK\u0013!\u0004@,2\u0001!\u000fH\u0013\u0015\r@\r\u0007\n\nfor each sample. Thus,\nLSQ can use every single sample available to it no matter what policy is under evaluation.\nWe note that if a particular set of projection weights are desired, it is straightforward to\n\npolicy merely determines which\b\nvalues for any policy representation that offers an action choice for each\u000f\nreweight the samples as they are added to \u0001\n\u0007J\u00017\nof the size of the state and the action space. For each sample inZ\n\u0007J\u00017\t\b\"\r\n\u0007J\u0001\n\n, LSQ incurs a cost of\nto solve the system and\n\ufb01nd the weights. Singular value decomposition (SVD) can be used for robust inversion of\n\n and a one time cost of\n\nto update the matrices \u0001\n\n space independently\n\nNotice that apart from storing the samples, LSQ requires only\n\n\u0013 in the set. The\n\nas it is not always a full rank matrix.\n\nand\u0001\n\n.\n\n\u0001\u0003\u000f\u0010\u0004\n\u00115\r+\u0001\n\n\u0001!\u000f\n\n\u0004\u0007,2\u0001\u0003\u000f\n\nand\n\n\u0004\u001dZ\n\nand\n\n\u001e\u0005\u0004\n\n .;\n\nLSQ includes LSTD as a special case where there is only one action available. It is also\npossible to extend LSQ to LSQ(\n) [3], but in\n\n) in a way that closely resembles LSTD(\n\n3\n\u000f\n\u000f\nP\n3\n\u000f\n9\n\u0001\n\u001e\n8\n\u0001\n\u000f\n\n\u0001\n\u0001\n#\nq\nP\n3\n\u000f\n\n\u0001\n \n8\n\u0001\n\u000f\n\n\u0001\n9\n\u001e\n\n\u0001\n\u0001\n8\nd\nT\n\u0002\n\u0006\nT\n\u001e\n\n\u0001\n\u0001\n \nd\nT\n\u0002\n\u0006\nT\n \n\u0002\n3\n\n\u0001\n\u0001\n\u0002\n3\n\n8\n\n\u0001\n\u0001\n\u001e\n\u0015\n]\n\u0001\n8\n\u0001\nd\nT\n\u0006\nT\n\u0015\n]\n\u0001\nd\nT\n\u0006\nT\n \n\u001e\n\u0015\n]\n \n3\n3\n\u001e\n]\n\u0004\n\u0001\n \n]\n\u0004\n\u0001\n\u0001\n]\n\u0001\n\u0001\n\u001e\n8\n\u0001\n\u001e\n]\n;\n\u0001\n\u001e\n\u0001\n\u0001\n \n8\n\u0001\n \n]\n;\n\u0001\n \n\u0001\n\u001a\n\u001e\n\u001e\n8\n \n8\n\u0001\n\u0001\n\b\n\b\n\u0011\n#\n\b\n\u0013\n\u0013\n\u0001\n\u0001\n\b\n\u001e\n\u001e\n\u0001\n\n\u0001\n\n\u001e\n\u0001\n\u001e\n\n\fthat case the sample set must consist of complete episodes generated using the policy un-\nder evaluation, which again raises the question of bias due to sampling distribution, and\nprevents the reusability of samples. LSQ is also applicable in the case of in\ufb01nite and con-\ntinuous state and/or action spaces with no modi\ufb01cation. States and actions are re\ufb02ected\nonly through the basis functions of the linear approximation and the resulting value func-\ntion can cover the entire state-action space with the appropriate set of continuous basis\nfunctions.\n\n5 LSPI: Least Squares Policy Iteration\n\n_\u0014a\nb2c\b_\u0016d\n\nThe LSQ algorithm provides a means of learning an approximate state-action value func-\n. We now integrate LSQ into an approximate policy\niteration algorithm. Clearly, LSQ is a candidate for the value determination step. The key\ninsight is that we can achieve the policy improvement step without ever explicitly repre-\n\ntion, S3\n\u0001!\u000f6\u0004\n\u00117\r , for any \ufb01xed policy,\nsenting our policy and without any sort of model. Recall that in policy improvement,,\u000bWYX[\\^]@Z\nthat maximizes43\nwill pick the action\u0011\n\u0001!\u000f\u0010\u0004\u0007\u00117\r . Since LSQ computes Q functions directly,\n\u0001!\u000f\u0010\u0004\u0003\u0002\nWUX[\\^]IZ\n\u0001\u0003\u000f\u0010\u0004\u0007\u00117\r\n_\u0014a\nb\u0005c\b_\u0016d\n\nwe do not need a model to determine our improved policy; all the information we need is\ncontained implicitly in the weights parameterizing our Q functions1:\n\n\u0001!\u000f\u0010\u0004\u0007\u00117\r\u0003\n\u000b\u0002\n\nWe close the loop simply by requiring that LSQ performs this maximization for each \u000f\u0010\u0013\nwhen constructing the\u001e matrix for a policy. For very large or continuous action spaces,\nexplicit maximization over\u0011 may be impractical. In such cases, some sort of global non-\n\nlinear optimization may be required to determine the optimal action.\n\nSince LSPI uses LSQ to compute approximate Q functions, it can use any data source for\nsamples. A single set of samples may be used for the entire optimization, or additional sam-\nples may be acquired, either through trajectories or some other scheme, for each iteration\nof policy iteration. We summarize the LSPI algorithm in Figure 1. As with any approxi-\nmate policy iteration algorithm, the convergence of LSPI is not guaranteed. Approximate\npolicy iteration variants are typically analyzed in terms of a value function approximation\nerror and an action selection error [2]. LSPI does not require an approximate policy rep-\nresentation, e.g., a policy function or \u201cactor\u201d architecture, removing one source of error.\nMoreover, the direct computation of linear Q functions from any data source, including\nstored data, allows the use of all available data to evaluate every policy, making the prob-\nlem of minimizing value function approximation error more manageable.\n6 Results\n\nWe initially tested LSPI on variants of the problematic MDP from Koller and Parr [5],\nessentially simple chains of varying length. LSPI easily found the optimal policy within\na few iterations using actual trajectories. We also tested LSPI on the inverted pendulum\nproblem, where the task is to balance a pendulum in the upright position by moving the cart\nto which it is attached. Using a simple set of basis functions and samples collected from\nrandom episodes (starting in the upright position and following a purely random policy),\nLSPI was able to \ufb01nd excellent policies using a few hundred such episodes [7].\n\nFinally, we tried a bicycle balancing problem [12] in which the goal is to learn to balance\nand ride a bicycle to a target position located 1 km away from the starting location. Initially,\nto the goal. The state description is a six-\nthe bicycle\u2019s orientation is at an angle of 90\nis the angle of the handlebar,\nis the vertical\n\n1This is the same principle that allows action selection without a model in Q-learning. To our\nknowledge, this is the \ufb01rst application of this principle in an approximate policy iteration algorithm.\n\ndimensional vector\u0001\n\n\u0004\u0004\u0003D\u0004\n\n\u0004\u0006\u0005\n\n\u0004\u0004\u0007\u000b\r , where \u0001\n\n,\n\n8\ne\n\u0001\n\n8\ne\n\b\n\u001a\n\n\u0001\n\u0004\n\u0002\n\u0001\n\u0002\n\u0003\n\u0003\n\u0003\n\fLSPI (\n\n/\u0018(\n\n/\u0002\u0001\n\n/\u0004\u0003\u0016/*N\u0006\u0005./\u0004\u0007\b\u0005 )\n\n//(\n//\n// \u0001\n//N\n// \u0003\n// \u0007\f\u0005\n\n: Number of basis functions\n: Basis functions\n: Discount factor\n: Stopping criterion\n\n: Initial policy, given as\n: Initial set of samples, possibly empty\n\n\u0005 ,N\n\nN\u0018),+./\n\n4 (default:\n\n#\u000b\n )\n\n// In essence,\n\nrepeat\n\nN\u0006\u0005\nUpdate \u0007\nK = LSQ (\u0007\nuntil (N\u000f\u000ePN\nK )\nreturnN\n\n(optional)\n\n// Add/remove samples, or leave unchanged\n\n/\u0018(\n\n/\u0002\u0001\n\n/*N )\n\n//\n\n//\n\nK = LSQ (\u0007\n\t\u0013\u0012\u0014\t\n\u0010\u0011\u0010\n\n/\r\u0001\n\u0003 )\n\n/\u0018(\n\u0010\u0015\u0010\u0017\u0016\n\n// that is, (\n\n// return \t\n\n)\n\nFigure 1: The LSPI algorithm.\n\nangle of the bicycle, and\n\ncombination of 20 basis functions:\n\nis the angle of the bicycle to the goal. The actions are the torque\n\n(R* seconds.\nfor a \ufb01xed action \u0011\n\n\u0018 applied to the handlebar (discretized to[\na7\u0004\n(R\u0004C;\na<e ) and the displacement of the rider\n(3\u001a\n(discretized to[\naR\u0004\u0007()\u0004G;\u001b(3\u001a\naJe ). In our experiments, actions are restricted to be either\n(or nothing) giving a total of 5 actions2. The noise in the system is a uniformly\n\u0018 or\n(8a7\u0004G;\u001b(\n(8a\u001b\u001a\nadded to the displacement component of the action. The\ndistributed term in&\ndynamics of the bicycle are based on the model described by Randl\u00f8v and Alstr\u00f8m [12]\nand the time step of the simulation is set to(3\u001a\nThe state-action value function \n\u0001<*\u0010\u0004\n\nwhere\nis\ncompletely ignored. This block of basis functions is repeated for each of the 5 actions, giv-\ning a total of 100 basis functions and weights. Training data were collected by initializing\nthe bicycle to a random state around the equilibrium position and running small episodes\nof 20 steps each using a purely random policy. LSPI was applied on training sets of dif-\nferent sizes and the average performance is shown in Figure 2(a). We used the same data\nset for each run of policy iteration and usually obtained convergence in 6 or 7 iterations.\nSuccessful policies usually reached the goal in approximately 1 km total, near optimal per-\nformance. We also show an annotated set of trajectories to demonstrate the performance\nimprovement over multiple steps of policy iteration in Figure 2(b).\n\n\u0004\u001e\u001c\n\f\u001c( . Note that the state variable\n\n\u0004\u001d\u001c\n\n\u0004\u001e\u001c\n\n\u0001!\u000f\u0010\u0004\u0007\u00117\r\n\u0007\u001c8\n\nis approximated by a linear\n\n\u0004\u0004\u0003\n\n\u0003D\u0004\n\u0007 \u001f\n\n( and\n\n\u0003D\u0004\n\u0007\u001c8\n\nC\u0004\n\nfor\n\nfor\n\ncase, we used(3\u001a\n\nThe following design decisions in\ufb02uenced the performance of LSPI on this problem: As is\ntypical with this problem, we used a shaping reward [10] for the distance to the goal. In this\n\n(R* of the net change (in meters) in the distance to the goal. We found that\n\nwhen using full random trajectories, most of our sample points were not very useful; they\noccurred after the bicycle had already entered into a \u201cdeath spiral\u201d from which recovery\nwas impossible. This complicated our learning efforts by biasing the samples towards\nhopeless parts of the space, so we decided to cut off trajectories after 20 steps. This created\nan additional problem because there was no terminating reward signal to indicate failure.\nWe approximated this with an additional shaping reward, which was proportional to the\n\n2Results are similar for the full 9-action case, but required more training data.\n\n\n\n\u0005\n\t\n\u0005\n#\n\t\n\u0005\n\t\n\u0005\n\u0007\n#\n\u0007\n\u0005\nN\nK\n#\n\t\nK\n#\n\t\n\u0005\nN\n#\nN\nK\n\t\n#\n\t\nK\nN\n/\n\n\t\n/\n\n/\n\t\nK\n\u0007\n\u0011\n\u0019\n\u0011\n(\n(\n\u0019\n\u0011\n(\n\u001a\n\u001a\n\u0002\n\u0003\n\u0004\n\u0003\n\u0001\n\u0004\n\u0002\n\u0003\n\u0001\n\u0002\n\u0001\n\u0004\n\u0002\n\u0001\n\u0004\n\u0001\n\u0001\n\u0004\n\u0002\n\u0001\n\u0001\n\u0004\n\u0001\n\u0002\n\u0001\n\u0004\n\u0003\n\u0001\n\u0004\n\u0003\n\u0001\n\u0001\n\u0004\n\u0003\n\u0001\n\u0001\n\u0004\n\u0007\n\u0004\n\u0007\n\u0001\n\u0004\n\u0007\n\u0001\n\u0007\n\u0007\n\u0001\n\u0007\n\u0001\n\u001c\n,\n\u0011\n\u0007\n\u001c\n\u0011\n,\n\u0011\n\u0007\n\u0007\n\u0005\n\u0003\n\fl\n\na\no\ng\n\n \n\ne\nh\n\nt\n \n\ni\n\ng\nn\nh\nc\na\ne\nr\n \ns\na\ni\nr\nt\n \nf\n\nl\n\n \n\no\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n0\n\n6th iteration\n (crash) \n\n3rd iteration \n\n5th and 7th\n iteration \n\nGoal \n\n4th and 8th\n iteration \n\nStarting \nPosition \n\n2nd iteration (crash) \n\n200\n\n0\n\n\u2212200\n\n\u2212400\n\n\u2212600\n\n1st iteration \n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\nNumber of training episodes\n\n\u2212800\n\n\u2212200\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n(a)\n\n(b)\n\nFigure 2: The bicycle problem: (a) Percentage of \ufb01nal policies that reach the goal, averaged\nover 200 runs of LSPI for each training set size; (b) A sample run of LSPI based on 2500\ntraining trials. This run converged in 8 iterations. Note that iterations 5 and 7 had different\nQ-values but very similar policies. This was true of iterations 4 and 8 as well. The weights\n, indicate convergence.\nThe curves at the end of the trajectories indicating where the bicycle has looped back for a\nsecond pass through the goal.\n\nof the ninth differed from the eighth by less than *H(\n\nin \u0002\u0004\u0003\u0005\u0002\n\n]\u0001\n\nnet change in the square of the vertical angle. This roughly approximated the likeliness\n\nof falling at the end of a truncated trajectory. Finally, we used a discount of (\n\nseemed to yield more robust performance.\n\nWe admit to some slight unease about the amount of shaping and adjusting of parameters\nthat was required to obtain good results on this problem. To verify that we had not elim-\ninated the learning problem entirely through shaping, we reran some experiments using a\n\n\u0006\u0010( , which\n\ndiscount of ( . In this case LSQ simply projects the immediate reward function into the\n\ncolumn space of the basis functions. If the problem were tweaked too much, acting to\nmaximize the projected immediate reward would be suf\ufb01cient to obtain good performance.\nOn the contrary, these runs always produced immediate crashes in trials.\n\n7 Discussion and Conclusions\n\nWe have presented a new, model-free approximate policy iteration algorithm called LSPI,\nwhich is inspired by the LSTD algorithm. This algorithm is able to use either a stored\nrepository of samples or samples generated dynamically from trajectories.\nIt performs\naction selection and approximate policy iteration entirely in value function space, without\nany need for model. In contrast to other approaches to approximate policy iteration, it does\nnot require any sort of approximate policy function.\n\nIn comparison to the memory based approach of Ormoneit and Sen [11], our method makes\nstronger use of function approximation. Rather than using our samples to implicitly con-\nstruct an approximate model using kernels, we operate entirely in value function space and\nuse our samples directly in the value function projection step. As noted by Boyan [3] the\n\n\u001e matrix used by LSTD and LSPI can be viewed as an approximate, compressed model.\nThis is most compelling if the columns of \u0007\nintuitions, a proper transition function cannot be reconstructed directly from\u001e\n\nare orthonormal. While this provides some\n, making a\n\npossible interpretation of LSPI as a model based method an area for future research.\n\nIn comparison to direct policy search methods [9, 8, 1, 13, 6], we offer the strength of\npolicy iteration. Policy search methods typically make a large number of relatively small\nsteps of gradient-based policy updates to a parameterized policy function. Our use of policy\niteration generally results in a small number of very large steps directly in policy space.\n\n\u0015\n\u0001\n\u001a\n\fOur experimental results demonstrate the potential of our method. We achieved good per-\nformance on the bicycle task using a very small number of randomly generated samples\nthat were reused across multiple steps of policy iteration. Achieving this level of perfor-\nmance with just a linear value function architecture did require some tweaking, but the\ntransparency of the linear architecture made the relevant issues much more salient than\nwould be the case with any \u201cblack box\u201d approach. We believe that the direct approach to\nfunction approximation and data reuse taken by LSPI will make the algorithm an intuitive\nand easy to use \ufb01rst choice for many reinforcement learning tasks. In future work, we plan\nto investigate the application of our method to multi-agent systems and the use of density\nestimation to control the projection weights in our function approximator.\n\nAcknowledgments\n\nWe would like to thank J. Randl\u00f8v and P. Alstr\u00f8m for making their bicycle simulator avail-\nable. We also thank C. Guestrin, D. Koller, U. Lerner and M. Littman for helpful discus-\nsions. The \ufb01rst author would like to thank the Lilian-Boudouri Foundation in Greece for\npartial \ufb01nancial support.\n\nReferences\n[1] J. Baxter and P.Bartlett. Reinforcement learning in POMDP\u2019s via direct gradient ascent.\n\nIn\nProc. 17th International Conf. on Machine Learning, pages 41\u201348. Morgan Kaufmann, San\nFrancisco, CA, 2000.\n\n[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, Belmont,\n\nMassachusetts, 1996.\n\n[3] Justin A. Boyan. Least-squares temporal difference learning.\n\nIn I. Bratko and S. Dzeroski,\neditors, Machine Learning: Proceedings of the Sixteenth International Conference, pages 49\u2013\n56. Morgan Kaufmann, San Francisco, CA, 1999.\n\n[4] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning.\n\nMachine Learning, 22(1/2/3):33\u201357, 1996.\n\n[5] D. Koller and R. Parr. Policy iteration for factored mdps.\n\nIn Proceedings of the Sixteenth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI-00). Morgan Kaufmann, 2000.\n\n[6] V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS 2000 editors, editor, Advances in\nNeural Information Processing Systems 12: Proceedings of the 1999 Conference. MIT Press,\n2000.\n\n[7] M. G. Lagoudakis and R. Parr. Model-Free Least-Squares policy iteration. Technical Report\n\nCS-2001-05, Department of Computer Science, Duke University, December 2001.\n\n[8] A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs.\nIn Proceedings of the Sixteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-00).\nMorgan Kaufmann, 2000.\n\n[9] A. Ng, R. Parr, and D. Koller. Policy search via density estimation. In Advances in Neural\n\nInformation Processing Systems 12: Proceedings of the 1999 Conference. MIT Press, 2000.\n\n[10] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transforma-\ntions: theory and application to reward shaping. In Proc. 16th International Conf. on Machine\nLearning, pages 278\u2013287. Morgan Kaufmann, San Francisco, CA, 1999.\n\n[11] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. To appear,Machine Learning,\n\n2001.\n\n[12] J. Randl\u00f8v and P. Alstr\u00f8m. Learning to drive a bicycle using reinforcement learning and shap-\ning. In The Fifteenth International Conference on Machine Learning, 1998. Morgan Kaufmann.\n[13] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in Neural Information Processing Systems\n12: Proceedings of the 1999 Conference, 2000. MIT Press.\n\n\f", "award": [], "sourceid": 2134, "authors": [{"given_name": "Michail", "family_name": "Lagoudakis", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}]}