{"title": "Fitted Q-iteration by Advantage Weighted Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1177, "page_last": 1184, "abstract": "Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces.", "full_text": "Fitted Q-iteration by Advantage Weighted Regression\n\nGerhard Neumann\n\nJan Peters\n\nInstitute for Theoretical Computer Science\n\nMax Planck Institute for Biological Cybernetics\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\nD-72076 T\u00fcbingen, Germany\nmail@jan-peters.net\n\ngerhard@igi.tu-graz.ac.at\n\nAbstract\n\nRecently, \ufb01tted Q-iteration (FQI) based methods have become more popular due\nto their increased sample ef\ufb01ciency, a more stable learning process and the higher\nquality of the resulting policy. However, these methods remain hard to use for con-\ntinuous action spaces which frequently occur in real-world tasks, e.g., in robotics\nand other technical applications. The greedy action selection commonly used for\nthe policy improvement step is particularly problematic as it is expensive for con-\ntinuous actions, can cause an unstable learning process, introduces an optimization\nbias and results in highly non-smooth policies unsuitable for real-world systems.\nIn this paper, we show that by using a soft-greedy action selection the policy\nimprovement step used in FQI can be simpli\ufb01ed to an inexpensive advantage-\nweighted regression. With this result, we are able to derive a new, computationally\nef\ufb01cient FQI algorithm which can even deal with high dimensional action spaces.\n\n1 Introduction\n\nReinforcement Learning [1] addresses the problem of how autonomous agents can improve their\nbehavior using their experience. At each time step t the agent can observe its current state st \u2208 X\nand chooses an appropriate action at \u2208 A. Subsequently, the agent gets feedback on the quality\nof the action, i.e., the reward rt = r(st, at), and observes the next state st+1. The goal of the\nagent is to maximize the accumulated reward expected in the future. In this paper, we focus on\nlearning policies for continuous, multi-dimensional control problems. Thus the state space X and\naction space A are continuous and multi-dimensional, meaning that discretizations start to become\nprohibitively expensive.\n\nWhile discrete-state/action reinforcement learning is a widely studied problem with rigorous con-\nvergence proofs, the same does not hold true for continuous states and actions. For continuous state\nspaces, few convergence guarantees exist and pathological cases of bad performance can be gen-\nerated easily [2]. Moreover, many methods cannot be transferred straightforwardly to continuous\nactions.\n\nCurrent approaches often circumvent continuous action spaces by focusing on problems where the\nactor can rely on a discrete set of actions, e.g., when learning a policy for driving to a goal in\nminimum time, an actor only needs three actions: the maximum acceleration when starting, zero\nacceleration at maximum velocity and maximum throttle down when the goal is suf\ufb01ciently close\nfor a point landing. While this approach (called bang-bang in traditional control) works for the\nlarge class of minimum time control problems, it is also a limited approach as cost functions rele-\nvant to the real-world incorporate much more complex constraints, e.g., cost-functions in biological\nsystems often punish the jerkiness of the movement [3], the amount of used metabolic energy [4]\nor the variance at the end-point [5]. For physical technical systems, the incorporation of further\noptimization criteria is of essential importance; just as a minimum time policy is prone to damage\nthe car on the long-run, a similar policy would be highly dangerous for a robot and its environment\n\n\fand the resulting energy-consumption would reduce its autonomy. More complex, action-dependent\nimmediate reward functions require that much larger sets of actions are being employed.\n\nWe consider the use of continuous actions for \ufb01tted Q-iteration (FQI) based algorithms. FQI is a\nbatch mode reinforcement learning (BMRL) algorithm. The algorithm mantains an estimate of the\nstate-action value function Q(s, a) and uses the greedy operator maxa Q(s, a) on the action space\nfor improving the policy. While this works well for discrete action spaces, the greedy operation\nis hard to perform for high-dimensional continuous actions. For this reason, the application of\n\ufb01tted Q-iteration based methods is often restricted to low-dimensional action spaces which can be\nef\ufb01ciently discretized. In this paper, we show that the use of a stochastic soft-max policy instead of\na greedy policy allows us to reduce the policy improvement step used in FQI to a simple advantage-\nweighted regression. The greedy operation maxa Q(s, a) over the actions is replaced by a less\nharmful greedy operation over the parameter space of the value function. This result allows us to\nderive a new, computationally ef\ufb01cient algorithm which is based on Locally-Advantage-WEighted\nRegression (LAWER).\n\nWe test our algorithm on three different benchmark tasks, i.e., the pendulum swing-up [6], the\nacrobot swing-up [1] and a dynamic version of the puddle-world [7] with 2 and 3 dimensions. We\nshow that in spite of the soft-greedy action selection, our algorithm is able to produce high quality\npolicies.\n\n2 Fitted Q-Iteration\n\nIn \ufb01tted Q-iteration [8, 6, 9] (FQI), we assume that all the experience of the agent up to the current\ntime is given in the form H = {< si, ai, ri, s\u2032\ni >}1\u2264i\u2264N . The task of the learning algorithm is to\nestimate an optimal control policy from this historical data. FQI approximates the state-action value\nfunction Q(s, a) by iteratively using supervised regression techniques. New target values for the\nregression are generated by\n\n\u02dcQk+1(i) = ri + \u03b3Vk(s\u2032\n\ni) = ri + \u03b3 max\n\na\u2032\n\nQk(s\u2032\n\ni, a\u2032).\n\n(1)\n\nThe regression problem for \ufb01nding the function Qk+1 is de\ufb01ned by the list of data-point pairs Dk\nand the regression procedure Regress\n\nDk(Qk) = \u00bdh(si, ai), \u02dcQk+1(i)i1\u2264i\u2264N\u00be , Qk+1 = Regress(Dk(Qk))\n\n(2)\n\nFQI can be viewed as approximate value iteration with state-action value functions [9]. Previous\nexperiments show that function approximators such as neural networks [6], radial basis function\nnetworks [8], CMAC [10] and regression trees [8] can be employed in this context. In [9], perfor-\nmance bounds for the value function approximation are given for a wide range of function approx-\nimators. The performance bounds also hold true for continuous action spaces, but only in the case\nof an actor-critic variant of FQI. Unfortunately, to our knowledge, no experiments with this variant\nexist in the literature. Additionally, it is not clear how to apply this actor-critic variant ef\ufb01ciently for\nnonparametric function approximators.\n\nFQI has proven to outperform classical online RL methods in many applications [8]. Nevertheless,\nFQI relies on the greedy action selection in Equation (1). Thus, the algorithm frequently requires\na discrete set of actions and generalization to continuous actions is not straightforward. Using the\ngreedy operator for continuous action spaces is a hard problem by itself as the use of expensive\noptimization methods is needed for high dimensional actions. Moreover the returned values of the\ngreedy operator often result in an optimization bias causing an unstable learning process, including\noscillations and divergence [11]. For a comparison with our algorithm, we use the Cross-Entropy\n(CE) optimization method [12] to \ufb01nd the maximum Q-values. In our implementation, we maintain\na Gaussian distribution for the belief of the optimal action. We sample nCE actions from this\ndistribution. Then, the best eCE < nCE actions (with the highest Q-values) are used to update the\nparameters of this distribution. The whole process is repeated for kCE iterations, starting with a\nuniformly distributed set of sample actions.\n\nFQI is inherently an of\ufb02ine method - given historical data, the algorithm estimates the optimal policy.\nHowever, FQI can also be used for online learning. After the FQI algorithm is \ufb01nished, new episodes\ncan be collected with the currently best inferred policy and the FQI algorithm is restarted.\n\n\f3 Fitted Q-Iteration by Advantage Weighted Regression\n\nA different method for policy updates in continuous action spaces is reinforcement learning by\nreward-weighted regression [13]. As shown by the authors, the action selection problem in the im-\nmediate reward RL setting with continuous actions can be formulated as expectation-maximization\n(EM) based algorithm and, subsequently, reduced to a reward-weighted regression. The weighted\nregression can be applied with ease to high-dimensional action spaces; no greedy operation in the\naction space is needed. While we do not directly follow the work in [13], we follow the general idea.\n\n3.1 Weighted regression for value estimation\n\nIn this section we consider the task of estimating the value function V of a stochastic policy \u03c0(\u00b7|s)\nwhen the state-action value function Q is already given. The value function can be calculated by\n\nV (s) = Ra \u03c0(a|s)Q(s, a)da. Yet, the integral over the action space is hard to perform for continuous\n\nactions. However, we will show how we can approximate the value function without the evaluation\nof this integral. Consider the quadratic error function\n\nError( \u02c6V ) = Zs\n= Zs\n\n\u00b5(s)\u00b5Za\n\u00b5(s)\u00b5Za\n\n\u03c0(a|s)Q(s, a)da \u2212 \u02c6V (s)\u00b62\n\u03c0(a|s)\u00b3Q(s, a) \u2212 \u02c6V (s)\u00b4 da\u00b62\n\nds\n\n(3)\n\n(4)\n\nds,\n\nwhich is used to \ufb01nd an approximation \u02c6V of the value function. \u00b5(s) denotes the state distribution\nwhen following policy \u03c0(\u00b7|a). Since the squared function is convex we can use Jensens inequality\nfor probability density functions to derive an upper bound of Equation (4)\n\nError( \u02c6V ) \u2264 Zs\n\n\u00b5(s)Za\n\n\u03c0(a|s)\u00b3Q(s, a) \u2212 \u02c6V (s)\u00b42\n\ndads = ErrorB( \u02c6V ).\n\n(5)\n\nThe solution \u02c6V \u2217 for minimizing the upper bound ErrorB( \u02c6V ) is the same as for the original error\nfunction Error( \u02c6V ).\n\nProof. To see this, we compute the square and replace the term Ra \u03c0(a|s)Q(s, a)da by the value\nfunction V (s). This is done for the error function Error( \u02c6V ) and for the upper bound ErrorB( \u02c6V ).\n\nError( \u02c6V ) = Zs\n\n\u00b5(s)\u00b3V (s) \u2212 \u02c6V (s)\u00b42\n\nds = Zs\n\n\u00b5(s)\u00b3V (s)2 \u2212 2V (s) \u02c6V (s) + \u02c6V (s)2\u00b4 ds\n\nErrorB( \u02c6V ) = Zs\n= Zs\n\n\u00b5(s)Za\n\u00b5(s)\u00b5Za\n\n\u03c0(a|s)\u00b3Q(s, a)2 \u2212 2Q(s, a) \u02c6V (s) + \u02c6V (s)2\u00b4 dads\n\u03c0(a|s)Q(s, a)2da \u2212 2V (s) \u02c6V (s) + \u02c6V (s)2\u00b6 ds\n\n(6)\n\n(7)\n\n(8)\n\nBoth error functions are the same except for an additive constant which does not depend on \u02c6V .\n\nIn difference to the original error function, the upper bound ErrorB can be approximated straightfor-\nwardly by samples {(si, ai), Q(si, ai)}1\u2264i\u2264N gained by following some behavior policy \u03c0b(\u00b7|s).\n\nErrorB( \u02c6V ) \u2248\n\nN\n\nXi=1\n\n\u00b5(s)\u03c0(ai|si)\n\n\u00b5b(si)\u03c0b(ai|si) \u00b3Q(si, ai) \u2212 \u02c6V (si)\u00b42\n\n,\n\n(9)\n\n\u00b5b(s) de\ufb01nes the state distribution when following the behavior policy \u03c0b.\nThe term\n1/(\u00b5b(si)\u03c0b(si, ai)) ensures that we do not give more weight on states and actions preferred by\n\u03c0b. This is a well known method in importance sampling. In order to keep our algorithm tractable,\nthe factors \u03c0b(ai|si), \u00b5b(si) and \u00b5(si) will all be set to 1/N. The minimization of Equation (9)\nde\ufb01nes a weighted regression problem which is given by the dataset DV , the weighting U and the\nweighted regression procedure WeightedRegress\n\nDV = n[(si, ai), Q(si, ai)]1\u2264i\u2264No , U = {[\u03c0(ai|si)]1\u2264i\u2264N } , \u02c6V = WeightedRegress(DV , U ) (10)\n\n\fAlgorithm 1 FQI with Advantage Weighted Regression\n\ni >}1\u2264i\u2264N , \u03c4 and L (Number of Iterations)\n\nInput: H = {< si, ai, ri, s\u2032\nInitialize \u02c6V0(s) = 0.\nfor k = 0 to L \u2212 1 do\n\nDk( \u02c6Vk) = \u00bdh(si, ai), ri + \u03b3 \u02c6Vk(s\u2032\n\ni)i1\u2264i\u2264N\u00be\n\nQk+1 = Regress(Dk( \u02c6Vk))\nA(i) = Qk+1(si, ai) \u2212 \u02c6Vk(si)\nEstimate mA(si) and \u03c3A(si) for 1 \u2264 i \u2264 N\nU = {[exp(\u03c4 (A(i) \u2212 mA(si))/\u03c3A(si)]i\u2264i\u2264N }\n\u02c6Vk+1 = WeightedRegress(Dk( \u02c6Vk), U )\n\nend for\n\nThe result shows that in order to approximate the value function V (s), we do not need to carry out\nthe expensive integration over the action space for each state si. It is suf\ufb01cient to know the Q-values\nat a \ufb01nite set of state-action pairs.\n\n3.2 Soft-greedy policy improvement\n\nWe use a soft-max policy [1] in the policy improvement step of the FQI algorithm. Our soft-max\npolicy \u03c01(a|s) is based on the advantage function A(s, a) = Q(s, a)\u2212V (s). We additionally assume\nthe knowledge of the mean mA(s) and the standard deviation of \u03c3A(s) of the advantage function at\nstate s. These quantities can be estimated locally or approximated by additional regressions. The\npolicy \u03c01(a|s) is de\ufb01ned as\n\n\u03c01(a|s) =\n\nexp(\u03c4 \u00afA(s, a))\n\nRa exp(\u03c4 \u00afA(s, a))da\n\n,\n\n\u00afA(s, a) = A(s,a)\u2212mA(s)\n\n\u03c3A(s)\n\n.\n\n(11)\n\n\u03c4 controls the greediness of the policy. If we assume that the advantages A(s, a) are distributed\nA(s)), all normalized advantage values \u00afA(s, a) have the same distribu-\nwith N (A(s, a)|mA(s), \u03c32\ntion. Thus, the denominator of \u03c01 is constant for all states and we can use the term exp(\u03c4 \u00afA(s, a)) \u221d\n\u03c01(a|s) directly as weighting for the regression de\ufb01ned in Equation (10). The resulting approxi-\nmated value function \u02c6V (s) is used to replace the greedy operator V (s\u2032\ni, a\u2032) in the\nFQI algorithm. The FQI by Advantage Weighted Regression (AWR) algorithm is given in Algo-\nrithm 1. As we can see, the Q-function Qk is only queried once for each step in the history H.\nFurthermore only already seen state action pairs (si, ai) are used for this query.\nAfter the FQI algorithm is \ufb01nished we still need to determine a policy for subsequent data collec-\ntion. The policy can be obtained in the same way as for reward-weighted regression [13], only the\nadvantage is used instead of the reward for the weighting - thus, we are optimizing the long term\ncosts instead of the immediate one.\n\ni) = maxa\u2032 Q(s\u2032\n\n4 Locally-Advantage-WEighted Regression (LAWER)\n\nBased on the FQI by AWR algorithm, we propose a new, computationally ef\ufb01cient \ufb01tted Q-iteration\nalgorithm which uses Locally Weighted Regression (LWR, [14]) as function approximator. Similar\nto kernel based methods, our algorithm needs to be able to calculate the similarity wi(s) between\na state si in the dataset H and state s. To simplify the notation, we will denote wi(sj) as wij for\nall sj \u2208 H. wi(s) is calculated by a Gaussian kernel wi(s) = exp(\u2212(si \u2212 s)T D(si \u2212 s)). The\ndiagonal matrix D determines the bandwidth of the kernel. Additionally, our algorithm also needs\na similarity measure wa\nij can be calculated by a Gaussian\nkernel wa\nUsing the state similarity wij, we can estimate the mean and the standard deviation of the advantage\nfunction for each state si\n\nij between two actions ai and aj. Again wa\n\nij = exp(\u2212(ai \u2212 aj)T Da(ai \u2212 aj)).\n\nmA(si) = Pj wijA(j)\nPj wij\n\n, \u03c32\n\nA(si) =\n\nPj wij (A(j)\u2212mA(sj ))2\n\nPj wij\n\n.\n\n(12)\n\n\f4.1 Approximating the value functions\n\nFor the approximation of the Q-function, we use Locally Weighted Regression [14]. The Q-function\nis therefore given by:\n\nT WSA)\u22121SA\n\nT WQk+1\n\nQk+1(s, a) = \u02dcsA(SA\n\n(13)\nwhere \u02dcsA = [1, sT , aT ]T , SA = [\u02dcsA(1), \u02dcsA(2), ..., \u02dcsA(N )]T is the state-action matrix, W =\ndiag(wi(s)wa\ni (a)) is the local weighting matrix consisting of state and action similarities, and\nQk+1 = [ \u02dcQk+1(1), \u02dcQk+1(2), . . . , \u02dcQk+1(N )]T is the vector of the Q-values (see Equation (1).\nFor approximating the V-function we can multiplicatively combine the advantage-based weighting\nui = exp(\u03c4 \u00afA(si, ai)) and the state similarity weights wi(s). The value V k+1(s) is given by 1:\n\n(14)\nwhere \u02dcs = [1, sT ]T , S = [\u02dcs1, \u02dcs2, ..., \u02dcsN ]T is the state matrix and U = diag(wi(s)ui) is the weight\nmatrix. We bound the estimate of \u02c6Vk+1(s) by maxi|wi(s)>0.001 Qk+1(i) in order to prevent the local\nregression from adding a positive bias which might cause divergence of the value iteration.\n\nVk+1(s) = \u02dcs(ST US)\u22121ST UQk+1,\n\nA problem with nonparametric value function approximators is their strongly increasing computa-\ntional complexity with an increasing number of data points. A simple solution to avoid this problem\nis to introduce a local forgetting mechanism. Whenever parts of the state space are oversampled, old\nexamples in this area are removed from the dataset.\n\n4.2 Approximating the policy\n\nSimilar\nto reward-weighted regression [13], we use a stochastic policy \u03c0(a|s) =\nN (a|\u00b5(s), diag(\u03c32(s))) with Gaussian exploration as approximation of the optimal policy. The\nmean \u00b5(s) and the variance \u03c32(s) are given by\n\n\u00b5(s) = \u02dcs(ST US)\u22121ST UA,\n\n\u03c32(s) = \u03c32\n\ninit\u03b10+Pi wi(s)ui(ai\u2212\u00b5(si))2\n\n\u03b10+Pi wi(s)ui\n\n,\n\n(15)\n\nwhere A = [a1, a2, . . . , aN ]T denotes the action matrix. The variance \u03c32 automatically adapts the\nexploration of the policy to the uncertainty of the optimal action. With \u03c32\ninit and \u03b10 we can set the\ninitial exploration of the policy. \u03c3init is always set to the bandwidth of the action space. \u03b10 sets the\nweight of the initial variance in comparision to the variance comming from the data, \u03b10 is set to 3\nfor all experiments.\n\n5 Evaluations\n\nWe evaluated the LAWER algorithm on three benchmark tasks, the pendulum swing up task, the\nacrobot swing up task and a dynamic version of the puddle-world (i.e., augmenting the puddle-\nworld by velocities, inertia, etc.) with 2 and 3 dimensions. We compare our algorithm to tree-based\nFQI [8] (CE-Tree), neural FQI [6] (CE-Net) and LWR-based FQI (CE-LWR) which all use the\nCross-Entropy (CE) optimization to \ufb01nd the maximum Q-values. For the CE optimization we used\nnCE = 10 samples for one dimensional, nCE = 25 samples for 2-dimensional and nCE = 64 for\n3-dimensional control variables. eCE was always set to 0.3nCE and we used kCE = 3 iterations.\nTo enforce exploration when collecting new data, a Gaussian noise of \u01eb = N (0, 1.0) was added\nto the CE-based policy. For the tree-based algorithm, an ensemble of M = 20 trees was used, K\nwas set to the number of state and action variables and nmin was set to 2 (see [8]). For the CE-Net\nalgorithm we used a neural network with 2 hidden layers and 10 neurons per layer and trained the\nnetwork with the algorithm proposed in [6] for 600 epochs. For all experiments, a discount factor\nof \u03b3 = 0.99 was used. The immediate reward function was quadratic in the distance to the goal\nposition xG and in the applied torque/force r = \u2212c1(x \u2212 xG)2 \u2212 c2a2. For evaluating the learning\nprocess, the exploration-free (i.e., \u03c3(s) = 0, \u01eb = 0) performance of the policy was evaluated after\neach data-collection/FQI cycle. This was done by determining the accumulated reward during an\nepisode starting from the speci\ufb01ed initial position. All errorbars represent a 95% con\ufb01dence interval.\n\n1In practice, ridge regression V k+1(s) = \u02dcs(ST WS + \u03c3I)\u22121ST WQk+1 is used to avoid numerical insta-\n\nbilities in the regression.\n\n\fd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n \n\n \n\n\u221220\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n\u221240\n\n\u221260\n\n\u221280\n\n \n\nLAWER\nCE Tree\nCE LWR\nCE Net\n15\n\n20\n\n5\n\n10\n\nNumber of Data Collections\n\n(a)\n\n(b)\n\n \n\n]\n\nN\n\n[\n \nu\n\n5\n0\n\u22125\n\n5\n0\n\u22125\n\n \n\n \n\n5\n0\n\u22125\n0\n\n \n\n1\n\n2\n3\nTime [s]\n(c)\n\nLAWER\n\nCE Tree\n\n \n\n \n\n \n\nCE LWR\n\n4\n\n5\n\n]\n\nN\n\n[\n \nu\n\n5\n0\n\u22125\n\n5\n0\n\u22125\n\n \n\n \n\n5\n0\n\u22125\n0\n\n \n\n1\n\n2\n3\nTime [s]\n(d)\n\nLAWER\n\nCE Tree\n\n \n\n \n\n \n\nCE LWR\n\n4\n\n5\n\nLAWER\nCE Tree\nCE LWR\nCE Net\n15\n\n20\n\n5\n\n10\n\nNumber of Data Collections\n\nFigure 1: (a) Evaluation of LAWER and CE-based FQI algorithms on the pendulum swing-up task\nfor c2 = 0.005 . The plots are averaged over 10 trials. (b) The same evaluation for c2 = 0.025. (c)\nLearned torque trajectories for c2 = 0.005. (d) Learned torque trajectories for c2 = 0.025.\n\n5.1 Pendulum swing-up task\n\nIn this task, a pendulum needs to be swung up from the position at the bottom to the top position [6].\nThe state space consists of the angular deviation \u03b8 from the top position and the angular velocity \u02d9\u03b8\nof the pendulum. The system dynamics are given by 0.5ml2 \u00a8\u03b8 = mg sin(\u03b8) + u , the torque of the\nmotor u was limited to [\u22125N, 5N ]. The mass was set to m = 1kg and length of the link to 1m.\nThe time step was set to 0.05s. Two experiments with different torque punishments c2 = 0.005 and\nc2 = 0.025 were performed.\nWe used L = 150 iterations. The matrices D and DA were set to D = diag(30, 3) and DA =\ndiag(2). In the data collection phase, 5 episodes with 150 steps were collected starting from the\nbottom position and 5 episodes starting from a random position.\nA comparison of the LAWER algorithm to CE-based algorithms for c2 = 0.005 is shown in Figure\n1(a) and for c2 = 0.025 in Figure 1(b). Our algorithm shows a comparable performance to the\ntree-based FQI algorithm while being computationally much more ef\ufb01cient. All other CE-based\nFQI algorithms show a slightly decreased performance. In Figure 1(c) and (d) we can see typical\nexamples of learned torque trajectories when starting from the bottom position for the LAWER,\nthe CE-Tree and the CE-LWR algorithm. In Figure 1(c) the trajectories are shown for c2 = 0.005\nand in Figure 1(d) for c2 = 0.025. All algorithms were able to discover a fast solution with 1\nswing-up for the \ufb01rst setting and a more energy-ef\ufb01cient solution with 2 swing-ups for the second\nsetting. Still, there are qualitative differences in the trajectories. Due to the advantage-weighted\nregression, LAWER was able to produce very smooth trajectories while the trajectories found by the\nCE-based methods look more jerky. In Figure 2(a) we can see the in\ufb02uence of the parameter \u03c4 on\nthe performance of the LAWER algorithm. The algorithm works for a large range of \u03c4 values.\n\n5.2 Acrobot swing-up task\n\nIn order to asses the performance of LAWER on a complex highly non-linear control task, we used\nthe acrobot (for a description of the system, see [1]). The torque was limited to [\u22125N, 5N ]. Both\nmasses were set to 1kg and both lengths of the links to 0.5m. A time step of 0.1s was used. L = 100\niterations were used for the FQI algorithms. In the data-collection phase the agent could observe 25\nepisodes starting from the bottom position and 25 starting from a random position. Each episode had\n100 steps. The matrices D and DA were set to D = diag(20, 23.6, 10, 10.5) and DA = diag(2).\nThe comparison of the LAWER and the CE-Tree algorithm is shown in Figure 2(a). Due to the\nadaptive state discretization, the tree-based algorithm is able to learn faster, but in the end, the\nLAWER algorithm is able to produce policies of higher quality than the tree-based algorithm.\n\n5.3 Dynamic puddle-world\n\nIn the puddle-world task [7], the agent has to \ufb01nd a way to a prede\ufb01ned goal area in a continuous-\nvalued maze world (see Figure 3(a)). The agent gets negative reward when going through puddles.\nIn difference to the standard puddle-world setting where the agent has a 2-dimensional state space\n(the x and y position), we use a more demanding setting. We have created a dynamic version of the\npuddle-world where the agent can set a force accelerating a k-dimensional point mass (m = 1kg).\n\n\f\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n \n\n\u221260\n2\n\n \n\n\u221220\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n\u221230\n\n\u221240\n\n\u221250\n\n \n\nc2 = 0.005\nc2 = 0.025\n7\n\n6\n\n3\n\n4\n\n\u03c4\n\n5\n\n(a)\n\n \n\n1\n\nStart\nGoal\n\nLAWER\nCE Tree\n\n5\n\n20\nNumber of Data Collections\n\n10\n\n15\n\n0\n\n1\n\n(b)\n\n(c)\n\nFigure 2: (a) Evaluation of the average reward gained over a whole learning trial on the pendulum\nswing-up task for different settings of \u03c4 (b) Comparison of the LAWER and the CE-Tree algorithm\non the acrobot swing-up task (c) Setting of the 2-dimensional dynamic puddle-world.\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n\u2212100\n\n \n\n \n\n \n\n\u221250\n\n\u2212100\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n\u2212150\n\n \n\nLAWER\nCE Tree\n\n5\nNumber of Data Collections\n\n10\n\n20\n\n25\n\n15\n\n30\n\nLAWER\nCE Tree\n\n5\nNumber of Data Collections\n\n10\n\n20\n\n15\n\n25\n\n30\n\n(a)\n\n(b)\n\n2\n0\n\u22122\n\n2\n0\n\u22122\n\n \n\n \n\n2\n0\n\u22122\n0\n\n \n\n \n\n \n\n \n\nu1\n\nu2\n\nu3\n\n4\n\n5\n\n1\n\n2\n3\nTime [s]\n(c)\n\n2\n0\n\u22122\n\n2\n0\n\u22122\n\n \n\n \n\n2\n0\n\u22122\n0\n\n \n\n \n\n \n\n \n\nu1\n\nu2\n\nu3\n\n4\n\n5\n\n1\n\n2\n3\nTime [s]\n(d)\n\nFigure 3: (a) Comparison of the CE-Tree and the LAWER algorithm for the 2-dimensional dynamic\npuddle-world. (b) Comparison of the CE-Tree and the LAWER algorithm for the 3-dimensional\ndynamic puddle-world. (c) Torque trajectories for the 3-dimensional puddle world learned with the\nLAWER algorithm. (d) Torque trajectories learned with the CE-Tree algorithm.\n\nThis was done for k = 2 and k = 3 dimensions. The puddle-world illustrates the scalability of\nthe algorithms to multidimensional continuous action spaces (2 respectively 3 dimensional). The\npositions were limited to [0, 1] and the velocities to [\u22121, 1]. The maximum force that could be\napplied in one direction was restricted to 2N and the time step was set to 0.1s. The setting of the\n2-dimensional puddle-world can be seen in Figure 2(c). Whenever the agent was about to leave\nthe prede\ufb01ned area, the velocities were set to zero and an additional reward of \u22125 was given. We\ncompared the LAWER with the CE-Tree algorithm. L = 50 iterations were used. The matrices D\nand DA were set to D = diag(10, 10, 2.5, 2.5) and DA = diag(2.5, 2.5) for the 2-dimensional and\nto D = diag(8, 8, 8, 2, 2, 2) and DA = diag(1, 1, 1) for the 3-dimensional puddle-world. In the\ndata collection phase the agent could observe 20 episodes with 50 steps starting from the prede\ufb01ned\ninitial position and 20 episodes starting from a random position.\nIn Figure 3(a), we can see the comparison of the CE-Tree and the LAWER algorithm for the 2-\ndimensional puddle-world and in Figure 3(b) for the 3-dimensional puddle-world. The results show\nthat the tree-based algorithm has an advantage in the beginning of the learning process. However,\nthe CE-Tree algorithm has problems \ufb01nding a good policy in the 3-dimensional action-space, while\nthe LAWER algorithm still performs well in this setting. This can be seen clearly in the comparison\nof the learned force trajectories which are shown in Figure 3(c) for the LAWER algorithm and in\nFigure 3(d) for the CE-Tree algorithm. The trajectories for the CE-Tree algorithm are very jerky\nand almost random for the \ufb01rst and third dimension of the control variable, whereas the trajectories\nfound by the LAWER algorithm look very smooth and goal directed.\n\n6 Conclusion and future work\n\nIn this paper, we focused on solving RL problems with continuous action spaces with \ufb01tted Q-\niteration based algorithms. The computational complexity of the max operator maxa Q(s, a) often\nmakes FQI algorithms intractable for high dimensional continuous action spaces. We proposed a\n\n\fnew method which circumvents the max operator by the use of a stochastic soft-max policy that\nallows us to reduce the policy improvement step V (s) = maxa Q(s, a) to a weighted regression\nproblem. Based on this result, we can derive the LAWER algorithm, a new, computationally ef\ufb01cient\nFQI algorithm based on LWR.\n\nExperiments have shown that the LAWER algorithm is able to produce high quality smooth policies,\neven for high dimensional action spaces where the use of expensive optimization methods for calcu-\nlating maxa Q(s, a) becomes problematic and only quite suboptimal policies are found. Moreover,\nthe computational costs of using continuous actions for standard FQI are daunting. The LAWER\nalgorithm needed on average 2780s for the pendulum, 17600s for the acrobot, 13700s for the 2D-\npuddle-world and 24200s for the 3D-puddle world benchmark task. The CE-Tree algorithm needed\non average 59900s, 201900s, 134400s and 212000s, which is an order of magnitude slower than the\nLAWER algorithm. The CE-Net and CE-LWR algorithm showed comparable running times as the\nCE-Tree algorithm. A lot of work has been spent to optimize the implementations of the algorithms.\nThe simulations were run on a P4 Xeon with 3.2 gigahertz.\n\nStill, in comparison to the tree-based FQI approach, our algorithm has handicaps when dealing with\nhigh dimensional state spaces. The distance kernel matrices have to be chosen appropriately by\nthe user. Additionally, the uniform distance measure throughout the state space is not adequate for\nmany complex control tasks and might degrade the performance. Future research will concentrate\non combining the AWR approach with the regression trees presented in [8].\n\n7 Acknowledgement\n\nThis paper was partially funded by the Austrian Science Fund FWF project # P17229. The \ufb01rst\nauthor also wants to thank Bernhard Sch\u00f6lkopf and the MPI for Biological Cybernetics in T\u00fcbingen\nfor the academic internship which made this work possible.\n\nReferences\n\n[1] R. Sutton and A. Barto, Reinforcement Learning. Boston, MA: MIT Press, 1998.\n[2] J. A. Boyan and A. W. Moore, \u201cGeneralization in reinforcement learning: Safely approximating the value\n\nfunction,\u201d in Advances in Neural Information Processing Systems 7, pp. 369\u2013376, MIT Press, 1995.\n\n[3] P. Viviani and T. Flash, \u201cMinimum-jerk, two-thirds power law, and isochrony: Converging approaches to\nmovement planning,\u201d Journal of Experimental Psychology: Human Perception and Performance, vol. 21,\nno. 1, pp. 32\u201353, 1995.\n\n[4] R. M. Alexander, \u201cA minimum energy cost hypothesis for human arm trajectories,\u201d Biological Cybernet-\n\nics, vol. 76, pp. 97\u2013105, 1997.\n\n[5] C. M. Harris and D. M. Wolpert, \u201cSignal-dependent noise determines motor planning.,\u201d Nature, vol. 394,\n\npp. 780\u2013784, August 1998.\n\n[6] M. Riedmiller, \u201cNeural \ufb01tted Q-iteration - \ufb01rst experiences with a data ef\ufb01cient neural reinforcement\n\nlearning method,\u201d in Proceedings of the European Conference on Machine Learning (ECML), 2005.\n\n[7] R. Sutton, \u201cGeneralization in reinforcement learning: Successful examples using sparse coarse coding,\u201d\n\nin Advances in Neural Information Processing Systems 8, pp. 1038\u20131044, MIT Press, 1996.\n\n[8] D. Ernst, P. Geurts, and L. Wehenkel, \u201cTree-based batch mode reinforcement learning,\u201d J. Mach. Learn.\n\nRes., vol. 6, pp. 503\u2013556, 2005.\n\n[9] A. Antos, R. Munos, and C. Szepesvari, \u201cFitted Q-iteration in continuous action-space MDPs,\u201d in Ad-\n\nvances in Neural Information Processing Systems 20, pp. 9\u201316, Cambridge, MA: MIT Press, 2008.\n\n[10] S. Timmer and M. Riedmiller, \u201cFitted Q-iteration with CMACs,\u201d pp. 1\u20138, 2007.\n[11] J. Peters and S. Schaal, \u201cPolicy learning for motor skills,\u201d in Proceedings of 14th International Conference\n\non Neural Information Processing (ICONIP), 2007.\n\n[12] P.-T. de Boer, D. Kroese, S. Mannor, and R. Rubinstein, \u201cA tutorial on the cross-entropy method,\u201d Annals\n\nof Operations Research, vol. 134, pp. 19\u201367, January 2005.\n\n[13] J. Peters and S. Schaal, \u201cReinforcement learning by reward-weighted regression for operational space\n\ncontrol,\u201d in Proceedings of the International Conference on Machine Learning (ICML), 2007.\n\n[14] C. G. Atkeson, A. W. Moore, and S. Schaal, \u201cLocally weighted learning,\u201d Arti\ufb01cial Intelligence Review,\n\nvol. 11, no. 1-5, pp. 11\u201373, 1997.\n\n\f", "award": [], "sourceid": 688, "authors": [{"given_name": "Gerhard", "family_name": "Neumann", "institution": null}, {"given_name": "Jan", "family_name": "Peters", "institution": null}]}