{"title": "Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1643, "page_last": 1650, "abstract": null, "full_text": "Nonparametric Representation of Policies and\nValue Functions: A Trajectory-Based Approach\n\nChristopher G. Atkeson \nRobotics Institute and HCII\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\n\ncga@cmu.edu\n\nATR Human Information Science Laboratories, Dept. 3\n\nJun Morimoto\n\nKeihanna Science City\nKyoto 619-0288, Japan\nxmorimo@atr.co.jp\n\nAbstract\n\nA longstanding goal of reinforcement learning is to develop non-\nparametric representations of policies and value functions that support\nrapid learning without suffering from interference or the curse of di-\nmensionality. We have developed a trajectory-based approach, in which\npolicies and value functions are represented nonparametrically along tra-\njectories. These trajectories, policies, and value functions are updated as\nthe value function becomes more accurate or as a model of the task is up-\ndated. We have applied this approach to periodic tasks such as hopping\nand walking, which required handling discount factors and discontinu-\nities in the task dynamics, and using function approximation to represent\nvalue functions at discontinuities. We also describe extensions of the ap-\nproach to make the policies more robust to modeling error and sensor\nnoise.\n\n1 Introduction\n\nThe widespread application of reinforcement learning is hindered by excessive cost in terms\nof one or more of representational resources, computation time, or amount of training data.\nThe goal of our research program is to minimize these costs. We reduce the amount of train-\ning data needed by learning models, and using a DYNA-like approach to do mental practice\nin addition to actually attempting a task [1, 2]. This paper addresses concerns about com-\nputation time and representational resources. We reduce the computation time required by\nusing more powerful updates that update \ufb01rst and second derivatives of value functions\nand \ufb01rst derivatives of policies, in addition to updating value function and policy values at\nparticular points [3, 4, 5]. We reduce the representational resources needed by representing\nvalue functions and policies along carefully chosen trajectories. This non-parametric repre-\nsentation is well suited to the task of representing and updating value functions, providing\nadditional representational power as needed and avoiding interference.\n\nThis paper explores how the approach can be extended to periodic tasks such as hopping\nand walking. Previous work has explored how to apply an early version of this approach\nto tasks with an explicit goal state [3, 6] and how to simultaneously learn a model and\n\n\u0001 also af\ufb01liated with the ATR Human Information Science Laboratories, Dept. 3\n\n\fuse this approach to compute a policy and value function [6]. Handling periodic tasks\nrequired accommodating discount factors and discontinuities in the task dynamics, and\nusing function approximation to represent value functions at discontinuities.\n\n2 What is the approach?\n\nRepresent value functions and policies along trajectories. Our \ufb01rst key idea for creating\na more global policy is to coordinate many trajectories, similar to using the method of\ncharacteristics to solve a partial differential equation. A more global value function is\ncreated by combining value functions for the trajectories. As long as the value functions are\nconsistent between trajectories, and cover the appropriate space, the global value function\ncreated will be correct. This representation supports accurate updating since any updates\nmust occur along densely represented optimized trajectories, and an adaptive resolution\nrepresentation that allocates resources to where optimal trajectories tend to go.\nSegment trajectories at discontinuities. A second key idea is to segment the trajectories\nat discontinuities of the system dynamics, to reduce the amount of discontinuity in the value\nfunction within each segment, so our extrapolation operations are correct more often. We\nassume smooth dynamics and criteria, so that \ufb01rst and second derivatives exist. Unfortu-\nnately, in periodic tasks such as hopping or walking the dynamics changes discontinuously\nas feet touch and leave the ground. The locations in state space at which this happens\ncan be localized to lower dimensional surfaces that separate regions of smooth dynamics.\nFor periodic tasks we apply our approach along trajectory segments which end whenever\na dynamics (or criterion) discontinuity is reached. We also search for value function dis-\ncontinuities not collocated with dynamics or criterion discontinuities. We can use all the\ntrajectory segments that start at the discontinuity and continue through the next region to\nprovide estimates of the value function at the other side of the discontinuity.\nUse function approximation to represent value function at discontinuities. We use\nlocally weighted regression (LWR) to construct value functions at discontinuities [7].\nUpdate \ufb01rst and second derivatives of the value function as well as \ufb01rst derivatives of\nthe policy (control gains for a linear controller) along the trajectory. We can think of\nthis as updating the \ufb01rst few terms of local Taylor series models of the global value and\npolicy functions. This non-parametric representation is well suited to the task of represent-\ning and updating value functions, providing additional representational power as needed\nand avoiding interference.\n\nWe will derive the update rules. Because we are interested in periodic tasks, we must intro-\nduce a discount factor into Bellman\u2019s equation, so value functions remain \ufb01nite. Consider\na system with dynamics \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\r\u000e\u0001\u0010\u000f\u0012\u0011\u0013\u0001\u0015\u0014 and a one step cost function\n\u000b\u0017\u0006\u0001\u0018\u000f\u0019\u0011\u0002\u0001\u0015\u0014 , where\nserves\nis the state of the system and \u0011\nas a time index, but will be dropped in the equations that follow in cases where all time\nindices are the same or are equal to \u001a\n\nis a vector of actions or controls. The subscript\n\n.\n\nA goal of reinforcement learning and optimal control is to \ufb01nd a policy that minimizes the\ntotal cost, which is the sum of the costs for each time step. One approach to doing this is to\nis\nconstruct an optimal value function, \u001b\nthe sum of all future costs, given that the system started in state and followed the optimal\npolicy (chose optimal actions at each time step as a function of the state). A local planner\nor controller can choose globally optimal actions if it knew the future cost of each action.\nThis cost is simply the sum of the cost of taking the action right now and the discounted\nfuture cost of the state that the action leads to, which is given by the value function. Thus,\nthe optimal action is given by:\nis the\ndiscount factor.\n\n\u000b\u0017\u000e\u0014 . The value of this value function at a state \n\n\u000b.\t/\u000b\u0017#\u000f\u0012\u0011\u0013\u00140\u0014\u0012\u0014 where ,\n\n\u0011\u001c\u0007\u001e\u001d \u001f\u0012!#\"%$'&)(*\u000b\n\n\u000b\u0017#\u000f\u0012\u0011\u0013\u0014\u000e+-,\n\n\u0016\n\n\u001a\n\u0016\n\u001b\n\f1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\u22121\n\nG\n\n0\n\n1\n\nt\n\ni\n\nh\ng\ne\nH\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\nVelocity\n\n1\n\n2\n\n3\n\n4\n\nFigure 1: Example trajectories where the value function and policy are explicitly repre-\nsented for a regulator task at goal state G (left), a task with a point goal state G (middle),\nand a periodic task (right).\n\n\u000b\n\u0006\t\n\n\u001b\u0006\u0005\n\n\u001b\u0006\u0007\n\n\u000b\n\n\u001b\f\u0007\r\u0007\n\n\u000b\n\u0002\t\n\n\u0007\u0015\u0007\n\n where \b\nand\n\nSuppose at a point \u000b\u0017\u0001\u0010\u000f\u0019\u0011\u0002 \u0014 we have 1) a local second order Taylor series approximation of\nthe optimal value function: \u001b\n . 2) a local\n\u000f\u000e\n\u000b\u0017\u000e\u0014\u0004\u0003\nsecond order Taylor series approximation of the dynamics, which can be learned using\nand \t\u0011\u0010 correspond to the usual \u0012\nlocal models of the plant (\t\nof the linear plant\nmodel used in linear quadratic regulator (LQR) design): \u0002\u0001\u0004\u0003\u0006\u0005\n , and 3) a local second order\nTaylor series approximation of the one step cost, which is often known analytically for\nof LQR design):\nhuman speci\ufb01ed criteria (\n\u000b\u0017#\u000f\u0012\u0011\u0013\u0014\u001b\u0003\nGiven a trajectory, one can integrate the value function and its \ufb01rst and second spatial\nderivatives backwards in time to compute an improved value function and policy. The\nbackward sweep takes the following form (in discrete time):\n\u0010\b+-,\n\n\u0010\u0017\u0010\n\u0010\u0017\u0010 correspond to the usual\n\u0011\u001c\n\n\u0007\r\u0007\n\n\u0011\u0016\n\u000e\t\nand\n\u000b\n\n\u0011 where \b\n\n\t\f\u000b\r#\u000f\u0012\u0011\u0013\u0014\u0014\u0003\n\n\u0011\u0018\u000e-\u0011\n\nand\n\n\u000b\n\n\u0010\u0017\u0010\n\n\u0007\r\u0007\n\n\u0007\r\u0007\n\n\u0007\r\u0007\n\n\u0007\r\u0007\n\n\u0010\u0017\u0010\n\n\t\u0011\u0010\n\n\u0007\r\u0007\n\n\u0010\r\u0010\n\n\t\u0011\u0010\n\n\u0007\f\u001e\n\u0007\r\u0007\u001f\u001e\n\n\u001b\u0006\u0007\r\u0007\n\u001d\"!\n\u0010\u0017\u0010\n\n#*\u001e\n\n\u0010\u0017\u0010\n\u001d\"!\n\u0010\u0017\u0010\n\n\u001b\u0006\u0007\n\u001e$#\n\u001b\u0006\u0007\r\u0007\r%+&)(\n\n\u0007\r\u0007\n\n(1)\n(2)\n(3)\n(4)\n(5)\n\n\u001b\u0006\u0007\u0017%'&)(\n\u000b\r\u000b,.-0/3\u000e\n\n\u000e\u0014\n\n\u00112\u000e\n\n\u00111\u000e\n\nAfter the backward sweep, forward integration can be used to update the trajectory itself:\n\u0011\u0016,.-0/\nIn order to use this approach we have to assume smooth dynamics and criteria, so that \ufb01rst\nand second derivatives exist. Unfortunately, in periodic tasks such as hopping or walking\nthe dynamics changes discontinuously as feet touch and leave the ground. The locations in\nstate space at which this happens can be localized to lower dimensional surfaces that sepa-\nrate regions of smooth dynamics. For periodic tasks we apply our approach along trajectory\nsegments which end whenever a dynamics (or criterion) discontinuity is reached. We can\nuse all the trajectory segments that start at the discontinuity and continue through the next\nregion to provide estimates of the value function at the other side of the discontinuity.\n\nFigure 1 shows our approach applied to several types of problems. On the left we see that\na task that requires steady state control about a goal point (a regulator task) can be solved\nwith a single trivial trajectory that starts and ends at the goal and provides a value function\nand constant linear policy \b\n\nin the vicinity of the goal.\n\n+\n\b\n\n+\n\u0005\n\t\n\b\n\b\n\n\u0007\n\n\u0007\n\u0013\n\u0007\n\t\n\u0005\n+\n\t\n\u0007\n\b\n\n+\n\t\n\u0010\n\b\n\u0011\n+\n\u0005\n\t\n\b\n\b\n\n+\n\b\n\u0007\n\u0010\n\b\n\u0011\n+\n\u0005\n\t\n\b\n\b\n\u0011\n\u0007\n\u0016\n\u0016\n\u0019\n\u001a\n\u0016\n\u0016\n\u0005\n+\n\u0016\n\u0007\n\b\n\n+\n\u0016\n\u0010\n\b\n\u0011\n+\n\u0005\n\t\n\b\n\u0016\n\b\n\n+\n\b\n\u0016\n\u0007\n\u0010\n\b\n\u0011\n+\n\u0005\n\t\n\b\n\u0016\n\b\n\u0011\n\u001d\n\u0007\n\u0007\n\u0016\n\u0007\n+\n,\n\u001b\n\u0007\n\t\n\u001d\n\u0010\n\u0007\n\u0016\n\u001b\n\u0007\n\u001d\n\u0007\n,\n\t\n\n\u0007\n\u001b\n\t\n\u0007\n+\n,\n\u001b\n\u0007\n\t\n+\n\u0016\n\u001d\n\u0010\n\u0007\n\u0007\n,\n\t\n\n\u0010\n\u001b\n\t\n\u0007\n+\n,\n\u001b\n\u0007\n\u0007\n+\n\u0016\n\u0010\n\u0007\n\u001d\n\u0007\n,\n\t\n\n\u0010\n\t\n\u0010\n+\n,\n\t\n+\n\u0016\n \n\u0011\n\u0007\n\u0005\n\u001d\n\u0010\n\u0007\n\u0005\n\u001d\n\u0010\n\u0007\n\u0007\n\u001d\n\u0007\n\u000e\n\u001d\n\u0010\n\u0007\n\u001d\n\u000e\n\u001d\n\u0007\n\u0010\n#\n\u0007\n \n#\n\u0011\n\u0007\n#\n\b\n\n\ft\n\ni\n\nh\ng\ne\nH\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\u22124\n\nt\n\ni\n\nh\ng\ne\nH\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\u22124\n\nt\n\ni\n\nh\ng\ne\nH\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\nVelocity\n\n1\n\n2\n\n3\n\n4\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\nVelocity\n\n1\n\n2\n\n3\n\n4\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\nVelocity\n\n1\n\n2\n\n3\n\n4\n\nFigure 2: The optimal hopper controller with a range of penalties on \u0011\n\nusage \u000b\u0001\n\n\u0002\u0004\u0003\u0005\u0003\n\n\u0002\u0006\u0003\u0007\u0003\u0005\u0003\u0007\u0003\n\n\u0014 .\n\n\u000b\u0017#\u000f\u0012\u0011\u0013\u0014\n\n\u0007\n\t\f\u000b\n\n\u000b\u0001\r\u000f\u000e\u0011\u0010\u0006\u0012\u0006\u0013\u0015\u0014\u0014\u000e\u0016\r\u0018\u0017\n\n+\u0019\u001a\u000b\u001c\u001b\n\nThe middle \ufb01gure of Figure 1 shows the trajectories used to compute the value function for\na swing up problem [3]. In this problem the goal requires regulation about the state where\nthe pendulum is inverted and in an unstable equilibrium. However, the nonlinearities of the\nproblem limit the region of applicability of a linear policy, and non-trivial trajectories have\nto be created to cover a larger region. In this case the region where the value function is less\nthan a target value is \ufb01lled with trajectories. The neighboring trajectories have consistent\nvalue functions and thus the globally optimal value function and policy is found in the\nexplored region [3].\n\nThe right \ufb01gure of Figure 1 shows the trajectories used to compute the value function for a\nperiodic problem, control of vertical hopping in a hopping robot. In this problem, there is\nno goal state, but a desired hopping height is speci\ufb01ed. This problem has been extensively\nstudied in the robotics literature [8] from the point of view of how to manually design a\nnonlinear controller with a large stability region. We note that optimal control provides\na methodology to design nonlinear controllers with large stability regions and also good\nperformance in terms of explicitly speci\ufb01ed criteria. We describe later how to also make\nthese controller designs more robust.\n\nIn this \ufb01gure the vertical axis corresponds to the height of the hopper, and the horizontal\naxis is vertical velocity. The robot moves around the origin in a counterclockwise direction.\nIn the top two quadrants the robot is in the air, and in the bottom two quadrants the robot\nis on the ground. Thus, the horizontal axis is a discontinuity of the robot dynamics, and\ntrajectory segments end and often begin at the discontinuity. We see that while the robot is\nin the air it cannot change how much energy it has (how high it goes or how fast it is going\nwhen it hits the ground), as the trajectories end with the same pattern they began with.\nWhen the robot is on the ground it thrusts with its leg to \u201cfocus\u201d the trajectories so the\nset of touchdown positions is mapped to a smaller set of takeoff positions. This funneling\neffect is characteristic of controllers for periodic tasks, and how fast the funnel becomes\nnarrow is controlled by the size of the penalty on \u0011 usage (Figure 2).\n2.1 How are trajectory start points chosen?\n\nIn our approach trajectories are re\ufb01ned towards optimality given their \ufb01xed starting points.\nHowever, an initial trajectory must \ufb01rst be created. For regulator tasks, the trajectory is\ntrivial and simply starts and ends at the known goal point. For tasks with a point goal,\ntrajectories can be extended backwards away from the goal [3]. For periodic tasks, crude\ntrajectories must be created using some other approach before this approach can re\ufb01ne\n\n\u0007\n\u0002\n\u000f\n\u000f\n\b\n\u0014\n\t\n\t\n\fthem.\n\nWe have used several methods to provide initial trajectories. Manually designed controllers\nsometimes work. In learning from demonstration a teacher provides initial trajectories [6].\nIn policy optimization (aka \u201cpolicy search\u201d) a parameterized policy is optimized [9].\n\nOnce a set of initial task trajectories are available, the following four methods are used to\ngenerate trajectories in new parts of state space. We use all of these methods simultane-\nously, and locally optimize each of the trajectories produced. The best trajectory of the set\nis then stored and the other trajectories are discarded. 1) Use the global policy generated\nby policy optimization, if available. 2) Use the local policy from the nearest point with the\nsame type of dynamics. 3) Use the local value function estimate (and derivatives) from the\nnearest point with the same type of dynamics. and 4) Use the policy from the nearest tra-\njectory, where the nearest trajectory is selected at the beginning of the forward sweep and\nkept the same throughout the sweep. Note that methods 2 and 3 can change which stored\ntrajectories they take points from on each time step, while method 4 uses a policy from a\nsingle neighboring trajectory.\n\n3 Control of a walking robot\n\n\u0014 , left leg angular velocity \u000b\u0007\u0006\n\nAs another example we will describe the search for a policy for walking of a simple planar\nbiped robot that walks along a bar. The simulated robot has two legs and a torque motor\nbetween the legs. Instead of revolute or telescoping knees, the robot can grab the bar with\nits foot as its leg swings past it. This is a model of a robot that walks along the trusses of\na large structure such as a bridge, much as a monkey brachiates with its arms. This simple\nmodel has also been used in studies of robot passive dynamic walking [10].\nThis arrangement means the robot has a \ufb01ve dimensional state space: left leg angle \u000b\u0001\u0003\u0002.\u0014 ,\nright leg angle \u000b\u0001\u0005\u0004\n\u0014 , and stance\nfoot location. A simple policy is used to determine when to grab the bar (at the end of a\nstep when the swing foot passes the bar going downwards). The variable to be controlled\nis the torque \n at the hip.\nThe criterion we used is quite complex. We are a long way from specifying an abstract\nor vague criterion such as \u201ccover a \ufb01xed distance with minimum fuel or battery usage\u201d\nor \u201cmaximize the amount of your genes in future gene pools\u201d and successfully \ufb01nding an\noptimal or reasonable policy. At this stage we need to include several \u201cshaping\u201d terms in the\ncriterion, that reward keeping the hips at the right altitude with minimal vertical velocity,\nkeeping the leg amplitude within reason, maintaining a symmetric gait, and maintaining\nthe desired hip forward velocity:\n\n\b\u0002.\u0014 , right leg angular velocity \u000b\t\u0006\n\b\u0004\n\n\u0010\u0004\u0013\n\n\u0003+$\n\n\u0014\u0014\u000e\n\n\u000b\r\f\u000f\u000e\u0011\u0010\n\n+\u0016\u0012\u0018\u0017\n\n+\u0019\u0012\u001a\u0002\u001c\u001b\u0001\u0010\n\n+\u0016\u0012\u001a\u0002\u001f\u0004\u0011\u001b\u0001\u0010\u0004\u0013\u000f\u0002\u001f\u0004\n\u0007\u0013\u0012\u0015\u0014\u0010\u000b\n+\u0016\u0012\u001d\u0002\u001e\u001b\n\u0002\u0006\u0003\u0007\u0003 , \u0012\u0018\u0017\n\u0014-\u0007\nwhere the \u0012\u0015\" are weighting factors and are \u0012#\u0014\n\t ). The desired\n\u0002\u0006\u0003\u0007\u0003 . The leg length is 1 meter (hence the 1 in \u0012\n\u0002\u0004\u0003\u0005\u0003\u0007\u0003\u0005\u0003\u0007\u0003 , and \u0012\n\u0003%$\n\u0013\u0003\" provides a measure of how far the left or right leg has gone\n\t'&)(\u0003\u000e . \u001b\u0001\u0010\nleg velocity \u0006\n\u0012\u0005,.-\u0003/0,\u0015\u000e1\u000e\nis the product of\nin the forward or backward direction.\npast its limits *\nis the\nthe leg angles if the legs are both forward or both rearward, and zero otherwise.\nhip location. The integration and control time steps are 1 millisecond each. The dynamics\nof this walker are simulated using a commercial package, SDFAST.\n\n+\u0019\u0012 \u0017\n\u0002\u0006\u0003\u0007\u0003 .\n\n\u0002\u0006\u0003\u0007\u0003 , \u0012\u001a\u0002\u001f\u0004\n\n\u001b\u0001\u0010\u0004\u0013'\u0002\u001f\u0004\n\n\u000b\u0001!\u000e\u000f\n\n\u0012\u001a\u0002\n\n(6)\n\n\u0014\u0010\u0014\n\nInitial trajectories were generated by optimizing the coef\ufb01cients of a linear policy. When\nthe left leg was in stance:\n\n+6287\nwhere \b\u0002\u001f\u0004\nwas used with the appropriate signs negated.\n\n(7)\nis the angle between the legs. When the right leg was in stance the same policy\n\n+42#\u00055\b\u0002\u001f\u0004\n\n\u0014*+62:9\n\n+628<\n\n+628=\n\n+62:;\n\n\u000732\n\n+62\n\n\b\u0002\u001f\u0004\n\n\b\u0002\n\n\b\u0002\n\n\u0002\n\u0014\n\t\n\u0014\n\u0006\n\u0014\n\t\n\u0013\n\t\n\u0002\n\t\n\u0004\n\u0007\n\u000b\n\u0006\n!\n\u000e\n\u0006\n!\n\u0017\n\u0014\n\t\n+\n\n\t\n\u0007\n\u0007\n\u0007\n\u0017\n\u0007\n\u0007\n\u0014\n\u000b\n\u0014\n\u000e\n\u0002\n\u0014\n!\n\u0017\n\u0007\n\u0002\n\n\u0005\n\t\n\u0006\n\u0006\n\u0006\n!\n\u0006\n\u0014\n\f3.1 Results\n\n\u0003%$\n\n\t\u0007\u0006\n\nThe trajectory-based approach was able to \ufb01nd a cheaper and more robust policy than\nthe parametric policy-optimization approach. This is not surprising given the \ufb02exible and\nexpandable representational capacity of an adaptive non-parametric representation, but it\ndoes provide some indication that our update algorithms can usefully harness the additional\nrepresentation power.\nCost: For example, after training the parametric policy, we measured the undiscounted cost\nover 1 second (roughly one step of each leg) starting in a state along the lowest cost cyclic\ntrajectory. The cost for the optimized parametric policy was 4316. The corresponding cost\nfor the trajectory-based approach starting from the same state was 3502.\nRobustness: We did a simple assessment of robustness by adding offsets to the same\nstarting state until the optimized linear policy failed. The offsets were in terms of the\nstance leg and the angle between the legs, and the corresponding angular velocities. The\nmaximum offsets for the linearized optimized parametric policy are \u000e\n\u0003+$\n\b\u0002\u001f\u0004\n . We did a\n\u0003\u0005\u0004 , \u000e\nsimilar test for the trajectory approach. In each direction the maximum offset the trajectory-\nbased approach was able to handle was equal to or greater than the parametric policy-based\napproach, extending the range most in the cases of \n\u0003 . This\nis not surprising, since the trajectory-based controller uses the parametric policy as one\nof the ways to initially generate candidate trajectories for optimization. In cases where\nthe trajectory-based approach is not able to generate an appropriate trajectory, the system\nwill generate a series of trajectories with start points moving from regions it knows how\nto handle towards the desired start point. Thus, we have not yet discovered situations that\nare physically possible to recover that the trajectory-based approach cannot handle if it is\nallowed as much computation time as it needs.\nInterference: To demonstrate interference in the parametric policy approach, we optimized\nits performance from a distribution of starting states. These states were the original state,\nand states with positive offsets. The new cost for the original starting position was 14,747,\ncompared to 4316 before retraining. The trajectory approach has the same cost as before,\n3502.\n\n\u0003\u0001\n , and \u000e\n\n\u0002 and \u0006\n\n\u0003\u0001\u0003\u0002\n\u0003%$\n\n\u0003+$\f\u000b\u000e\r\t\u0002\n\n\u0003%$\n\n\b\u0002\n\n\u0003+$\n\n\u0002 , \u000e\n\n\u0003+$\b\t\u0002\n\n\b\u0002\u001f\u0004\n\n\u0003+$\n\n\u0003+$\n\n\u0002\u0003$\n\n4 Robustness to modeling error and imperfect sensing\n\nSo far we have addressed robustness in terms of the range of initial states that can be\nhandled. Another form of robustness is robustness to modeling error (changes in masses,\nfriction, and other model parameters) and imperfect sensing, so that the controller does not\nknow exactly what state the robot is in. Since simulations are used to optimize policies, it\nis relatively easy to include simulations with different model parameters and sensor noise\nin the training and optimize for a robust parametric controller in policy shaping. How does\nthe trajectory-based approach achieve comparable robustness?\n\nWe have developed two approaches, a probabilistic approach with maintains distributional\ninformation about unknown states and parameters, and a game-based or minimax approach.\nThe probabilistic approach supports actions by the controller to actively minimize uncer-\ntainty as well as achieve goals, which is known as dual control. The game-based approach\ndoes not reduce uncertainty with experience, and is somewhat paranoid, assuming the\nworld is populated by evil spirits which choose the worst possible disturbance at each time\nstep for the controller. This results in robust, but often overly conservative policies.\n\nIn the probabilistic case, the state is augmented with any unknown parameters such as\nmasses of parts or friction coef\ufb01cients, and the covariance of all the original elements of\n\n\u0007\n\n\u0002\n\u0002\n\u0007\n\u0002\n\u0007\n\u0006\n\u0002\n\u0007\n\u0007\n\u0002\n\u0007\n\u0007\n\u0006\n\u0002\n\u0007\n\u0002\n\u0004\n\u0002\n\u0007\n\n\u0002\n\u0004\n\u0002\n\u0007\n\fthe state as well as the added parameters. An extended Kalman \ufb01lter is constructed as the\nnew dynamics equation, predicting the new estimates of the means and covariances given\nthe control signals to the system. The one step cost function is restated in terms of the\naugmented state. The value function is now a function of the augmented state, including\ncovariances of the original state vector elements. These covariances interact with the cur-\nvature of the value function, causing additional cost in areas of the value function that have\nhigh curvature or second derivatives. Thus the system is rewarded when it moves to areas\nof the value function that are planar, and uncertainty has no effect on the expected cost. The\nsystem is also rewarded when it learns, which reduces the covariances of the estimates, so\nthe system may choose actions that move away from a goal but reduce uncertainty. This\nprobabilistic approach does dramatically increase the dimensionality of the state vector and\nthus the value function, but in the context of only a quadratic cost on dimensionality this is\nnot as fatal is it would seem.\n\nA less expensive approach is to use a game-based uncertainty model with minimax op-\ntimization.\nIn this case, we assume an opponent can pick a disturbance to maximally\nincrease our cost. This is closely related to robust nonlinear controller design techniques\nbased on the idea of\ncontrol [11, 12] and risk sensitive control [13, 14]. We augment\n,\r\f5\u0002\u0019\u000b\r#\u000f\u0019\u0011\u0002\u0014\u0013+\u000e\u0003\nthe dynamics equation with a disturbance term: \u0002\u0001\nwhere \u0003\nis a vector of disturbance inputs. To limit the size of the disturbances, we in-\nclude the disturbance magnitude in a modi\ufb01ed one step cost function with a negative sign.\nThe opponent who controls the disturbance wants to increase our cost, so this new term\ngives an incentive to the opponent to choose the worse direction for the disturbance, and\na disturbance magnitude that gives the highest ratio of increased cost to disturbance size:\nis set to globally approximate the\n\u000b\u0017#\u000f\u0012\u0011\n\n\u0014\u0013\u0015\u0003\nuncertainty of the model. Ultimately, \u0013\nshould vary with the local con\ufb01dence in the\n, and\nmodel. Highly practiced movements or portions of movements should have high \u0013\n. The optimal action is now given by Isaacs\u2019 equa-\nnew movements should have lower \u0013\n\u000f\u0004\u0003\u0006\u0014\u0012\u00140\u0014 . How we solve Isaacs\u2019\ntion: \u000b\u0016\u0003\n\u000f\u000f\u0003\u0006\u0014\u0006+\nequation and an application of this method are described in the companion paper [15].\n\n. Initially, \u0013\n\n\u0003\u0002\u0005\u0015\u000b.\t/\u000b\u0017#\u000f\u0012\u0011\n\n\u0004\u000f\u0007\n\t\u0010\u0007\n\n,\u0011\f5\u00020\u000b\r#\u000f\u0019\u0011\u0002\u0014\n\n\u0004\b\u0007\n\t\u000b\u0007\n\n\t\u0006\u0005\n\n\t\f\u000b\r#\u000f\u0012\u0011\n\n\u000f\u0004\u0003\u0002\u0014\n\n\u000b\r#\u000f\u0019\u0011\n\n\u000e\u0012\u0003\n\n\u000f\u0012\u0011\u0013\u0014\n\n\u001d\u0015\u001f\u0019!\n\n\u001d\u0018\u0017\u001a\u0019\n\n$'&\n\n\u0002\u0001\n\n\u0003\u0002\u0005\n\n\u000f\u0004\u0003\u0006\u0014\n\n5 How to cover a volume of state space\n\nIn tasks with a goal or point attractor, [3] showed that certain key trajectories can be grown\nbackwards from the goal in order to approximate the value function. In the case of a sparse\n\t costs\nuse of trajectories to cover a space, the cost of the approach is dominated by the -\nof updating second derivative matrices, and thus the cost of the trajectory-based approach\nincreases quadratically as the dimensionality increases.\n\nHowever, for periodic tasks the approach of growing trajectories backwards from the goal\ncannot be used, as there is no goal point or set. In this case the trajectories that form the\noptimal cycle can be used as key trajectories, with each point along them supplying a local\nlinear policy and local quadratic value function. These key trajectories can be computed\nusing any optimization method, and then the corresponding policy and value function esti-\nmates along the trajectory computed using the update rules given here.\n\nIt is important to point out that optimal trajectories need only be placed densely enough to\nseparate regions which have different local optima. The trajectories used in the represen-\ntation usually follow local valleys of the value function. Also, we have found that natural\nbehavior often lies entirely on a low-dimensional manifold embedded in a high dimensional\nspace. Using these trajectories and creating new trajectories as task demands require it, we\nexpect to be able to handle a range of natural tasks.\n\n\u0007\n\u0007\n\u0016\n\u0007\n\u0016\n\u0005\n\u0007\n\"\n\"\n(\n\u000b\n\u0016\n,\n\u001b\n\u0001\n\f6 Contributions\n\nIn order to accommodate periodic tasks, this paper has discussed how to incorporate dis-\ncount factors into the trajectory-based approach, how to handle discontinuities in the dy-\nnamics (and equivalently, criteria and constraints), and how to \ufb01nd key trajectories for a\nsparse trajectory-based approach. The trajectory-based approach requires less design skill\nfrom humans since it doesn\u2019t need a \u201cgood\u201d policy parameterization, produces cheaper and\nmore robust policies, which do not suffer from interference.\n\nReferences\n\n[1] Richard S. Sutton. Integrated architectures for learning , planning and reacting based on ap-\nproximating dynamic programming. In Proceedings 7th International Conference on Machine\nLearning., 1990.\n\n[2] C. Atkeson and J. Santamaria. A comparison of direct and model-based reinforcement learning,\n\n1997.\n\n[3] Christopher G. Atkeson. Using local trajectory optimizers to speed up global optimization\nin dynamic programming. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors,\nAdvances in Neural Information Processing Systems, volume 6, pages 663\u2013670. Morgan Kauf-\nmann Publishers, Inc., 1994.\n\n[4] P. Dyer and S. R. McReynolds. The Computation and Theory of Optimal Control. Academic\n\nPress, New York, NY, 1970.\n\n[5] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, New York,\n\nNY, 1970.\n\n[6] Christopher G. Atkeson and Stefan Schaal. Robot learning from demonstration. In Proc. 14th\n\nInternational Conference on Machine Learning, pages 12\u201320. Morgan Kaufmann, 1997.\n\n[7] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Arti\ufb01cial Intelligence\n\nReview, 11:11\u201373, 1997.\n\n[8] W. Schwind and D. Koditschek. Control of forward velocity for a simpli\ufb01ed planar hopping\nrobot. In International Conference on Robotics and Automation, volume 1, pages 691\u20136, 1995.\n[9] J. Andrew Bagnell and Jeff Schneider. Autonomous helicopter control using reinforcement\nIn International Conference on Robotics and Automation,\n\nlearning policy search methods.\n2001.\n\n[10] M. Garcia, A. Chatterjee, and A. Ruina. Ef\ufb01ciency, speed, and scaling of two-dimensional\n\npassive-dynamic walking. Dynamics and Stability of Systems, 15(2):75\u201399, 2000.\n\n[11] K. Zhou, J. C. Doyle, and K. Glover. Robust Optimal Control. PRENTICE HALL, New Jersey,\n\n1996.\n\n[12] J. Morimoto and K. Doya. Robust Reinforcement Learning.\n\nIn Todd K. Leen, Thomas G.\nDietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13,\npages 1061\u20131067. MIT Press, Cambridge, MA, 2001.\n\n[13] R. Neuneier and O. Mihatsch. Risk Sensitive Reinforcement Learning. In M. S. Kearns, S. A.\nSolla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages\n1031\u20131037. MIT Press, Cambridge, MA, USA, 1998.\n\n[14] S. P. Coraluppi and S. I. Marcus. Risk-Sensitive and Minmax Control of Discrete-Time Finite-\n\nState Markov Decision Processes. Automatica, 35:301\u2013309, 1999.\n\n[15] J. Morimoto and C. Atkeson. Minimax differential dynamic programming: An application to\nrobust biped walking. In Advances in Neural Information Processing Systems 15. MIT Press,\nCambridge, MA, 2002.\n\n\f", "award": [], "sourceid": 2213, "authors": [{"given_name": "Christopher", "family_name": "Atkeson", "institution": null}, {"given_name": "Jun", "family_name": "Morimoto", "institution": null}]}