{"title": "Compositionality of optimal control laws", "book": "Advances in Neural Information Processing Systems", "page_first": 1856, "page_last": 1864, "abstract": "We present a theory of compositionality in stochastic optimal control, showing how task-optimal controllers can be constructed from certain primitives. The primitives are themselves feedback controllers pursuing their own agendas. They are mixed in proportion to how much progress they are making towards their agendas and how compatible their agendas are with the present task. The resulting composite control law is provably optimal when the problem belongs to a certain class. This class is rather general and yet has a number of unique properties - one of which is that the Bellman equation can be made linear even for non-linear or discrete dynamics. This gives rise to the compositionality developed here. In the special case of linear dynamics and Gaussian noise our framework yields analytical solutions (i.e. non-linear mixtures of linear-quadratic regulators) without requiring the final cost to be quadratic. More generally, a natural set of control primitives can be constructed by applying SVD to Greens function of the Bellman equation. We illustrate the theory in the context of human arm movements. The ideas of optimality and compositionality are both very prominent in the field of motor control, yet they are hard to reconcile. Our work makes this possible.", "full_text": "Compositionality of optimal control laws\n\nApplied Mathematics and Computer Science & Engineering\n\nEmanuel Todorov\n\nUniversity of Washington\n\ntodorov@cs.washington.edu\n\nAbstract\n\nWe present a theory of compositionality in stochastic optimal control, showing\nhow task-optimal controllers can be constructed from certain primitives. The\nprimitives are themselves feedback controllers pursuing their own agendas. They\nare mixed in proportion to how much progress they are making towards their agen-\ndas and how compatible their agendas are with the present task. The resulting\ncomposite control law is provably optimal when the problem belongs to a certain\nclass. This class is rather general and yet has a number of unique properties \u2013 one\nof which is that the Bellman equation can be made linear even for non-linear or\ndiscrete dynamics. This gives rise to the compositionality developed here. In the\nspecial case of linear dynamics and Gaussian noise our framework yields analyt-\nical solutions (i.e. non-linear mixtures of LQG controllers) without requiring the\n\ufb01nal cost to be quadratic. More generally, a natural set of control primitives can\nbe constructed by applying SVD to Green\u2019s function of the Bellman equation. We\nillustrate the theory in the context of human arm movements. The ideas of opti-\nmality and compositionality are both very prominent in the \ufb01eld of motor control,\nyet they have been dif\ufb01cult to reconcile. Our work makes this possible.\n\n1 Introduction\n\nStochastic optimal control is of interest in many \ufb01elds of science and engineering, however it re-\nmains hard to solve. Dynamic programming [1] and reinforcement learning [2] work well in dis-\ncrete state spaces of reasonable size, but cannot handle continuous high-dimensional state spaces\ncharacteristic of complex dynamical systems. A variety of function approximation methods are\navailable [3, 4], yet the shortage of convincing results on challenging problems suggests that exist-\ning approximation methods do not scale as well as one would like. Thus there is need for more\nef\ufb01cient methods. The idea we pursue in this paper is compositionality. With few exceptions [5, 6]\nthis good-in-general idea is rarely used in optimal control, because it is unclear what/how can be\ncomposed in a way that guarantees optimality of the resulting control law.\nOur second motivation is understanding how the brain controls movement. Since the brain remains\npretty much the only system capable of solving truly complex control problems, sensorimotor neu-\nroscience is a natural (albeit under-exploited) source of inspiration. To be sure, a satisfactory un-\nderstanding of the neural control of movement is nowhere in sight. Yet there exist theoretical ideas\nbacked by experimental data which shed light on the underlying computational principles. One such\nidea is that biological movements are near-optimal [7, 8]. This is not surprising given that motor\nbehavior is shaped by the processes of evolution, development, learning and adaptation, all of which\nresemble iterative optimization. Precisely what algorithms enable the brain to approach optimal per-\nformance is not known, however a clue is provided by another prominent idea: compositionality. For\nabout a century, researchers have been talking about motor synergies or primitives which somehow\nsimplify control [9\u201311]. The implied reduction in dimensionality is now well documented [12\u201314].\nHowever the structure and origin of the hypothetical primitives, the rules for combining them, and\nthe ways in which they actually simplify the control problem remain unclear.\n\n1\n\n\f2 Stochastic optimal control problems with linear Bellman equations\n\nWe will be able to derive compositionality rules for \ufb01rst-exit and \ufb01nite-horizon stochastic optimal\ncontrol problems which belong to a certain class. This class includes both discrete-time [15\u201317]\nand continuous-time [17\u201319] formulations, and is rather general, yet affords substantial simpli\ufb01-\ncation. Most notably the optimal control law is found analytically given the optimal cost-to-go,\nwhich in turn is the solution to a linear equation obtained from the Bellman equation by exponen-\ntiation. Linearity implies compositionality as will be shown here. It also makes a number of other\nthings possible: \ufb01nding the most likely trajectories of optimally-controlled stochastic systems via\ndeterministic methods; solving inverse optimal control problems via convex optimization; applying\noff-policy learning in the state space as opposed to the state-action space; establishing duality be-\ntween stochastic optimal control and Bayesian estimation. An overview can be found in [17]. Here\nwe only provide the background needed for the present paper.\nThe discrete-time problem is de\ufb01ned by a state cost q (x) \u2265 0 describing how (un)desirable different\nstates are, and passive dynamics x0 \u223c p (\u00b7|x) characterizing the behavior of the system in the absence\nof controls. The controller can impose any dynamics x0 \u223c u (\u00b7|x) it wishes, however it pays a price\n(control cost) which is the KL divergence between u and p. We further require that u (x0|x) = 0\nwhenever p (x0|x) = 0 so that KL divergence is well-de\ufb01ned. Thus the discrete-time problem is\n\ndynamics:\ncost rate:\n\nx0 \u223c u (\u00b7|x)\n(cid:99) (x, u (\u00b7|x)) = q (x) + KL (u (\u00b7|x)||p (\u00b7|x))\n\nLet I denote the set of interior states and B the set of boundary states, and let f (x) \u2265 0, x \u2208 B be\na \ufb01nal cost. Let v (x) denote the optimal cost-to-go, and de\ufb01ne the desirability function\n\nLet G denote the linear operator which computes expectation under the passive dynamics:\n\nz (x) = exp (\u2212v (x))\n\nG [z] (x) = Ex0\u223cp(\u00b7|x)z (x0)\n\nFor x \u2208 I it can be shown that the optimal control law u\u2217 (\u00b7|x) and the desirability z (x) satisfy\n\noptimal control law:\n\nlinear Bellman equation:\n\np (x0|x) z (x0)\nu\u2217 (x0|x) =\nexp (q (x)) z (x) = G [z] (x)\n\nG [z] (x)\n\n(1)\n\nOn the boundary x \u2208 B we have z (x) = exp (\u2212f (x)). The linear Bellman equation can be written\nmore explicitly in vector-matrix notation as\n(2)\nwhere M = diag (exp (\u2212qI)) PII and N = diag (exp (\u2212qI)) PIB . The matrix M is guaranteed\nto have spectral radius less than 1, thus the simple iterative solver zI \u2190 M zI + N zB converges.\nThe continuous-time problem is a control-af\ufb01ne Ito diffusion with control-quadratic cost:\n\nzI = M zI + N zB\n\ndynamics:\ncost rate:\n\ndx = a (x) dt + B (x) (udt + \u03c3d\u03c9)\n\n(cid:99) (x, u) = q (x) +\n\n1\n\n2\u03c32 kuk2\n\nThe control u is now a (more traditional) vector and \u03c9 is a Brownian motion process. Note that\nthe control cost scaling by \u03c3\u22122, which is needed to make the math work, can be compensated by\nrescaling q. The optimal control law u\u2217 (x) and desirability z (x) satisfy\n\noptimal control law:\n\nlinear HJB equation:\n\nu\u2217 (x) = \u03c32B (x)T zx (x)\nz (x)\n\n(3)\n\nwhere the 2nd-order linear differential operator L is de\ufb01ned as\n\nL [z] (x) = a (x)T zx (x) +\n\nq (x) z (x) = L [z] (x)\n\ntr\u00b3B (x) B (x)T zxx (x)\u00b4\n\n\u03c32\n2\n\n2\n\n\fThe relationship between the two formulations above is not obvious, but nevertheless it can be\nshown that the continuous-time formulation is a special case of the discrete-time formulation. This\nis done by de\ufb01ning the passive dynamics p(h) (\u00b7|x) as the h-step transition probability density of\nthe uncontrolled diffusion (or an Euler approximation to it), and the state cost as q(h) (x) = hq (x).\nThen, in the limit h \u2192 0, the integral equation exp\u00a1q(h)\u00a2 z = G(h) [z] reduces to the differential\nequation qz = L [z]. Note that for small h the density p(h) (\u00b7|x) is close to Gaussian. From the\nformula for KL divergence between Gaussians, the KL control cost in the discrete-time formulation\nreduces to the quadratic control cost in the continuous-time formulation.\nThe reason for working with both formulations and emphasizing the relationship between them is\nthat most problems of practical interest are continuous in time and space, yet the discrete-time for-\nmulation is easier to work with. Furthermore it leads to better numerical stability because integral\nequations are better behaved than differential equations. Note also that the discrete-time formula-\ntion can be used in both discrete and continuous state spaces, although the latter require function\napproximation in order to solve the linear Bellman equation [20].\n\n3 Compositionality theory\n\nThe compositionality developed in this section follows from the linearity of equations (1, 3). We\nfocus on \ufb01rst-exit problems which are more general. An example involving a \ufb01nite-horizon problem\nwill be given later. Consider a collection of K optimal control problems in our class which all have\nthe same dynamics \u2013 p (\u00b7|x) in discrete time or a (x) , B (x) , \u03c3 in continuous time \u2013 the same state\ncost rate q (x) and the same sets I and B of interior and boundary states. These problems differ only\nin their \ufb01nal costs fk (x). Let zk (x) denote the desirability function for problem k, and u\u2217k (\u00b7|x) or\nu\u2217k (x) the corresponding optimal control law. The latter will serve as primitives for constructing\noptimal control laws for new problems in our class. We will call the K problems we started with\ncomponent and the new problem composite.\nSuppose the \ufb01nal cost for the composite problem is f (x), and there exist weights wk such that\n\n(4)\n\nf (x) = \u2212 log\u00b3PK\nz (x) =PK\n\nk=1wk exp (\u2212fk (x))\u00b4\nThus the functions fk (x) de\ufb01ne a K-dimensional manifold of composite problems. The above\ncondition ensures that for all boundary/terminal states x \u2208 B we have\n\nk=1wkzk (x)\n\n(5)\nSince z is the solution to a linear equation, if (5) holds on the boundary then it must hold everywhere.\nThus the desirability function for the composite problem is a linear combination of the desirability\nfunctions for the component problems. The weights in this linear combination can be interpreted as\ncompatibilities between the control objectives in the component problems and the control objective\nin the composite problem. The optimal control law for the composite problem is given by (1, 3).\nThe above construction implies that both z and zk are everywhere positive. Since z is de\ufb01ned as an\nexponent, it must be positive. However this is not necessary for the components. Indeed if\n\n(6)\nholds for all x \u2208 B, then (5) and z (x) > 0 hold everywhere even if zk (x) \u2264 0 for some k and x.\nIn this case the zk\u2019s are no longer desirability functions for well-de\ufb01ned optimal control problems.\nNevertheless we can think of them as generalized desirability functions with similar meaning: the\nlarger zk (x) is the more compatible state x is with the agenda of component k.\n\nf (x) = \u2212 log\u00b3PK\n\nk=1wkzk (x)\u00b4\n\n3.1 Compositionality of discrete-time control laws\n\nWhen zk (x) > 0 the composite control law u\u2217 can be expressed as a state-dependent convex\ncombination of the component control laws u\u2217k. Combining (5, 1) and using the linearity of G,\n\nu\u2217 (x0|x) =Xk\n\nwkG [zk] (x)\n\nPswsG [zs] (x)\n\n3\n\np (x0|x) zk (x0)\n\nG [zk] (x)\n\n\fThe term in brackets is u\u2217k (x). We denote the \ufb01rst term with mk (x) as before:\n\nu\u2217 (x) =Xk\n\nB (x)T \u2202\n\u2202x\n\nzk (x)\u00b8\n\nzk (x)\n\nwkzk (x)\n\nPswszs (x)\u2219 \u03c32\nPswszs (x)\nu\u2217 (x) =Pkmk (x) u\u2217k (x)\n\nmk (x) =\n\nwkzk (x)\n\nThe second term above is u\u2217k. The \ufb01rst term is a state-dependent mixture weight which we denote\nmk (x). The composition rule for optimal control laws is then\n\n(7)\nUsing the fact that zk (x) satis\ufb01es the linear Bellman equation (1) and q (x) does not depend on k,\nthe mixture weights can be simpli\ufb01ed as\n\nu\u2217 (\u00b7|x) =Pkmk (x) u\u2217k (\u00b7|x)\n\nmk (x) =\n\nwkG [zk] (x)\n\nPswsG [zs] (x)\n\n=\n\nwkzk (x)\n\nPswszs (x)\n\n(8)\n\nNote thatPkmk (x) = 1 and mk (x) > 0.\n\n3.2 Compositionality of continuous-time control laws\n\nSubstituting (5) in (3) and assuming zk (x) > 0, the control law given by (3) can be written as\n\nThen the composite optimal control law is\n\n(9)\nNote the similarity between the discrete-time result (7) and the continuous-time result (9), as well as\nthe fact that the mixing weights are computed in the same way. This is surprising given that in one\ncase the control law directly speci\ufb01es the probability distribution over next states, while in the other\ncase the control law shifts the mean of the distribution given by the passive dynamics.\n\n4 Analytical solutions to linear-Gaussian problems with non-quadratic costs\n\nHere we specialize the above results to the case when the components are continuous-time linear\nquadratic Gaussian (LQG) problems of the form\n\ndynamics:\ncost rate:\n\nThe component \ufb01nal costs are quadratic:\n\ndx = Axdt + B (udt + \u03c3d\u03c9)\n\n(cid:99) (x, u) =\n\n1\n2\n\nxTQx +\n\n1\n\n2\u03c32 kuk2\n\nfk (x) =\n\n1\n2\n\nxTFkx\n\nThe optimal cost-to-go function for LQG problems is known to be quadratic [21] in the form\n\nvk (x, t) =\n\n1\n2\n\nxTVk (t) x + \u03b1k (t)\n\nAt the prede\ufb01ned \ufb01nal time T we have Vk (T ) = Fk and \u03b1k (T ) = 0. The optimal control law is\n\nThe quantities Vk (t) and \u03b1k (t) can be computed by integrating backward in time the ODEs\n\nu\u2217k (x, t) = \u2212\u03c32BTVk (t) x\n\n\u2212 \u02d9Vk = Q + ATVk + VkAT \u2212 Vk\u03a3Vk\n\u2212 \u02d9\u03b1k =\n\ntr (\u03a3Vk)\n\n1\n2\n\n(10)\n\nNow consider a composite problem with \ufb01nal cost\n\nf (x) = \u2212 log\u00b5Pkwk exp\u00b5\u2212\n\n1\n2\n\nxTFkx\u00b6\u00b6\n\n4\n\n\f(A) An LQG problem with\nFigure 1: Illustration of compositionality in the LQG framework.\nquadratic cost-to-go and linear feedback control law. T = 10 is the \ufb01nal time. (B, C) Non-LQG\nproblems solved analytically by mixing the solutions to multiple LQG problems.\n\nThis composite problem is no longer LQG because it has non-quadratic \ufb01nal cost (i.e. log of mix-\nture of Gaussians), and yet we will be able to \ufb01nd a closed-form solution by combining multiple\nLQG controllers. Note that, since mixtures of Gaussians are universal function approximators, we\ncan represent any desired \ufb01nal cost to within arbitrary accuracy given enough LQG components.\nApplying the results from the previous section, the desirability for the composite problem is\n\nz (x, t) =Pkwk exp\u00b5\u2212\n\n1\n2\n\nxTVk (t) x \u2212 \u03b1k (t)\u00b6\n\nThe optimal control law can now be obtained directly from (3), or via composition from (9). Note\nthat the constants \u03b1k (t) do not affect the component control laws (and indeed are rarely computed\nin the LQG framework) however they affect the composite control law through the mixing weights.\nWe illustrate the above construction on a scalar example with integrator dynamics dx = udt+0.2d\u03c9.\nThe state cost rate is q (x) = 0. We set wk = 1 for all k. The \ufb01nal time is T = 10. The component\n\ufb01nal costs are of the form\n\nfk (x) =\n\ndk\n2\n\n(x \u2212 ck)2\n\nIn order to center these quadratics at ck rather than 0 we augment the state: x = [x; 1]. The matrices\nde\ufb01ning the problem are then\n\nA =\u2219 0 0\n\n0 0 \u00b8 , B =\u2219 1\n\n0 \u00b8 , Fk = dk\u2219 1\n\n\u2212ck\n\n\u2212ck\nc2\nk\n\n\u00b8\n\nThe ODEs (10) are integrated using ode45 in Matlab. Fig 1 shows the optimal cost-to-go func-\ntions v (x, t) = \u2212 log (z (x, t)) and the optimal control laws u\u2217 (x, t) for the following problems:\n{c = 0; d = 5}, {c = \u22121, 0, 1; d = 5, 0.1, 15}, and {c = \u22121.5 : 0.5 : 1.5; d = 5}. The \ufb01rst prob-\nlem (Fig 1A) is just an LQG. As expected the cost-to-go is quadratic and the control law is linear\nwith time-varying gain. The second problem (Fig 1B) has a multimodal cost-to-go. The control law\nis no longer linear but instead has an elaborate shape. The third problem (Fig 1C) resembles robust\ncontrol in the sense that there is a f1at region where all states are equally good. The corresponding\ncontrol law uses feedback to push the state into this f1at region. Inside the region the controller\ndoes nothing, so as to save energy. As these examples illustrate, the methodology developed here\nsigni\ufb01cantly extends the LQG framework while preserving its tractability.\n\n5\n\n\f5 Constructing minimal sets of primitives via SVD of Green\u2019s function\n\nWe showed how composite problems can be solved once the solutions to the component problems\nare available. The choice of component boundary conditions de\ufb01nes the manifold (6) of problems\nthat can be solved exactly. One can use any available set of solutions as components, but is there a\nset which is in some sense minimal? Here we offer an answer based on singular value decomposition\n(SVD). We focus on discrete state spaces; continuous spaces can be discretized following [22].\nRecall that the vector of desirability values z (x) at interior states x \u2208 I, which we denoted zI,\nsatis\ufb01es the linear equation (2). We can write the solution to that equation explicitly as\n\nwhere G = (diag (exp (qI)) \u2212 PII)\u22121 PIB. The matrix G maps values on the boundary to values\non the interior, and thus resembles Green\u2019s function for linear PDEs. A minimal set of primitives\ncorresponds to the best low-rank approximation to G. If we de\ufb01ne \"best\" in terms of least squares,\na minimal set of R primitives is obtained by approximating G using the top R singular values:\n\nzI = G zB\n\nG \u2248 U SV T\nS is an R-by-R diagonal matrix, U and V are |I|-by-R and |B|-by-R orthonormal matrices. If we\nnow set zB = V\u00b7r, which is the r-th column of V , then\nzI = G zB \u2248 U SV TV\u00b7r = SrrU\u00b7r\nThus the right singular vectors (columns of V ) are the component boundary conditions, while the\nleft singular vectors (columns of U) are the component solutions.\nThe above construction does not use knowledge of the family of composite problems we aim to\nsolve/approximate. A slight modi\ufb01cation makes it possible to incorporate such knowledge. Let the\nfamily in question have parametric \ufb01nal costs f (x, \u03b8). Choose a discrete set {\u03b8k}k=1\u00b7\u00b7\u00b7K of values\nof the parameter \u03b8, and form the |B|-by-K matrix \u03a6 with elements \u03a6ik = exp (\u2212f (xi, \u03b8k)), xi \u2208 B.\nAs in (4), this choice restricts the boundary conditions that can be represented to zB = \u03a6w, where\nw is a K-dimensional vector. Now apply SVD to obtain a rank-R approximation to the matrix G\u03a6\ninstead of G. We can set R \u00bf K to achieve signi\ufb01cant reduction in the number of components.\nNote that G\u03a6 is smaller than G so the SVD here is faster to compute.\nWe illustrate the above approach using a discretization of the following 2D problem:\n\nThe vector \ufb01eld in Fig 2A illustrates the function a (x). To make the problem more interesting\nwe introduce an L-shaped obstacle which can be hit without penalty but cannot be penetrated. The\ndomain is a disk centered at (0, 0) with radius \u221a21. The constant q implements a penalty for the\ntime spent inside the disk. The discretization involves |I| = 24520 interior states and |B| = 4163\nboundary states. The parametric family of \ufb01nal costs is\n\nf (x, \u03b8) = 13 \u2212 13 exp (5 cos (atan 2 (x2, x1) \u2212 \u03b8) \u2212 5)\n\nThis is an inverted von Mises function specifying the desired location where the state should exit\nthe disk. f (x, 0) is plotted in red in Fig 2A. The set {\u03b8k} includes 200 uniformly spaced values of\n\u03b8. The SVD components are constructed using the second method above (although the \ufb01rst method\ngives very similar results). Fig 2B compares the solution obtained with a direct solver (i.e. using\nthe exact G) for \u03b8 = 0, and the solutions obtained using R = 70 and R = 40 components. The\ndesirability function z is well approximated in both cases. In fact the approximation to z looks\nperfect with much fewer components (not shown). However v = \u2212 log (z) is more dif\ufb01cult to\napproximate. The dif\ufb01culty comes from the fact that the components are not always positive, and\nas a result the composite solution is not always positive. The regions where that happens are shown\nin white in Fig 2B. In those regions the approximation is unde\ufb01ned. Note that this occurs only\nnear the boundary. Fig 2C shows the \ufb01rst 10 components. They resemble harmonic functions.\nIt is notable that the higher-order components (corresponding to smaller singular values) are only\nmodulated near the boundary \u2013 which explains why the approximation errors in Fig 2B are near the\nboundary. In summary, a small number of components are suf\ufb01cient to construct composite control\nlaws which are near-optimal in most of the state space. Accuracy at the boundary requires additional\ncomponents. Alternatively one could use positive SVD and obtain not just positive but also more\nlocalized components (as we have done in preliminary work).\n\n6\n\na (x) =\u2219 \u22120.2 x2\n\n0.2|x1| \u00b8 , B = I,\n\n\u03c3 = 1,\n\nq (x) = 0.1\n\n\fFigure 2: Illustration of primitives obtained via SVD. (A) Passive dynamics and cost. (B) Solutions\nobtained with a direct solver and with different numbers of primitives. (C) Top ten primitives zk (x).\n\nA\n\nB\n\n20\n\n)\nc\ne\ns\n \n/\n \n\nm\nc\n(\n \nd\ne\ne\np\ns\n\n10\n\n0\n\n0\n\nC\n\n1\n2\ntime (sec)\n\n3\n\nFigure 3: Preliminary model of arm movements. (A) Hand paths of different lengths. Red dots\ndenote start points, black circles denote end points. (B) Speed pro\ufb01les for the movements shown\nin (A). Note that the same controller generates movements of different duration. (C) Hand paths\ngenerated by a composite controller obtained by mixing the optimal controllers for two targets. This\ncontroller \"decides\" online which target to go to.\n\n7\n\n\f6 Application to arm movements\n\nWe are currently working on an optimal control model of arm movements based on compositionality.\nThe dynamics correspond to a 2-link arm moving in the horizontal plane, and have the form\n\n\u03c4 = M (\u03b8) \u00a8\u03b8 + n\u00b3\u03b8, \u02d9\u03b8\u00b4\n\n\u03b8 contains the shoulder and elbow joint angles, \u03c4 is the applied torque, M is the con\ufb01guration-\ndependent inertia, and n is the vector of Coriolis, centripetal and viscous forces. Model parameters\nare taken from the biomechanics literature. The \ufb01nal cost f is a quadratic (in Cartesian space) cen-\ntered at the target. The running state cost is q = const encoding a penalty for duration. The above\nmodel has a 4-dimensional state space (\u03b8, \u02d9\u03b8). In order to encode reaching movements, we introduce\nan additional state variable s which keeps track of how long the hand speed (in Cartesian space)\nhas remained below a threshold. When s becomes suf\ufb01ciently large the movement ends. This aug-\nmentation is needed in order to express reaching movements as a \ufb01rst-exit problem. Without it the\nmovement would stop whenever the instantaneous speed becomes zero \u2013 which can happen at rever-\nsal points as well as the starting point. Note that most models of reaching movements have assumed\nprede\ufb01ned \ufb01nal time. However this is unrealistic because we know that movement duration scales\nwith distance, and furthermore such scaling takes place online (i.e. movement duration increases if\nthe target is perturbed during the movement).\nThe above second-order system is expressed in general \ufb01rst-order form, and then the passive dy-\nnamics corresponding to \u03c4 = 0 are discretized in space and time. The time step is h = 0.02 sec.\nThe space discretization uses a grid with 514x3 points. The factor of 3 is needed to discretize the\nvariable s. Thus we have around 20 million discrete states, and the matrix P characterizing the\npassive dynamics is 20 million - by - 20 million. Fortunately it is very sparse because the noise (in\ntorque space) cannot have a large effect within a single time step: there are about 50 non-zero entries\nin each row. Our simple iterative solver converges in about 30 iterations and takes less than 2 min\nof CPU time, using custom multi-threaded C++ code.\nFig 3A shows hand paths from different starting points to the same target. The speed pro\ufb01les for\nthese movements are shown in Fig 3B. The scaling with amplitude looks quite realistic. In partic-\nular, it is known that human reaching movements of different amplitude have similar speed pro\ufb01les\naround movement onset, and diverge later. Fig 3C shows results for a composite controller ob-\ntained by mixing the optimal control laws for two different targets. In this example the targets are\nsuf\ufb01ciently far away and the \ufb01nal costs are suf\ufb01ciently steep, thus the mixing yields a switching con-\ntroller instead of an interpolating controller. Depending on the starting point, this controller takes\nthe hand to one or the other target, and can also switch online if the hand is perturbed. An inter-\npolating controller can be created by placing the targets closer or making the component \ufb01nal costs\nless steep. While these results are preliminary we \ufb01nd them encouraging. In future work we will\nexplore this model in more detail and also build a more realistic model using 3rd-order dynamics\n(incorporating muscle time constants). We do not expect to be able to discretize the latter system,\nbut we are in the process of making a transition from discretization to function approximation [20].\n\n7 Summary and relation to prior work\n\nWe developed a theory of compositionality applicable to a general class of stochastic optimal control\nproblems. Although in this paper we used simple examples, the potential of such compositionality\nto tackle complex control problems seems clear.\nOur work is somewhat related to proto value functions (PVFs) which are eigenfunctions of the\nLaplacian [5], i.e. the matrix I \u2212 PII. While the motivation is similar, PVFs are based on intuitions\n(mostly from grid worlds divided into rooms) rather than mathematical results regarding optimality\nof the composite solution. In fact our work suggests that PVFs should perhaps be used to approx-\nimate the exponent of the value function instead of the value function itself. Another difference is\nthat PVFs do not take into account the cost rate q and the boundary B. This sounds like a good\nthing but it may be too good, in the sense that such generality may be the reason why guarantees\nregarding PVF optimality are lacking. Nevertheless the ambitious agenda behind PVFs is certainly\nworth pursuing, and it will be interesting to compare the two approaches in more detail.\n\n8\n\n\fFinally, another group [6] has developed similar ideas independently and in parallel. Although\ntheir paper is restricted to combination of LQG controllers for \ufb01nite-horizon problems, it contains\nvery interesting examples from complex tasks such as walking, jumping and diving. A particularly\nimportant point made by [6] is that the primitives can be only approximately optimal (in this case\nobtained via local LQG approximations), and yet their combination still produces good results.\n\nReferences\n[1] D. Bertsekas, Dynamic Programming and Optimal Control (2nd Ed). Bellmont, MA: Athena\n\nScienti\ufb01c, 2001.\n\n[2] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Press, Cambridge\n\nMA, 1998.\n\n[3] D. Bertsekas and J. Tsitsiklis, Neuro-dynamic programming. Belmont, MA: Athena Scienti\ufb01c,\n\n1997.\n\n[4] J. Si, A. Barto, W. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamic\n\nProgramming. Wiley-IEEE Press, 2004.\n\n[5] S. Mahadevan and M. Maggioni, \u201cProto-value functions: A Laplacian farmework for learn-\ning representation and control in Markov decision processes,\u201d Journal of Machine Learning\nResearch, vol. 8, pp. 2169\u20132231, 2007.\n\n[6] M. daSilva, F. Durand, and J. Popovic, \u201cLinear bellman combination for control of character\n\nanimation,\u201d To appear in SIGGRAPH, 2009.\n\n[7] E. Todorov, \u201cOptimality principles in sensorimotor control,\u201d Nature Neuroscience, vol. 7, no. 9,\n\npp. 907\u2013915, 2004.\n\n[8] C. Harris and D. Wolpert, \u201cSignal-dependent noise determines motor planning,\u201d Nature, vol.\n\n394, pp. 780\u2013784, 1998.\n\n[9] C. Sherrington, The integrative action of the nervous system. New Haven: Yale University\n\nPress, 1906.\n\n[10] N. Bernstein, On the construction of movements. Moscow: Medgiz, 1947.\n[11] M. Latash, \u201cOn the evolution of the notion of synergy,\u201d in Motor Control, Today and Tomorrow,\nSo\ufb01a: Academic Publishing House \"Prof. M.\n\nG. Gantchev, S. Mori, and J. Massion, Eds.\nDrinov\", 1999, pp. 181\u2013196.\n\n[12] M. Tresch, P. Saltiel, and E. Bizzi, \u201cThe construction of movement by the spinal cord,\u201d Nature\n\nNeuroscience, vol. 2, no. 2, pp. 162\u2013167, 1999.\n\n[13] A. D\u2019Avella, P. Saltiel, and E. Bizzi, \u201cCombinations of muscle synergies in the construction of\n\na natural motor behavior,\u201d Nat.Neurosci., vol. 6, no. 3, pp. 300\u2013308, 2003.\n\n[14] M. Santello, M. Flanders, and J. Soechting, \u201cPostural hand synergies for tool use,\u201d J Neurosci,\n\nvol. 18, no. 23, pp. 10 105\u201315, 1998.\n\n[15] E. Todorov, \u201cLinearly-solvable Markov decision problems,\u201d Advances in Neural Information\n\nProcessing Systems, 2006.\n\n[16] \u2014\u2014, \u201cGeneral duality between optimal control and estimation,\u201d IEEE Conference on Decision\n\nand Control, 2008.\n\n[17] \u2014\u2014, \u201cEf\ufb01cient computation of optimal actions,\u201d PNAS, in press, 2009.\n[18] S. Mitter and N. Newton, \u201cA variational approach to nonlinear estimation,\u201d SIAM J Control\n\nOpt, vol. 42, pp. 1813\u20131833, 2003.\n\n[19] H. Kappen, \u201cLinear theory for control of nonlinear stochastic systems,\u201d Physical Review Let-\n\nters, vol. 95, 2005.\n\n[20] E. Todorov, \u201cEigen-function approximation methods for linearly-solvable optimal control\nproblems,\u201d IEEE International Symposium on Adaptive Dynamic Programming and Rein-\nforcemenet Learning, 2009.\n\n[21] R. Stengel, Optimal Control and Estimation. New York: Dover, 1994.\n[22] H. Kushner and P. Dupuis, Numerical Methods for Stochastic Optimal Control Problems in\n\nContinuous Time. New York: Springer, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1170, "authors": [{"given_name": "Emanuel", "family_name": "Todorov", "institution": null}]}