{"title": "Learning Attractor Landscapes for Learning Motor Primitives", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1554, "abstract": null, "full_text": "Learning Attractor Landscapes for\n\nLearning Motor Primitives\n\nAuke Jan Ijspeert1;3\u2044, Jun Nakanishi2, and Stefan Schaal1;2\n\n1University of Southern California, Los Angeles, CA 90089-2520, USA\n2ATR Human Information Science Laboratories, Kyoto 619-0288, Japan\n\n3EPFL, Swiss Federal Institute of Technology, Lausanne, Switzerland\n\nijspeert@usc.edu, jun@his.atr.co.jp, sschaal@usc.edu\n\nAbstract\n\nMany control problems take place in continuous state-action spaces,\ne.g., as in manipulator robotics, where the control objective is of-\nten de\ufb02ned as \ufb02nding a desired trajectory that reaches a particular\ngoal state. While reinforcement learning o\ufb01ers a theoretical frame-\nwork to learn such control policies from scratch, its applicability to\nhigher dimensional continuous state-action spaces remains rather\nlimited to date. Instead of learning from scratch, in this paper we\nsuggest to learn a desired complex control policy by transforming\nan existing simple canonical control policy. For this purpose, we\nrepresent canonical policies in terms of di\ufb01erential equations with\nwell-de\ufb02ned attractor properties. By nonlinearly transforming the\ncanonical attractor dynamics using techniques from nonparametric\nregression, almost arbitrary new nonlinear policies can be gener-\nated without losing the stability properties of the canonical sys-\ntem. We demonstrate our techniques in the context of learning a\nset of movement skills for a humanoid robot from demonstrations\nof a human teacher. Policies are acquired rapidly, and, due to the\nproperties of well formulated di\ufb01erential equations, can be re-used\nand modi\ufb02ed on-line under dynamic changes of the environment.\nThe linear parameterization of nonparametric regression moreover\nlends itself to recognize and classify previously learned movement\nskills. Evaluations in simulations and on an actual 30 degree-of-\nfreedom humanoid robot exemplify the feasibility and robustness\nof our approach.\n\n1\n\nIntroduction\n\nLearning control is formulated in one of the most general forms as learning a control\npolicy u = \u2026(x; t; w) that maps a state x, possibly in a time t dependent way, to an\naction u; the vector w denotes the adjustable parameters that can be used to opti-\nmize the policy. Since learning control policies (CPs) based on atomic state-action\nrepresentations is rather time consuming and faces problems in higher dimensional\nand/or continuous state-action spaces, a current topic in learning control is to use\n\n\u2044http://lslwww.ep(cid:176).ch/~ijspeert/\n\n\fhigher level representations to achieve faster and more robust learning [1, 2]. In this\npaper we suggest a novel encoding for such higher level representations based on the\nanalogy between CPs and di\ufb01erential equations: both formulations suggest a change\nof state given the current state of the system, and both usually encode a desired\ngoal in form of an attractor state. Thus, instead of shaping the attractor landscape\nof a policy tediously from scratch by traditional methods of reinforcement learning,\nwe suggest to start out with a di\ufb01erential equation that already encodes a rough\nform of an attractor landscape and to only adapt this landscape to become more\nsuitable to the current movement goal. If such a representation can keep the policy\nlinear in the parameters w, rapid learning can be accomplished, and, moreover, the\nparameter vector may serve to classify a particular policy.\n\nIn the following sections, we will \ufb02rst develop our learning approach of shaping at-\ntractor landscapes by means of statistical learning building on preliminary previous\nwork [3, 4].1 Second, we will present a particular form of canonical CPs suitable\nfor manipulator robotics, and \ufb02nally, we will demonstrate how our methods can be\nused to classify movement and equip an actual humanoid robot with a variety of\nmovement skills through imitation learning.\n\n2 Learning Attractor Landscapes\n\nWe consider a learning scenario where the goal of control is to attain a particular\nattractor state, either formulated as a point attractor (for discrete movements) or\nas a limit cycle (for rhythmic movements). For point attractors, we require that the\nCP will reach the goal state with a particular trajectory shape, irrespective of the\ninitial conditions | a tennis swing toward a ball would be a typical example of such\na movement. For limit cycles, the goal is given as the trajectory shape of the limit\ncycle and needs to be realized from any start state, as for example, in a complex\ndrumming beat hitting multiple drums during one period. We will assume that,\nas the seed of learning, we obtain one or multiple example trajectories, de\ufb02ned by\npositions and velocities over time. Using these samples, an asymptotically stable\nCP is to be generated, prescribing a desired velocity given a particular state2.\n\nVarious methods have been suggested to solve such control problems in the lit-\nerature. As the simplest approach, one could just use one of the demonstrated\ntrajectories and track it as a desired trajectory. While this would mimic this one\nparticular trajectory, and scaling laws could account for di\ufb01erent start positions\n[5], the resultant control policy would require time as an explicit variable and thus\nbecome highly sensitive toward unforeseen perturbations in the environment that\nwould disrupt the normal time (cid:176)ow. Spline-based approaches [6] have a similar\nproblem. Recurrent neural networks were suggested as a possible alternative that\ncan avoid explicit time indexing | the complexity of training these networks to ob-\ntain stable attractor landscapes, however, has prevented a widespread application\nso far. Finally, it is also possible to prime a reinforcement learning system with\nsample trajectories and pursue one of the established continuous state-action learn-\ning algorithms; investigations of such an approach, however, demonstrated rather\nlimited e\u2013ciency [7]. In the next sections, we present an alternative and surprisingly\nsimple solution to learning the control problem above.\n\n1Portions of the work presented in this paper have been published in [3, 4]. We here\nextend these preliminary studies with an improvement and simpli\ufb02cation of the rhythmic\nsystem, an integrated view of the interpretation of both the discrete and rhythmic CPs, the\n\ufb02tting of a complete alphabet of Gra\ufb02tti characters, and an implementation of automatic\nallocation of centers of kernel functions for locally weighted learning.\n\n2Note that we restrict our approach to purely kinematic CPs, assuming that the move-\nment system is equipped with an appropriate feedback and feedforward controller that can\naccurately track the kinematic plans generated by our policies.\n\n\fTable 1: Discrete and Rhythmic control policies. \ufb01z; \ufb02z; \ufb01v; \ufb02v; \ufb01z; \ufb02z; \u201e; (cid:190)i and ci are\npositive constants. x0 is the start state of the discrete system in order to allow non-\nzero initial conditions. The design parameters of the discrete system are \u00bf , the temporal\nscaling factor, and g, the goal position. The design parameters of the rhythmic system\nare ym, the baseline of the oscillation, \u00bf , the period divided by 2\u2026, and ro, the amplitude\nof oscillations. The parameters wi are \ufb02tted to a demonstrated trajectory using Locally\nWeighted Learning.\n\ni=1\n\ni=1\n\n\u201ci\n\ni ~v\n\n\u201ciwT\n\nDiscrete\n\u00bf _y = z + PN\nPN\n\u00bf _z = \ufb01z(\ufb02z(g \u00a1 y) \u00a1 z)\n~v = [v]\n\u00bf _v = \ufb01v(\ufb02v(g \u00a1 x) \u00a1 v)\n\u00bf _x = v\n\u201ci = exp\u2021\u00a1hi( x\u00a1x0\nci 2 [0; 1]\n\ng\u00a1x0\n\n\u00a1 ci)2\u00b7\n\ni=1\n\ni=1\n\n\u201ci\n\ni ~v\n\n\u201ciwT\n\nRhythmic\n\u00bf _y = z + PN\nPN\n\u00bf _z = \ufb01z(\ufb02z(ym \u00a1 y) \u00a1 z)\n~v = [r cos `; r sin `]T\n\u00bf _` = 1\n\u00bf _r = \u00a1\u201e(r \u00a1 r0)\n\u201ci = exp\u00a1\u00a1hi(mod(`; 2\u2026) \u00a1 ci)2\u00a2\nci 2 [0; 2\u2026]\n\n2.1 Dynamical systems for Discrete Movements\n\nAssume we have a basic control policy (CP), for instance, instantiated by the second\norder attractor dynamics\n\n\u00bf _z = \ufb01z(\ufb02z(g \u00a1 y) \u00a1 z)\n\n\u00bf _y = z + f\n\n(1)\n\nwhere g is a known goal state, \ufb01z; \ufb02z are time constants, \u00bf is a temporal scaling\nfactor (see below) and y; _y correspond to the desired position and velocity generated\nby the policy as a movement plan. For appropriate parameter settings and f = 0,\nthese equations form a globally stable linear dynamical system with g as a unique\npoint attractor. Could we insert a nonlinear function f in Eq.1 to change the rather\ntrivial exponential convergence of y to allow more complex trajectories on the way\nto the goal? As such a change of Eq.1 enters the domain of nonlinear dynamics,\nan arbitrary complexity of the resulting equations can be expected. To the best\nof our knowledge, this has prevented research from employing generic learning in\nnonlinear dynamical systems so far. However, the introduction of an additional\ncanonical dynamical system (x; v)\n\nand the nonlinear function f\n\n\u00bf _v = \ufb01v(\ufb02v(g \u00a1 x) \u00a1 v)\n\n\u00bf _x = v\n\n(2)\n\n(3)\n\nf (x; v; g) = PN\ni=1 \u201ciwiv\nPN\n\ni=1 \u201ci\n\n\u201ci = exp\u00a1\u00a1hi(x=g \u00a1 ci)2\u00a2\n\ncan alleviate this problem. Eq.2 is a second order dynamical system similar to\nEq.1, however, it is linear and not modulated by a nonlinear function, and, thus,\nits monotonic global convergence to g can be guaranteed with a proper choice of\n\ufb01v and \ufb02v. Assuming that all initial conditions of the state variables x; v; y; z are\ninitially zero, the quotient x=g 2 [0; 1] can serve as a phase variable to anchor the\nGaussian basis functions \u201ci (characterized by a center ci and bandwidth hi), and v\ncan act as a \\gating term\" in the nonlinear function (3) such that the in(cid:176)uence of\nthis function vanishes at the end of the movement. Assuming boundedness of the\nweights wi in Eq.3, it can be shown that the combined dynamical system (Eqs.1{3)\nasymptotically converges to the unique point attractor g.\n\nGiven that f is a normalized basis function representation with linear parameter-\nization, it is obvious that this choice of a nonlinearity allows applying a variety of\n\n\fy\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nv\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\nTime [s]\n\n2\n\n10\n\n5\n\n0\n\n\u22125\n\nt\n\nd\n\n/\ny\nd\n\n\u221210\n\n0\n\n1\n\ni\n\n0.5\n\n0\n\n0\n\n2\n\n1.5\n\nx\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\nTime [s]\n\n2\n\n1\n\n0\n\ny\n\n\u22121\n\n\u22122\n\n\u22123\n\n0\n\n6\n\n4\n\n2\n\n)\n\n2\n,\n(f\nd\no\nm\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n)\n(\ns\no\nc\n \nr\n\n\u22121\n\n0\n\nt\n\nd\n\n/\ny\nd\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n0\n\n1\n\ni\n\n0.5\n\n0\n\n0\n\n1\n\n)\n(f\nn\ns\n \nr\n\ni\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\nTime [s]\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\n1.5\n\n2\n\n0.5\n\n1\n\nTime [s]\n\n1.5\n\n2\n\nFigure 1: Examples of time evolution of the discrete CPs (left) and rhythmic CPs (right).\nThe parameters wi have been adjusted to \ufb02t _ydemo(t) = 10 sin(2\u2026t) exp(\u00a1t2) for the dis-\ncrete CPs and _ydemo(t) = 2\u2026 cos(2\u2026t) \u00a1 6\u2026 sin(6\u2026t) for the rhythmic CPs.\n\nlearning algorithms to \ufb02nd the wi. For learning from a given sample trajectory,\ncharacterized by a trajectory ydemo(t); _ydemo(t) and duration T , a supervised learn-\ning problem can be formulated with the target trajectory ftarget = \u00bf _ydemo \u00a1 zdemo\nfor Eq.1 (right), where zdemo is obtained by integrating Eq.1 (left) with ydemo in-\nstead of y. The corresponding goal state is g = ydemo(T ) \u00a1 ydemo(t = 0), i.e., the\nsample trajectory was translated to start at y = 0. In order to make the nominal\n(i.e., assuming f = 0) dynamics of Eqs.1 and 2 span the duration T of the sample\ntrajectory, the temporal scaling factor \u00bf is adjusted such that the nominal dynamics\nachieves 95% convergence at t = T . For solving the function approximation prob-\nlem, we chose a nonparametric regression technique from locally weighted learning\n(LWL) [8] as it allows us to determine the necessary number of basis functions, their\ncenters ci, and bandwidth hi automatically | in essence, for every basis function\n\u201ci, LWL performs a locally weighted regression of the training data to obtain an\napproximation of the tangent of the function to be approximated within the scope of\nthe kernel, and a prediction for a query point is achieved by a \u201ci-weighted average\nof the predictions all local models. Moreover, as will be explained later, the pa-\nrameters wi learned by LWL are also independent of the number of basis functions,\nsuch that they can be used robustly for categorization of di\ufb01erent learned CPs.\n\nIn summary, by anchoring a linear learning system with nonlinear basis functions in\nthe phase space of a canonical dynamical system with guaranteed attractor properties,\nwe are able to learn complex attractor landscapes of nonlinear di\ufb01erential equations\nwithout losing the asymptotic convergence to the goal state.\n\n2.2 Extension to Limit Cycle Dynamics\n\nThe system above can be extended to limit cycle dynamics by replacing the canon-\nical system (x; v) with, for instance, the following rhythmic system which has a\nstable limit cycle in terms of polar coordinates (`; r):\n\n(4)\nSimilar to the discrete system, the rhythmic canonical system serves to provide\nboth an amplitude signal ~v = [r cos `; r sin `]T and phase variable mod(`; 2\u2026) to\nthe basis function \u201ci of the control policy (z; y):\n\n\u00bf _r = \u00a1\u201e(r \u00a1 r0)\n\n\u00bf _` = 1\n\n\u00bf _z = \ufb01z(\ufb02z(ym \u00a1 y) \u00a1 z)\n\n\u00bf _y = z + PN\n\nwhere ym is an anchor point for the oscillatory trajectory. Table 1 summarizes the\nproposed discrete and rhythmic CPs, and Figure 1 shows exemplary time evolutions\nof the complete systems.\n\ni ~v\n\ni=1 \u201ciwT\nPN\ni=1 \u201ci\n\n(5)\n\nY\np\nY\nf\n\f2.3 Special Properties of Control Policies based on Dynamical Systems\n\nSpatial and Temporal Invariance An interesting property of both discrete and\nrhythmic CPs is that they are spatially and temporally invariant. Scaling of the goal\ng for the discrete CP and of the amplitude r0 for the rhythmic CP does not a\ufb01ect the\ntopology of the attractor landscape. Similarly, the period (for the rhythmic system)\nand duration (for the discrete system) of the trajectory y is directly determined\nby the parameter \u00bf . This means that the amplitude and durations/periods of\nlearned patterns can be independently modi\ufb02ed without a\ufb01ecting the qualitative\nshape of trajectory y.\nIn section 3, we will exploit these properties to reuse a\nlearned movement (such as a tennis swing, for instance) in novel conditions (e.g\ntoward new ball positions).\n\nRobustness against Perturbations When considering applications of our ap-\nproach to physical systems, e.g., robots and humanoids, interactions with the en-\nvironment may require an on-line modi\ufb02cation of the policy. An obstacle can, for\ninstance, block the trajectory of the robot, in which case large discrepancies between\ndesired positions generated by the control policy and actual positions of the robot\nwill occur. As outlined in [3], the dynamical system formulation allows feeding back\nan error term between actual and desired positions into the CPs, such that the time\nevolution of the policy is smoothly paused during a perturbation, i.e., the desired\nposition y is modi\ufb02ed to remain close to the actual position ~y. As soon as the\nperturbation stops, the CP rapidly resumes performing the (time-delayed) planned\ntrajectory. Note that other (task-speci\ufb02c) ways to cope with perturbations can be\ndesigned. Such on-line adaptations are one of the most interesting properties of\nusing autonomous di\ufb01erential equations for CPs.\n\nMovement Recognition Given the temporal and spatial invariance of our policy\nrepresentation, trajectories that are topologically similar tend to be \ufb02t by similar pa-\nrameters wi, i.e., similar trajectories at di\ufb01erent speeds and/or di\ufb01erent amplitudes\nwill result in similar wi. In section 3.3, we will use this property to demonstrate\nthe potential of using the CPs for movement recognition.\n\n3 Experimental Evaluations\n\n3.1 Learning of Rhythmic Control Policies by Imitation\n\nWe tested the proposed CPs in a learning by demonstration task with a humanoid\nrobot. The robot is a 1.9-meter tall 30 DOFs hydraulic anthropomorphic robot\nwith legs, arms, a jointed torso, and a head [9]. We recorded trajectories performed\nby a human subject using a joint-angle recording system, the Sarcos Sensuit (see\nFigure 2, top). The joint-angle trajectories are \ufb02tted by the CPs, with one CP\nper degree of freedom (DOF). The CPs are then used to replay the movement\nin the humanoid robot, using an inverse dynamics controller to track the desired\ntrajectories generated by the CPs. The actual positions ~y of each DOF are fed back\ninto the CPs in order to take perturbations into account.\n\nUsing the joint-angle recording system, we recorded a set of rhythmic movements\nsuch as tracing a \ufb02gure 8 in the air, or a drumming sequence on a bongo (i.e.\nwithout drumming sticks). Six DOFs for both arms were recorded (three at the\nshoulder, one at the elbow, and two at the wrist). An exemplary movement and its\nreplication by the robot is demonstrated in Figure 2 (top). Figure 2 (left) shows the\njoint trajectories over one period of an exemplary drumming beat. Demonstrated\nand learned trajectories are superposed. For the learning, the base frequency was\nextracted by hand such as to provide the parameter \u00bf to the rhythmic CP.\n\nOnce a rhythmic movement has been learned by the CP, it can be modulated\nin several ways. Manipulating r0 and \u00bf for all DOFs amounts to simultaneously\n\n\fE\nF\nS\n_\nL\n\nA\nA\nS\n_\nL\n\n0.05\n0\n\u22120.05\n\u22120.1\n\n0.05\n0\n\u22120.05\n\u22120.1\n\u22120.15\n\n0.2\n0\n\u22120.2\n\u22120.4\n\n0.6\n0.4\n0.2\n0\n\u22120.2\n\u22120.4\n\n0.1\n0\n\u22120.1\n\u22120.2\n\nR\nH\n_\nL\n\nB\nE\n_\nL\n\nE\nF\nW\n_\nL\n\nR\nW\n_\nL\n\n0.2\n0\n\u22120.2\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n0.15\n0.1\n0.05\n0\n\nE\nF\nS\n_\nR\n\nA\nA\nS\n_\nR\n\n0.1\n0.05\n0\n\u22120.05\n\nR\nH\n_\nR\n\n0.4\n0.2\n0\n\nB\nE\n_\nR\n\n0\n\u22120.2\n\u22120.4\n\u22120.6\n\u22120.8\n\nE\nF\nW\n_\nR\n\n0.3\n0.2\n0.1\n0\n\nR\nW\n_\nR\n\n0\n\u22120.2\n\u22120.4\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n0.5\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n1.5\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\nTime [s]\n\nTime [s]\n\n3\n\n2\nA\n1\n\n0\n\n0\n\n3\n\n2\nB\n1\n\n0\n\n0\n\n3\n\n2\nC\n1\n\n0\n\n0\n\n3\n\n2\nD\n1\n\n0\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n1\n\n2\n\n3\n\n4\n\n5\n\nTime [s]\n\n6\n\n7\n\n8\n\n9\n\n10\n\nFigure 2: Top: Humanoid robot learning a \ufb02gure-8 movement from a human demon-\nstration. Left: Recorded drumming movement performed with both arms (6 DOFs per\narm). The dotted lines and continuous lines correspond to one period of the demon-\nstrated and learned trajectories, respectively. Right: Modi\ufb02cation of the learned rhyth-\nmic pattern ((cid:176)exion/extension of the right elbow, R EB). A: trajectory learned by the\nrhythmic CP, B: temporary modi\ufb02cation with ~r0 = 2r0, C: ~\u00bf = \u00bf =2, D: ~ym = ym + 1\n(dotted line), where ~r0, ~\u00bf , and ~ym correspond to modi\ufb02ed parameters between t=3s\nand t=7s. Movies of the human subject and the humanoid robot can be found at\nhttp://lslwww.ep(cid:176).ch/~ijspeert/humanoid.html.\n\nmodulate the amplitude and period of all DOFs, while keeping the same phase\nrelation between DOFs. This might be particularly useful for a drumming task\nin order to replay the same beat pattern at di\ufb01erent speeds and/or amplitudes.\nAlternatively, the r0 and \u00bf parameters can be modulated independently for the\nDOFs each arm, in order to be able to change the beat pattern (doubling the\nfrequency of one arm, for instance). Figure 2 (right) illustrates di\ufb01erent modulations\nwhich can be generated by the rhythmic CPs. For reasons of clarity, only one DOF\nis showed. The rhythmic CP can smoothly modulate the amplitude, frequency, and\nbaseline of the oscillations.\n\n3.2 Learning of Discrete Control Policies by Imitation\nIn this experiment, the task for the robot was to learn tennis forehand and backhand\nswings demonstrated by a human wearing the joint-angle recording system. Once\na particular swing has been learned, the robot is able to repeat the swing motion\nto di\ufb01erent cartesian targets, by providing new goal positions g to the CPs for the\ndi\ufb01erent DOFs. Using a system of two-cameras, the position of the ball is given\nto an inverse kinematic algorithm which computes these new goals in joint space.\nWhen the new ball positions are not too distant from the original cartesian target,\nthe modi\ufb02ed trajectories reach the ball with swing motions very similar to those\nused for the demonstration.\n\n3.3 Movement Recognition using the Discrete Control Policies\nOur learning algorithm, Locally Weighted Learning [8], automatically sets the num-\nber of the kernel functions and their centers ci and widths hi depending on the com-\nplexity of the function to be approximated, with more kernel functions for highly\n\n\fFigure 3: Humanoid robot learning a forehand swing from a human demonstration.\n\nnonlinear details of the movement. An interesting aspect of locally weighted re-\ngression is that the regression parameters wi of each kernel i do not depend on the\nother kernels, since regression is based on a separate cost function for each kernel.\nThis means that kernel functions can be added or removed without a\ufb01ecting the\nparameters wi of the other kernels.\n\nWe here use this feature to perform movement recognition within a large variety\nof trajectories, based on a small subset of kernels at \ufb02xed locations ci in phase\nspace. These \ufb02xed kernels are common for \ufb02tting all the trajectories, in addition\nto the kernels automatically added by the LWL algorithm. The stability of their\nparameters wi w.r.t. other kernels generated by LWL makes them well-suited for\ncomparing qualitative trajectory shapes.\n\nTo illustrate the possibility of using the CPs for movement recognition (i.e., recogni-\ntion of spatiotemporal patterns, not just spatial patterns as in traditional character\nrecognition), we carried out a simple task of \ufb02tting trajectories performed by a hu-\nman user when drawing two-dimensional single-stroke patterns. The 26 letters of\nthe Gra\u2013ti alphabet used in hand-held computers were chosen. These characters\nare drawn in a single stroke, and are fed as a two-dimensional trajectory (x(t); y(t))\nto be \ufb02tted by our system. Five examples of each character were presented (see\nFigure 4 for four examples).\n\na\n\nwb\n\nFixed sets of \ufb02ve kernels per DOF were set aside for movement recognition. The\ncorrelation wT\njwajjwbj between their parameter vectors wa and wb of character a and\nb can be used to classify movements with similar velocity pro\ufb02les (Figure 4, right).\nFor instance, for the 5 instances of the N, I, P, S, characters, the correlation is\nsystematically higher with the four other examples of the same character. These\nsimilarities in weight space can therefore serve as basis for recognizing demonstrated\nmovements by \ufb02tting them and comparing the \ufb02tted parameters wi with those\nof previously learned policies in memory. In this example, a simple one-nearest-\nneighbor classi\ufb02er in weight space would serve the purpose. Using such a classi\ufb02er\nwithin the whole alphabet (5 instances of each letter), we obtained a 84% recognition\nrate (i.e. 110 out of the 130 instances had a highest correlation with an instance of\nthe same letter). Further studies are required to evaluate the quality of recognition\nin larger training and test sets | what we wanted to demonstrate is the ability\nfor recognition without any speci\ufb02c system tuning or sophisticated classi\ufb02cation\nalgorithm.\n\n4 Conclusion\nBased on the analogy between autonomous di\ufb01erential equations and control poli-\ncies, we presented a novel approach to learn control policies of basic movement skills\nby shaping the attractor landscape of nonlinear di\ufb01erential equations with statisti-\ncal learning techniques. To the best of our knowledge, the presented approach is the\n\ufb02rst realization of a generic learning system for nonlinear dynamical systems that\n\n\f400\n\n350\n\n300\n\nY\n\n250\n\n200\n\n150\n\n500\n\n400\n\nY\n\n300\n\n200\n\n100\n\n0\n\n400\n\n350\n\n50\n\n100\n\n150\n\n200\n\n300\n\n350\n\n400\n\n450\n\n250\n\nX\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\nX\n\n300\n\n250\n\n200\n\n150\n\n0\n\n0.2\n\n0.4\n\n0.6\n\nTime [s]\n\n0.8\n\n1\n\n1.2\n\n460\n\n440\n\n420\n\n400\n\n380\n\nY\n\n360\n\n340\n\n320\n\n300\n\n280\n\n500\n\n450\n\nY\n\n400\n\n350\n\n300\n\n250\n\n0\n\n340\n\n320\n\nX\n\n300\n\n280\n\n260\n\n0\n\n480\n\n460\n\n440\n\n420\n\n400\n\nY\n\n380\n\n360\n\n340\n\n320\n\n500\n\n450\n\nY\n\n400\n\n350\n\n300\n\n0\n\n350\n\nX\n\n300\n\n200\n\n250\n\n350\n\n400\n\n300\n\nX\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n200\n\n250\n\n350\n\n400\n\n300\n\nX\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\nTime [s]\n\n0.6\n\n0.8\n\n250\n\n0\n\n0.2\n\n0.4\n\n0.6\n\nTime [s]\n\n0.8\n\n1\n\n1.2\n\n480\n\n460\n\n440\n\n420\n\nY\n\n400\n\n380\n\n360\n\n340\n\n500\n\n450\n\nY\n\n400\n\n350\n\n300\n\n0\n\n400\n\n350\n\nX\n\n300\n\n250\n\n200\n\n0\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\nS\n\nP\n\nI\n\nN\n\n200\n\n250\n\n300\n\nX\n\n350\n\n400\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0.2\n\n0.4\n\n0.6\n\nTime [s]\n\n0.8\n\n1\n\n0\n\n0\n\n2\n\n4\n\n6\n\nN\n\n8\nI\n\n10\n\n12\n\n14\n\n16\n\nP\n\n20\n\n18\nS\n\nFigure 4: Left: Examples of two-dimensional trajectories \ufb02tted by the CPs. The demon-\nstrated and \ufb02tted trajectories are shown with dotted and continuous lines, respectively.\nRight: Correlation between the weight vectors of the 20 characters (5 of each letter) \ufb02tted\nby the system. The gray scale is proportional to the correlation, with black corresponding\nto a correlation of +1 (max. correlation) and white to a correlation of 0 or smaller.\n\ncan guarantee basic stability and convergence properties of the learned nonlinear\nsystems. We demonstrated the applicability of the suggested techniques by learn-\ning various movement skills for a complex humanoid robot by imitation learning,\nand illustrated the usefulness of the learned parameterization for recognition and\nclassi\ufb02cation of movement skills. Future work will consider (1) learning of multidi-\nmensional control policies without assuming independence between the individual\ndimensions, and (2) the suitability of the linear parameterization of the control\npolicies for reinforcement learning.\n\nAcknowledgments\nThis work was made possible by support from the US National Science Foundation (Awards\n9710312 and 0082995), the ERATO Kawato Dynamic Brain Project funded by the Japan\nScience and Technology Corporation, the ATR Human Information Science Laboratories,\nand Communications Research Laboratory (CRL).\n\nReferences\n\n[1] R. Sutton and A.G. Barto. Reinforcement learning: an introduction. MIT Press, 1998.\n\n[2] F.A. Mussa-Ivaldi. Nonlinear force \ufb02elds: a distributed system of control primitives for\nrepresenting and learning movements. In IEEE International Symposium on Computa-\ntional Intelligence in Robotics and Automation, pages 84{90. IEEE, Computer Society,\nLos Alamitos, 1997.\n\n[3] A.J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynam-\nical systems in humanoid robots. In IEEE International Conference on Robotics and\nAutomation (ICRA2002), pages 1398{1403. 2002.\n\n[4] A.J. Ijspeert, J. Nakanishi, and S. Schaal. Learning rhythmic movements by demon-\nstration using nonlinear oscillators. In Proceedings of the IEEE/RSJ Int. Conference\non Intelligent Robots and Systems (IROS2002), pages 958{963. 2002.\n\n[5] S. Kawamura and N. Fukao. Interpolation for input torque patterns obtained through\nlearning control. In Proceedings of The Third International Conference on Automation,\nRobotics and Computer Vision (ICARCV\u201994). 1994.\n\n[6] H. Miyamoto, S. Schaal, F. Gandolfo, Y. Koike, R. Osu, E. Nakano, Y. Wada, and\nM. Kawato. A kendama learning robot based on bi-directional theory. Neural Networks,\n9:1281{1302, 1996.\n\n[7] S. Schaal. Learning from demonstration. In M. C. Mozer, M. Jordan, and T. Petsche,\neditors, Advances in Neural Information Processing Systems 9, pages 1040{1046. Cam-\nbridge, MA, MIT Press, 1997.\n\n[8] S. Schaal and C.G. Atkeson. Constructive incremental learning from only local infor-\n\nmation. Neural Computation, 10(8):2047{2084, 1998.\n\n[9] C. G. Atkeson, J. Hale, M. Kawato, S. Kotosaka, F. Pollick, M. Riley, S. Schaal,\nS. Shibata, G. Tevatia, A. Ude, and S. Vijayakumar. Using humanoid robots to study\nhuman behavior. IEEE Intelligent Systems, 15:46{56, 2000.\n\n\f", "award": [], "sourceid": 2140, "authors": [{"given_name": "Auke", "family_name": "Ijspeert", "institution": null}, {"given_name": "Jun", "family_name": "Nakanishi", "institution": null}, {"given_name": "Stefan", "family_name": "Schaal", "institution": null}]}