{"title": "The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay", "book": "Advances in Neural Information Processing Systems", "page_first": 43, "page_last": 50, "abstract": null, "full_text": "The Emergence of Multiple Movement Units in\n\nthe Presence of Noise and Feedback Delay\n\nMichael Kositsky\n\nAndrew G. Barto\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\nAmherst, MA 01003-4610\n\n kositsky,barto\n\n@cs.umass.edu\n\nAbstract\n\nTangential hand velocity pro\ufb01les of rapid human arm movements of-\nten appear as sequences of several bell-shaped acceleration-deceleration\nphases called submovements or movement units. This suggests how the\nnervous system might ef\ufb01ciently control a motor plant in the presence of\nnoise and feedback delay. Another critical observation is that stochastic-\nity in a motor control problem makes the optimal control policy essen-\ntially different from the optimal control policy for the deterministic case.\nWe use a simpli\ufb01ed dynamic model of an arm and address rapid aimed\narm movements. We use reinforcement learning as a tool to approximate\nthe optimal policy in the presence of noise and feedback delay. Using\na simpli\ufb01ed model we show that multiple submovements emerge as an\noptimal policy in the presence of noise and feedback delay. The optimal\npolicy in this situation is to drive the arm\u2019s end point close to the target by\none fast submovement and then apply a few slow submovements to accu-\nrately drive the arm\u2019s end point into the target region. In our simulations,\nthe controller sometimes generates corrective submovements before the\ninitial fast submovement is completed, much like the predictive correc-\ntions observed in a number of psychophysical experiments.\n\n1 Introduction\n\nIt has been consistently observed that rapid human arm movements in both infants and\nadults often consist of several submovements, sometimes called \u201cmovement units\u201d [21].\nThe tangential hand velocity pro\ufb01les of such movements show sequences of several bell-\nshaped acceleration-deceleration phases, sometimes overlapping in the time domain and\nsometimes completely separate. Multiple movement units are observed mostly in infant\nreaching [5, 21] and in reaching movements by adult subjects in the face of dif\ufb01cult time-\naccuracy requirements [15]. These data provide clues about how the nervous system ef-\n\ufb01ciently produces fast and accurate movements in the presence of noise and signi\ufb01cant\nfeedback delay. Most modeling efforts concerned with movement units have addressed\nonly the kinematic aspects of movement, e.g., [5, 12].\n\nWe show that multiple movement units might emerge as the result of a control policy that is\noptimal in the face of uncertainty and feedback delay. We use a simpli\ufb01ed dynamic model\n\n\u0001\n\fof an arm and address rapid aimed arm movements. We use reinforcement learning as a\ntool to approximate the optimal policy in the presence of noise and feedback delay.\n\nAn important motivation for this research is that stochasticity inherent in the motor control\nproblem has a signi\ufb01cant in\ufb02uence on the optimal control policy [9]. We are following the\npreliminary work of Zelevinsky [23] who showed that multiple movement units emerge\nfrom the stochasticity of the environment combined with a feedback delay. Whereas he\nrestricted attention to a \ufb01nite-state system to which he applied dynamic programming, our\nmodel has a continuous state space and we use reinforcement learning in a simulated real-\ntime learning framework.\n\n2 The model description\n\nThe model we simulated is sketched in Figure 1. Two main parts of this model are the\n\u201cRL controller\u201d (Reinforcement Learning controller) and the \u201cplant.\u201d The controller here\nrepresents some functionality of the central nervous system dealing with the control of\nreaching movements. The plant represents a simpli\ufb01ed arm together with spinal circuitry.\n\n, so at time \u0003\n\nchanges over time. To simulate delayed feedback the state of the plant is made available to\n\nThe controller generates the control signal, \n, which in\ufb02uences how the state, \u0001 , of the plant\nthe controller after a delay period \u0002\n\u0002\n\t .\nTo introduce stochasticity, we disturbed  by adding noise to it, to produce a corrupted\ncontrol \u000b\nregion about a target state \u0001\r\f\n\n. The controller learns to move the plant state as quickly as possible into a small\n. The reward structure block in Figure 1 provides a negative\nunit reward when the plant\u2019s state is out of the target area of the state space, and it provides\nzero reward when the plant state is within the target area. The reinforcement learning\ncontroller tries to maximize the total cumulative reward for each movement. With the above\nmentioned reward structure, the faster the plant is driven into the target region, the less\nnegative reward is accumulated during the movement. Thus this reward structure speci\ufb01es\nthe minimum time-to-goal criterion.\n\nthe controller can only observe \u0001\u0005\u0004\u0006\u0003\b\u0007\n\nreward\n\nr\n\nRL controller\n\nr\n\nreward\nstructure\n\ntarget state\n\ns\nu\nu\n\nefferent copy\n\ndelay\n\nnoise\n\nu~\n\nstate\n\ns\n\nT\n\ns\n\ntarget\n\nplant\n\nFigure 1: Sketch of the model used in our simulations. \u201cRL controller\u201d stands for a Rein-\nforcement Learning controller.\n\n2.1 The plant\n\nTo model arm dynamics together with the spinal re\ufb02ex mechanisms we used a fractional-\npower damping dynamic model [22]. The simplest model that captures the most critical\ndynamical features is a spring-mass system with a nonlinear damping:\n\n\t \u001f\"!$#\n\nis the position of the mass attached to the spring,\n\n\u0011\u0019\u0018\n\u0011\u0013\u0012\u0015\u0014\u0017\u0016\n\u000e\u0010\u000f\n\u001a\u001b\u0012\u001d\u001c\nHere, \u0011\nvelocity and the acceleration of the object, \u000e\nspring is assumed equal to zero), \u0014\nand \nin this paper, we call  activation, referring to the activation level of a muscle pair. The\n\nare respectively the\nis the mass of the object (the mass of the\nis the stiffness coef\ufb01cient,\nis the control signal which determines the resting, or equilibrium, position. Later\n\nis the damping coef\ufb01cient, \u001c\n\nand \u000f\n\n\u0007\u001e\n\n\n\u0004\n\u0011\n\u0016\n\u0011\n\u0011\n\fTable 1: Parameter values used in the simulations.\n\ndescription\nthe basic simulation time step\n\nthe feedback delay, \u0002\n\ninitial position\ninitial velocity\ntarget position\ntarget velocity\ntarget position radius\n\nvalue\n1 ms\n200 ms\n0 cm\n0 cm/s\n5 cm\n5 cm\n0.5 cm\n\ndescription\nthreshold velocity radius\nstandard deviation of the noise\nvalue function learning rate\npreferences learning rate\ndiscount factor,\nbootstrapping factor,\n\nvalue\n0.1 cm/s\n1 cm\n0.5\n1\n0.9\n0.9\n\nvalues for the mass, the damping coef\ufb01cient, and the stiffness coef\ufb01cient were taken from\n. These values provide movement\n\ntrajectories qualitatively similar to those observed in human wrist movements [22].\n\n\u001a , \u001c\n\n\u001f\u000e\u0004\n\n!\u000f\u0006\n\n\u0010\f\n\nBarto et al. [3]: \u000e\n\n\u001f\u0003\u0002 kg, \u0014\n\n\u001f\u0005\u0004\u0007\u0006\n\n\u0004\t\b\u000b\n\r\f\n\nThe fractional-power damping in this model is motivated by both biological evidence\n[8, 14] and computational considerations. Controlling a system with such a concave damp-\ning function is an easier control problem than for a system with apparently simpler linear\ndamping. Fractional-power damping creates a qualitatively novel dynamical feature called\na stiction region, a region in the position space around the equilibrium position consisting\nof pseudo-stable states, where the velocity of the plant remains very close to zero. Such\nstates are stable states for all practical purposes. For the parameter magnitudes used in our\nsimulations, the stiction region is a region of radius 2.5 cm about the true equilibrium in\nthe position space.\n\nAnother essential feature of the neural signal transmission can be accounted for by using a\n[16]. We used a second-order\n\ncascade of low-pass temporal \ufb01lters on the activation level \n\nlow-pass \ufb01lter with the time constant of 25 ms.\n\n2.2 The reinforcement learning controller\n\nWe used the version of the actor-critic algorithm described by Sutton and Barto [20]. A pos-\nsible model of how an actor-critic architecture might be implemented in the nervous system\nwas suggested by Barto [2] and Houk et al. [10]. We implemented the actor-critic algorithm\n\nfor a continuous state space and a \ufb01nite set of actions, i.e., activation level magnitudes \n\nevenly spaced every 1 cm between 0 cm and 10 cm. To represent functions de\ufb01ned over\nthe continuous state space we used a CMAC representation [1] with 10 tilings, each tiling\nspans all three dimensions of the state space and has 10 tiles per dimension. The tilings\nhave random offsets drawn from the uniform distribution. Learning is done in episodes. At\nthe beginning of each episode the plant is at a \ufb01xed initial state, and the episode is complete\nwhen the plant reaches the target region of the state space. Table 1 shows the parameter\nvalues used in the simulations. Refer to ref. [20] for algorithm details.\n\n2.3 Clocking the control signal\n\nFor the controller to have suf\ufb01cient information about the current state of the plant, the\ncontroller internal representation of the state should be augmented by a vector of all the\nactions selected during the last delay period. To keep the dimension of the state space at\na feasible level, we restrict the set of available policies and make the controller select a\n, in a clocked manner at time intervals equal to the delay period.\nOne step of the reinforcement learning controller is performed once a delay period, which\ncorresponds to many steps of the underlying plant simulation. To simulate a stochastic\n. A new Gaussian disturbance was drawn every time a\n\nnew activation level, \nplant we added Gaussian noise to \n\n\n\u0001\n\t\n\u0018\n\fnew activation level was selected.\n\nApart from the computational motivation, there is evidence of intermittent motor control\nby human subjects [13]. In our simulations we use an oversimpli\ufb01ed special kind of inter-\nmittent control with a piecewise constant control signal whose magnitude changes at equal\ntime intervals, but this is done for the sake of acceleration of the simulations and overall\nclarity. Intermittent control does not necessarily assume this particular kind of the control\nsignal; the most important feature is that control segments are selected at particular points\nin time, and each control segment determines the control signal for an extended time inter-\nval. The time interval until selection of the next control segment can itself be one of the\nparameters [11].\n\n3 Results\n\nThe model learned to move the mass quickly and accurately to the target in approximately\n1,000 episodes. Figure 2 shows the corresponding learning curve. Figure 3 shows a typical\nmovement accomplished by the controller after learning. The movement shown in Figure 3\nhas two acceleration-deceleration phases called movement units or submovements.\n\ns\nm\n\n \n,\n\ne\nd\no\ns\np\ne\n\ni\n\n \nr\ne\np\n\n \n\ne\nm\n\ni\nt\n\n4000\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nepisode #\n\nFigure 2: The learning curve averaged over 100 trials. The performance is measured in\ntime-per-episode.\n\nCorrective submovements may occur before the plant reaches zero velocity. The controller\ngenerates this corrective submovement \u201con the \ufb02y,\u201d i.e., before the initial fast submove-\nment is completed. Figure 4 shows a sample movement accomplished by the controller\nafter learning where such overlapping submovements occur. This can be seen clearly in\npanel (b) of Figure 4 where the velocity pro\ufb01le of the movement is shown. Each of the\nsubmovements appears as a bell-shaped unit in the tangential velocity plot.\n\nSometimes the controller accomplishes a movement with a single smooth submovement.\nA sample of such a movement is shown in Figure 5.\n\n4 Discussion\n\nThe model learns to produce movements that are fast and accurate in the presence of noise\nand delayed sensory feedback. Most of the movements consist of several submovements.\nThe \ufb01rst submovement is always fast and covers most of the distance from the initial po-\n\n\fm\nc\n \n,\nn\no\ni\nt\ni\ns\no\np\n\n6\n\n4\n\n2\n\n0\n\n0\n\n200\n\n400\n\n200\n\n400\n\n30\n\n20\n\n10\n\n0\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\n\u221210\n\n0\n\nm\nc\n \n,\n\nn\no\n\ni\nt\n\na\nv\ni\nt\nc\na\n\n15\n\n10\n\n5\n\n0\n\n0\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n0\n\n800\n\n1000\n\n1200\n\n800\n\n1000\n\n1200\n\n(a)\n\n600\nt, ms\n\n(b)\n\n600\nt, ms\n\n(c)\n\n(d)\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nposition, cm\n\n200\n\n400\n\n600\nt, ms\n\n800\n\n1000\n\n1200\n\nFigure 3: A sample movement accomplished by the controller after learning. Panels (a) and\n(b) show the position and velocity time course respectively. Panel (c) shows the activation\nselected by the controller. The\n\ntime courses. The thin solid line shows the activation \nthick solid line shows the disturbed activation \u000b\n\nplant. The dashed line shows the activation after the temporal \ufb01ltering is applied. Panel (d)\nshows the phase trajectory of the movement. The thick bar at the lower-right corner is the\ntarget region.\n\n which is sent as the control signal to the\n\nsition to the target. All of the subsequent submovements are much slower and cover much\nshorter segments in the position space.\n\nThis feature stands in good agreement with the dual control model [12, 17], where the ini-\ntial part of a movement is conducted in a ballistic manner, and the \ufb01nal part is conducted\nunder closed-loop control. Some evidence for this kind of dual control strategy comes from\nexperiments in which subjects were given visual feedback only during the initial stage of\nmovement. Subjects did not show signi\ufb01cant improvement under these conditions com-\npared to trials in which they were deprived of visual feedback during the entire movement\n[4, 6]. In another set of experiments, proprioceptive feedback was altered by stimulations\nof muscle tendons. Movement accuracy decreased only when the stimulation was applied\nat the \ufb01nal stages of movement [18]. Note, however, that the dual control strategy though is\nnot explicitly designed into our model, but naturally emerges from the existing constraints\nand conditions.\n\nThe reinforcement learning controller is encouraged by the reward structure to accomplish\neach movement as quickly as possible. On the other hand, it faces high uncertainty in\nthe plant behavior. In states with low velocities the information available to the controller\ndetermines the actual state of the plant quite accurately as opposed to states with high\n\n\f(a)\n\n(d)\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n0\n\nm\nc\n \n,\nn\no\ni\nt\ni\ns\no\np\n\n6\n\n4\n\n2\n\n0\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nt, ms\n\n(b)\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\nm\nc\n \n,\n\nn\no\n\ni\nt\n\na\nv\ni\nt\nc\na\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n15\n\n10\n\n5\n\n0\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nt, ms\n\n(c)\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nt, ms\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nposition, cm\n\nFigure 4: A sample movement accomplished by the controller after learning with a well\nexpressed predictive correction.\n\nvelocities. If the controller were to adopt a policy in which it attempts to directly hit the\ntarget in one fast submovement, then very often it would miss the target and spend long\nadditional time to accomplish the task. The optimal policy in this situation is to move the\narm close to the target by one fast submovement and then apply a few slow submovements\nto accurately move arm into the target region.\n\nThe model learns to produce control sequences consisting of pairs of high activation steps\nfollowed by low activation steps. This feature stands in good agreement with pulse-step\nmodels of motor control [7, 19]. Each of the pulse-step combinations produces a submove-\nment characterized by a bell-shaped unit in the velocity pro\ufb01le.\n\nIn biological motor control corrective submovements are observed very consistently, in-\ncluding both the overlapping and separate submovements.\nIn the case of overlapping\nsubmovements, the corrective movement is called a predictive correction. Multiple sub-\nmovements are observed mostly in infant reaching [5]. Adults perform routine everyday\nreaching movements ordinarily with a single smooth submovement, but in case of tight\ntime constraints or accuracy requirements they revert to multiple submovements [15]. The\nsuggested model sometimes accomplishes movements with a single smooth submovement\n(see Figure 5), but in most cases it produces multiple submovements much like an infant or\nan adult subject trying to move quickly and accurately.\n\nThe suggested model is also consistent with theories of basal ganglia information pro-\ncessing for motor control [10]. Some of these theories suggest that dopamine neurons in\nthe basal ganglia carry information similar to the secondary reinforcement (or temporal\ndifference) in the actor-critic controller, i.e., information about how the expected perfor-\n\n\fm\nc\n \n,\nn\no\ni\nt\ni\ns\no\np\n\n6\n\n4\n\n2\n\n0\n\n0\n\n200\n\n200\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\nm\nc\n \n,\n\nn\no\n\ni\nt\n\na\nv\ni\nt\nc\na\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n15\n\n10\n\n5\n\n0\n\n0\n\ns\n/\nm\nc\n \n,\ny\nt\ni\nc\no\ne\nv\n\nl\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n0\n\n600\n\n800\n\n600\n\n800\n\n(a)\n\n400\nt, ms\n\n(b)\n\n400\nt, ms\n\n(c)\n\n(d)\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nposition, cm\n\n200\n\n400\nt, ms\n\n600\n\n800\n\nFigure 5: A sample movement accomplished by the controller after learning with a single\nsmooth submovement.\n\nmance (time-to-target) changes over time during a movement. A possible use of this kind\nof information is for initiating corrective submovements before the current movement is\ncompleted. This kind of behavior is exhibited by our model (Figure 4).\n\nAcknowledgments\n\nThis work was supported by NIH Grant MH 48185\u201309. We thank Andrew H. Fagg and\nMichael T. Rosenstein for helpful comments.\n\nReferences\n[1] J. S. Albus. A new approach to manipulator control: the cerebellar model articulation controller\n\n(CMAC). Journal of Dynamics, Systems, Measurement and Control, 97:220\u2013227, 1975.\n\n[2] A. G. Barto. Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, and D. G. Beiser,\neditors, Models of Information Processing in the Basal Ganglia, pages 215\u2013232. MIT Press,\nCambridge, MA, 1995.\n\n[3] A. G. Barto, A. H. Fagg, N. Sitkoff, and J. C. Houk. A cerebellar model of timing and prediction\n\nin the control of reaching. Neural Computation, 11:565\u2013594, 1999.\n\n[4] D. Beaubaton and L. Hay. Contribution of visual information to feedforward and feedback\n\nprocesses in rapid pointing movements. Human Movement Science, 5:19\u201334, 1986.\n\n[5] N. E. Berthier. Learning to reach: a mathematical model. Developmental Psychology, 32:811\u2013\n\n832, 1996.\n\n[6] L. G. Carlton. Processing of visual feedback information for movement control. Journal of\n\nExperimental Psychology: Human Perception and Performance, 7:1019\u20131030, 1981.\n\n\f[7] C. Ghez. Contributions of central programs to rapid limb movement in the cat. In H. Asanuma\nand V. J. Wilson, editors, Integration in the Nervous System, pages 305\u2013320. Igaku-Shoin,\nTokyo, 1979.\n\n[8] C. C. A. M. Gielen and J. C. Houk. A model of the motor servo: incorporating nonlinear spindle\n\nreceptor and muscle mechanical properties. Biological Cybernetics, 57:217\u2013231, 1987.\n\n[9] C. M. Harris and D. M. Wolpert. Signal-dependent noise determines motor planning. Nature,\n\n394:780\u2013784, 1998.\n\n[10] J. C. Houk, J. L. Adams, and A. G. Barto. A model of how the basal ganglia generates and\nuses neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, and D. G. Beiser,\neditors, Models of Information Processing in the Basal Ganglia, pages 249\u2013270. MIT Press,\nCambridge, MA, 1995.\n\n[11] M. Kositsky. Motor Learning and Skill Acquisition by Sequences of Elementary Actions. PhD\n\nthesis, The Weizmann Institute of Science, Israel, October 1998.\n\n[12] D. E. Meyer, S. Kornblum, R. A. Abrams, C. E. Wright, and J. E. K. Smith. Optimality in\nhuman motor performance: ideal control of rapid aimed movements. Psychological Review,\n95(3):340\u2013370, 1988.\n\n[13] R. C. Miall, D. J. Weir, and J. F. Stein. Intermittency in human manual tracking tasks. Journal\n\nof Motor Behavior, 25:53\u201363, 1993.\n\n[14] L. E. Miller. Re\ufb02ex stiffness of the human wrist. Master\u2019s thesis, Department of Physiology,\n\nNorthwestern University, Evanston, IL, 1984.\n\n[15] K. E. Novak, L. E. Miller, and J. C. Houk. Kinematic properties of rapid hand movements in a\n\nknob turning task. Experimental Brain Research, 132:419\u2013433, 2000.\n\n[16] L. D. Partridge. Integration in the central nervous system. In J. H. U. Brown and S. S. Gann,\neditors, Engineering Principles in physiology, pages 47\u201398. Academic Press, New York, 1973.\n[17] R. Plamondon and A. M. Alimi. Speed/accuracy trade-offs in target-directed movements. Be-\n\nhavioral and Brain Science, 20:279\u2013349, 1997.\n\n[18] C. Redon, L. Hay, and J.-L. Velay. Proprioceptive control of goal directed movements in man\nstudied by means of vibratory muscle tendon stimulation. Journal of Motor Behavior, 23:101\u2013\n108, 1991.\n\n[19] D. A. Robinson. Oculomotor control signals. In G. Lennerstrand and P. B. y Rita, editors, Basic\nMechanisms of Ocular Mobility and Their Clinical Implications, pages 337\u2013374. Pergamon\nPress, Oxford, 1975.\n\n[20] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,\n\nMA, 1998.\n\n[21] C. von Hofsten. Structuring of early reaching movements: A longitudinal study. Journal of\n\nMotor Behavior, 23:280\u2013292, 1991.\n\n[22] C. H. Wu, J. C. Houk, K. Y. Young, and L. E. Miller. Nonlinear damping of limb motion. In\nJ. M. Winters and S. L.-Y. Woo, editors, Multiple Muscle Systems: Biomechanics and Movement\nOrganization, pages 214\u2013235. Springer-Verlag, New York, 1990.\n\n[23] L. Zelevinsky. Does time-optimal control of a stochastic system with sensory delay produce\n\nmovement units? Master\u2019s thesis, University of Massachusetts, Amherst, 1998.\n\n\f", "award": [], "sourceid": 2080, "authors": [{"given_name": "Michael", "family_name": "Kositsky", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}