{"title": "Inverse Dynamics of Speech Motor Control", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1050, "abstract": null, "full_text": "Inverse Dynamics \n\nof Speech Motor  Control \n\nMakoto Hirayama  Eric Vatikiotis-Datesol1  Mitsuo Kawato\" \n\nATR Human Information Processing  Research  Laboratories \n2-2  Hikaridai, Seika-cho,  Soraku-gun,  Kyoto 619-02,  Japan \n\nAbstract \n\nProgress ha.s been made in comput.ational implementation of speech \nproduction  based  on  physiological  dat.a.  An  inverse  dynamics \nmodel  of  the  speech  articulator's  l1111sculo-skeletal  system.  which \nis  the  mapping from  art.iculator  t.rajectories  to  e\\ectromyogl'aphic \n(EMG) signals,  was  modeled  using  the acquired  forward  dynamics \nmodel  and  temporal  (smoot.hness  of EMG  activation)  and  range \nconstraints.  This inverse dynamics model allows  the use of a  faster \nspeech  mot.or control scheme, which  can  be applied to phoneme-to(cid:173)\nspeech  synthesis via musclo-skeletal system  dynamics, or  to future \nuse in speech  recognition.  The forward acoustic model, which is  the \nmapping from  articulator  trajectories  t.o  the  acoustic  parameters, \nwas  improved  by  adding  velocity  and  voicing  information  inputs \nto distinguish  acollst.ic  paramet.er  differences  caused  by  changes  in \nsource  characterist.ics. \n\n1 \n\nINTRODUCTION \n\nModeling  speech  articulator  dynamics  is  important  not  only  for  speech  science, \nbut  also  for  speech  processing.  This  is  because  many  issues  in  speech  phenomena, \nsuch  as  coarticulation  or  generat.ion  of aperiodic  sources,  are  caused  by  temporal \nproperties  of speech  articulat.or  behavior  due  t.o  musculo-skelet.al system  dynamics \nand  const.raints on  neurO-l1lotor  command activation . \n\n.. Also,  Laboratory  of Parallel  Distributed  Processing,  Research  Institute  for  Electronic \n\nScience,  Bokkaido  University,  Sapporo,  Hokkaido  060,  Japan \n\n1043 \n\n\f1044 \n\nHirayama, Vatikiotis-Bateson, and Kawato \n\nWe  have  proposed  using  neural  networks  for  a  computational  implementation  of \nspeech  production  based  on  physiological  activities  of speech  articulator  muscles. \nIn  previous  works  (Hirayama,  Vatikiotis-Bateson,  Kawato  and  Jordan  1992;  Hi(cid:173)\nrayama,  Vatikiotis-Bateson,  Honda,  Koike  and  Kawato  1993),  a  neural  network \nlearned  the  forward  dynamics,  relating motor commands to muscles  and  the  ensu(cid:173)\ning articulator behavior.  From movement t.rajectories,  the forward  acoustic network \ngenerated  the  acoustic  PARCOR parameters  (Itakura and  Saito,  1969)  that  were \nthen  used  to synthesize  the speech  acoustics.  A  cascade  neural  network  containing \nthe  forward  dynamics  model  along  with  a  suitable smoothness  criterion  was  used \nto  produce  a  continuous  motor command  from  a  sequence  of discrete  articulatory \ntargets  corresponding  to the  phoneme input string. \n\nAlong the same line,  we  have extended  our model  of speech  motor control.  In  this \npaper,  WI~ focus  on  modeling the  inverse  dynamics  of the  musculo-skeletal  system. \nHaving an  inverse  dynamics  model  allows  us  to  use  a  faster  control  scheme,  which \npermits  phoneme-to-speech  synthesis  via  musculo-skeletal  system  dynamics,  and \nultimately  may  be  useful  in  speech  recognition.  The  final  sectioll  of  this  paper \nreports  improvements  in  the  forward  acoustic  model,  which  were  made  by  incor(cid:173)\nporating  articulator  velocity  and  voicing  information  to  distinguish  the  acoustic \nparameter differences  caused  by  changes  in  source  characteristics. \n\n2 \n\nINVERSE DYNAMICS  MODELING  OF \nMUSCULO-SKELETAL SYSTEM \n\nFrom  the  viewpoint  of control  theory,  an  inverse  dynamics  model  of a  controlled \nobject pla.ys  an essential role in fecdfonvard  cont.rol.  That is, an accurate inverse dy(cid:173)\nnamics model outputs an  appropriate control sequence  that realizes  a  given desired \ntrajectory  by  using  only  fecdforward  cOlltrol  wi t.hout  any  feedback  information, so \nlong as there  is  no perturbation from  the environment.  For speech  a rticulators,  the \nmain control scheme cannot rely  upon feedback  control  because of sensory feedback \ndelays.  Thus,  we  believe  that  the  inverse  dynamics model  is  essential  for  biological \nmotor  control  of speech  and  for  any  efficient  speech  synthesis  algorithm  based  on \nphysiological data. \n\nHowever,  the  speech  articulator  system  is  an  excess-degrees-of-freedom  system, \nthus  the  mapping  from  art.iculator  t.rajectory  (posit.ion,  velocit.y,  accelerat.ion)  to \nelectromyographic  (E~fG) activity  is  one-to-many.  That  is,  different  EMG  com(cid:173)\nbinations  exist  for  the  same  articulat.or  traject.ory  (for  example,  co-contraction  of \nagonist  and  antagonist  muscle  pairs).  Consequently,  we  applied  the  forward  mod(cid:173)\neling  approach  to  learning  an  inverse  model  (Jordan  alld  Rumelhart,  1992),  i.e., \nconstrained  supervised  leaming,  as  shown  in  Figure  1.  The  inputs  of the  inverse \n\nDesired \n\nTrajectory \n\nr--~--..., Control p----..., Trajectory \n\nInverse  I--__  ~ Forward  t------~~ \nModel \n\nModel \n\n---Error \n\nFigure  1:  Inverse  dynamics modeling using a  forward  dynamics model  (Jordan and \nRumelhart,  1992). \n\n\fInverse Dynamics of Speech Motor Control \n\n1045 \n\n--- Actual EMG \n\n\"optimal\" EMG by 10M \n\n-\n\n1.0 \n0.8 \n0.6 \n0.4 \n0.2 \n\nO.O~----------~~----------r-----------~--~----~ \n4 \n\n2 \n\n3 \n\no \n\n1 \n\nTime (s) \n\nFigure 2:  After learning, the inverse model output \"optimal\" EMG (anterior belly of \nthe digastric) for jaw lowering is  compared with  actual EMG for  the tf'st  trajectory. \n\ndynamics model  are articulator positions,  velocities,  and  accelerations;  the outputs \nare  rectified,  integrated,  and  filtered  EIVIG  for  relevant  muscles.  The  forward  dy(cid:173)\nnamics model previously reported  (Hirayama et al.,  1993)  was  used  for  determining \nthe error  signals of the inverse  dynamics model . \n\nTo  choose  a  realistic  EMG  patt.ern  from  among diverse  possible  sciutions,  we  use \nboth  temporal  and  range  const.raints.  The  temporal  constraint  is  related  to  the \nsmoothnt~ss  of  EMG  activat.ion,  i.e.,  minimizing  EI\\'1G  activation  change  (Uno, \nSuzuki,  and  Kawat.o,  1989).  The  minimum  and  maximum  values  of  the  range \nconstraint  were  chosen  using  valucs  obt.ained  from  t.he  experimental  data.  Direct \ninverse modeling (Albus,  1975) was uscd  to det.ermine weights, which  were  then sup(cid:173)\nplied  as  initial weights  to  t.he  constrained  supervised  learning algorithm of Jordan \nand  Rumelhart's (1992)  inverse  dynamics modeling met.hod. \n\nFigure  2  shows  an  example  of  t.he  inverse  dynnmics  model  output  after  learning, \nwhen  a  real  articulator  trajectory,  not.  included  in  the  training  set,  was  given  as \nthe  input.  Note  that the  net.work  output cannot  be  exactly  t.he  same as  the  actual \nEMG,  as  the  network  chooses  a  unique  \"optimal\"  EMG  from  many  possible  EMG \npatterns  that  appear in  the actual  EI\\IG  for  t.he  trajectory. \n\n--- Experimental data \n- - - Direct inverse modeli ng \n\nInverse modeling using FDM \n\n-\n\n-0.3 \nE  -0.4 \nc \n-0.5 \n0 \n~ \nUJ \n0 \nQ.. \n\n-0.6 \n\n-0.7 \n\n0 \n\n1 \n\n2 \n\nTime (s) \n\n3 \n\n4 \n\nFigure  3:  Trajectories  generated  by  the  forward  dynamics  net.work  for  the  two \nmethods  of inverse  dynamics  modeling compared  with  t.he  desired  trajectory  (ex(cid:173)\nperimental  da t.a). \n\n\f1046 \n\nHirayama, Vatikiotis-Bateson, and  Kawato \n\nSince  the  inverse  dynamics model  was  obtained  by  learning,  when  the  desired  tra(cid:173)\njectory  is  given  to  the  inverse  dynamics  model,  an  articulator  trajectory  can  be \ngenerated  with the forward  dynamics network  previously reported  (Hirayama et al., \n1993).  Figure 3  compares  trajectories  generated  by  the forward  dynamics network \nusing EMG  derived  from the direct inverse  dynamics method or the constrained  su(cid:173)\npervised  learning algorithm (which  uses  the forward  dynamics model  to  determine \nthe inverse dynamics model's  \"opt.imal\"  El\\IG). The latter method yielded a  30.0 % \naverage  reduction  in  acceleration  prediction  error  over  the  direct  method,  thereby \nbringing the model output  trajectory  closer  to the  experimental  data. \n\n3  TRAJECTORY FORMATION  USING  FORWARD \n\nAND  INVERSE  RELAXATION  MODEL \n\nPreviously,  to  generate  a  trajectory  from  discrete  phoneme-specific  via-points,  we \nused  a  cascade  neural  network  (c.f.,  Hirayama. et.  al.,  1992).  The  inverse  dynamics \nmodel allows us t.o  use an alternative network  proposed by \\\\fada and Kawato (1993) \n(Figure 4).  The network  uses  both  the forward  and  inverse models of the controlled \nobject,  and  updates a  given  initial  rough  trajectory  passing  through  the via-points \naccording  to  t.he  dYllamics of the  cont.rolled  object  and  a  smoothness constraint on \nthe control  input.  The computation  time of the  net.work  is  much shorter  than  that \nof the cascade  neural  network  CWada  and  Kawa.to,  1993). \n\nFigure 5  shows  a  forward  dynamics  model  output  trajectory  driven  by  the  model(cid:173)\ngenerated  motor control signals.  Unlike \\Vada and  Kawato's original model (1993) \nin which generated trajectories always pass through via-points, our tl'ajectories were \ngenerated from smoothed motor control signals (i.e.,  after applying the smoothness \nconstraint)  and,  consequently,  do  not.  pass  through  the  exact  via-points.  In  this \npaper,  a  typical  value for  each  phoneme from  experimental  data was  chosen  as  the \ntarget  via-point.  and  was  given  in  Cartesian  coordinates  relative  to  the  maxillary \nincisor.  Alt.hough  further  investigation  is  needed  to  refine  the  phoneme-specific \ntarget specifications (e.g.  lip aperture targets), reasonable coarticulated trajectories \nwere  obtained  from series  of discret.e  via-point  t.argets  (Figure  5).  For  engineering \napplications  such  as  text-to-speech  synthesizers  using  articulatory  synthesis,  this \nkind of technique is  necessary  because realistic coarticula.ted trajectories  must serve \nas  input to the  articulatory synthesizer. \n\n~ e  ~ (d 'd \n\nlal \n\nluI \n\nIiI \n\nlsI \n\nIt I \n\nArticulatory Targets \n\nFigure 4:  Speech t.rajectory formation scheme modified from the forward and inverse \nrelaxation  neural  network  model  (\\\\'ada and  Kawato,  1993). \n\n\fInverse Dynamics of Speech Motor Control \n\n1047 \n\n-0.3 \n\u00a3  -0.4 \nc: \n-0.5 \n.2 \n= ~  -0.6 \n\n-0.7 \n\n. -.--''''' ~-' ..... -~.---\"- ~\"--.. \n\n\". \n\n'. ..... \n\n\"\" \n\n'.\", '. \n\nNetwork output \n\n-\n.......  Experimental data \n. \u2022 . Phoneme specific targets \n\n............. -.....\u2022. \n\n\\.  '\" \n... '. \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\nTime (s) \n\n0.8 \n\n1.0 \n\n1.2 \n\nFigure  5:  Jaw  trajectory  generated  by  the  forward  and  inverse  relaxation  model. \nThe output of the forward  dynamics model  is  used  for  this plot. \n\nA  furthe!'  advantage  of this  network  is  that.  it  can  be  llsed  t.o  predict  phoneme(cid:173)\nspecific  via-point.s  from  t.he  realized  t.rajectory  (vVada,  Koike,  Vatikiotis-Bateson \nand  Kawato,  1993).  This  capability  will  allow  us  to  use  our  forward  and  inverse \ndynamicb  models  for  speech  recognition  in  future,  through  acoustic  to  articula(cid:173)\ntory  mapping  (Shirai  and  Kobayashi,  1991;  Papcun,  Hochberg,  Thomas,  Laroche, \nZacks  and  Levy,  1992)  and  the  articulatory  to  phoneme  specific  via-points  map(cid:173)\nping  discussed  above.  Because  t.rajectories  may  be  recovered  from  a  small  set  of \nphoneme\u00b7\u00b7specific via-points, this approach should be readily applicable to problems \nof speech  data compression. \n\n4  DYNAMIC  MODELING  OF  FORWARD  ACOUSTICS \n\nThe  secoild  area  of progress  is  t.he  improvement.  in  t.he  forward  acoustic  network. \nPreviously  (Hirayama et  al.,  1993),  we  demonstrat.ed  that  acoustic  signals  can  be \nobtained  using  a  neural network  that learns  the  mapping between  articulator posi(cid:173)\ntions and acoustic PARCOR coetTIcients  (ltakura and Saito,  1969; See  also,  Markel \nand  Gray,  ] 976). \n\nHowever,  this modeling was effective only for vowels and a limited number of conso(cid:173)\nnants because  the architecture of the model  was  basically the same as  that of static \narticulatory  synthesizers  (e.g.  Mermelst.ein,  1973).  For  nat.ural  speech,  aperiodic \nsources  for  plosive  and  sibilant  consonants  result.  in  multiple sets  of acoustic  pa(cid:173)\nrameters for  the same articulator  configurat.ion  (i.e.,  the  mapping is  one-to-many) ; \nhence,  learning  did  not  fully  converge.  One  approach  t.o  solving  t his  problem  is \nto  make source  modeling completely separat.e  from  the  vocal  tract  area  modeling. \nHowever,  for  synthesis  of natural sentences,  t.he  vocal tract  transfer  function  model \nrequires  anot.her  model  for  t.he  non-glottal sources  associated  wit.h  consonant  pro(cid:173)\nduction .  Since  these  sources  are  locat.ed  at.  various  point.s  along  t.he  vocal  tract, \ntheir interaction  is  extremely complex. \n\nOur  approach  to  solving this  one-to-many  mapping is  to  have  the  neural  network \nlearn  the  acoustic  parameters  along  with  the  sound  source  characteristic  specific \nto  each  phoneme.  Thus,  we  put  articulator  positions  with  their  velocities  and \nvoiced/voiceless informat.ion (e.g ., Markel  and Gray, 1976) into the input (Figure 6) \nbecause  the sound source characterist.ics  are made not only by  the articulator posi-\n\n\f1048 \n\nHirayama, Vatikiotis-Bateson, and Kawato \n\nArticulator Positions, Velocities \n\n& VoicedNoiceless \n\n___ G_lot_ta_1 s_o_u_rc_e ---'I-----L--'--___  -'--.J.--~~) )  ) \n\nAcoustic Wave \n\nFigure 6:  Improved forward  acoustic network.  Inputs to the network are articulator \npositions and  velocities  and  voiced/voiceless  information. \n\ntion  but  also  by the  dynamic movement of articulators. \n\nFor  simulations,  horizontal  and  vertical  motions of jaw,  upper  and  lower  lips,  and \ntongue  tip  and  blade  were  used  for  the  inputs  and  12  dimensional  PARCOR  pa(cid:173)\nrameters  were  used  for  the  outputs  of the  network.  Figure  7(a)  shows  position(cid:173)\nvelocity-voiced/voiceless network  out.put compared with  posit.ion-only network  and \nexperimentally obtained PARCOR parameters for a natural test sentence.  Only the \nfirst  two coefficients  are shown.  The first  part of the test sentence,  \"Sam sat on top \nof the  potato cooker  and  waited  for  Tommy to  cut  up  a  bag of tiny  tomatoes and \npop  the  beat  tips  into  the  pot,\"  is  shown  in  this  plot.  Figure 7(b)( c)  show  a  part \nof the synthesized speech  driven  by  funtlamental frequency  pulses for  voiced  sounds \nand  random noises  for  voiceless sounds. \n\nBy  using  velocity  and  voiced/voiceless  inputs,  the  performance  was  improved  for \nnatural  utterances  which  include  many  vowels  and  consonants.  The  average  val(cid:173)\nues  of the  LPC-cepstrum  distance  mea.<.;ure  between  original and synthesized,  were \n5.17  (dB)  for  the  position-only  network  and  4.18  (dB)  for  the  position-velocity(cid:173)\nvoiced/voiceless  network.  When  listening  to  the  output,  the  sentence  can  be  un(cid:173)\nderstood,  and  almost all  vowels  and  many of the consonants  can  be classified.  The \noverall  clarity  and  the  classifica.tion  of some consonants  is  about  as  difficult  as  ex(cid:173)\nperienced  in noisy  international telephone  calls. \n\nAlthough  there  are  other  potentia.l  means  to  achieve  further  improvement  (e.g. \nadding more tongue channels,  using more balanced training patterns,  incorporating \nnasality information, implementation of better glottal and non-glottal sources),  the \nnetwork  synthesizes  quite smooth  and  reasonable  acoustic signals  by incorporating \naspects  of the articulator  dynamics. \n\n5  CONCLUSION \n\nWe are modeling the information transfer from phoneme-specific articulatory targets \nto acoustic wave  via the musculo-skeletal system,  using  a series  of neural networks. \nElectromyographic (EMG)  signals  are  used  as  the  reflection  of motor control  com(cid:173)\nmands.  In  this  paper,  we  have  focused  on  the  inverse  dynamics  modeling of the \n\n\fa \n\nInverse Dynamics of Speech Motor Control \n\n1049 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\n1 0 \n\n. j ... ~.'''._~ \n1 0  ~\u00b7\u00b7I\u00b7\u00b7::~ .. \u00b7\u00b7\u00b7,. .. -~ j~.;.;.: .... ~ \n\n.......  PARCOR ~or rest \n\n-\n- - - Position-only Network \n\nPosition+Velocity+Voiced/Voiceless Network \n\n.............  .:  \\ \n. \n\n' , '  \n\nr\"\", \n'. \n' \n\n\\ \n. \n\n.. \n\nC\\I  0 0 \n~  . \n- . \n\n'.. \n0.4 \n\nI ' \"   ...... \n0.6 \n\n0.8 \n\n1.0 \n\n1.0 \n\n0.0 \n\nc \n\n0.5 \n\nTime (s) \n\n0.0 \n\n0.2 \n\nb \n\nOriginal \n\nSource \n\n(Noise + Pulse) \n\nSynthesized  -+--~ \n\n2UlJCJ \n\nn \nsOCJ{J \n\nII \n1: \n\nrr \n.:u \nl \nll. \n\n\" \nI J \n.L \n\nCJ \nill \nL \nL1. \n\no -{.---------'(cid:173)\nij, LJCJ  1 C \n\nI  1((, ,= \n\n0.2 \n(seconds) \n\n---- -- -\n\n--\n\nf:~I: \n-. __ -,--~~~!I\"iI=~:.~,_ ._t,.i \n\nI t ,1 \n\n\" \n\nFigure  7:  (a)  Model  output  PARCOR parameters.  Only kl  and  k2  are  shown.  (b) \nOriginal, source model, and synthesized acoustic signals.  (c)  \\Videband spectrogram \nfor  the original  and synthesized  speech.  Utterance shown  is  \"Sam sat  on  top\"  from \na  test  sentence. \n\nC: \n\n- -\n\n\f1050 \n\nHirayama, Vatikiotis-Bateson, and Kawato \n\nmusculo-skeletal system, its control for  the transform from  discrete  linguistic infor(cid:173)\nmation to continuous motor control signals,  and articulatory speech  synthesis using \nthe  articulator  dynamics.  '''Ie  believe  that.  modeling  the  dynamics  of articulat.ory \nmotions is  a key issue  both for  elucidating mechanisms of speech  motor control  and \nfor  synthesis of nat'llr'al utterances. \n\nAcknowledgetnellts \n\nWe  thank  Yoh'ichi  Toh'kura for  continuous  encouragement.  Further  support  was \nprovided  by  HFSP  grants to M.  Kawato. \n\nReferences \n\nAlbus,  J.  S.  (1975)  A  new  approach  to  manipulator control:  The cerebellar  model \narticulation  controller  (CMAC).  Transactions  of the  ASME  Journal  of Dynamic \nSystem,  Afeasurement,  and  Control,  220-227. \n\nHirayama.,  M.,  E.  Vatikiotis-Bateson, M.  Kawato, and 1\\1.  1.  Jordan \\1992)  Forward \ndynamics  modeling  of speech  motor  control  using  physiological  data.  In  Moody, \nJ.  E.,  Hanson,  S.  J.,  and  Lippmann,  R.  P.  (eds.)  Advances  in  Neural  Information \nProcessing  Systems  4.  San  Mateo,  CA:  I\\lorgan  Kaufmann  Publishers,  191-198. \nHirayama,  M.,  E.  Vatikiotis-Bateson,  K.  Honda,  Y.  Koike,  and  M. Kawato  (1993) \nPhysiologically based  speech  synthesis.  In  Giles,  C.  L.,  Hanson,  S.  J.,  and  Cowan, \nJ.  D.  (eds.)  Advances  in  Neural  Information  Processing  Systems  5.  San  Mateo, \nCA:  Morgan  Kaufmann  Publishers,  658-665. \n\nItakura,  F.  and S.  Saito (1969)  Speech  analysis and synthesis  by  partial correlation \nparameters.  Proceeding  of Japan  Acoustic  Society,  2-2-6  (In  Japanese). \n\nJordan,  M.  I.  and  D.  E.  Rumelhart  (1992)  Forward  models:  Supervised  learning \nwith  a  di'3tal  teacher.  Cognitive  Science,  16, 307-354. \n\nMermelstein, P. (1973) Articulatory model for  the study of speech production.  Jour(cid:173)\nnal  of Acoustical Society  of America,  53,  1070-1082. \n\nPapcun,  J.,  J.  Hochberg,  T.  R.  Thomas, T.  Laroche,  J.  Zacks,  and  S.  Levy  (1992) \nInferring articulation and recognizing gestures from acoustics  with a neural network \ntrained on x-ray microbeam data.  Jo'urnal  of Acoustical Society  of America, 92  (2) \nPt.  1. \n\nShirai, K.  and T.  Kobayashi  (1991)  Estimation of articulatory motion using neural \nnetworks.  Journal  of Phonetics,  19, 379-385. \n\nUno,  Y.,  R.  Suzuki,  and  M.  Kawato  (1989)  The  minimum  muscle  tension  change \nmodel  which  reproduces  arm movement  t.rajectories.  Pr'oceedi7l9  of the  4th  Sympo(cid:173)\nsium  on  Biological  and  Physiological  Engineering,  299-302  (In  Japanese). \n\nWada, Y.  and  M.  Kawat.o  (1993) A nemal network model for  arm t.rajectory forma(cid:173)\ntion of using fOl'ward  and  inverse  dynamics  models.  Neural  Networks,  6,  919-932. \n\nWada, Y.,  Y.  Koike,  E.  Vatikiotis-Bateson,  and  M.  Kawato (1993)  Movement  Pat(cid:173)\ntern  Recognition Based  on  the  Minimization Principle.  Tech nical RI'port  of IEICE, \nNC93-23, 85-92  (In  Japanese). \n\n\f", "award": [], "sourceid": 751, "authors": [{"given_name": "Makoto", "family_name": "Hirayama", "institution": null}, {"given_name": "Eric", "family_name": "Vatikiotis-Bateson", "institution": null}, {"given_name": "Mitsuo", "family_name": "Kawato", "institution": null}]}