{"title": "Hoo Optimality Criteria for LMS and Backpropagation", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 358, "abstract": "", "full_text": "Hoo  Optimality Criteria for  LMS  and \n\nBackpropagation \n\nBabak Hassibi \n\nInformation Systems Laboratory \n\nStanford University \nStanford,  CA  94305 \n\nAli  H.  Sayed \n\nDept.  of Elec.  and Compo  Engr. \n\nUniversity of California Santa Barbara \n\nSanta Barbara,  CA  93106 \n\nThomas Kailath \n\nInformation Systems Laboratory \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nWe  have  recently  shown that the  widely  known  LMS  algorithm is \nan H OO  optimal estimator.  The H OO  criterion  has been introduced, \ninitially in  the  control  theory  literature,  as  a  means to ensure  ro(cid:173)\nbust  performance  in  the  face  of model  uncertainties  and  lack  of \nstatistical  information on  the  exogenous  signals.  We  extend  here \nour  analysis  to  the  nonlinear  setting  often  encountered  in  neural \nnetworks,  and show  that the  backpropagation  algorithm is  locally \nH OO  optimal.  This fact  provides  a  theoretical justification  of the \nwidely  observed  excellent  robustness  properties  of the  LMS  and \nbackpropagation algorithms.  We further discuss  some implications \nof these  results. \n\n1 \n\nIntroduction \n\nThe LMS algorithm was originally conceived as an approximate recursive  procedure \nthat solves the following problem (Widrow and Hoff,  1960):  given a sequence of n x 1 \ninput column vectors {hd, and a corresponding sequence of desired scalar responses \n{ di },  find  an  estimate of an  n  x  1 column  vector  of weights  w  such  that  the  sum \nof squared  errors,  L:~o Idi  - hi w1 2 ,  is  minimized.  The  LMS  solution  recursively \n\n351 \n\n\f352 \n\nHassibi. Sayed. and Kailath \n\nupdates estimates of the weight  vector along the direction of the instantaneous gra(cid:173)\ndient  of the  squared  error.  It has  long  been  known  that  LMS  is  an  approximate \nminimizing solution to the above least-squares (or H2)  minimization problem.  Like(cid:173)\nwise,  the celebrated backpropagation algorithm (Rumelhart and McClelland,  1986) \nis an extension of the gradient-type approach to nonlinear cost functions of the form \n2:~o I di  - hi ( W ) 12 ,  where  hi ( .)  are  known  nonlinear functions  (e. g.,  sigmoids).  It \nalso  updates  the  weight  vector  estimates  along  the  direction  of the  instantaneous \ngradients. \n\nWe  have  recently  shown  (Hassibi,  Sayed  and  Kailath,  1993a)  that  the  LMS  algo(cid:173)\nrithm is  an  H<Xl-optimal filter,  where  the  H<Xl  norm has  recently  been  introduced \nas a  robust criterion for  problems in  estimation and control  (Zames,  1981).  In gen(cid:173)\neral  terms,  this  means  that  the  LMS  algorithm,  which  has  long  been  regarded  as \nan approximate least-mean squares solution, is  in fact  a minimizer of the H<Xl  error \nnorm  and  not  of the  JI2  norm.  This statement  will  be  made more  precise  in  the \nnext  few  sections.  In  this  paper,  we  extend  our  results  to a  nonlinear  setting that \noften  arises  in  the  study  of neural  networks,  and  show  that  the  backpropagation \nalgorithm is  a  locally  H<Xl-optimal filter.  These  facts  readily  provide  a  theoretical \njustification for  the widely observed  excellent  robustness  and tracking properties  of \nthe LMS  and backpropagation algorithms, as compared to, for example, exact least \nsquares  methods such  as  RLS  (Haykin,  1991). \n\nIn this paper we  attempt to introduce the main concepts,  motivate the results,  and \ndiscuss  the  various  implications.  \\Ve  shall,  however,  omit the  proofs  for  reasons  of \nspace.  The reader  is  refered  to  (Hassibi  et al.  1993a), and the  expanded  version  of \nthis  paper for  the  necessary  details. \n\n2  Linear  HOO  Adaptive  Filtering \n\n\\Ve  shall  begin  with  the  definition  of  the  H<Xl  norm  of a  transfer  operator.  As \nwill  presently  become apparent, the  motivation for  introducing the  H<Xl  norm is  to \ncapture  the  worst  case  behaviour of a  system. \n\nLet h2 denote the vector space of square-summable complex-valued causal sequences \n{fk,  0 :::;  k  < oo},  viz., \n\nh2  = {set of sequences  {fk} such  that L  f; fk  < oo} \n\n<Xl \n\nk=O \n\nwith  inner  product  <  {Ik}, {gd  >  =  2:~=o f; gk \n,where  * denotes  complex \nconjugation.  Let  T  be a  transfer operator that maps an  input sequence  {ud  to  an \noutput sequence  {yd.  Then the  H<Xl  norm of T  is  equal to \n\nIITII<Xl  = \n\nsup \n\nutO,uEh 2 \n\nIIyl12 \nII u l1 2 \n\nwhere  the notation  111/.112  denotes  the  h 2 -norm of the causal sequence  {ttd, viz., \n\n2  ~<Xl  * \n\nIlull:?  =  L...Jk=o ttkUk \n\nThe  H<Xl  norm may thus  be  regarded  as  the maximum energy  gain  from  the  input \nu  to the  output  y. \n\n\fHoc Optimality Criteria for LMS and Backpropagation \n\n353 \n\nSuppose  we  observe  an output sequence  {dd  that obeys  the following  model: \n\ndi  = hT W + Vi \n\n(1) \n\nwhere hT =  [hi1  hi2 \nhin  ] is a known input vector, W is an unknown weight \nvector,  and {Vi}  is an unknown disturbance, which may also include modeling errors. \nWe shall  not  make any assumptions on  the noise  sequence  {vd  (such  as  whiteness, \nnormally distributed,  etc.). \nLet  Wi  =  F(do, di, ... , di)  denote  the  estimate  of the  weight  vector  W  given  the \nobservations  {dj}  from  time  0  up  to  and  including  time  i.  The  objective  is  to \ndetermine the functional  F,  and consequently  the  estimate Wi,  so  as to minimize a \ncertain norm defined  in  terms of the prediction error \n\nei  =  hT W - h T Wi-1 \n\nwhich  is  the  difference  between  the  true  (uncorrupted)  output  hT wand  the  pre(cid:173)\ndicted output hT Wi -1.  Let T  denote the transfer operator that maps the unknowns \n{w - W_1, {vd}  (where  W-1  denotes  an initial guess  of w)  to the prediction  errors \n{ed.  The  HOO  estimation problem can now  be formulated as  follows. \n\nProblem 1  (Optimal HOC  Adaptive Problem)  Find  an  Hoc -optimal  estima(cid:173)\ntion  strategy Wi  = F(do, d1, ... , di )  that  minimizes IITlloc'  and  obtain  the  resulting \n\n!~ = inf \n\n:F \n\nIITII!:,  = inf \n\nsup \n:F  w,vEh 2 \n\n(2) \n\nwhere Iw - w_11 2  = (w - w-1f (w - W-1),  and J1- is  a positive constant  that reflects \napriori knowledge  as  to  how  close  w  is  to  the  initial guess W-1 . \n\nNote  that  the  infimum  in  (2)  is  taken  over  all  causal  estimators  F.  The  above \nproblem  formulation  shows  that  HOC  optimal  estimators  guarantee  the  smallest \nprediction error energy over all possible disturbances offixed energy.  Hoc  estimators \nare thus over conservative, which reflects  in a  more robust behaviour to disturbance \nvariation. \n\nBefore stating our first  result  we  shall define  the input vectors  {hd  exciting if,  and \nonly if, \n\nN \n\nlim  L hT hi  =  00 \n\nN-+oc \n\ni=O \n\nTheoreln 1  (LMS  Algorithm)  Consider the  model  (1),  and suppose  we  wish  to \nminimize  the  Hoc  norm  of the  transfer operator from  the  unknowns  w  - W-1  and \nVi  to  the  prediction  errors ei.  If the  input vectors  hi  are  exciting  and \n\no < J1- < i~f h:h. \n\nt i t  \n\n(3) \n\nthen  the  minimum H oo  norm  is !Opt  =  1.  In  this  case  an  optimal Hoo  estimator is \ngiven  by  the  LA-IS  alg01'ithm  with  learning  rate  J1-,  viz. \n\n(4) \n\n\f354 \n\nHassibi, Sayed, and Kailath \n\nIn  other  words,  the result  states that the  LMS  algorithm is  an  H oo -optimal filter. \nMoreover,  Theorem 1 also gives an upper bound on the learning rate J-t  that ensures \nthe  H oo  optimality of LMS.  This  is  in  accordance  with  the  well-known  fact  that \nLMS  behaves  poorly if the learning rate is too large. \n\nIntuitively it  is  not hard to convince oneself that  \"'{opt  cannot  be  less  than one.  To \nthis  end  suppose  that  the  estimator has  chosen  some  initial guess  W-l.  Then  one \nmay conceive  of a  disturbance  that  yields  an  observation  that  coincides  with  the \noutput expected  from W-l,  i.e. \n\nhT W-l = hT W  + Vi  = di \n\nIn this case one expects  that the estimator will not change its estimate of w, so that \nWi  = W-l  for  all i.  Thus  the  prediction  error  is \n\nei  = hTw - hTwi-l = hTw - hTw-l = -Vi \n\nand the ratio in  (2)  can  be  made arbitrarily close  to one. \n\nThe surprising fact  though is  that  \"'{opt  is one and that the LMS  algorithm achieves \nit.  What  this  means  is  that  LMS  guarantees  that  the  energy  of the  prediction \nerror  will  never  exceed  the  energy  of the  disturbances.  This  is  not  true  for  other \nestimators.  For example, in the case of the recursive least-squares  (RLS)  algorithm, \none  can  come  up  with  a  disturbance  of arbitrarily  small energy  that  will  yield  a \nprediction error of large energy. \n\nTo  demonstrate  this,  we  consider  a  special  case  of model  (1)  where  hi  is  now  a \nscalar  that  randomly takes on  the values + 1 or  -1.  For  this model J-t  must be less \nthan  1 and  we  chose  the  value  J-t  = .9.  We  compute the  Hoo  norm of the  transfer \noperator from the disturbances to the prediction errors for  both RLS  and LMS.  We \nalso  compute  the  worst  case  RLS  disturbance,  and  show  the  resulting  prediction \nerrors.  The  results  are  illustrated  in  Fig.  1.  As  can  be  seen,  the  H OO  norm  in \nthe  RLS  case  increases  with the  number of observations,  whereas  in  the  LMS  case \nit remains constant  at  one.  Using  the  worst  case  RLS  disturbance,  the  prediction \nerror  due  to the  LMS  algorithm goes  to  zero,  whereas  the  prediction  error  due  to \nthe  RLS  algorithm does  not.  The form of the  worst  case  RLS  disturbance  is  also \ninteresting;  it competes  with the true output early on,  and then  goes  to zero. \nWe  should mention that the LMS  algorithm is  only one of a  family of HOO  optimal \nestimators.  However,  LMS  corresponds  to  what  is  called  the  central solution,  and \nhas the additional properties of being the maximum entropy solution and the risk(cid:173)\nsensitive  optimal solution  (Whittle  1990,  Glover  and  Mustafa 1989,  Hassibi  et  al. \n1993b). \nIf there  is  no  disturbance  in  (1)  we  have the following \n\nCorollary 1  If in addition to the assumptions of Theorem  1 there is no  disturbance \nin  {1J,  then  LMS guarantees  II  e  II~:::;  J-t-1Iw - w_11 2 ,  meaning  that  the  prediction \nerror converges  to  zero. \n\nNote that the above Corollary suggests that the larger J-t  is  (provided  (3) is satisfied) \nthe faster  the convergence  will be. \n\nBefore closing this section  we  should mention that if instead of the  prediction error \none  were  to  consider  the  filtered  error  ej,i  =  hjw - hjwj,  then  the  HOO  optimal \nestimator is  the so-called normalized LMS  algorithm (Hassibi  et  al.  1993a). \n\n\fHoo  Optimality Criteria for LMS and Backpropagation \n\n355 \n\n2.5 .----------''-=--------, \n\na \n\n0.5L-------------J \n50 \n\no \n\n1 \n\n0.98 \n\n0.96 \n\n0.94 \n\n0.92 \n\n0.9 \n\n0 \n\n50 \n\n(e) \n\n0.5 r------>-=--------, \n\n(d) \n\n0.5 \n\n\\ , \no \n\n\" \n\n-0.5 \n\n1\"'-\" \n\n-l~---------~ \no \n50 \n\n-1L-------------------~ \n50 \n\no \n\nFigure 1:  Hoo  norm of transfer operator as a function of the number of observations \nfor  (a)  RLS,  and  (b)  LMS.  The true output  and  the  worst  case  disturbance signal \n(dotted curve)  for  RLS  are given  in  (c).  The predicted errors for  the RLS  (dashed) \nand  LMS  (dotted)  algorithms corresponding  to  this  disturbance  are  given  in  (d). \nThe LMS  predicted  error goes  to zero  while the RLS  predicted  error  does  not. \n\n3  Nonlinear  HOO  Adaptive Filtering \n\nIn  this  section  we  suppose  that  the  observed  sequence  {dd  obeys  the  following \nnonlinear model \n\n(5) \nwhere  hi (.)  is  a  known  nonlinear function  (with  bounded  first  and  second  order \nderivatives),  W  is  an  unknown  weight  vector,  and  {vd  is  an  unknown  disturbance \nsequence  that  includes  noise  and/or modelling errors.  In  a  neural  network context \nthe index i  in  hi (.)  will  correspond  to the nonlinear function  that maps the weight \nvector to the output when the ith input pattern is  presented,  i.e., hi(W) = h(x(i), w) \nwhere  x(i)  is  the ith input pattern.  As  before we  shall denote by Wi  =  :F(do, ... , di) \nthe  estimate  of the  weight  vector  using  measurements up  to  and  including  time i, \nand the prediction error  by \n\nI \n\nei = hi(w) - hi(Wi-1) \n\nLet  T  denote \n{  W  - W -1 ,  { vd} to the  prediction  errors  {e;}. \n\nthe \n\ntransfer  operator  that  maps  the  unknowns/disurbances \n\nProblem 2  (Optimal Nonlinear HOO  Adaptive Problem)  Find \nan  Hoo-optimal  estimation  strategy Wi  =  :F(do, d1, . .. , di )  that  minimizes  IITllooI \n\n\f356 \n\nHassibi, Sayed, and Kailath \n\nand  obtain  the  resulting \n\ni'~ = inf  IITII~ = inf \n\n:F \n\nsup \n:F  w,vEh2 \n\n(6) \n\nCurrently  there  is  no  general  solution to the  above  problem,  and  the  class of non(cid:173)\nlinear functions  hi(.) for  which the above problem has a solution is  not known  (Ball \nand Helton,  1992). \n\nTo  make  some  headway,  though,  note  that  by  using  the  mean  value  theorem  (5) \nmay be rewritten  as \n\ndi  = hi(wi-d + ~~ T (wi_d.(w - Wi-I) + Vi \n\n(7) \n\nwhere wi-l is  a point on the line connecting wand Wi-I.  Theorem 1 applied to (7) \nshows that  the recursion \n\n(8) \nwill  yield  i'  = 1.  The  problem  with  the  above  algorithm  is  that  the  wi's  are \nnot  known.  But  it  suggests  that  the  i'opt  in  Problem  2  (if  it  exists)  cannot  be \nless  than one.  Moreover,  it  can  be  seen  that  the  backpropagation algorithm is  an \napproximation to  (8)  where  wi  is  replaced  by  Wi.  To  pursue  this  point further  we \nuse  again the  mean value theorem to write  (5)  in the  alternative form \n\nohi T \n\ndi  =  hi(wi-d+ ow  (wi-d\u00b7(w-Wi-l  +2(W-Wi-d  . ow 2  wi-d\u00b7(w-Wi-d+Vi \n(9) \nwhere  once  more  Wi-l  lies  on  the  line  connecting  Wi-l  and  w.  Using  (9)  and \nTheorem 1 we  have the following result. \n\n) \n\n1 \n\nT  02hi(_ \n\nTheorem 2  (Backpropagation Algorithm)  Consider \nbackpropagation  algorithm \n\nthe  model  (5)  and  the \n\nWi  = Wi-l + J.L  Ow  (wi-d(di - hi(wi-d) \n\nohi \n\nthen  if the  ~~i (Wi- d  are  exciting,  and \n\no < J.L  < In  - - : :T= - - - - - - -\nill!..  ( \n) \now  Wi-I\u00b7 ow  wi-l \n\n1 \n)  ill!..( \n\n.  f \ni \n\n(10) \n\n(11) \n\nthen  for  all nonzero  w,  v  E h 2: \n\n-----------~~=-~--~~--~~--------------- <  1 \nJ.L-11w  - w_112+  II  Vi  + !(w - wi_dT  ~:::J (wi-d\u00b7(w - Wi-I)  II~  -\n\nII  ~~i T  (wi-d(w - wi-d II~ \n\nwhere \n\n\fHoo  Optimality Criteria for LMS and Backpropagation \n\n357 \n\nThe  above  result  means that  if one  considers  a  new  disturbance v;  =  Vi  + ~ (w  -\nWi_I)T ~::J (Wi-I).(W - Wi-I), whose second  term indicates how far hi(w)  is from a \nfirst  order approximation  at point Wi-I, then backpropagation guarantees that the \nenergy of the linearized prediction error  ~~ T (wi-d(w - Wi-I)  does  not exceed  the \nenergy  of the new  disturbances  W  - W-l  and v:. \nIt seems plausible that if W-I  is close enough to w  then the second term in v~ should \nbe small and  the  true  and  linearized  prediction  errors  should  be close,  so  that  we \nshould  be  able  to  bound  the  ratio  in  (6).  Thus  the  following  result  is  expected, \nwhere  we  have  defined  the  vectors  {hd  persistently exciting if,  and only  if,  for  all \na  E nn \n\nTheorem 3  (Local  Hoc  Optimality)  Consider  the  model  (5)  and  the  backprop(cid:173)\nagation  algorithm  (10).  Suppose  that  the  ~':: (Wi-I)  are  persistently  exciting,  and \nthat  (11)  is  satisfied.  Then  for  each  (  >  0,  there  exist cSt, ch  >  0  such  that for  all \nIw - w-ti < cSt  and all v  E  h2  with IVil  < 82, we  have \n\n,  12 \nII  ej  I 2 \n\n< 1 + ( \n\nIl-Ilw - w_112+  II  v  II~  -\n\nThe  above  Theorem  indicates  that  the  backpropagation  algorithm  is  locally  HOC \noptimal.  In  other  words  for  W-l  sufficiently  close  to  w,  and  for  sufficiently  small \ndisturbance,  the  ratio  in  (6)  can  be  made  arbitrarily  close  to  one.  Note  that  the \nconditions on  wand  Vi  are  reasonable,  since  if for  example W  is  too far from W-l, \nor if some Vi  is  too large,  then it is  well  known that backpropagation may get stuck \nin  a  local minimum, in  which case  the ratio in  (6)  may get  arbitrarily large. \nAs  before  (11)  gives  an  upper  bound  on  the  learning  rate  Il,  and  indicates  why \nbackpropagation behaves poorly if the learning rate is  too large. \nIf there  is  no  disturbance in  (5)  we  have  the following \n\nCorollary 2  If in  addition  to the assumptions  in  Theorem  3  there is  no disturbance \nin  (5),  then  for  every (  > 0  there  exists  a 8  >  0  such  that  for  all  Iw - w-il  < 8, \nthe  backpropagation  algorithm  will  yield  II  e'  II~:::;  1l-18(1  + (),  meaning  that  the \nprediction  error converges  to  zero.  Moreover Wi  will converge  to  w. \n\nAgain provided  (11)  is  satisfied,  the larger Il  is the faster  the convergence  will  be. \n\n4  Discussion and  Conclusion \n\nThe  results  presented  in  this  paper  give  some  new  insights  into  the  behaviour  of \ninstantaneous gradient-based adaptive algorithms.  We showed that ifthe underlying \nobservation  model is  linear  then  LMS  is  an  HOC  optimal estimator,  whereas  if the \nunderlying  observation  model  is  nonlinear  then  the  backpropagation  algorithm is \nlocally HOC  optimal.  The HOC  optimality of these algorithms explains their inherent \nrobustness  to  unknown  disturbances  and  modelling  errors,  as  opposed  to  other \nestimation algorithms for  which such  bounds are not  guaranteed. \n\n\f358 \n\nHassibi, Sayed, and Kailath \n\nNote  that  if one  considers  the  transfer  operator from  the  disturbances  to  the  pre(cid:173)\ndiction errors,  then LMS  (backpropagation) is  H OO  optimal (locally), over all causal \nestimators.  This  indicates  that  our  result  is  most  applicable  in  situations  where \none is confronted with real-time data and there is no possiblity of storing the train(cid:173)\ning  patterns.  Such  cases  arise  when  one  uses  adaptive filters  or  adaptive  neural \nnetworks  for  adaptive  noise  cancellation,  channel  equalization,  real-time  control, \nand undoubtedly many other situations.  This is  as opposed  to pattern recognition, \nwhere  one  has  a  set  of training patterns  and repeatedly  retrains  the  network  until \na  desired  performance is  reached. \nMoreover,  we also showed that the H oo  optimality result leads to convergence proofs \nfor  the  LMS  and  backpropagation  algorithms  in  the  absence  of disturbances.  We \ncan pursue this line of thought further  and argue why choosing large learning rates \nincreases the resistance of backpropagation to local minima, but we  shall not do so \ndue to lack of space. \n\nIn  conclusion these  results  give a  new  interpretation of the  LMS  and backpropaga(cid:173)\ntion algorithms, which  we  believe  should be  worthy of further  scrutiny. \n\nAcknowledgements \n\nThis work  was supported in part by the Air  Force  Office  of Scientific  Research,  Air \nForce  Systems  Command under  Contract  AFOSR91-0060  and  in  part  by  a  grant \nfrom  Rockwell International Inc. \n\nReferences \n\nJ. A.  Ball and J. W. Helton.  (1992)  Nonlinear H oo  control theory for  stable plants. \nMath.  Control  Signals  Systems,  5:233-261. \nK.  Glover  and  D.  Mustafa.  (1989)  Derivation of the  maximum entropy  H oo  con(cid:173)\ntroller  and a  state space formula for  its entropy.  Int.  1.  Control,  50:899-916. \n\nB. Hassibi, A.  H.  Sayed,  and T. Kailath.  (1993a) LMS  is  HOO  Optimal.  IEEE Conf. \non  Decision  and  Control,  74-80, San Antonio,  Texas. \n\nB.  Hassibi,  A.  H.  Sayed,  and  T.  Kailath.  (1993b)  Recursive  linear  estimation in \nKrein  spaces  - part  II:  Applications.  IEEE  Conf.  on  Decision  and  Control,  3495-\n3501,  San Antonio, Texas. \n\nS.  Haykin.  (1991)  Adaptive  Filter Theory.  Prentice Hall,  Englewood  Cliffs,  NJ. \n\nD.  E.  Rumelhart, J.  L.  McClelland and  the PDP  Research  Group.  (1986)  Parallel \ndistributed processing:  explorations  in  the  microstructure of cognition.  Cambridge, \nMass.  :  MIT Press. \n\nP.  Whittle.  (1990)  Risk  Sensitive  Optimal  Control.  John  Wiley  and  Sons,  New \nYork. \n\nB.  Widrow and  M.  E.  Hoff,  Jr.  (1960)  Adaptive switching circuits.  IRE  WESCON \nConv.  Rec.,  Pt.4:96-104. \n\nG.  Zames.  (1981)  Feedback  optimal sensitivity:  model  preference  transformation, \nmultiplicative  seminorms  and  approximate  inverses.  IEEE  Trans.  on  Automatic \nControl,  AC-26:301-320. \n\n\f", "award": [], "sourceid": 815, "authors": [{"given_name": "Babak", "family_name": "Hassibi", "institution": null}, {"given_name": "Ali", "family_name": "Sayed", "institution": null}, {"given_name": "Thomas", "family_name": "Kailath", "institution": null}]}