{"title": "Constrained Differential Optimization", "book": "Neural Information Processing Systems", "page_first": 612, "page_last": 621, "abstract": null, "full_text": "612 \n\nConstrained  Differential  Optimization \n\nJohn  C.  Platt \nAlan H.  Barr \n\nCalifornia Institute of Technology, Pasadena, CA 91125 \n\nAbstract \n\nMany optimization models of neural  networks need constraints to restrict the space of outputs to \na subspace which satisfies external criteria.  Optimizations using energy methods yield \"forces\" which \nact upon  the  state of the  neural  network.  The penalty method, in which quadratic  energy  constraints \nare  added  to  an  existing  optimization  energy,  has  become  popular  recently,  but  is  not  guaranteed \nto satisfy  the  constraint conditions  when  there  are  other forces  on  the  neural  model  or when  there \nare  multiple constraints.  In this paper, we present the basic differential multiplier method (BDMM), \nwhich  satisfies constraints exactly;  we  create forces  which gradually apply  the constraints over time, \nusing \"neurons\" that estimate Lagrange multipliers. \n\nThe  basic  differential  multiplier  method  is  a  differential  version  of the  method  of multipliers \nfrom  Numerical Analysis.  We  prove  that the differential  equations locally converge  to  a constrained \nminimum. \n\nExamples of applications of the differential method of multipliers include enforcing permutation \ncodewords in the analog decoding problem and enforcing valid tours in the traveling salesman problem. \n\n1.  Introduction \n\nOptimization  is  ubiquitous  in  the  field  of neural  networks.  Many  learning  algorithms,  such  as \nback-propagation,18  optimize by minimizing the difference between expected solutions and observed \nsolutions.  Other  neural  algorithms  use  differential  equations  which  minimize  an  energy  to  solve \na  specified  computational  problem,  such  as  associative  memory, D differential  solution  of the  trav(cid:173)\neling  salesman problem,s,lo  analog  decoding,lS  and  linear programming.1D  Furthennore,  Lyapunov \nmethods show  that various models of neural behavior find  minima of particular functions.4,D \n\nSolutions to a constrained optimization problem are restricted to a  subset of the solutions of the \ncorresponding unconstrained optimization problem.  For example, a mutual inhibition circuitS requires \none neuron to be \"on\" and the rest to be \"off\".  Another example is the traveling salesman problem,ls \nwhere  a  salesman  tries  to  minimize  his  travel  distance,  subject  to  the  constraint  that  he  must  visit \nevery  city  exactly  once.  A  third  example  is the  curve  fitting  problem,  where elastic  splines  are  as \nsmooth as  possible, while still going  through  data points.s  Finally,  when  digital decisions are being \nmade on analog data, the answer is constrained to be bits, either 0 or 1.14 \n\nA constrained optimization problem can be stated as \n\nminimize  / (~), \nsubject to g(~) =  0, \n\n(1) \n\nwhere ~ is  the  state of the  neural  network,  a position vector in a  high-dimensional  space;  f(~) is a \nscalar energy, which can be imagined as the height of a landscape as a function of position~; g(~) =  0 \nis a  scalar  equation  describing  a  subspace  of the  state  space.  During  constrained optimization,  the \nstate should be attracted to  the subspace  g(~) =  0,  then slide along the  subspace until it reaches the \nlocally smallest value of f(~) on  g(~) =  O. \n\nIn  section 2 of the paper, we  describe classical methods of constrained optimization, such as  the \n\npenalty method and Lagrange multipliers. \n\nSection 3 introduces the basic differential multiplier method (BDMM) for constrained optimiza(cid:173)\ntion, which calcuIates a good local minimum.  If the constrained optimization problem is convex, then \nthe  local  minimum  is  the  global  minimum;  in  general,  finding  the  global  minimum  of non-convex \nproblems is fairly  difficult. \n\nIn  section 4,  we  show  a  Lyapunov  function  for  the  BDMM  by  drawing  on  an  analogy  from \n\nphysics. \n\n\u00a9 American Institute of Physics 1988 \n\n\f613 \n\nIn section 5, augmented Lagrangians, an idea from optimization theory, enhances the convergence \n\nproperties of the BDMM. \n\nIn  section 6, we apply  the differential algorithm  to  two neural problems, and discuss  the insen(cid:173)\n\nsitivity  of BDMM  to  choice  of parameters.  Parameter  sensitivity  is  a  persistent problem  in  neural \nnetworks. \n\n2.  Classical  Methods of Constrained Optimization \n\nThis section discusses two methods of constrained optimization, the penalty method and Lagrange \nmultipliers.  The  penalty  method  has  been  previously  used  in  differential  optimization.  The  basic \ndifferential  multiplier  method  developed  in  this  paper  applies  Lagrange  multipliers  to  differential \noptimization. \n\n2.l.  The  Penalty  Method \n\nThe  penalty  method  is  analogous  to  adding  a  rubber  band  which  attracts  the  neural  state  to \nthe  subspace  g(~) =  o.  The penalty  method  adds  a  quadratic  energy  term  which  penalizes  viola(cid:173)\ntions  of constraints.  8  Thus,  the  constrained minimization problem  (1)  is converted  to  the  following \nunconstrained minimization problem: \n\n(2) \n\nFigure  1.  The penalty  method makes a trough in state space \n\nThe penalty method can be extended to fulfill multiple constraints by  using more than one rubber \n\nband.  Namely,  the constrained optimization problem \n\nminimize  f (.~), \n8ubject  to go (~) = OJ \n\na  = 1,2, ... , n; \n\nis converted into unconstrained optimization problem \n\nminimize  l'pena1ty(~) =  f(~) + L Co(go(~))2. \n\nn \n\n0:::1 \n\n(3) \n\n(4) \n\nThe penalty method has several convenient features.  First, it is easy to use.  Second, it is globally \nconvergent to the correct answer as  Co  -\n00.8  Third, it allows compromises between constraints.  For \nexample,  in the  case of a  spline curve  fitting  input data,  there  can be a compromise between fitting \nthe  data and making  a smooth spline. \n\n\f614 \n\nHowever, the penalty method has a number of disadvantages.  First, for finite constraint strengths \nCOl'  it doesn't fulfill  the constraints  exactly.  Using  multiple  rubber band constraints  is  like  building \na  machine  out of rubber  bands:  the  machine  would  not  hold  together  perfectly.  Second,  as  more \nconstraints  are  added,  the  constraint  strengths  get  harder  to  set,  especially  when  the  size  of  the \nnetwork (the dimensionality of .u gets large. \nIn addition, there is a dilemma to the setting of the constraint strengths.  If the strengths are small, \nthen  the  system  finds  a deep local minimum, but does  not fulfill  all  the  constraints.  If the  strengths \nare  large, then  the system quickly fulfills  the constraints,  but gets stuck in a poor local minimum. \n\nLagrange multiplier methods also convert constrained optimization problems  into unconstrained \nextremization problems.  Namely, a solution to  the equation (1)  is also a critical point of the energy \n\n2.2.  Lagrange Multipliers \n\n).  is called the Lagrange multiplier for  the  constraint g(~) = 0.8 \n\nA direct consequence of equation (5) is that the gradient of f  is collinear to the gradient of 9 at \nthe constrained extrema (see Figure 2).  The constant of proportionality between  'i1 f  and 'i1 9 is  -).: \n\n'i1 'Lagrange = 0 = 'i1 f + ). 'i1 g. \nWe  use  the collinearity of 'i1 f  and  'i1 9 in the design  of the BDMM. \n\n(6) \n\n(5) \n\nFigure  2.  At the  constrained minimum,  'i1 f  =  -). 'i1 9 \n\nA simple example shows that Lagrange multipliers provide the extra degrees of freedom necessary \nto  solve  constrained  optimization  problems.  Consider  the  problem  of finding  a point  (x, y)  on  the \nline x + y =  1 that is closest to  the origin.  Using Lagrange multipliers, \n\nNow, take  the  derivative with respect to all variables,  x, y,  and  A. \n\n'Lagrange = x 2  + y2 + ).(x + y  - 1) \n\naeLagrange  =  2x + A = 0 \n\na'Lagrange  = 2y + A = 0 \n\nax \nay \n\na'Lagrange  =  x + y  - 1 =  0 \n\na). \n\n(7) \n\n(8) \n\n\f615 \n\nWith  the  extra  variable  A,  there  are  now  three  equations  in  three  unknowns.  In  addition,  the  last \nequation is precisely  the constraint equation. \n\n3.  The  Basic  Differential  Multiplier Method  for  Constrained  Optimization \nThis  section  presents  a  new  \"neural\"  algorithm  for  constrained  optimization,  consisting  of dif(cid:173)\n\nferential  equations  which  estimate Lagrange  multipliers.  The  neural  algorithm  is  a  variation  of the \nmethod of multipliers, first presented by  Hestenes9  and Powell 16 \u2022 \n\n3.1.  Gradient  Descent  does  not work  with  Lagrange  Multipliers \n\nThe simplest differential optimization algorithm  is gradient descent,  where the state variables of \nthe network slide downhill, opposite the gradient.  Applying gradient descent to the energy in equation \n(5)  yields \n\nx. - _ a!Lagrange  =  _  al _ A ag \n, -\nax' ' \n\\. \nJ\\  =  -\n\nax\u00b7 \n\" \n=  -g * \n) \n( \n. \n\nax\u00b7 \n, \naA \n\na!Lagrange \n\n(9) \n\nNote that there is a auxiliary differential equation for A,  which is an additional \"neuron\" necessary \nto  apply  the  constraint  g(~) =  O.  Also,  recall  that when  the  system  is  at  a  constrained  extremum, \nVI =  -AVg, hence, x.  =  O. \n\nEnergies  involving Lagrange  multipliers,  however,  have  critical points  which  tend  to  be  saddle \npoints.  Consider the  energy in equation (5).  If ~ is frozen,  the  energy can be decreased by  sending \nA to  +00 or -00. \n\nGradient descent does not work with  Lagrange multipliers, because a critical point of the energy \nin equation (5)  need not be an attractor for (9).  A stationary point must be a local minimum in order \nfor  gradient descent to  converge. \n\n3.2.  The  New  Algorithm:  the Basic  Differential  Multiplier Method \n\nWe present an alternative to differential gradient descent that estimates the Lagrange multipliers, \nso that the constrained minima are attractors of the differential equations, instead of \"repulsors.\"  The \ndifferential equations that solve  (1)  is \n\nag \nal \n. \nX' = - - -A -\n, \nax, \nax.' \ni  =  +g(*). \n\n(10) \n\nEquation  (10)  is  similar to  equation  (9).  As  in equation  (9),  constrained extrema of the energy \n(5)  are  stationary  points  of equation (10).  Notice,  however,  the sign inversion  in  the  equation for i, \nas  compared  to  equation  (9).  The equation  (10)  is  performing  gradient ascent  on  A.  The  sign  flip \nmakes  the BDMM stable, as  shown  in section 4. \n\nEquation  (10) corresponds  to  a  neural  network  with  anti-symmetric  connections  between  the  A \n\nneuron  and all of the ~ neurons. \n\n3.3.  Extensions  to  the  Algorithm \n\nOne extension  to equation (10)  is an algorithm  for constrained minimization with  multiple con(cid:173)\nstraints.  Adding an extra neuron for every equality constraint and summing all of the constraint forces \ncreates  the  energy \n\n!multiple  =  !(~) + I: Ao<ga(~), \n\nwhich  yields differential equations \n\n0< \n\nx' - _ al  _ \"\" A  agcr. \n,- ax'  ~ 0<  ax' ) \n\n'0<  \n\n' \n\n(11) \n\n(12) \n\n\f616 \n\nAnother  extension  is  constrained  minimization  with  inequality  constraints.  As  in  traditional \noptimization  theory.8  one  uses  extra  slack  variables  to  convert  inequality  constraints  into  equality \nconstraints.  Namely. a constraint of the form  h(~) ~ 0 can be expressed as \n\n(13) \n\nSince  Z2  must  always  be positive,  then  h(~) is  constrained  to  be positive.  The  slack  variable  z  is \ntreated like a component of ~ in equation (10).  An  inequality constraint requires two extra neurons, \none for  the slack variable  %  and one for  the Lagrange multiplier ~. \n\nAlternatively, the inequality constraint can be represented as an equality constraint  For example, \nif h(~)  ~ 0,  then  the  optimization  can  be  constrained  with  g(~)  =  h(.~),  when  h(~)  ~ 0;  and \ng(.~) =  0 otherwise. \n\n4.  Why  the  algorithm works \n\nThe system of differential  equations (10)  (the  BDMM) gradually fulfills  the constraints.  Notice \nthat  the  function  g(~) can  be replaced by  kg(~), without  changing  the  location  of the  constrained \nminimum.  As  k  is  increased,  the  state  begins  to  undergo  damped  oscillation  about  the  constraint \nsubspace g(~) =  o.  As  k  is increased further,  the  frequency  of the oscillations increase, and the time \nto convergence increases. \n\nconstraint subspace \n\n.,.(cid:173)\n\n/ '  \n\n./ \u2022 \n\ninitial state \n\npath of algorithm \n\n\" \\ \n\n\\ \n\nFigure  3.  The state is  attracted to  the constraint subspace \n\nThe damped oscillations of equation (10) can be explained by combining both of the differential \n\nequations into  one second-order differential equation. \n\n(14) \n\nEquation  (14)  is the equation for  a damped mass system,  with  an  inertia term  Xi.  a damping matrix \n\nand an  internal force,  gOg/O%i,  which is  the  derivative of the internal energy \n\n(15) \n\n(16) \n\n\f617 \n\nIf the  system  is damped and the state remains bounded,  the state falls  into a constrained minima. \n\nAs in physics, we can construct a total energy of the system, which is the sum of the kinetic and \n\npotential energies. \n\nE =  T + U =  L i(xd2  + i(g(~))2. \n\n, \n\n(17) \n\nIf the  total  energy  is decreasing with  time and  the  state  remains bounded,  then  the system  will \n\ndissipate any extra energy,  and will settle down  into the state where \n\nwhich  is a constrained extremum  of the  original problem  in equation (1). \n\nThe time derivative of the  total energy in  equation (17)  is \n\n=  - Lx,A,jxj. \n\n',i \n\n(18) \n\n(19) \n\nIf damping  matrix  Aii is positive definite, the system converges  to  fulfill  the constraints. \n\nBDMM always converges for a special case of constrained optimization:  quadratic programming. \nA  quadratic  programming  problem  has  a  quadratic  function  f(~) and  a  piecewise linear continuous \nfunction  g(~) such  that \n\nUnder  these  circumstances,  the  damping  matrix  Aii  is positive  definite  for  all  ~ and  A,  so  that  the \nsystem converges  to  the constraints. \n\nFor the  case of multiple constraints, the  total energy for equation (12)  is \n\n4.1.  Multiple  constraints \n\n(20) \n\nE  =  T + U =  L i(Xi)2 + L igo(~)2. \n\ni \n\n0 \n\nand the time derivative  is \n\n(21) \n\n(22) \n\nAgain,  BDMM solves  a  quadratic  programming  problem,  if a  solution  exists.  However,  it  is \n\npossible to  pose a problem that has contradictory constraints.  For example, \n\ngdx)  =  x =  0, \n\ng2(X)  =  x-I =  0 \n\n(23) \n\nIn the case of conflicting constraints, the BDMM compromises,  trying to  make each constraint go  as \nsmall as possible.  However,  the Lagrange multipliers  Ao  goes to \u00b1oo as the constraints oppose each \nother.  It is possible, however,  to  arbitrarily  limit the  Ao  at some large absolute value. \n\n\f618 \n\nLaSalle's invariance theorem12  is used to prove that the BDMM eventually fulfills the constraints. \nLet G  be an  open  subset  of Rn.  Let F  be a  subset of G*, the  closure  of G,  where  the  system  of \ndifferential equations (12) is at an equilibrium. \n\nIf the damping matrix \n\na2f \n-----:;_ + '\" A \nax, ax;  ~ a  ax,ax; \n\na2g \n\na \n\n(24) \n\n(25) \n\nis  positive  definite  in  G,  if xa{ t)  and  Aa (t)  are  bounded,  and  remain  in  G  for  all  time,  and  ~f F \nis  non-empty,  then  F  is  the  largest invariant set in  G*,  hence,  by  LaSalle's invariance  theorem,  the \nsystem x, (t), Aa (t)  approaches Fast -+ 00. \n\n5.  The Modified  Differential  Method of Multipliers \n\nThis  section  presents  the  modified differemiaI  multiplier method (MDMM),  which  is  a  modifi(cid:173)\n\ncation of the BDMM with more robust convergence properties.  For a given constrained optimization \nproblem, it is frequently necessary to alter the BDMM to have a region of positive damping surround(cid:173)\ning the constrained minima.  The non-differential method of multipliers from  Numerical Analysis also \nhas  this difficulty.  2  Numerical Analysis combines the multiplier method with the penalty method to \nyield a modified multiplier method that is locally convergent around constrained minima.  2 \n\nThe  BDMM is completely compatible with  the penalty  method.  If one adds  a penalty  force  to \n\nequation (10)  corresponding to an quadratic  energy \n\nthen  the  set of differential  equations for  MDMM is \n\nEpenalty = ~(g(~))2. \n\nag \n. \naf \nx, = - - - A- - cg-, \nax, \nax, \nj  = g(~). \n\nag \nax, \n\n(26) \n\n(27) \n\nThe extra force from  the penalty does not change the position of the stationary points of the differential \nequations,  because  the  penalty  force  is  0  when  g(~) =  O.  The  damping  matrix  is  modified  by  the \npenalty force  to be \n\n(28) \n\nThere is a theorem  1  that states that there exists a c*  > 0 such that if c  > c*, the damping matrix \nin equation  (28)  is positive  definite at constrained minima.  Using continuity,  the damping  matrix  is \npositive  definite  in  a  region  R  surrounding  each  constrained  minimum.  If the  system  starts  in  the \nregion  R  and remains  bounded  and  in  R,  then  the  convergence  theorem  at  the  end of section 4  is \napplicable, and MDMM will converge to a constrained minimum. \n\nThe minimum necessary penalty strength c for the MDMM is usually much less than the strength \n\nneeded by  the penalty method  alone.2 \n\n6.  Examples \n\nThis section contains two examples which illustrate the use of the BDMM and the MDMM. First, \nthe  BDMM  is  used  to  find  a  good  solution  to  the  planar  traveling  salesman  problem.  Second,  the \nMDMM is  used to enforcing mutual  inhibition and digital results  in  the  task of analog decoding. \n\nThe traveling salesman problem (fSP) is, given a set of cities lying in  the plane, find  the shortest \nclosed path  that  goes  through  every  city  exactly  once.  Finding  the  shortest  path  is  NP-complete. \n\n6.1.  Planar Traveling  Salesman \n\n\f619 \n\nFinding  a  nearly  optimal path,  however,  is  much  easier than  finding  a  globally  optimal  path.  There \nexist  many  heuristic  algorithms  for  approximately  solving  the  traveling  salesman  problem.5,10,11,13 \nThe  solution  presented  in  this  section  is  moderately  effective  and  illustrates  the  independence  of \nBDMM to changes in parameters. \n\nFollowing Durbin and Willshaw,5  we use an elastic snake to solve the TSP. A snake is a discretized \ncurve  which  lies  on the plane.  The  elements of the  snake  are points on the  plane,  (Xi, Yd.  A  snake \nis a locally connected neural network,  whose neural outputs are positions on the plane. \n\nThe snake minimizes its  length \n\nsubject to  the  constraint that the snake must lie on the cities: \n\n2:)Xi+1 - x,)2 - (Yi+l  - Yi)2, \ni \n\n(29) \n\n(30) \nwhere (x*, y*)  are city coordinates, (xc, Yc)  is the closest snake point to the city, and k is the constraint \nstrength. \n\nk(x* - xc)  =  0, \n\nk(y*  - Yc)  =  0, \n\nThe minimization in equation (29) is quadratic and the constraints in equation (30) are piecewise \nlinear,  corresponding  to  a  CO  continuous  potential  energy  in  equation  (21).  Thus,  the  damping  is \npositive definite, and the system converges to a state where the constraints are fulfilled. \n\nIn  practice,  the  snake  starts  out  as  a  circle.  Groups  of cities  grab  onto  the  snake,  deforming \nit  As  the snake gets close to  groups  of cities, it grabs  onto  a specific ordering of cities that locally \nminimize its  length  (see Figure 4). \n\nThe system of differential equations that solve equations (29) and (30) are piecewise linear.  The \ndifferential  equations  for  Xi  and  Yi  are  solved  with  implicit  Euler's  method,  using  tridiagonal  LV \ndecomposition  to  solve  the  linear system.17  The points of the  snake are  sorted into  bins  that divide \nthe  plane, so  that the computation of finding  the  nearest point is  simplified. \n\nFigure  4.  The snake eventually attaches  to  the cities \n\nThe constrained minimization in equations (29) and (30) is a reasonable method for approximately \nsolving the TSP.  For  120 cities distributed in the unti square, and 600 snake points, a numerical step \nsize of 100  time  units,  and a  constraint strength  of 5  x  10- 3 ,  the  tour  lengths are  6% \u00b1 2%  longer \nthan  that  yielded  by  simulated  annealing11 .  Empirically,  for  30  to  240  cities,  the  time  needed  to \ncompute  the final  city ordering  scales  as  N1.6,  as compared to  the  Kernighan-Lin  method13 ,  which \nscales roughly as  N 2.2 \u2022 \n\nThe constraint strength  is  usable for  both a 30 city problem  and  a 240  city problem.  Although \n\nchanging the constraint strength affects the performance, the snake attaches to the cities for any non(cid:173)\nzero constraint strength.  Parameter adjustment does  not  seem  to  be an  issue as  the  number  of cities \nincreases, unlike the penalty method. \n\n\f620 \n\n6.2.  Analog  Decoding \n\nAnalog  decoding  uses  analog  signals from  a  noisy  channel  to  reconstruct  codewords.  Analog \ndecoding  has  been  performed  neurally,15  with  a  code  space  of permutation  matrices,  out  of  the \npossible space of binary  matrices. \n\nTo  perform  the  decoding  of permutation  matrices,  the  nearest permutation  matrix  to  the  signal \nmatrix  must be  found.  In  other  words,  find  the  nearest  matrix  to  the  signal  matrix,  subject  to  the \nconstraint that the matrix has on/off binary elements, and has exactly one \"on\" per row  and one \"on\" \nper column.  If the signal matrix is  Ii;  and the result is Vi;,  then minimize \n\nsubject to constraints \n\nVi,,(l- Vi;)  =  OJ \n\n- \"v.. ,1-. \nL..J  .,  ., \ni ,; \n\n(31) \n\n(32) \n\nLVi\" -1 =  O. \n; \n\nIn  this  example,  the  first  constraint in equation  (32)  forces  crisp digital  decisions.  The  second \n\nand third constraints are  mutual inhibition  along the rows and columns of the  matrix. \n\nThe optimization in equation (31)  is  not quadratic, it is linear.  In addition, the  first constraint in \nequation (32)  is non-linear.  Using the BDMM results in undamped oscillations.  In order to converge \nonto  a constrained minimum,  the  MDMM must be used.  For both  a  5 x  5  and  a  20 x  20  system,  a \nc =  0,2 is adequate for  damping the oscillations.  The choice of c seems to  be reasonably insensitive \nto  the size of the  system, and a wide range of c,  from  0.02 to  2.0,  damps the oscillations . \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u00b7  ' \u2022....\u2022. \n.. ,  .. . \n.. ... . \n.  ..\u2022...... \n. ... \n\u2022  \u2022\u2022\u2022\u2022\u2022 \n\u2022  \u2022 \u2022\u2022\u2022 \n\u2022 \u2022 ... e\u00b7 ... . \n.\u2022. '  . \n\u2022  \u2022\u2022\u2022\u2022\u2022  \u2022 \n\u00b7 ..  . e\u00b7 ... \u00b7 .. \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022\u2022 \n\u2022 \u2022 \u2022 \n. . \n.\u2022. ' \n\u2022  \u2022  \u2022\u2022\u2022 \n\u2022  \u2022\u2022\u2022\u2022 \n..  . . . \n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022 \n.... . .  . ...  \u2022 \u2022 \u2022 \n:~:.:.: \n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022\u2022 \u2022\u2022\u2022  \u2022  \u2022  \u2022  \u2022  \u2022 \n\u00b7\u00b7\u00b7::r::::::::  \u2022  \u2022 \n.. . . . .  ,  ...  . ... \n\u2022 \u2022 \n: :e&:.::  ....\u2022. \n\u2022  \u2022 \n\u2022\u2022\u2022\u2022\u2022 \u2022\u2022\u2022 \n\n\u2022 \n\n.\u2022... . \n. \u2022.... \n..' . \n\u2022 \u2022\u2022\u2022 \u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \u2022 \u2022\u2022 ..' . \n\u2022\u2022\u2022 \u2022\u2022 \n\n\u2022 \u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022\u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \n\n. \u2022.... . ..... \n\nFigure  5.  The decoder finds  the  nearest permutation matrix \n\nIn  a  test  of the  MDMM,  a signal  matrix  which  is  a  permutation  matrix  plus  some  noise,  with \na  signal-to-noise  ratio  of 4  is  supplied  to  the  network.  In  figure  5,  the  system  has  turned  on  the \ncorrect neurons but also  many  incorrect neurons.  The constraints  start to  be applied,  and eventually \nthe system reaches a permutation matrix.  The differential equations do not need to be reset.  If a new \nsignal matrix is applied to  the  network,  the neural state will  move  towards  the  new  solution. \n\n7.  ConClusions \n\nIn  the  field  of neural  networks,  there  are  differential  optimization  algorithms  which  find  local \nsolutions  to  non-convex  problems.  The  basic  differential  multiplier  method  is  a  modification  of a \nstandard  constrained  optimization  algorithm,  which  improves  the  capability  of  neural  networks  to \nperform constrained optimization. \n\nThe BDMM and the MDMM offer many  advantages over the  penalty  method.  First,  the  differ(cid:173)\n\nential equations (10) are much less stiff than those of the penalty method.  Very large quadratic terms \nare  not needed by the MDMM in order to strongly enforce the constraints.  The energy terrain for  the \n\n\f621 \n\npenalty method looks like steep canyons, with gentle floors;  finding  minima of these types of energy \nsurfaces  is  numerically  difficult  In  addition,  the  steepness  of the  penalty  tenns  is  usually  sensitive \nto  the  dimensionality of the  space.  The differential  multiplier  methods are promising  techniques  for \nalleviating stiffness. \n\nThe  differential  multiplier  methods  separate the  speed of fulfilling  the constraints  from  the  ac(cid:173)\n\ncuracy  of fulfilling  the  constraints.  In  the  penalty  method,  as  the  strengths  of a constraint goes  to \n00,  the  constraint  is  fulfilled,  but  the  energy  has  many  undesirable  local  minima.  The  differential \nmultiplier methods allow one to  choose how quickly  to  fulfill  the  constraints. \n\nThe BDMM fulfills  constraints exactly and is compatible with  the penalty method.  Addition  of \npenalty tenns  in  the MDMM does  not change  the stationary  points of the algorithm,  and sometimes \nhelps to  damp oscillations and improve convergence. \n\nSince the BDMM and the MDMM are in the form  of first-order  differential  equations,  they can \nbe directly implemented in hardware.  Performing constrained optimization at the raw speed of analog \nVLSI  seems like a promising technique for  solving difficult perception problems. 14 \n\nThere exist Lyapunov functions  for  the BDMM and  the  MDMM.  The  BDMM converges  glob(cid:173)\n\nally  for  quadratic  programming.  The  MDMM  is  provably  convergent  in  a  local  region  around  the \nconstrained  minima  Other  optimization  algorithms,  such  as  Newton's  method,17  have  similar  lo(cid:173)\ncal  convergence  properties.  The  global  convergence properties of the  BDMM and  the  MDMM  are \ncurrently under investigation. \n\nIn  summary,  the  differential  method of multipliers  is  a  useful  way  of enforcing  constraints  on \nneural networks  for enforcing syntax of solutions,  encouraging desirable properties  of solutions,  and \nmaking crisp decisions. \n\nThis paper was  supported by an AT&T Bell Laboratories fellowship  (JCP). \n\nAcknowledgments \n\nReferences \n\n1.  K.  J.  Arrow, L. Hurwicz, H. Uzawa, Studies in Linear and Nonlinear Programming. (Stanford \n\nUniversity  Press, Stanford, CA,  1958). \n\n2.  D.  P.  Bertsekas, Automatica, 12,  133-145, (1976). \n3.  C.  de Boor, A  Practical Guide  to  Splines.  (Springer-Verlag, NY,  1978). \n4.  M.  A.  Cohen, S.  Grossberg, IEEE  Trans.  Systems.  Man.  and Cybernetics,  ,815-826, (1983). \n5.  R.  Durbin, D.  Willshaw, Nature,  326, 689-691,  (1987). \n6.  J.  C.  Eccles, The  Physiology of Nerve  Cells,  (Johns  Hopkins Press, Baltimore,  1957). \n7.  M.  R.  Hestenes, J.  Opt.  Theory Appl., 4, 303-320, (1969). \n8.  M.  R.  Hestenes, Optimization  Theory,  (Wiley & Sons, NY,  1975). \n9.  J.  J.  Hopfield, PNAS, 81, 3088, (1984). \n10.  J. J.  Hopfield, D.  W.  Tank, Biological Cybernetics, 52,  141, (1985). \n11.  S.  Kirkpatrick,  C.  D.  Gelatt, C.  M.  Vecchi, Science, 220, 671-680,  (1983). \n12.  J. LaSalle, The  Stability of Dynamical Systems, (SIAM, Philadelphia,  1976). \n13.  S.  Lin,  B.  W.  Kernighan,  Oper. Res.,  21,498-516 (1973). \n14.  C.  A.  Mead, Analog VLSI  and Neural  Systems,  (Addison-Wesley,  Reading.  MA, TBA). \n15.  J. C. Platt, J. J. Hopfield, in AlP Con/. Proc.151:  Neural Networksfor Computing (1.  Denker \n\ned.)  364-369, (American Institute of PhysiCS,  NY,  1986). \n\n16.  M. 1. Powell,  in Optimization,  (R. Fletcher, ed.), 283-298, (Academic Press,  NY,  1969). \n17.  W. H.  Press,  B.  P.  Flannery,  S.  A.  Teukolsky,  W.  T.  Vetterling, Numerical Recipes, (Cam(cid:173)\n\nbridge University Press, Cambridge,  1986). \n\n18.  D.  Rumelhart,  G.  Hinton,  R.  Williams,  in  Parallel Distributed Processing,  (D.  Rumelhart, \n\ned), 1, 318-362, (MIT Press, Cambridge, MA,  1986). \n\n19.  D.  W.  Tank,  J. J.  Hopfield, IEEE  Trans.  Cir.  &  Sys.,  CAS-33, no.  5,533-541  (1986). \n\n\f", "award": [], "sourceid": 4, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Alan", "family_name": "Barr", "institution": null}]}