{"title": "488 Solutions to the XOR Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 410, "page_last": 416, "abstract": null, "full_text": "488  Solutions to the XOR Problem \n\nFrans  M.  Coetzee * \neoetzee@eee.emu.edu \n\nVirginia L.  Stonick \n\nginny@eee.emu.edu \n\nDepartment of Electrical  Engineering \n\nDepartment of Electrical  Engineering \n\nCarnegie Mellon University \n\nPittsburgh,  PA  15213 \n\nCarnegie  Mellon  University \n\nPittsburgh,  PA 15213 \n\nAbstract \n\nA  globally convergent homotopy method is  defined  that is capable \nof sequentially producing large numbers of stationary points of the \nmulti-layer perceptron  mean-squared error  surface.  Using  this  al(cid:173)\ngorithm large subsets of the stationary points of two test problems \nare  found.  It is  shown  empirically that  the  MLP  neural  network \nappears  to  have  an  extreme  ratio  of saddle  points  compared  to \nlocal  minima,  and  that  even  small neural  network  problems  have \nextremely large numbers of solutions. \n\n1 \n\nIntroduction \n\nThe number and type of stationary points of the error  surface  provide insight into \nthe difficulties of finding the optimal parameters ofthe network, since the stationary \npoints  determine  the  degree  of the  system[l].  Unfortunately,  even  for  the  small \ncanonical test problems commonly used in neural network studies, it is still unknown \nhow  many stationary  points there  are,  where  they  are,  and  how  these  are  divided \ninto minima, maxima and saddle  points. \n\nSince solving the neural equations explicitly is currently intractable, it is of interest \nto be  able to numerically characterize  the error surfaces of standard test  problems. \nTo  perform  such  a  characterization  is  non-trivial,  requiring  methods  that  reliably \nconverge  and  are  capable  of finding  large  subsets  of distinct  solutions.  It can  be \nshown[2]  that methods which produce only one solution set on a  given trial become \ninefficient  (at  a  factorial  rate)  at  finding  large  sets  of multiple distinct  solutions, \nsince  the  same solutions  are  found  repeatedly.  This  paper  presents  the  first  prov(cid:173)\nably globally convergent  homotopy methods capable of finding  large subsets of the \n\nCurrently  with  Siemens  Corporate  Research,  Princeton  NJ 08540 \n\n\f488 Solutions to the XOR Problem \n\n411 \n\nstationary  points  of the  neural  network  error  surface.  These  methods are  used  to \nempirically quantify  not  only  the  number  but  also  the  type  of solutions for  some \nsimple neural networks. \n\n1.1  Sequential Neural Homotopy Approach Summary \n\nWe briefly acquaint the reader with the principles of homotopy methods, since these \napproaches differ  significantly from standard descent  procedures. \n\nHomotopy  methods  solve  systems  of nonlinear  equations  by  mapping  the  known \nsolutions from  an  initial system  to the  desired  solution of the  unsolved  system  of \nequations.  The  basic  method  is  as  follows:  Given  a  final  set  of equations  /(z) = \n0, xED ~ ?Rn  whose  solution  is  sought,  a  homotopy function  h  :  D  x  T  -+ ?Rn  is \ndefined  in terms of a  parameter T  ETC ?R,  such  that \n\nh(z, T)  = \n\n{  g(z) \n/(z) \n\nwhen  T  = 0 \nwhen  T  =  1 \n\nwhere  the  initial  system of equations  g(z)  = 0  has  a  known  solution.  For  opti(cid:173)\nmization problems  f(x) = \\7 x\u20ac2(x)  where \u20ac2(x) \nis the error meaSUre.  Conceptually, \nh(z, T)  =  0 is  solved numerically for  z  for  increasing values of T,  starting at T =  0 \nat  the  known  solution,  and  incrementally  varying  T  and  correcting  the  solution  x \nuntil  T  =  1, thereby  tracing a  path from the initial to the final  solutions. \n\nThe  power  and  the  problems of homotopy  methods lie  in  constructing  a  suitable \nfunction h.  Unfortunately, for  a given f  most choices of h will fail, and, with the ex(cid:173)\nception of polynomial systems, no guaranteed procedures for selecting h exist.  Paths \ngenerally  do not connect  the initial and final  solutions,  either due  to non-existence \nof solutions,  or due  to  paths diverging  to  infinity.  However,  if a  theoretical  proof \nof existence  of a  suitable trajectory  can be  constructed,  well-established numerical \nprocedures exist  that reliably track the  trajectory. \n\nThe following  theorem,  proved  in  [2],  establishes  that  a  suitable  homotopy exists \nfor  the standard feed-forward  backpropagation neural networks: \n\nTheorem 1.1  Let  \u20ac2  be  the  unregularized  mean  square  error  (MSE)  problem  for \nthe  multi-layer perceptron  network,  with  weights  f3  E  ?Rn .  Let f3 0  E  U  C  ?Rn  and \na  EVe ?Rn ,  where  U  and V  are  open  bounded  sets.  Then  except  for  a  set  of \nmeasure zero  (f3, a)  E U  x  V,  the  solutions ({3,  T)  of the  set of equations \nf3o)  + TD(J  (\u20ac2  + J.t'\u00a2'(IIf3  - aWn = 0 \n\n(1) \nwhere J.t  > 0 and'\u00a2'  : ?R  -+ ?R  satisfies 2'\u00a2'\"(a 2 )a 2 + '\u00a2\"(a 2 )  > 0 as  a  -+ 00,  form non(cid:173)\ncrossing one dimensional trajectories for all T E ?R,  which  are bounded  V T E [0, 1]. \nFurthermore,  the path through  (f3 o, 0)  connects to  at least one solution ({3* ,1)  of the \nregularized MSE  error problem \n\nh(f3, T)  = (1- T)(f3  -\n\nOn  T  E  [0,1]  the  approach  corresponds  to  a  pseudo-quadratic  error  surface  being \ndeformed  continuously  into the final  neural  network  error surface l .  Multiple solu-\n\n1 The common engineering heuristic whereby some arbitrary error surface is relaxed into \n\nanother error surface generally  does  not yield  well  defined  trajectories. \n\n(2) \n\n\f412 \n\nF  M.  Coetzee and V. L.  Stonick \n\ntions can be obtained by choosing different initial values f30.  Every desired solution \n(3\"  is  accessible  via an appropriate choice  of a, since  f30  = (3\"  suffices. \nFigure  1 qualitatively illustrates  typical  paths  obtained  for  this  homotopy2 .  The \npaths typically contain only a few solutions, are disconnected and diverge to infinity. \nA novel two-stage homotopy[2, 3]  is used to overcome these problems by constructing \nand  solving  two  homotopy equations.  The  first  homotopy  system  is  as  described \nabove.  A synthetic second  homotopy solves  an auxiliary set of equations on  a non(cid:173)\nEuclidean compact manifold (sn (0; R)  x A, where  A is  a  compact subset  of R)  and \nis  used  to move  between  the  disconnected  trajectories  of the  first  homotopy.  The \nmethod makes use  of the  topological properties of the compact manifold to ensure \nthat the secondary  homotopy paths do not diverge. \n\n+infty \n\n(3 \n\n0 \n\n-infty \n\n-infty \n\n+infty \n\n0 \n\nT \n\n4 \n\n3 \n\n2 \n\n0 \n\n-1 \n\n-2 \n\n-3 \n\n-4 \n-4 \n\n-2 \n\n0 \n\n2 \n\n4 \n\nFigure  1: \n(a)  Typical  homotopy trajectories,  illustrating divergence  of paths and \nmultiple solutions occurring on one  path.  (b)  Plot of two-dimensional vectors  used \nas training data for  the second  test  problem  (Yin-Yang problem). \n\n2  Test  Problems \n\nThe test  problems described  in this paper are small to allow for  (i)  a  large number \nof repeated  runs,  and  (ii)  to  make  it  possible  to  numerically  distinguish  between \nsolutions.  Classification problems were  used since  these present  the only interesting \nsmall problems,  even  though  the  MSE  criterion is  not necessarily  best  for  classifi(cid:173)\ncation.  Unlike  most classification tasks,  all  algorithms were  forced  to  approximate \nthe stationary point  accurately  by  requiring the 11  norm of the gradient  to be  less \nthan  10- 10 ,  and ensuring that solutions differed  in h by  more than 0.01. \nThe  historical  XOR problem is  considered  first.  The data points  (-1, -1),  (1,1)' \n'(-1,1)  and  (1,-1)  were  trained  to  the  target  values  -0.8,-0.8, 0.8  and  0.8.  A \nnetwork  with  three  inputs  (one  constant),  two  hidden  layer nodes  and  one  output \nnode were  used , with hyperbolic tangent transfer functions on the hidden  and final \n\n2Note  that  the homotopy  equation  and  its  trajectories  exist  outside  the interval  T  = \n\n[0,1]. \n\n\f488 Solutions to the XOR Problem \n\n413 \n\nnodes.  The regularization used  fL  = 0.05 , tjJ(x)  = x  and a = 0  (no bifurcations were \nfound  for  this value during simulations) . This problem was  chosen  since  it is small \nenough  to serve  as  a  benchmark for comparing the convergence  and performance of \nthe different  algorithms.  The second  problem, referred  to as  the Yin-Yang problem , \nis shown in Figure 1.  The problem has 23  and 22 data points in classes one and two \nrespectively, and target values \u00b10.7.  Empirical evidence indicates that the smallest \nsingle  hidden  layer  network  capable of solving the  problem  has five  hidden  nodes. \nWe  used  a  net  with  three  inputs, five  hidden nodes  and one output.  This problem \nis  interesting since  relatively  high  classification  accuracy  is  obtained  using  only  a \nsingle  neuron,  but  a  100%  classification  performance  requires  at least  five  hidden \nnodes  and one of only a few  global weight solutions. \n\nThe  stationary  points  form  equivalence  classes  under  renumbering  of the  weights \nor  appropriate  interchange  of weight  signs.  For  the  XOR  problem  each  solution \nclass  contains up  to  22  2!  =  8  distinct  solutions; for  the  Yin-Yang network,  there \nare  25  5!  =  3840  symmetries.  The  equivalence  classes  are  reported in  the  following \nsections. \n\n3  Test  Results \n\nA  Ribak-Poliere conjugate  gradient  (CG)  method was  used  as  a  control since  this \nmethod  can  find  only  minima,  as  contrasted  to the other  algorithms,  all  of which \nare  attracted  by  all  stationary  points. \nIn  the  second  algorithm,  the  homotopy \nequation  (1)  was  solved  by  following  the  main  path  until divergence.  A  damped \nNewton  (DN)  method  and  the  twa-stage  homotopy  method  completed  the  set  of \nfour  algorithms considered.  The different  algorithms were  initialized with the same \nn  random weights f30  E sn-l(o; v'2n). \n\n3.1  Control- The XOR problem \n\nThe  total number and  classification of the  solutions obtained for  250  iterations on \neach  algorithm are shown in Table  1. \n\nTable 1:  Number of equivalence class solutions obtained .  XOR Problem \n\nAlgorithm \nCG \nDN \nOne  Stage \nTwo Stage \nI Total DIstmct \n\n#  Solutions  #Minima  #  Maxima  #Saddle Points \n\n17 \n44 \n28 \n61 \n61 \n\n17 \n6 \n16 \n17 \n17 \n\n0 \n0 \n0 \n0 \n0 \n\n0 \n38 \n12 \n44 \n44 \n\nThe  probability  of finding  a  given  solution  on  a  trial  is  shown  in  Figure  2.  The \ntwa-stage  homotopy  method  finds  almost  every  solution from  every  initial  point. \nIn  contrast  to  the  homotopy  approaches,  the  Newton  method  exhibits  poor  con(cid:173)\nvergence,  even  when  heavily damped.  The  sets  of saddle  points found  by  the  DN \nalgorithm and the homotopy algorithms are  to a large extent  disjoint, even  though \nthe  same initial weights  were  used.  For  the  Newton  method solutions close  to the \ninitial point  are  typically obtained,  while  the  initial point  for  the  homotopy alga-\n\n\f414 \n\n50%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n0 \n0 \n\nF.  M.  Coetzee and V.  L  Stonick \n\n75%-t \n\nlOO%-t \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\nConjugate \nGradient \n\nPi \n\nPi \n\nPi \n\n50%-t \n\n75%-t \n\nlOO%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \no \no \n\nr1r\">~n. \n\n~. \n\n10 \n\n20 \n\n30 \n\n-\n\n.Jl \u2022 ..nnnn.. \n\nNewton \n\n40 \n\n50 \n\n60 \n\n50%-t \n\n75%-t \n\nlOO%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n0 \n0 \n\n50%-t \n\n0.8 \n0.6 \n0.4 \n0.2 \n0 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n75%-t \n\n50 \n\n60 \n\nlOO%-t \n\n10 \n\n20 \n\n30 \nEquivalence  Class \n\n40 \n\n50 \n\n60 \n\nSingle \nStage \nHomotopy \n\nTwo \nStage \nHomotopy \n\nFigure  2:  Probability of finding  equivalence  class  i  on  a  trial.  Solutions  have  been \nsorted based on percentage ofthe training set correctly classified.  Dark bars indicate \nlocal minima, light bars saddle points.  XOR problem \n\n\f488 Solutions to the XOR Problem \n\n415 \n\nTable 2:  Number of solutions correctly  classifying x%  of target data. \n\nClassification \nMinimum \nSaddle \nTotal Distinct \n\n25  %  50  %  75  %  100  % \n17 \n44 \n61 \n\n17 \n44 \n61 \n\n4 \n20 \n24 \n\n4 \n0 \n4 \n\nrithm  might differ  significantly  from  the  final  solution.  This difference  illustrates \nthat  homotopy  arrives  at solutions  in a  fundamentally different  way  than  descent \napproaches. \n\nBased on these  results we  conclude that the two-stage homotopy meets its objective \nof significantly  increasing the  number of solutions produced  on a  single  trial.  The \nhomotopy algorithms converge  more reliably than  Newton  methods,  in  theory  and \nin  practise.  These  properties  make  homotopy  attractive  for  characterizing  error \nsurfaces.  Finally, due  to the large number of trials and significant overlap between \nthe solution sets for  very different  algorithms, we  believe  that Tables 1-2  represent \naccurate  estimates for  the  number  and  types  of solutions  to  the  regularized  XOR \nproblem. \n\n3.2  Results on the Yin-Yang problem \n\nThe first  three  algorithms for  the Yin-Yang problem  were  evaluated for  100  trials. \nThe conjugate gradient method showed excellent stability, while the Newton method \nexhibited serious  convergence  problems, even  with  heavy  damping.  The  two-stage \nalgorithm was still producing solutions when the runs were terminated after multiple \nweeks  of computer time, allowing evaluation of only ten different  initial points. \n\nTable 3:  Number of equivalence  class solutions obtained.  Yin-Yang Problem \n\nAlgorithm \nConjugate Gradient \nDamped Newton \nOne  Stage Homotopy \nTwo Stage Homotopy \nTotal Dlstmct \n\n#  Solutions  #Minima  #  Maxima  #Saddle Points \n\n14 \n10 \n78 \n1633 \n1722 \n\n14 \n0 \n15 \n12 \n28 \n\n0 \n0 \n0 \n0 \n0 \n\n0 \n10 \n63 \n1621 \n1694 \n\nTable 4:  Number of solutions correctly  classifying x%  of target data. \n\nClassification \n75 \n28 \nMinimum \nSaddle \n1694 \nTotal Distinct  1722 \n\n80 \n28 \n1694 \n1722 \n\n95 \n26 \n\n96 \n90 \n26 \n28 \n1682  400  400 \n1710  426 \n426 \n\n97 \n5 \n13 \n18 \n\n98 \n5 \n13 \n18 \n\n99 \n2 \n3 \n5 \n\n100% \n\n2 \n3 \n5 \n\nThe results in Tables 3-4 for  the number of minima are believed to be accurate, due \nto verification  provided  by  the  conjugate  gradient  method.  The  number of saddle \n\n\f416 \n\nF.  M.  Coetzee and V. L  Stonick \n\npoints should be seen  as a  lower bound.  The regularization ensured  that the saddle \npoints  were  well  conditioned,  i.e. \nthe  Hessian  was  not  rank  deficient,  and  these \nsolutions are  therefore  distinct point solutions. \n\n4  Concl usions \n\nThe  homotopy  methods introduced  in  this paper  overcome  the  difficulties  of poor \nconvergence  and  the  problem  of repeatedly  finding  the  same  solutions.  The  use \nof these  methods  therefore  produces  significant  new  empirical  insight  into  some \nextraordinary unsuspected  properties  of the  neural network  error surface. \n\nThe  error  surface  appears  to  consist  of relatively  few  minima,  separated  by  an \nextraordinarily  large 'number of saddle  points.  While one  recent  paper  by  Goffe  et \nal  [4]  had given some numerical  estimates based on  which  it was  concluded  that a \nlarge number of minima in neural nets exist  (they did not find  a  significant number \nof these),  this extreme  ratio of saddle points to minima appears to be  unexpected. \nNo maxima were  discovered  in the  above  runs;  in fact  none  appear  to exist  within \nthe sphere  where  solutions were  sought  (this seems  likely given  the regularization). \nThe numerical results reveal  astounding complexity in the neural network problem. \nIf the equivalence classes are complete, then 488 solutions for  the XOR problem are \nimplied, of which 136 are  minima.  For the Yin-Yang problem, 6,600,000+ solutions \nand  107,250+ minima were  characterized.  For the simple architectures  considered, \nthese  numbers  appear  extremely  high.  We  are  unaware  of  any  other  system  of \nequations having these  remarkable properties. \n\nFinally,  it  should  be  noted  that  the  large  number  of saddle  points  and  the  small \nratio  of minima to  saddle  points  in  neural  problems  can  create  tremendous  com(cid:173)\nputational difficulties for  approaches  which  produce stationary points,  rather  than \nsimple minima.  The efficiency  of any such  algorithm at producing solutions will  be \nnegated  by the fact  that, from  an optimization perspective,  most of these  solutions \nwill be  useless. \n\nAcknowledgements.  The partial support  of the National Science  Foundation by \ngrant  MIP-9157221  is gratefully acknowledged. \n\nReferences \n\n[1]  E .  H.  Rothe,  Introduction to  Various  Aspects of Degree  Theory  in Banach Spaces. \nMathematical  Surveys  and Monographs  (23),  Providence,  Rhode  Island:  American \nMathematical  Society,  1986.  ISBN 0-82218-1522-9. \n\n[2]  F.  M.  Coetzee,  Homotopy Approaches for  the  Analysis and Solution  of Neural \n\nNetwork  and Other Nonlinear Systems of Equations.  PhD thesis,  Carnegie Mellon \nUniversity,  Pittsburgh,  PA,  May  1995. \n\n[3]  F .  M.  Coetzee and V.  L.  Stonick,  \"Sequential  homotopy-based  computation of \n\nmultiple  solutions  to nonlinear  equations,\"  in  Proc.  IEEE ICASSP,  (Detroit),  IEEE, \nMay  1995. \n\n[4]  W.  L.  Goffe,  G.  D.  Ferrier,  and J.  Rogers,  \"Global  optimization  of statistical \n\nfunctions  with simulated annealing,\"  Jour.  Econometrics, vol.  60,  no.  1-2,  pp.  65-99, \n1994. \n\n\f", "award": [], "sourceid": 1298, "authors": [{"given_name": "Frans", "family_name": "Coetzee", "institution": null}, {"given_name": "Virginia", "family_name": "Stonick", "institution": null}]}