{"title": "Neural Network Visualization", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 472, "abstract": null, "full_text": "Neural Network Visualization \n\n465 \n\nNEURAL  NETWORK  VISUALIZATION \n\nJakub Wejchert \nGerald Tesauro \n\nIB M  Research \n\nT.J.  Watson Research \n\nCenter \n\nYorktown Heights \n\nNY  10598 \n\nABSTRACT \n\nWe  have developed graphics  to visualize  static and dynamic infor(cid:173)\nmation in layered neural network learning systems.  Emphasis  was \nplaced  on  creating new  visuals  that  make  use  of spatial arrange(cid:173)\nments,  size  information,  animation  and  color.  We  applied  these \ntools  to  the  study  of back-propagation learning of simple  Boolean \npredicates,  and  have  obtained  new  insights  into  the  dynamics  of \nthe learning process. \n\nINTRODUCTION \n\n1 \nAlthough  neural network learning  systems are  being  widely  investigated by  many \nresearchers  via computer simulations,  the  graphical display of information in  these \nIn  other  fields  such  as  fluid \nsimulations  has  received  relatively  little  attention. \ndynamics and chaos theory,  the development of \"scientific visualization\"  techniques \n(1,3)  have  proven  to  be  a  tremendously  useful  aid  to  research,  development,  and \neducation.  Similar  benefits  should  result  from  the  application  of these  techniques \nto neural networks research. \nIn this  article,  several visualization methods  are introduced  to investigate learning \nin  neural  networks  which  use  the  back-propagation  algorithm.  A  multi-window \n\n\f466  Wejchert and Tesauro \n\nenvironment  is used  that allows  different  aspects  of the simulation  to be displayed \nsimultaneously in each  window. \n\nAs  an  application,  the  toolkit  is  used  to  study  small  networks  learning  Boolean \nfunctions.  The animations are used to observe the emerging structure of connection \nstrengths, to study the temporal behaviour, and to understand the relationships and \neffects  of parameters.  The simulations and  graphics can run at real-time speeds. \n\n2  VISUAL  REPRESENTATIONS \nFirst, we introduce our techniques for representing both the instantaneous dynamics \nof the learning process,  and  the full  temporal trajectory of the network during  the \ncourse  of one or more learning runs. \n\n2.1  The Bond Diagram \n\nIn the  first  of these  diagrams,  the  geometrical structure  of a  connected network is \nused  as  a  basis  for  the  representation.  As  it  is  of interest  to  try  to  see  how  the \ninternal configuration of weights relates to the problem the network is  learning, it is \nclearly worthwile  to have a  graphical representation that explicitly includes  weight \ninformation integrated with network topology.  This differs from  \"Hinton diagrams\" \n(2),  in  which  data may  only  be indirectly  related  to  the  network structure.  In our \nrepresentation nodes  are represented by circles,  the  area of which  are proportional \nto the  threshold values.  Triangles or lines  are used to represent  the weights  or their \nrate of change.  The  triangles  or line  segments  emanate from  the  nodes  and  point \ntoward  the  connecting nodes.  Their  lengths  indicate  the  magnitude  of the  weight \nor weight  derivative.  We  call this  the  \"bond diagram\". \nIn  this  diagram,  one  can  look  at  any  node  and  clearly  see  the  magnitude  of the \nweights feeding into and out of it.  Also,  a  sense  of direction is built into the picture \nsince the bonds point to the node that they are connected to.  Further, the collection \nof weights form distinct  patterns  that can be easily  perceived,  so  that one  can also \ninfer  global information from  the overall patterns formed. \n\n2.2  The Trajectory Diagram \n\nA further limitation of Hinton diagrams is  that  they provide a  relatively  poor rep(cid:173)\nresentation  of dynamic  information.  Therefore,  to understand  more about  the  dy(cid:173)\nnamics  of learning  we  introduce  another  visual  tool  that  gives  a  two-dimensional \nprojection  of  the  weight  space  of the  network.  This  represents  the  learning  pro(cid:173)\ncess  as  a  trajectory in  a  reduced  dimensional  space.  By  representing  the  value  of \nthe  error function  as  the  color of the point  in  weight  space,  one  obtains  a  sense  of \nthe  contours  of the  error hypersurface,  and  the  dynamics  of the  gradient-descent \nevolution  on  this  hypersurface.  We  call this  the  \"trajectory diagram\". \n\nThe scheme is  based on  the premise  that  the human user has a  good visual notion \nof vector addition.  To  represent  an  n-dimensional  point,  its  axial components  are \ndefined  as  vectors  and  then  are  plotted  radially  in  the  plane;  the  vector  sum  of \nthese is  then calculated to  yield  the  point representing  the  n-dimensional position. \n\n\fNeural Network Visualization \n\n467 \n\nIt is  obvious  that for n  > 2  the resultant point is  not  unique,  however,  the method \ndoes  allow  one  to  infer  information  about  families  of  similar  trajectories,  make \ncomparisons between trajectories  and notice important deviations  in  behaviour. \n\n2.3 \n\nImplementation \n\nThe  graphics  software  was  written in  C  using  X-Windows  v.  11.  The  C  code  was \ninterfaced to a  FORTRAN neural network simulator.  The whole package ran under \nUNIX,  on  an  RT  workstation.  Using  the  portability  of X- Windows  the  graphics \ncould be run remotely on different machines using a  local area network.  Excecution \ntime  was  slow  for  real-time  interaction  except  for  very  small  networks  (typically \nup  30  weights).  For  larger  networks,  the  Stellar  graphics  workstation  was  used, \nwhereby  the simulator code  could be vectorized and parallelized. \n\n3  APPLICATION  EXAMPLES \nWith  the  graphics  we  investigated  networks  learning  Boolean  functions:  binary \ninput  vectors  were  presented  to  the  network  through  the  input  nodes,  and  the \nteacher signal was set to either 1 or O.  Here, we show networks learning majority, and \nsymmetry functions.  The output of the majority function is  1 only if more than half \nof the input nodes are on;  simple symmetry distiguishes  between input  vectors that \nare  symmetric or anti-symmetric about  a  central axis;  general symmetry identifies \nperfectly  symmetric  patterns  out  of  all  other  permutations.  Using  the  graphics, \none  can  watch  how  solutions  to  a  particular  problem  are  obtained,  how  different \nparameters affect  these solutions,  and observe stages at which learning decisions are \nmade. \n\nAt  the start of the simulations  the  weights  are set  to small random values.  During \nlearning, many example patterns of vectors are presented to the input of the network \nand  weights  are  adjusted  accordingly. \nInitially  the  rate  of change  of  weights  is \nsmall, later as the simulation gets under  way the  weights change rapidly, until small \nchanges  are made as  the system moves  toward  the final solution.  Distinct  patterns \nof triangles show  the configuration of weights  in  their final form. \n\n3.1  The Majority  Function \n\nFigure 1 shows a  bond diagram for a  network that has learnt the majority function. \nDuring  the run,  many  input  patterns  were  presented  to  the  network during  which \ntime  the  weights  were  changed.  The  weights  evolve  from  small  random  values \nthrough  to  an  almost  uniform  set  corresponding  to  the  solution  of  the  problem. \nTowards  the  end,  a  large  output  node  is  displayed  and  the  magnitudes  of all  the \nweights  are roughly uniform,  indicating  that  a  large bias  (or threshold)  is  required \nto offset  the sum of the weights.  Majority is  quite a  simple problem for  the network \nto learn;  more complicated functions  require  hidden units. \n\n3.2  The Simple  Symmetry Function \n\nIn this  case only symmetric or perfectly anti-symmetric patterns are presented and \nthe  network  is  taught  to  distinguish  between  these.  In solving  this  problem,  the \n\n\f468  Wejchert and Tesauro \n\nFigure 1:  A near-final configuration  of weights  for  the  majority function.  All  the \nweights  are positive.  The  disc  corresponds  to the  threshold  of the output  unit. \n\n\fNeural Network Visualization \n\n469 \n\nnetwork chose  (correctly) that it needs  only two units  to make  the decision  whether \nthe input is totally symmetric or totally anti-symmetric.  (In fact, any symmetrically \nseparated input pair will work.)  It was found  that the simple pattern created by  the \nbond representation carries over into the more general symmetry function,  where the \nnetwork must  identify perfectly symmetric inputs from  all the  other permutations. \n\n3.3  The General Symmetry Function \n\nHere,  the  network  is  required  to  detect  symmtery  out  of  all  the  possible  input \npatterns.  As  can be seen from  the bond diagram (figure  2)  the network has chosen \na  hierarchical structure  of weights  to solve  the problem, using  the  basic pattern of \nweights  of simple  symmtery.  The  major  decision  is  made  on  the  outer  pair  and \nadditional decisions  are made  on  the  remaining pairs  with decreasing strength.  As \nbefore,  the  choice  of pairs in  the  hierarchy  depends  on  the initial random  weights. \nBy watching  the animations,  we  could make some observations  about the  stages  of \nlearning.  We  found  that  the  early  behavior  was  the  most critical as  it  was  at  this \nstage that the signs  of the weights feeding  to the hidden units  were  determined.  At \nthe later stages the relative magnitudes  of the  weights  were  adapted. \n\n3.4  The Visualization Environment \n\nFigure 3 shows the visualization environment with most of the windows  active.  The \nupper window  shows  the  total error,  and  the  lower  window  the  state of the output \nunit.  Typically,  the error initially stays high  then decreases rapidly  and  then levels \noff  to  zero  as  final  adjustments  are  made  to  the  weights.  Spikes  in  this  curve  are \ndue  to the method  of presenting patterns at  random.  The state of the output  unit \ninitially  oscillates and  then bifurcates into the  two requires  output  states. \n\nThe  two  extra  windows  on  the  right  show  the  trajectory  diagrams  for  the  two \nhidden  units.  These  diagrams  are  generalizations  of phase  diagrams:  components \nof a  point in a  high  dimensional space are plotted radially in  the plane  and treated \nas vectors whose sum yields  a point in the two-dimensional representation.  We have \nfound  these  diagrams  useful in  observing  the  trajectories of the  two  hidden  units, \nin  which  case  they  are representations  of paths  in  a  six-dimensional  weight  space. \nIn cases where the network does converge to a  correct solution,  the paths of the two \nhidden units either try to match each  other (in which case the configurations of the \nunits  were  identical)  or move  in  opposite  directions  (in  which  case  the  units  were \nopposites ). \n\nBy  contrast,  for  learning  runs  which  do  not  converge  to  global  optima  we  found \nthat usually one of the hidden units followed  a normal trajectory whereas the other \nunit  was  not able  to achieve  the appropriate match  or anti-match.  This is  because \nthe signs of the weights  to the second hidden unit  were not correct and the learning \nalgorithm  could  not  make  the  necessary  adjustments.  At  a  certain  point  early  in \nlearning the unit  would travel off on a  completely different  trajectory.  These obser(cid:173)\nvations suggest a heuristic  that could improve learning by setting initial trajectories \nin  the  \"correct\"  directions. \n\n\f470 \n\n\\Vejchert and Tesauro \n\nFigure 2:  The bond diagram for a  network that has learnt the symmetry function. \nThere are six input units,  two  hidden  and one output.  Weights  are shown by bonds \nemantating from  nodes.  In  the  graphics  positive  and negative  weights  are  colored \nred and blue respectively.  In this  grey-scale photo the negative weights are marked \nwith diagonal lines  to  distiguish  them from positive  weights. \n\n\fNeural Network Visualization \n\n471 \n\nFigure  3:  An  example of the  graphics  with  most of the  windows  active;  the  com(cid:173)\nmand  line  appears  on  the  bottom.  The  central window  shows  the  bond  diagram \nof the  General  Symmetry function.  The  upp er left  window  shows  the  total error, \nand  the  lower  left  window  the  state  of the  output  unit.  The  two  windows  on  the \nright  show  the  trajectory diagrams for  the  two  hidden  units.  The  \"spokes\"  in  this \ndiagram  correspond  to  the  magnitude  of the  weights.  The  trace  of dots  are  the \npaths  of the  two  units  in weight  space. \n\n\f472  Wejchert and Tesauro \n\nIn general,  the trajectory diagram has similar uses  to a  conventional phase plot:  it \ncan  distinguish  between different  regions  of configuration  space;  it  can  be used  to \ndetect critical stages of the dynamics  of a  system;  and it gives a  \"trace\"  of its time \nevolution. \n\n4  CONCLUSION \nA set of computer graphics visualization programs have been designed and interfaced \nto a back-propagation simulator.  Some new visualization tools were introduced such \nas  the bond and trajectory diagrams.  These  and other visual tools  were integrated \ninto an interactive multi-window  environment. \n\nDuring the course of the work it was found  that the graphics was useful in a number \nof ways:  in  giving  a  clearer picture  of the  internal  representation  of weights,  the \neffects  of parameters,  the  detection of errors in  the code,  and pointing  out  aspects \nof the  simulation  that had not  been expected beforehand.  Also,  insight  was  gained \ninto principles  of designing  graphics for  scientific  processes. \n\nIt  would  be  of interest  to  extend our  visualization  techniques  to include  large net(cid:173)\nworks  with  thousands  of nodes  and tens  of thousands  of weights.  We  are currently \nexamining a  number of alternative techniques  which are more appropriate for large \ndata-set regimes. \n\nAcknow ledgements \n\nWe  wish  to thank Scott Kirkpatrick for  help  and encouragment during  the project. \nWe  also  thank  members  of the  visualization  lab  and  the  animation lab  for  use  of \ntheir resources. \n\nReferences \n\n(1)  McCormick  B  H,  DeFanti  T  A  Brown  M  D  (Eds),  \"Visualization in  Scientific \nComputing\"  Computer Graphics  21,  6,  November  (1987).  See  also  \"Visualization \nin  Scientific  Computing-A  Synopsis\",  IEEE Computer Graphics  and Applications, \nJuly  (1987). \n\n(2) Rumelhart D E,  McClelland J  L,  \"Parallel Distributed Processing:  Explorations \nin  the  Microstructure of Cognition.  Volume 1\"  MIT Press,  Cambridge,  MA  (1986). \n\n(3)  Tufte  E  R,  \"The  Visual  Display  of Quantitative Information\",  Graphic  Press, \nChesire,  CT (1983). \n\n\fPART VI: \n\nNEW LEARNING ALGORITHMS \n\n\f", "award": [], "sourceid": 286, "authors": [{"given_name": "Jakub", "family_name": "Wejchert", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}