{"title": "Differentiating Functions of the Jacobian with Respect to the Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 441, "abstract": null, "full_text": "Differentiating Functions of the Jacobian \n\nwith Respect to the Weights \n\nGary William Flake \nNEC Research Institute \n4 Independence Way \nPrinceton, NJ  08540 \n\njiake@research.nj.nec.com \n\nBarak A. Pearlmutter \n\nDept of Computer Science, FEC 313 \n\nUniversity of New Mexico \nAlbuquerque, NM  87131 \n\nbap@cs.unm.edu \n\nAbstract \n\nFor many problems, the correct behavior of a model depends not only on \nits input-output mapping but also on properties of its Jacobian matrix, the \nmatrix of partial derivatives of the model's outputs with respect to its in(cid:173)\nputs.  We introduce the J-prop algorithm, an efficient general method for \ncomputing the exact partial derivatives of a variety of simple functions of \nthe Jacobian of a model with respect to its free parameters. The algorithm \napplies to any parametrized feedforward model,  including nonlinear re(cid:173)\ngression, multilayer perceptrons, and radial basis function networks. \n\n1  Introduction \n\nLet f (x, w) be an n input, m output, twice differentiable feedforward model parameterized \nby an input vector, x, and a weight vector w. Its Jacobian matrix is defined as \n\nJ= \n\n[ ~ \n\n()xl \n: \naim \naXI \n\n~l a~\" \naim \nax\" \n\n=  df(x, w) . \n\ndx \n\nThe algorithm we introduce can be used to optimize functions of the form \n\nor \n\nEv(w) = 211Jv - bll \n\n2 \n\n1 \n\n(1) \n\n(2) \n\nwhere  u, v,  a,  and  b are user-defined constants.  Our algorithm,  which  we  call  J-prop, \ncan be used to calculate the exact value of both a Eu / aw  or a Ev / aw  in 0 (1)  times  the \ntime required to calculate the normal gradient. Thus, I-prop is suitable for training models \nto have specific first derivatives, or for implementing several other well-known algorithms \nsuch as Double Backpropagation [1]  and Tangent Prop [2]. \n\nClearly, being able to optimize Equations  1 and 2 is useful;  however,  we suspect that the \nformalism  which  we  use  to  derive  our algorithm  is  actually  more interesting  because  it \nallows  us  to  modify  J-prop to  easily be applicable to  a  wide-variety  of model  types and \n\n\f436 \n\nG.  W.  Flake and B.  A.  Pear/mutter \n\nobjective functions.  As  such,  we  spend a fair portion  of this paper describing  the mathe(cid:173)\nmatical framework from which we later build J-prop. \n\nThis paper is divided into four more sections.  Section 2 contains background information \nand motivation for why optimizing the properties of the Jacobian is an important problem. \nSection  3  introduces our formalism  and contains  the derivation  of the  J-prop  algorithm. \nSection 4 contains a brief numerical example of J-prop.  And,  finally,  Section 5 describes \nfurther work and gives our conclusions. \n\n2  Background and motivation \n\nPrevious work concerning the modeling of an unknown function and its derivatives can be \ndivided into works that are descriptive or prescriptive.  Perhaps the best known descriptive \nresult is due to White et al.  [3,4], who show that given noise-free data, a multilayer percep(cid:173)\ntron (MLP) can approximate the higher derivatives of an  unknown function in the limit as \nthe number of training points goes to infinity. The difficulty with applying this result is the \nstrong requirements on  the amount and integrity of the training data;  requirements which \nare rarely met in practice. This problem was specifically demonstrated by Principe, Rathie \nand Kuo [5]  and Deco and Schiirmann [6], who showed that using noisy training data from \nchaotic systems can lead to models that are accurate in the input-output sense, but inaccu(cid:173)\nrate in their estimates of quantities related to the Jacobian of the unknown system, such as \nthe largest Lyapunov exponent and the correlation dimension. \n\nMLPs are particularly problematic because large weights can lead to saturation at a particu(cid:173)\nlar sigmoidal neuron which, in tum, results in extremely large first derivatives at the neuron \nwhen evaluated near the center of the sigmoid transition.  Several methods to combat this \ntype  of over-fitting  have  been  proposed.  One of the  earliest methods,  weight decay  [7], \nuses  a  penalty  term  on  the  magnitude of the  weights.  Weight  decay  is  arguably  optimal \nfor models in  which the output is  linear in  the  weights because minimizing the magnitude \nof the weights is equivalent to  minimizing the  magnitude of the  model's first derivatives. \nHowever,  in  the  nonlinear case,  weight decay  can  have  suboptimal  performance  [1]  be(cid:173)\ncause large (or  small)  weights  do  not always  correspond to  having  large  (or  small)  first \nderivatives. \n\nThe Double Backpropagation algorithm  [1]  adds  an  additional  penalty  term  to  the  error \nfunction equal to II a E / ax 112.  Training on this function results in a form of regularization \nthat is in many ways an elegant combination of weight decay and training with noise:  it is \nstrictly analytic (unlike training with noise) but it explicitly penalizes large first derivatives \nofthe model (unlike weight decay).  Double Backpropagation can be seen as a special case \nof J-prop, the algorithm derived in this paper. \n\nAs  to  the general problem of coercing the  first  derivatives of a  model  to  specific  values, \nSimard, et at.,  [2]  introduced the Tangent Prop algorithm, which was used to  train MLPs \nfor  optical  character recognition  to  be  insensitive  to  small  affine  transformations  in  the \ncharacter space. Tangent Prop can also be considered a special case of J-prop. \n\n3  Derivation \n\nWe  now  define  a  formalism  under  which  J-prop  can  be  easily  derived.  The  method  is \nvery similar to a technique introduced by Pearlmutter [8]  for calculating the product of the \nHessian of an  MLP and an  arbitrary vector.  However, where Pearlmutter used differential \noperators  applied  to  a  model's  weight  space,  we  use  differential  operators  defined  with \nrespect to a model's input space. \n\nOur entire  derivation is  presented  in  five  steps.  First,  we  will  define  an  auxiliary  error \n\n\fDifferentiating Functions of the Jacobian \n\n437 \n\nfunction that has a few  useful mathematical properties that simplify the derivation.  Next, \nwe will define a special differential operator that can be applied to both the auxiliary error \nfunction,  and its  gradient with  respect to  the weights.  We  will  then  see  that the result of \napplying the differential operator to the gradient of the auxiliary error function is equivalent \nto analytically calculating the derivatives required to optimize Equations 1 and 2.  We  then \nshow an example of the technique applied to an MLP.  Finally, in the last step, the complete \nalgorithm is presented. \n\nTo avoid confusion, when referring to generic data-driven models, the model will always be \nexpressed as a vector function y  =  f (x, w), where x refers to the model input and w refers \nto a  vector of all  of the tunable parameters of the model.  In this  way,  we can  talk about \nmodels while ignoring the mechanics of how the models work internally.  Complementary \nto the generic vector notation, the notation for an MLP uses only scalar symbols; however, \nthese  symbols  must refer  to  internal  variable  of the  model  (e.g.,  neuron  thresholds,  net \ninputs,  weights,  etc.), which can lead  to  some ambiguity.  To  be clear,  when  using  vector \nnotation, the input and output of an MLP will always be denoted by x  and y, respectively, \nand the collection of all  of the weights (including biases) map to  the vector w.  However, \nwhen using scalar arithmetic, the scalar notation for MLPs will apply. \n\n3.1  Auxiliary error function \n\nOur auxiliary error function, E, is defined as \n\n-\nE(x, w)  =  u  f(x, w). \n\nT \n\n(3) \nNote that we  never actually optimize with respect E;  we define it only because it has the \nproperty  that aE/ax  =  u T J, which  will  be useful  to  the  derivation  shortly.  Note that \na E / ax appears in the Taylor expansion of E about a point in input space: \n\nE(x + Ax, w)  =  E(x, w) + ~!  Ax + 0 (1IAXI12) . \n\n-T \n\n(4) \n\nThus, while holding the weights, w, fixed and letting Ax be a perturbation of the input, x, \nEquation 4 characterizes how small changes in the input of the model change the value of \nthe auxiliary error function. \nBe setting Ax  =  rv, with v  being an arbitrary vector and r  being a small  value,  we can \nrearrange Equation 4 into the form: \n\n~ [E(x+rv,w) -E(x,w)] +O(r) \n\n- ]  \n\n1  [-\n\nlim  - E(x + rv, w) - E(x, w) \nr~O r \nI \na -\naE(x + rv,w) \nr \n\nr=O \n\n. \n\n(5) \n\nThis final expression will allow us to define the differential operator in the next subsection. \n\n3.2  Differential operator \n\nLet  h(x, w)  be  an  arbitrary  twice  differentiable function.  We  define  the  differentiable \noperator \n\n(6) \n\n\f438 \n\nG.  W  Flake and B.  A.  Pearlmutter \n\nwhich has  the property that Rv{E(x, w)}  =  u T Jv.  Being a differential operator, Rv{-} \nobeys all of the standard rules for differentiation: \n\nRv{c} \nRv{ c\u00b7 h(x, w)} \nRv{h(x,w) + g(x,w)} \nRv{h(x, w) . g(x, w)} \nRv{h(g(x, w), w)} \n\nRv{!h(X,W) } \n\no \n\n=  c\u00b7 Rv{h(x, w)} \n\nRv{ h(x, w)} + Rv{g(x, w)} \nRv{h(x,w)}\u00b7 g(x,w) + h(x,w)\u00b7 Rv{g(x,w)} \nh'(g(x,w))\u00b7 Rv{g(x,w)} \nd \ndt Rv{h(x, w)} \n\nThe operator also yields the identity Rv{ x}  =  v. \n\n3.3  Equivalence \n\nWe  will  now see that the result of calculating Rv{ a E / aw} can be used  to  calculate both \naEu/aw and  aEv/aw.  Note  that  Equations  3-5 all  assume  that  both u  and  v  are  in(cid:173)\ndependent of x  and  w.  To  calculate  aEu/aw and  aEv/aw,  we  will  actually  set  u  or \nv  to  a value that depends on  both  x  and  w;  however,  the  derivation  still  works  because \nour choices are explicitly made in such  a way  that the  chain rule  of differentiation is  not \nsupposed to  be applied  to  these  terms.  Hence,  the  correct analytical  solution is  obtained \ndespite the dependence. \n\nTo  optimize with respect to Equation 1, we use: \n\na  1 II  T \naw'2  J  u  - a \n\n112 \n\n(au T J) T  T \n\n{  a E } \n=  ~ (J  u  - a)  = Rv  aw \n\n' \n\nwith v  =  (JT U - a).  To optimize with respect to Equation 2,  we use: \n\n~! IIJv - bll 2  = (Jv _  b)T  (aJv)  =  Rv{ aE} , \naw2 \n\naw \n\naw \n\n(7) \n\n(8) \n\nwith u  =  (Jv - b). \n\n3.4  Method applied to MLPs \n\nWe  are now  ready  to  see how  this  technique can  be  applied  to  a specific  type  of model. \nConsider an MLP with L + 1 layers of nodes defined by the equations: \n\ny~  = \n\nXl \nt \n\ng(x~) \nL  1-1 \nN/ \n\nYj  Wij  -\n\nI \n\nri \ni' \n\n(9) \n\n(10) \n\nj \n\nIn  these equations,  superscripts denote the  layer number (starting  at 0),  subscripts  index \nover terms in  a particular layer, and NI  is the number of input nodes in  layer l . Thus, y~ is \nthe output of neuron i  at node layer l, and xi  is the net input coming into the same neuron. \nMoreover, yf is an output of the entire MLP while y?  is an input going into the MLP. \n\nThe feedback equations calculated with respect to E  are: \n\naE \nayf \n\n(11) \n\n\fDifferentiating Functions of the Jacobian \n\n8E \n8y~ \n8E \n8x' t \n8E \n8w!j \n8E \n8()! \nJ \n\n8E \n8x l\n\nt \n\n' \n\n439 \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\nwhere the Ui  term  is  a component in  the vector u  from Equation  1.  Applying the  R v {\u00b7} \noperator to the feedforward equations yields: \n\nRv{Y?} \nRv{yD \n\nRv{ x~} \n\n(for 1 > 0) \n\ng'(x~)Rv{ xD \nN/ L Rv{y~-l } W~j' \n\nj \n\n(16) \n\n(17) \n\n(18) \n\nwhere the Vi  term is  a component in  the vector v  from Equation 2.  As  the  final  step,  we \napply the Rv {\u00b7} operator to the feedback equations, which yields: \n\no \n\nRv{:~ } \nRv{ :~} \nRv{ :~} \nRv{ :!,} \nRv{ ::;} \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n3.5  Complete algorithm \n\nImplementing this algorithm is nearly as simple as implementing normal gradient descent. \nFor each type of variable that is used in an MLP (net input, neuron output, weights, thresh(cid:173)\nolds,  partial  derivatives,  etc.),  we  require that  an  extra variable  be  allocated  to  hold  the \nresult of applying the R v {\u00b7} operator to the original variable.  With this change in place, the \ncomplete algorithm to compute 8Eu /8w is as follows : \n\n\u2022  Set u  and a  to the user specified vectors from Equation 1. \n\n\u2022  Set the MLP inputs to the value of x  that J  is to be evaluated at. \n\n\u2022  Perform a normal feedforward pass using Equations 9 and 10. \n\n\u2022  Set 8E/8yf to Ui. \n\n\f440 \n\nG.  W.  Flake and B. A.  Pearlmutter \n\n(a) \n\n(b) \n\nFigure  1:  Learning only  the derivative:  showing (a)  poor approximation of the  function \nwith (b) excellent approximation of the derivative. \n\n\u2022  Perform the feedback pass with Equations 11-15. Note that values in the aEjay? \n\nterms are now equal to JT U. \n\n\u2022  Set v to (JT u  - a) \n\n\u2022  Perform a Rv{ .} forward pass with Equations 16-18. \n\n\u2022  Set the Rv{ 8Ej8yf} terms to O. \n\n\u2022  Perform a Rv{\u00b7} backward pass with Equations 19-23. \n\nAfter the last step,  the  values in the Rv{ 8E j 8w!j} and Rv{ 8 E j aeD  terms contain  the \nrequired result.  It is important to note that the time complexity of the \"J\u00b7forward\" and \"J. \nbackward\" calculations are nearly identical to the typical output and gradient evaluations \n(i.e., the \"forward\" and \"backward\" passes) of the models used. \nA similar technique can be used for calculating 8Evj8w. The main difference is that the \nRv{ . } forward pass is performed between the normal forward and backward passes because \nu can only be determined after the Rv{ f (z, w)} has been calculated. \n\n4  Experimental results \n\nTo  demonstrate the effectiveness and generality of the J-prop algorithm,  we have imple(cid:173)\nmented it on top of an existing neural network library [9] in such a way that the algorithm \ncan be used on a large number of architectures, including MLPs, radial basis function net\u00b7 \nworks, and higher order networks. \n\nWe trained an MLP with ten hidden tanh nodes on 100 points with conjugate gradient. The \ntraining exemplars consisted of inputs in [-1, 1] and a target derivative from 3 cos( 3x) + \n5cos(lOx).  Our unknown function  (which the MLP never sees data from)  is  sin(3x) + \nl sin(lOx). The model quickly converges to a solution in approximately 100 iterations. \nFigure 1 shows the performance of the MLP.  Having never seen data from the unknown \nfunction, the MLP yields a poor approximation of the function, but a very accurate approx(cid:173)\nimation of the function's derivative. We could have trained on both outputs and derivatives, \nbut our goal was to illustrate that J\u00b7prop can target derivatives alone. \n\n\fDifferentiating Functions of the Jacobian \n\n441 \n\n5  Conclusions \n\nWe  have introduced a  general  method for calculating  the  weight gradient of functions  of \nthe Jacobian matrix of feedforward nonlinear systems. The method can be easily applied to \nmost nonlinear models in common use today. The resulting algorithm, J-prop, can be easily \nmodified to  minimize functionals from  several  application domains  [10].  Some possible \nuses  include:  targeting  known  first  derivatives,  implementing Tangent  Prop  and  Double \nBackpropagation, enforcing identical VO sensitivities in auto-encoders, deflating the largest \neigenvalue and  minimizing  all  eigenvalue  bounds,  optimizing  the  determinant  for  blind \nsource separation, and building nonlinear controllers. \n\nWhile some special cases of the J-prop algorithm have already been studied,  a great deal \nis unknown about how optimization of the Jacobian changes the overall optimization prob(cid:173)\nlem.  Some anecdotal evidence seems to imply that optimization of the Jacobian can lead to \nbetter generalization and faster training. It remains to be seen if J-prop used on a nonlinear \nextension of linear methods will lead to superior solutions. \n\nAcknowledgements \n\nWe thank Frans Coetzee, Yannis Kevrekidis, Joe O'Ruanaidh, Lucas Parra, Scott Rickard, \nJustinian Rosca, and Patrice Simard for helpful discussions. GWF would also like to thank \nEric Baum and the NEC Research Institute for funding the time to write up these results. \n\nReferences \n\n[1]  H. Drucker and Y. Le Cun. Improving generalization performance using double back(cid:173)\n\npropagation.  IEEE Transactions on Neural Networks,  3(6), November 1992. \n\n[2]  P.  Simard,  B.  Victorri,  Y.  Le  Cun,  and J.  Denker.  Tangent prop-A formalism  for \nspecifying selected invariances in  an  adaptive network.  In John E.  Moody, Steve J. \nHanson, and Richard P.  Lippmann, editors, Advances in Neural Information Process(cid:173)\ning Systems, volume 4, pages 895-903. Morgan Kaufmann Publishers, Inc.,  1992. \n\n[3]  H.  White  and  A.  R.  Gallant.  On  learning  the  derivatives of an  unknown  mapping \nwith  multilayer feedforward  networks.  In  Halbert White,  editor,  Artificial  Neural \nNetworks, chapter 12, pages 206-223. Blackwell, Cambridge, Mass.,  1992. \n\n[4]  H.  White, K. Hornik, and M. Stinchcombe.  Universal approximation of an unknown \nmapping  and  its  derivative.  In  Halbert  White,  editor,  Artificial  Neural  Networks, \nchapter 6, pages 55-77. Blackwell, Cambridge, Mass., 1992. \n\n[5]  J.  Principe,  A.  Rathie,  and  J.  Kuo.  Prediction  of chaotic  time  series  with  neural \n\nnetworks and the issues of dynamic modeling.  Bifurcations and Chaos, 2(4), 1992. \n\n[6]  G.  Deco and B.  Schiirmann.  Dynamic modeling of chaotic time  series.  In Russell \n\nGreiner, Thomas Petsche,  and Stephen Jose Hanson, editors,  Computational Learn(cid:173)\ning  Theory and Natural Learning Systems,  volume IV of Making  Learning Systems \nPractical, chapter 9, pages 137-153. The MIT Press, Cambridge, Mass.,  1997. \n\n[7]  G. E. Hinton. Learning distributed representations of concepts. In Proc.  Eigth Annual \n\nCon! Cognitive Science Society, pages 1-12, Hillsdale, NJ,  1986. Erlbaum. \n\n[8]  Barak A. Pearlmutter. Fast exact multiplication by the Hessian.  Neural Computation, \n\n6(1):147-160,1994. \n\n[9]  G. W.  Flake.  Industrial strength modeling tools.  Submitted to NIPS 99, 1999. \n[10]  G. W. Flake and B. A. Pearl mutter. Optimizing properties of the Jacobian of nonlinear \n\nfeedforward systems.  In preperation, 1999. \n\n\f", "award": [], "sourceid": 1702, "authors": [{"given_name": "Gary", "family_name": "Flake", "institution": null}, {"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}]}