{"title": "From Data Distributions to Regularization in Invariant Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 223, "page_last": 230, "abstract": null, "full_text": "From Data Distributions to \n\nRegularization in Invariant  Learning \n\nTodd  K.  Leen \n\nDepartment of Computer Science and Engineering \n\nOregon Graduate Institute of Science and Technology \n\n20000 N.W.  Walker Rd \nBeaverton, Oregon 97006 \n\ntieen@cse.ogi.edu \n\nAbstract \n\nIdeally pattern recognition machines provide constant output when \nthe inputs are transformed under a group 9 of desired invariances. \nThese invariances can be achieved  by  enhancing the training data \nto include examples of inputs transformed by elements of g,  while \nleaving  the  corresponding  targets  unchanged.  Alternatively  the \ncost  function  for  training  can  include  a  regularization  term  that \npenalizes changes in the output when the input is  transformed un(cid:173)\nder the group. \n\nThis paper relates the two approaches, showing precisely the sense \nin  which  the regularized  cost function  approximates the result  of \nadding  transformed (or distorted)  examples  to  the  training data. \nThe cost function for the enhanced training set is equivalent to the \nsum of the original  cost function plus  a  regularizer.  For  unbiased \nmodels,  the regularizer reduces  to the intuitively obvious choice -\na  term that penalizes changes  in  the output when  the inputs are \ntransformed under  the  group.  For  infinitesimal  transformations, \nthe coefficient of the regularization term reduces to the variance of \nthe distortions introduced into the training data.  This correspon(cid:173)\ndence provides a simple bridge between the two  approaches. \n\n\f22 4 \n\nTodd Leen \n\n1  A pproaches to Invariant Learning \n\nIn machine learning one sometimes wants  to incorporate invariances into the func(cid:173)\ntion  learned.  Our  knowledge  of  the  problem  dictates  that  the  machine  outputs \nought to remain constant when its inputs are transformed under a set of operations \ngl.  In  character  recognition,  for  example,  we  want  the  outputs  to  be  invariant \nunder shifts and small rotations of the input image. \n\nIn neural networks,  there are several ways  to achieve  this invariance \n\n1.  The invariance can be hard-wired by  weight  sharing in the case of summa(cid:173)\n\ntion nodes (LeCun et al.  1990)  or by constraints similar to weight sharing \nin higher-order nodes  (Giles  et al.  1988). \n\n2.  One can enhance the training ensemble by adding examples of inputs trans(cid:173)\n\nformed  under  the  desired  inval\"iance  group,  while  maintaining  the  same \ntargets as for  the raw data. \n\n3.  One can add to the cost function a regularizer that penalizes changes in the \noutput when the input is  transformed by elements of the group (Simard et \nal.  1992). \n\nIntuitively  one  expects  the  approaches  in  3  and  4  to  be  intimately  linked.  This \npaper develops  that correspondence in detail. \n\n2  The Distortion-Enhanced Input Ensemble \n\nLet the input data x  be distributed according to the density function p( x).  The con(cid:173)\nditional distribution for  the corresponding targets is  denoted p(tlx).  For simplicity \nof notation we  take t  E  R.  The extension  to vector  targets is  trivial.  Let  f(x; w) \ndenote the network function,  parameterized by  weights  w.  The training procedure \nis  assumed to minimize the expected squared error \n\n\u00a3(w)  = J J dtdx p(tlx) p(x) [t - f(x; w)]2 \n\n. \n\n(1) \n\nvVe  wish  to consider the effects of adding new  inputs that are related to the old by \ntransformations that correspond to the desired invariances.  These transformations, \nor  distortions,  of  the  inputs  are  carried  out  by  group  elements  g  E  g.  For  Lie \ngroups2,  the transformations are analytic functions of parameters a  E Rk \n\nwith  the identity transformation corresponding to parameter value zero \n\nx  -t x'  =  g(x;a)  , \n\ng(x;O)  =  x  . \n\n(2) \n\n(3) \n\nIn image processing, for  example,  we  may  want  our machine to exhibit  invariance \nwith  respect  to  rotation,  scaling,  shearing  and  translations  of  the  plane.  These \n\n1 We assume that the set forms  a  group. \n2See  for  example (Sattillger,  1986). \n\n\fFrom  Data  Distributions to  Regularization  in  Invariant  Learning \n\n225 \n\ntransformations form a six-parameter Lie group3. \nBy adding distorted input examples we  alter the original density p( x).  To describe \nthe new  density,  we  introduce a  probability  density for  the transformation param(cid:173)\neters p(a).  Using  this  density,  the distribution for  the distortion-enhanced input \nensemble is \n\np(x')  =  j  j  dadx p(x'lx,a) p(a) p(x) \n\n= \n\nj  jdadxt5(x'-g(x;a\u00bbp(a)p(x) \n\nwhere t5(.)  is  the Dirac delta function4 \nFinally  we  impose that the targets remain unchanged  when  the inputs are trans(cid:173)\nformed according to  (2)  i.e., p(tlx') = p(tlx).  Substituting p(x')  into  (1)  and using \nthe invariance of the targets yields  the cost function \n\nt  =  j \n\nj  J dtdxda  p(tlx)p(x)p(a)  [t  -\n\nf(g(x;a);w)]2 \n\n(4) \n\nEquation (4)  gives  the cost function for  the distortion-enhanced input ensemble. \n\n3  Regularization and Hints \n\nThe remainder of the paper  makes  precise  the connection  between  adding  trans(cid:173)\nformed  inputs,  as  embodied  in  (4),  and  various  regularization  procedures.  It is \nstraightforward to show  that the cost function for  the distortion-enhanced ensem(cid:173)\nble  is  equivalent  to  the  cost  function  for  the  original  data  ensemble  (1)  plus  a \nregularization term.  Adding and subtracting f(x; w)  to the term in square brackets \nin  (4), and expanding the quadratic leaves \n\nt  =  E + ER  , \n\n(5) \n\nwhere the regularizer is \n\nER  =  EH  + Ec J da p(a)  j  dx  p(x)  [f(x, w)  - f(g(x; a); w)]2 \n\n- 2 J j  J dtdxda  p(tlx) p(x) p(a) \n\nx[t -\n\nf{x;w)]  [f(g(x;a);w) - f(x;w)] \n\n(6) \n\n3The parameters for rotations, scaling and shearing completely specify elements of G L2, \nthe four parameter group of 2 x 2 invertible matrices.  The translations carry an additional \ntwo degrees of freedom. \n\n4 In  general the density  on  0  might vary through  the input  space,  suggesting the con(cid:173)\nditional density p(o I :1').  This introduces rather minor changes in the discussion  that will \nnot  be considered here. \n\n\f226 \n\nToddLeen \n\nTraining with  the original  data ensemble using  the cost function  (5)  is  equivalent \nto adding transformed inputs to the data ensemble. \nThe first term of the regularizer \u00a3 H  penalizes the average squared difference between \nI(x;w)  and  I(g(x;a);w).  This  is  exactly  the  form  one  would  intuitively  apply \nin  order to  insure  that  the  network  output  not  change  under  the  transformation \nx  -4 g( x, a).  Indeed this is the similar to the form of the invariance \"hint\" proposed \nby Abu-Mostafa (1993).  The difference here is that there is no arbitrary parameter \nmultiplying  the term.  Instead  the strength of the  regularizer  is  governed  by  the \naverage over  the density pea).  The term \u00a3H  measures the error in  satisfying  the \ninvariance hint. \nThe second  term \u00a3a  measures  the correlation between  the error  in  fitting  to  the \ndata,  and  the errol' in  satisfying the hint.  Only  when  these correlations vanish  is \nthe cost function for the enhanced ensemble equal to the original cost function plus \nthe invariance hint penalty. \nThe correlation term vanishes trivially when either \n\n1.  The invariance I (g( x; a); w)  = I (x; w)  is  satisfied, or \n2.  The network function equals the least squares regression on t \n\nI(x; w)  =  J dt p(tlx) t  = E[tlx]  . \n\n(7) \n\nThe lowest  possible \u00a3  occurs when I  satisfies  (7),  at which \u00a3  becomes the \nvariance in  the  targets  averaged  over  p( x ).  By  substituting this  into  \u00a3a \nand carrying out the integration over dt p( tlx), the correlation term is  seen \nto vanish. \n\nIf the minimum of t occurs at a weight for which the invariance is satisfied (condition \n1 above). then minimizing t ( w) is equivalent to minimizing \u00a3 ( w).  If the minimum of \nt  occurs at a  weight for  which the network function is  the regression (condition 2), \nthen minimizing t  is  equivalent  to minimizing the cost function  with the intuitive \nregularizer \u00a3 H  5. \n\n3.1 \n\nInfinitesimal Transformations \n\nAbove we  enumerated the conditions under which  the correlation term in  \u00a3R  van(cid:173)\nishes  exactly for  unrestricted transformations.  If the transformations are analytic \nin  the paranleters 0',  then by  restricting ourselves  to small transformations (those \nclose to the identity) we can-show how the correlation term approximately vanishes \nfor  unbiased models.  To implement this, we  assume that p( a) is sharply peaked up \nabout the origin so that large transformations are unlikely. \n\n51\u00a3 the data is to be fit optimally, with enough freedom left over to satisfy the invariance \nhint,  then  there  must  be  several  weight values  (perhaps  a  continuum of such  values)  for \nwhich the network function  satisfies  (7).  That is,  the problem must be under-specified.  If \nthis  is  the case,  then the interesting part  weight space is  just the subset  on which  (7)  is \nsatisfied.  On  this subset  the correlation term in (6)  vanishes and  the regularizer assumes \nthe intuitive form. \n\n\fFrom  Data Distributions to Regularization  in  Invariant Learning \n\n227 \n\nWe obtain an approximation to the cost function t by expanding the integrands in \n(6)  in power series about 0  = 0 and retaining terms to second order.  This leaves \n\nt  =  c + J J dxdo p(x) p(o)  (Oi  :~ L=o  :!\" f \n-2 J J J dt dx do  p(tlx)p(x)p(o)  [t-f(x;w)] x \n0 2 gIl  I ) ( of) \nax\" \n\nOg\"l \n\u2022  OOi  0=0  2 '   J  OOi  (0)  0=0 \n\n0 \u00b0 -\n\n1 \n\n+ -0 \u00b00 \u00b0  \n\n[  ( \n\n-\n\n(8) \n\nwhere  x P and gP  denote the pth  components of x  and g,  OJ  denotes the ith  com(cid:173)\nponent of the transformation parameter vector, repeated Greek and Roman indices \nare summed over,  and  all  derivatives  are evaluated at  0  = o.  Note that we  have \nused the fact  that Lie group transformations are analytic in the parameter vector \no  to derive the expansion. \nFinally we  introduce two assumptions on the distribution p(o).  First 0  is  assumed \nto be zero mean.  This  corresponds, in the linear approximation,  to a  distribution \nof distortions whose mean is  the identity transformation.  Second,  we  assume that \nthe  components  of  0  are  uncorrelated  so  that  the  covariance  matrix  is  diagonal \nwith  elements ul,  i  = 1 ... k. 6  With these assumptions, the cost function for  the \ndistortion-enhanced ensemble simplifies to \n\nt  =  c + ~ (Tr  J dx p( x)  (  ~g\"l \n\n:  f \nva.  a=O  vx\" \n\n)  2 \n\nk - L (T~ J J dx dt  p(tlx) p(X)  {  (f(x; w) -\n\nt ) \n\n~ \n.=1 \n\n.=1 \n\n[  ~:; 10=0  ( :~ ) +  :~ L=o  :~ L=o  (ox~2\u00a3x\" )]} \n\nX \n\nThis  last  expression provides  a  simple  bridge between  the  the methods of adding \ntransformed examples  to  the data,  and  the alternative of adding  a  regularizer  to \nthe cost function:  The coefficient of the regularization term in the latter approach \nis  equal to the variance  of the  transformation  parameters  in the former approach. \n\n6Note  that  the  transformed  patterns  may be correlated in parts of the pattern space. \nFor  example the  results  of applying  the  shearing  and  rotation  operations  to  an  infinite \nvertical line are indistinguishable.  In  general,  there may be regions of the  pattern space \nfor which  the action of several different group elements are indistinguishable; that is  x' = \ng(x; a) = g(x; (3).  However this does  not imply that a  and (3  are statistically correlated. \n\n\f228 \n\nTodd Leen \n\n3.1.1  Unbiased  Models \nFor  unbiased  models  the  regularizer  in  E( w)  assumes  a  particularly  simple form. \nSuppose the network  function  is  rich  enough to form  an unbiased  estimate of the \nleast squares regression on t for the un distorted  data ensemble.  That is,  there exists \na  weight  value  Wo  such that \n\nf(x;wo)  = J  dt  tp(tlx)  ==  E[tlx] \n\n(10) \n\nThis is  the global  minimum for  the original error \u00a3( w). \nThe arguments of section 3 apply  here as  well.  However  we  can go further.  Even \nif  there is  only  a  single,  isolated  weight  value  for  which  (10)  is  satisfied,  then  to \nO( 0- 2 )  the correlation term in the regularizer vanishes.  To see this note that by  the \nimplicit function theorem the modified cost function  (9)  has its global minimum at \nthe new weight 7 \n\n(11) \n\nAt  this weight,  the network function is  no longer the regression on t,  but rather \n\nf(x;wo)  =  E[tlx]  +  0(0-2 ) \n\n\u2022 \n\n(12) \n\nSubstituting  (12)  into  (9),  we  find  that  the  minimum of  (9)  is,  to  0(0-2 ),  at  the \nsame weight  as the minimum of \n\nt  = \u00a3  +  L.k  o-~  JdX p(x)  [  oglJ I  of (x, w)  ] 2 \n\n.=1 \n\n0 Q'j  Q=O \n\noxlJ \n\n(13) \n\nTo  0(0- 2 ),  minimizing  (13)  is  equivalent  to minimizing (9).  So  we  regard t  as the \neffective cost function. \n\nThe regularization term in (13) is proportional to the average square of the gradient \nof the  network  function  along  the  direction  in  the  input  space  generated  by  the \nlineal'  part of g.  The quantity  inside the square brackets is  just the linear part of \n[f (g( X; Q'))  -\nf (x)]  from  (6).  The magnitude of the regularization term is  just the \nvariance of the distribution of distortion parameters. \n\nThis is  precisely the form of the regularizer given  by  Simard et al.  in  their tangent \nprop  algorithm  (Simard  et  aI,  1992).  This  derivation  shows  the  equivalence  (to \n0(0\"2))  between the tangent prop regularizer and the alternative of modifying the \ninput  distribution.  Furthermore,  we  see  that  with  this  equivalence,  the constant \nfixing the strengt.h of the regularization term is simply the variance  of the  distortions \nintroduced into the original training set. \n\nWe should stress that the equivalence between the regularizer,  and the distortion(cid:173)\nenhanced ensemble in  (13)  only  holds  to 0(0- 2 ).  If one allows  the variance of the \n\n7We assume that the Hessian of \u00a3  is  nonsingular at woo \n\n\fFrom  Data  Distributions to Regularization  in  Invariant Learning \n\n229 \n\ndistortion  parameters u 2  to  become arbitrarily  large  in  an  effort  to  mock  up  an \narbitrarily large regularization term, then the equivalence expressed in  (13)  breaks \ndown  since  terms  of order O( ( 4 )  can  no  longer  be neglected.  In  addition,  if the \ntransformations are  to  be kept  small  so  that  the  linearization  holds  (e.g.  by  re(cid:173)\nstricting the density  on a  to have support on a  small neighborhood of zero),  then \nthe variance will  bounded above. \n\n3.1.2  Smoothing Regularizers \n\nIn  the  previous  sections  we  showed  the equivalence between  modifying  the input \ndistribution and  adding a  regularizer to the cost function.  We  derived  this  equiv(cid:173)\nalence  to illuminate mechanisms for  obtaining invariant  pattern recognition.  The \ntechnique for dealing with infinitesimal transformations in section \u00a73.1  was used by \nBishop  (1994)  to show  the equivalence between added input noise  and smoothing \nregularizers.  Bishop's  results,  though  they  preceded  our  own,  are  a  special  case \nof the results presented here.  Suppose the group 9 is  restricted to translations by \nrandom vectors g( X; a) = X + a  where a  is  spherically distributed with variance u!. \nThen the regularizer in  (13) is \n\n(14) \n\nThis  regularizer penalizes  large magnitude gradients in  the network  function  and \nis,  as pointed out by  Bishop,  one of the class of generalized Tikhonov  regularizers. \n\n4  Summary \n\nWe have shown that enhancing the input ensemble by adding examples transformed \nunder  a  group  x  -?  g(x;a),  while  maintaining the target  values,  is  equivalent  to \nadding  a  regularizer  to  the original  cost  function.  For  unbiased  models  the reg(cid:173)\nulatizer  reduces  to  the  intuitive  form  that  penalizes  the mean squared  difference \nbetween the network output for  transformed and untransformed inputs - i.e.  the \nerror in satisfying the desired invariance.  In general the regularizer includes a  term \nthat measures correlations  between  the error in fitting  the data,  and  the error in \nsatisfying the desired inva.riance.  For infinitesimal transformations, the regularizer \nis  equivalent  (up to terms linear in the variance of the transformation parameters) \nto  the tangent prop form  given  by  Simard et a1.  (1992),  with  regularization coef(cid:173)\nficient  equal  to the variance of the transformation parameters.  In  the special case \nthat the group transformations are limited to random translations of the input, the \nregularizer reduces to a  standard smoothing regularizer. \n\n\\Ve  gave  conditions  under  which  enhancing  the  input  ensemble  and  adding  the \nintuitive regularizer \u00a3 H  are equivalent.  However tins equivalence is only with regard \nto  the optimal weight.  We  have  not  compared the training dynamics for  the two \napproaches.  In  particular,  it  is  quite  possible  that  the full  regularizer  \u00a3H  + \u00a3c \nexhibits different  training dynamics from  the intuitive form  \u00a3 H.  For  the approach \nin  which  data are added  to the input  ensemble,  one can easily  construct datasets \nand  distributions p( a)  that  either increase  the condition  number  of the Hessian, \nor  decrease  it.  Finally,  it  may  be  that  the  intuitive  regularizer  can  have  either \ndetrimental or positive effects on the Hessian as well. \n\n\f230 \n\nAcknowledgments \n\nToddLeen \n\nI  thank  Lodewyk Wessels,  Misha Pavel,  Eric Wan,  Steve Rehfuss,  Genevieve  Orr \nand  Patrice Simard for  stimulating and helpful  discussions,  and  the reviewers  for \nhelpful  comments.  I  am grateful to my  father for  what he gave  to  me in  life,  and \nfor  the presence of his  spirit after his  recent passing. \n\nThis  work  was  supported  by  EPRI  under  grant  RP8015-2,  AFOSR  under  grant \nFF4962-93-1-0253,  and ONR under grant N00014-91-J-1482. \n\nReferences \n\nYasar S.  Abu-Mostafa.  A method for learning from hints.  In S.  Hanson, J.  Cowan, \nand C.  Giles,  editors,  Advances  in  Neural  Information  Processing  Systems,  vol.  5, \npages 73-80.  Morgan Kaufmann, 1993. \n\nChris M.  Bishop.  Training with noise is  equivalent  to Tikhonov regularization.  To \nappear in  Neural  Computation,  1994. \n\nC.L. Giles, R.D. Griffin, and T. Maxwell.  Encoding geometric invariances in higher(cid:173)\norder  neural  networks.  In  D.Z.Anderson,  editor,  Neural  Information  Processing \nSystems,  pages 301-309.  American Institute of Physics,  1988. \n\nY.  Le Cun,  B.  Boser,  J.S.  Denker,  D.  Henderson, R.E.  Howard,  W.  Hubbard, and \nL.D.  Jackel.  Handwritten digit  recognition  with  a  back-propagation network.  In \nAdvances in Neural Information Processing Systems,  vol.  2, pages 396-404.  Morgan \nKaufmann Publishers,  1990. \n\nPatrice Simard, Bernard Victorri, Yann Le Cun, and John Denker.  Tangent prop -\na  formalism for specifying selected invariances in  an adaptive network.  In John E. \nMoody,  Steven J.  Hanson, and Richard P.  Lippmann, editors,  Advances  in Neural \nInformation  Processing  Systems  4,  pages 895-903.  Morgan Kaufmann, 1992. \nD.H.  Sattinger  and  O.L.  Weaver.  Lie  Groups  and  Algebras  with  Applications  to \nPhysics,  Geometry  and  Mechanics.  Springer-Verlag, 1986. \n\n\f", "award": [], "sourceid": 925, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}]}