{"title": "Estimating Equivalent Kernels for Neural Networks: A Data Perturbation Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 382, "page_last": 388, "abstract": null, "full_text": "Estimating Equivalent Kernels For Neural \nNetworks: A Data Perturbation Approach \n\nA. Neil Burgess \n\nDepartment of Decision Science \n\nLondon Business School \nLondon, NW1  4SA, UK \n\n(N.Burgess@lbs.lon.ac.uk) \n\nABSTRACT \n\nWe  describe  the  notion  of  \"equivalent  kernels\"  and  suggest  that  this \nprovides a framework  for comparing different classes of regression models, \nincluding  neural  networks  and  both  parametric  and  non-parametric \nstatistical techniques.  Unfortunately,  standard techniques break down  when \nfaced with models, such as neural networks,  in which there is more than one \n\"layer\" of adjustable parameters.  We propose an algorithm which overcomes \nthis limitation,  estimating the equivalent kernels for  neural network models \nusing  a  data  perturbation approach.  Experimental  results  indicate  that  the \nnetworks do  not  use  the  maximum possible  number of degrees of freedom, \nthat  these  can  be  controlled  using  regularisation  techniques  and  that  the \nequivalent kernels learnt by the network vary both  in \"size\"  and in \"shape\" \nin different regions of the input space. \n\n1 \n\nINTRODUCTION \n\nThe  dominant  approaches  within  the  statistical  community,  such  as  multiple  linear \nregression  but  even  extending  to  advanced  techniques  such  as  generalised  additive \nmodels  (Hastie  and  Tibshirani,  1990),  projection  pursuit  regression  (Friedman  and \nStuetzle,  1981),  and classification and  regreSSion  trees  (Breiman  et  al.,  1984),  tend  to \nerr,  when they do,  on  the high-bias side due  to  restrictive assumptions regarding either \nthe functional  form  of the  response  to  individual  variables and/or the limited nature  of \nthe  interaction  effects  which  can  be  accommodated.  Other  classes  of models,  such  as \nmulti-variate  adaptive  regression  spline  models  of  high-order  (Friedman,  1991), \ninteraction splines (Wahba,  1990)  and especially  non-parametric  regression  techniques \n(HardIe,  1990) are capable of relaxing some or all  of these  restrictive  assumptions,  but \nrun the converse risk of suffering high-variance, or \"over fitting\". \n\nA large  literature  of experimental  results  suggests  that,  under  the  right  conditions,  the \nflexibility  of neural  networks  allows  them to  out-perform  other  techniques.  Where  the \ncurrent  understanding  is  limited,  however,  is  in analysing  trained  neural  networks  to \nunderstand  how  the  degrees  of freedom  have  been  allocated,  in  a  way  which  allows \nmeaningful  comparisons  with  other classes  of models.  We  propose  that  the  notion  of \n\n\fEstimating Equivalent Kernels:  A Data Perturbation Approach \n\n383 \n\n\"equivalent  kernels\" \n[ego  (Hastie  and  Tibshirani,  1990)]  can  provide  a  unifying \nframework  for  neural  networks  and  other  classes  of  regression  model,  as  well  as \nproviding  important  information  about  the  neural  network  itself.  We  describe  an \nalgorithm  for  estimating  equivalent  kernels  for  neural  networks  which  overcomes  the \nlimitations of existing analytical methods. \n\nIn the following  section we describe the concept of equivalent kernels.  In  Section  3 we \ndescribe an algorithm which estimates how the response function  learned by the  neural \nnetwork would change if the training data were modified slightly, from which we derive \nthe  equivalent  kernels  for  the  network.  Section  4  provides  simulation  results  for  two \ncontrolled experiments.  Section 5 contains a brief discussion of some of the implications \nof this  work,  and  highlights  a  number of interesting directions  for  further  research.  A \nsummary of the main points of the paper is presented in Section 6. \n\n2 \n\nEQUIVALENT KERNELS \n\nNon-parametric  regression techniques,  such  as  kernel  smoothing,  local  regression  and \nnearest neighbour regression, can all be expressed in the form: \n\ny(z)  =  f ((J(z,x).J(x).t(x) dx \n\n<Xl \n\n(1) \n\nX=\u00b7<Xl \n\nwhere y(z) is the response at the query point z, <p(z.  x) is the weighting, or kernel, which \nis \"centred\" at z, f(x)  is the input density and t(x) is the target function. \n\nIn finite samples, this is approximated by: \n\nn \n\ny(xJ  =  L<f>(x;,xj).tj \n\nj=1 \n\n(2) \n\nand the response at point Xj  is a weighted average of the sampled target values across the \nentire dataset.  Furthermore,  the  response can be  viewed  as a  least squares estimate  for \ny(Xj) because we can write it as a solution to the minimization problem: \n\n~ (r.CjJ(xj,Xj).tj - y(xi)Y \n\n];1 \n\n) \n\n(3) \n\nWe can combine the kernel functions to define the smoother matrix S, given by: \n\n<P(Xl,Xl)  <P(Xl,X2) \ns=  <P(X2 ,Xl)  <P(X2 ,X2) \n\nFrom which we obtain: \n\ny=S.t \n\n(4) \n\n(5) \n\n\f384 \n\nA. N.  Burgess \n\nWhere y = (y(XI), y(X2),  ...  ,y(xn) )T,  and t = (t\\, h, ... , tJT is the vector of target values. \n\nFrom the  smoother matrix S,  we  can derive  many kinds  of important infonnation.  The \nmodel  is  represented  in tenns  of the  influence  of each  observation  on  the  response  at \neach sample  point,  allowing us  to  quantify  the  effect  of outliers for  instance. It is  also \npossible  to  calculate  the  model  bias  and  variance  at  each  sample  point  [see  (Hardie, \n1990) for details].  One important measure which we  will return to below is the  number \nof degrees of freedom which are absorbed by the  model;  a number of definitions can be \nmotivated, but in the case of least squares estimators they turn out to be  equivalent  [see \npp 52-55 of (Hastie and Tibshirani, 1990)], perhaps the most intuitive is: \n\ndofs = trace( S ) \n\n(6) \n\nthus a model which is a look up table, i.e. y(Xj)  = tj,  absorbs all  'n' degrees of freedom, \nwhereas  the  sample  mean,  y(Xj)  =  lin  L  tj  ,  absorbs  only  one  degree  of freedom.  The \ndegrees  of freedom  can  be  taken  as  a  natural  measure  of model  complexity,  which \nfonnulated with respect to the data itself, rather than to the number of parameters. \n\nThe discussion above relates only to models which can be expressed in the fonn given by \nequation (2), i.e. where the \"kernel functions\"  can be computed.  Fortunately, many types \nof parametric  models  can be  \"inverted\"  in  this  manner,  providing  what  are  known  as \n\"equivalent kernels\". Consider a model of the fonn: \n\ni.e. a weighted function  of some arbitrary transfonnations of the  input variables. In the \ncase of fitting using a least squares approach, then the optimal weights w = ( WI,  W2,  ... , \nWn)T are given by: \n\n(7) \n\nwhere  <1>+  is  the  pseudo-inverse of the transfonned data  matrix  <1>.  The network  output \ncan then be expressed as: \n\n(8) \n\n(9) \n\n=  ~k <P(Xj,  Xk) .~ \n\nand the cp(Xj,  Xk)  are then the \"equivalent kernels\" of the original model which is now in \nthe  same  fonn  as equation  (2).  Examples of equivalent  kernels  for  different  classes  of \nparametric and non-parametric models are given by (Hastie and Tibshirani,  1990) whilst \na treatment for Radial Basis Function (RBF) networks is presented in (Lowe,  1995). \n\n3 \n\nEQUIVALENT KERNELS FOR NEURAL NETWORKS \n\nThe  analytic  approach  described  above  relies  on  the  ability  to  calculate  the  optimal \nweights  using  the  pseudo-inverse  of  the  data  matrix.  This  is  only  possible  if the \ntransfonnations, ~(x),  are fixed functions,  as is typically the case in parametric models or \nsingle-layer neural networks.  However, for a neural network with more than one layer of \n\n\fEstimating Equivalent Kernels:  A Data Perturbation Approach \n\n385 \n\nadjustable weights,  the basis functions  are  parametrised rather  than fixed  and are thus \nthemselves a function of the training data. Consequently the equivalent kernels are also \ndependent on the data, and the problem of finding the equivalent kernels becomes  non(cid:173)\nlinear. \n\nWe adopt a solution to this problem which is based on the following observation. In the \ncase where  the  equivalent  kernels  are  independent  of the observed values  tj,  we  notice \nfrom equation (2): \n\n) \n= <p(x;,Xj \n\nBy; \n-\nBtj \n\n(10) \n\ni.e. the basis function  <p(Xj,  x) is equal to the sensitivity of the response y(Xj)  to a small \nchange in the observed value tj.  This suggests that we approximate the equivalent kernels \nby turning the above expression around: \n\n(11) \n\nwhere  E  is  a small perturbation of the training data  and <p(Xj)  is  the  response of the  re(cid:173)\noptimised network: \n\nIf/(X j )  =  <p\u00b7(x;,x).(tj +e)+ L<p\u00b7(x;,Xk)\u00b7tk \n\n(12) \n\nk~j \n\nThe notation <p.  indicates that the new kernel functions derive from the network fitted to \nperturbed data.  Note that this takes into account  all  of the  adjustable  parameters in the \nnetwork.  Whereas treating the basis functions as fixed would give simply the number of \nadditive terms in the final layer of the network. \n\nCalculating  the  equivalent  kernels  in  this  fashion  is  a  computationally  intensive \nprocedure, with the network needing to be retrained after perturbing each point in tum. \nNote that regularisation techniques  such as  weight decay  should be incorporated within \nthis  procedure  as  with  initial  training  and  are  thus  correctly  accounted  for  by  the \nalgorithm.  The  retraining  step  is  facilitated  by  using  the  optimised  weights  from  the \nunperturbed data, causing the network to re-train from weights which are initially almost \noptimal (especially if the perturbation is small). \n\n4 \n\nSIMULATION RESULTS \n\nIn order to  investigate the practical viability  of estimating equivalent  kernels using the \nperturbation  approach,  we  performed  a  controlled  experiment  on  simulated  data.  The \ntarget function used was the first two periods of a sine-wave, sampled at 41  points evenly \nspaced between  0  and  47t. This function  was  estimated  using  a  neural  network  with  a \nsingle  layer  of four  sigmoid  units,  a  shortcut  connection  from  input  to  output,  and  a \nlinear output unit, trained using standard backpropogation. \n\nFrom  the  trained  network  we  then  estimated  the  equivalent  kernels  using  the \nperturbation method described in the previous section. The resulting kernels for points 0, \n7t,  and 27t are shown in figure 2, below. \n\n\f386 \n\nA. N.  Burgess \n\nFigure 2:  Equivalent Kernels for sine-wave problem \n\nAs discussed in the previous section, we can combine the estimated kernels to construct a \nlinear smoother.  The correlation coefficient between the function reconstructed from the \napproximated smoother matrix and the original neural network is found to be 0.995 . \n\nFrom equation (6) we find that the network contains approx. 8.2 degrees of freedom;  this \ncompares to  the  10 potential degrees of freedom,  and also to  the 6  degrees  of freedom \nwhich we would expect for an equivalent model with fixed transfer functions.  Clearly, to \nsome degreee, perturbations in the training data are accommodated by adjustments to the \nsigmoid functions. \n\nUsing this approach we can also investigate the effects of weight decay on (a)  the ability \nof the  network to  reproduce  the  target function,  (b)  the number of degrees  of freedom \nabsorbed  by  the  network,  and  (c)  the  kernel  functions  themselves.  We  use  a  standard \nquadratic weight decay,  leading to a cost function of the form: \n\nC =  (y - f(x)i  + y.LW2 \n\n(13) \n\nThe  effect  of  gradually  increaSing  the  weight  decay  factor,  y,  on  both  network \nperformance and capacity is shown in figure 3(b), below: \n\n15 \n\n.15 \n\n9 \n\u2022\n\no \n\nQ \n\n.... -.......-~\u00b7. \u00b7 ~\u00b7 ..... ;,;.\u00b7h\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7- -\u00b7- ---\u00b7 --- \u00b7-- ----\u00b7---\u00b7-\n\nlilJ'11, \n.. \u2022 ... . .... .. .,., .........\u2022.\u2022.\u2022\u2022...\u2022..\u2022\u2022 \u2022\u2022\u2022 .\u2022  OM \n\n.. .. \u2022....... .\u2022\u2022\u2022. \u2022\u2022\u2022\u2022. \n\n,  ........ ...... .. .. ... ... ..... . . . ..... . . . . . ... . ....... \\  .................  OM \n\n: ::::::::::::::::::::::::::::::::::::: .. ::::'~:':\"'~:::::::::::::::::: : \n, .......... ..... .. ...... ...... ... . ........... .............. \\ .... .... ... .. \n\"'\" \n: :::::1  ::=(=:...~ .... I  !::::::::::::: .. ~~:~:~ :: \nm  m ~  ~  ~  ~  ~  ~  !  I  ~ i  ~  ~  ~  ~ \nt \n\n_o.aw_ \n\n1  --- \u2022\u2022\u2022 \u2022 - -- ----- -------- --.--.-- \u2022 \u2022\u2022 \u2022 \u2022\u2022 -------- - -- --- - -- -- - --.- - . -- -.'--. \n\n\" \n\nIii \n\n;:j \n\n-\n\n~  \u2022 \n\n1(M \n\n~ \n\n... \n\n'I  m \n\nFigure 3: (a) Comparison of network and reconstructed functions with target, and (b) effect of weight decay \n\nLooking at figure 3(b) we note that the two curves follow each other very closely. As the \nweight decay factor is increased, the effective capacity of the network is reduced and the \nperformance drops off accordingly. \n\nIn one dimension, the main flexibility for the equivalent kernels is one of scale:  narrow, \nconcentrated  kernels  which  rely  heavily  on  nearby  observations  versus  broad,  diffuse \nkernels  in  which  the  response  is  conditioned  on  a  larger  number  of observations.  In \nhigher dimensions,  however,  the  real  power  of neural  networks  as  function  estimators \nlies  in the fact  that the  sensitivity of the estimated  network function  is  itself a  flexible \n\n\fEstimating Equivalent Kernels:  A Data Perturbation Approach \n\n387 \n\nfunction  of the  input  vector.  Viewed  from  the  perspective  of equivalent  kernels,  this \nproperty might be expected to manifest  itself in a change in the shape  of the kernels in \ndifferent  regions  of the  input  space.  In  order  to  investigate  this  effect  we  applied  the \nperturbation  approach  in  estimating  equivalent  kernels  for  a  network  trained  to \nreproduce a two-dimensional function; the function chosen was a \"ring\" defined by: \n\nz = II ( 1 + 30.( x2 + y2 - 0.5)2) \n\n(14) \n\nFor ease of visualisation the input points were chosen on a regular 15 by  15  grid running \nbetween plus and minus one. This function was approximated using a 2(+ 1 )-8-1  network \nwith  sigmoidal  hidden  units  and  a  linear  output  unit.  Selected  kernel  functions, \nestimated from this network, are shown in figure 4, below: \n\nFigure 4: Equivalent Kernels: approximated using the perturbation method \n\nThis result clearly shows the changing shape of the kernel functions  in different parts of \nthe  input  space.  The function  reconstructed  from  the  estimated  smoother  matrix  has  a \ncorrelation coefficient of 0.987 with the original network function. \n\n5. Discussion \n\nThe  ability  to  transform  neural  network  regression  models  into  an  equivalent  kernel \nrepresentation  raises  the  possibility  of  harnessing \nthe  whole  battery  of  statistical \nmethods  which  have  been  developed  for  non-parametric  techniques:  model  selection \nprocedures,  prediction  interval  estimation,  calculation  of  degrees  of  freedom,  and \nstatistical  significance  testing  amongst  others.  The  algorithm  described  in  this  paper \nraises the possibility of applying these techniques to more-powerful networks with two or \nmore layers of adaptable weights, be they based on sigmoids,  radial functions, splines or \nwhatever, albeit at the price of significant computational effort. \n\nAnother opportunity  is  in  the  area  of model  combination  where  the  added  value  from \ncombining  models  in an  ensemble  is  related  to  the  degree  of correlation  between  the \ndifferent  models  (Krogh  and  Vedelsby,  1995).  Typically  the  pointwise  correlation \nbetween two models will be related to the similarity between their equivalent kernels and \nso the equivalent kernel approach opens new possibilities for conditionally modifying the \nensemble weights without a need for an additional level of learning. \n\nThe influence-based  method  for  estimating the  number of degrees of freedom  absorbed \nby a neural network model, focuses attention on uncertainty in the data itself, rather than \ntaking the  indirect  route based on uncertainty  in the  model  parameters;  in future  work \n\n\f388 \n\nA. N.  Burgess \n\nwe  propose  to  investigate  the  similarities  and  differences  between  our  approach  and \nthose  based  on  the  \"effective  number  of  parameters\"  (Moody,  1992)  and  Bayesian \nmethods (MacKay,  1992). \n\n6. Summary \n\nWe  suggest  that  equivalent  kernels  provide  an  important  tool  for  understanding  what \nneural  networks  do  and  how  they  go  about  doing  it;  in  particular  a  large  battery  of \nexisting statistical tools use information derived from the smoother matrix. \n\nThe perturbation method which we have presented overcomes the limitations of standard \napproaches,  which  are  only  appropriate  for  models  with  a  single  layer  of adjustable \nweights,  albeit  at  considerable  computational  expense.  It  has  the  added  bonus  of \nautomatically taking into account the effect of regularisation techniques  such as  weight \ndecay. \n\nThe  experimental  results  illustrate  the  application  of  the  technique  to  two  simple \nproblems.  As  expected the  number of degrees  of freedom  in the  models  is  found  to  be \nrelated to  the amount of weight decay used during training.  The equivalent kernels  are \nfound  to  vary  significantly  in  different  regions  of  input  space  and  the  functions \nreconstructed from the estimated smoother matrices closely match the origna! networks. \n\n7. References \n\nBreiman, 1., Friedman, J. H., Olshen, R.  A, and Stone C.  1., 1984,  Classification and Regression \n\nTrees, Wadsworth and Brooks/Cole, Monterey. \n\nFriedman, J.H.  and Stuetzle, W.,  1981.  Projection pursuit regression. Journal of the American \n\nStatistical Association. Vol.  76, pp.  817-823. \n\nFriedman, J.H.,  1991 . Multivariate Adaptive Regression Splines (with discussion). Annals of \n\nStatistics.  Vol  19, num.  1, pp. 1-141. \n\nHardIe, W., 1990. Applied non parametric regression. Cambridge University Press. \n\nHastie, T.J. and Tibshirani, R.J.,  1990.  Generalised Additive Models. Chapman and Hall, London. \n\nKrogh, A, and Vedelsby, 1., New-al network ensembles, cross-validation and active learning, NIPS \n\n7,  pp231-238. \n\nLowe, D., 1995, On the use of  nonlocal and non positive definite basis functions in radial basis \n\nfunction networks, Proceedings of the Fourth lEE Conference on ArtifiCial Neural \nNetworks, pp. 206-211 . \n\nMacKay, D.  J.  C.,  1992, A practical Bayesian framework for backprop networks, Neural \n\nComputation, 4,448-472. \n\nMoody, J.  E.,  1992, The effective number of parameters: an analysis of generalisation and \n\nregularization in nonlinear learning systems, NIPS 4,  847-54, Morgan Kaufmann, San Mateo \n\nWahba, G., 1990, Spline Models for Observational Data. Society for Industrial and Applied \n\nMathematics, Philadelphia. \n\n\f", "award": [], "sourceid": 1326, "authors": [{"given_name": "A.", "family_name": "Burgess", "institution": null}]}