{"title": "Second-order Learning Algorithm with Squared Penalty Term", "book": "Advances in Neural Information Processing Systems", "page_first": 627, "page_last": 633, "abstract": null, "full_text": "Second-order Learning Algorithm with \n\nSquared  Penalty Term \n\nRyohei Nakano \nKazumi Saito \nNTT Communication Science Laboratories \n\n2 Hikaridai, Seika-cho,  Soraku-gun,  Kyoto 619-02  Japan \n\n{saito,nakano }@cslab.kecl.ntt.jp \n\nAbstract \n\nThis  paper compares three penalty terms  with respect to the effi(cid:173)\nciency of supervised learning, by using first- and second-order learn(cid:173)\ning algorithms.  Our experiments showed that for  a reasonably ade(cid:173)\nquate penalty factor,  the combination of the squared penalty term \nand  the second-order  learning  algorithm  drastically improves  the \nconvergence  performance more than 20  times  over  the other  com(cid:173)\nbinations, at the same time bringing about a better generalization \nperformance. \n\n1 \n\nINTRODUCTION \n\nIt has been found empirically that adding some penalty term to an objective func(cid:173)\ntion  in  the  learning  of  neural  networks  can  lead  to  significant  improvements  in \nnetwork  generalization.  Such  terms  have  been  proposed  on  the  basis  of  several \nviewpoints  such  as  weight-decay  (Hinton,  1987),  regularization  (Poggio  &  Girosi, \n1990),  function-smoothing  (Bishop,  1995),  weight-pruning  (Hanson &  Pratt, 1989; \nIshikawa,  1990),  and  Bayesian  priors  (MacKay,  1992;  Williams,  1995).  Some  are \ncalculated by  using simple  arithmetic  operations,  while  others utilize  higher-order \nderivatives.  The most important evaluation criterion for these terms is how the gen(cid:173)\neralization performance  improves,  but the learning efficiency  is  also  an important \ncriterion in  large-scale  practical  problems;  i.e.,  computationally demanding terms \nare hardly applicable  to such  problems.  Here,  it is  naturally conceivable  that the \neffects of penalty terms depend on learning algorithms;  thus, we  need comparative \nevaluations. \n\nThis  paper  evaluates  the efficiency  of  first- and  second-order  learning  algorithms \n\n\f628 \n\nK.  Saito and R.  Nakano \n\nwith three penalty terms.  Section 2 explains the framework of the present learning \nand shows  a  second-order  algorithm  with  the penalty  terms.  Section  3  shows  ex(cid:173)\nperimental  results  for  a  regression  problem,  a  graphical evaluation,  and  a  penalty \nfactor  determination llsing cross-validation. \n\n2  LEARNING WITH PENALTY TERM \n\n2.1  Framework \n\nLet {(Xl, Y1),\"', (xm, Ym)}  be a set of examples, where Xt  denotes an n-dimensional \ninput  vector  and  Yt  a  target  value  corresponding  to  Xt.  In  a  three-layer  neural \nnetwork,  let  h  be  the  number  of  hidden  units,  Wj  (j  =  1,\"', h)  be  the  weight \nvector between all the input units and the hidden unit j, and Wo  =  (WOO,\"\"  WOh)T \nbe the weight vector  between all  the hidden units  and the output unit;  WjO  means \na  bias  term and  XtO  is  set to  1.  Note  that aT  denotes  the transposed vector  of a. \nHereafter, a vector consisting of all parameters, (w5,\"\"  wDT ,  is simply expressed \nas ~ =  (<PI, . . . , <P N ) T,  where N (= nh + 2h + 1)  denotes the dimension of ~. Then, \nthe training error in  the three-layer neural network can be defined  as follows: \n\n(1) \n\nwhere O'(u)  represents a sigmoidal function,  O'(u)  =  1/(1 + e- U ). \nIn this paper, we  consider the following  three penalty terms: \n\nN \n\n02(~) =  L l<Pkl, \n\nk=l \n\nHereafter,  01,  O2 ,  and  0 3  are referred  to  as  the squared  (Hinton,  1987;  MacKay, \n1992), absolute (Ishikawa, 1990; Williams, 1995), and normalized (Hanson &  Pratt, \n1989)  penalty  terms,  respectively.  Then,  learning with  one  of these  terms can  be \ndefined as the problem of minimizing the following  objective function \n\n(3) \n\nwhere J.L  is  a  penalty factor. \n\n2.2  Second-order Algorithm with Penalty Term \n\nIn order  to  minimize  the objective function,  we  employ  a  newly  invented  second(cid:173)\norder  learning  algorithm  based  on  a  quasi-Newton  method,  called  BPQ  (Saito  & \nNakano,  1997), where the descent direction, ~~, is calculated on the basis of a par(cid:173)\ntial BFGS update and a reasonably accurate step-length, .x,  is efficiently calculated \nas  the  minimal  point  of  a  second-order  approximation.  Here,  the  partial  BFGS \nupdate can be  directly applied,  while the step-length .x  is  evaluated as  follows: \n\n.x  = \n\n-VFi(~)~~T \n\n~~T'V2 Fi(~)~~ \n\n(4) \n\n\fSecond-order Learning Algorithm with Squared Penalty Term \n\n629 \n\nThe  quadratic  form  for  the  training  error  term,  .1.~T\\72 f(~).1.~, can  be  calcu(cid:173)\nlated efficiently  with  the computational complexity of Nm + O(hm)  by using  the \nprocedure of BPQ, while those for  penalty terms are calculated as follows: \n\nN \n\n.1.~T\\72n2(~).1.~ =  0, \n\n.1.~T\\72rh(~).1.~ = L.1.\u00a2>~, \n\n.1.~T\\72n (~).1.~ = t (1  - 3\u00a2>~).1.\u00a2>~ \n\nk=1 \n\n3 \n\nk=1 \n\n(1 + \u00a2>~)3 \n\n. \n\n(5) \n\nNote that, in the step-length calculation,  .1.~T\\72Fi(~).1.~ is  basically assumed to \nbe positive.  The three terms have a  different  effect on it,  Le.,  the squared penalty \nterm always adds a non-negative value; the absolute penalty term has no effect; the \nnormalized penalty term may add a negative value if many weight values are larger \nthan ...jf73.  This indicates that the squared penalty term has  a  desirable  feature. \nIncidentally,  we  can employ  other  second-order  learning  algorithms  such  as  SCG \n(M(ljller,  1993)  or ass (Battiti, 1992),  but BPQ worked the most efficiently among \nthem in our own experience (Saito &  Nakano,  1997). \n\n3  EVALUATION BY EXPERIMENTS \n\n3.1  Regression Problem \nBy using a  regression problem for  a  function y  =  (1- x + 2x2)e-O.5X2 ,  the learning \nperformance of adding a penalty term was evaluated.  In the experiment, a value of \nx  was  randomly generated in the range of [-4,4], and the corresponding value of y \nwas calculated from x; each value of y  was corrupted by adding Gaussian noise with \na mean of 0 and a standard deviation of 0.2.  The total number of training examples \nwas  set to  30.  The number  of hidden  units  was  set  to  5,  where  the initial  values \nfor  the weights  between the input and hidden units were independently generated \naccording to  a  normal distribution  with a  mean of 0  and  a  standard deviation  of \n1;  the initial values  for  the weights  between the hidden and output units were  set \nto  0,  but the bias value at the output unit  was  initially set to the average output \nvalue  of all  training  examples.  The  iteration  was  terminated  when  the  gradient \nvector  was  sufficiently  small  (Le.,  11\\7 Fi ( ~) 112/ N  < 10-12) or the total processing \ntime exceeded 100  seconds.  The penalty factor  J.t  was  changed from  tJ  to  2- 19  by \nmultiplying by 2- 1 ;  trials were performed 20  times for  each penalty factor. \n\nFigure 1 shows  the training examples,  the true function,  and  a  function  obtained \nafter learning without  a  penalty term.  We can see that such a  learning over-fitted \nthe training examples to some degree. \n\n3.2  Evaluation using Second-order Algorithm \n\nBy using BPQ, an evaluation was made after adding each penalty term.  Figure 2(a) \ncompares  the generalization  performance,  which  was  evaluated by  using  the aver(cid:173)\nage  RMSE  (root  mean  squared  error)  for  a  set  of 5,000  test  examples.  The  best \npossible RMSE level is  0.2  because each test example includes the same amount of \nGaussian noise given to each training example.  For each penalty term, the general(cid:173)\nization performance was  improved when J.t  was  set adequately,  but the normalized \n\n\f630 \n\nK.  Saito and R. Nakano \n\n3 \n\n2 \n\n1 \n\na \n\n-4 \n\n0 \n'0 . \n\naverage RMSE \n0.8 \n\n0.6 \n\nII \n\ntrue function \nleaming result \no \n\n-2 \n\na \n\n2 \n\n4 \n\nFigure 1:  Learning problem \n\nCPU time (sec.) \n100 \n\n, '- ....\u2022. ~ ~ \n\n, \n,  '1:3 \n\n' ...... ~  \" ... \"  \" \n.... <~\" : \n~-----~-'~--u-t--~'~~\"~#'-~~~ \n\npenalty \n\n, \n\nII \n\n'1:3\" \n\nI \n\nI \n\n, , \n. . \n. ' \n\n\\ \n\nI \n\n10 \n\n1 \n\n0.2 -+-r-T\"T\"T'T'\"T\"\"\"\"\"\"\"I\"'T\"\"r-T\"T\"T'''T'T'''''''' \n\n~ \n\n2-5 \n\n2.10 \n\n2-15 \n\n2\u00b720~ \n\n(a)  Generalization performance \n\n2-5 \n\n2-10 \n\n2.15 \n\n2-20 ~ \n\n(b)  CPU time until convergence \n\nFigure 2:  Comparison using second-order  algorithm BPQ \n\npenalty  term  was  the  most  unstable  among  the  three,  because  it  frequently  got \nstuck in undesirable local minima.  Figure 2(b) compares the processing time!  until \nconvergence.  In  comparison  to  the  learning  without  a  penalty  term,  the squared \npenalty term drastically decreased the processing time especially when f.1  was large, \nwhile the absolute penalty term did not converge when f.1  was large; the normalized \npenalty term  generally  required  a  larger  processing  time.  Thus,  only the squared \npenalty  term  improved  the  convergence  performance  more  than  2  rv  100  times, \nkeeping a  better generalization performance for  an adequate penalty factor. \n\n3.3  Evaluation using First-order Algorithm \n\nBy using BP, a similar evaluation was  made after adding each penalty term.  Here, \nwe  adopted  Silva  and  Almeida's  learning  rate  adaptation  rule  (Silva  &  Almeida, \n1990),  i.e.,  learning  rate  \"'k  for  each  weight  <Pk  is  adjusted  by  the  signs  of  two \nsuccessive gradient values2 .  Figure  3(a)  compares  the generalization  performance \nand Figure 3(b) compares the processing time until convergence, where the average \nprocessing  time for  the  trials  without  a  penalty term  is  not  displayed  because  all \ntrials did not converge within 100 seconds.  For each penalty term, the generalization \n\nlOur experiments were done on  SUN 8-4/20 computers. \n2The increasing and decreasing parameters were  set to 1.1  and  1/1.1,  respectively,  as \nrecommended by (Silva &  Almeida,  1990); if the value of the objective function  increases, \nall learning rates are halved until the value decreases. \n\n\fSecond-order Learning Algorithm with Squared Penalty Term \n\n631 \n\naverage RMSE \n0.8 \n\n0.6 \n\n-n1 \n.......  ~ \n- ~ \n-\n\nwithout \npenalty \n\nCPU time (sec.) \n100 \n\n~-----------\n\n10 \n\n-n1 \n.......  ~ \n- - ~ \n\n0.2 -+-rT\"\"r-T\"T\"T\"T\"T\".,....,..\"T\"T\"T\"I'''T'''I~1''''I'''\\ \n\nO. 1 \"\"\"'''''''r-T''1I'''T''\"'''''-'r-r\"r-T\"T''T\"'r\"T\"'T''T\"T'1 \n\n211 \n\n2-5 \n\n2-10 \n\n2-15 \n\n2-20~ \n\n2'5 \n\n2-10 \n\n2-15 \n\n2-20  ~ \n\n2!J \n\n(a)  Generalization performance \n\n(b)  CPU time until convergence \n\nFigure 3:  Comparison using first-order  algorithm BP \n\nperformance  was  improved  when  f.t  was  set  adequately.  Note  that  BP  with  the \nsquared penalty term 01  required more processing time than BPQ with 0 1 .  As for \nthe normalized  penalty term 03, BP  with 0 3 worked  more stably than BPQ with \n0 3 ,  Incidentally, the generalization performance of BP without a penalty term was \nbetter  than that  of BPQ  without  it;  we  predict  that this  is  because  the effect  of \nearly stopping  (Bishop,  1995)  worked for  BP.  Actually,  for  the training examples, \nthe average  RMSE  of BP  without  a  penalty  term  was  0.138,  while  that  of BPQ \nwithout it was  0.133. \n\n3.4  Graphical Evaluation \n\nIn order to graphically examine the reasons why  the effect of the addition of each \npenalty  term  differed,  we  designed  a  simple  problem;  that is,  learning  a  function \nY = U(WIX) + U(W2X),  where only two weights,  WI  and W2,  are adjustable.  In the \nthree-layer network, the input and output layers consist of only one unit, while the \nhidden layer consists of two  units.  Note that the weights between the hidden units \nand  the output  unit  are fixed  at  1,  there  is  no  bias,  and  the  activation  function \nof hidden  units  is  assumed  to  be u(x)  =  1/(1 + exp( -x)).  Each  target  value Yt \nwas  calculated from  the corresponding input value  Xt  E {-0.2, -0.1, 0,0.1, 0.2}  by \nsetting (WI,W2)  =  (1,3). \n\nFigure 4 shows  the learning trajectories on error contour maps with respect to  WI \nand  W2  during  100  iterations  starting at  (Wt,W2)  =  (-1,-3), where  the  penalty \nfactor  f.t  was  set to 0.1 or  0.01.  Here,  BPQ was  used as  a  learning algorithm.  The \ncontours for  the squared penalty term form ovals,  making BPQ learn easily.  When \nf.t  =  0.1,  the  contours  for  the  absolute  penalty  term  form  an  almost  square-like \nshape, and the learning trajectories oscillate near the origin  (WI  = W2  = 0),  due to \nthe discontinuity of the gradient function .  The contours for  the normalized penalty \nterm form  a valley,  making BPQ's learning more difficult. \n\n3.5  Determining Penalty Factor \n\nIn  general,  for  a  given  problem,  we  cannot  know  an  adequate  penalty  factor  in \nadvance.  Given a  limited number of examples, we must find  a  reasonably adequate \n\n\f632 \n\nK.  Saito and R.  Nakano \n\nw2  ~ ( 1.1=0.1  ) \n5 \n\nw2 \n5 \n\n0 \n\n0.5 \n\n-5 \n\n5w1  -5 \nw2 \n5 \n\n5 w1 \n\n0 \n\n0 \n\n-5 \n\n5 w1  -5 \n\n0 \n\n5 w1 \n\n0 \n\n-5 \n\n-5 \nw \n5 \n\n0 \n\n5 Wl \n\n0 \n\n-5 \n\n-5 \nw2 \n5 \n\n0 \n\n0.5 \n\n-5 \n\n-5 \n\n0 \n\n5 W1 \n\n0.5 \n\n-5 \n\n-5 \n\nFigure 4:  Graphical evaluation \n\npenalty factor.  The procedure of cross-validation  (Stone,  1978)  is  adopted for  this \npurpose.  Since we  knew the combination of the squared penalty term 0 1  and the \nsecond-order algorithm BPQ works very efficiently, we  performed experiments using \nthe above regression problem with exactly the same experimental conditions. \n\nFigure  5  shows  the  experimental  results,  where  the  procedure  of cross-validation \nwas implemented as a leave-one-out method, and the initial weight values for  evalu(cid:173)\nating the cross-validation error were set as the learning results of the entire training \nexamples.  Figure 5(a)  compares  the average generalization error  and  the average \ncross-validation error.  Although the cross-validation error was a pessimistic estima(cid:173)\ntor of the generalization error, it showed the same tendency and was  minimized at \nalmost the same penalty factor.  Figure 5(b) shows the average processing time and \nits  standard  deviation;  although  the  processing  time  includes  the cross-validation \nevaluation,  we  can see that the learning was  performed quite efficiently. \n\n4  CONCLUSION \n\nThis  paper  investigated  the  efficiency  of supervised  learning  with  each  of  three \npenalty terms,  by  using  first- and second-order learning algorithms,  BP and BPQ. \nOur  experiments  showed  that  for  a  reasonably  adequate  penalty  factor,  the com(cid:173)\nbination  of  the squared  penalty  term  and  the  second-order  algorithm  drastically \nimproves the convergence performance about 20 times over the other combinations, \ntogether  with  an  improvement  in  the  generalization  performance.  In  the  case  of \nother second-order learning algorithms such as SCG or OSS, similar results are pos(cid:173)\nsible because the main difference between BPQ and those other algorithms involves \nonly the learning efficiency.  In the future,  we  plan to  do  further  evaluations  using \nlarger-scale problems. \n\n\fSecond-order Learning Algorithm with Squared Penalty Term \n\n633 \n\nCPU time (sec.) \n100 \n\naverage \n\n.......  S.d. \n\n10 \n\n;' \n\n... \n\naverage AMSE \n0.8 \n\ncross-validation error \ngeneralization error \n\n0.6 \n\n0.4 \n\n0.2 \n\n. \n\n.,.# ... #  . . . . . . . . . . . .  1 \n\n.. \n,. ... . .. \n\n20 \n\n2.5 \n\n2.10 \n\n2.15 \n\n2\"201.1 \n\n0.1 -++. r-T'\"T'\"T\"\"\"\"'T'\"'I\"\"T\"\"I''''''''''''''''-r\"'lr-T'\"Ir\"\"'I\"\"I \n\n2!J \n\n2.5 \n\n2.10 \n\n2.15 \n\n2\"201.1 \n\n(a)  Generalization performance \n\n(b)  CPU time until convergence \n\nFigure 5:  Learning result \n\nReferences \n\nBattiti,  R.  (1992)  First- and  second-order  methods  for  learning  between  steepest \ndescent and Newton's method.  Neural  Computation 4(2):141-166. \n\nBishop,  C.M.  (1995)  Neural  networks for  pattern recognition.  Clarendon Press. \n\nHanson,  S.J.  &  Pratt,  L.  Y.  (1989)  Comparing  biases  for  minimal  network  con(cid:173)\nstruction  with  back-propagation.  In  D.  S.  Touretzky  (ed.),  Advances  in  Neural \nProcessing  Systems,  Volume 1,  pp.  177-185.  San Mateo,  CA:  Morgan Kaufmann. \n\nHinton, G.E.  (1987)  Learning translation invariant recognition in massively parallel \nnetworks.  In J. W. de Bakker, A.  J. Nijman and P.  C.  Treleaven (eds.),  Proceedings \nPARLE  Conference  on  Parallel  Architectures  and  Languages  Europe,  pp.  1-13. \nBerlin:  Springer-Verlag. \n\nIshikawa,  M.  (1990)  A structural learning algorithm with forgetting of link weight. \nTech.  Rep.  TR-90-7,  Electrotechnical Lab.  Tsukuba-City,  Japan. \n\nMacKay,  D.J.C.  (1992)  Bayesian interpolation.  Neural  Computation 4(3):415-447. \n\nM\u00a2ller, M.F.  (1993)  A scaled conjugate gradient algorithm for fast supervised learn(cid:173)\ning.  Neural  Networks 6(4):525-533. \n\nPoggio, T. & Girosi, F.  (1990)  Regularization algorithms for learning that are equiv(cid:173)\nalent to multilayer networks.  Science 247:978-982. \nSaito, K. &  Nakano, R.  (1997)  Partial BFGS update and efficient step-length calcu(cid:173)\nlation for  three-layer neural networks.  Neural  Computation 9(1):239-257 (in press). \n\nSilva,  F.M.  &  Almeida, L.B.  (1990)  Speeding up backpropagation.  In R.  Eckmiller \n(ed.),  Advanced Neural  Computers,  pp.  151-160.  Amsterdam:  North-Holland. \n\nStone, M.  (1978) Cross-validation:  A review.  Operationsforsch.  Statist.  Ser.  Statis(cid:173)\ntics  B 9(1):111-147. \n\nWilliams,  P.M.  (1995)  Bayesian  regularization  and pruning  using  a  Laplace prior. \nNeural  Computation 7(1):117-143. \n\n\f", "award": [], "sourceid": 1189, "authors": [{"given_name": "Kazumi", "family_name": "Saito", "institution": null}, {"given_name": "Ryohei", "family_name": "Nakano", "institution": null}]}