{"title": "Time Trials on Second-Order and Variable-Learning-Rate Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 983, "abstract": null, "full_text": "Time Trials on Second-Order and \nVariable-Learning-Rate Algorithms \n\nRichard Rohwer \nCentre for Speech Technology Research \nEdinburgh University \n80, South Bridge \nEdinburgh EH 1 1HN, SCOTLAND \n\nAbstract \n\nThe performance of seven minimization algorithms are compared on five \nneural network problems. These include a variable-step-size algorithm, \nconjugate gradient, and several methods with explicit analytic or numerical \napproximations to the Hessian. \n\n1 \n\nIntroduction \n\nThere are several minimization algorithms in use which in the nth iteration vary \nthe ith coordinate Xi in the direction \n\nS~+l = r~s~ + h~V~ \n, \n\n\" \n\n(1) \n\n\" \n\nwhere Vf = ;:.1 is the ith component of the gradient of the error measure E \nat zn, sO = V O, and rn and h n are chosen differently in different algorithms. \nAlgorithms also use various methods for choosing the step size .,.,n to be taken along \ndirection sn. In this study, 7 algorithms were compared on a suite of 5 neural \nnetwork problems. These algorithms are defined in table 1. \n\n\u2022 z .. \n\n1.1 The algorithms \n\nThe algorithms investigated are Silva and Almeida's variable-step-size algorithm \n(Silva, 1990) which closely resembles Toolenaere's \"SuperSAB\" algorithm (Toole-\n\n977 \n\n\fII \n\nII \n\nII \n\n-'-' \n\nI \n\n... \n~ .-.. \n\n~ \n\nI -'-\"' \n\n000 \nII V \n1\\ \n......... \nI \nt;-t;-t;\" \n\nI \n\nI \n\n~i>-~ \n\nI \n\n... \n~ .-.. \n\n~ \n\nI -+ \n\nI \n\n... \n~ .-.. \n\n~ \n\nI -'-\"' \n\n+ \nEs ~ \n\nII \n\n... \n... \n1\\ VI \n\nI \n\n~ \n\n-- ... \n-\n~- ~ \n.-.. \n'---\" ~ \n\nI -+' \n\n978 \n\nRohwer \n\nr--r---r---------r---r-----r-----r--------.--------,---------'r----, \n\n... \n\n... \n~~I ~~I ~~~I \n0 -00 -00 -00 \n\n... \n\n- -\n\nII \n\nII \n\n-\n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\nII \n\n~ \n\n~\":~~o \n-00 0 \n\nt \n~.~ II \nII \n:; i ;:! ~ ~ ~-%.: \n\nII \n\nII \n\nII \n\n~ 0& \n\nII \n\nII \n\ni> \n~-+ ~--~------4--+---_+----~---~------+_-----~1 \n\n~~ \nV 1\\1 \n...... \n+ + \n~~ \n\n::-. \n+ ... \n, oQ \n-::-' ' .. \nto; ' \" ~Q \n, G \nI>~ ... \n-;-.... \n\n5::-\n\n... \n+ -\n~~-... \n+ \nc~ ~--~--------r_--~----~----~------_r--------r_--------~ \nII \n\n-\n\n-\n\no \n\no \n\no \n\no \n\no \n\nnaere, 1990), conjugate gradient (Press, 1988), and 5 variants of an algorithm \nadvocated by LeCun (LeCun, 1989), which employs an analytic calculation of the \ndiagonal terms of the matrix of second derivatives. (Algorithms involving an ap(cid:173)\nproximation of the full Hessian, the inverse of the matrix of second derivatives, were \nstudied by Watrous (Watrous, 1987).) In 4 of these methods the gradient is divided \ncomponent-wise by a decaying average of either the second derivatives or their ab(cid:173)\nsolute values. Dividing by the absolute values assures that s . V < 0, and reflects \nthe philosophy that directions with high curvature, be it positive or negative, are \n\n\fTime Trials on Second-Order and Variable-Learning-Rate Algorithms \n\n979 \n\nnot good ones to follow because the quadratic approximation is likely to break down \nat short distances. In the remaining method, sketched in (Rohwer, 1990a,b), the \ngradient is divided componentwise by the maximum of the absolute values of an \nanalytic and numerical calculation of the second derivitives. Again the philosopy is \nthat curvature is to be avoided. The numerical calculation may detect evidence of \nnearby high curvature at a point where the analytic calculation finds low curvature. \n\nSome algorithms conventionally use a multi-step I-dimensional \"linesearch\" to de(cid:173)\ntermine how far to proceed in direction 8, whereas others take a single step accord(cid:173)\ning to some formula. A linesearch guarantees descent (more precisely, non-ascent), \nwhich is beneficial if local minima pose no threat. Table?? shows the step-size \nmethods used in this study; the decisions are rather arbitrary. The theoretical basis \nof the conjugate gradient method is lost if exact linesearches are not used, but it is \nlost anyway on any non-quadratic function. Silva and Toolenaere's use a single-step \nmethod which guarantees descent by retracting any step which does not produce \nascent. The method is not a linesearch however, because the step following a re(cid:173)\ntracted step will be in a different direction. Space limitations prohibit a detailed \nspecification of the of the linesearch algorithm and the convergence criteria used. \nThese details may be very important. A longer paper is planned in which they are \nto be specified, and their influence on performance studied. \n\n1.2 The test problems \n\nTwo types of problems are used in these tests. One is a strictly-layered 3-layer back \npropagation network in which the minimization variables are the weights. The test \nproblems are 4-bit parity using 4 hidden nodes, auto-association of 10-bit random \npatterns using 7 hidden nodes, and the Peterson and Barney vowel classification \nproblem (Peterson, 1952), which uses 2 inputs, 10 hidden nodes, and 10 target \nnodes. The other type is a fully connected recurrent network trained by the Moving \nTargets method (Rohwer, 1990a,b). In this case the minimization variables are the \nweights and the moving targets, which can be regarded as variable training data \nfor the hidden nodes. The limit cycle switching problem and the 100-step context \nsensitivity problem from these references are the test problems used. In the limit(cid:173)\ncycle switching problem, a single target node is required to regularly generate pulses \nof width proportional to a 2-bit binary number indicated by 2 input nodes. In the \n100-step context problem, the training data always has an input pulse at time step \n100, and sometimes has an input pulse at time O. The target node is required to \nturn on at time 100 if and only if there was an input pulse at time O. \n\nEach method is tested on each problem with 10 different random initial conditions, \nexcept for the parity problem which was done with 100 different initial conditions. \n\n1.3 Unconventional nonlinearity \n\nAn unconventional form of nonlinearity was used in these tests. The usual \n/(x) = 1/(1 + e-~) presents difficulties when x - \u00b1oo because its derivative be(cid:173)\ncomes very small. This makes the system learn slowly if activations become large. \nAlso, numerical noise becomes serious if expressions such as /(x)(l- /(x)) are used \nin the derivative calculations. Various cutoff schemes are sometimes used to pre(cid:173)\nvent these problems, but these introduce discontinuities and/or incorrect derivative \n\n\f980 \n\nRohwer \n\n-)( --\n\nq \n\n?\"\" \n\n0 \nci \n\n-20 \n\n-10 \n\n0 \nx \n\n10 \n\n20 \n\nFigure 1: The nonlinearity used. \n\ncalculations which present further problems for second-derivative methods. In early \nwork it was found that algorithm performance was highly sensitive to cutoff value \n(More systematic work on this subject is wanting.), so an entirely different non(cid:173)\nlinearity was introduced which is bounded but has reasonably large derivatives for \nmost arguments. This combination of properties can only be had with an oscillatory \nfunction. It was also desired to retain the property of 1/(1 + e-~) that it has large \n\"saturated regions\" in which it is approximately constant. The function used is \n\nf(x) = ~ + 2(1 ~ {3) (1 + {3sin(;:)2)sin(; sin(; sin(;:))) \n\n(2) \n\nwith (l' = 10 and {3 = 0.02. This function is graphed in figure 1. \n\n2 Results \n\nAn algorithm is useful if it produces good solutions quickly. The data for each \nalgorithm-problem pair is divided into separate sets for successful and unsuccessful \nruns. Success is defined rather arbitrarily as less than 1 % error on any target \nnode for all training data in the backpropagation problems. In the Moving Target \nproblems, it is defined in terms of the maximum error on any target node in the \nfreely-running network, the threshold being 5% for the 4-limit-cycle problem and \n10% for the 100-step-context problem. \nThe speed data, measured in number of gradient evaluations, is presented in figure \n2, which contains 4 tables, one for each problem except random autoassociation. A \nmaximum of 10000 evaluations was allowed. Each table is divided into 7 columns, \none for each algorithm. From left to right, the algorithms are Rohwer's algorithm \n(max....abs), conjugate gradient (cg), division by unsigned (an-8.bs) or signed (an-Bgn) \nanalytically computed second derivatives and using a linesearch, these two with the \nlinesearch replaced by a single variably-sized step (an_abs-Bs and an-sgn-Bs) and \nSilva's algorithm (silva..ss). The data in each of these 7 columns is divided into \n3 sub columns, the first (a) shows all data points, the second (s) shows data for \nsuccessful runs only, and the third (f) shows data for the failures. Each error bar \nshows the mean and standard deviation of the data in its column. The all-important \nlittle boxes at the base of each column show the proportions of runs in that column's \ncategory. \n\n\fr \u00b7 . \n\n\u00b7 . \n\n\u00b7 \n\n\u00b7 . \u00b7 \n\nGradient evaluations \n\n\u00b7 . \n\nMoving targets, 4 11m II cycles \n\nI -I - - -n \nJ ~ \n\n.. \n\n.. \n\nB \n\n~ \n\n8 \nII \n\n8 \nIi \n\nf t s \n~ \n\n~ \n\n.... \n~ ... \n\n('I) \n\n~ \n\n('I) \n\nQ ... \n~ .... \n::s .. (\") \n.. ~ \n.. .... 0 ::s \n\n0 \nS \n'0 \n~ \n\nC/l \n\n8 \n51 \n\n\u00a7 \n\n~ ~ t \n! \n\n0 \n\n. . \nIQi I~II .~~ I~~ 10. 10. \n\n- - - - - - - - -\n\n101 \n\nm __ IbI \n\n\u2022 \n\n\u2022 \n\nI \n\ncg \n\u2022 \u2022 \n\nan_19\"_-\nIn_lbe \n1 .1 l s i \u2022 \u2022 ' I . ' \n\n.., ____ -\n\nIn_liSft \n\n_hod \n\n811M __ \n\n\u2022 s \n\nI \n\nGradient evaluations \n\nMoving targets, 100-step context \n\nJ J \n\n\u2022 \n\nI \n\nII \n\nII \n\n~ ~ \n\n. . \n~ 1 ~ n \n\u2022 \u2022 \u2022 \u2022 \u2022 a \n. . ! \n\n\u00b7 \najk'-\nI~i 101 101 101 101 101 lOll \n\n. I \n:' \n\n\u2022 \n: \n\nI \n\nI \n! \n\nsi\"'-__ \n\nmalt_It. \n\u2022 \nI \n\n\u2022 \n\ncg \n\nIn_I9\" __ \n\u2022 \u2022 \u2022 \u2022 , \u2022\u2022 , I.' \u2022\u2022 f \n\naR_IbI_.. \n\nIn_IV\" \n\n\"'_M \n\nmol hod \n\n\u2022 \n\n\u2022 \n\nI \n\n~ \n\nB N \n\n\u00a7 \n\n~ .. \n1 \n\n~ \n\n~ \n\ni \nt ~ \n\n8 \n~ \n\nGradient evaluations \n\n2-10-10 Peterson and Barney \n\nII \n\nII \n\n~ ~ \n\nI \n\nI \n\n\u00b7 . \n\n\u00b7 . \n\u00b7 . \n\u00b7 . \n\u00b7 . \n\n101 \n\n\u00b7 . \nII \nl\u00a3 \n\nit \n\n.. \nit \n101 101 101 101 \n\n.. \n\n.. \n\n.. \n\n-\n\n--- -\n\n- - - - - -\n\n1\"_lbI \n\"'_lOn \nI . , \u2022\u2022 t \n\nmel hod \n\nan_It.__ an_19\"_-\n\u2022\u2022 f I . t \n\ntiNa __ \n\n\u2022 \n\n\u2022 \n\nI \n\n101 \n\n101 \n\nm __ IbI \n\n\u2022 \n\n\u2022 \n\nI \n\ncg \nI. \n\nGradient evaluations 4-4-1 parity \n\n- . \n\u00b7 . \n\u00b7 . \n\u00b7 . \n\u00b7 . \n\ni i \n\n\u00b7 . \n\n. \n\n: \n\nI \n\n! ! \nI \n\n\u00b7 . \n\u00b7 . \n\n\u00b7 . I \n\u00b7 . \n\n\u00b7 . \n\u00b7 . \n\u00b7 . \n. : \u00b7 .. \n\u00b7 . \n. . I \u2022 I \n! : . \ni I . \n\u2022 \n\u2022 \nknh \nI \nI \nI~~ liQ li~ III~ 101 \n\n. , , I \u00b7 .. \ni~~ : I 1 \n\n: .. \n\nI \n\nI \u2022 \u2022 \n\n\u2022 \n\nI \n\nm \n\nliD \n\n-\n\nm __ IbI \n\n\u2022 \n\n\u2022 \n\nI \n\ncg \n\nIn_aba \n\u2022 \u2022 \u2022 \u2022 f \n\n\"us9n \n\u2022 \u2022 f \n\n-\"_lbI__ \nIn_ag\" __ \n\u2022 \u2022 f 1 .1 \n\nmelhod \n\n\u00b7 . \n\u00b7 . \u00b7 . \u00b7 . \n\u00b7 . \n: ~ \u00b7 . \nliD \n\n_IIM __ \n\n\u2022 \n\n\u2022 \n\nf \n\nnI \n\n~ .... \nS \n~ .... \n~ \no ::s \nen \nnI \nn \no \n\n6-o lot \n\nQ. \n~ \n~ \nQ. \n~ .... \n~ -nI \n~ \n3 s\u00b7 \n\nCJ'CI \nI \n~ \nP! \nnI \n~ \nCJ'CI o \n\nlot f fI) \n\n\\D \n