{"title": "Asymptotic Theory for Regularization: One-Dimensional Linear Case", "book": "Advances in Neural Information Processing Systems", "page_first": 294, "page_last": 300, "abstract": "", "full_text": "Asymptotic Theory for Regularization: \n\nOne-Dimensional Linear Case \n\nRolf Nevanlinna Institute, P.O. Box 4, FIN-00014 University of Helsinki, \n\nFinland. Email: PetrLKoistinen@rnLhelsinkLfi \n\nPetri Koistinen \n\nAbstract \n\nThe generalization ability of a neural network can sometimes be \nimproved dramatically by regularization. To analyze the improve(cid:173)\nment one needs more refined results than the asymptotic distri(cid:173)\nbution of the weight vector. Here we study the simple case of \none-dimensional linear regression under quadratic regularization, \ni.e., ridge regression. We study the random design, misspecified \ncase, where we derive expansions for the optimal regularization pa(cid:173)\nrameter and the ensuing improvement. It is possible to construct \nexamples where it is best to use no regularization. \n\n1 \n\nINTRODUCTION \n\nSuppose that we have available training data (Xl, Yd, .. 0' (Xn' Yn) consisting of \npairs of vectors, and we try to predict Yi on the basis of Xi with a neural network \nwith weight vector w. One popular way of selecting w is by the criterion \n\n(1) \n\n1 n - L \u00a3(Xi' Yi, w) + >..Q(w) = min!, \n\nn \n\nI \n\nwhere the loss \u00a3(x,y,w) is, e.g., the squared error Ily - g(x,w)11 2 , the function \ng(., w) is the input/output function of the neural network, the penalty Q(w) is \na real function which takes on small values when the mapping g(o, w) is smooth \nand high values when it changes rapidly, and the regularization parameter >.. is a \nnonnegative scalar (which might depend on the training sample). We refer to the \nsetup (1) as (training with) regularization, and to the same setup with the choice \n>.. = 0 as training without regularization. Regularization has been found to be very \neffective for improving the generalization ability of a neural network especially when \nthe sample size n is of the same order of magnitude as the dimensionality of the \nparameter vector w, see, e.g., the textbooks (Bishop, 1995; Ripley, 1996). \n\n\fAsymptotic Theory for Regularization: One-Dimensional Linear Case \n\n295 \n\nIn this paper we deal with asymptotics in the case where the architecture of the \nnetwork is fixed but the sample size grows . To fix ideas, let us assume that the \ntraining data is part of an Li.d. \n(independent, identically distributed) sequence \n(X,Y);(Xl'Yl),(X2'Y2)\"\" of pairs of random vectors, i.e., for each i the pair \n(Xi, Yi) has the same distribution as the pair (X, Y) and the collection of pairs is \nindependent (X and Y can be dependent) . Then we can define the (prediction) risk \nof a network with weights w as the expected value \n(2) \nr(w) := IE:f(X, Y, w). \nLet us denote the minimizer of (1) by Wn (.),) , and a minimizer of the risk r by \nw*. The quantity r(w n (>.)) is the average prediction error for data independent of \nthe training sample. This quantity r(w n (>.)) is a random variable which describes \nthe generalization performance of the network: it is bounded below by r( w*) and \nthe more concentrated it is about r(w*), the better the performance. We will \nquantify this concentration by a single number, the expected value IE:r(wn(>.)) . We \nare interested in quantifying the gain (if any) in generalization for training with \nversus training without regularization defined by \n\n(3) \n\nWhen regularization helps, this is positive. \n\nHowever, relatively little can be said about the quantity (3) without specifying in \ndetail how the regularization parameter is determined. We show in the next section \nthat provided>' converges to zero sufficiently quickly (at the rate op(n- 1 / 2 )), then \nIE: r(wn(O)) and IE: r(wn(>.)) are equal to leading order. It turns out, that the optimal \nregularization parameter resides in this asymptotic regime. For this reason, delicate \nanalysis is required in order to get an asymptotic approximation for (3). In this \narticle we derive the needed asymptotic expansions only for the simplest possible \ncase: one-dimensional linear regression where the regularization parameter is chosen \nindependently of the training sample. \n\n2 REGULARIZATION IN LINEAR REGRESSION \n\nWe now specialize the setup (1) to the case of linear regression and a quadratic \nsmoothness penalty, i.e. , we take f(x,y,w) = [y-xTwJ2 and Q(w) = wTRw, where \nnow y is scalar, x and w are vectors, and R is a symmetric, positive definite matrix. \nIt is well known (and easy to show) that then the minimizer of (1) is \n\n(4) \n\n1 n \n\nwn (>') = ~ ~ XiX! + >'R \n\n[ \n\n]\n\n-1 \n\n1 n \n~ ~ XiYi. \n\nThis is called the generalized ridge regression estimator, see, e.g., (Titterington, \n1985); ridge regression corresponds to the choice R = I, see (Hoerl and Kennard, \n1988) for a survey. Notice that (generalized) ridge regression is usually studied in \nthe fixed design case, where Xi:s are nonrandom. Further, it is usually assumed \nthat the model is correctly specified, i.e., that there exists a parameter such that \nYi = Xr w* + \u20ac i , and such that the distribution of the noise term \u20aci does not depend \non Xi. In contrast, we study the random design, misspecified case. \nAssuming that IE: IIXI1 2 < 00 and that IE: [XXT] is invertible, the minimizer of the \nrisk (2) and the risk itself can be written as \n\n(5) \n(6) \n\nw* = A-lIE: [XY], with A:=IE:[XXT] \n\nr(w) = r(w*) + (w - w*f A(w - w*). \n\n\f296 \n\nP. Koistinen \n\nIf Zn is a sequence of random variables, then the notation Zn = open-a) means \nthat n a Zn converges to zero in probability as n -+ 00 . For this notation and the \nmathematical tools needed for the following proposition see, e.g., (Serfiing, 1980, \nCh. 1) or (Brockwell and Davis, 1987, Ch. 6). \n\nProposition 1 Suppose that IE: y4 < 00, IE: IIXII 4 < 00 and that A = IE: [X XTj is in(cid:173)\nvertible. If,\\ = op(n- I/2), then both y'n(wn(O) -w*) and y'n(wn('\\) - w*) converge \nin distribution to N (0, C), a normal distribution with mean zero and covariance \nmatrix C. \n\nThe previous proposition also generalizes to the nonlinear case (under more compli(cid:173)\ncated conditions). Given this proposition, it follows (under certain additional con(cid:173)\nditions) by Taylor expansion that both IE:r(wn('\\)) - r(w*) and IEr(wn(O)) - r(w*) \nadmit the expansion f31 n -} + o( n -}) with the same constant f3I. Hence, in the \nregime ,\\ = op(n-I/2) we need to consider higher order expansions in order to \ncompare the performance of wn(,\\) and wn(O). \n\n3 ONE-DIMENSIONAL LINEAR REGRESSION \n\nWe now specialize the setting of the previous section to the case where x is scalar. \nAlso, from now on, we only consider the case where the regularization parameter \nfor given sample size n is deterministic; especially ,\\ is not allowed to depend on \nthe training sample. This is necessary, since coefficients in the following type of \nasymptotic expansions depend on the details of how the regularization parameter \nis determined. The deterministic case is the easiest one to analyze. \nWe develop asymptotic expansions for the criterion \n\n(7) \n\nwhere now the regularization parameter k is deterministic and nonnegative. The \nexpansions we get turn out to be valid uniformly for k ~ O. We then develop \nasymptotic formulas for the minimizer of I n, and also for In(O) - inf I n. The last \nquantity can be interpreted as the average improvement in generalization perfor(cid:173)\nmance gained by optimal level of regularization, when the regularization constant \nis allowed to depend on n but not on the training sample. \nFrom now on we take Q(w) = w2 and assume that A = IEX2 = 1 (which could be \narranged by a linear change of variables). Referring back to formulas in the previous \nsection, we see that \n\n(8) \n\nr(wn(k)) - r(w*) = ern - kw*)2/(Un + 1 + k)2 =: h(Un, Vn, k), \n\nwhence In(k) = IE:h(Un, Vn , k), where we have introduced the function h (used \nheavily in what follows) as well as the arithmetic means Un and Vn \n\n(9) \n\n(10) \n\n_ \n\n1 n \n\nVn:= - L Vi, with \nVi := XiYi - w* xl \n\nn \n\nI \n\nFor convenience, also define U := X2 - 1 and V := Xy - w* X2 . Notice that \nU; UI, U2 , \u2022 .. are zero mean Li.d. random variables, and that V; Vi, V2 ,. \" satisfy \nthe same conditions. Hence Un and Vn converge to zero, and this leads to the idea \nof using the Taylor expansion of h(u, v, k) about the point (u, v) = (0,0) in order \nto get an expansion for In(k). \n\n\fAsymptotic Theory for Regularization: One-Dimensional Linear Case \n\n297 \n\nTo outline the ideas, let Tj(u,v,k) be the degree j Taylor polynomial of (u,v) f-7 \nh(u, v, k) about (0,0), i.e., Tj(u, v, k) is a polynomial in u and v whose coeffi(cid:173)\ncients are functions of k and whose degree with respect to u and v is j. Then \nIETj(Un,Vn,k) depends on n and moments of U and V. By deriving an upper \nbound for the quantity IE Ih(Un, Vn, k) - Tj(Un, Vn, k)1 we get an upper bound for \nthe error committed in approximating In(k) by IE Tj(Un, Vn, k). It turns out that \nfor odd degrees j the error is of the same order of magnitude in n as for degree \nj - 1. Therefore we only consider even degrees j. It also turns out that the error \nbounds are uniform in k ~ 0 whenever j ~ 2. To proceed, we need to introduce \nassumptions. \n\nAssumption 1 IE IXlr < 00 and IE IYls < 00 for high enough rand s. \n\nAssumption 2 Either (a) for some constant j3 > 0 almost surely IXI ;::: j3 or (b) \nX has a density which is bounded in some neighborhood of zero. \n\nAssumption 1 guarantees the existence of high enough moments; the values r = 20 \nand s = 8 are sufficient for the following proofs. E.g., if the pair (X, Y) has a \nnormal distribution or a distribution with compact support, then moments of all \norders exist and hence in this case assumption 1 would be satisfied. Without some \ncondition such as assumption 2, In(O) might fail to be meaningful or finite. The \nfollowing technical result is stated without proof. \n\nProposition 2 Let p > 0 and let 0 < IE X 2 < 00. If assumption 2 holds, then \n\nwhere the expectation on the left is finite (a) for n ~ 1 (b) for n > 2p provided that \nassumption 2 (a), respectively 2 (b) holds. \n\nProposition 3 Let assumptions 1 and 2 hold. Then there exist constants no and \nM such that \n\nIn(k) = JET2(Un, Vn, k) + R(n, k) where \n\n_ _ \n\n(w*)2k2 \n\n-1 [IEV2 \n\n(w*)2k2JEU2 W*kIEUV] \n\nIET2(Un, Vn, k) = (1+k)2 +n \n\n(1+k)2 +3 \n\n(1+k)4 +4 (1+k)3 \n\nIR(n, k) I :s; Mn- 3/2(k + 1)-1, \n\n\"In;::: no, k ;::: o. \n\nPROOF SKETCH The formula for IE T2(Un , Vn. k) follows easily by integrating the \ndegree two Taylor polynomial term by term. To get the upper bound for R(n, k), \nconsider the residual \n\nwhere we have omitted four similar terms. Using the bound \n\n\f298 \n\nP. Koistinen \n\nthe Ll triangle inequality, and the Cauchy-Schwartz inequality, we get \n\nIR(n, k)1 = IJE [h(Un, Vn, k) - T2(Un, Vn, k)]1 \n\n., (k+ W' {Ii: [(~ ~Xl)-'] r \n\n{2(k + 1)3[JE (lUnI2IVnI 4 )]l/2 + 4(w*)2k2(k + 1)[18 IUnI6]l/2 ... } \n\nBy proposition 2, here 18 [(~ 2:~ X[)-4] = 0(1). Next we use the following fact, cf. \n(Serfiing, 1980, Lemma B, p. 68). \nFact 1 Let {Zd be i.i.d. with 18 [Zd = 0 and with 18 IZI/v < 00 for some v ~ 2. \nThen \n\nv \n\nApplying the Cauchy-Schwartz inequality and this fact, we get, e.g., that \n[18 (IUnI2 IVnI 4 )]l/2 ~ [(18 IUnI4 )1/2(E IVnI8)1/2p/2 = 0(n- 3/ 2). \n\nGoing through all the terms carefully, we see that the bound holds. \n\no \n\nProposition 4 Let assumptions 1 and 2 hold, assume that w* :j; 0, and set \n\nal := (18 V2 - 2w*E [UVD/(w*)2. \n\nIf al > 0, then there exists a constant ni such that for all n ~ nl the function \nk ~ ET2(Un, Vn,k) has a unique minimum on [0,(0) at the point k~ admitting the \nexpanszon \n\nIn(O) - inf{Jn(k) : k ~ O} = In(O) - In(aln- 1 ) = ar(w*)2n- 2 + 0(n- 5 / 2). \n\nk~ = aIn- 1 + 0(n-2); \n\nfurther, \n\nIf a ~ 0, then \n\nPROOF SKETCH The proof is based on perturbation expansio!1 c~nsidering lin a \nsmall parameter. By the previous proposition, Sn(k) := ET2 (Un , Vn , k) is the sum \nof (w*)2k2/(1 + k)2 and a term whose supremum over k ~ ko > -1 goes to zero \nas n ~ 00. Here the first term has a unique minimum on (-1,00) at k = O. \nDifferentiating Sn we get \n\nS~(k) = [2(w*)2k(k + 1)2 + n- 1p2(k)]/(k + 1)5, \n\nwhere P2(k) is a second degree polynomial in k. The numerator polynomial has \nthree roots, one of which converges to zero as n ~ 00. A regular perturbation \nexpansion for this root, k~ = aln- I + a2n-2 + ... , yields the stated formula for \nal. This point is a minimum for all sufficiently large n; further, it is greater than \nzero for all sufficiently large n if and only if al > O. \nThe estimate for J n (0) - inf { J n (k) : k ~ O} in the case al > 0 follows by noticing \nthat \n\nIn(O) - In(k) = 18 [h(Un, Vn, 0) - h(Un, Vn, k)), \n\nwhere we now use a third degree Taylor expansion about (u, v, k) = (0,0,0) \n\nh(u,v,O) - h(u,v,k) = \n\n2w* kv - (w*)2k2 - 4w*kuv + 2(w*?k2u + 2kv2 - 4w*k2v + 2(W*)2k3 + r(u, v, k). \n\n\fAsymptotic Theory for Regularization: One-Dimensional Linear Case \n\n299 \n\n0.2 \n0.18 \n0.16 \n0.14 \n0.12 \n0.1 ~~ __ ~ __ ~ __ ~ __ ~ __ ~ __ ~ __ L-~ __ ~ \n0.5 \n\n0.35 \n\n0.05 \n\n0.15 \n\n0.2 \n\n0.4 \n\n0.45 \n\no \n\n0.1 \n\n0.25 \n\n0.3 \n\nFigure 1: Illustration of the asymptotic approximations in the situation of equation \n(11) . Horizontal axis kj vertical axis .In(l\u00a3) and its asymptotic appr~ximations. \nLegend: markers In(k); solid line IE T2 (Un, Vn, k)j dashed line IET4 (Un, Vn, k). \n\nUsin~ t~e techniques of the previous proposition, \nIE Ir(Un , Vn , k~)1 = O(n- S/ 2 ). \nestimate gives \n\nit can be shown that \nIntegrating the Taylor polynomial and using this \n\nIn(O) - In(aI/n) = af(w*)2n-2 + O(n- S/ 2 ). \n\nFinally, by the mean value theorem, \n\nIn(O) -inf{ In(k) : k ~ O} = In(O) -In(aI/n) + ! (In(O) - In(k)]lk=8(k~ -aI/n) \n= In(O) - In(aI/n) + O(n-1)O(n-2) \nwhere () lies between k~ and aI/n, and where we have used the fact that the indi(cid:173)\ncated derivative evaluated at () is of order O(n- 1 ), as can be shown with moderate \n0 \neffort. \n\nRemark In the preceding we assumed that A = IEX 2 equals 1. If this is not \nthe case, then the formula for a1 has to be divided by A; again, if a1 > 0, then \nk~ = a1n-1 + O(n- 2 ) . \nIf the model is correctly specified in the sense that Y = w* X + E, where E is \nindependent of X and IE E = 0, then V = X E and IE [UV] = O. Hence we have \na1 = IE [E2]j(w*)2, and this is strictly positive expect in the degenerate case where \nE = 0 with probability one. This means that here regularization helps provided the \nregularization parameter is chosen around the value aI/n and n is large enough. \nSee Figure 1 for an illustration in the case \nX \"'\" N(O, 1) , Y = w* X + f , \n\nf \"'\" N(O, 1), w* = 1, \n\n(11) \n\nwhere E and X are independent. In(k) is estimated on the basis of 1000 repetitions \nof the task for n = 8. In addition to IE T2(Un, Vn, k) the function IE T4 (Un, lin, k) \nis also plotted. The latter can be shown to give In(k) correctly up to order \nO(n-s/ 2 (k+ 1)-3). Notice that although IE T2(Un, Vn, k) does not give that good an \napproximation for In(k), its minimizer is near the minimizer of In(k), and both of \nthese minimizers lie near the point al/n = 0.125 as predicted by the theory. In the \nsituation (11) it can actually be shown by lengthy calculations that the minimizer \nof In(k) is exactly al/n for each sample size n ~ 1. \nIt is possible to construct cases where a1 < O. For instance, take \n\nX \"'\" Uniform (a, b), \nY = cjX + d+ Z, \n\na=- b=-(3Vs-l) \n\n1 \n2 ' \n\n1 \n4 \nc= -5,d= 8, \n\n\f300 \n\nP. Koistinen \n\nand Z '\" N (0, a 2 ) with Z and X independent and 0 :::; a < 1.1. In such a case \nregularization using a positive regularization parameter only makes matters worse; \nusing a properly chosen negative regularization parameter would, however, help in \nthis particular case. This would, however, amount to rewarding rapidly changing \nfunctions. In the case (11) regularization using a negative value for the regulariza(cid:173)\ntion parameter would be catastrophic. \n\n4 DISCUSSION \n\nWe have obtained asymptotic approximations for the optimal regularization param(cid:173)\neter in (1) and the amount of improvement (3) in the simple case of one-dimensional \nlinear regression when the regularization parameter is chosen independently of the \ntraining sample. It turned out that the optimal regularization parameter is, to \nleading order, given by Qln- 1 and the resulting improvement is of order O(n- 2 ). \nWe have also seen that if Ql < 0 then regularization only makes matters worse. \nAlso (Larsen and Hansen, 1994) have obtained asymptotic results for the optimal \nregularization parameter in (1). They consider the case of a nonlinear network; \nhowever, they assume that the neural network model is correctly specified. \nThe generalization of the present results to the nonlinear, misspecified case might \nbe possible using, e.g., techniques from (Bhattacharya and Ghosh, 1978). General(cid:173)\nization to the case where the regularization parameter is chosen on the basis of the \nsample (say, by cross validation) would be desirable. \n\nAcknowledgements \n\nThis paper was prepared while the author was visiting the Department for Statis(cid:173)\ntics and Probability Theory at the Vienna University of Technology with financial \nsupport from the Academy of Finland. I thank F. Leisch for useful discussions. \n\nReferences \nBhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal Edge(cid:173)\n\nworth expansion. The Annals of Statistics, 6(2):434-45l. \n\nBishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University \n\nPress. \n\nBrockwell, P. J. and Davis, R. A. (1987). Time Series: Theory and Methods. \n\nSpringer series in statistics. Springer-Verlag. \n\nHoerl, A. E. and Kennard, R. W. (1988). Ridge regression. In Kotz, S., Johnson, \nN. L., and Read, C. B., editors, Encyclopedia of Statistical Sciences. John Wiley \n& Sons, Inc. \n\nLarsen, J. and Hansen, L. K. (1994). Generalization performance of regularized \nneural network models. In Vlontos, J., Whang, J.-N., and Wilson, E., editors, \nProc. of the 4th IEEE Workshop on Neural Networks for Signal Processing, \npages 42-51. IEEE Press. \n\nRipley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Uni(cid:173)\n\nversity Press. \n\nSerfiing, R. J. (1980). Approximation Theorems of Mathematical Statistics. John \n\nWiley & Sons, Inc. \n\nTitterington, D. M. (1985). Common structure of smoothing techniques in statistics. \n\nInternational Statistical Review, 53:141-170. \n\n\f", "award": [], "sourceid": 1453, "authors": [{"given_name": "Petri", "family_name": "Koistinen", "institution": null}]}