{"title": "Outcomes of the Equivalence of Adaptive Ridge with Least Absolute Shrinkage", "book": "Advances in Neural Information Processing Systems", "page_first": 445, "page_last": 451, "abstract": null, "full_text": "Outcomes of the Equivalence of Adaptive Ridge \n\nwith Least Absolute Shrinkage \n\nYves Grandvalet \n\nStephane Canu \n\nHeudiasyc, UMR CNRS 6599, Universite de Technologie de Compiegne, \n\nBP 20.529, 60205 Compiegne cedex, France \n\nYves.Grandvalet@hds.utc.fr \n\nAbstract \n\nAdaptive Ridge is a special form of Ridge regression, balancing the \nquadratic penalization on each parameter of the model. It was shown to \nbe equivalent to Lasso (least absolute shrinkage and selection operator), \nin the sense that both procedures produce the same estimate. Lasso can \nthus be viewed as a particular quadratic penalizer. \nFrom this observation, we derive a fixed point algorithm to compute the \nLasso solution. The analogy provides also a new hyper-parameter for tun(cid:173)\ning effectively the model complexity. We finally present a series ofpossi(cid:173)\nble extensions oflasso performing sparse regression in kernel smoothing, \nadditive modeling and neural net training. \n\n1 INTRODUCTION \n\nIn supervised learning, we have a set of explicative variables x from which we wish to pre(cid:173)\ndict a response variable y. To solve this problem, a learning algorithm is used to produce a \npredictor J( x) from a learning set Sf. = {(Xi, yd H=l of examples. The goal of prediction \nmay be: 1) to provide an accurate prediction of future responses, accuracy being measured \nby a user-defined loss function; 2) to quantify the effect of each explicative variable in the \nresponse; 3) to better understand the underlying phenomenon. \n\nPenalization is extensively used in learning algorithms. It decreases the predictor variability \nto improve the prediction accuracy. It is also expected to produce models with few non-zero \ncoefficients if interpretation is planned. \n\nRidge regression and Subset Selection are the two main penalization procedures. The for(cid:173)\nmer is stable, but does not shrink parameters to zero, the latter gives simple models, but is \nunstable [1]. These observations motivated the search for new penalization techniques such \nas Garrotte, Non-Negative Garrotte [1], and Lasso (least absolute shrinkage and selection \noperator) [10]. \n\n\f446 \n\nY. Grandvalet and S. Canu \n\nAdaptive Ridge was proposed as a means to automatically balance penalization on different \ncoefficients. It was shown to be equivalent to Lasso [4]. Section 2 presents Adaptive Ridge \nand recalls the equivalence statement. The following sections give some of the main out(cid:173)\ncomes ofthis connection. They concern algorithmic issues in section 3, complexity control \nin section 4, and some possible generalizations oflasso to non-linear regression in section 5. \n\n2 ADAPTIVE RIDGE REGRESSION \n\nFor clarity of exposure, the formulae are given here for linear regression with quadratic loss. \nThe predictor is defined as j( x) = rff x, with rff = (f31, ... , f3d). Adaptive Ridge is a \nmodification of the Ridge estimate, which is defined by the quadratic constraint ~~ = 1 f3; ~ \nC applied to the parameters. It is usually computed by minimizing the Lagrangian \n\njj = Argmin L ( L f3j Xij - Yi) 2 + A L f3; \n\nd \n\nd \n\nl \n\n{3 \n\ni=l \n\nj=l \n\nj=l \n\n, \n\n(1) \n\nwhere A is the Lagrange multiplier varying with the bound C on the norm of the parameters. \n\nWhen the ordinary least squares (OLS) estimate maximizes likelihood 1 , the Ridge estimate \nmay be seen as a maximum a posteriori estimate. The Bayes prior distribution is a centered \nnormal distribution, with variance proportional to 1/ A. This prior distribution treats all co(cid:173)\nvariates similarly. It is not appropriate when we know that all covariates are not equally \nrelevant. \n\n----0 \n\nThe garrotte estimate [1] is based on the OLS estimate,8 . The standard quadratic constraint \nis replaced by ~~ = 1 f3] (iJf ~ C. The coefficients with smaller OLS estimate are thus \nmore heavily penalized. Other modifications are better explained with the prior distribu(cid:173)\ntion viewpoint. Mixtures of Gaussians may be used to cluster different set of covariates. \nSeveral models have been proposed, with data dependent clusters [9], or classes defined a \npriori [7]. The Automatic Relevance Determination model [8] ranks in the latter type. In [4], \nwe propose to use such a mixture, in the form \n\n,8= ArgmmL.,., (L.,.,f3j Xij - Yi) + L.,.,Ajf3j \n\n----\n\n2 \n\n(2) \n\nt \n\nd \n\n. \" \" \" \" \nj=l \n\ni=l \n\n{3 \n\nd \n\n2 \" \" \nj=l \n\nHere, each coefficient has its own prior distribution. The priors are centered normal distri(cid:173)\nbutions with variances proportional to 1/ Aj. To avoid the simultaneous estimation of these \nd hyper-parameters by trial, the constraint \n\n1 d 1 \n1 \nd L ~ = ~ \n\nj=l \n\nJ \n\n, Aj > 0 \n\n(3) \n\nis applied on A = (A1, .. . , Ad)T, where A is a predefined value. This constraint is a link \nbetween the d prior distributions. Their mean variance is proportional to 1/ A. The values of \nAj are automatically2 induced from the sample, hence the qualifieradaptative. Adaptativity \nrefers here to the penalization balance on {Pj }, not to the tuning of the hyper-parameter A. \n\n1 If {(.C,} are independently and identically drawn from some distribution, and that some{3\" exists, \nsuch that Y. = {3\" T (.C, + e, where c is a centered normal random variable, then the empirical cost \n\"'0 \nbased on the quadratic loss is proportional to the log-likelihood of the sample. The OLS estimate{3 \nis thus the maximum likelihood estimate offJ'. \n\n2 Adaptive Ridge, as Ridge or Lasso, is not scale invariant, so that the covariates should be nor(cid:173)\n\nmalized to produce sensible estimates. \n\n\fEquivalence of Adaptive Ridge with Least Absolute Shrinkage \n\n447 \n\nIt was shown [4] that Adaptive Ridge and least absolute value shrinkage are equivalent, in \nthe sense that they yield the same estimate. We remind that the Lasso estimate is defined by \n\nj3 = Argmin L (L (3j Xij - Yi ) 2 \n\nd \n\ne \n\nj3 \n\ni=l \n\nj =l \n\nsubject to \n\nd \n\nL l{3j l ~ f{ \n\n. \n\nj =l \n\n(4) \n\nThe only difference in the definition of the Adaptive Ridge and the Lasso estimate is that \nthe Lagrangian form of Adaptive Ridge uses the constraint CL1=1 l{3j 1) 2/ d ~ f{ 2. \n\n3 OPTIMIZATION ALGORITHM \n\nTibshirani [10] proposed to use quadratic programming to find the l,asso solution, with 2d \nvariables (positive and negative parts of (3j ) and 2d + 1 constraints (signs of positive and \nnegative parts of (3j plus constraint (4)). Equations (2) and (3) suggest to use a fixed point \n(FP) algorithm. At each step s, the FP algorithm estimates the optimal parameters ) . .y) of \nthe Bayes prior based on the estimate (3) S -1 ) , and then maximizes the posterior to compute \nthe current estimate (3) S ) \u2022 \n\nAs the parameterization (j3, A) may lead to divergent solutions, we define new variables \n\nand \n\nC j = V I; for J = 1, .. . , d \n\nr;: \n\n. \n\n(5) \n\n(6) \n\nThe FP algorithm updates alternatively c and -y as follows: \n\n{ \n\nd ,jS -1 )2 \n\n(S)2 \ncj = ,,\",d \n-y (s) = (diag( c (s) )XT X diag( c (s) ) + AI) -1 diag( c (S) )XT y \n\nL..,k =l /k \n\n(s -1 )2 \n\nwhere Xi j = Xi j , I is the identity matrix, and diag( c) is the square matrix with the vector \nc on its diagonal. \n\nThe algorithm can be initialized by the Ridge or the OLS estimate. In the latter case,,B(1) is \nthe garrotte estimate. \nPractically, 'if lys-1 ) is small compared to numerical accuracy, then c~s) is set to zero. In \nturn, ,ys) is zero, and the system to be solved in the second step to determine -y can be \nreduced to the other var~ables. If cJ' is set to zero at any time during the optimization pro-\ncess, the final estimate {3j will be zero. The computations are simplified, but it is not clear \nwhether global convergence can be obtained with this algorithm. It is easy to show the con(cid:173)\nvergence towards a local minimum, but we did not find general conditions ensuring global \nconvergence. If these conditions exist, they rely on initial conditions. \n\nFinally, we stress that the optimality conditions for c (or in a less rigorous sense for A) do \nnot depend on the first part of the cost minimized in (2). In consequence, the equivalence \nbetween Adaptive Ridge and lasso holds/or any model or loss/unction . The FP algorithm \ncan be applied to these other problems, without modifying the first step. \n\n4 COMPLEXITY TUNING \n\nThe Adaptive Ridge estimate depends on the learning set Sf. and on the hyper-parameter \nA. When the estimate is defined by (2) and (3), the analogy with Ridge suggests A as the \n\n\f448 \n\nY. Grandvalet and S. Canu \n\n\"natural\" hyper-parameter for tuning the complexity of the regressor. As ..\\ goes to zero, j3 \napproac~es the OLS estimatej3 , and the number of effective parameters is d. As ..\\ goes to \ninfinity, (3 goes to zero and the number of effective parameters is zero. \n\nr-<> \n\n~ \n\nWhen the estimate is defined by (4), there is no obvious choice for the hyper-parameter con(cid:173)\ntrolling complexity. Tibshirani [10] proposed to use v = 'Lf=1 l.8j 1/ 'Lf=l ~ I. As v goes \nto one,{3 approaches{3 ; as v goes to infinity, {3goes to zero. \n\nr-<> \n\n~ \n\n~ \n\nThe weakness of v is that it is explicitly defined from the OLS estimate. As a result, it is \nvariable when the design matrix is badly conditioned. The estimation of v is thus harder, \nand the overall procedure looses in stability. This is illustrated on an experiment following \nBreiman's benchmark [1] with 30 highly correlated predictors lE(XjXk) = plj-k l, with \np = 1 - 10- 3 . \n\nWe generate 1000 Li.d. samples of size \u00a3 = 60. For each sampie s1, the modeling er(cid:173)\nror (ME) is computed for several values of v and'\\. We select v k and ,\\k achieving the \nlowest ME. For one sample, there is a one to one mapping from v to'\\. Thus ME is the \nsame for v k and ,\\k. Then, we compute v* and ..\\* achieving the best average ME on \nthe 1000 samples. As v k and ,\\k achieve the lowest ME for s1, the ME for s1 is higher \nor equal for v* and ,\\ *. Due to the wide spread of {Vk }, the average loss encountered is \ntwice for v* than for ,\\*: 1/1000 'L!~10 (ME(s~, v*) - ME(s; , v k )) = 4.6 10- 2 , and \n1/1000'L!~010 (ME(s~ , ..\\* ) - ME(s1 , ,\\k)) = 2.310- 2 . The average modeling error are \nME(v*) = 1.910- 1 and ME(\"\\*) = 1.710- 1. \nThe estimates of prediction error, such as leave-one-out cross-validation tend to be variable. \nHence, complexity tuning is often based on the minimization of some estimate of the mean \nprediction error (e.g bootstrap, K-fold cross-validation). Our experiment supports that, re(cid:173)\ngarding mean prediction error, the optimal ,\\ performs better than the optimal v . Thus, ,\\ is \nthe best candidate for complexity tuning. \n\nAlthough,\\ and v are respectively the control parameter of the FP and QP algorithms, the \npreceding statement does not imply that we should use the FP algorithm. Once the solution \n73 is known, v or ,\\ are easily computed. The choice of one hyper-parameter is not linked to \nthe choice of the optimization algorithm. \n\n5 APPLICATIONS \n\nAdaptive Ridge may be applied to a variety of regression techniques. They include kernel \nsmoothing, additive and neural net modeling. \n\n5.1 KERNEL SMOOTHING \n\nSoft-thresholding was proved to be efficient in wavelet functional e~timation [2]. Kernel \nsmoothers [5] can also benefit from the sparse representation given by soft-thresholding \nmethods. For these regressors, l( x) = 'L1=1 f3i K(x , xd+f3o, there are as many covariates \nas pairs in the sample. The quadratic procedure of Lasso with 2\u00a3 + 1 constraints becomes \ncomputationally expensive, but the FP algorithm of Adaptive Ridge is reasonably fast to \nconverge. \n\nAn example of least squares fitting is shown in fig. 1 for the motorcycle dataset [5]. On \nthis example, the hyperparameter ,\\ has been estimated by .632 bootstrap (with 50 boot(cid:173)\nstrap replicates) for Ridge and Adaptive Ridge regressions. For tuning..\\, it is not necessary \nto determine the coefficients {3 with high accuracy. Hence, compared to Ridge regression, \n\n\fEquivalence of Adaptive Ridge with Least Absolute Shrinkage \n\n449 \n\nthe overall amount of computation required to get the Adaptive Ridge estimate was about \nsix times more important. For evaluation, Adaptive Ridge is ten times faster as Ridge re(cid:173)\ngression as the final fitting uses only a few kernels (11 out of 133). \n\n-AR \n-- - - R \n\n+ \n\n\"'+ \n\n-1+ \n+ ++ \n\n+ + + \n\n+ \n\n+t-\n+ \n\nx \n\nFigure 1: Adaptive Ridge (AR) and Ridge (R) in kernel smoothing on the \nmotorcycle data. The + are data points, and. are the prototypes corre(cid:173)\nsponding to the kernels with non-zero coefficients in AR. The Gaussian \nkernel used is represented dotted in the lower right-hand corner. \n\nGirosi [3] showed an equivalence between a version of least absolute shrinkage applied to \nkernel smoothing, and Support Vector Machine (SVM). However, Adaptive Ridge, as ap(cid:173)\nplied here, is not equivalent to SVM, as the cost minimized is different. The fit and proto(cid:173)\ntypes are thus different from the fit and support vectors that would be obtained from SVM. \n\n5.2 ADDITIVE MODELS \nAdditive models [6] are sums of univariate functions, f( x) = LJ = 1 fj (x j ). In the non(cid:173)\nparametric setting, {fj} are smooth but unspecified functions. Additive models are easily \nrepresented and thus interpretable, but they require the ch~ice of the relevant covariates to \nbe included in the model, and of the smoothness of each Ij. \nIn the form presented in the two previous sections, Adaptive Ridge regression penalizes \ndifferently each individual coefficient, but it is easily extended to the pooled penalization of \ncoefficients. Adaptive Ridge may th~ be used as an alternative to BRUTO [6] to balance \nthe penalization parameters on each Ij . \nA classical choice for fj is cubic spline smoothing. Let B j denote the \u00a3 x (\u00a3 + 2) matrix of \nthe unconstrained B-spline basis, evaluated at Xij. Let 51 j be the (\u00a3 + 2) x (f + 2) matrix \ncorresponding to the penalization of the second derivative of J;. The coefficients of fj in \nthe unconstrained B-spline basis are noted /3j. The \"natural\" extension of Adaptive Ridge \nis to minimize \n\nd \n\nII L B j/3j - YI12 + L ,\\jf3]' 51 j /3j \n\n, \n\n(7) \n\nd \n\nj=l \n\nj=l \n\nsubject to constraint (3). This problem is easily shown to have the same solution as the \nminimization of \n\nII t, Bjf3j - yll' + A (t, Jf3J fljf3j) 2 \n\n(8) \n\nNote that if the cost (8) is optimized with respect to a single covariate, the solution is a usual \nsmoothing spline regression (with quadratic penalization). In the multidimensional case, \n\n\f450 \n\nY. Grandvalet and S. Canu \n\nex] =rif o'j/3j = J {Ij'(t)}2dt may be used to summarize the non-linearity of Ij, thus lajl \ncan be interpreted as a relevance index operating besides linear dependence of feature j. \nThe penalizer in (8) is a least absolute shrinkage operator applied to ex j. Hence, formula (8) \nmay be interpreted as \"quadratic penalization within, and soft-thresholding between covari(cid:173)\nates\".The FP algorithm of section 3 is easily modified to minimize (8), and backfitting may \nbe used to solve the second step of this procedure. \n\nA simulated example in dimension five is shown in fig. 2. The fitted univariate functions \nare plotted for five values of'\\. There is no dependency between the the explained variable \nand the last covariate. The other covariates affect the response, but the dependency on the \nfirst features is smoother, hence easier to capture and more relevant for the spline smoother. \nFor a small value of '\\, the univariate functions are unsmooth, and the additive model is \ninterpolating the data. For,\\ = 10- 4 , the dependencies are well estimated on all covariates. \nAs ,\\ increases, the cov~riates with higher coordinate number are more heavily penalized, \nand the corresponding Ij tend to be linear. \n\n~ \nI \n0 \n\nII \n\n\"<t \nI \n0 \n\n\" . \n\n....< \n\n\" ' \n\n....< \n\n\" \n\n\" \n\n. ., ...... \n\n.. \n\n.... +-t-. \n\n~----~-------+------~------~----~ \n\nt \n\n+. \n\n.. .... :.:... :. \n\n.. .. \n\n\" ....<~: ______ J -______ ~ ______ ~ ____ ~ ______ -J \n\nFigure 2: Adaptive Ridge in additive modeling on simulated data. The \ntrue model is y = Xl + cos( rrx2) + cOS(2rrx3) + cos(3rrx4) + E. The co(cid:173)\nvariates are independently drawn from a uniform distribution on [-1, 1] \nand E is a Gaussian noise of standard deviation (j = 0.3. The solid curves \nare the estimated univariate functions for different values of '\\, and + are \npartial residuals. \n\nLinear trends are not penalized in cubic spline smoothing. Thus, when after convergence \n/3j n j/3j = 0, the jth covariate is not eliminated. This can be corrected by applying Adap-\n~T ~ \nbve Ridge a second time. To test if a significant linear trend can be detected, a linear (pe-\nnalized) model may be used for 1;, the remaining h, k #- j being cubic splines. \n\n5.3 MLP FITTING \n\nThe generalization to the pooled penalization of coefficients can also be applied to Multi(cid:173)\nLayered Perceptrons to control the complexity of the fit. If weights are penalized individ(cid:173)\nually, Adaptive Ridge is equivalent to the Lasso. If weights are pooled by layer, Adaptive \nRidge automatically tunes the amount of penalization on each layer, thus avoiding the mul(cid:173)\ntiple hyper-parameter tuning necessary in weight-decay [7]. \n\n\fEquivalence of Adaptive Ridge with Least Absolute Shrinkage \n\n451 \n\nFigure 3: groups of weights for two examples of Adaptive Ridge in MLP \nfitting. Left: hidden node soft-thresholding. Right: input penalization \nand selection, and individual smoothing coefficient for each output unit. \n\nTwo other interesting configurations are shown in fig. 3. If weights are pooled by incom(cid:173)\ning and outcoming weights of a unit, node penalization/pruning is performed. The weight \ngroups may also gather the outcoming weights from each input unit, orthe incoming weights \nfrom each output unit (one set per input plus one per output). The goal here is to penal(cid:173)\nize/select the input variables according to their relevance, and each output variable accord(cid:173)\ning to the smoothness of the corresponding mapping. This configuration proves itself espe(cid:173)\ncially useful in time series prediction, where the number of inputs to be fed into the network \nis not known in advance. There are also more complex choices of pooling, such as the one \nproposed to encourage additive modeling in Automatic Relevance Determination [8]. \n\nReferences \n\n[1] L. Breiman. Heuristics of instability and stabilization in model selection. The Annals \n\nof Statistics, 24(6):2350-2383, 1996. \n\n[2] D.L Donoho and I.M. Johnstone. Minimax estimation via wavelet shrinkage. Ann. \n\nStatist., 26(3):879--921,1998. \n\n[3] F. Girosi. An equivalence between sparse approximation and support vector machines. \n\nTechnical Report 1606, M.LT AI Laboratory, Cambridge, MA., 1997. \n\n[4] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization. In \n\nL. Niklasson, M. Boden, and T Ziemske, editors, ICANN'98, volume 1 of Perspec(cid:173)\ntives in Neural Computing, pages 201-206. Springer, 1998. \n\n[5] W. HardIe. Applied Nonparametric Regression, volume 19 of Economic Society \n\nMonographs. Cambridge University Press, New York, 1990. \n\n[6] TJ. Hastie and R.J. Tibshirani. Generalized Additive Models, volume 43 of Mono(cid:173)\n\ngraphs on Statistics and Applied Probability. Chapman & Hall, New York, 1990. \n\n[7] D.J.C. MacKay. A practical Bayesian framework for backprop networks. Neural Com(cid:173)\n\nputation, 4(3):448-472,1992. \n\n[8] R. M. Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics. \n\nSpringer, New York, 1996. \n\n[9] S.I. Nowlan and G.E. Hinton. Simplifying neural networks by soft weight-sharing. \n\nNeural Computation, 4(4):473-493, 1992. \n\n[10] R.I. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal \n\nStatistical Society, B, 58(1):267-288, 1995. \n\n\f", "award": [], "sourceid": 1500, "authors": [{"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "St\u00e9phane", "family_name": "Canu", "institution": null}]}