{"title": "Following Curved Regularized Optimization Solution Paths", "book": "Advances in Neural Information Processing Systems", "page_first": 1153, "page_last": 1160, "abstract": null, "full_text": " Following Curved Regularized Optimization\n Solution Paths\n\n\n\n Saharon Rosset\n IBM T.J. Watson Research Center\n Yorktown Heights, NY 10598\n srosset@us.ibm.com\n\n\n\n\n Abstract\n\n Regularization plays a central role in the analysis of modern data, where\n non-regularized fitting is likely to lead to over-fitted models, useless for\n both prediction and interpretation. We consider the design of incremen-\n tal algorithms which follow paths of regularized solutions, as the regu-\n larization varies. These approaches often result in methods which are\n both efficient and highly flexible. We suggest a general path-following\n algorithm based on second-order approximations, prove that under mild\n conditions it remains \"very close\" to the path of optimal solutions and\n illustrate it with examples.\n\n\n1 Introduction\n\nGiven a data sample (xi, yi)n\n i=1 (with xi Rp and yi R for regression, yi {1} for\nclassification), the generic regularized optimization problem calls for fitting models to the\ndata while controlling complexity by solving a penalized fitting problem:\n\n ^\n(1) () = arg min C(yi, xi) + J()\n i\n\nwhere C is a convex loss function and J is a convex model complexity penalty (typically\ntaken to be the lq norm of , with q 1).1\n\nMany commonly used supervised learning methods can be cast in this form, including\nregularized 1-norm and 2-norm support vector machines [13, 4], regularized linear and\nlogistic regression (i.e. Ridge regression, lasso and their logistic equivalents) and more. In\n[8] we show that boosting can also be described as approximate regularized optimization,\nwith an l1-norm penalty.\n\nDetailed discussion of the considerations in selecting penalty and loss functions for regu-\nlarized fitting is outside the scope of this paper. In general, there are two main areas we\nneed to consider in this selection:\n\n1. Statistical considerations: robustness (which affects selection of loss), sparsity (l1-norm\npenalty encourages sparse solutions) and identifiability are among the questions we should\n\n 1We assume a linear model in (1), but this is much less limiting than it seems, as the model can\nbe linear in basis expansions of the original predictors, and so our approach covers Kernel methods,\nwavelets, boosting and more\n\n\f\nkeep in mind when selecting our formulation.\n2. Computational considerations: we should be able to solve the problems we pose with\nthe computational resources at our disposal. Kernel methods and boosting are examples\nof computational tricks that allow us to solve very high dimensional problems exactly or\napproximately with a relatively small cost. In this paper we suggest a new computational\napproach.\n\nOnce we have settled on a loss and penalty, we are still faced with the problem of select-\ning a \"good\" regularization parameter , in terms of prediction performance. A common\napproach is to solve (1) for several values of , then use holdout data (or theoretical ap-\nproaches, like AIC or SRM) to select a good value. However, if we view the regularized\noptimization problem as a family of problems, parameterized by the regularization parame-\nter , it allows us to define the \"path\" of optimal solutions { ^\n () : 0 }, which is a\n1-dimensional curve through Rp. Path following methods attempt to utilize the mathemat-\nical properties of this curve to devise efficient procedures for \"following\" it and generating\nthe full set of regularized solutions with a (relatively) small computational cost.\n\nAs it turns out, there is a family of well known and interesting regularized problems for\nwhich efficient exact path following algorithms can be devised. These include the lasso [3],\n1- and 2-norm support vector machines [13, 4] and many others [9]. The main property of\nthese problems which makes them amenable to such methods is the piecewise linearity of\nthe regularized solution path in Rp. See [9] for detailed exposition of these properties and\nthe resulting algorithms.\n\nHowever, the path following idea can stretch beyond these exact piecewise linear algo-\nrithms. The \"first order\" approach is to use gradient-based approaches. In [8] we have\ndescribed boosting as an approximate gradient-based algorithm for following l1-norm reg-\nularized solution paths. [6] suggest a gradient descent algorithm for finding an optimal so-\nlution for a fixed value of and are seemingly unaware that the path they are going through\nis of independent interest as it consists of approximate (alas very approximate) solutions\nto l1-regularized problems. Gradient-based methods, however, can only follow regularized\npaths under strict and non-testable conditions, and theoretical \"closeness\" results to the\noptimal path are extremely difficult to prove for them (see [8] for details).\n\nIn this paper, we suggest a general second-order algorithm for following \"curved\" regu-\nlarized solution paths (i.e. ones that cannot be followed exactly by piecewise-linear al-\ngorithms). It consists of iteratively changing the regularization parameter, while making a\nsingle Newton step at every iteration towards the optimal penalized solution, for the current\nvalue of . We prove that if both the loss and penalty are \"nice\" (in terms of bounds on\ntheir relevant derivatives in the relevant region), then the algorithm is guaranteed to stay\n\"very close\" to the true optimal path, where \"very close\" is defined as:\n\n If the change in the regularization parameter at every iteration is , then\n the solution path we generate is guaranteed to be within O( 2) from the\n true path of penalized optimal solutions\n\nIn section 2 we present the algorithm, and we then illustrate it on l1- and l2-regularized\nlogistic regression in section 3. Section 4 is devoted to a formal statement and proof outline\nof our main result. We discuss possible extensions and future work in section 5.\n\n\n2 Path following algorithm\n\nWe assume throughout that the loss function C is twice differentiable. Assume for now\nalso that the penalty J is twice differentiable (this assumption does not apply to the l1-\nnorm penalty which is of great interest and we address this point later). The key to our\n\n\f\nmethod are the normal equations for (1):\n\n(2) C( ^\n ()) + J( ^\n ()) = 0\n\n\nOur algorithm iteratively constructs an approximate solution ( )\n t by taking \"small\"\nNewton-Raphson steps trying to maintain (2) as the regularization changes. Our main\nresult in this paper is to show, both empirically and theoretically, that for small , the dif-\n\nference ( )\n t - ^\n (0 + t) is small, and thus that our method successfully tracks the\npath of optimal solutions to (1).\n\nAlgorithm 1 gives a formal description of our quadratic tracking method. We start from a\nsolution to (1) for some fixed 0 (e.g. ^\n (0), the non-regularized solution). At each iteration\nwe increase by and take a single Newton-Raphson step towards the solution to (2) with\nthe new value in step 2(b).\n\nAlgorithm 1 Approximate incremental quadratic algorithm for regularized optimization\n\n\n 1. Set ( )\n 0 = ^\n (0), set t = 0.\n\n 2. While (t < max)\n\n (a) t+1 = t +\n\n (b) ( )\n t+1 =\n -1\n ( ) 2\n t - 2C(( )\n t ) + t+1 J(( )\n t ) C(( )\n t ) + t+1 J(( )\n t )\n\n (c) t = t + 1\n\n\n2.1 The l1-norm penalty\n\nThe l1-norm penalty, J() = 1, is of special interest because of its favorable statistical\nproperties (e.g. [2]) and its widespread use in popular methods, such as the lasso [10] and\n1-norm SVM [13]. However it is not differentiable and so our algorithm does not apply to\nl1-penalized problems directly.\n\nTo understand how we can generalize Algorithm 1 to this situation, we need to consider the\nKarush-Kuhn-Tucker (KKT) conditions for optimality of the optimization problem implied\nby (1). It is easy to verify that the normal equations (2) can be replaced by the following\nKKT-based condition for l1-norm penalty:\n\n(3) | C( ^\n ())j| < ^\n ()j = 0\n ^\n(4) ()j = 0 | C( ^\n ())j| = \n\nthese conditions hold for any differentiable loss and tell us that at each point on the path we\nhave a set A of non-0 coefficients which corresponds to the variables whose current \"gen-\neralized correlation\" | C( ^\n ())j| is maximal and equal to . All variables with smaller\ngeneralized correlation have 0 coefficient at the optimal penalized solution for this . Note\nthat the l1-norm penalty is twice differentiable everywhere except at 0. So if we carefully\nmanage the set of non-0 coefficients according to these KKT conditions, we can still apply\nour algorithm in the lower-dimensional subspace spanned by non-0 coefficients only.\n\nThus we get Algorithm 2, which employs the Newton approach of Algorithm 1 for twice\ndifferentiable penalty, limited to the sub-space of \"active\" coefficients denoted by A. It\nadds to Algorithm 1 updates for the \"add variable to active set\" and \"remove variable from\n\n\f\nactive set\" events, when a variable becomes \"highly correlated\" as defined in (4) and when\na coefficient hits 0 , respectively. 2\n\nAlgorithm 2 Approximate incremental quadratic algorithm for regularized optimization\nwith lasso penalty\n\n\n 1. Set ( )\n 0 = ^\n (0), set t = 0, set A = {j : ^\n (0)j = 0}.\n\n 2. While (t < max)\n\n (a) t+1 = t +\n (b)\n\n -1\n ( )\n t+1 = ( )\n t - 2C(( )\n t )A C(( )\n t )A + t+1sgn(( )\n t )A\n\n (c) A = A {j /\n A : C(( )\n t+1)j > t+1}\n\n (d) A = A - {j A : |( ) | < }\n t+1,j\n (e) t = t + 1\n\n\n2.2 Computational considerations\n\nFor a fixed 0 and max, Algorithms 1 and 2 take O(1/ ) steps. At each iteration they need\nto calculate the Hessians of both the loss and the penalty at a typical computational cost of\nO(n p2); invert the resulting p p matrix at a cost of O(p3); and perform the gradient\ncalculation and multiplication, which are o(n p2) and so do not affect the complexity\ncalculation. Since we implicitly assume throughout that n p, we get overall complexity\nof O(n p2/ ). The choice of represents a tradeoff between computational complexity\nand accuracy (in section 4 we present theoretical results on the relationship between and\nthe accuracy of the path approximation we get). In practice, our algorithm is practical for\nproblems with up to several hundred predictors and several thousand observations. See the\nexample in section 3.\n\nIt is interesting to compare this calculation to the obvious alternative, which is to solve\nO(1/ ) regularized problems (1) separately, using a Newton-Raphson approach, resulting\nin the same complexity (assuming the number of Newton-Raphson iterations for finding\neach solution is bounded). There are several reasons why our approach is preferable:\n\n The number of iterations until convergence of Newton-Raphson may be large even\n if it does converge. Our algorithm guarantees we stay very close to the optimal\n solution path with a single Newton step at each new value of .\n\n Empirically we observe that in some cases our algorithm is able to follow the path\n while direct solution for some values of fails to converge. We assume this is\n related to various numeric properties of the specific problems being solved.\n\n For the interesting case of l1-norm penalty and a \"curved\" loss function (like logis-\n tic log-likelihood), there is no direct Newton-Raphson algorithm. Re-formulating\n the problem into differentiable form requires doubling the dimensionality. Using\n our Algorithm 2, we can still utilize the same Newton method, with significant\n computational savings when many coefficients are 0 and we work in a lower-\n dimensional subspace.\n\n 2When a coefficient hits 0 it not only hits a non-differentiability point in the penalty, it also\nceases to be maximally correlated as defined in (4). A detailed proof of this fact and the rest of the\n\"accounting\" approach can be found in [9]\n\n\f\nOn the flip side, our results in section 4 below indicate that to guarantee successful tracking\nwe require to be small, meaning the number of steps we do in the algorithm may be\nsignificantly larger than the number of distinct problems we would typically solve to select\n using a non-path approach.\n\n\n2.3 Connection to path following methods from numerical analysis\n\nThere is extensive literature on path-following methods for solution paths of general para-\nmetric problems. A good survey is given in [1]. In this context, our method can be described\nas a \"predictor-corrector\" method with a redundant first order predictor step. That is, the\ncorrector step starts from the previous approximate solution. These methods are recognized\nas attractive options when the functions defining the path (in our case, the combination of\nloss and penalty) are \"smooth\" and \"far from linear\". These conditions for efficacy of our\napproach are reflected in the regularity conditions for the closeness result in Section 4.\n\n\n3 Example: l2- and l1-penalized logistic regression\n\nRegularized logistic regression has been successfully used as a classification and proba-\nbility estimation approach [11, 12]. We first illustrate applying our quadratic method to\nthis regularized problem using a small subset of the \"spam\" data-set, available from the\nUCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html)\nwhich allows us to present some detailed diagnostics. Next, we apply it to the full \"spam\"\ndata-set, to demonstrate its time complexity on bigger problems.\n\nWe first choose five variables and 300 observations and track the solution paths to two\nregularized logistic regression problems with the l2-norm and the l1-norm penalties:\n\n ^\n(5) () = arg min log(1 + exp{-yi xi}) + 22\n \n ^\n(6) () = arg min log(1 + exp{-yi xi}) + 1\n \n\n\nFigure 1 shows the solution paths ( )(t) generated by running Algorithms 1 and 2 on this\ndata using = 0.02 and starting at = 0, i.e. from the non-regularized logistic regression\nsolution. The interesting graphs for our purpose are the ones on the right. They represent\nthe \"optimality gap\":\n\n C(( )\n e t )\n t = + t\n J(( )\n t )\n\nwhere the division is done componentwise (and so the five curves in each plot correspond\nto the five variables we are using). Note that the optimal solution ^\n (t ) is uniquely defined\nby the fact that (2) holds and therefore the \"optimality gap\" is equal to zero componentwise\nat ^\n (t ). By convexity and regularity of the loss and the penalty, there is a correspondence\nbetween small values of e and small distance ( )(t) - ^\n (t ) . In our example we observe\nthat the components of e seem to be bounded in a small region around 0 for both paths (note\nthe small scale of the y axis in both plots -- the maximal error is under 10-3). We conclude\nthat on this simple example our method tracks the optimal solution paths well, both for the\nl1- and l2-regularized problems. The plots on the left show the actual coefficient paths --\nthe curve in R5 is shown as five coefficient traces in R, each corresponding to one variable,\nwith the non-regularized solution (identical for both problems) on the extreme left.\n\nNext, we run our algorithm on the full \"spam\" data-set, containing p = 57 predic-\ntors and n = 4601 observations. For both the l1- and l2-penalized paths we used\n\n\f\n -4\n x 10\n 2.5\n\n\n 2\n 4\n 1.5 \n\n 1 J + \n (/) 2\n \n 0.5 C / \n 0\n 0\n\n\n -0.5 -2\n 0 10 20 30 40 0 10 20 30 40\n \n\n -4\n x 10\n 2.5\n\n\n 2\n 4\n 1.5 \n\n 1 J + \n (/) 2\n \n 0.5 C / \n 0\n 0\n\n\n -0.5 -2\n 0 10 20 30 40 0 10 20 30 40\n \n\n\nFigure 1: Solution paths (left) and optimality criterion (right) for l1 penalized logistic re-\ngression (top) and l2 penalized logistic regression (bottom). These result from running\nAlgorithms 2 and 1, respectively, using = 0.02 and starting from the non-regularized\nlogistic regression solution (i.e. = 0)\n\n\n0 = 0, max = 50, = 0.02, and the whole path was generated in under 5 minutes\nusing a Matlab implementation on an IBM T-30 Laptop. Like in the small scale example,\nthe \"optimality criterion\" was uniformly small throughout the two paths, with none of its\n57 components exceeding 10-3 at any point.\n\n\n4 Theoretical closeness result\n\nIn this section we prove that our algorithm can track the path of true solutions to (1).\nWe show that under regularity conditions on the loss and penalty (which hold for all the\ncandidates we have examined), if we run Algorithm 1 with a specific step size , then we\nremain within O( 2) of the true path of optimal regularized solutions.\n\nTheorem 1 Assume 0 > 0, then for small enough and under regularity conditions on\nthe derivatives of C and J ,\n\n 0 < c < max - 0 , ( )(c/ ) - ^\n (0 + c) = O( 2)\n\nSo there is a uniform bound O( 2) on the error which does not depend on c.\n\nProof We give the details of the proof in Appendix A of [7]. Here we give a brief review\nof the main steps.\n\nSimilar to section 3 we define the \"optimality gap\":\n\n C(( )\n(7) ( t ) )j + t = etj\n J(( )\n t )\nAlso define a \"regularity constant\" M , which depends on 0 and the first, second and third\nderivatives of the loss and penalty.\n\nThe proof is presented as a succession of lemmas:\n\n\f\n \nLemma 2 Let u1 = M p 2, ut = M (ut-1 + p )2, then: et 2 ut\n\nThis lemma gives a recursive expression bounding the error in the optimality gap (7) as the\nalgorithm proceeds. The proof is based on separate Taylor expansions of the numerator and\ndenominator of the ratio C\n J in the optimality gap and some tedious algebra.\n \n \n 1-4 p M\nLemma 3 If p M 1/4 then u 1\n t - p - = O( 2) , t\n 2M 2M\n\n\nThis lemma shows that the recursive bound translates to a uniform O( 2) bound, if is\nsmall enough. The proof consists of analytically finding the fixed point of the increasing\nseries ut.\n\nLemma 4 Under regularity conditions on the penalty and loss functions in the neighbor-\nhood of the solutions to (1), the O( 2) uniform bound of lemma 3 translates to an O( 2)\nuniform bound on ( )(c/ ) - ^\n (0 + c)\n\nFinally, this lemma translates the optimality gap bound to an actual closeness result. This\nis proven via a Lipschitz argument.\n\n\n4.1 Required regularity conditions\n\nRegularity in the loss and the penalty is required in the definition of the regularity constant\nM and in the translation of the O( 2) bound on the \"optimality gap\" into one on the distance\nfrom the path in lemma 4. The exact derivation of the regularity conditions is highly tech-\nnical and lengthy. They require us to bound the norm of third derivative \"hyper-matrices\"\nfor the loss and the penalty as well as the norms of various functions of the gradients and\nHessians of both (the boundedness is required only in the neighborhood of the optimal path\nwhere our approximate path can venture, obviously). We also need to have 0 > 0 and\nmax < . Refer to Appendix A of [7] for details. Assuming that 0 > 0 and max < \nthese conditions hold for every interesting example we have encountered, including:\n\n Ridge regression and the lasso (that is, l2- and l1- regularized squared error loss).\n l1- and l2-penalized logistic regression. Also Poisson regression and other expo-\n nential family models.\n l1- and l2-penalized exponential loss.\n\nNote that in our practical examples above we have started from 0 = 0 and our method still\nworked well. We observe in figure 1 that the tracking algorithm indeed suffers the biggest\ninaccuracy for the small values of , but manages to \"self correct\" as increases.\n\n\n5 Extensions\n\nWe have described our method in the context of linear models for supervised learning.\nThere are several natural extensions and enhancements to consider.\n\n\nBasis expansions and Kernel methods\n\nOur approach obviously applies, as is, to models that are linear in basis expansions of the\noriginal variables (like wavelets or kernel methods) as long as p < n is preserved. However,\nthe method can easily be applied to high (including infinite) dimensional kernel versions\nof regularized models where RKHS theory applies. We know that the solution path is fully\nwithin the span of the representer functions, that is the columns of the Kernel matrix. With\n\n\f\na kernel matrix K with columns k1, ..., kn and the standard l2-norm penalty, the regularized\nproblem becomes:\n ^\n () = arg min C(yi, ki) + K\n i\nso the penalty now also contains the Kernel matrix, but this poses no complications in using\nAlgorithm 1. The only consideration we need to keep in mind is the computational one,\nas our complexity is O(n3/ ). So our method is fully applicable and practical for kernel\nmethods, as long as the number of observations, and the resulting kernel matrix, are not too\nlarge (up to several hundreds).\n\n\nUnsupervised learning\n\nThere is no reason to limit the applicability of this approach to supervised learning. Thus,\nfor example, adaptive density estimation using negative log-likelihood as a loss can be\nregularized and the solution path be tracked using our algorithm.\n\n\nComputational tricks\n\nThe computational complexity of our algorithm limits its applicability to large problems.\nTo improve its scalability we primarily need to reduce the effort in the Hessian calculation\nand inversion. The obvious suggestion here would be to keep the Hessian part of step 2(b)\nin Algorithm 1 fixed for many iterations and change the gradient part only, then update\nthe Hessian occasionally. The clear disadvantage would be that the \"closeness\" guarantees\nwould no longer hold. We have not tried this in practice but believe it is worth pursuing.\n\nAcknowledgements. The author thanks Noureddine El Karoui for help with the proof and\nJerome Friedman, Giles Hooker, Trevor Hastie and Ji Zhu for helpful discussions.\n\n\nReferences\n\n[1] Allgower, E. L. & Georg, K. (1993). Continuation and path following. Acta Numer., 2:164\n\n[2] Donoho, D., Johnstone, I., Kerkyachairan, G. & Picard, D. (1995). Wavelet shrinkage: Asymp-\n topia? Annals of Statistics\n\n[3] Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R.(2004). Least Angle Regression. Annals of\n Statistics .\n\n[4] Hastie, T., Rosset, S., Tibshirani, R. & Zhu, J. (2004). The Entire Regularization Path for the\n Support Vector Machine. Journal of Machine Learning Research, 5(Oct):13911415.\n\n[5] Hastie, T., Tibshirani, R. & Friedman, J. (2001). Elements of Stat. Learning. Springer-Verlag\n\n[6] Kim, Y & Kim, J. (2004) Gradient LASSO for feature selection. ICML-04, to appear.\n\n[7] Rosset, S. (2003). Topics in Regularization and Boosting. PhD thesis, dept. of Statistics, Stan-\n ford University.\n http://www-stat.stanford.edu/~saharon/papers/PhDThesis.pdf\n\n[8] Rosset, S., Zhu, J. & Hastie,T. (2003). Boosting as a regularized path to a maximum margin\n classifier. Journal of Machine Learning Research, 5(Aug):941-973.\n\n[9] Rosset, S. & Zhu, J. (2003). Piecewise linear regularized solution paths. Submitted.\n\n[10] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. JRSSB\n\n[11] Wahba, G., Gu, C., Wang, Y. & Chappell, R. (1995) Soft Classification, a.k.a. Risk Estimation,\n via Penalized Log Likelihood and Smoothing Spline Analysis of Variance. In D.H. Wolpert,\n editor, The Mathematics of Generalization.\n\n[12] Zhu, J. & Hastie, T. (2003). Classification of Gene Microarrays by Penalized Logistic Regres-\n sion. Biostatistics, to appear.\n\n[13] Zhu, J., Hastie, T., Rosset, S. & Tibshirani, R. (2004). 1-norm support vector machines. Neural\n Information Processing Systems, 16.\n\n\f\n", "award": [], "sourceid": 2600, "authors": [{"given_name": "Saharon", "family_name": "Rosset", "institution": null}]}