{"title": "Generalized Linear Model Regression under Distance-to-set Penalties", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1395, "abstract": "Estimation in generalized linear models (GLM) is complicated by the presence of constraints. One can handle constraints by maximizing a penalized log-likelihood. Penalties such as the lasso are effective in high dimensions but often lead to severe shrinkage. This paper explores instead penalizing the squared distance to constraint sets. Distance penalties are more flexible than algebraic and regularization penalties, and avoid the drawback of shrinkage. To optimize distance penalized objectives, we make use of the majorization-minimization principle. Resulting algorithms constructed within this framework are amenable to acceleration and come with global convergence guarantees. Applications to shape constraints, sparse regression, and rank-restricted matrix regression on synthetic and real data showcase the strong empirical performance of distance penalization, even under non-convex constraints.", "full_text": "Generalized Linear Model Regression under\n\nDistance-to-set Penalties\n\nUniversity of California, Los Angeles\n\nJason Xu\n\njqxu@ucla.edu\n\nEric C. Chi\n\nNorth Carolina State University\n\neric_chi@ncsu.edu\n\nUniversity of California, Los Angeles\n\nKenneth Lange\n\nklange@ucla.edu\n\nAbstract\n\nEstimation in generalized linear models (GLM) is complicated by the presence of\nconstraints. One can handle constraints by maximizing a penalized log-likelihood.\nPenalties such as the lasso are effective in high dimensions, but often lead to\nunwanted shrinkage. This paper explores instead penalizing the squared distance\nto constraint sets. Distance penalties are more \ufb02exible than algebraic and regu-\nlarization penalties, and avoid the drawback of shrinkage. To optimize distance\npenalized objectives, we make use of the majorization-minimization principle. Re-\nsulting algorithms constructed within this framework are amenable to acceleration\nand come with global convergence guarantees. Applications to shape constraints,\nsparse regression, and rank-restricted matrix regression on synthetic and real data\nshowcase strong empirical performance, even under non-convex constraints.\n\n1\n\nIntroduction and Background\n\nIn classical linear regression, the response variable y follows a Gaussian distribution whose mean\nxt\u03b2 depends linearly on a parameter vector \u03b2 through a vector of predictors x. Generalized linear\nmodels (GLMs) extend classical linear regression by allowing y to follow any exponential family\ndistribution, and the conditional mean of y to be a nonlinear function h(xt\u03b2) of xt\u03b2 [24]. This\nencompasses a broad class of important models in statistics and machine learning. For instance, count\ndata and binary classi\ufb01cation come within the purview of generalized linear regression.\nIn many settings, it is desirable to impose constraints on the regression coef\ufb01cients. Sparse regression\nis a prominent example. In high-dimensional problems where the number of predictors n exceeds the\nnumber of cases m, inference is possible provided the regression function lies in a low-dimensional\nmanifold [11]. In this case, the coef\ufb01cient vector \u03b2 is sparse, and just a few predictors explain the\nresponse y. The goals of sparse regression are to correctly identify the relevant predictors and to\nestimate their effect sizes. One approach, best subset regression, is known to be NP hard. Penalizing\nthe likelihood by including an (cid:96)0 penalty (cid:107)\u03b2(cid:107)0 (the number of nonzero coef\ufb01cients) is a possibility,\nbut the resulting objective function is nonconvex and discontinuous. The convex relaxation of (cid:96)0\nregression replaces (cid:107)\u03b2(cid:107)0 by the (cid:96)1 norm (cid:107)\u03b2(cid:107)1. This LASSO proxy for (cid:107)\u03b2(cid:107)0 restores convexity\nand continuity [31]. While LASSO regression has been a great success, it has the downside of\nsimultaneously inducing both sparsity and parameter shrinkage. Unfortunately, shrinkage often has\nthe undesirable side effect of including spurious predictors (false positives) with the true predictors.\nMotivated by sparse regression, we now consider the alternative of penalizing the log-likelihood by\nthe squared distance from the parameter vector \u03b2 to the constraint set. If there are several constraints,\nthen we add a distance penalty for each constraint set. Our approach is closely related to the proximal\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdistance algorithm [19, 20] and proximity function approaches to convex feasibility problems [5].\nNeither of these prior algorithm classes explicitly considers generalized linear models. Beyond\nsparse regression, distance penalization applies to a wide class of statistically relevant constraint\nsets, including isotonic constraints and matrix rank constraints. To maximize distance penalized log-\nlikelihoods, we advocate the majorization-minimization (MM) principle [2, 18, 19]. MM algorithms\nare increasingly popular in solving the large-scale optimization problems arising in statistics and\nmachine learning [22]. Although distance penalization preserves convexity when it already exists,\nneither the objective function nor the constraints sets need be convex to carry out estimation. The\ncapacity to project onto each constraint set is necessary. Fortunately, many projection operators are\nknown. Even in the absence of convexity, we are able to prove that our algorithm converges to a\nstationary point. In the presence of convexity, the stationary points are global minima.\nIn subsequent sections, we begin by brie\ufb02y reviewing GLM regression and shrinkage penalties. We\nthen present our distance penalty method and a sample of statistically relevant problems that it can\naddress. Next we lay out in detail our distance penalized GLM algorithm, discuss how it can be\naccelerated, summarize our convergence results, and compare its performance to that of competing\nmethods on real and simulated data. We close with a summary and a discussion of future directions.\n\nGLMs and Exponential Families:\nIn linear regression, the vector of responses y is normally\ndistributed with mean vector E(y) = X\u03b2 and covariance matrix V(y) = \u03c32I. A GLM preserves\nthe independence of the responses yi but assumes that they are generated from a shared exponential\nfamily distribution. The response yi is postulated to have mean \u00b5i(\u03b2) = E[yi|\u03b2] = h(xt\ni\u03b2), where\nxi is the ith row of a design matrix X, and the inverse link function h(s) is smooth and strictly\nincreasing [24]. The functional inverse h\u22121(s) of h(s) is called the link function. The likelihood of\nany exponential family can be written in the canonical form\n\np(yi|\u03b8i, \u03c4 ) = c1(yi, \u03c4 ) exp\n\n.\n\n(1)\n\n(cid:26) y\u03b8i \u2212 \u03c8(\u03b8i)\n\n(cid:27)\n\nc2(\u03c4 )\n\nHere \u03c4 is a \ufb01xed scale parameter, and the positive functions c1 and c2 are constant with respect to the\nnatural parameter \u03b8i. The function \u03c8 is smooth and convex; a brief calculation shows that \u00b5i = \u03c8(cid:48)(\u03b8i).\nThe canonical link function h\u22121(s) is de\ufb01ned by the condition h\u22121(\u00b5i) = xt\ni\u03b2 = \u03b8i. In this case,\nh(\u03b8i) = \u03c8(cid:48)(\u03b8i), and the log-likelihood ln p(y|\u03b2, xj, \u03c4 ) is concave in \u03b2. Because c1 and c2 are not\nfunctions of \u03b8, we may drop these terms and work with the log-likelihood up to proportionality. We\ndenote this by L(\u03b2 | y, X) \u221d ln p(y|\u03b2, xj, \u03c4 ). The gradient and second differential of L(\u03b2 | y, X)\namount to\n\n\u2207L =\n\n[yi \u2212 \u03c8(cid:48)(xt\n\ni\u03b2)]xi\n\n\u03c8(cid:48)(cid:48)(xt\n\ni\u03b2)xixt\ni.\n\n(2)\n\nand d2L = \u2212 m(cid:88)\n\nm(cid:88)\n\ni=1\n\ni=1\n\nAs an example, when \u03c8(\u03b8) = \u03b82/2 and c2(\u03c4 ) = \u03c4 2, the density (1) is the Gaussian likelihood,\nand GLM regression under the identity link coincides with standard linear regression. Choosing\n\u03c8(\u03b8) = ln[1 + exp(\u03b8)] and c2(\u03c4 ) = 1 corresponds to logistic regression under the canonical link\nh\u22121(s) = ln s\n1+es . GLMs unify a range of regression settings,\nincluding Poisson, logistic, gamma, and multinomial regression.\n\n1\u2212s with inverse link h(s) = es\n\nShrinkage penalties: The least absolute shrinkage and selection operator (LASSO) [12, 31] solves\n\n(cid:104)\n\nm(cid:88)\n\nj=1\n\n(cid:105)\n\n\u02c6\u03b2 = argmin\u03b2\n\n\u03bb(cid:107)\u03b2(cid:107)1 \u2212 1\nm\n\nL(\u03b2 | yj, xj)\n\n,\n\n(3)\n\nwhere \u03bb > 0 is a tuning constant that controls the strength of the (cid:96)1 penalty. The (cid:96)1 relaxation is\na popular approach to promote a sparse solution, but there is no obvious map between \u03bb and the\nsparsity level k. In practice, a suitable value of \u03bb is found by cross-validation. Relying on global\nshrinkage towards zero, LASSO notoriously leads to biased estimates. This bias can be ameliorated\nby re-estimating under the model containing only the selected variables, known as the relaxed LASSO\n[25], but success of this two-stage procedure relies on correct support recovery in the \ufb01rst step.\nIn many cases, LASSO shrinkage is known to introduce false positives [30], resulting in spurious\ncovariates that cannot be corrected. To combat these shortcomings, one may replace the LASSO\npenalty by a non-convex penalty with milder effects on large coef\ufb01cients. The smoothly clipped\n\n2\n\n\fabsolute deviation (SCAD) penalty [10] and minimax concave penalty (MCP) [34] are even functions\nde\ufb01ned through their derivatives\n\nq(cid:48)\n\u03b3(\u03b2i, \u03bb) = \u03bb\n\n1{|\u03b2i|\u2264\u03bb} +\n\n(\u03b3\u03bb \u2212 |\u03b2i|)+\n(\u03b3 \u2212 1)\u03bb\n\n1{|\u03b2i|>\u03bb}\n\nand\n\nq(cid:48)\n\u03b3(\u03b2i, \u03bb) = \u03bb\n\n(cid:18)\n1 \u2212 |\u03b2i|\n\n(cid:19)\n\n\u03bb\u03b3\n\n+\n\n(cid:20)\n\nfor \u03b2i > 0. Both penalties reduce bias, interpolate between hard thresholding and LASSO shrinkage,\nand signi\ufb01cantly outperform the LASSO in some settings, especially in problems with extreme\nsparsity. SCAD, MCP, as well as the relaxed lasso come with the disadvantage of requiring an extra\ntuning parameter \u03b3 > 0 to be selected.\n\n2 Regression with distance-to-constraint set penalties\n\nAs an alternative to shrinkage, we consider penalizing the distance between the parameter vector \u03b2\nand constraints de\ufb01ned by sets Ci. Penalized estimation seeks the solution\n\n\uf8f9\uf8fb := argmin\u03b2 f (\u03b2),\n\n(4)\n\n(cid:21)\n\nm(cid:88)\n\nj=1\n\n\uf8ee\uf8f0 1\n\n2\n\n(cid:88)\n\ni\n\n\u02c6\u03b2 = argmin\u03b2\n\nvidist(\u03b2, Ci)2 \u2212 1\nm\n\nL(\u03b2 | yj, xj)\n\nwhere the vi are weights on the distance penalty to constraint set Ci . The Euclidean distance can\nalso be written as\n\ndist(\u03b2, Ci) = (cid:107)\u03b2 \u2212 PCi(\u03b2)(cid:107)2,\n\nwhere PCi(\u03b2) denotes the projection of \u03b2 onto Ci. The projection operator is uniquely de\ufb01ned\nwhen Ci is closed and convex. If Ci is merely closed, then PCi(\u03b2) may be multi-valued for a few\nunusual external points \u03b2. Notice the distance penalty dist(\u03b2, Ci)2 is 0 precisely when \u03b2 \u2208 Ci.\nThe solution (4) represents a tradeoff between maximizing the log-likelihood and satisfying the\nconstraints. When each Ci is convex, the objective function is convex as a whole. Sending all of the\npenalty constants vi to \u221e produces in the limit the constrained maximum likelihood estimate. This\nis the philosophy behind the proximal distance algorithm [19, 20]. In practice, it often suf\ufb01ces to\n\ufb01nd the solution (4) under \ufb01xed vi large. The reader may wonder why we employ squared distances\nrather than distances. The advantage is that squaring renders the penalties differentiable. Indeed,\n\u2207 1\n2dist(x, Ci)2 = x \u2212 PCi(x) whenever PCi(x) is single valued. This is almost always the case.\nIn contrast, dist(x, Ci) is typically nondifferentiable at boundary points of Ci even when Ci is\nconvex. The following examples motivate distance penalization by considering constraint sets and\ntheir projections for several important models.\n\nSparse regression: Sparsity can be imposed directly through the constraint set Ck =\n{z \u2208 Rn : (cid:107)z(cid:107)0 \u2264 k} . Projecting a point \u03b2 onto C is trivially accomplished by setting all but\nthe k largest entries in magnitude of \u03b2 equal to 0, the same operation behind iterative hard thresh-\nolding algorithms. Instead of solving the (cid:96)1-relaxation (3), our algorithm approximately solves the\noriginal (cid:96)0-constrained problem by repeatedly projecting onto the sparsity set Ck. Unlike LASSO\nregression, this strategy enables one to directly incorporate prior knowledge of the sparsity level k in\nan interpretable manner. When no such information is available, k can be selected by cross validation\njust as the LASSO tuning constant \u03bb is selected. Distance penalization escapes the NP hard dilemma\nof best subset regression at the cost of possible convergence to a local minimum.\n\nShape and order constraints: As an example of shape and order restrictions, consider isotonic\nregression [1]. For data y \u2208 Rn, isotonic regression seeks to minimize 1\n2 subject to\nthe condition that the \u03b2i are non-decreasing. In this case, the relevant constraint set is the isotone\nconvex cone C = {\u03b2 : \u03b21 \u2264 \u03b22 \u2264 . . . \u2264 \u03b2n}. Projection onto C is straightforward and ef\ufb01ciently\naccomplished using the pooled adjacent violators algorithm [1, 8]. More complicated order constraints\ncan be imposed analogously: for instance, \u03b2i \u2264 \u03b2j might be required of all edges i \u2192 j in a directed\ngraph model. Notably, isotonic linear regression applies to changepoint problems [32]; our approach\nallows isotonic constraints in GLM estimation. One noteworthy application is Poisson regression\nwhere the intensity parameter is assumed to be nondecreasing with time.\n\n2(cid:107)y \u2212 \u03b2(cid:107)2\n\nRank restriction: Consider GLM regression where the predictors X i and regression coef\ufb01cients\nB are matrix-valued. To impose structure in high-dimensional settings, rank restriction serves as an\n\n3\n\n\fmatrix B is de\ufb01ned as the sum of its singular values (cid:107)B(cid:107)\u2217 =(cid:80)\n\nappropriate matrix counterpart to sparsity for vector parameters. Prior work suggests that imposing\nmatrix sparsity is much less effective than restricting the rank of B in achieving model parsimony\n[37]. The matrix analog of the LASSO penalty is the nuclear norm penalty. The nuclear norm of a\nB\u2217B). Notice\n(cid:107)B(cid:107)\u2217 is a convex relaxation of rank(B). Including a nuclear norm penalty entails shrinkage and\ninduces low-rankness by proxy.\nDistance penalization of rank involves projecting onto the set Cr = {Z \u2208 Rn\u00d7n : rank(Z) \u2264 r}\nfor a given rank r. Despite sacri\ufb01cing convexity, distance penalization of rank is, in our view, both\nmore natural and more effective than nuclear norm penalization. Avoiding shrinkage works to the\nadvantage of distance penalization, which we will see empirically in Section 4. According to the\nEckart-Young theorem, the projection of a matrix B onto Cr is achieved by extracting the singular\nvalue decomposition of B and truncating all but the top r singular values. Truncating the singular\nvalue decomposition is a standard numerical task best computed by Krylov subspace methods [14].\n\nj \u03c3j(B) = trace(\n\n\u221a\n\nSimple box constraints, hyperplanes, and balls: Many relevant set constraints reduce to closed\nconvex sets with trivial projections. For instance, enforcing non-negative parameter values is ac-\ncomplished by projecting onto the non-negative orthant. This is an example of a box constraint.\nSpecifying linear equality and inequality constraints entails projecting onto a hyperplane or half-space,\nrespectively. A Tikhonov or ridge penalty constraint (cid:107)\u03b2(cid:107)2 \u2264 r requires spherical projection.\nFinally, we stress that it is straightforward to consider combinations of the aforementioned constraints.\nMultiple norm penalties are already in common use. To encourage selection of correlated variables\n[38], the elastic net includes both (cid:96)1 and (cid:96)2 regularization terms. Further examples include matrix\n\ufb01tting subject to both sparse and low-rank matrix constraints [29] and LASSO regression subject\nto linear equality and inequality constraints [13]. In our setting the relative importance of different\nconstraints can be controlled via the weights vi.\n\n3 Majorization-minimization\n\nFigure 1: Illustrative example of two MM iterates with surrogates g(x|xk) majorizing f (x) = cos(x).\nTo solve the minimization problem (4), we exploit the principle of majorization-minimization. An\nMM algorithm successively minimizes a sequence of surrogate functions g(\u03b2 | \u03b2k) majorizing the\nobjective function f (\u03b2) around the current iterate \u03b2k. See Figure 1. Forcing g(\u03b2 | \u03b2k) downhill\nautomatically drives f (\u03b2) downhill as well [19, 22]. Every expectation-maximization (EM) algorithm\n[9] for maximum likelihood estimation is an MM algorithm. Majorization requires two conditions:\ntangency at the current iterate g(\u03b2k | \u03b2k) = f (\u03b2k), and domination g(\u03b2 | \u03b2k) \u2265 f (\u03b2) for all\n\u03b2 \u2208 Rm. The iterates of the MM algorithm are de\ufb01ned by\n\n\u03b2k+1 := arg min\n\n\u03b2\n\ng(\u03b2 | \u03b2k)\n\nalthough all that is absolutely necessary is that g(\u03b2k+1 | \u03b2k) < g(\u03b2k | \u03b2k). Whenever this holds,\nthe descent property\n\nf (\u03b2k+1) \u2264 g(\u03b2k+1 | \u03b2k) \u2264 g(\u03b2k | \u03b2k) = f (\u03b2k)\n\n4\n\n\ffollows. This simple principle is widely applicable and converts many hard optimization problems\n(non-convex or non-smooth) into a sequence of simpler problems.\nTo majorize the objective (4), it suf\ufb01ces to majorize each distance penalty dist (\u03b2, Ci)2. The ma-\njorization dist (\u03b2, Ci)2 \u2264 (cid:107)\u03b2 \u2212 PCi(\u03b2k)(cid:107)2\n2 is an immediate consequence of the de\ufb01nitions of the set\n(cid:88)\ndistance dist (\u03b2, Ci)2 and the projection operator PCi(\u03b2) [8]. The surrogate function\nL(\u03b2 | yj, xj).\n(cid:88)\n(cid:16)(cid:88)\n\nvi[\u03b2 \u2212 PCi(\u03b2k)] \u2212 1\nm\n(cid:17)\nm(cid:88)\n\nvi(cid:107)\u03b2 \u2212 PCi(\u03b2k)(cid:107)2\n\nm(cid:88)\n2 \u2212 1\nm\nm(cid:88)\n\nj=1\n\n\u2207g(\u03b2 | \u03b2k) =\n\n\u2207L(\u03b2 | yj, xj)\n\ng(\u03b2 | \u03b2k) =\n\nand second differential\n\nhas gradient\n\ni\n\ni\n\n1\n2\n\nj=1\n\nd2L(\u03b2 | yj, xj) := H k.\n\nd2g(\u03b2 | \u03b2k) =\n\n(5)\n\nvi\n\nI n \u2212 1\nm\n\ni\n\nj=1\n\nThe score \u2207L(\u03b2 | yj, xj) and information \u2212d2L(\u03b2 | yj, xj) appear in equation (2). Note that for\nGLMs under canonical link, the observed and expected information matrices coincide, and their\ncommon value is thus positive semide\ufb01nite. Adding a multiple of the identity I n to the information\nmatrix is analogous to the Levenberg-Marquardt maneuver against ill-conditioning in ordinary\nregression [26]. Our algorithm therefore naturally bene\ufb01ts from this safeguard.\nSince solving the stationarity equation \u2207g(\u03b2 | \u03b2k) = 0 is not analytically feasible in general, we\nemploy one step of Newton\u2019s method in the form\n\n\u03b2k+1 = \u03b2k \u2212 \u03b7kd2g(\u03b2k | \u03b2k)\u22121\u2207f (\u03b2k),\n\nwhere \u03b7k \u2208 (0, 1] is a stepsize multiplier chosen via backtracking. Note here our application of\nthe gradient identity \u2207f (\u03b2k) = \u2207g(\u03b2k | \u03b2k), valid for all smooth surrogate functions. Because\nthe Newton increment is a descent direction, some value of \u03b7k is bound to produce a decrease in\nthe surrogate and therefore in the objective. The following theorem, proved in the Supplement,\nestablishes global convergence of our algorithm under simple Armijo backtracking for choosing \u03b7k:\nTheorem 3.1 Consider the algorithm map\n\nM(\u03b2) = \u03b2 \u2212 \u03b7\u03b2H(\u03b2)\u22121\u2207f (\u03b2),\n\nwhere the step size \u03b7\u03b2 has been selected by Armijo backtracking. Assume that f (\u03b2) is coercive in the\nsense lim(cid:107)\u03b2(cid:107)\u2192\u221e f (\u03b2) = +\u221e. Then the limit points of the sequence \u03b2k+1 = M(\u03b2k) are stationary\npoints of f (\u03b2). Moreover, the set of limit points is compact and connected.\n\nWe remark that stationary points are necessarily global minimizers when f (\u03b2) is convex. Furthermore,\ncoercivity of f (\u03b2) is a very mild assumption, and is satis\ufb01ed whenever either the distance penalty or\nthe negative log-likelihood is coercive. For instance, the negative log-likelihoods of the Poisson and\nGaussian distributions are coercive functions. While this is not the case for the Bernoulli distribution,\nadding a small (cid:96)2 penalty \u03c9(cid:107)\u03b2(cid:107)2\n2 restores coerciveness. Including such a penalty in logistic regression\nis a common remedy to the well-known problem of numerical instability in parameter estimates\ncaused by a poorly conditioned design matrix X [27]. Since L(\u03b2) is concave in \u03b2, the compactness\nof one or more of the constraint sets Ci is another suf\ufb01cient condition for coerciveness.\n\nGeneralization to Bregman divergences: Although we have focused on penalizing GLM likeli-\nhoods with Euclidean distance penalties, this approach holds more generally for objectives containing\nnon-Euclidean measures of distance. As reviewed in the Supplement, the Bregman divergence\nD\u03c6(v, u) = \u03c6(v) \u2212 \u03c6(u) \u2212 d\u03c6(u)(v \u2212 u) generated by a convex function \u03c6(v) provides a general\n2(cid:107)v(cid:107)2\nnotion of directed distance [4]. The Bregman divergence associated with the choice \u03c6(v) = 1\n2,\nfor instance, is the squared Euclidean distance. One can rewrite the GLM penalized likelihood as a\nsum of multiple Bregman divergences\n\n(cid:88)\n\n(cid:104)P \u03c6\n\nCi\n\n(cid:105)\n\nm(cid:88)\n\n(cid:104)\n(cid:105)\nyj,(cid:101)hj(\u03b2)\n\nf (\u03b2) =\n\nviD\u03c6\n\n(\u03b2), \u03b2\n\n+\n\nwjD\u03b6\n\n.\n\n(6)\n\ni\n\nj=1\n\n5\n\n\fAlgorithm 1 MM algorithm to solve distance-penalized objective (4)\n1: Initialize k = 0, starting point \u03b20, initial step size \u03b1 \u2208 (0, 1), and halving parameter \u03c3 \u2208 (0, 1):\n2: repeat\n3:\n\n(cid:80)m\nj=1 \u2207L(\u03b2 | yj, \u03b2j)\n\n\u2207fk \u2190(cid:80)\nH k \u2190(cid:16)(cid:80)\n\n4:\n\n(cid:17)\n(cid:80)m\ni vi[\u03b2 \u2212 PCi (\u03b2k)] \u2212 1\nj=1 d2L(\u03b2 | yj, \u03b2j)\ni vi\nk \u2207fk\n\nI n \u2212 1\n\nm\n\nm\n\nv \u2190 \u2212H\u22121\n\u03b7 \u2190 1\nwhile f (\u03b2k + \u03b7v) > f (\u03b2k) + \u03b1\u03b7\u2207f t\n\u03b7 \u2190 \u03c3\u03b7\nend while\n\u03b2k+1 \u2190 \u03b2k + \u03b7v\nk \u2190 k + 1\n\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: until convergence\n\nk\u03b2k do\n\nm\n\nCi\n\nj=1 D\u03b6\n\n(cid:2)yj, h\u22121(xt\n\nj\u03b2)(cid:3). The functional form (6)\n\nGLM log-likelihood term where(cid:101)hj(\u03b2) = h\u22121(xt\n(cid:80)m\n\nThe \ufb01rst sum in equation (6) represents the distance penalty to the constraint sets Ci. The projection\nP \u03c6\n(\u03b2) denotes the closest point to \u03b2 in Ci measured under D\u03c6. The second sum generalizes the\nj\u03b2). Every exponential family likelihood uniquely\ncorresponds to a Bregman divergence D\u03b6 generated by the conjugate of its cumulant function \u03b6 = \u03c8\u2217\n[28]. Hence, \u2212L(\u03b2 | y, X) is proportional to 1\nimmediately broadens the class of objectives to include quasi-likelihoods and distances to constraint\nsets measured under a broad range of divergences. Objective functions of this form are closely related\nto proximity function minimization in the convex feasibility literature [5, 6, 7, 33]. The MM principle\nmakes possible the extension of the projection algorithms of [7] to minimize this general objective.\nOur MM algorithm for distance penalized GLM regression is summarized in Algorithm 1. Although\nfor the sake of clarity the algorithm is written for vector-valued arguments, it holds more generally for\nmatrix-variate regression. In this setting the regression coef\ufb01cients B and predictors X i are matrix\nvalued, and response yj has mean h[trace(X t\niB)] = h[vec(X i)t vec(B)]. Here the vec operator\nstacks the columns of its matrix argument. Thus, the algorithm immediately applies if we replace B\nby vec(B) and X 1, . . . , X m by X = [vec(X 1), . . . , vec(X m)]t. Projections requiring the matrix\nstructure are performed by reshaping vec(B) into matrix form. In contrast to shrinkage approaches,\nthese maneuvers obviate the need for new algorithms in matrix regression [37].\n\nAcceleration: Here we mention two modi\ufb01cations to the MM algorithm that translate to large\npractical differences in computational cost. Inverting the n-by-n matrix d2g(\u03b2k | \u03b2k) naively\nrequires O(n3) \ufb02ops. When the number of cases m (cid:28) n, invoking the Woodbury formula requires\nsolving a substantially smaller m \u00d7 m linear system at each iteration. This computational savings is\ncrucial in the analysis of the EEG data of Section 4. The Woodbury formula says\n\n(vI n + U V )\u22121 = v\u22121I n \u2212 v\u22122U(cid:0)I m + v\u22121V U(cid:1)\u22121\n\nV\n\nwhen U and V are n\u00d7 m and m\u00d7 n matrices, respectively. Inspection of equations (2) and (5) shows\nthat d2g(\u03b2k | \u03b2k) takes the required form. Under Woodbury\u2019s formula the dominant computation\nis the matrix-matrix product V U, which requires only O(nm2) \ufb02ops. The second modi\ufb01cation to\nthe MM algorithm is quasi-Newton acceleration. This technique exploits secant approximations\nderived from iterates of the algorithm map to approximate the differential of the map. As few as two\nsecant approximations can lead to orders of magnitude reduction in the number of iterations until\nconvergence. We refer the reader to [36] for a detailed description of quasi-Newton acceleration and\na summary of its performance on various high-dimensional problems.\n\n4 Results and performance\n\nWe \ufb01rst compare the performance of our distance penalization method to leading shrinkage methods\nin sparse regression. Our simulations involve a sparse length n = 2000 coef\ufb01cient vector \u03b2 with 10\nnonzero entries. Nonzero coef\ufb01cients have uniformly random effect sizes. The entries of the design\nmatrix X are N (0, 0.1) Gaussian random deviates. We then recover \u03b2 from undersampled responses\n\n6\n\n\fFigure 2: The left \ufb01gure displays relative errors among nonzero predictors in underdetermined\nPoisson and logistic regression with m = 1000 cases. It is clear that LASSO suffers the most\nshrinkage and bias, while MM appears to outperform MCP and SCAD. The right \ufb01gure displays\nMSE as a function of m, favoring MM most notably for logistic regression.\n\nyj following Poisson and Bernoulli distributions with canonical links. Figure 2 compares solutions\nobtained using our distance penalties (MM) to those obtained under MCP, SCAD, and LASSO\npenalties. Relative errors (left) with m = 1000 cases clearly show that LASSO suffers the most\nshrinkage and bias; MM seems to outperform MCP and SCAD. For a more detailed comparison, the\nright side of the \ufb01gure plots mean squared error (MSE) as a function of the number of cases averaged\nover 50 trials. All methods signi\ufb01cantly outperform LASSO, which is omitted for scale, with MM\nachieving lower MSE than competitors, most noticeably in logistic regression. As suggested by an\nanonymous reviewer, similar results from additional experiments for Gaussian (linear) regression\nwith comparison to relaxed lasso are included in the Supplement.\n\n(a) Sparsity constraint\n\n(b) Regularize (cid:107)B(cid:107)\u2217\n\n(c) Restrict rk(B) = 2 (d) Vary rk(B) = 1, . . . , 8\n\nFigure 3: True B0 in the top left of each set of 9 images has rank 2. The other 8 images in (a)\u2014(c)\ndisplay solutions as \u0001 varies over the set {0, 0.1, . . . , 0.7}. Figure (a) applies our MM algorithm\nwith sparsity rather than rank constraints to illustrate how failing to account for matrix structure\nmisses the true signal; Zhou and Li [37] report similar \ufb01ndings comparing spectral regularization to\n(cid:96)1 regularization. Figure (b) performs spectral shrinkage [37] and displays solutions under optimal \u03bb\nvalues via BIC, while (c) uses our MM algorithm restricting rank(B) = 2. Figure (d) \ufb01xes \u0001 = 0.1\nand uses MM with rank(B) \u2208 {1, . . . , 8} to illustrate robustness to rank over-speci\ufb01cation.\n\nFor underdetermined matrix regression, we compare to the spectral regularization method developed\nby Zhou and Li [37]. We generate their cross-shaped 32 \u00d7 32 true signal B0 and in all trials sample\nm = 300 responses yi \u223c N [tr(X t\ni, B), \u0001]. Here the design tensor X is generated with standard\nnormal entries. Figure 3 demonstrates that imposing sparsity alone fails to recover Y 0 and that\nrank-set projections visibly outperform spectral norm shrinkage as \u0001 varies. The rightmost panel also\nshows that our method is robust to over-speci\ufb01cation of the rank of the true signal to an extent.\nWe consider two real datasets. We apply our method to count data of global temperature anomalies\nrelative to the 1961-1990 average, collected by the Climate Research Unit [17]. We assume a non-\n\n7\n\nRelative Error, Logistic\u22120.6\u22120.20.2MMMCPSCADLASSORelative Error, Poisson\u22120.4\u22120.20.0Support indices of true coefficients600800100012001400160018000.030.050.070.09Number of samplesMean squared errorMMMCPSCADlogisticpoisson\fFigure 4: The leftmost plot shows our isotonic \ufb01t to temperature anomaly data [17]. The right \ufb01gures\ndisplay the estimated coef\ufb01cient matrix B on EEG alcoholism data using distance penalization,\nnuclear norm shrinkage [37], and LASSO shrinkage, respectively.\n\ndecreasing solution, illustrating an instance of isotonic regression. The \ufb01tted solution displayed\nin Figure 4 has mean squared error 0.009, clearly obeys the isotonic constraint, and is consistent\nwith that obtained on a previous version of the data [32]. We next focus on rank constrained matrix\nregression for electroencephalography (EEG) data, collected by [35] to study the association between\nalcoholism and voltage patterns over times and channels. The study consists of 77 individuals with\nalcoholism and 45 controls, providing 122 binary responses yi indicating whether subject i has\nalcoholism. The EEG measurements are contained in 256 \u00d7 64 predictor matrices X i; the dimension\nm is thus greater than 16, 000. Further details about the data appear in the Supplement.\nPrevious studies apply dimension reduction [21] and propose algorithms to seek the optimal rank 1\nsolution [16]. These methods could not handle the size of the original data directly, and the spectral\nshrinkage approach proposed in [37] is the \ufb01rst to consider the full EEG data. Figure 4 shows that\nour regression solution is qualitatively similar to that obtained under nuclear norm penalization [37],\nrevealing similar time-varying patterns among channels 20-30 and 50-60. In contrast, ignoring matrix\nstructure and penalizing the (cid:96)1 norm of B yields no useful information, consistent with \ufb01ndings in\n[37]. However, our distance penalization approach achieves a lower misclassi\ufb01cation error of 0.1475.\nThe lowest misclassi\ufb01cation rate reported in previous analyses is 0.139 by [16]. As their approach is\nstrictly more restrictive than ours in seeking a rank 1 solution, we agree with [37] in concluding that\nthe lower misclassi\ufb01cation error can be largely attributed to bene\ufb01ts from data preprocessing and\ndimension reduction. While not visually distinguishable, we also note that shrinking the eigenvalues\nvia nuclear norm penalization [37] fails to produce a low-rank solution on this dataset.\nWe omit detailed timing comparisons throughout since the various methods were run across platforms\nand depend heavily on implementation. We note that MCP regression relies on the MM principle,\nand the LQA and LLA algorithms used to \ufb01t models with SCAD penalties are also instances of\nMM algorithms [11]. Almost all MM algorithms share an overall linear rate of convergence. While\nthese require several seconds of compute time on a standard laptop machine, coordinate-descent\nimplementations of LASSO outstrip our algorithm in terms of computational speed. Our MM\nalgorithm required 31 seconds to converge on the EEG data, the largest example we considered.\n\n5 Discussion\n\nGLM regression is one of the most widely employed tools in statistics and machine learning. Imposing\nconstraints upon the solution is integral to parameter estimation in many settings. This paper considers\nGLM regression under distance-to-set penalties when seeking a constrained solution. Such penalties\nallow a \ufb02exible range of constraints, and are competitive with standard shrinkage methods for sparse\nand low-rank regression in high dimensions. The MM principle yields a reliable solution method\nwith theoretical guarantees and strong empirical results over a number of practical examples. These\nexamples emphasize promising performance under non-convex constraints, and demonstrate how\ndistance penalization avoids the disadvantages of shrinkage approaches.\nSeveral avenues for future work may be pursued. The primary computational bottleneck we face is\nmatrix inversion, which limits the algorithm when faced with extremely large and high-dimensional\n\n8\n\n1850190019502000\u22120.6\u22120.20.20.6YearGlobal Temperature Anomalies10203040506025020015010050\fdatasets. Further improvements may be possible using modi\ufb01cations of the algorithm tailored to\nspeci\ufb01c problems, such as coordinate or block descent variants. Since the linear systems encountered\nin our parameter updates are well conditioned, a conjugate gradient algorithm may be preferable to\ndirect methods of solution in such cases. The updates within our algorithm can be recast as weighted\nleast squares minimization, and a re-examination of this classical problem may suggest even better\niterative solvers. As the methods apply to a generalized objective comprised of multiple Bregman\ndivergences, it will be fruitful to study penalties under alternate measures of distance, and settings\nbeyond GLM regression such as quasi-likelihood estimation.\nWhile our experiments primarily compare against shrinkage approaches, an anonymous referee points\nus to recent work revisiting best subset selection using modern advances in mixed integer optimization\n[3]. These exciting developments make best subset regression possible for much larger problems than\npreviously thought possible. As [3] focus on the linear case, it is of interest to consider how ideas in\nthis paper may offer extensions to GLMs, and to compare the performance of such generalizations.\nBest subsets constitutes a gold standard for sparse estimation in the noiseless setting; whether it\noutperforms shrinkage methods seems to depend on the noise level and is a topic of much recent\ndiscussion [15, 23]. Finally, these studies as well as our present paper focus on estimation, and it\nwill be fruitful to examine variable selection properties in future work. Recent work evidences an\ninevitable trade-off between false and true positives under LASSO shrinkage in the linear sparsity\nregime [30]. The authors demonstrate that this need not be the case with (cid:96)0 methods, remarking\nthat computationally ef\ufb01cient methods which also enjoy good model performance would be highly\ndesirable as (cid:96)0 and (cid:96)1 approaches possess one property but not the other [30]. Our results suggest\nthat distance penalties, together with the MM principle, seem to enjoy bene\ufb01ts from both worlds on a\nnumber of statistical tasks.\n\nAcknowledgements: We would like to thank Hua Zhou for helpful discussions about matrix\nregression and the EEG data. JX was supported by NSF MSPRF #1606177.\n\nReferences\n\n[1] Barlow, R. E., Bartholomew, D. J., Bremner, J., and Brunk, H. D. Statistical inference under\norder restrictions: The theory and application of isotonic regression. Wiley New York, 1972.\n[2] Becker, M. P., Yang, I., and Lange, K. EM algorithms without missing data. Statistical Methods\n\nin Medical Research, 6:38\u201354, 1997.\n\n[3] Bertsimas, D., King, A., and Mazumder, R. Best subset selection via a modern optimization\n\nlens. The Annals of Statistics, 44(2):813\u2013852, 2016.\n\n[4] Bregman, L. M. The relaxation method of \ufb01nding the common point of convex sets and\nits application to the solution of problems in convex programming. USSR Computational\nMathematics and Mathematical Physics, 7(3):200\u2013217, 1967.\n\n[5] Byrne, C. and Censor, Y. Proximity function minimization using multiple Bregman projections,\nwith applications to split feasibility and Kullback\u2013Leibler distance minimization. Annals of\nOperations Research, 105(1-4):77\u201398, 2001.\n\n[6] Censor, Y. and Elfving, T. A multiprojection algorithm using Bregman projections in a product\n\nspace. Numerical Algorithms, 8(2):221\u2013239, 1994.\n\n[7] Censor, Y., Elfving, T., Kopf, N., and Bortfeld, T. The multiple-sets split feasibility problem\n\nand its applications for inverse problems. Inverse Problems, 21(6):2071\u20132084, 2005.\n\n[8] Chi, E. C., Zhou, H., and Lange, K. Distance majorization and its applications. Mathematical\n\nProgramming Series A, 146(1-2):409\u2013436, 2014.\n\n[9] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), pages\n1\u201338, 1977.\n\n[10] Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle\n\nproperties. Journal of the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[11] Fan, J. and Lv, J. A selective overview of variable selection in high dimensional feature space.\n\nStatistica Sinica, 20(1):101, 2010.\n\n9\n\n\f[12] Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[13] Gaines, B. R. and Zhou, H. Algorithms for \ufb01tting the constrained lasso. arXiv preprint\n\narXiv:1611.01511, 2016.\n\n[14] Golub, G. H. and Van Loan, C. F. Matrix computations, volume 3. JHU Press, 2012.\n[15] Hastie, T., Tibshirani, R., and Tibshirani, R. J. Extended comparisons of best subset selection,\n\nforward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017.\n\n[16] Hung, H. and Wang, C.-C. Matrix variate logistic regression model with application to EEG\n\ndata. Biostatistics, 14(1):189\u2013202, 2013.\n\n[17] Jones, P., Parker, D., Osborn, T., and Briffa, K. Global and hemispheric temperature anomalies\u2013\nland and marine instrumental records. Trends: a compendium of data on global change, 2016.\n[18] Lange, K., Hunter, D. R., and Yang, I. Optimization transfer using surrogate objective functions\n\n(with discussion). Journal of Computational and Graphical Statistics, 9:1\u201320, 2000.\n\n[19] Lange, K. MM Optimization Algorithms. SIAM, 2016.\n[20] Lange, K. and Keys, K. L. The proximal distance algorithm. arXiv preprint arXiv:1507.07598,\n\n2015.\n\n[21] Li, B., Kim, M. K., and Altman, N. On dimension folding of matrix-or array-valued statistical\n\nobjects. The Annals of Statistics, pages 1094\u20131121, 2010.\n\n[22] Mairal, J. Incremental majorization-minimization optimization with application to large-scale\n\nmachine learning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[23] Mazumder, R., Radchenko, P., and Dedieu, A. Subset selection with shrinkage: Sparse linear\n\nmodeling when the SNR is low. arXiv preprint arXiv:1708.03288, 2017.\n\n[24] McCullagh, P. and Nelder, J. A. Generalized linear models, volume 37. CRC press, 1989.\n[25] Meinshausen, N. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374\u2013393,\n\n2007.\n\n[26] Mor\u00e9, J. J. The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical\n\nanalysis, pages 105\u2013116. Springer, 1978.\n\n[27] Park, M. Y. and Hastie, T. L1-regularization path algorithm for generalized linear models.\n\nJournal of the Royal Statistical Society: Series B (Methodological), 69(4):659\u2013677, 2007.\n\n[28] Polson, N. G., Scott, J. G., and Willard, B. T. Proximal algorithms in statistics and machine\n\nlearning. Statistical Science, 30(4):559\u2013581, 2015.\n\n[29] Richard, E., Savalle, P.-a., and Vayatis, N. Estimation of simultaneously sparse and low\nIn Proceedings of the 29th International Conference on Machine Learning\n\nrank matrices.\n(ICML-12), pages 1351\u20131358, 2012.\n\n[30] Su, W., Bogdan, M., and Cand\u00e8s, E. False discoveries occur early on the lasso path. The Annals\n\nof Statistics, 45(5), 2017.\n\n[31] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B (Methodological), pages 267\u2013288, 1996.\n\n[32] Wu, W. B., Woodroofe, M., and Mentz, G. Isotonic regression: Another look at the changepoint\n\nproblem. Biometrika, pages 793\u2013804, 2001.\n\n[33] Xu, J., Chi, E. C., Yang, M., and Lange, K. A majorization-minimization algorithm for split\n\nfeasibility problems. arXiv preprint arXiv:1612.05614, 2017.\n\n[34] Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals\n\nof Statistics, 38(2):894\u2013942, 2010.\n\n[35] Zhang, X. L., Begleiter, H., Porjesz, B., Wang, W., and Litke, A. Event related potentials during\n\nobject recognition tasks. Brain Research Bulletin, 38(6):531\u2013538, 1995.\n\n[36] Zhou, H., Alexander, D., and Lange, K. A quasi-Newton acceleration for high-dimensional\n\noptimization algorithms. Statistics and Computing, 21:261\u2013273, 2011.\n\n[37] Zhou, H. and Li, L. Regularized matrix regression. Journal of the Royal Statistical Society:\n\nSeries B (Methodological), 76(2):463\u2013483, 2014.\n\n10\n\n\f[38] Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society: Series B (Methodological), 67(2):301\u2013320, 2005.\n\n11\n\n\f", "award": [], "sourceid": 896, "authors": [{"given_name": "Jason", "family_name": "Xu", "institution": "NSF Postdoctoral Fellow UCLA"}, {"given_name": "Eric", "family_name": "Chi", "institution": "North Carolina State University"}, {"given_name": "Kenneth", "family_name": "Lange", "institution": "UCLA"}]}