{"title": "Scalable nonconvex inexact proximal splitting", "book": "Advances in Neural Information Processing Systems", "page_first": 530, "page_last": 538, "abstract": "We study large-scale, nonsmooth, nonconconvex optimization problems. In particular, we focus on nonconvex problems with \\emph{composite} objectives. This class of problems includes the extensively studied convex, composite objective problems as a special case. To tackle composite nonconvex problems, we introduce a powerful new framework based on asymptotically \\emph{nonvanishing} errors, avoiding the common convenient assumption of eventually vanishing errors. Within our framework we derive both batch and incremental nonconvex proximal splitting algorithms. To our knowledge, our framework is first to develop and analyze incremental \\emph{nonconvex} proximal-splitting algorithms, even if we disregard the ability to handle nonvanishing errors. We illustrate our theoretical framework by showing how it applies to difficult large-scale, nonsmooth, and nonconvex problems.", "full_text": "Scalable nonconvex inexact proximal splitting\n\nSuvrit Sra\n\nMax Planck Institute for Intelligent Systems\n\n72076 T\u00a8ubigen, Germany\nsuvrit@tuebingen.mpg.de\n\nAbstract\n\nWe study a class of large-scale, nonsmooth, and nonconvex optimization prob-\nlems. In particular, we focus on nonconvex problems with composite objectives.\nThis class includes the extensively studied class of convex composite objective\nproblems as a subclass. To solve composite nonconvex problems we introduce a\npowerful new framework based on asymptotically nonvanishing errors, avoiding\nthe common stronger assumption of vanishing errors. Within our new framework\nwe derive both batch and incremental proximal splitting algorithms. To our knowl-\nedge, our work is \ufb01rst to develop and analyze incremental nonconvex proximal-\nsplitting algorithms, even if we were to disregard the ability to handle nonvanish-\ning errors. We illustrate one instance of our general framework by showing an\napplication to large-scale nonsmooth matrix factorization.\n\nIntroduction\n\n1\nThis paper focuses on nonconvex composite objective problems having the form\n\nminimize \u03a6(x) := f (x) + h(x)\n\nx \u2208 X ,\n\n(1)\nwhere f : Rn \u2192 R is continuously differentiable, h : Rm \u2192 R \u222a {\u221e} is lower semi-continuous\n(lsc) and convex (possibly nonsmooth), and X is a compact convex set. We also make the common\nassumption that \u2207f is locally (in X ) Lipschitz continuous, i.e., there is a constant L > 0 such that\n(2)\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107)\n\nfor all x, y \u2208 X .\n\nProblem (1) is a natural but far-reaching generalization of composite objective convex problems,\nwhich enjoy tremendous importance in machine learning; see e.g., [2, 3, 11, 34]. Although, convex\nformulations are extremely useful, for many dif\ufb01cult problems a nonconvex formulation is natu-\nral. Familiar examples include matrix factorization [20, 23], blind deconvolution [19], dictionary\nlearning [18, 23], and neural networks [4, 17].\nThe primary contribution of this paper is theoretical. Speci\ufb01cally, we present a new algorithmic\nframework: Nonconvex Inexact Proximal Splitting (NIPS). Our framework solves (1) by \u201csplitting\u201d\nthe task into smooth (gradient) and nonsmooth (proximal) parts. Beyond splitting, the most notable\nfeature of NIPS is that it allows computational errors. This capability proves critical to obtaining\na scalable, incremental-gradient variant of NIPS, which, to our knowledge, is the \ufb01rst incremental\nproximal-splitting method for nonconvex problems.\nNIPS further distinguishes itself in how it models computational errors. Notably, it does not require\nthe errors to vanish in the limit, which is a more realistic assumption as often one has limited to no\ncontrol over computational errors inherent to a complex system. In accord with the errors, NIPS also\ndoes not require stepsizes (learning rates) to shrink to zero. In contrast, most incremental-gradient\nmethods [5] and stochastic gradient algorithms [16] do assume that the computational errors and\nstepsizes decay to zero. We do not make these simplifying assumptions, which complicates the\nconvergence analysis a bit, but results in perhaps a more satisfying description.\n\n1\n\n\fOur analysis builds on the remarkable work of Solodov [29], who studied the simpler setting of\ndifferentiable nonconvex problems (which correspond with h \u2261 0 in (1)). NIPS is strictly more\ngeneral: unlike [29] it solves a non-differentiable problem by allowing a nonsmooth regularizer\nh (cid:54)\u2261 0, and this h is tackled by invoking proximal-splitting [8].\nProximal-splitting has proved to be exceptionally fruitful and effective [2, 3, 8, 11]. It retains the\nsimplicity of gradient-projection while handling the nonsmooth regularizer h via its proximity op-\nerator. This approach is especially attractive because for several important choices of h, ef\ufb01cient\nimplementations of the associated proximity operators exist [2, 22, 23]. For convex problems, an\nalternative to proximal splitting is the subgradient method; similarly, for nonconvex problems one\nmay use a generalized subgradient method [7, 12]. However, as in the convex case, the use of sub-\ngradients has drawbacks: it fails to exploit the composite structure, and even when using sparsity\npromoting regularizers it does not generate intermediate sparse iterates [11].\nAmong batch nonconvex splitting methods, an early paper is [14]. More recently, in his pioneering\npaper on convex composite minimization, Nesterov [26] also brie\ufb02y discussed nonconvex problems.\nBoth [14] and [26], however, enforced monotonic descent in the objective value to ensure conver-\ngence. Very recently, Attouch et al. [1] have introduced a generic method for nonconvex nonsmooth\nproblems based on Kurdyka-\u0141ojasiewicz theory, but their entire framework too hinges on descent.\nA method that uses nonmontone line-search to eliminate dependence on strict descent is [13].\nIn general, the insistence on strict descent and exact gradients makes many of the methods unsuitable\nfor incremental, stochastic, or online variants, all of which usually lead to a nonmonotone objective\nvalues especially due to inexact gradients. Among nonmonotonic methods that apply to (1), we are\naware of the generalized gradient-type algorithms of [31] and the stochastic generalized gradient\nmethods of [12]. Both methods, however, are analogous to the usual subgradient-based algorithms\nthat fail to exploit the composite objective structure, unlike proximal-splitting methods.\nBut proximal-splitting methods do not apply out-of-the-box to (1): nonconvexity raises signi\ufb01cant\nobstructions, especially because nonmonotonic descent in the objective function values is allowed\nand inexact gradient might be used. Overcoming these obstructions to achieve a scalable non-descent\nbased method that allows inexact gradients is what makes our NIPS framework novel.\n\n2 The NIPS Framework\nTo simplify presentation, we replace h by the penalty function\ng(x) := h(x) + \u03b4(x|X ),\n\n(3)\nwhere \u03b4(\u00b7|X ) is the indicator function for X : \u03b4(x|X ) = 0 for x \u2208 X , and \u03b4(x|X ) = \u221e for x (cid:54)\u2208 X .\nWith this notation, we may rewrite (1) as the unconstrained problem:\n\nminx\u2208Rn \u03a6(x) := f (x) + g(x),\n\n(4)\nand this particular formulation is our primary focus. We solve (4) via a proximal-splitting approach,\nso let us begin by de\ufb01ning our most important component.\nDe\ufb01nition 1 (Proximity operator). Let g : Rn \u2192 R be an lsc, convex function. The proximity\noperator for g, indexed by \u03b7 > 0, is the nonlinear map [see e.g., 28; Def. 1.22]:\n\n(5)\n\nP g\n\n\u03b7 :\n\ny (cid:55)\u2192 argmin\nx\u2208Rn\n\n(cid:0)g(x) + 1\n\n2\u03b7(cid:107)x \u2212 y(cid:107)2(cid:1).\n\nThe operator (5) was introduced by Moreau [24] (1962) as a generalization of orthogonal projec-\ntions. It is also key to Rockafellar\u2019s classic proximal point algorithm [27], and it arises in a host of\nproximal-splitting methods [2, 3, 8, 11], most notably in forward-backward splitting (FBS) [8].\nFBS is particularly attractive because of its simplicity and algorithmic structure. It minimizes convex\ncomposite objective functions by alternating between \u201cforward\u201d (gradient) steps and \u201cbackward\u201d\n(proximal) steps. Formally, suppose f in (4) is convex; for such f, FBS performs the iteration\n\n(6)\nwhere {\u03b7k} is a suitable sequence of stepsizes. The usual convergence analysis of FBS is intimately\ntied to convexity of f. Therefore, to tackle nonconvex f we must take a different approach. As\n\nxk+1 = P g\n\u03b7k\n\nk = 0, 1, . . . ,\n\n(xk \u2212 \u03b7k\u2207f (xk)),\n\n2\n\n\fpreviously mentioned, such approaches were considered by Fukushima and Mine [14] and Nesterov\n[26], but both proved convergence by enforcing monotonic descent.\nThis insistence on descent severely impedes scalability. Thus, the key challenge is: how to retain\nthe algorithmic simplicity of FBS and allow nonconvex losses, without sacri\ufb01cing scalability?\nWe address this challenge by introducing the following inexact proximal-splitting iteration:\n\n(xk \u2212 \u03b7k\u2207f (xk) + \u03b7ke(xk)),\n\nxk+1 = P g\n\u03b7k\n\n(7)\nwhere e(xk) models the computational errors in computing the gradient \u2207f (xk). We also assume\nthat for \u03b7 > 0 smaller than some stepsize \u00af\u03b7, the computational error is uniformly bounded, that is,\n(8)\n\nfor some \ufb01xed error level \u00af\u0001 \u2265 0,\n\n\u03b7(cid:107)e(x)(cid:107) \u2264 \u00af\u0001,\n\nand \u2200x \u2208 X .\n\nk = 0, 1, . . . ,\n\n(cid:88)\n\nCondition (8) is weaker than the typical vanishing error requirements\n\n\u03b7(cid:107)e(xk)(cid:107) < \u221e,\n\nk\n\nk\u2192\u221e \u03b7(cid:107)e(xk)(cid:107) = 0,\n\nlim\n\nwhich are stipulated by most analyses of methods with gradient errors [4, 5]. Obviously, since errors\nare nonvanishing, exact stationarity cannot be guaranteed. We will, however, show that the iterates\nproduced by (7) do progress towards reasonable inexact stationary points. We note in passing that\neven if we assume the simpler case of vanishing errors, NIPS is still the \ufb01rst nonconvex proximal-\nsplitting framework that does not insist on monotonicity, which complicates convergence analysis\nbut ultimately proves crucial to scalability.\n\nAlgorithm 1 Inexact Nonconvex Proximal Splitting (NIPS)\nInput: Operator P g\n\n\u03b7 , and a sequence {\u03b7k} satisfying\nc \u2264 lim inf k \u03b7k,\nOutput: Approximate solution to (7)\n\nlim supk \u03b7k \u2264 min{1, 2/L \u2212 c} ,\n\n0 < c < 1/L.\n\n(9)\n\nk \u2190 0; Select arbitrary x0 \u2208 X\nwhile \u00ac converged do\n\nCompute approximate gradient (cid:101)\u2207f (xk) := \u2207f (xk) \u2212 e(xk)\n\n\u03b7k (xk \u2212 \u03b7k(cid:101)\u2207f (xk))\n\nUpdate: xk+1 = P g\nk \u2190 k + 1\n\nend while\n\n2.1 Convergence analysis\nWe begin by characterizing inexact stationarity. A point x\u2217 is a stationary point for (4) if and only if\nit satis\ufb01es the inclusion\n\n(10)\nwhere \u2202C\u03c6 denotes the Clarke subdifferential [7]. A brief exercise shows that this inclusion may be\nequivalently recast as the \ufb01xed-point equation (which augurs the idea of proximal-splitting)\n\n0 \u2208 \u2202C\u03a6(x\u2217) := \u2207f (x\u2217) + \u2202g(x\u2217),\n\nx\u2217 = P g\n\n\u03b7 (x\u2217 \u2212 \u03b7\u2207f (x\u2217)),\n\nfor \u03b7 > 0.\n\n(11)\n\nThis equation helps us de\ufb01ne a measure of inexact stationarity: the proximal residual\n\n(12)\nNote that for an exact stationary point x\u2217 the residual norm (cid:107)\u03c1(x\u2217)(cid:107) = 0. Thus, we call a point x to\nbe \u0001-stationary if for a prescribed error level \u0001(x), the corresponding residual norm satis\ufb01es\n\n\u03c1(x) := x \u2212 P g\n\n1 (x \u2212 \u2207f (x)).\n\n(cid:107)\u03c1(x)(cid:107) \u2264 \u0001(x).\n\n(13)\nAssuming the error-level \u0001(x) (say if \u00af\u0001 = lim supk \u0001(xk)) satis\ufb01es the bound (8), we prove below\n\nthat the iterates(cid:8)xk(cid:9) generated by (7) satisfy an approximate stationarity condition of the form (13),\n\nby allowing the stepsize \u03b7 to become correspondingly small (but strictly bounded away from zero).\nWe start by recalling two basic facts, stated without proof as they are standard knowledge.\n\n3\n\n\fLemma 2 (Lipschitz-descent [see e.g., 25; Lemma 2.1.3]). Let f \u2208 C 1\n\n|f (x) \u2212 f (y) \u2212 (cid:104)\u2207f (y), x \u2212 y(cid:105)| \u2264 L\n\n2 (cid:107)x \u2212 y(cid:107)2,\n\nLemma 3 (Nonexpansivity [see e.g., 9; Lemma 2.4]). The operator P g\n\u2200 x, y \u2208 Rn.\n\n\u03b7 (y)(cid:107) \u2264 (cid:107)x \u2212 y(cid:107),\n\n\u03b7 (x) \u2212 P g\n\n(cid:107)P g\n\nL(X ). Then,\n\u2200 x, y \u2208 X .\n\u03b7 is nonexpansive, that is,\n\n(14)\n\n(15)\n\nNext we prove a crucial monotonicity property that actually subsumes similar results for projection\noperators derived by Gafni and Bertsekas [15; Lem. 1], and may therefore be of independent interest.\nLemma 4 (Prox-Monotonicity). Let y, z \u2208 Rn, and \u03b7 > 0. De\ufb01ne the functions\n\npg(\u03b7)\n\n:= 1\n\n\u03b7(cid:107)P g\n\n\u03b7 (y \u2212 \u03b7z) \u2212 y(cid:107),\n\nand qg(\u03b7)\n\n:= (cid:107)P g\n\n\u03b7 (y \u2212 \u03b7z) \u2212 y(cid:107).\n\n(16)\n\nThen, pg(\u03b7) is a decreasing function of \u03b7, and qg(\u03b7) an increasing function of \u03b7.\n\nProof. Our proof exploits properties of Moreau-envelopes [28; pp. 19,52], and we present it in the\nlanguage of proximity operators. Consider the \u201cde\ufb02ected\u201d proximal objective\n\nmg(x, \u03b7; y, z) := (cid:104)z, x \u2212 y(cid:105) + 1\n\n2\u03b7(cid:107)x \u2212 y(cid:107)2 + g(x),\n\nfor some y, z \u2208 X .\n\n(17)\n\nAssociate to objective mg the de\ufb02ected Moreau-envelope\n\nEg(\u03b7) := inf\n\nx\u2208X mg(x, \u03b7; y, z),\n\n(18)\n\u03b7 (y \u2212 \u03b7z). Thus, Eg(\u03b7) is differentiable, and its\nwhose in\ufb01mum is attained at the unique point P g\nderivative is given by E(cid:48)\n2 p(\u03b7)2. Since Eg is convex in \u03b7,\nE(cid:48)\ng is increasing ([28; Thm. 2.26]), or equivalently p(\u03b7) is decreasing. Similarly, de\ufb01ne \u02c6eg(\u03b3) :=\nEg(1/\u03b3); this function is concave in \u03b3 as it is a pointwise in\ufb01mum (indexed by x) of functions linear\nin \u03b3 [see e.g., \u00a73.2.3 in 6]. Thus, its derivative \u02c6e(cid:48)\n1/\u03b3(x \u2212 \u03b3\u22121y) \u2212 x(cid:107)2 = qg(1/\u03b3), is a\ndecreasing function of \u03b3. Set \u03b7 = 1/\u03b3 to conclude the argument about qg(\u03b7).\n\n\u03b7 (y \u2212 \u03b7z) \u2212 y(cid:107)2 = \u2212 1\n\ng(\u03b7) = \u2212 1\n\n2\u03b72(cid:107)P g\n\n2(cid:107)P g\n\ng(\u03b3) = 1\n\nWe now proceed to bound the difference between objective function values from iteration k to k + 1,\nby developing a bound of the form\n\n(19)\nObviously, since we do not enforce strict descent, h(xk) may be negative too. However, we show\nthat for suf\ufb01ciently large k the algorithm makes enough progress to ensure convergence.\nLemma 5. Let xk+1, xk, \u03b7k, and X be as in (7), and that \u03b7k(cid:107)e(xk)(cid:107) \u2264 \u0001(xk) holds. Then,\n\n\u03a6(xk) \u2212 \u03a6(xk+1) \u2265 h(xk).\n\n\u03a6(xk) \u2212 \u03a6(xk+1) \u2265 2\u2212L\u03b7k\n\n(cid:107)xk+1 \u2212 xk(cid:107)2 \u2212 1\n\n\u0001(xk)(cid:107)xk+1 \u2212 xk(cid:107).\n\n2\u03b7k\n\n\u03b7k\n\nProof. For the de\ufb02ected Moreau envelope (17), consider the directional derivative dmg with respect\nto x in the direction w; at x = xk+1, this derivative satis\ufb01es the optimality condition\n\ndmg(xk+1, \u03b7; y, z)(w) = (cid:104)z + \u03b7\u22121(xk+1 \u2212 y) + sk+1, w(cid:105) \u2265 0,\n\nsk+1 \u2208 \u2202g(xk+1).\n\nSet z = \u2207f (xk) \u2212 e(xk), y = xk, and w = xk \u2212 xk+1 in (21), and rearrange to obtain\n(cid:104)\u2207f (xk) \u2212 e(xk), xk+1 \u2212 xk(cid:105) \u2264 (cid:104)\u03b7\u22121(xk+1 \u2212 xk) + sk+1, xk \u2212 xk+1(cid:105).\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\nFrom Lemma 2 it follows that\n\n\u03a6(xk+1) \u2264 f (xk) + (cid:104)\u2207f (xk), xk+1 \u2212 xk(cid:105) + L\n\n2 (cid:107)xk+1 \u2212 xk(cid:107)2 + g(xk+1),\n\nwhereby upon adding and subtracting e(xk), and then using (22) we further obtain\n\nf (xk) + (cid:104)\u2207f (xk) \u2212 e(xk), xk+1 \u2212 xk(cid:105) + L\n\n\u2264 f (xk) + g(xk+1) + (cid:104)sk+1, xk \u2212 xk+1(cid:105) +(cid:0) L\n\n2 (cid:107)xk+1 \u2212 xk(cid:107)2 + g(xk+1) + (cid:104)e(xk), xk+1 \u2212 xk(cid:105)\n\n(cid:1)(cid:107)xk+1 \u2212 xk(cid:107)2 + (cid:104)e(xk), xk+1 \u2212 xk(cid:105)\n\n2 \u2212 1\n\n\u03b7k\n\n(cid:107)xk+1 \u2212 xk(cid:107)2 + (cid:104)e(xk), xk+1 \u2212 xk(cid:105)\n\n\u2264 f (xk) + g(xk) \u2212 2\u2212L\u03b7k\n\u2264 \u03a6(xk) \u2212 2\u2212L\u03b7k\n\u2264 \u03a6(xk) \u2212 2\u2212L\u03b7k\n\n2\u03b7k\n\n2\u03b7k\n\n2\u03b7k\n\n(cid:107)xk+1 \u2212 xk(cid:107)2 + (cid:107)e(xk)(cid:107)(cid:107)xk+1 \u2212 xk(cid:107)\n\u0001(xk)(cid:107)xk+1 \u2212 xk(cid:107).\n(cid:107)xk+1 \u2212 xk(cid:107)2 + 1\n\n\u03b7k\n\nThe second inequality above follows from convexity of g, the third one from Cauchy-Schwarz, and\nthe last one by assumption on \u0001(xk). Now \ufb02ip signs and apply (23) to conclude the bound (20).\n\n4\n\n\fNext we further bound (20) by deriving two-sided bounds on (cid:107)xk+1 \u2212 xk(cid:107).\nLemma 6. Let xk+1, xk, and \u0001(xk) be as before; also let c and \u03b7k satisfy (9). Then,\n\nc(cid:107)\u03c1(xk)(cid:107) \u2212 \u0001(xk) \u2264 (cid:107)xk+1 \u2212 xk(cid:107) \u2264 (cid:107)\u03c1(xk)(cid:107) + \u0001(xk).\n\n(24)\n\nProof. First observe that from Lemma 4 that for \u03b7k > 0 it holds that\n\nif 1 \u2264 \u03b7k then q(1) \u2264 qg(\u03b7k),\n\nand if \u03b7k \u2264 1 then pg(1) \u2264 pg(\u03b7k) = 1\n\n\u03b7k\n\nqg(\u03b7k).\n\n(25)\n\nUsing (25), the triangle inequality, and Lemma 3, we have\n\nmin{1, \u03b7k} qg(1) = min{1, \u03b7k}(cid:107)\u03c1(xk)(cid:107) \u2264 (cid:107)P g\n\u2264 (cid:107)xk+1 \u2212 xk(cid:107) + (cid:107)xk+1 \u2212 P g\n\u2264 (cid:107)xk+1 \u2212 xk(cid:107) + (cid:107)\u03b7ke(xk)(cid:107) \u2264 (cid:107)xk+1 \u2212 xk(cid:107) + \u0001(xk).\n\n(xk \u2212 \u03b7k\u2207f (xk)) \u2212 xk(cid:107)\n(xk \u2212 \u03b7k\u2207f (xk))(cid:107)\n\n\u03b7k\n\n\u03b7k\n\nFrom (9) it follows that for suf\ufb01ciently large k we have (cid:107)xk+1 \u2212 xk(cid:107) \u2265 c(cid:107)\u03c1(xk)(cid:107) \u2212 \u0001(xk). For the\nupper bound note that\n\n(cid:107)xk+1 \u2212 xk(cid:107) \u2264 (cid:107)xk \u2212 P g\n\n(xk \u2212 \u03b7k\u2207f (xk))(cid:107) + (cid:107)P g\n\n(xk \u2212 \u03b7k\u2207f (xk)) \u2212 xk+1(cid:107)\n\n\u03b7k\n\n\u03b7k\n\n\u2264 max{1, \u03b7k}(cid:107)\u03c1(xk)(cid:107) + (cid:107)\u03b7ke(xk)(cid:107) \u2264 (cid:107)\u03c1(xk)(cid:107) + \u0001(xk).\n\nLemma 5 and Lemma 6 help prove the following crucial corollary.\nCorollary 7. Let xk, xk+1, \u03b7k, and c be as above and k suf\ufb01ciently large so that c and \u03b7k satisfy (9).\nThen, \u03a6(xk) \u2212 \u03a6(xk+1) \u2265 h(xk) holds with h(xk) given by\n\n2(2\u22122Lc)(cid:107)\u03c1(xk)(cid:107)2 \u2212(cid:0) L2c2\n\n2\u2212cL + 1\n\nc\n\n(cid:1)(cid:107)\u03c1(xk)(cid:107)\u0001(xk) \u2212(cid:0) 1\n\nh(xk) := L2c3\n\n(cid:1)\u0001(xk)2.\n\nc \u2212 L2c\n2(2\u2212cL)\n\n(26)\n\nProof. Plug in the bounds (24) into (20), invoke (9), and simplify\u2014see [32] for details.\n\nWe now have all the ingredients to state the main convergence theorem.\nTheorem 8 (Convergence). Let f \u2208 C 1\n\nLet(cid:8)xk(cid:9) \u2282 X be a sequence generated by (7), and let condition (8) on each (cid:107)e(xk)(cid:107) hold. There\nexists a limit point x\u2217 of the sequence(cid:8)xk(cid:9), and a constant K > 0, such that (cid:107)\u03c1(x\u2217)(cid:107) \u2264 K\u0001(x\u2217).\nIf(cid:8)\u03a6(xk)(cid:9) converges, then for every limit point x\u2217 of(cid:8)xk(cid:9) it holds that (cid:107)\u03c1(x\u2217)(cid:107) \u2264 K\u0001(x\u2217).\n\nL(X ) such that infX f > \u2212\u221e and let g be lsc, convex on X .\n\nProof. Lemma 5, 6, and Corollary 7 have done all the hard work. Indeed, they allow us to reduce\nour convergence proof to the case where the analysis of the differentiable case becomes applicable,\nand an appeal to the analysis of [29; Thm. 2.1] grants us our claim.\n\nTheorem 8 says that we can obtain an approximate stationary point for which the norm of the resid-\nual is bounded by a linear function of the error level. The statement of the theorem is written in a\nconditional form, because nonvanishing errors e(x) prevent us from making a stronger statement.\nIn particular, once the iterates enter a region where the residual norm falls below the error thresh-\n\nold, the behavior of(cid:8)xk(cid:9) may be arbitrary. This, however, is a small price to pay for having the\n\nadded \ufb02exibility of nonvanishing errors. Under the stronger assumption of vanishing errors (and\ndiminishing stepsizes), we can also obtain guarantees to exact stationary points.\n3 Scaling up NIPS: incremental variant\nWe now apply NIPS to the large-scale setting, where we have composite objectives of the form\n\n(27)\n\u03a6(x) :=\n(X ) function. For simplicity, we use L = maxt Lt in the sequel.\nwhere each ft : Rn \u2192 R is a C 1\nIt is well-known that for such decomposable objectives it can be advantageous to replace the full\n\nt \u2207ft(x) by an incremental gradient \u2207f\u03c3(t)(x), where \u03c3(t) is some suitable index.\n\ngradient(cid:80)\n\nft(x) + g(x),\n\nt=1\n\nLt\n\n(cid:88)T\n\n5\n\n\fNonconvex incremental methods for differentiable problems have been extensively analyzed, e.g.,\nbackpropagation algorithms [5, 29], which correspond to g(x) \u2261 0. However, when g(x) (cid:54)= 0, the\nonly incremental methods that we are aware of are stochastic generalized gradient methods of [12]\nor the generalized gradient methods of [31]. As previously mentioned, both of these fail to exploit\nthe composite structure of the objective function, a disadvantage even in the convex case [11].\nIn stark contrast, we do exploit the composite structure of (27). Formally, we propose the following\nincremental nonconvex proximal-splitting iteration:\n\nxk+1 = M(cid:0)xk \u2212 \u03b7k\n\n(cid:88)T\n\n\u2207ft(xk,t)(cid:1),\n\nxk,1 = xk, xk,t+1 = O(xk,t \u2212 \u03b7k\u2207ft(xk,t)),\n\nt=1\n\nk = 0, 1, . . . ,\nt = 1, . . . , T \u2212 1,\n\n(28)\n\nwhere O and M are appropriate operators, different choices of which lead to different algorithms.\nFor example, when X = Rn, g(x) \u2261 0, M = O = Id, and \u03b7k \u2192 0, then (28) reduces to the classic\nincremental gradient method (IGM) [4], and to the IGM of [30], if lim \u03b7k = \u00af\u03b7 > 0. If X a closed\nconvex set, g(x) \u2261 0, M is orthogonal projection onto X , O = Id, and \u03b7k \u2192 0, then iteration (28)\nreduces to (projected) IGM [4, 5].\nWe may consider four variants of (28) in Table 1; to our knowledge, all of these are new. Which\nof the four variants one prefers depends on the complexity of the constraint set X and cost to apply\nP g\n\u03b7 . The analysis of all four variants is similar, so we present details only for the most general case.\nX\nRn\nRn\n\nPenalty and constraints\npenalized, unconstrained\npenalized, unconstrained\npenalized, constrained\npenalized, constrained\n\nProximity operator calls\nonce every major (k) iteration\nonce every minor (k, t) iteration\nonce every major (k) iteration\nonce every minor (k, t) iteration\n\ng\n(cid:54)\u2261 0\n(cid:54)\u2261 0\n\nM O\nP g\nId\n\u03b7\n\u03b7 P g\nP g\nh(x) + \u03b4(X|x) P g\n\u03b7\nId\nh(x) + \u03b4(X|x) P g\n\u03b7\n\u03b7 P g\n\u03b7\n\nConvex\nConvex\n\nTable 1: Different variants of incremental NIPS (28).\n\n3.1 Convergence analysis\nSpeci\ufb01cally, we analyze convergence for the case M = O = P g\ncase treated by [30]. We begin by rewriting (28) in a form that matches the main iteration (7):\n\n\u03b7 by generalizing the differentiable\n\n(cid:0)xk \u2212 \u03b7k\n(cid:0)xk \u2212 \u03b7k\n(cid:0)xk \u2212 \u03b7k\n\n(cid:88)T\n(cid:88)T\n(cid:88)\n\nt\n\n\u2207ft(xk,t)(cid:1)\n(cid:2)(cid:88)T\n\u2207ft(xk) + \u03b7ke(xk)(cid:1).\n\n\u2207ft(xk) + \u03b7k\n\nt=1\n\nt=1\n\nt=1\n\nxk+1 = P g\n\u03b7\n\n= P g\n\u03b7\n\n= P g\n\u03b7\n\nft(xk) \u2212 ft(xk,t)(cid:3)(cid:1)\n\n(29)\n\nTo show that iteration (29) is well-behaved and actually \ufb01ts the main NIPS iteration (7), we must\nensure that the norm of the error term is bounded. We show this via a sequence of lemmas.\nLemma 9 (Bounded-increment). Let xk,t+1 be computed by (28), and let st \u2208 \u2202g(xk,t). Then,\n\n(cid:107)xk,t+1 \u2212 xk,t(cid:107) \u2264 2\u03b7k(cid:107)\u2207ft(xk,t) + st(cid:107).\n\n(30)\n\nProof. From the de\ufb01nition of a proximity operator (5), we have the inequality\n\n1\n\n2(cid:107)xk,t+1 \u2212 xk,t + \u03b7k\u2207ft(xk,t)(cid:107)2 + \u03b7kg(xk,t+1) \u2264 1\n2(cid:107)xk,t+1 \u2212 xk,t(cid:107)2 \u2264 \u03b7k(cid:104)\u2207ft(xk,t), xk,t \u2212 xk,t+1(cid:105) + \u03b7k(g(xk,t) \u2212 g(xk,t+1)).\n\n2(cid:107)\u03b7k\u2207ft(xk,t)(cid:107)2 + \u03b7kg(xk,t),\n\n=\u21d2 1\n\nSince st \u2208 \u2202g(xk,t), we have g(xk,t+1) \u2265 g(xk,t) + (cid:104)st, xk,t+1 \u2212 xk,t(cid:105). Therefore,\n2(cid:107)xk,t+1 \u2212 xk,t(cid:107)2 \u2264 \u03b7k(cid:104)st, xk,t \u2212 xk,t+1(cid:105) + (cid:104)\u2207ft(xk,t), xk,t \u2212 xk,t+1(cid:105)\n\n1\n\n\u2264 \u03b7k(cid:107)st + \u2207ft(xk,t)(cid:107)(cid:107)xk,t \u2212 xk,t+1(cid:107)\n=\u21d2 (cid:107)xk,t+1 \u2212 xk,t(cid:107) \u2264 2\u03b7k(cid:107)\u2207ft(xk,t) + st(cid:107).\n\nLemma 9 proves helpful in bounding the overall error.\n\n6\n\n\fLemma 10 (Bounded error). If for all xk \u2208 X , (cid:107)\u2207ft(xk)(cid:107) \u2264 M and (cid:107)\u2202g(xk)(cid:107) \u2264 G, then there\nexists a constant K1 > 0 such that (cid:107)e(xk)(cid:107) \u2264 K1.\n\n(cid:88)t\u22121\n\n\u0001t := (cid:107)\u2207ft(xk,t) \u2212 \u2207ft(xk)(cid:107),\n\nProof. To bound the error of using xk,t instead of xk \ufb01rst de\ufb01ne the term\nt = 1, . . . , T.\nThen, an inductive argument (see [32] for details) shows that for 2 \u2264 t \u2264 T\n(1 + 2\u03b7kL)t\u22121\u2212j(cid:107)\u2207fj(xk) + sj(cid:107).\n\nSince (cid:107)e(xk)(cid:107) =(cid:80)T\n(cid:88)T\u22121\n(cid:88)T\n(cid:88)T\n(1 + 2\u03b7kL)T\u2212t\u03b2t \u2264 (1 + 2\u03b7kL)T\u22121(cid:88)T\u22121\n\u2264(cid:88)T\u22121\n\nt=1 \u0001t, and \u00011 = 0, (32) then leads to the bound\n\n(1 + 2\u03b7kL)t\u22121\u2212j\u03b2j = 2\u03b7kL\n\n(cid:88)t\u22121\n\n\u0001t \u2264 2\u03b7kL\n\n\u0001t \u2264 2\u03b7kL\n\nj=1\n\nj=1\n\nt=2\n\nt=2\n\nt=1\n\n\u2264 C1(T \u2212 1)(M + G) =: K1.\n\n(31)\n\n(32)\n\n(cid:16)(cid:88)T\u2212t\u22121\n\n(1 + 2\u03b7kL)j(cid:17)\n\nj=0\n\n\u03b2t\n(cid:107)\u2207ft(x) + st(cid:107)\n\nt=1\n\nt=1\n\nThus, the error norm (cid:107)e(xk)(cid:107) is bounded from above by a constant, whereby it satis\ufb01es the require-\nment (8), making the incremental NIPS method (28) a special case of the general NIPS framework.\nThis allows us to invoke the convergence result Theorem 8 for without further ado.\n4\nThe main contribution of our paper is the new NIPS framework, and a speci\ufb01c application is not\none of the prime aims of this paper. We do, however, provide an illustrative application of NIPS to\na challenging nonconvex problem: sparsity regularized low-rank matrix factorization\n\nIllustrative application\n\n2(cid:107)Y \u2212 XA(cid:107)2\n\n1\n\nt=1\n\n\u03c8t(at),\n\nmin\nX,A\u22650\n\nF + \u03c80(X) +\n\n(33)\nwhere Y \u2208 Rm\u00d7T , X \u2208 Rm\u00d7K and A \u2208 RK\u00d7T , with a1, . . . , aT as its columns. Problem (33)\ngeneralizes the well-known nonnegative matrix factorization (NMF) problem of [20] by permitting\narbitrary Y (not necessarily nonnegative), and adding regularizers on X and A. A related class of\nproblems was studied in [23], but with a crucial difference: the formulation in [23] does not allow\nnonsmooth regularizers on X. The class of problems studied in [23] is in fact a subset of those cov-\nered by NIPS. On a more theoretical note, [23] considered stochastic-gradient like methods whose\nanalysis requires computational errors and stepsizes to vanish, whereas our method is deterministic\nand allows nonvanishing stepsizes and errors.\nFollowing [23] we also rewrite (33) in a form more amenable to NIPS. We eliminate A and consider\n\n(cid:88)T\n\nminX \u03c6(X) :=\n\nft(X) + g(X), where\n\ng(X) := \u03c80(X) + \u03b4(X|\u2265 0),\n\n(34)\n\n(cid:88)T\n\nt=1\n\nand where each ft(X) for 1 \u2264 t \u2264 T is de\ufb01ned as\n\n2(cid:107)yt \u2212 Xa(cid:107)2 + gt(a),\n\n1\n\nft(X) := mina\n\n(35)\nwhere gt(a) := \u03c8t(a) + \u03b4(a|\u2265 0). For simplicity, assume that (35) attains its unique1 minimum,\nsay a\u2217, then ft(X) is differentiable and we have \u2207X ft(X) = (Xa\u2217 \u2212 yt)(a\u2217)T . Thus, we can\ninstantiate (28), and all we need is a subroutine for solving (35).2\nWe present empirical results on the following two variants of (34): (i) pure unpenalized NMF (\u03c8t \u2261\n0 for 0 \u2264 t \u2264 T ) as a baseline; and (ii) sparsity penalized NMF where \u03c80(X) \u2261 \u03bb(cid:107)X(cid:107)1 and\n\u03c8t(at) \u2261 \u03b3(cid:107)at(cid:107)1. Note that without the nonnegativity constraints, (34) is similar to sparse-PCA.\nWe use the following datasets and parameters:\n\n(cid:74)i(cid:75) RAND: 4000 \u00d7 4000 dense random (uniform\n[0, 1]); rank-32 factorization; (\u03bb, \u03b3) = (10\u22125, 10);(cid:74)ii(cid:75) CBCL: CBCL database [33]; 361 \u00d7 2429;\nrank-49 factorization;(cid:74)iii(cid:75) YALE: Yale B Database [21]; 32256\u00d72414 matrix; rank-32 factorization;\n(cid:74)iv(cid:75) WEB: Web graph from google; sparse 714545 \u00d7 739454 (empty rows and columns removed)\n\nmatrix; ID: 2301 in the sparse matrix collection [10]); rank-4 factorization; (\u03bb = \u03b3 = 10\u22126).\n\n1Otherwise, at the expense of more notation, we can add a small strictly convex perturbation to ensure\n\nuniqueness; this perturbation can be then absorbed into the overall computational error.\n\n2In practice, it is better to use mini-batches, and we used the same sized mini-batches for all the algorithms.\n\n7\n\n\fFigure 1: Running times of NIPS (Matlab) versus SPAMS (C++) for NMF on RAND, CBCL, and YALE\ndatasets. Initial objective values and tiny runtimes have been suppressed for clarity of presentation.\n\nOn the NMF baseline (Fig. 1), we compare NIPS against the well optimized state-of-the-art C++\ntoolbox SPAMS (version 2.3) [23]. We compare against SPAMS only on dense matrices, as its NMF\ncode seems to be optimized for this case. Obviously, the comparison is not fair: unlike SPAMS,\nNIPS and its subroutines are all implemented in MATLAB, and they run equally easily on large\nsparse matrices. Nevertheless, NIPS proves to be quite competitive: Fig. 1 shows that our MATLAB\nimplementation runs only slightly slower than SPAMS. We expect a well-tuned C++ implementation\nof NIPS to run at least 4\u201310 times faster than the MATLAB version\u2014the dashed line in the plots\nvisualizes what such a mere 3X-speedup to NIPS might mean.\nFigure 2 shows numerical results comparing the stochastic generalized gradient (SGGD) algorithm\nof [12] against NIPS, when started at the same point. As in well-known, SGGD requires careful\nstepsize tuning; so we searched over a range of stepsizes, and have reported the best results. NIPS\ntoo requires some stepsize tuning, but substantially lesser than SGGD. As predicted, the solutions\nreturned by NIPS have objective function values lower than SGGD, and have greater sparsity.\n\nFigure 2: Sparse NMF: NIPS versus SGGD. The bar plots show the sparsity (higher is better) of the factors\nX and A. Left plots for RAND dataset; right plots for WEB. As expected, SGGD yields slightly worse objective\nfunction values and less sparse solutions than NIPS.\n\n5 Discussion\n\nWe presented a new framework called NIPS, which solves a broad class of nonconvex composite\nobjective problems. NIPS permits nonvanishing computational errors, which can be practically use-\nful. We specialized NIPS to also obtain a scalable incremental version. Our numerical experiments\non large scale matrix factorization indicate that NIPS is competitive with state-of-the-art methods.\nWe conclude by mentioning that NIPS includes numerous other algorithms as special cases. For ex-\nample, batch and incremental convex FBS, convex and nonconvex gradient projection, the proximal-\npoint algorithm, among others. Theoretically, however, the most exciting open problem resulting\nfrom this paper is: extend NIPS in a scalable way when even the nonsmooth part is nonconvex.\nThis case will require very different convergence analysis, and is left to the future.\n\nReferences\n[1] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame prob-\nlems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math.\n\n8\n\n010203040506070100Running time (seconds)Objective function value  NIPSSPAMS0510152025102.1102.2102.3Running time (seconds)Objective function value  NIPSSPAMS050100150200250300350400102.3102.4102.5102.6102.7102.8Running time (seconds)Objective function value  NIPSSPAMS01020304050607080101102103104Running time (seconds)Objective function value  NIPSSGGD00.10.20.30.40.50.60.70.80.9SGGD\u2212ANIPS\u2212ASGGD\u2212XNIPS\u2212XSparsity\fProgramming Series A, Aug. 2011. Online First.\n\n[2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In\n\nS. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning. MIT Press, 2011.\n\n[3] A. Beck and M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Prob-\n\nlems. SIAM J. Imgaging Sciences, 2(1):183\u2013202, 2009.\n\n[4] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, second edition, 1999.\n[5] D. P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A\n\nSurvey. Technical Report LIDS-P-2848, MIT, August 2010.\n\n[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004.\n[7] F. H. Clarke. Optimization and nonsmooth analysis. John Wiley & Sons, Inc., 1983.\n[8] P. L. Combettes and J.-C. Pesquet. Proximal Splitting Methods in Signal Processing. arXiv:0912.3522v4,\n\nMay 2010.\n\n[9] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale\n\nModeling and Simulation, 4(4):1168\u20131200, 2005.\n\n[10] T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on\n\nMathematical Software, 2011. To appear.\n\n[11] J. Duchi and Y. Singer. Online and Batch Learning using Forward-Backward Splitting. J. Mach. Learning\n\nRes. (JMLR), Sep. 2009.\n\n[12] Y. M. Ermoliev and V. I. Norkin. Stochastic generalized gradient method for nonconvex nonsmooth\n\nstochastic optimization. Cybernetics and Systems Analysis, 34:196\u2013215, 1998.\n\n[13] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient Projection for Sparse Reconstruction:\nApplication to Compressed Sensing and Other Inverse Problems. IEEE J. Selected Topics in Sig. Proc., 1\n(4):586\u2013597, 2007.\n\n[14] M. Fukushima and H. Mine. A generalized proximal point algorithm for certain non-convex minimization\n\nproblems. Int. J. Systems Science, 12(8):989\u20131000, 1981.\n\n[15] E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization. SIAM\n\nJournal on Control and Optimization, 22(6):936\u2013964, 1984.\n\n[16] A. A. Gaivoronski. Convergence properties of backpropagation for neural nets via theory of stochastic\n\ngradient methods. Part 1. Optimization methods and Software, 4(2):117\u2013134, 1994.\n\n[17] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, 1st edition, 1994.\n[18] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski. Dictionary\n\nlearning algorithms for sparse representation. Neural Computation, 15:349\u2013396, 2003.\n\n[19] D. Kundur and D. Hatzinakos. Blind image deconvolution. IEEE Signal Processing Magazine, 13(3),\n\nMay 1996.\n\n[20] D. D. Lee and H. S. Seung. Algorithms for Nonnegative Matrix Factorization. In NIPS, 2000.\n[21] K.C. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting.\n\nIEEE Trans. Pattern Anal. Mach. Intelligence, 27(5):684\u2013698, 2005.\n\n[22] J. Liu and J. Ye. Ef\ufb01cient Euclidean projections in linear time. In ICML, Jun. 2009.\n[23] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Learning for Matrix Factorization and Sparse Coding.\n\nJMLR, 11:10\u201360, 2010.\n\n[24] J. J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci.\n\nParis Sr. A Math., 255:2897\u20132899, 1962.\n\n[25] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.\n[26] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 2007/76,\n\nUniversit catholique de Louvain, September 2007.\n\n[27] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control and Optimiza-\n\ntion, 14, 1976.\n\n[28] R. T. Rockafellar and R. J.-B. Wets. Variational analysis. Springer, 1998.\n[29] M. V. Solodov. Convergence analysis of perturbed feasible descent methods. J. Optimization Theory and\n\nApplications, 93(2):337\u2013353, 1997.\n\n[30] M. V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational\n\nOptimization and Applications, 11:23\u201335, 1998.\n\n[31] M. V. Solodov and S. K. Zavriev. Error stability properties of generalized gradient-type algorithms. J.\n\nOptimization Theory and Applications, 98(3):663\u2013680, 1998.\n\n[32] S. Sra. Nonconvex proximal-splitting: Batch and incremental algorithms. Sep. 2012. arXiv:1109.0258v2.\n[33] K.-K. Sung. Learning and Example Selection for Object and Pattern Recognition. PhD thesis, MIT, 1996.\n[34] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. In NIPS,\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 271, "authors": [{"given_name": "Suvrit", "family_name": "Sra", "institution": null}]}