{"title": "Faster and Non-ergodic O(1/K) Stochastic Alternating Direction Method of Multipliers", "book": "Advances in Neural Information Processing Systems", "page_first": 4476, "page_last": 4485, "abstract": "We study stochastic convex optimization subjected to linear equality constraints. Traditional Stochastic Alternating Direction Method of Multipliers and its Nesterov's acceleration scheme can only achieve ergodic O(1/\\sqrt{K}) convergence rates, where K is the number of iteration. By introducing Variance Reduction (VR) techniques, the convergence rates improve to ergodic O(1/K). In this paper, we propose a new stochastic ADMM which elaborately integrates Nesterov's extrapolation and VR techniques. With Nesterov\u2019s extrapolation, our algorithm can achieve a non-ergodic O(1/K) convergence rate which is optimal for separable linearly constrained non-smooth convex problems, while the convergence rates of VR based ADMM methods are actually tight O(1/\\sqrt{K}) in non-ergodic sense. To the best of our knowledge, this is the first work that achieves a truly accelerated, stochastic convergence rate for constrained convex problems. The experimental results demonstrate that our algorithm is significantly faster than the existing state-of-the-art stochastic ADMM methods.", "full_text": "Faster and Non-ergodic O(1/K) Stochastic\nAlternating Direction Method of Multipliers\n\nCong Fang\n\nFeng Cheng\n\nZhouchen Lin\u2217\n\nKey Laboratory of Machine Perception (MOE), School of EECS, Peking University, P. R. China\n\nCooperative Medianet Innovation Center, Shanghai Jiao Tong University, P. R. China\n\nfangcong@pku.edu.cn\n\nfengcheng@pku.edu.cn\n\nzlin@pku.edu.cn\n\nAbstract\n\n\u221a\n\nWe study stochastic convex optimization subjected to linear equality constraints.\nTraditional Stochastic Alternating Direction Method of Multipliers [1] and its Nes-\nterov\u2019s acceleration scheme [2] can only achieve ergodic O(1/\nK) convergence\nrates, where K is the number of iteration. By introducing Variance Reduction (VR)\ntechniques, the convergence rates improve to ergodic O(1/K) [3, 4]. In this paper,\nwe propose a new stochastic ADMM which elaborately integrates Nesterov\u2019s ex-\ntrapolation and VR techniques. With Nesterov\u2019s extrapolation, our algorithm can\nachieve a non-ergodic O(1/K) convergence rate which is optimal for separable\n\u221a\nlinearly constrained non-smooth convex problems, while the convergence rates of\nVR based ADMM methods are actually tight O(1/\nK) in non-ergodic sense. To\nthe best of our knowledge, this is the \ufb01rst work that achieves a truly accelerated,\nstochastic convergence rate for constrained convex problems. The experimental\nresults demonstrate that our algorithm is faster than the existing state-of-the-art\nstochastic ADMM methods.\n\n1\n\nIntroduction\n\nWe consider the following general convex \ufb01nite-sum problem with linear constraints:\n\nmin\nx1,x2\n\nh1(x1) + f1(x1) + h2(x2) +\n\n1\nn\n\nf2,i(x2),\n\nn(cid:88)\n\ni=1\n\nn\n\ns.t.\n\nA1x1 + A2x2 = b,\n\n(cid:80)n\ni=1 f2,i(x). And we use \u2207f to denote the gradient of f.\n\n(1)\nwhere f1(x1) and f2,i(x2) with i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n} are convex and have Lipschitz continuous gradients,\nh1(x1) and h2(x2) are also convex, but can be non-smooth. We use the following notations:\nL1 denotes the Lipschitz constant of f1(x1), L2 is the Lipschitz constant of f2,i(x2) with i \u2208\n{1, 2,\u00b7\u00b7\u00b7 , n}, and f2(x) = 1\nProblem (1) is of great importance in machine learning. The \ufb01nite-sum functions f2(x2) are typically\na loss over training samples, and the remaining functions control the structure or regularize the model\nto aid generalization [2]. The idea of using linear constraints to decouple the loss and regularization\nterms enables researchers to consider some more sophisticated regularization terms which might\nbe very complicated to solve through proximity operators for Gradient Descent [5] methods. For\nexample, for multitask learning problems [6, 7], the regularization term is set as \u00b51(cid:107)x(cid:107)\u2217 + \u00b52(cid:107)x(cid:107)1,\nfor most graph-guided fused Lasso and overlapping group Lasso problem [8, 4], the regularization\nterm can be written as \u00b5(cid:107)Ax(cid:107)1, and for many multi-view learning tasks [9], the regularization terms\nalways involve \u00b51(cid:107)x(cid:107)2,1 + \u00b52(cid:107)x(cid:107)\u2217.\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Convergence rates of ADMM type methods solving Problem (1).\n\nType\n\nBatch\n\nStochastic\n\nAlgorithm\nADMM [13]\n\nLADM-NE [15]\nSTOC-ADMM [1]\nOPG-ADMM [16]\nOPT-ADMM [2]\n\nConvergence Rate\nTight non-ergodic O( 1\u221a\n)\nK\nOptimal non-ergodic O( 1\nK )\n\nergodic O( 1\u221a\nergodic O( 1\u221a\nergodic O( 1\u221a\nunknown\n\n)\n)\n)\n\nK\n\nK\n\nK\n\nSDCA-ADMM [17]\nSAG-ADMM [3]\nSVRG-ADMM [4]\n\nTight non-ergodic O( 1\u221a\n)\nTight non-ergodic O( 1\u221a\n)\nK\nACC-SADMM (ours) Optimal non-ergodic O( 1\nK )\n\nK\n\n\u221a\nK). They also construct a family of functions showing that non-ergodic O(1/\n\nAlternating Direction Method of Multipliers (ADMM) is a very popular optimization method to solve\nProblem (1), with its advantages in speed, easy implementation and good scalability shown in lots of\nliteratures (see survey [10]). A popular criterion of the algorithms\u2019 convergence rate is its ergodic\nconvergence. And it is proved in [11, 12] that ADMM converges with an O(1/K) ergodic rate.\nHowever, in this paper, it is noteworthy that we consider the convergence in the non-ergodic sense.\nThe reasons are two folded: 1) in real applications, the output of ADMM methods are non-ergodic\nresults (xK), rather than the ergodic one (convex combination of x1, x2,\u00b7\u00b7\u00b7 , xK), as the non-ergodic\nresults are much faster (see detailed discussions in Section 5.3); 2) The ergodic convergence rate\nis not trivially the same as general-case\u2019s rate. For a sequence {ak} = {1,\u22121, 1,\u22121, 1,\u22121,\u00b7\u00b7\u00b7}\n(When k is odd, ak is 1, and \u22121 when k is even), it is divergent, while in ergodic sense, it converges\nin O(1/K). So the analysis in the non-ergodic are closer to reality. 2) is especially suit for ADMM\n\u221a\nmethods. In [13], Davis et al. prove that the Douglas-Rachford (DR) splitting converges in non-\n\u221a\nergodic O(1/\nK) is\ntight. Chen et al. establish O(1/\nK) for Linearized ADMM [14]. Then Li et al. accelerate ADMM\nthrough Nesterov\u2019s extrapolation and obtain a non-ergodic O(1/K) convergence rate[15]. They also\nprove that the lower complexity bound of ADMM type methods for the separable linearly constrained\nnonsmooth convex problems is exactly O(1/K), which demonstrates that their algorithm is optimal.\nThe convergence rates for different ADMM based algorithms are shown in Table 1.\nOn the other hand, to meet the demands of solving large-scale machine learning problems, stochastic\nalgorithms [18] have drawn a lot of interest in recent years. For stochastic ADMM (SADMM), the\n\u221a\nprior works are from STOC-ADMM [1] and OPG-ADMM [16]. Due to the noise of gradient, both of\nthe two algorithms can only achieve an ergodic O(1/\nK) convergence rate. There are two lines of\nresearch to accelerate SADMM. The \ufb01rst is to introduce the Variance Reduction (VR) [19, 20, 21]\ntechniques into SADMM. VR methods ensure the descent direction to have a bounded variance\nand so can achieve faster convergence rates. The existing VR based SADMM algorithms include\nSDCA-ADMM [17], SAG-ADMM [3] and SVRG-ADMM [4]. SAG-ADMM and SVRG-ADMM\ncan provably achieve ergodic O(1/K) rates for Porblem (1). The second way to accelerate SADMM\nis through the Nesterov\u2019s acceleration [22]. This work is from [2], in which the authors propose\nan ergodic O( R2\n) stochastic algorithm (OPT-ADMM). The dependence on the\n\u221a\nsmoothness constant of the convergence rate is O(1/K 2) and so each term in the convergence rate\nseems to have been improved to optimal. However, the worst convergence rate of it is still O(1/\nK).\nIn this paper, we propose Accelerated Stochastic ADMM (ACC-SADMM) for large scale general con-\nvex \ufb01nite-sum problems with linear constraints. By elaborately integrating Nesterov\u2019s extrapolation\nand VR techniques, ACC-SADMM provably achieves a non-ergodic O(1/K) convergence rate which\n\u221a\nis optimal for non-smooth problems. As in non-ergodic sense, the VR based SADMM methods (e.g.\nK) (please see detailed discussions in\nSVRG-ADMM, SAG-ADMM) converges in a tight O(1/\nSection 5.3), ACC-SADMM improve the convergence rates from O(1/\nK) to (1/K) in the ergodic\nsense and \ufb01ll the theoretical gap between the stochastic and batch (deterministic) ADMM. The\noriginal idea to design our ACC-SADMM is by explicitly considering the snapshot vector \u02dcx (approxi-\nmately the mean value of x in the last epoch) into the extrapolation terms. This is, to some degree,\ninspired by [23] who proposes an O(1/K 2) stochastic gradient algorithm named Katyusha for convex\n\nK2 + Dy+\u03c1\n\nK + \u03c3\u221a\n\n\u221a\n\nK\n\n2\n\n\fTable 2: Notations and Variables\n\u221a\n\nMeaning\n\nNotation\n\n(cid:104)x, y(cid:105)G,(cid:107)x(cid:107)G xT Gy,\n\nFi(xi)\n\nx\ny\n\nxT Gx\nhi(xi) + fi(xi)\n\n(x1, x2)\n(y1, y2)\n\nF (x)\n\nF1(x1) + F2(x2)\n\nVariable\nyk\ns,1, yk\ns,2\nxk\ns,1, xk\ns,2\n\u02dc\u03bbk\ns , \u03bbk\ns\n\u02dcxs,1, \u02dcxs,2, \u02dcbs\nx\u2217\n2, \u03bb\u2217\n1, x\u2217\n\nMeaning\n\nextrapolation variables\n\nprimal variables\n\ndual and temp variables\n\nsnapshot vectors\n\noptimal solution of Eq. (1)\n\ns = xk\n\ns + (1\u2212 \u03b81,s \u2212 \u03b82)(xk\n\nproblems. However, there are many distinctions between the two algorithms (please see detailed\ndiscussions in Section 5.1). Our method is also very ef\ufb01cient in practice since we have suf\ufb01ciently\nconsidered the noise of gradient into our acceleration scheme. For example, we adopt extrapolation\nas yk\n) in the inner loop, where \u03b82 is a constant and \u03b81,s decreases\n(xk \u2212 xk\u22121)\nafter every epoch, instead of directly adopting extrapolation as yk = xk + \u03b8k\nin the original Nesterov\u2019s scheme and adding proximal term\nas [2] does. There are also\nvariants on updating of multiplier and the snapshot vector. We list the contributions of our work as\nfollows:\n\ns \u2212 xk\u22121\n\n(cid:107)xk+1\u2212xk(cid:107)2\n\n1 (1\u2212\u03b8k\u22121\n\n\u03b8k\u22121\n\n\u03c3k3/2\n\ns\n\n)\n\n1\n\n1\n\n\u2022 We propose ACC-SADMM for large scale convex \ufb01nite-sum problems with linear constraints\nwhich integrates Nesterov\u2019s extrapolation and VR techniques. We prove that our algorithm\nconverges in non-ergodic O(1/K) which is optimal for separable linearly constrained non-\nsmooth convex problems. To our best knowledge, this is the \ufb01rst work that achieves a truly\naccelerated, stochastic convergence rate for constrained convex problems.\n\u2022 We do experiments on four bench-mark datasets to demonstrate the superiority of our\nalgorithm. We also do experiments on the Multitask Learning [6] problem to demonstrate\nthat our algorithm can be used on very large datasets.\n\n2 Preliminary\n\nMost SADMM methods alternately minimize the following variant surrogate of the augmented\nLagrangian:\n\nG1\n\n\u03b2\n2\n\n(cid:48)\nL\n\nG2 +\n\n(cid:107)A1x1 + A2x2 \u2212 b +\n\n(x1, x2, \u03bb, \u03b2) = h1(x1) + (cid:104)\u2207f1(x1), x1(cid:105) +\n(cid:107)x1 \u2212 xk\n1(cid:107)2\nL1\n2\n(cid:107)x2 \u2212 xk\n+h2(x2) + (cid:104) \u02dc\u2207f2(x2), x2(cid:105) +\n2(cid:107)2\nL2\n2\n\n(2)\n(cid:107)2,\nwhere \u02dc\u2207f2(x2) is an estimator of \u2207f2(x2) from one or a mini-batch of training samples. So the\ncomputation cost for each iteration reduces from O(n) to O(b) instead, where b is the mini-batch size.\nWhen fi(x) = 0 and Gi = 0, with i = 1, 2, Problem (1) is solved as exact ADMM. When there\nis no hi(xi), Gi is set as the identity matrix I, with i = 1, 2, the subproblem in xi can be solved\nthrough matrix inversion. This scheme is advocated in many SADMM methods [1, 3]. Another\ncommon approach is linearization (also called the inexact Uzawa method) [24, 25], where Gi is set\nas \u03b7iI \u2212 \u03b2\ni Ai(cid:107).\n(cid:88)\nFor STOC-ADMM [1], \u02dc\u2207f2(x2) is simply set as:\n\ni Ai with \u03b7i \u2265 1 + \u03b2\nAT\n\n(cid:107)AT\n\n\u03bb\n\u03b2\n\nLi\n\nLi\n\n(3)\n\n\u2207f2,ik (x2),\n\n1\nb\n\nik\u2208Ik\n\nwhere Ik is the mini-batch of size b from {1, 2,\u00b7\u00b7\u00b7 , n}. For SVRG-ADMM [4], the gradient\nestimator can be written as:\n\u02dc\u2207f2(x2) =\n\n(\u2207f2,ik (x2) \u2212 \u2207f2,ik (\u02dcx2)) + \u2207f2(\u02dcx2),\n\n(4)\n\n\u02dc\u2207f2(x2) =\n(cid:88)\n\n1\nb\n\nik\u2208Ik\n\nwhere \u02dcx2 is a snapshot vector (mean value of last epoch).\n\n3\n\n\fAlgorithm 1 Inner loop of ACC-SADMM\n\nfor k = 0 to m \u2212 1 do\n\n(cid:16)\n\nUpdate dual variable: \u03bbk\nUpdate xk+1\ns,1 through Eq. (6).\nUpdate xk+1\ns,2 through Eq. (7).\nUpdate dual variable: \u02dc\u03bbk+1\nUpdate yk+1\nthrough Eq. (5).\n\ns\n\ns = \u02dc\u03bbk\n\ns + \u03b2\u03b82\n\u03b81,s\n\nA1xk\n\ns,1 + A2xk\n\ns + \u03b2(cid:0)A1xk+1\n\ns = \u03bbk\n\ns,1 + A2xk+1\n\ns,2 \u2212 \u02dcbs\n\n(cid:17)\ns,2 \u2212 b(cid:1) .\n\n.\n\nend for k.\n\n3 Our Algorithm\n\n3.1 ACC-SADMM\n\nTo help readers easier understand our algorithm, we list the notations and the variables in Table\n2. Our algorithm has double loops as we use SVRG [19], which also have two layers of nested\nloops to estimate the gradient. We denote subscript s as the index of the outer loop and superscript\nk as the index in the inner loops. For example, xk\ns,1 is the value of x1 at the k-th step of the inner\niteration and the s-th step of the outer iteration. And we use xk\ns,2), and\n(yk\ns,1, yk\ns,2, extrapolation\nterms yk\ns, and s remains unchanged. In the outer loop, we maintain\nsnapshot vectors \u02dcxs+1,1, \u02dcxs+1,2 and \u02dcbs+1, and then assign the initial value to the extrapolation\nterms y0\ns+1,2. We directly linearize both the smooth term fi(xi) and the augmented term\n2(cid:107)A1x1 + A2x2 \u2212 b + \u03bb\n\u03b2\n\ns,2), respectively. In each inner loop, we update primal variables xk\ns,1, yk\n\n\u03b2 (cid:107)2. The whole algorithm is shown in Algorithm 2.\n\ns,2 and dual variable \u03bbk\n\ns to denote (xk\n\ns+1,1 and y0\n\ns,1 and xk\n\ns and yk\n\ns,1, xk\n\n3.2 Inner Loop\n\nThe inner loop of ACC-SAMM is straightforward, shown as Algorithm 1. In each iteration, we do\nextrapolation, and then update the primal and dual variables. There are two critical steps which\nensures us to obtain a non-ergodic results. The \ufb01rst is extrapolation. We do extrapolation as:\n\ns + (1 \u2212 \u03b81,s \u2212 \u03b82)(xk+1\n\ns \u2212 xk\ns ),\n\nyk+1\ns = xk+1\n\n(5)\nWe can \ufb01nd that 1 \u2212 \u03b81,s \u2212 \u03b82 \u2264 1 \u2212 \u03b81,s. So comparing with original Nesterov\u2019s scheme, our way is\nmore \u201cmild\u201d to tackle the noise of gradient. The second step is on the updating primal variables.\nxk+1\n\nh1(x1) + (cid:104)\u2207f1(yk\n\ns,1 = argmin\n\ns,1), x1(cid:105)\n\n(6)\n\nx1\n\n(cid:0)A1yk\n\ns,2 \u2212 b(cid:1) + \u03bbk\n\n1 A1(cid:107)\n\u03b2(cid:107)AT\n2\u03b81,s\nAnd then update x2 with the latest information of x1, which can be written as:\n\ns , A1x1(cid:105) +\n\n+(cid:104) \u03b2\n\u03b81,s\n\ns,1 + A2yk\n\n2\n\n(cid:107)x1 \u2212 yk\n\ns,1(cid:107)2.\n\n(cid:19)\ns,2 \u2212 b(cid:1)\n\n(7)\n\nxk+1\n\ns,2 = argmin\n\nx2\n\nh2(x2) + (cid:104) \u02dc\u2207f2(yk\n\n(cid:32)\n\n+\u03bbk\n\ns , A2x2(cid:105) +\n\n(1 + 1\nb\u03b82\n2\n\ns,1), x2(cid:105) + (cid:104) \u03b2\n\u03b81,s\n2 A2(cid:107)\n\u03b2(cid:107)AT\n2\u03b81,s\n\n)L2\n\n+\n\ns,1 + A2yk\n\n(cid:107)x2 \u2212 yk\n\ns,2(cid:107)2,\n\n+\n\n(cid:18) L1\n(cid:0)A1xk+1\n(cid:33)\n\nwhere \u02dc\u2207f2(yk\n\ns,2) is obtained by the technique of SVRG [19] with the form:\n\n(cid:88)\n\n(cid:0)\u2207f2,ik,s (yk\n\ns,2) \u2212 \u2207f2,ik,s (\u02dcxs,2) + \u2207f2(\u02dcxs,2)(cid:1) .\n\n\u02dc\u2207f2(yk\n\ns,2) =\n\n1\nb\n\nik,s\u2208I(k,s)\n\nComparing with unaccelerated SADMM methods, which alternately minimize Eq. (2), our method is\ndistincted in two ways. The \ufb01rst is that the gradient estimator are computed on the yk\ns,2. The second\nis that we have chosen a slower increasing penalty factor \u03b2\n, instead of a \ufb01xed one.\n\u03b81,s\n\n4\n\n\fAlgorithm 2 ACC-SADMM\nInput: epoch length m > 2, \u03b2, \u03c4 = 2, c = 2, x0\nfor s = 0 to S \u2212 1 do\n\nc+\u03c4 s, \u03b82 = m\u2212\u03c4\n\u03c4 (m\u22121).\n\n\u03b81,s = 1\n\n0 = 0, \u02dc\u03bb0\n\n0 = 0, \u02dcx0 = x0\n\n0, y0\n\n0 = x0\n0,\n\ns+1 = xm\ns .\n\nDo inner loop, as stated in Algorithm 1.\nSet primal variables: x0\nUpdate snapshot vectors \u02dcxs+1 through Eq. (8).\nUpdate dual variable:\nUpdate dual snapshot variable:\nUpdate extrapolation terms y0\n\ns+1 = \u03bbm\u22121\n\u02dc\u03bb0\n\ns+1 through Eq. (9).\n\n+ \u03b2(1 \u2212 \u03c4 )(A1xm\n\ns\n\u02dcbs+1 = A1\u02dcxs+1,1 + A2\u02dcxs+1,2.\n\ns,1 + A2xm\n\n\u02c6xS =\n\n1\n\n(m \u2212 1)(\u03b81,S + \u03b82) + 1\n\nxm\n\nS +\n\n\u03b81,S + \u03b82\n\n(m \u22121)(\u03b81,S + \u03b82) + 1\n\ns,2 \u2212 b).\nm\u22121(cid:88)\n\nxk\nS.\n\nk=1\n\nend for s.\n\nOutput:\n\n3.3 Outer Loop\n\nThe outer loop of our algorithm is a little complex, in which we preserve snapshot vectors, and\nthen resets the initial value. The main variants we adpot is on the snapshot vector \u02dcxs+1 and the\nextrapolation term y0\n\n(cid:32)(cid:20)\n1 \u2212 (\u03c4 \u2212 1)\u03b81,s+1\n\ns+1. For the snapshot vector \u02dcxs+1, we update it as:\n(\u03c4 \u2212 1)\u03b81,s+1\n(m \u2212 1)\u03b82\n\n(cid:21)\n\n(cid:20)\n\nxm\n\n\u03b82\n\n1\nm\n\n(cid:21) m\u22121(cid:88)\n\n\u02dcxs+1 =\n\n\u02dcxs+1 is not the average of {xk\ngenerating \u02dcx guarantees a faster convergence rate for the constraints. Then we reset y0\n\n(8)\ns}, different from most SVRG-based methods [19, 4]. The way of\n\ns +\n\n1 +\n\nk=1\n\n.\n\nxk\ns\n\ns+1 as:\n\n(cid:33)\n\n(cid:2)(1 \u2212 \u03b81,s)xm\n\ns \u2212 (1 \u2212 \u03b81,s \u2212 \u03b82)xm\u22121\n\ns\n\n\u2212 \u03b82\u02dcxs\n\n(9)\n\n(cid:3) .\n\ns+1 = (1 \u2212 \u03b82)xm\ny0\n\ns + \u03b82\u02dcxs+1 +\n\n\u03b81,s+1\n\u03b81,s\n\n4 Convergence Analysis\n\nIn this section, we give the convergence results of ACC-SADMM. The proof and a outline can be\nfound in Supplementary Material. As we have mentioned in Section 3.2, the main strategy that enable\nus to obtain a non-ergodic results is that we adopt extrapolation as Eq. (5). We \ufb01rst analyze each\ninner iteration, shown in Lemma 1. We ignore subscript s as s is unchanged in the inner iteration.\nLemma 1 Assume that f1(x1) and f2,i(x2) with i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n} are convex and have Lipschitz\ncontinuous gradients. L1 is the Lipschitz constant of f1(x1). L2 is the Lipschitz constant of f2,i(x2)\nwith i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n} . h1(x1) and h2(x2) is also convex. For Algorithm 2, in any epoch, we have\n\n)(cid:3) \u2212 \u03b82L(\u02dcx1, \u02dcx2, \u03bb\u2217\n(cid:104)(cid:107)\u02c6\u03bbk+1 \u2212 \u03bb\u2217(cid:107)2(cid:105)(cid:17)\n\n2\n\n1\n\n, \u03bb\u2217\n\n, xk+1\n\n(cid:2)L(xk+1\n(cid:16)(cid:107)\u02c6\u03bbk \u2212 \u03bb\u2217(cid:107)2 \u2212 Eik\n(cid:0)(cid:107)xk+1\n(cid:0)(cid:107)xk+1\n\nEik\n(cid:107)yk\n\nEik\n\nEik\n\u2264 \u03b81\n2\u03b2\n\u2212 1\n2\n1\n+\n2\n\u2212 1\n2\n\n1 \u2212 (1 \u2212 \u03b81 \u2212 \u03b82)xk\n\n2 \u2212 (1 \u2212 \u03b81 \u2212 \u03b82)xk\n\n2 \u2212 \u03b82\u02dcx2 \u2212 \u03b81x\u2217\n2(cid:107)2\n\n2 \u2212 (1 \u2212 \u03b81 \u2212 \u03b82)xk\n\n+\n\n1\n2\n\n) \u2212 (1 \u2212 \u03b82 \u2212 \u03b81)L(xk\n\n1, xk\n1 \u2212 (1 \u2212 \u03b81 \u2212 \u03b82)xk\n(cid:107)yk\n(cid:1)\n1 \u2212 \u03b82\u02dcx1 \u2212 \u03b81x\u2217\n1(cid:107)2\n(cid:1) ,\n\n2 \u2212 \u03b82\u02dcx2 \u2212 \u03b81x\u2217\n2(cid:107)2\n\nG4\n\nG3\n\nG4\n\n)\n\n2, \u03bb\u2217\n1 \u2212 \u03b82\u02dcx1 \u2212 \u03b81x\u2217\n1(cid:107)2\n\nG3\n\nwhere Eik denotes that the expectation is taken over the random samples in the minibatch Ik,s,\nL(x1, x2, \u03bb) = F1(x1) + F2(x2) + (cid:104)\u03bb, A1x1 + A2x2 \u2212 b(cid:105) and \u02c6\u03bbk = \u02dc\u03bbk + \u03b2(1\u2212\u03b81)\n(Axk \u2212 b),\nG3 =\n\n)L2 + \u03b2(cid:107)AT\n\nL1 + \u03b2(cid:107)AT\n\nI \u2212 \u03b2AT\n\n, and G4 =\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\nI.\n\n\u03b81\n\n(1 + 1\nb\u03b82\n\n2 A2(cid:107)\n\u03b81\n\n1 A1(cid:107)\n\u03b81\n\n1 A1\n\u03b81\n\nThen Theorem 1 analyses ACC-SADMM in the whole iteration, which is the key convergence result\nof the paper.\n\n5\n\n\fTheorem 1 If the conditions in Lemma 1 hold, then we have\n\n2\u03b2\n\n(cid:18) 1\n(cid:18) m\n(cid:0)F (x0\n\n\u03b81,S\n\nE\n\n+E\n\n0 \u2212 b(cid:1) + \u02dc\u03bb0\n(cid:0)Ax0\n(A\u02c6xS\u2212b) \u2212 \u03b2(m\u22121)\u03b82\n(cid:19)\n(cid:107) \u03b2m\n\u03b81,S\n(F (\u02c6xS) \u2212 F (x\u2217\n0 \u2212 b(cid:105)(cid:1) +\n0) \u2212 F (x\u2217\n0,1 \u2212 x\u2217\n1(cid:107)2\n(\u03b81,0L1+(cid:107)AT\n\n\u03b81,0\n) + (cid:104)\u03bb\u2217\n\n, A\u02c6xS \u2212 b(cid:105))\n\n1 A1(cid:107))I\u2212AT\n\n) + (cid:104)\u03bb\u2217\n\n, Ax0\n\n(cid:107)\u02dc\u03bb0\n\n(cid:107)x0\n\n1\n2\u03b2\n1\n2\n\n1 A1\n\n+\n\n(cid:107)x0\n\n+\n\n1\n2\n\n\u2264 C3\n\n(cid:19)\n\n(10)\n\n0 \u2212 \u03bb\u2217(cid:107)2\n\n\u03b2(1 \u2212 \u03b81,0)\n0 +\n2(cid:107)2(cid:16)\n0,2 \u2212 x\u2217\n\n\u03b81,0\n\n(1+ 1\nb\u03b82\n\n(Ax0\n\n0 \u2212 b) \u2212 \u03bb\u2217(cid:107)2\n2 A2(cid:107)(cid:17)\n\n,\n\nI\n\n)\u03b81,0L2+(cid:107)AT\n\nwhere C3 = 1\u2212\u03b81,0+(m\u22121)\u03b82\n\n\u03b81,0\n\n.\n\nCorollary 1 directly demonstrates that ACC-SADMM have a non-ergodic O(1/K) convergence rate.\nCorollary 1 If the conditions in Lemma 1 holds, we have\n\n),\n\nE|F (\u02c6xS) \u2212 F (x\u2217\n\n)| \u2264 O(\nE(cid:107)A\u02c6xS \u2212 b(cid:107) \u2264 O(\n\n1\nS\n1\nS\nWe can \ufb01nd that \u02c6xS depends on the latest m information of xk\nS. So our convergence results is in\nnon-ergodic sense, while the analysis for SVRG-ADMM [4] and SAG-ADMM [3] is in ergodic sense,\nsince they consider the point \u02c6xS = 1\ns over\nmS\nall the iterations.\nNow we directly use the theoretical results of [15] to demonstrate that our algorithm is optimal when\nthere exists non-smooth term in the objective function.\nTheorem 2 For the following problem:\n\ns, which is the convex combination of xk\n\n(cid:80)m\n\n(cid:80)S\n\nk=1 xk\n\n(11)\n\ns=1\n\n).\n\nF1(x1) + F2(x2), s.t. x1 \u2212 x2 = 0,\n\nmin\nx1,x2\n\n(12)\n\nlet the ADMM type algorithm to solve it be:\n\n2 and yk\n\n2 in any way,\n\n(cid:17)\n\n,\n\n2 \u2212 \u03bbk\nyk\n\n2\n\u03b2k\n\n1 = ProxF1/\u03b2k\n\n\u2022 Generate \u03bbk\n\u2022 xk+1\n\u2022 Generate \u03bbk+1\n\u2022 xk+1\n\n1\n\n2 = ProxF2/\u03b2k\n\n(cid:16)\n(cid:16)\n\nand yk+1\n\n1\n\nin any way,\n\n1 \u2212 \u03bbk+1\nyk+1\n\n1\n\u03b2k\n\n(cid:17)\n\n.\n\nThen there exist convex functions F1 and F2 de\ufb01ned on X = {x \u2208 R6k+5 : (cid:107)x(cid:107) \u2264 B} for the above\ngeneral ADMM method, satsifying\n1(cid:107) + |F1(\u02c6xk\n\n2)| \u2265 LB\n\n2) \u2212 F2(x\u2217\n\n1) \u2212 F1(x\u2217\n\n1) + F1(\u02c6xk\n\n2 \u2212 \u02c6xk\n\nL(cid:107)\u02c6xk\n\n(13)\n\n,\n\n8(k + 1)\n\n1 =(cid:80)k\n\n2 =(cid:80)k\n\n1xi\n\ni=1 \u03b1i\n\n1 and \u02c6xk\n\nwhere \u02c6xk\nTheorem 2 is Theorem 11 in [15]. More details can be found in it. Problem (12) is a special case of\nProblem (1) as we can set each F2,i(x2) = F (x2) with i = 1,\u00b7\u00b7\u00b7 , n or set n = 1. So there is no\nbetter ADMM type algorithm which converges faster than O(1/K) for Problem (1).\n\n2 with i from 1 to k.\n\n2 for any \u03b1i\n\n1 and \u03b1i\n\ni=1 \u03b1i\n\n2xi\n\n5 Discussions\nWe discuss some properties of ACC-SADMM and make further comparisons with some related\nmethods.\n\n6\n\n\fTable 3: Size of datasets and mini-batch size we adopt in the experiments\n\nProblem\n\nDataset\n\nLasso\n\nMultitask\n\na9a\n\ncovertype\n\nmnist\ndna\n\nImageNet\n\n# training\n72, 876\n290, 506\n60, 000\n\n2, 400, 000\n1, 281, 167\n\n# testing\n72, 875\n290, 506\n10, 000\n600, 000\n50, 000\n\n# dimension \u00d7 # class\n\n74 \u00d7 2\n54 \u00d7 2\n784 \u00d7 10\n800 \u00d7 2\n\n4, 096 \u00d7 1, 000\n\n# minibatch\n\n100\n\n500\n2, 000\n\n5.1 Comparison with Katyusha\n\nAs we have mentioned in Introduction, some intuitions of our algorithm are inspired by Katyusha [23],\nwhich obtains an O(1/K 2) algorithm for convex \ufb01nite-sum problems. However, Katyusha cannot\nsolve the problem with linear constraints. Besides, Katyusha uses the Nesterov\u2019s second scheme\nto accelerate the algorithm while our method conducts acceleration through Nesterov\u2019s extrapola-\ntion (Nesterov\u2019s \ufb01rst scheme). And our proof uses the technique of [26], which is different from\n[23]. Our algorithm can be easily extended to unconstrained convex \ufb01nite-sum and can also obtain a\nO(1/K 2) rate but belongs to the Nesterov\u2019s \ufb01rst scheme 2.\n\n5.2 The Growth of Penalty Factor \u03b2\n\u03b81,s\n\nThe penalty factor \u03b2\nincreases linearly with the iteration. One might deem that this make our\n\u03b81,s\nalgorithm impractical because after dozens of epoches, the large value of penalty factor might slow\ndown the decrement of function value. However, we have not found any bad in\ufb02uence. There may\nbe two reasons 1. In our algorithm, \u03b81,s decreases after each epoch (m iterations), which is much\nslower than LADM-NE [15]. So the growth of penalty factor works as a continuation technique [28],\nwhich may help to decrease the function value. 2. From Theorem 1, our algorithm converges in\nO(1/S) whenever \u03b81,s is large. So from the theoretical viewpoint, a large \u03b81,s cannot slow down\nour algorithm. We \ufb01nd that OPT-ADMM [2] also needs to decrease the step size with the iteration.\nHowever, its step size decreasing rate is O(k\n\n2 ) and is faster than ours.\n\n3\n\n5.3 The Importance of Non-ergodic O(1/K)\n\nSAG-ADMM [3] and SVRG-ADMM [4] accelerate SADMM to ergodic O(1/K). In Theorem\n\u221a\n9 of [15], the authors generate a class of functions showing that the original ADMM has a tight\n\u221a\nK) convergence rate. When n = 1, SAG-ADMM and SVRG-ADMM are the\nnon-ergodic O(1/\nsame as batch ADMM, so their convergence rates are no better than O(1/\nK). So in non-ergodic\nsense, our algorithm does have a faster convergence rate than VR based SADMM methods.\nThen we are to highlight the importance of our non-ergodic result. As we have mentioned in the\nIntroduction, in practice, the output of ADMM methods is the non-ergodic result xK, not the mean\nof x1 to xK. For deterministic ADMM, the proof of ergodic O(1/K) rate is proposed in [11], after\nADMM had become a prevailing method of solving machine learning problems [29]; for stochastic\nADMM, e.g. SVRG-ADMM [4], the authors give an ergodic O(1/K) proof, but in experiment, what\nthey emphasize to use is the mean value of the last epoch as the result. As the non-ergodic results\nare more close to reality, our algorithm is much faster than VR based SADMM methods, even when\nits rate is seemingly the same. Actually, though VR based SADMM methods have provably faster\nrates than STOC-ADMM, the improvement in practice is evident only after numbers of iterations,\nwhen point are close to the convergence point, rather than at the early stage. In both [3] and [4], the\nauthors claim that SAG-ADMM and SVRG-ADMM are sensitive to initial points. We also \ufb01nd that\nif the step sizes are set based on the their theoretical guidances, sometimes they are even slower than\nSTOC-ADMM (see Fig. 1) as the early stage lasts longer when the step size is small. Our algorithm is\nfaster than the two algorithms which demonstrates that Nesterov\u2019s extrapolation has truly accelerated\nthe speed and the integration of extrapolation and VR techniques is harmonious and complementary.\n\n2We follow [26] to name the extrapolation scheme as Nesterov\u2019s \ufb01rst scheme and the three-step scheme [27]\n\nas the Nesterov\u2019s second scheme.\n\n7\n\n\f(a) a9a-original\n\n(b) covertype-original\n\n(c) mnist-original\n\n(d) dna-original\n\n(e) a9a-group\n\n(f) covertype-group\n\n(g) mnist-group\n\n(h) dna-group\n\nFigure 1: Experimental results of solving the original Lasso (Top) and Graph-Guided Fused Las-\nso (Bottom). The computation time includes the cost of calculating full gradients for SVRG based\nmethods. SVRG-ADMM and SAG-ADMM are initialized by running STOC-ADMM for 3n\nb itera-\ntions. \u201c-ERG\u201d represents the ergodic results for the corresponding algorithms.\n\n6 Experiments\n\n(cid:80)n\n\nWe conduct experiments to show the effectiveness of our method3. We compare our method with the\nfollowing the-state-of-the-art SADMM algorithms: (1) STOC-ADMM [1], (2) SVRG-ADMM [4],\n(3) OPT-SADMM [2], (4) SAG-ADMM [3]. We ignore SDCA-ADMM [17] in our comparison since\nit gives no analysis on general convex problems and it is also not faster than SVRG-ADMM [4].\nExperiments are performed on Intel(R) CPU i7-4770 @ 3.40GHz machine with 16 GB memory. Our\nexperiments focus on two typical problems [4]: Lasso Problem and Multitask Learning. Due to space\nlimited, the experiment of Multitask Learning is shown in Supplementary Materials. For the Lasso\nproblems, we perform experiments under the following typical variations. The \ufb01rst is the original\nLasso problem; and the second is Graph-Guided Fused Lasso model: minx \u00b5(cid:107)Ax(cid:107)1 + 1\ni=1 li(x),\nwhere li(x) is the logistic loss on sample i, and A = [G; I] is a matrix encoding the feature sparsity\npattern. G is the sparsity pattern of the graph obtained by sparse inverse covariance estimation [30].\nThe experiments are performed on four benchmark data sets: a9a, covertype, mnist and dna4. The\ndetails of the dataset and the mini-batch size that we use in all SADMM are shown in Table 3. And\nlike [3] and [4], we \ufb01x \u00b5 = 10\u22125 and report the performance based on (xt, Axt) to satisfy the\nconstraints of ADMM. Results are averaged over \ufb01ve repetitions. And we set m = 2n\nb for all the\nalgorithms. For original Lasso problem, the step sizes are set through theoretical guidances for\neach algorithm. For the Graph-Guided Lasso, the best step sizes are obtained through searches on\nparameters which give best convergence progress. Except ACC-SADMM, we use the continuation\ntechnique [28] to accelerate algorithms. SAG-ADMM is performed on the \ufb01rst three datasets due to\nits large memory requirement.\nThe experimental results are shown in Fig. 1. We can \ufb01nd that our algorithm consistently outperforms\nother compared methods in all these datasets for both the two problems, which veri\ufb01es our theoretical\nanalysis. The details about parameter setting, experimental results where we set a larger \ufb01xed step\nsize for the group guided Lasso problem, curves of the test error, the memory costs of all algorithms,\nand Multitask learning experiment are shown in Supplementary Materials.\n\nn\n\n3The code will be available at http://www.cis.pku.edu.cn/faculty/vision/zlin/zlin.htm.\n4a9a, covertype and dna are from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/,\n\nand mnist is from: http://yann.lecun.com/exdb/mnist/.\n\n8\n\n510152025303540number of effective passes 10-510-410-310-2objective gap510152025303540number of effective passes 10-310-2objective gap510152025303540number of effective passes 10-310-210-1objective gap510152025303540number of effective passes 10-410-3objective gap102030405060number of effective passes 10-310-210-1objective gap102030405060number of effective passes 10-510-410-310-210-1100objective gap102030405060number of effective passes 10-310-210-1100objective gap510152025303540number of effective passes 10-810-610-410-2100objective gap880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989FasterandNon-ergodicO(1/K)StochasticAlternatingDirectionMethodofMultipliers510152025303540number of effective passes 10-510-410-310-2objective gapSTOC-ADMMSTOC-ADMM-ERGOPT-ADMMSVRG-ADMMSVRG-ADMM-ERGSAG-ADMMSAG-ADMM-ERGACC-SADMM510152025303540number of effective passes 0.1120.1140.1160.1180.12test loss510152025303540number of effective passes 10-310-2objective gap510152025303540number of effective passes 0.370.3720.3740.376test loss510152025303540number of effective passes 10-310-210-1objective gap510152025303540number of effective passes 0.190.1950.20.2050.21test loss510152025303540number of effective passes 10-410-3objective gap(a) a9a(b) covertype(b) mnist(d) dna510152025303540number of effective passes 2.833.23.43.63.8test loss\u00d710-3Figure3.Illustrationoftheproposedapproach.TheevolutionaryprocessofourPDE(solidarrow)withrespecttothetime(t=0,T/N,\u00b7\u00b7\u00b7,T,)extractsthefeaturefromtheimageandthegradientdescentprocess(hollowarrow)learnsatransformtorepresentthefeature.Frostig,Roy,Ge,Rong,Kakade,Sham,andSidford,Aaron.Un-regularizing:approximateproximalpointandfasterstochasticalgorithmsforempiricalriskmin-imization.InProc.Int\u2019l.Conf.onMachineLearning,2015.He,BingshengandYuan,Xiaoming.OntheO(1/n)con-vergencerateofthedouglas\u2013rachfordalternatingdirec-tionmethod.SIAMJournalonNumericalAnalysis,50(2):700\u2013709,2012.Hien,LeThiKhanh,Lu,Canyi,Xu,Huan,andFeng,Ji-ashi.Acceleratedstochasticmirrordescentalgorithmsforcompositenon-stronglyconvexoptimization.arXivpreprintarXiv:1605.06892,2016.Johnson,RieandZhang,Tong.Acceleratingstochasticgradientdescentusingpredictivevariancereduction.InProc.Conf.AdvancesinNeuralInformationProcessingSystems,2013.Kim,Seyoung,Sohn,Kyung-Ah,andXing,EricP.Amul-tivariateregressionapproachtoassociationanalysisofaquantitativetraitnetwork.Bioinformatics,25(12):i204\u2013i212,2009.Li,HuanandLin,Zhouchen.OptimalnonergodicO(1/k)convergencerate:WhenlinearizedADMmeetsnes-terov\u2019sextrapolation.arXivpreprintarXiv:1608.06366,2016.Lin,Hongzhou,Mairal,Julien,andHarchaoui,Zaid.Auniversalcatalystfor\ufb01rst-orderoptimization.InProc.Conf.AdvancesinNeuralInformationProcessingSys-tems,2015a.Lin,Zhouchen,Liu,Risheng,andSu,Zhixun.Linearizedalternatingdirectionmethodwithadaptivepenaltyforlow-rankrepresentation.InProc.Conf.AdvancesinNeuralInformationProcessingSystems,2011.Lin,Zhouchen,Liu,Risheng,andLi,Huan.Linearizedalternatingdirectionmethodwithparallelsplittingandadaptivepenaltyforseparableconvexprogramsinma-chinelearning.MachineLearning,99(2):287\u2013325,2015b.Lu,Canyi,Li,Huan,Lin,Zhouchen,andYan,Shuicheng.Fastproximallinearizedalternatingdirectionmethodofmultiplierwithparallelsplitting.arXivpreprintarX-iv:1511.05133,2015.Nesterov,Yurii.Amethodforunconstrainedconvexmini-mizationproblemwiththerateofconvergenceO(1/k2).InDokladyanSSSR,volume269,pp.543\u2013547,1983.Nesterov,Yurii.Onanapproachtotheconstructionofop-timalmethodsofminimizationofsmoothconvexfunc-tions.EkonomikaiMateaticheskieMetody,24(3):509\u2013517,1988.Nesterov,Yurii.Introductorylecturesonconvexoptimiza-tion:Abasiccourse,volume87.2013.Nitanda,Atsushi.Stochasticproximalgradientdescentwithaccelerationtechniques.InProc.Conf.AdvancesinNeuralInformationProcessingSystems,2014.Ouyang,Hua,He,Niao,Tran,Long,andGray,Alexan-derG.Stochasticalternatingdirectionmethodofmulti-pliers.Proc.Int\u2019l.Conf.onMachineLearning,2013.\f7 Conclusion\nWe propose ACC-SADMM for the general convex \ufb01nite-sum problems. ACC-SADMM integrates\nNesterov\u2019s extrapolation and VR techniques and achieves a non-ergodic O(1/K) convergence rate,\nwhich shows theoretical and practical importance. We do experiments to demonstrate that our\nalgorithm is faster than other SADMM methods.\n\nAcknowledgment\n\nZhouchen Lin is supported by National Basic Research Program of China (973 Program) (grant no.\n2015CB352502) and National Natural Science Foundation (NSF) of China (grant no.s 61625301,\n61731018, and 61231002).\n\nReferences\n[1] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction\n\nmethod of multipliers. Proc. Int\u2019l. Conf. on Machine Learning, 2013.\n\n[2] Samaneh AzadiSra and Suvrit Sra. Towards an optimal stochastic alternating direction method\n\nof multipliers. In Proc. Int\u2019l. Conf. on Machine Learning, 2014.\n\n[3] Wenliang Zhong and James Tin-Yau Kwok. Fast stochastic alternating direction method of\n\nmultipliers. In Proc. Int\u2019l. Conf. on Machine Learning, 2014.\n\n[4] Shuai Zheng and James T Kwok. Fast-and-light stochastic admm. In Proc. Int\u2019l. Joint Conf. on\n\nArti\ufb01cial Intelligence, 2016.\n\n[5] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning.\n\nProc. Conf. Advances in Neural Information Processing Systems, 2007.\n\n[7] Li Shen, Gang Sun, Zhouchen Lin, Qingming Huang, and Enhua Wu. Adaptive sharing for\n\nimage classi\ufb01cation. In Proc. Int\u2019l. Joint Conf. on Arti\ufb01cial Intelligence, 2015.\n\n[8] Seyoung Kim, Kyung-Ah Sohn, and Eric P Xing. A multivariate regression approach to\n\nassociation analysis of a quantitative trait network. Bioinformatics, 25(12):i204\u2013i212, 2009.\n\n[9] Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. Joint feature selection and\nsubspace learning for cross-modal retrieval. IEEE Trans. on Pattern Analysis and Machine\nIntelligence, 38(10):1\u20131, 2016.\n\n[10] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends R(cid:13) in Machine Learning, 3(1):1\u2013122, 2011.\n\n[11] Bingsheng He and Xiaoming Yuan. On the O(1/n) convergence rate of the Douglas\u2013Rachford\n\nalternating direction method. SIAM Journal on Numerical Analysis, 50(2):700\u2013709, 2012.\n\n[12] Zhouchen Lin, Risheng Liu, and Huan Li. Linearized alternating direction method with parallel\nsplitting and adaptive penalty for separable convex programs in machine learning. Machine\nLearning, 99(2):287\u2013325, 2015.\n\n[13] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In\nSplitting Methods in Communication, Imaging, Science, and Engineering, pages 115\u2013163. 2016.\n\n[14] Caihua Chen, Raymond H Chan, Shiqian Ma, and Junfeng Yang. Inertial proximal ADMM\nfor linearly constrained separable convex optimization. SIAM Journal on Imaging Sciences,\n8(4):2239\u20132267, 2015.\n\n[15] Huan Li and Zhouchen Lin. Optimal nonergodic O(1/k) convergence rate: When linearized\n\nADM meets nesterov\u2019s extrapolation. arXiv preprint arXiv:1608.06366, 2016.\n\n9\n\n\f[16] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction\n\nmultiplier method. In Proc. Int\u2019l. Conf. on Machine Learning, 2013.\n\n[17] Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers.\n\nIn Proc. Int\u2019l. Conf. on Machine Learning, 2014.\n\n[18] L\u00e9on Bottou. Stochastic learning. In Advanced lectures on machine learning, pages 146\u2013168.\n\n2004.\n\n[19] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Proc. Conf. Advances in Neural Information Processing Systems, 2013.\n\n[20] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Proc. Conf. Advances in\nNeural Information Processing Systems, 2014.\n\n[21] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, pages 1\u201330, 2013.\n\n[22] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence O(1/k2). In Doklady an SSSR, volume 269, pages 543\u2013547, 1983.\n\n[23] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst truly accelerated stochastic gradient method. In Annual\n\nSymposium on the Theory of Computing, 2017.\n\n[24] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with\nadaptive penalty for low-rank representation. In Proc. Conf. Advances in Neural Information\nProcessing Systems, 2011.\n\n[25] Xiaoqun Zhang, Martin Burger, and Stanley Osher. A uni\ufb01ed primal-dual algorithm framework\n\nbased on bregman iteration. Journal of Scienti\ufb01c Computing, 46:20\u201346, 2011.\n\n[26] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. In\n\nTechnical report, 2008.\n\n[27] Yurii Nesterov. On an approach to the construction of optimal methods of minimization of\n\nsmooth convex functions. Ekonomika i Mateaticheskie Metody, 24(3):509\u2013517, 1988.\n\n[28] Wangmeng Zuo and Zhouchen Lin. A generalized accelerated proximal gradient approach for\ntotal variation-based image restoration. IEEE Trans. on Image Processing, 20(10):2748, 2011.\n\n[29] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for\n\nexact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.\n\n[30] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation\n\nwith the graphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n10\n\n\f", "award": [], "sourceid": 2334, "authors": [{"given_name": "Cong", "family_name": "Fang", "institution": "Peking University"}, {"given_name": "Feng", "family_name": "Cheng", "institution": "Peking University"}, {"given_name": "Zhouchen", "family_name": "Lin", "institution": "Peking University"}]}