{"title": "Distributed Low-rank Matrix Factorization With Exact Consensus", "book": "Advances in Neural Information Processing Systems", "page_first": 8422, "page_last": 8432, "abstract": "Low-rank matrix factorization is a problem of broad importance, owing to the ubiquity of low-rank models in machine learning contexts. In spite of its non- convexity, this problem has a well-behaved geometric landscape, permitting local search algorithms such as gradient descent to converge to global minimizers. In this paper, we study low-rank matrix factorization in the distributed setting, where local variables at each node encode parts of the overall matrix factors, and consensus is encouraged among certain such variables. We identify conditions under which this new problem also has a well-behaved geometric landscape, and we propose an extension of distributed gradient descent (DGD) to solve this problem. The favorable landscape allows us to prove convergence to global optimality with exact consensus, a stronger result than what is provided by off-the-shelf DGD theory.", "full_text": "Distributed Low-rank Matrix Factorization\n\nWith Exact Consensus\n\nZhihui Zhu\u21e4\n\nMathematical Institute for Data Science\n\nJohns Hopkins University\n\nBaltimore, MD, USA\nzzhu29@jhu.edu\n\nQiuwei Li\u21e4\n\nDepartment of Electrical Engineering\n\nColorado School of Mines\n\nGolden, CO, USA\nqiuli@mines.edu\n\nXinshuo Yang\n\nDepartment of Electrical Engineering\n\nColorado School of Mines\n\nGolden, CO, USA\n\nxinshuoyang@mines.edu\n\nGongguo Tang\n\nDepartment of Electrical Engineering\n\nColorado School of Mines\n\nGolden, CO, USA\ngtang@mines.edu\n\nMichael B. Wakin\n\nDepartment of Electrical Engineering\n\nColorado School of Mines\n\nGolden, CO, USA\nmwakin@mines.edu\n\nAbstract\n\nLow-rank matrix factorization is a problem of broad importance, owing to the\nubiquity of low-rank models in machine learning contexts. In spite of its non-\nconvexity, this problem has a well-behaved geometric landscape, permitting local\nsearch algorithms such as gradient descent to converge to global minimizers. In this\npaper, we study low-rank matrix factorization in the distributed setting, where local\nvariables at each node encode parts of the overall matrix factors, and consensus\nis encouraged among certain such variables. We identify conditions under which\nthis new problem also has a well-behaved geometric landscape, and we propose\nan extension of distributed gradient descent (DGD) to solve this problem. The\nfavorable landscape allows us to prove convergence to global optimality with exact\nconsensus, a stronger result than what is provided by off-the-shelf DGD theory.\n\n1\n\nIntroduction\n\nA promising line of recent literature has examined the nonconvex objective functions that arise\nwhen certain matrix optimization problems are solved in factored form, that is, when a low-rank\noptimization variable X is replaced by a product of two thin matrices UVT and the optimization\nproceeds jointly over U and V [6, 2, 10, 14, 23, 24, 27, 37, 38, 41, 42, 43]. In many cases, a study of\nthe geometric landscape of these objective functions reveals that\u2014despite their nonconvexity\u2014they\npossess a certain favorable geometry. In particular, many of the resulting objective functions (i) satisfy\nthe strict saddle property [13, 36], where every critical point is either a local minimum or a strict\nsaddle point, at which the Hessian matrix has at least one negative eigenvalue, and (ii) have no\nspurious local minima (every local minimum corresponds to a global minimum).\n\n\u21e4Equal contribution. ZZ is also with the Department of Electrical & Computer Engineering, University of\n\nDenver.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne such problem\u2014which is both of fundamental importance and representative of structures\nthat arise in many other machine learning problems [19]\u2014is the low-rank matrix approximation\nF . As we explain\nproblem, where given a data matrix Y the objective is to minimize kUVT  Yk2\nin Theorem 3.1, building on recent analysis in [32] and [42], this problem satis\ufb01es the strict saddle\nproperty and has no spurious local minima.\nIn parallel with the recent focus on the favorable geometry of certain nonconvex landscapes, it has\nbeen shown that a number of local search algorithms have the capability to avoid strict saddle points\nand converge to a local minimizer for problems that satisfy the strict saddle property [21, 17, 33,\n25, 26, 28]; see [9, 11] for an overview. As stated in [20] and as we summarize in Theorem 2.2,\ngradient descent when started from a random initialization is one such algorithm. For problems\nsuch as low-rank matrix approximation that have no spurious local minima, converging to a local\nminimizer means converging to a global minimizer.\nTo date, the geometric and algorithmic research described above has largely focused on centralized\noptimization, where all computations happen at one \u201ccentral\u201d node that has full access, for example,\nto the data matrix Y.\nIn this work, we study the impact of distributing the factored optimization problem, such as would be\nnecessary if the data matrix Y in low-rank matrix approximation were partitioned into submatrices\n\u00b7\u00b7\u00b7 YJ ], each of which was available at only one node in a network. By similarly\nY = [Y1 Y2\npartitioning the matrix V, one can partition the objective function\n\nkUVT  Yk2\n\nF =\n\nkUVT\n\nj  Yjk2\nF .\n\n(1)\n\nJXj=1\n\nOne can attempt to minimize the resulting objective, in which the matrix U appears in every term\nof the summation, using techniques similar to classical distributed algorithms such as distributed\ngradient descent (DGD) [30], distributed Riemannian gradient descent (DRGD) [22], gossip-based\nmethods [18, 4], and primal-dual methods [5, 35, 7, 15]. At a minimum, however, these algorithms\ninvolve creating local copies U1, U2, . . . , UJ of the optimization variable U and iteratively sharing\nupdates of these variables with the aim of converging to a consensus where (exactly or approximately)\nU1 = U2 = \u00b7\u00b7\u00b7 = UJ. The introduction of additional variables (and possibly constraints) means\nthat these distributed algorithms are navigating a potentially different geometric landscape than their\ncentralized counterparts.\nIn this paper we study a straightforward extension of DGD for solving such problems. This extension,\nwhich we term DGD+LOCAL, resembles classical DGD in that each node j has a local copy Uj of\nthe optimization variable U as described above. Additionally, however, each node has a local block\nVj of the partitioned optimization variable V, and this block exists only locally at node j without any\nconsensus or sharing among other nodes. (In contrast, applying classical DGD to (1) would actually\nrequire every node to maintain and update a copy of both entire matrices U and V.)\nWe present a geometric framework for analyzing the convergence of DGD+LOCAL in such problems.\nOur framework relies on a straightforward conversion which reveals (for example in the low-rank\nmatrix approximation problem) that DGD+LOCAL as described above is equivalent to running\nconventional gradient descent on the objective function\n\nJXj=1\u21e3kUjVT\n\nj  Yjk2\n\nF +\n\nJXi=1\n\nF\u2318,\nwjikUj  Uik2\n\n(2)\n\nwhere wji are weights inherited from the DGD+LOCAL iterations. This objective function (2) differs\nfrom the original objective function (1) in two respects: it contains more optimization variables, and it\nincludes a quadratic regularizer to encourage consensus. Although the geometry of (1) is understood\nto be well-behaved, new questions arise about the geometry of (2): Does it contain new critical points\n(local minima that are not global, saddle points that are not strict)? And on the consensus subspace,\nwhere U1 = U2 = \u00b7\u00b7\u00b7 = UJ, how do the critical points of (2) relate to the critical points of (1)?\nWe answer these questions and build on the algorithmic results for gradient descent to identify in\nTheorem 3.2 suf\ufb01cient conditions where DGD+LOCAL is guaranteed to converge to a point that\n(i) is exactly on the consensus subspace, and (ii) coincides with a global minimizer of problem (1).\nUnder these conditions, the distributed low-rank matrix approximation problem is shown to enjoy the\nsame geometric and algorithmic guarantees as its well-behaved centralized counterpart.\n\n2\n\n\f10\n\n10\n\n0\n\n10\n\n-10\n\n10\n\n-20\n\n10\n\n-30\n\n10\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n6\n\n10\n\n4\n\n10\n\n2\n\n10\n\n0\n\n10\n\n-2\n\n10\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\n1800\n\n2000\n\niteration\n\niteration\n\nFigure 1: (Left) Exact optimality and consensus are possible for DGD+LOCAL on a distributed low-\nrank matrix approximation problem. (Right) Such properties do not hold in general, as demonstrated\nby running DGD on a least squares problem. Full details are provided in the supplementary material.\n\nFor the distributed low-rank matrix approximation problem, these guarantees are stronger than what\nappear in the literature for classical DGD and more general problems. In particular, we show exact\nconvergence to the consensus subspace with a \ufb01xed DGD+LOCAL stepsize, which in more general\nworks is accomplished only with diminishing DGD stepsizes for convex [8, 16] and nonconvex [40]\nproblems or by otherwise modifying DGD as in the EXTRA algorithm [34]. Moreover, we show\nconvergence to a global minimizer of the original centralized nonconvex problem. Until recently,\nexisting DGD results either considered convex problems [8, 16] or showed convergence to stationary\npoints of nonconvex problems [40]. Very recently, it was also shown [12] that with an appropriately\nsmall stepsize, DGD can converge to an arbitrarily small neighborhood of a second-order critical point\nfor general nonconvex problems with additional technical assumptions. Our work differs from [12]\nin our use of DGD+LOCAL (rather than DGD) and our focus on one speci\ufb01c problem where we can\nestablish stronger guarantees of exact global optimality and exact consensus without requiring an\narbitrarily small (or diminishing) stepsize.\nTo summarize the above discussion, in this paper we make the following contributions:\n\u2022 For general problems and under certain conditions (see Theorem 2.5 and Theorem 2.6), following\nthe analysis in [21], we show that DGD+LOCAL converges to a second-order critical point of the\nregularized objective function (7). In the case of distributed low-rank matrix approximation, (7)\ncorresponds to (2).\n\n\u2022 For general problems and under a certain symmetric gradient condition (see Proposition 2.3), we\nshow that every critical point of the distributed objective function (7) lies exactly on the consensus\nsubspace, thus ensuring that DGD+LOCAL converges to an exact consensus point even with a \ufb01xed\nstepsize. We further show (see Theorem 2.7) that along the consensus subspace, the distributed\nobjective function (7) has a certain geometric correspondence to its centralized counterpart (3). In\nparticular, every critical point of (7) corresponds to a critical point of (3), and every strict saddle\npoint of (3) corresponds to a strict saddle point of (7).\n\n\u2022 We show (see Theorem 3.2) that distributed low-rank matrix approximation satis\ufb01es the symmetric\ngradient condition. Combined with the fact that the centralized low-rank matrix approximation\nproblem satis\ufb01es the strict saddle property and has no spurious local minima (see Theorem 3.1), we\nconclude that DGD+LOCAL with a \ufb01xed stepsize achieves exact consensus and global optimality\nfor distributed low-rank matrix approximation (see Theorem 3.2).\n\nTo demonstrate our conclusion for distributed low-rank matrix approximation, the left panel in\nFigure 1 shows the convergence of DGD+LOCAL for a low-rank matrix factorization problem whose\nsetup is described in the supplementary material. Both the blue line (showing the objective value)\nand the red line (showing the consensus error) converge to zero. In contrast, the right panel in\nFigure 1 shows that DGD fails to achieve such optimality and consensus on a different, least squares\nproblem. We also include experiments on distributed matrix completion and matrix sensing in the\nsupplementary material.\nOur main results on distributed low-rank matrix factorization are presented in Section 3. These results\nbuild on several more general algorithmic and geometric results that we \ufb01rst establish in Section 2.\nThe results from Section 2 may have broader applicability, and the geometric and algorithmic\ndiscussions in Section 2 may have independent interest from one another. All proofs can be found in\nthe supplementary material.\n\n3\n\n\f2 General Analysis of DGD+LOCAL\n\nConsider a centralized minimization problem that can be written in the form\n\n1\n\nx,y\n\nfj(x, yj),\n\n\u00b7\u00b7\u00b7 yT\n\nminimize\n\nf (x, y) =\n\ni and j are neighbors in the network or i = j. Throughout this paper, we will make the common\nassumption [29] that\n\nJXj=1\nwhere y = \u21e5yT\nJ\u21e4T. Here x is the common variable in all of the objective functions\n{fj}j2[J] and yj is the variable only corresponding to fj.\nThe standard DGD algorithm [30] is stated for problems of minimizing f (x) =PJ\ni=1ewjixi(k)  \u00b5rxfj(xj(k)),\nfor such problems it involves updates of the form xj(k + 1) =PJ\nwhere {ewji} are a set of symmetric nonnegative weights, and ewji is positive if and only if nodes\nJXi=1 ewji = 1 for all j 2 [J].\nJXi=1ewjixi(k)  \u00b5rxfj(xj(k), yj(k)),\n\nA very natural extension of DGD to problems of the form (3)\u2014which involve local copies of the\nshared variable x and local partitions of the variable y\u2014is to perform the updates\n\nyj(k + 1) = yj(k)  \u00b5ryfj(xj(k), yj(k)).\n\nj=1 fj(x), and\n\nxj(k + 1) =\n\n(3)\n\n(4)\n\n(5)\n\nBecause we are interested in solving problems of the form (3), we refer to (5) as DGD+LOCAL\nthroughout this paper. We note that DGD+LOCAL is not equivalent to algorithm would obtain by\napplying classical DGD to reach consensus over the concatenated variables x and y as this would\nrequire each node to maintain a local copy of the entire vector y. For the same reason, DGD+LOCAL\nis not equivalent to the blocked variable problem described in [31].\n\n2.1 Relation to Gradient Descent\nNote that we can rewrite the \ufb01rst equation in (5) as\n\nxj(k + 1) = xj(k)  \u00b5\u21e3rxfj(xj(k), yj(k)) +Xi6=j ewji\n\n\u00b5\n\n(xj(k)  xi(k))\u2318,\n\nwhere the assumption (4) is used. Thus, by de\ufb01ning {wji} such that\n\nwji = wij =(ewji\n\n4\u00b5 ,\n0,\n\ni 6= j,\ni = j,\n\n(6)\n\nwe see that DGD+LOCAL (5) is equivalent to applying standard gradient descent (with stepsize \u00b5) to\n\nminimize\n\nz\n\ng(z) =\n\nJXj=1\u21e3fj(xj, yj) +\n\nJXi=1\n\n2\u2318,\nwjikxj  xik2\n\n(7)\n\nwhere z = (x1, . . . , xJ , y1, . . . , yJ ) and W = {wji} is a J\u21e5J connectivity matrix with nonnegative\nentries de\ufb01ned in (6) and zeros on the diagonal.\n\n2.2 Algorithmic Analysis\nWe are interested in understanding the convergence of the gradient descent algorithm when it is\napplied to minimizing g(z) in (7); as we have argued in Section 2.1, this is equivalent to running the\nDGD+LOCAL algorithm (5) to minimize the objective function f (x, y) in (3).\n\n4\n\n\f2.2.1 Objective Function Properties and Convergence of Gradient Descent\nIn this section, we review and present convergence results for gradient descent, before more explicitly\ndiscussing the convergence of DGD+LOCAL in Section2.2.2. Under certain conditions, we can\nguarantee that gradient descent will converge to a second-order critical point of the objective function\ng(z) in (7). The proof relies on certain properties of the functions fj comprising (3). We describe\nthese properties before discussing convergence results. The \ufb01rst property concerns the assumption\nthat each fj comprising (3) has Lipschitz gradient. In this case we can also argue that g in (7) has\nLipschitz gradient.\n\n\u00b5 , where L := maxj Lj, ! :=PJ\n\nProposition 2.1. Let f (x, y) = PJ\nj=1 fj(x, yj) be an objective function as in (3) and let g(z)\nbe as in (7) with z = (x1, . . . , xJ , y1, . . . , yJ ). Suppose that each fj has Lipschitz gradient, i.e.,\nrfj is Lipschitz continuous with constant Lj > 0. Then rg is Lipschitz continuous with constant\nLg = L + 2!\nand stepsize as in (5).\n\ni6=j ewji, and ewji and \u00b5 are the DGD+LOCAL weights\nThe second property concerns the following \u0141ojasiewicz inequality.\nDe\ufb01nition 2.1. [1] Assume that h : Rn ! R is continuously differentiable. Then h is said to\nsatisfy the \u0141ojasiewicz inequality, if for any critical point x of h(x), there exist > 0, C1 > 0, and\n\u2713 2 [0, 1) (which is often referred to as the KL exponent) such that\n\n|h(x)  h(x)|\u2713 \uf8ff C1krh(x)k, 8 x 2 B(x, ).\n\nThis \u0141ojasiewicz inequality (or a more general Kurdyka-\u0141ojasiewicz (KL) inequality for general\nnonsmooth problems) characterizes the local geometric properties of the objective function around\nits critical points and has proved useful for convergence analysis [1, 3]. The \u0141ojasiewicz inequality\n(or KL inequality) is very general and holds for most problems in engineering. For example,\nevery analytic function satis\ufb01es this \u0141ojasiewicz inequality, but each function may have different\n\u0141ojasiewicz exponent \u2713 which determines the convergence rate; see [1, 3] for the details on this.\nA general result for convergence of gradient descent to \ufb01rst-order critical point for a function satisfying\nthe \u0141ojasiewicz inequality is as follows.2\nTheorem 2.1. [1] Suppose infRn h > 1 and h satis\ufb01es the \u0141ojasiewicz inequality. Also assume\nrh is Lipschitz continuous with constant L > 0. Let {x(k)} be the sequence generated by gradient\ndescent x(k + 1) = x(k)  \u00b5rh(x(k)) with \u00b5 < 1\nL. Then if the sequence {x(k)} is bounded, it\nconverges to a critical point of h.\n\nThe following result further characterizes the convergence behavior of gradient descent to a second-\norder critical point.\nTheorem 2.2. [21] Suppose h is a twice-continuously differentiable function and rh is Lipschitz\ncontinuous with constant L > 0. Let {x(k)} be the sequence generated by gradient descent\nx(k + 1) = x(k)  \u00b5rh(x(k)) with \u00b5 < 1\nL. Suppose x(0) is chosen randomly from a probability\ndistribution supported on a set S having positive measure. Then the sequence {x(k)} almost surely\navoids strict saddles, where the Hessian has at least one negative eigenvalue.\n\nTheorems 2.1 and 2.2 apply for functions h that globally satisfy the \u0141ojasiewicz and Lipschitz gradient\nconditions. In some problems, however, one or both of these properties may be satis\ufb01ed only locally.\nNevertheless, under an assumption of bounded iterations\u2014as is already made in Theorem 2.1\u2014it is\npossible to extend the \ufb01rst- and second-order convergence results to such functions. For example, one\ncan extend Theorem 2.1 as follows by noting that the original derivation in [1] used the \u0141ojasiewicz\nproperty only locally around limit points of the sequence {x(k)}.\nTheorem 2.3. [1] Suppose infRn h > 1. For \u21e2> 0, let B\u21e2 denote the open ball of radius \u21e2:\nB\u21e2 := {x : kxk2 <\u21e2 }, and suppose h satis\ufb01es the \u0141ojasiewicz inequality at all points x 2 B\u21e2.\nAlso assume rh is Lipschitz continuous with constant L > 0. Let {x(k)} be the sequence generated\nby gradient descent x(k + 1) = x(k)  \u00b5rh(x(k)) with \u00b5 < 1\nL. Suppose {x(k)}\u2713 B\u21e2 and all\nlimit points of {x(k)} are in B\u21e2. Then the sequence {x(k)} converges to a critical point of h.\n\n2The result in [1] is stated for the proximal method, but the result can be extended to gradient descent as long\n\nas \u00b5 < 1\nL .\n\n5\n\n\fThe following result establishes second-order convergence for a function with a locally Lipschitz\ngradient.\nTheorem 2.4. Let \u21e2> 0, and consider an objective function h where:\n\n1. infRn h > 1,\n2. h satis\ufb01es the \u0141ojasiewicz inequality within B\u21e2,\n3. h is twice-continuously differentiable, and\n\n4. |h (x)|\uf8ff L0, krh (x)k \uf8ff L1, andr2h(x)2 \uf8ff L2 for all x 2 B2\u21e2.\n\nSuppose the gradient descent stepsize\n\n\u00b5 <\n\n1\n\nL2 + 4L1\n\n\u21e2 + (4+2\u21e1)L0\n\n\u21e22\n\n.\n\n(8)\n\nSuppose x(0) is chosen randomly from a probability distribution supported on a set S \u2713 B\u21e2 with\nS having positive measure, and suppose that under such random initialization, there is a positive\nprobability that the sequence {x(k)} remains bounded in B\u21e2 and all limit points of {x(k)} are in B\u21e2.\nThen conditioned on observing that {x(k)}\u2713 B\u21e2 and all limit points of {x(k)} are in B\u21e2, gradient\ndescent converges to a critical point of h, and the probability that this critical point is a strict saddle\npoint is zero.\n\n2.2.2 Convergence Analysis of DGD+LOCAL\nAs described in the following theorem, under certain conditions, we can guarantee that the\nDGD+LOCAL algorithm (5) (which is equivalent to gradient descent applied to minimizing g(z)\nin (7)) will converge to a second-order critical point of the objective function g(z).\n\nTheorem 2.5. Let f (x, y) = PJ\nj=1 fj(x, yj) be an objective function as in (3) and let g(z) be\nas in (7) with z = (x1, . . . , xJ , y1, . . . , yJ ). Suppose each fj satis\ufb01es infRn fj > 1, is twice\ncontinuously-differentiable, and has Lipschitz gradient, i.e., rfj is Lipschitz continuous with constant\nLj > 0. Suppose g satis\ufb01es the \u0141ojasiewicz inequality. Let L := maxj Lj, and let ewji and \u00b5 be the\nDGD+LOCAL weights and stepsize as in (5). Assume ! := maxjPi6=j ewji < 1\n2. Let {z(k)} be the\n\nsequence generated by the DGD+LOCAL algorithm in (5) with\n\n\u00b5 <\n\n(9)\n\n1  2!\n\nL\n\nand with random initialization from a probability distribution supported on a set S having positive\nmeasure. Then if the sequence {z(k)} is bounded, it almost surely converges to a second-order\ncritical point of the objective function in (7).\nRemark 2.1. The requirement that the DGD+LOCAL stepsize \u00b5 = O( 1\nL ) also appears in the\nconvergence analysis of DGD in [39, 40].\nRemark 2.2. The function g is guaranteed to satisfy the \u0141ojasiewicz inequality [1, 3], for example, if\nevery fj is semi-algebraic, because this will imply that g is semi-algebraic, and every semi-algebraic\nfunction satis\ufb01es the \u0141ojasiewicz inequality.\nRemark 2.3. The bounded sequence assumption is commonly used in analysis involving the \u0141o-\njasiewicz inequality [1, 3]. If we further assume that the function is coercive, the sequence must be\nbounded.\nRemark 2.4. Asymptotic convergence is a consequence of the \u0141ojasiewicz inequality [1, 3]. Empiri-\ncally, however, DGD+LOCAL for distributed low-rank matrix approximation converges at a linear\nrate; see Figure 1.\nRemark 2.5. In order to satisfy (9), it must hold that !< 1\n\n2. In the case where the DGD+LOCAL\n\nrows and columns sums to 1), this condition is equivalent to requiring that each diagonal element of\n\nweight matrixfW is symmetric and doubly stochastic (i.e.,fW has nonnegative entries and each of its\nfW is larger than 1\n2. Given any symmetric and doubly stochastic matrixfW, one can design a new\nweight matrix (fW + I)/2 that satis\ufb01es this requirement. This strategy is also mentioned at the end of\n\n[39, Section 2.1].\n\n6\n\n\fWe also have the following DGD+LOCAL convergence result when the functions fj have only a\nlocally Lipschitz gradient.\n\nTheorem 2.6. Let f (x, y) =PJ\n\n(7) with z = (x1, . . . , xJ , y1, . . . , yJ ). Let \u21e2> 0 and suppose each fj satis\ufb01es\n\nj=1 fj(x, yj) be an objective function as in (3) and let g(z) be as in\n\n1. infRn fj > 1,\n2. fj is twice-continuously differentiable, and\n\nB2\u21e2.\n\n3. |fj (x, yj)|\uf8ff L0,j, krfj (x, yj)k \uf8ff L1,j, andr2fj(x, yj)2 \uf8ff L2,j for all (x, yj) 2\nSuppose also that g satis\ufb01es the \u0141ojasiewicz inequality within B\u21e2. Let ewji and \u00b5 be the DGD+LOCAL\nweights and stepsize as in (5). Assume ! := maxjPi6=j ewji < 1\n\ngenerated by the DGD+LOCAL algorithm in (5) with\n1  2!\nmaxj L2,j + 4L1,j\n\n2. Let {z(k)} be the sequence\n\n(10)\n\n\u00b5 <\n\n\u21e2 + (4+2\u21e1)L0,j\n\n\u21e22\n\n.\n\nSuppose z(0) is chosen randomly from a probability distribution supported on a set S \u2713 B\u21e2 with\nS having positive measure, and suppose that under such random initialization, there is a positive\nprobability that the sequence {z(k)} remains bounded in B\u21e2 and all limit points of {z(k)} are in B\u21e2.\nThen conditioned on observing that {z(k)}\u2713 B\u21e2 and all limit points of {z(k)} are in B\u21e2,\nDGD+LOCAL converges to a critical point of the objective function in (7), and the probability\nthat this critical point is a strict saddle point is zero.\n\n2.3 Geometric Analysis\n\nSection 2.2 establishes that, under certain conditions, DGD+LOCAL will converge to a second-order\ncritical point of the objective function g(z) in (7).\nIn this section, we are interested in studying the geometric landscape of the distributed objective\nfunction in (7) and comparing it to the geometric landscape of the original centralized objective\nfunction in (3). In particular, we would like to understand how the critical points of g(z) in (7) are\nrelated to the critical points of f (x, y) in (3).\nThese problems differ in two important respects:\n\n\u2022 The objective function in (7) involves more optimization variables than that in (3). Thus,\nthe optimization takes place in a higher-dimensional space and there is the potential for new\nfeatures to be introduced into the geometric landscape.\n\n\u2022 The objective function in (7) involves a quadratic regularization term that will promote\nconsensus among the variables x1, . . . , xJ. This term is absent from (3). However, along\nthe consensus subspace where x1 = \u00b7\u00b7\u00b7 = xJ, this regularizer will be zero and the objective\nfunctions will coincide.\n\nDespite these differences, we characterize below some ways in which the geometric landscapes of\nthe two problems may be viewed as equivalent. These results may have independent interest from the\nspeci\ufb01c DGD+LOCAL convergence analysis in Section 2.2.\nOur \ufb01rst result establishes that if the sub-objective functions fj satisfy certain properties, the formula-\ntion (7) does not introduce any new global minima outside of the consensus subspace.\n\nProposition 2.2. Let f (x, y) =PJ\n\nj=1 fj(x, yj) be as in (3). Suppose the topology de\ufb01ned by W is\n\nconnected. Also suppose there exist x? (which is independent of j) and y?\n\nj , j 2 [J] such that\n\n(x?, y?\n\nj ) 2 arg min\n\nx,yj\n\nfj(x, yj), 8 j 2 [J].\n\n(11)\n\nThen g(z) de\ufb01ned in (7) satis\ufb01es minz g(z) = minx,y f (x, y), and g(z) achieves its global minimum\nonly for z with x1 = \u00b7\u00b7\u00b7 = xJ.\n\n7\n\n\fWe note that the assumption in Proposition 2.2 is fairly strong, and while there are problems where it\ncan hold, there are also many problems where it will not hold.\nProposition 2.2 establishes that, in certain cases, there will exist no global minimizers of the distributed\nobjective function g(z) that fall outside of the consensus subspace. (Moreover, and also importantly,\nthere will exist a global minimizer on the consensus subspace.) Also relevant is the question of\nwhether there may exist any other types of critical points (such as local minima or saddle points)\noutside of the consensus subspace. Under certain conditions, the following proposition ensures that\nthe answer is no.\nProposition 2.3. Let f (x, y) be as in (3) and g(z) be as in (7) with z = (x1, . . . , xJ , y1, . . . , yJ ).\nSuppose the matrix W is connected and symmetric. Also suppose the gradient of fj satis\ufb01es the\nfollowing symmetric property:\n\nhrxfj(x, yj), xi = hryj fj(x, yj), yji\nfor all j 2 [J]. Then, any critical point of g must satisfy x1 = \u00b7\u00b7\u00b7 = xJ.\nFinally, we can also make a statement about the behavior of critical points that do fall on the consensus\nsubspace.\nTheorem 2.7. Let Cf denote the set of critical points of (3): Cf := {x, y : rf (x, y) = 0} , and\nlet Cg denote the set of critical points of (7): Cg := \u21e2z : rg(z) = 0. Then, for any z =\n(x1, . . . , xJ , y) 2C g with x1 = \u00b7\u00b7\u00b7 = xJ = x, we have (x, y) 2C f . Furthermore, if (x, y) is a\nstrict saddle of f, then z = (x, . . . , x, y) is also a strict saddle of g.\n\n(12)\n\n3 Analysis of Distributed Matrix Factorization\n\nWe now consider the prototypical low-rank matrix approximation in factored form, where given a\ndata matrix Y 2 Rn\u21e5m, we seek to solve\n\nminimize\n\nU2Rn\u21e5r,V2Rm\u21e5r kUVT  Yk2\nF .\n\n(13)\n\nHere U 2 Rn\u21e5r and V 2 Rm\u21e5r are tall matrices, and r is chosen in advance to allow for a suitable\napproximation of Y. In some of our results below, we will assume that the data matrix Y has rank at\nmost r.\nOne can solve problem (13) using local search algorithms such as gradient descent. Such algorithms\ndo not require expensive SVDs, and the storage complexity for U and V scales with (n + m)r,\nwhich is smaller than nm as for Y. Unfortunately, problem (13) is nonconvex in the optimization\nvariables (U, V). Thus, the question arises of whether local search algorithms such as gradient\ndescent actually converge to a global minimizer of (13). Using geometric analysis of the critical\npoints of problem (13), however, it is possible to prove convergence to a global minimizer.\nIn Section 11 of the supplementary material, building on analysis in [32], we prove the following\nresult about the favorable geometry of the nonconvex problem (13).\nTheorem 3.1. For any data matrix Y, every critical point (i.e., every point where the gradient is\nzero) of problem (13) is either a global minimum or a strict saddle point, where the Hessian has at\nleast one negative eigenvalue.\n\nSuch favorable geometry has been used in the literature to show that local search algorithms (partic-\nularly gradient descent with random initialization [21]) will converge to a global minimum of the\nobjective function.\n\n3.1 Distributed Problem Formulation\nWe are interested in generalizing the matrix approximation problem from centralized to distributed\nscenarios. To be speci\ufb01c, suppose the columns of the data matrix Y are distributed among J\nnodes/sensors. Without loss of generality, partition the columns of Y as\n\nY = [Y1 Y2\n\n\u00b7\u00b7\u00b7 YJ ] ,\n\n8\n\n\fj=1 mj. Partitioning V similarly as\n\nwhere for j 2{ 1, 2, . . . , J}, matrix Yj (which is stored at node j) has size n \u21e5 mj, and where\nm =PJ\nwhere Vj has size mj \u21e5 r, we obtain the following optimization problem\n\n\u00b7\u00b7\u00b7 VT\n\n(14)\n\n1\n\nV =\u21e5VT\nJXj=1\n\nminimize\nU,V1,...,VJ\n\nJ\u21e4T ,\nj  Yjk2\nF ,\n\nkUVT\n\nwhich is exactly equivalent to (13). Problem (15), in turn, can be written in the form of problem (3)\nby taking\n\nx = vec(U), yj = vec(Vj), and fj(x, yj) = kUVT\n\n(16)\nConsequently, we can use the analysis from Section 2 to study the performance of DGD+LOCAL (5)\nwhen applied to problem (15).\nFor convenience, we note that in this context the DGD+LOCAL iterations (5) take the form\n\nj  Yjk2\nF .\n\n(15)\n\n(17)\n\n(18)\n\nUj(k + 1) =\n\nj (k)  Yj)Vj(k),\n\nj (k)  Yj)TUj(k),\nand the corresponding gradient descent objective function (7) takes the form\n\nVj(k + 1) = Vj(k)  2\u00b5(Uj(k)VT\n\nJXi=1ewjiUi(k)  2\u00b5(Uj(k)VT\nJXj=1 kUjVT\n\nj  Yjk2\n\nJXi=1\n\nF +\n\nminimize\n\nz\n\ng(z) =\n\nF! ,\nwjikUj  Uik2\n\nwhere U1, . . . , UJ 2 Rn\u21e5r are local copies of the optimization variable U; V1, . . . , VJ are a\npartition of V as in (14); and the weights {wji} are determined by {ewji} and \u00b5 as in (6).\nProblems (15) and (18) (as special cases of problems (3) and (7), respectively) satisfy many of the\nassumptions required for the geometric and algorithmic analysis in Section 2. We use these facts in\nproving our main result for the convergence of DGD+LOCAL on the matrix factorization problem.\nTheorem 3.2. Suppose rank(Y) \uf8ff r. Suppose DGD+LOCAL (17) is used to solve problem (15),\nwith weights {ewji} and stepsize\nfor some \u21e2> 0 and where ! := maxjPi6=j ewji < 1\n\n2. Suppose the J \u21e5 J connectivity matrix\nW = {wji} (with wji de\ufb01ned in (6)) is connected and symmetric. Let {z(k)} be the sequence\ngenerated by the DGD+LOCAL algorithm. Suppose z(0) is chosen randomly from a probability\ndistribution supported on a set S \u2713 B\u21e2 with S having positive measure, and suppose that under such\nrandom initialization, there is a positive probability that the sequence {z(k)} remains bounded in\nB\u21e2 and all limit points of {z(k)} are in B\u21e2.\nThen conditioned on observing that {z(k)}\u2713 B\u21e2 and all limit points of {z(k)} are in B\u21e2,\nDGD+LOCAL almost surely converges to a solution z? = (U1?, . . . , UJ? , V?\nJ ) with the\nfollowing properties:\n\nmaxj (276 + 64\u21e1)\u21e22 + 34kYjkF + (8+4\u21e1)\n\n1, . . . , V?\n\n1  2!\n\nkYjk2\n\n(19)\n\n\u00b5 <\n\n\u21e22\n\nF\n\n\u2022 Consensus: U1? = \u00b7\u00b7\u00b7 = UJ? = U?.\n\u2022 Global optimality: (U?, V?) is a global minimizer of (13), where V? denotes the concate-\n\nnation of V?\n\n1, . . . , V?\n\nJ as in (14).\n\nSuch consensus and global optimality for the distributed low-rank matrix approximation problem are\ndemonstrated by our experiment in the top panel of Figure 1.\n\n9\n\n\fAcknowledgments\nThis work was supported by the DARPA Lagrange Program under ONR/SPAWAR contract\nN660011824020. The views, opinions and/or \ufb01ndings expressed are those of the author(s) and\nshould not be interpreted as representing the of\ufb01cial views or policies of the Department of Defense\nor the U.S. Government.\n\nReferences\n[1] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving\n\nanalytic features. Mathematical Programming, 116(1-2):5\u201316, 2009.\n\n[2] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix\n\nrecovery. In Advances in Neural Information Processing Systems, pages 3873\u20133881, 2016.\n\n[3] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization or nonconvex and\n\nnonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014.\n\n[4] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions\n\non Networking (TON), 14(SI):2508\u20132530, 2006.\n\n[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R in Machine learning,\n3(1):1\u2013122, 2011.\n\n[6] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs via\n\nlow-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[7] T.-H. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consensus admm.\n\nIEEE Transactions on Signal Processing, 63(2):482\u2013497, 2014.\n\n[8] A. I.-A. Chen. Fast distributed \ufb01rst-order methods. PhD thesis, Massachusetts Institute of Technology,\n\n2012.\n\n[9] Y. Chen and Y. Chi. Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent\ntheory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine,\n35(4):14\u201331, 2018.\n\n[10] Y. Chi. Low-rank matrix completion [lecture notes]. IEEE Signal Processing Magazine, 35(5):178\u2013181,\n\n2018.\n\n[11] Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization: An overview.\n\nIEEE Transactions on Signal Processing, 2019.\n\n[12] A. Daneshmand, G. Scutari, and V. Kungurtsev. Second-order guarantees of distributed gradient algorithms.\n\narXiv preprint arXiv:1809.08694, 2018.\n\n[13] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points\u2014online stochastic gradient for tensor\n\ndecomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[14] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A uni\ufb01ed\n\ngeometric analysis. arXiv preprint arXiv:1704.00708, 2017.\n\n[15] D. Jakoveti\u00b4c, J. M. Moura, and J. Xavier. Linear convergence rate of a class of distributed augmented\n\nlagrangian algorithms. IEEE Transactions on Automatic Control, 60(4):922\u2013936, 2014.\n\n[16] D. Jakoveti\u00b4c, J. Xavier, and J. M. Moura. Fast distributed gradient methods. IEEE Transactions on\n\nAutomatic Control, 59(5):1131\u20131146, 2014.\n\n[17] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points ef\ufb01ciently. arXiv\n\npreprint arXiv:1703.00887, 2017.\n\n[18] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In 44th Annual\nIEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 482\u2013491. IEEE, 2003.\n\n[19] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,\n\n(8):30\u201337, 2009.\n\n10\n\n\f[20] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order methods almost\n\nalways avoid strict saddle points. Mathematical Programming, pages 1\u201327.\n\n[21] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In\n\nConference on learning theory, pages 1246\u20131257, 2016.\n\n[22] Q. Li, X. Yang, Z. Zhu, G. Tang, and M. B. Wakin. The geometric effects of distributing constrained\nnonconvex optimization problems. In 2019 IEEE 8th International Workshop on Computational Advances\nin Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2019.\n\n[23] Q. Li, Z. Zhu, and G. Tang. Geometry of factored nuclear norm regularization. arXiv preprint\n\narXiv:1704.01265, 2017.\n\n[24] Q. Li, Z. Zhu, and G. Tang. The non-convex geometry of low-rank matrix optimization. Information and\n\nInference: A Journal of the IMA, 2018.\n\n[25] Q. Li, Z. Zhu, and G. Tang. Alternating minimizations converge to second-order optimal solutions. In\n\nInternational Conference on Machine Learning, pages 3935\u20133943, 2019.\n\n[26] Q. Li, Z. Zhu, G. Tang, and M. B. Wakin. Provable bregman-divergence based methods for nonconvex and\n\nnon-lipschitz problems. arXiv preprint arXiv:1904.09712, 2019.\n\n[27] S. Li, G. Tang, and M. B. Wakin. The landscape of non-convex empirical risk with degenerate population\n\nrisk. In Advances in Neural Information Processing Systems, 2019.\n\n[28] S. Lu, M. Hong, and Z. Wang. PA-GD: On the convergence of perturbed alternating gradient descent to\nsecond-order stationary points for structured nonconvex optimization. In International Conference on\nMachine Learning, pages 4134\u20134143, 2019.\n\n[29] A. Mokhtari, Q. Ling, and A. Ribeiro. Network Newton distributed optimization methods. IEEE Transac-\n\ntions on Signal Processing, 65(1):146\u2013161, 2017.\n\n[30] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transac-\n\ntions on Automatic Control, 54(1):48\u201361, 2009.\n\n[31] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano. Distributed big-data optimization via block\ncommunications. In Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2017\nIEEE 7th International Workshop on, pages 1\u20135. IEEE, 2017.\n\n[32] M. Nouiehed and M. Razaviyayn. Learning deep models: Critical points and local openness. arXiv preprint\n\narXiv:1803.02968, 2018.\n\n[33] C. W. Royer, M. O\u2019Neill, and S. J. Wright. A Newton-CG algorithm with complexity guarantees for\n\nsmooth unconstrained optimization. arXiv preprint arXiv:1803.02924, 2018.\n\n[34] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact \ufb01rst-order algorithm for decentralized consensus\n\noptimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\n[35] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized\n\nconsensus optimization. IEEE Transactions on Signal Processing, 62(7):1750\u20131761, 2014.\n\n[36] J. Sun. When are nonconvex optimization problems not scary? PhD thesis, Columbia University, 2016.\n[37] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on\n\nInformation Theory, 62(11):6535\u20136579, 2016.\n\n[38] A. Uschmajew and B. Vandereycken. On critical points of quadratic low-rank matrix optimization problems.\n\n2018.\n\n[39] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on\n\nOptimization, 26(3):1835\u20131854, 2016.\n\n[40] J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing,\n\n66(11):2834\u20132848, 2018.\n\n[41] X. Zhang, L. Wang, Y. Yu, and Q. Gu. A primal-dual analysis of global optimality in nonconvex low-rank\n\nmatrix recovery. In International conference on machine learning, pages 5857\u20135866, 2018.\n\n[42] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. The global optimization geometry of low-rank matrix optimization.\n\narXiv preprint arXiv:1703.01256, 2017.\n\n[43] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. Global optimality in low-rank matrix optimization. IEEE\n\nTransactions on Signal Processing, 66(13):3614\u20133628, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4559, "authors": [{"given_name": "Zhihui", "family_name": "Zhu", "institution": "Johns Hopkins University"}, {"given_name": "Qiuwei", "family_name": "Li", "institution": "Colorado School of Mines"}, {"given_name": "Xinshuo", "family_name": "Yang", "institution": "Colorado School of Mines"}, {"given_name": "Gongguo", "family_name": "Tang", "institution": "Colorado School of Mines"}, {"given_name": "Michael", "family_name": "Wakin", "institution": "Colorado School of Mines"}]}