{"title": "On Markov Chain Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 9896, "page_last": 9905, "abstract": "Stochastic gradient methods are the workhorse (algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Markov chain gradient descent, a variant of stochastic gradient descent where the random samples are taken on the trajectory of a Markov chain. Existing results of this method assume convex objectives and a reversible Markov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Markov chains. Nonconvexity makes our method applicable to broader problem classes. Non-reversible finite-state Markov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Markov chains. The reported numerical results validate our contributions.", "full_text": "On Markov Chain Gradient Descent\u2217\n\nTao Sun\n\nCollege of Computer\n\nNational University of Defense Technology\n\nChangsha, Hunan 410073, China\n\nnudtsuntao@163.com\n\nYuejiao Sun\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\n\nsunyj@math.ucla.edu\n\nWotao Yin\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\nwotaoyin@math.ucla.edu\n\nAbstract\n\nStochastic gradient methods are the workhorse (algorithms) of large-scale opti-\nmization problems in machine learning, signal processing, and other computational\nsciences and engineering. This paper studies Markov chain gradient descent, a\nvariant of stochastic gradient descent where the random samples are taken on the\ntrajectory of a Markov chain. Existing results of this method assume convex objec-\ntives and a reversible Markov chain and thus have their limitations. We establish\nnew non-ergodic convergence under wider step sizes, for nonconvex problems, and\nfor non-reversible \ufb01nite-state Markov chains. Nonconvexity makes our method\napplicable to broader problem classes. Non-reversible \ufb01nite-state Markov chains,\non the other hand, can mix substatially faster. To obtain these results, we introduce\na new technique that varies the mixing levels of the Markov chains. The reported\nnumerical results validate our contributions.\n\n1\n\nIntroduction\n\nIn this paper, we consider a stochastic minimization problem. Let \u039e be a statistical sample space with\nprobability distribution \u03a0 (we omit the underlying \u03c3-algebra). Let X \u2286 Rn be a closed convex set,\nwhich represents the parameter space. F (\u00b7; \u03be) : X \u2192 R is a closed convex function associated with\n\u03be \u2208 \u039e. We aim to solve the following problem:\n\nF (x, \u03be)d\u03a0(\u03be).\n\nsamples \u03bek i.i.d\n\n\u223c \u03a0.\n\n(1)\n\n(2)\n\nA common method to minimize (1) is Stochastic Gradient Descent (SGD) [11]:\n\nminimize\nx\u2208X\u2286Rn\n\nE\u03be(cid:0)F (x; \u03be)(cid:1) =(cid:90)\u03a0\nxk+1 = ProjX(cid:0)xk \u2212 \u03b3k\u2202F (xk; \u03bek)(cid:1),\n\nHowever, for some problems and distributions, direct sampling from \u03a0 is expensive or impossible,\nand it is possible that the sample space \u039e is not explicitly known. In these cases, it can be much\ncheaper to sample by following a Markov chain that has a desired equilibrium distribution \u03a0.\n\n\u2217The work is supported in part by the National Key R&D Program of China 2017YFB0202902, China\nScholarship Council. The work of Y. Sun and W. Yin was supported in part by AFOSR MURI FA9550-18-1-0502,\nNSF DMS-1720237, NSFC 11728105, and ONR N000141712162.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f5\n\n\u03b5\n\n\u221a|\u039e|\n\nTo be concrete, imagine solving problem (1) with a discrete space \u039e := {x \u2208 {0, 1}n | (cid:104)a, x(cid:105) \u2264 b},\nwhere a \u2208 Rn and b \u2208 R, and the uniform distribution \u03a0 over \u039e. A straightforward way to obtain\na uniform sample is iteratively randomly sampling x \u2208 {0, 1}n until the constraint (cid:104)a, x(cid:105) \u2264 b is\nsatis\ufb01ed. Even if the feasible set is small, it may take up to O(2n) iterations to get a feasible sample.\nInstead, one can sample a trajectory of a Markov chain described in [4]; to obtain a sample \u03b5-close\n) exp(O(\u221an(log n)\nto the distribution \u03a0, one only needs log(\n2 )) samples [2], where |\u039e| is the\ncardinality of \u039e. This presents a signifant saving in sampling cost.\nMarkov chains also naturally arise in some applications. Common examples are systems that evolve\naccording to Markov chains, for example, linear dynamic systems with random transitions or errors.\nAnother example is a distributed system in which every node locally stores a subset of training\nsamples; to train a model using these samples, we can let a token that holds all the model parameters\ntraverse the nodes following a random walk, so the samples are accessed according to a Markov\nchain.\nSuppose that the Markov chain has a stationary distribution \u03a0 and a \ufb01nite mixing time T , which\nis how long a random trajectory needs to be until its current state has a distribution that roughly\nmatches \u03a0. A larger T means a closer match. Then, in order to run one iteration of (2), we can\ngenerate a trajectory of samples \u03be1, \u03be2, \u03be3, . . . , \u03beT and only take the last sample \u03be := \u03beT . To run\nanother iteration of (2), we repeat this process, i.e., sample a new trajectory \u03be1, \u03be2, \u03be3, . . . , \u03beT and\ntake \u03be := \u03beT .\nClearly, sampling a long trajectory just to use the last sample wastes a lot of samples, especially when\nT is large. But, this may seem necessary because \u03bet, for all small t, have large biases. After all, it\ncan take a long time for a random trajectory to explore all of the space, and it will often double back\nand visit states that it previously visited. Furthermore, it is also dif\ufb01cult to choose an appropriate\nT . A small T will cause large bias in \u03beT , which slows the SGD convergence and reduces its \ufb01nal\naccuracy. A large T , on the other hand, is wasteful especially when xk is still far from convergence\nand some bias does not prevent (2) to make good progress. Therefore, T should increase adaptively\nas k increases \u2014 this makes the choice of T even more dif\ufb01cult.\nSo, why waste samples, why worry about T , and why not just apply every sample immediately in\nstochastic gradient descent? This approach has appeared in [5, 6], which we call the Markov Chain\nGradient Descent (MCGD) algorithm for problem (1):\n\nxk+1 = ProjX(cid:0)xk \u2212 \u03b3k \u02c6\u2207F (xk; \u03bek)(cid:1),\n\nwhere \u03be0, \u03be1, . . . are samples on a Markov chain trajectory and \u02c6\u2207F (xk; \u03bek) \u2208 \u2202F (xk; \u03bek) is a\nsubgradient.\nLet us examine some special cases. Suppose the distribution \u03a0 is supported on a set of M points,\ny1, . . . , yM . Then, by letting fi(x) := M \u00b7 Prob(\u03be = yi) \u00b7 F (x, yi), problem (1) reduces to the\n\ufb01nite-sum problem:\n\nminimize\nx\u2208X\u2286Rd\n\nf (x) \u2261\n\n1\nM\n\nM(cid:88)i=1\n\nfi(x).\n\n(4)\n\nBy the de\ufb01nition of fi, each state i has the uniform probability 1/M. At each iteration k of MCGD,\nwe have\n\n(5)\nwhere (jk)k\u22650 is a trajectory of a Markov chain on {1, 2, . . . , M} that has a uniform stationary\ndistribution. Here, (\u03bek)k\u22650 \u2286 \u03a0 and (jk)k\u22650 \u2286 [M ] are two different, but related Markov chains.\nStarting from a deterministic and arbitrary initialization x0, the iteration is illustrated by the following\ndiagram:\n\nxk+1 = ProjX(cid:0)xk \u2212 \u03b3k \u02c6\u2207fjk (xk)(cid:1),\n\n(3)\n\n(6)\n\nx0 \u2212\u2212\u2212\u2212\u2192 x1 \u2212\u2212\u2212\u2212\u2192 x2 \u2212\u2212\u2212\u2212\u2192 x3 \u2212\u2212\u2212\u2212\u2192 . . .\n\nIn the diagram, given each jk, the next state jk+1 is statistically independent of jk\u22121, . . . , j0; given\njk and xk, the next iterate xk+1 is statistically independent of jk\u22121, . . . , j0 and xk\u22121, . . . , x0.\n\nj0 \u2212\u2212\u2212\u2212\u2192 j1 \u2212\u2212\u2212\u2212\u2192 j2 \u2212\u2212\u2212\u2212\u2192 . . .\n\n\uf8e6\uf8e6(cid:121)\n\n\uf8e6\uf8e6(cid:121)\n\n\uf8e6\uf8e6(cid:121)\n\n2\n\n\fAnother application of MCGD involves a network: consider a strongly connected graph G = (V,E)\nwith the set of vertices V = {1, 2, . . . , M} and set of edges E \u2286 V\u00d7V. Each node j \u2208 {1, 2, . . . , M}\npossess some data and can compute \u2207fj(\u00b7). To run MCGD, we employ a token that carries the\nvariable x, walking randomly over the network. When it reaches a node j, node j reads x form the\ntoken and computes \u2207fj(\u00b7) to update x according to (5). Then, the token walks away to a random\nneighbor of node j.\n\n1.1 Numerical tests\n\nWe present two kinds of numerical results. The \ufb01rst one is to show that MCGD uses fewer samples to\ntrain both a convex model and a nonconvex model. The second one demonstrates the advantage\nof the faster mixing of a non-reversible Markov chain. Our results on nonconvex objective and\nnon-reversible chains are new.\n\n1. Comparision with SGD\nLet us compare:\n\n1. MCGD (3), where jk is taken from one trajectory of the Markov chain;\n2. SGDT , for T = 1, 2, 4, 8, 16, 32, where each jk is the T th sample of a fresh, independent\n\ntrajectory. All trajectories are generated by starting from the same state 0.\n\nTo compute T gradients, SGDT uses T times as many samples as MCGD. We did not try to adapt T\nas k increases because there lacks a theoretical guidance.\nIn the \ufb01rst test, we recover a vector u from an auto regressive process, which closely resembles the \ufb01rst\nexperiment in [1]. Set matrix A as a subdiagonal matrix with random entries Ai,i\u22121\n\u223c U[0.8, 0.99].\nRandomly sample a vector u \u2208 Rd, d = 50, with the unit 2-norm. Our data (\u03be1\nt )\u221et=1 are generated\nt , \u03be2\naccording to the following auto regressive process:\n\ni.i.d\n\n\u03be1\nt = A\u03be1\n\ni.i.d\n\n\u00af\u03be2\n\n\u03be2\n\nif (cid:104)u, \u03be1\n\n\u223c N (0, 1)\n\nt\u22121 + e1Wt, Wt\nt (cid:105) > 0,\n0, otherwise;\nwith probability 0.8,\nt , with probability 0.2.\n\nt =(cid:26) 1,\nt =(cid:26) \u00af\u03be2\nt ,\n1 \u2212 \u00af\u03be2\n\nClearly, (\u03be1\nchain. We recover u as the solution to the following problem:\n\nt )\u221et=1 forms a Markov chain. Let \u03a0 denote the stationary distribution of this Markov\n\nt , \u03be2\n\nminimize\n\nx\n\nE(\u03be1,\u03be2)\u223c\u03a0(cid:96)(x; \u03be1, \u03be2).\n\nWe consider both convex and nonconvex loss functions, which were not done before in the literature.\nThe convex one is the logistic loss\n\nwhere \u03c3(t) =\n\n(cid:96)(x; \u03be1, \u03be2) = \u2212\u03be2 log(\u03c3((cid:104)x, \u03be1(cid:105))) \u2212 (1 \u2212 \u03be2) log(1 \u2212 \u03c3((cid:104)x, \u03be1(cid:105))),\n1+exp(\u2212t). And the nonconvex one is taken as\n\n1\n\n(cid:96)(x; \u03be1, \u03be2) =\n\n1\n2\n\n(\u03c3((cid:104)x, \u03be1(cid:105)) \u2212 \u03be2)2\n\nkq as our stepsize, where q = 0.501. This choice is consistently with our\n\nfrom [7]. We choose \u03b3k = 1\ntheory below.\nOur results in Figure 1 are surprisingly positive on MCGD, more so to our expectation. As we\nhad expected, MCGD used signi\ufb01cantly fewer total samples than SGD on every T . But, it is\nsurprising that MCGD did not need even more gradient evaluations. Randomly generated data\nmust have helped homogenize the samples over the different states, making it less important for a\ntrajectory to converge. It is important to note that SGD1 and SGD2, as well as SGD4, in the noncon-\nvex case, stagnate at noticeably lower accuracies because their T values are too small for convergence.\n\n3\n\n\fFigure 1: Comparisons of MCGD and SGDT for T = 1, 2, 4, 8, 16, 32. xk is the average of\nx1, . . . , xk.\n\n2. Comparison of reversible and non-reversible Markov chains\nWe also compare the convergence of MCGD when working with reversible and non-reversible\nMarkov chains (the de\ufb01nition of reversibility is given in next section). As mentioned in [14],\ntransforming a reversible Markov chain into non-reversible Markov chain can signi\ufb01cantly accelerate\nthe mixing process. This technique also helps to accelerate the convergence of MCGD.\nIn our experiment, we \ufb01rst construct an undirected connected graph with n = 20 nodes with edges\nrandomly generated. Let G denote the adjacency matrix of the graph, that is,\n\nLet dmax be the maximum number of outgoing edges of a node. Select d = 10 and compute\n\u03b2\u2217 \u223c N (0, Id). The transition probability of the reversible Markov chain is then de\ufb01ned by, known\nas Metropolis-Hastings markov chain,\n\nGi,j =(cid:26) 1,\n\nif i, j are connected;\n\n0, otherwise.\n\n(cid:80)\n\n,\n\n1\n\ndmax\n\n1 \u2212\n0,\n\nPi,j =\uf8f1\uf8f2\uf8f3\n\nj(cid:54)=i Gi,j\ndmax\n\nif j (cid:54)= i, Gi,j = 1;\nif j = i;\notherwise.\n\n,\n\nObviously, P is symmetric and the stationary distribu-\ntion is uniform. The non-reversible Markov chain is con-\nstructed by adding cycles. The edges of these cycles are\ndirected and let V denote the adjacency matrix of these\ncycles. If Vi,j = 1, then Vj,i = 0. Let w0 > 0 be the\nweight of \ufb02ows along these cycles. Then we construct the\ntransition probability of the non-reversible Markov chain\nas follows,\n\nQi,j =\n\n,\n\nWi,j(cid:80)l Wi,l\n\nwhere W = dmaxP + w0V . See [14] for an explanation\nwhy this change makes the chain mix faster.\nIn our experiment, we add 5 cycles of length 4, with edges\nexisting in G. w0 is set to be dmax\n2 . We test MCGD on\na least square problem. First, we select \u03b2\u2217 \u223c N (0, Id);\nand then for each node i, we generate xi \u223c N (0, Id), and\nyi = xT\n\ni \u03b2\u2217. The objective function is de\ufb01ned as,\n\nFigure 2: Comparison of reversible and\nirreversible Markov chains. The second\nlargest eigenvalues of reversible and non-\nreversible Markov chains are 0.75 and\n0.66 respectively.\n\nThe convergence results are depicted in Figure 2.\n\nf (\u03b2) =\n\n1\n2\n\ni \u03b2 \u2212 yi)2.\n(xT\n\nn(cid:88)i=1\n\n1.2 Known approaches and results\n\nIt is more dif\ufb01cult to analyze MCGD due to its biased samples. To see this, let pk,j be the probability\nto select \u2207fj in the kth iteration. SGD\u2019s uniform probability selection (pk,j \u2261 1\nM ) yields an unbiased\n\n4\n\n100101102103104Number of gradient evaluations10-310-210-1100Convex caseMCSGDSGD1SGD2SGD4SGD8SGD16SGD32f(cid:0)xk(cid:1)\u2212f(x\u2217)100101102103104Number of samples10-310-210-1100Convex casef(cid:0)xk(cid:1)\u2212f(x\u2217)100101102103104Number of gradient evaluations10-310-210-1Nonconvex casef(cid:0)xk(cid:1)\u2212f(x\u2217)100101102103104Number of samples10-310-210-1Nonconvex casef(cid:0)xk(cid:1)\u2212f(x\u2217)100101102103104Iteration10-310-210-1100101102f(k)-f*ReversibleNon-reversible\fgradient estimate\n\nEjk (\u2207fjk (xk)) = C\u2207f (xk)\n\n(7)\n\nfor some C > 0. However, in MCGD, it is possible to have pk,j = 0 for some k, j. Consider a\n\u201crandom walk\u201d. The probability pjk,j is determined by the current state jk, and we have pjk,i > 0\nonly for i \u2208 N (jk) and pjk,i = 0 for i /\u2208 N (jk), where N (jk) denotes the neighborhood of jk.\nTherefore, we no longer have (7).\nAll analyses of MCGD must deal with the biased expectation. Papers [6, 5] investigate the conditional\nexpectation E\njk+\u03c4|jk (\u2207fjk+\u03c4 (xk)). For a suf\ufb01ciently large \u03c4 \u2208 Z+, it is suf\ufb01ciently close to\nM \u2207f (xk) (but still different). In [6, 5], the authors proved that, to achieve an \u0001 error, MCGD with\n1\nstepsize O(\u0001) can return a solution in O( 1\n\u00012 ) iteration. Their error bound is given in the ergodic sense\nand using liminf. The authors of [10] proved a lim inf f (xk) and Edist2(xk, X\u2217) have almost sure\nconvergence under diminishing stepsizes \u03b3k = 1\n3 < q \u2264 1. Although the authors did not compute\n) iterations,\nany rates, we computed that their stepsizes will lead to a solution with \u0001 error in O(\nfor 2\nand\nshowed ergodic convergence; in other words, to achieve \u0001 error, it is enough to run MCGD for O( 1\n\u00012 )\niterations. There is no non-ergodic result regarding the convergence of f (xk). It is worth mentioning\nthat [10, 1] use time non-homogeneous Markov chains, where the transition probability can change\nover the iterations as long as there is still a \ufb01nite mixing time. In [1], MCGD is generalized from\ngradient descent to mirror descent. In all these works, the Markov chain is required to be reversible,\nand all functions fi, i \u2208 [M ], are assumed to be convex. However, non-reversible chains can have\nsubstantially faster convergence and thus more numerically ef\ufb01cient.\n\n\u0001 ) for q = 1. In [1], the authors improved the stepsizes to \u03b3k = 1\u221ak\n\nkq , 2\n\n3 < q < 1, and O(e\n\n1\n\n1\n1\n1\u2212q\n\n\u0001\n\n1.3 Our approaches and results\n\nIn this paper, we improve the analyses of MCGD to non-reversible \ufb01nite-state Markov chains and to\nnonconvex functions. The former allows us to have faster mixing, and the latter frequently appears\nin applications. Our convergence result is given in the non-ergodic sense though the rate results are\nstill given the ergodic sense. It is important to mention that, in our analysis, the mixing time of the\nunderlying Markov chain is not tied to a \ufb01xed mixing level but can vary to different levels. This is\nessential because MCGD needs time to reduce its objective error from its current value to a lower\none, and this time becomes longer when the current value is lower since a more accurate Markov\nchain convergence and thus a longer mixing time are required. When f1, f2, . . . , fM are all convex,\nwe allow them to be non-differentiable and MCGD to use subgradients, provided that X is bounded.\nWhen any of them is nonconvex, we assume X is the full space and f1, f2, . . . , fM are differentiable\nwith bounded gradients. The bounded-gradient assumption is due to a technical dif\ufb01culty associated\nwith nonconvexity.\nSpeci\ufb01cally, in the convex setting, we prove limk Ef (xk) = f\u2217 (minimum of f over X) for both\nexact and inexact MCGD with stepsizes \u03b3k = 1\n2 < q < 1. The convergence rates of MCGD with\nexact and inexact subgradient computations are presented. The \ufb01rst analysis of nonconvex MCGD is\nalso presented with its convergence given in the expectation of (cid:107)\u2207f (xk)(cid:107). These results hold for\nnon-reversible \ufb01nite-state Markov chains and can be extended to time non-homogeneous Markov\nchain under extra assumptions [10, Assumptions 4 and 5] and [1, Assumption C], which essentially\nensure \ufb01nite mixing.\nOur results for \ufb01nite-state Markov chains are \ufb01rst presented in Sections 3 and 4. They are extended\nto continuous-state reversible Markov chains in Section 5.\nSome novel results are are developed based on new techniques and approaches developed in this\npaper. To get the stronger results in general cases, we used the varying mixing time rather than \ufb01xed\nones.\nWe list the possible extensions of MCGD that are not discussed in this paper. The \ufb01rst one is the\naccelerated versions including the Nesterov\u2019s acceleration and variance reduction schemes. The\nsecond one is the design and optimization of Markov chains to improve the convergence of MCGD.\n\nkq , 1\n\n5\n\n\f2 Preliminaries\n\n2.1 Markov chain\n\nWe recall some de\ufb01nitions, properties, and existing results about the Markov chain. Although we\nuse the \ufb01nite-state time-homogeneous Markov chain, results can be extended to more general chains\nunder similar extra assumptions in [10, Assumptions 4, 5] and [1, Assumption C].\nDe\ufb01nition 1 (\ufb01nite-state time-homogeneous Markov chain) Let P be an M \u00d7 M-matrix with\nreal-valued elements. A stochastic process X1, X2, ... in a \ufb01nite state space [M ] := {1, 2, . . . , M}\nis called a time-homogeneous Markov chain with transition matrix P if, for k \u2208 N, i, j \u2208 [M ], and\ni0, i1, . . . , ik\u22121 \u2208 [M ], we have\n(8)\n\nP(Xk+1 = j | X0 = i0, X1 = i1, . . . , Xk = i) = P(Xk+1 = j | Xk = i) = Pi,j.\n\nj . \u03c0 satis\ufb01es(cid:80)M\n\u03c0k = \u03c0k\u22121P = \u00b7\u00b7\u00b7 = \u03c00P k,\n\nthe probability distribution of Xk be denoted as the non-negative row vector \u03c0k =\ni = 1. When the Markov chain is\n2 , . . . , \u03c0k\n\nLet\n(\u03c0k\ntime-homogeneous, we have \u03c0k = \u03c0k\u22121P and\n\nM ), that is, P(Xk = j) = \u03c0k\n\ni=1 \u03c0k\n\n1 , \u03c0k\n\n(9)\nfor k \u2208 N, where P k denotes the kth power of P . A Markov chain is irreducible if, for any i, j \u2208 [M ],\ni,i = 0 whenever k\nthere exists k such that (P k)i,j > 0. State i \u2208 [M ] is said to have a period d if P k\nis not a multiple of d and d is the greatest integer with this property. If d = 1, then we say state i is\naperiodic. If every state is aperiodic, the Markov chain is said to be aperiodic.\nAny time-homogeneous, irreducible, and aperiodic Markov chain has a stationary distribution \u03c0\u2217 =\ni=1 \u03c0\u2217i = 1 and mini{\u03c0\u2217i } > 0, and \u03c0\u2217 = \u03c0\u2217P . It also holds\n(10)\n\nP k = [(\u03c0\u2217); (\u03c0\u2217); . . . ; (\u03c0\u2217)] =: \u03a0\u2217 \u2208 RM\u00d7M .\nThe largest eigenvalue of P is 1, and the corresponding left eigenvector is \u03c0\u2217.\nAssumption 1 The Markov chain (Xk)k\u22650 is time-homogeneous, irreducible, and aperiodic. It has\na transition matrix P and has stationary distribution \u03c0\u2217.\n\nlimk \u03c0k = [\u03c0\u22171, \u03c0\u22172, . . . , \u03c0\u2217M ] with(cid:80)M\n\nlim\nk\n\nthat\n\n2.2 Mixing time\n\nMixing time is how long a Markov chain evolves until its current state has a distribution very close\nto its stationary distribution. The literature has a thorough investigation of various kinds of mixing\ntimes, with the majority for reversible Markov chains (that is, \u03c0iPi,j = \u03c0jPj,i). Mixing times of\nnon-reversible Markov chains are discussed in [3]. In this part, we consider a new type of mixing\ntime of non-reversible Markov chain. The proofs are based on basic matrix analysis. Our mixing\ntime gives us a direct relationship between k and the deviation of the distribution of the current state\nfrom the stationary distribution.\nTo start a lemma, we review some basic notions in linear algebra. Let C be the n-dimensional\ncomplex \ufb01eld. The modulus of a complex number a \u2208 C is given as |a|. For a vector x \u2208 Cn, the (cid:96)\u221e\nand (cid:96)2 norms are de\ufb01ned as (cid:107)x(cid:107)\u221e := maxi |xi|, (cid:107)x(cid:107)2 :=(cid:112)(cid:80)n\ni=1 |xi|2. For a matrix A = [ai,j] \u2208\nCm\u00d7n, its \u221e-induced and Frobenius norms are (cid:107)A(cid:107)\u221e := maxi,j |ai,j|, (cid:107)A(cid:107)F :=(cid:113)(cid:80)n\ni,j=1 |ai,j|2,\nrespectively.\nWe know P k \u2192 \u03a0\u2217, as k \u2192 \u221e. The following lemma presents a deviation bound for \ufb01nite k.\nLemma 1 Let Assumption 1 hold and let \u03bbi(P ) \u2208 C be the ith largest eigenvalue of P , and\n\n\u03bb(P ) :=\n\nmax{|\u03bb2(P )|,|\u03bbM (P )|} + 1\n\n2\n\n\u2208 [0, 1).\n\nThen, we can bound the largest entry-wise absolute value of the deviation matrix \u03b4k := \u03a0\u2217 \u2212 P k \u2208\nRM\u00d7M as\n(11)\n\n(cid:107)\u03b4k(cid:107)\u221e \u2264 CP \u00b7 \u03bbk(P )\n\n6\n\n\ffor k \u2265 KP , where CP is a constant that also depends on the Jordan canonical form of P and KP\nis a constant that depends on \u03bb(P ) and \u03bb2(P ). Their formulas are given in (45) and (46) in the\nSupplementary Material.\n\nRemark 1 If P is symmetric, then all \u03bbi(P )\u2019s are all real and nonnegative, KP = 0, and CP \u2264 M\nFurthermore, (42) can be improved by directly using \u03bbk\n2 \u00b7 \u03bbk\n\n2(P ) for the right side as\n2(P ), k \u2265 0.\n\n(cid:107)\u03b4k(cid:107)\u221e \u2264 (cid:107)\u03b4k(cid:107)F \u2264 M\n\n3\n\n3\n\n2 .\n\n3 Convergence analysis for convex minimization\n\nThis part considers the convergence of MCGD in the convex cases, i.e., f1, f2, . . . , fM and X are all\nconvex. We investigate the convergence of scheme (5). We prove non-ergodic convergence of the\nexpected objective value sequence under diminishing non-summable stepsizes, where the stepsizes\nare required to be \u201calmost\" square summable. Therefore, the convergence requirements are almost\nequal to SGD. This section uses the following assumption.\n\nAssumption 2 The set X is assumed to be convex and compact.\n\nNow, we present the convergence results for MCGD in the convex (but not necessarily differentiable)\ncase. Let f\u2217 be the minimum value of f over X.\nTheorem 1 Let Assumptions 1 and 2 hold and (xk)k\u22650 be generated by scheme (5). Assume that fi,\ni \u2208 [M ], are convex functions, and the stepsizes satisfy\n\u03b3k = +\u221e, (cid:88)k\n\nk < +\u221e.\n\nln k \u00b7 \u03b32\n\n(cid:88)k\n\n(12)\n\nThen, we have\n\nDe\ufb01ne\n\nWe have:\n\nEf (xk) = f\u2217.\n\nlim\nk\n\n\u03c8(P ) := max{1,\n\n1\n\nln(1/\u03bb(P ))}.\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nE(f (xk) \u2212 f\u2217) = O(cid:16) \u03c8(P )\ni=1 \u03b3i(cid:17),\n(cid:80)k\n\n. Therefore, if we select the stepsize \u03b3k = O( 1\n\n(cid:80)k\n(cid:80)k\n\ni=1 \u03b3ixi\ni=1 \u03b3i\n\nwhere xk :=\nrate E(f (xk) \u2212 f\u2217) = O( \u03c8(P )\nk1\u2212q ).\nFurthermore, consider the inexact version of MCGD:\n\nwhere the noise sequence (ek)k\u22650 is arbitrary but obeys\n\nxk+1 = ProjX(cid:0)xk \u2212 \u03b3k( \u02c6\u2207fjk (xk) + ek)(cid:1),\n\n(cid:107)ek(cid:107)2\nln k\n\n< +\u221e.\n\n+\u221e(cid:88)k=2\n\nkq ) as 1\n\n2 < q < 1, we get the\n\nThen, for iteration (15), results (13) and (14) still hold; furthermore, if (cid:107)ek(cid:107) = O( 1\nand \u03b3k = O( 1\n\n2 < q < 1, the rate E(f (xk) \u2212 f\u2217) = O( \u03c8(P )\n\nk1\u2212q ) also holds.\n\nkq ) as 1\n\nkp ) with p > 1\n\n2\n\nThe stepsizes requirement (12) is nearly identical to the one of SGD and subgradient algorithms.\nIn the theorem above, we use the stepsize setting \u03b3k = O( 1\n2 < q < 1. This kind of\nstepsize requirements also works for SGD and subgradient algorithms. The convergence rate of\nMCGD is O(\nk1\u2212q ), which is also as the same as SGD and subgradient algorithms for\nkq ).\n\u03b3k = O( 1\n\nkq ) as 1\n\n) = O( 1\n\n1(cid:80)k\n\ni=1 \u03b3i\n\n7\n\n\f4 Convergence analysis for nonconvex minimization\n\nThis section considers the convergence of MCGD when one or more of fi is nonconvex. In this case,\nwe assume fi, i = 1, 2, . . . , M, are differentiable and \u2207fi is Lipschitz with L2. We also set X as the\nfull space. We study the following scheme\n\nxk+1 = xk \u2212 \u03b3k\u2207fjk (xk).\n\n(17)\n\nWe prove non-ergodic convergence of the expected gradient norm of f under diminishing non-\nsummable stepsizes. The stepsize requirements in this section are slightly stronger than those in the\nconvex case with an extra ln k factor. In this part, we use the following assumption.\n\nAssumption 3 The gradients of fi are assumed to be bounded, i.e., there exists D > 0 such that\n\n(cid:107)\u2207fi(x)(cid:107) \u2264 D,\n\ni \u2208 [M ].\n\n(18)\n\nWe use this new assumption because X is now the full space, and we have to directly bound the size\nof (cid:107)\u2207fi(x)(cid:107). In the nonconvex case, we cannot obtain objective value convergence, and we only\nbound the gradients. Now, we are prepared to present our convergence results of nonconvex MCGD.\n\nTheorem 2 Let Assumptions 1 and 3 hold and (xk)k\u22650 be generated by scheme (17). Also, assume\nfi is differentiable and \u2207fi is L-Lipschitz, and the stepsizes satisfy\nk < +\u221e.\n\nln2 k \u00b7 \u03b32\n\n(19)\n\n(cid:88)k\n\n\u03b3k = +\u221e,(cid:88)k\n\nThen, we have\n\nand\n\nlim\nk\n\nE(cid:107)\u2207f (xk)(cid:107) = 0.\n\ni=1 \u03b3i(cid:17),\n1\u2264i\u2264k{(cid:107)\u2207f (xi)(cid:107)2}(cid:1) = O(cid:16) \u03c8(P )\nE(cid:0) min\n(cid:80)k\n\n(20)\n\n(21)\n\nwhere \u03c8(P ) is given in Lemma 1. If we select the stepsize as \u03b3k = O( 1\n\nkq ), 1\n\n2 < q < 1, then we get\n\nthe rate E(cid:0) min1\u2264i\u2264k{(cid:107)\u2207f (xi)(cid:107)2}(cid:1) = O( \u03c8(P )\n\nk1\u2212q ).\n\nFurthermore, let (ek)k\u22650 be a sequence of noise and consider the inexact nonconvex MCGD iteration:\n(22)\n\nxk+1 = xk \u2212 \u03b3k(cid:0)\u2207fjk (xk) + ek(cid:1).\n\nIf the noise sequence obeys\n\n+\u221e(cid:88)k=1\n\n\u03b3k \u00b7 (cid:107)ek(cid:107) < +\u221e,\n\n(23)\n\nk1\u2212q ).\n\nkq ) as 1\n\nthen the convergence results (20) and (21) still hold for inexact nonconvex MCGD. In addition, if\nwe set \u03b3k = O( 1\nkp ) for p + q > 1, then (20) still\n\n2 < q < 1 and the noise satisfy (cid:107)ek(cid:107) = O( 1\n\nholds and E(cid:0) min1\u2264i\u2264k{(cid:107)\u2207f (xi)(cid:107)2}(cid:1) = O( \u03c8(P )\nThis proof of Theorem 2 is different from previous one. In particular, we cannot expect some sort of\nconvergence to f (x\u2217), where x\u2217 \u2208 argmin f due to nonconvexity. To this end, we use the Lipschitz\ncontinuity of \u2207fi (i \u2208 [M ]) to derive the \u201cdescent\". Here, the \u201cO\" contains a polynomial compisition\nof constants D and L.\nCompared with MCGD in the convex case, the stepsize requirements of nonconvex MCGD become\na tad higher; in summable part, we need(cid:80)k ln2 k \u00b7 \u03b32\nk < +\u221e.\n\nk < +\u221e rather than(cid:80)k ln k \u00b7 \u03b32\n\nNevertheless, we can still use \u03b3k = O( 1\n\n2 < q < 1.\n\nkq ) for 1\n\n2This is for the convenience of the presentation in the proofs. If each fi has a Li, it is possible to improve\n\nour results slights. But, we simply set L := maxi{Li}\n\n8\n\n\f5 Convergence analysis for continuous state space\n\nWhen the state space \u039e is a continuum, there are in\ufb01nitely many possible states. In this case, we\nconsider an in\ufb01nite-state Markov chain that is time-homogeneous and reversible. Using the results\nin [8, Theorem 4.9], the mixing time of this kind of Markov chain still has geometric decrease like\n(11). Since Lemma 1 is based on a linear algebra analysis, it no longer applies to the continuous\ncase. Nevertheless, previous results still hold with nearly unchanged proofs under the following\nassumption:\n\nAssumption 4 For any \u03be \u2208 \u039e, |F (x; \u03be) \u2212 F (y; \u03be)| \u2264 L(cid:107)x \u2212 y(cid:107), supx\u2208X,\u03be\u2208\u039e{(cid:107) \u02c6\u2207F (x; \u03be)(cid:107)} \u2264 D,\nE\u03be \u02c6\u2207F (x; \u03be) \u2208 \u2202E\u03beF (x; \u03be), and supx,y\u2208X,\u03be\u2208\u039e |F (x; \u03be) \u2212 F (y; \u03be)| \u2264 H.\nWe consider the general scheme\nxk+1 = ProjX(cid:0)xk \u2212 \u03b3k( \u02c6\u2207F (xk; \u03bek) + ek)(cid:1),\n\nwhere \u03bek are samples on a Markov chain trajectory. If ek \u2261 0, the scheme then reduces to (3).\nCorollary 1 Assume F (\u00b7; \u03be) is convex for each \u03be \u2208 \u039e. Let the stepsizes satisfy (12) and (xk)k\u22650\nbe generated by Algorithm (24), and (ek)k\u22650 satisfy (16). Let F \u2217 := minx\u2208X E\u03be(F (x; \u03be)). If\nAssumption 4 holds and the Markov chain is time-homogeneous, irreducible, aperiodic, and reversible,\nthen we have\n\n(24)\n\nE(cid:0)E\u03be(F (xk; \u03be)) \u2212 F\n\n\u2217(cid:1) = 0, E(E\u03be(F (xk; \u03be)) \u2212 F\n\n\u2217\n\n) = O\n\n(cid:16) max{1,\n(cid:80)k\n\n1\n\nln(1/\u03bb)}\ni=1 \u03b3i\n\n(cid:17)\n\n,\n\nlim\nk\n\nwhere 0 < \u03bb < 1 is the geometric rate of the mixing time of the Markov chain (which corresponds to \u03bb(P ) in\nthe \ufb01nite-state case).\n\nNext, we present our result for a possibly nonconvex objective function F (\u00b7; \u03be) under the following\nassumption.\nAssumption 5 For any \u03be \u2208 \u039e, F (x; \u03be) is differentiable, and (cid:107)\u2207F (x; \u03be)\u2212\u2207F (y; \u03be)(cid:107) \u2264 L(cid:107)x\u2212y(cid:107). In\naddition, supx\u2208X,\u03be\u2208\u039e{(cid:107)\u2207F (x; \u03be)(cid:107)} < +\u221e, X is the full space, and E\u03be\u2207F (x; \u03be) = \u2207E\u03beF (x; \u03be).\nSince F (x, \u03be) is differentiable and X is the full space, the iteration reduces to\n\nxk+1 = xk \u2212 \u03b3k(\u2207F (xk; \u03bek) + ek).\n\n(25)\n\nCorollary 2 Let the stepsizes satisfy (19), (xk)k\u22650 be generated by Algorithm (25), the noises obey\n(23), and Assumption 5 hold. Assume the Markov chain is time-homogeneous, irreducible, and\naperiodic and reversible. Then, we have\nE(cid:107)\u2207E\u03be(F (xk; \u03be))(cid:107) = 0, E( min\n\nlim\nk\n\n(26)\n\n1\u2264i\u2264k{(cid:107)\u2207E\u03be(F (xi; \u03be))(cid:107)2}) = O(cid:16) max{1,\n(cid:80)k\n\nwhere 0 < \u03bb < 1 is geometric rate for the mixing time of the Markov chain.\n\n1\n\nln(1/\u03bb)}\ni=1 \u03b3i\n\n(cid:17),\n\n6 Conclusion\n\nIn this paper, we have analyzed the stochastic gradient descent method where the samples are taken\non a trajectory of Markov chain. One of our main contributions is non-ergodic convergence analysis\nfor convex MCGD, which uses a novel line of analysis. The result is then extended to the inexact\ngradients. This analysis lets us establish convergence for non-reversible \ufb01nite-state Markov chains\nand for nonconvex minimization problems. Our results are useful in the cases where it is impossible\nor expensive to directly take samples from a distribution, or the distribution is not even known, but\nsampling via a Markov chain is possible. Our results also apply to decentralized learning over a\nnetwork, where we can employ a random walker to traverse the network and minimizer the objective\nthat is de\ufb01ned over the samples that are held at the nodes in a distribute fashion.\n\n9\n\n\fReferences\n[1] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent.\n\nSIAM Journal on Optimization, 22(4):1549\u20131578, 2012.\n\n[2] Martin Dyer, Alan Frieze, Ravi Kannan, Ajai Kapoor, Ljubomir Perkovic, and Umesh Vazirani.\nA mildly exponential time algorithm for approximating the number of solutions to a multi-\ndimensional knapsack problem. Combinatorics, Probability and Computing, 2(3):271\u2013284,\n1993.\n\n[3] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov\nchains, with an application to the exclusion process. The annals of applied probability, 1(1):62\u2013\n87, 1991.\n\n[4] Mark Jerrum and Alistair Sinclair. The Markov chain Monte Carlo method: an approach to\n\napproximate counting and integration. Citeseer, 1996.\n\n[5] Bjorn Johansson, Maben Rabi, and Mikael Johansson. A simple peer-to-peer algorithm for dis-\ntributed optimization in sensor networks. In Decision and Control, 2007 46th IEEE Conference\non, pages 4705\u20134710. IEEE, 2007.\n\n[6] Bj\u00f6rn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient\nmethod for distributed optimization in networked systems. SIAM Journal on Optimization,\n20(3):1157\u20131170, 2009.\n\n[7] Song Mei, Yu Bai, Andrea Montanari, et al. The landscape of empirical risk for nonconvex\n\nlosses. The Annals of Statistics, 46(6A):2747\u20132774, 2018.\n\n[8] Ravi Montenegro, Prasad Tetali, et al. Mathematical aspects of mixing times in markov chains.\n\nFoundations and Trends R(cid:13) in Theoretical Computer Science, 1(3):237\u2013354, 2006.\n\n[9] Rufus Oldenburger et al. In\ufb01nite powers of matrices and characteristic roots. Duke Mathematical\n\nJournal, 6(2):357\u2013361, 1940.\n\n[10] S Sundhar Ram, A Nedi\u00b4c, and Venugopal V Veeravalli. Incremental stochastic subgradient\n\nalgorithms for convex optimization. SIAM Journal on Optimization, 20(2):691\u2013717, 2009.\n\n[11] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of\n\nMathematical Statistics, 22(3):400\u2013407, 1951.\n\n[12] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost\nsupermartingales and some applications. In Optimizing methods in statistics, pages 233\u2013257.\nElsevier, 1971.\n\n[13] Ralph Tyrell Rockafellar. Convex Analysis. Princeton university press, 2015.\n\n[14] Konstantin S Turitsyn, Michael Chertkov, and Marija Vucelja. Irreversible monte carlo algo-\n\nrithms for ef\ufb01cient sampling. Physica D: Nonlinear Phenomena, 240(4-5):410\u2013414, 2011.\n\n[15] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions\n\non signal processing, 66(11):2834\u20132848, 2018.\n\n10\n\n\f", "award": [], "sourceid": 6438, "authors": [{"given_name": "Tao", "family_name": "Sun", "institution": "National university of defense technology"}, {"given_name": "Yuejiao", "family_name": "Sun", "institution": "University of California, Los Angeles"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}