{"title": "Optimal Algorithms for Non-Smooth Distributed Optimization in Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2740, "page_last": 2749, "abstract": "In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.", "full_text": "Optimal Algorithms for Non-Smooth\nDistributed Optimization in Networks\n\nKevin Scaman1 Francis Bach2 S\u00e9bastien Bubeck3 Yin Tat Lee3,4 Laurent Massouli\u00e92,5\n\n1 Huawei Noah\u2019s Ark Lab, 2 INRIA, Ecole Normale Sup\u00e9rieure, PSL Research University,\n\n3 Microsoft Research, 4 University of Washington, 5 MSR-INRIA Joint Centre\n\nAbstract\n\nIn this work, we consider the distributed optimization of non-smooth convex func-\ntions using a network of computing units. We investigate this problem under two\nregularity assumptions: (1) the Lipschitz continuity of the global objective function,\nand (2) the Lipschitz continuity of local individual functions. Under the local regu-\nlarity assumption, we provide the \ufb01rst optimal \ufb01rst-order decentralized algorithm\ncalled multi-step primal-dual (MSPD) and its corresponding optimal convergence\nrate. A notable aspect of this result is that, for non-smooth functions, while the\ndominant term of the error is in O(1/\nt), the structure of the communication\nnetwork only impacts a second-order term in O(1/t), where t is time. In other\nwords, the error due to limits in communication resources decreases at a fast rate\neven in the case of non-strongly-convex objective functions. Under the global reg-\nularity assumption, we provide a simple yet ef\ufb01cient algorithm called distributed\nrandomized smoothing (DRS) based on a local smoothing of the objective function,\nand show that DRS is within a d1/4 multiplicative factor of the optimal convergence\nrate, where d is the underlying dimension.\n\n\u221a\n\n1\n\nIntroduction\n\n(cid:80)n\n\nDistributed optimization \ufb01nds many applications in machine learning, for example when the dataset\nis large and training is achieved using a cluster of computing units. As a result, many algorithms were\nrecently introduced to minimize the average \u00aff = 1\ni=1 fi of functions fi which are respectively\nn\naccessible by separate nodes in a network [1, 2, 3, 4]. Most often, these algorithms alternate between\nlocal and incremental improvement steps (such as gradient steps) with communication steps between\nnodes in the network, and come with a variety of convergence rates (see for example [5, 4, 6, 7]).\nRecently, a theoretical analysis of \ufb01rst-order distributed methods provided optimal convergence rates\nfor strongly-convex and smooth optimization in networks [8]. In this paper, we extend this analysis to\nthe more challenging case of non-smooth convex optimization. The main contribution of this paper is\nto provide optimal convergence rates and their corresponding optimal algorithms for this class of\ndistributed problems under two regularity assumptions: (1) the Lipschitz continuity of the global\nobjective function \u00aff, and (2) a bound on the average of Lipschitz constants of local functions fi.\nUnder the local regularity assumption, we provide in Section 4 matching upper and lower bounds\nof complexity in a decentralized setting in which communication is performed using the gossip\nalgorithm [9]. Moreover, we propose the \ufb01rst optimal algorithm for non-smooth decentralized\noptimization, called multi-step primal-dual (MSPD). Under the more challenging global regularity\nassumption, we show in Section 3 that distributing the simple smoothing approach introduced in [10]\nyields fast convergence rates with respect to communication. This algorithm, called distributed\nrandomized smoothing (DRS), achieves a convergence rate matching the lower bound up to a d1/4\nmultiplicative factor, where d is the dimensionality of the problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur analysis differs from the smooth and strongly-convex setting in two major aspects: (1) the na\u00efve\nmaster/slave distributed algorithm is in this case not optimal, and (2) the convergence rates differ\nbetween communication and local computations. More speci\ufb01cally, error due to limits in communica-\ntion resources enjoys fast convergence rates, as we establish by formulating the optimization problem\nas a composite saddle-point problem with a smooth term for communication and non-smooth term\nfor the optimization of the local functions (see Section 4 and Eq. (21) for more details).\nRelated work. Many algorithms were proposed to solve the decentralized optimization of an average\nof functions (see for example [1, 11, 3, 4, 12, 2, 13, 5]), and a sheer amount of work was devoted to\nimproving the convergence rate of these algorithms [5, 6]. In the case of non-smooth optimization,\nfast communication schemes were developed in [14, 15], although precise optimal convergence rates\nwere not obtained. Our decentralized algorithm is closely related to the recent primal-dual algorithm\nof [14] which enjoys fast communication rates in a decentralized and stochastic setting. Unfortunately,\ntheir algorithm lacks gossip acceleration to reach optimality with respect to communication time.\nFinally, optimal convergence rates for distributed algorithms were investigated in [8] for smooth and\nstrongly-convex objective functions, and [16, 17] for totally connected networks.\n\n2 Distributed optimization setting\nOptimization problem. Let G = (V,E) be a strongly connected directed graph of n computing\nunits and diameter \u2206, each having access to a convex function fi over a convex set K \u2282 Rd. We\nconsider minimizing the average of the local functions\n\nn(cid:88)\n\ni=1\n\n\u00aff (\u03b8) =\n\nmin\n\u03b8\u2208K\n\n1\nn\n\nfi(\u03b8) ,\n\n(1)\n\n(cid:113) 1\n\nn\n\n(cid:80)n\n\nin a distributed setting. More speci\ufb01cally, we assume that each computing unit can compute a\nsubgradient \u2207fi(\u03b8) of its own function in one unit of time, and communicate values (i.e. vectors\nin Rd) to its neighbors in G. A direct communication along the edge (i, j) \u2208 E requires a time \u03c4 \u2265 0.\nThese actions may be performed asynchronously and in parallel, and each machine i possesses a\nlocal version of the parameter, which we refer to as \u03b8i \u2208 K.\nRegularity assumptions. Optimal convergence rates depend on the precise set of assumptions\napplied to the objective function. In our case, we will consider two different constraints on the\nregularity of the functions:\n\n(A1) Global regularity: the (global) function \u00aff is convex and Lg-Lipschitz continuous, in the\n\nsense that, for all \u03b8, \u03b8(cid:48) \u2208 K,\n\n| \u00aff (\u03b8) \u2212 \u00aff (\u03b8(cid:48))| \u2264 Lg(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2 .\n\n(2)\n\n(A2) Local regularity: Each local function is convex and Li-Lipschitz continuous, and we\n\ndenote as L(cid:96) =\n\ni=1 L2\n\ni the (cid:96)2-average of the local Lipschitz constants.\n\nAssumption (A1) is weaker than (A2), as we always have Lg \u2264 L(cid:96). Moreover, we may have Lg = 0\nand L(cid:96) arbitrarily large, for example with two linear functions f1(x) = \u2212f2(x) = ax and a \u2192 +\u221e.\nWe will see in the following sections that the local regularity assumption is easier to analyze and\nleads to matching upper and lower bounds. For the global regularity assumption, we only provide an\nalgorithm with a d1/4 competitive ratio, where d is the dimension of the problem. Finding an optimal\ndistributed algorithm for global regularity is, to our understanding, a much more challenging task and\nis left for future work.\nFinally, we assume that the feasible region K is convex and bounded, and denote by R the radius of a\nball containing K, i.e.\n\n(3)\nwhere \u03b80 \u2208 K is the initial value of the algorithm, that we set to \u03b80 = 0 without loss of generality.\nBlack-box optimization procedure. The lower complexity bounds in Theorem 2 and Theorem 3\ndepend on the notion of black-box optimization procedures of [8] that we now recall. A black-box\noptimization procedure is a distributed algorithm verifying the following constraints:\n\n\u2200\u03b8 \u2208 K, (cid:107)\u03b8 \u2212 \u03b80(cid:107)2 \u2264 R ,\n\n2\n\n\f1. Local memory: each node i can store past values in a (\ufb01nite) internal memory Mi,t \u2282 Rd\nat time t \u2265 0. These values can be accessed and used at time t by the algorithm run by node\ni, and are updated either by local computation or by communication (de\ufb01ned below), that is,\nfor all i \u2208 {1, ..., n},\n\n(4)\n2. Local computation: each node i can, at time t, compute a subgradient of its local function\n\n\u2207fi(\u03b8) for a value \u03b8 \u2208 Mi,t\u22121 in the node\u2019s internal memory before the computation.\n\ni,t \u222a Mcomm\n\n(5)\n3. Local communication: each node i can, at time t, share a value to all or part of its neighbors,\n\ni,t = Span ({\u03b8,\u2207fi(\u03b8) : \u03b8 \u2208 Mi,t\u22121}) .\n\nMcomp\n\nMi,t \u2282 Mcomp\n\ni,t\n\n.\n\nthat is, for all i \u2208 {1, ..., n},\n\nMcomm\n\ni,t\n\n= Span\n\nMj,t\u2212\u03c4\n\n.\n\n(6)\n\n(cid:18) (cid:91)\n\n(j,i)\u2208E\n\n(cid:19)\n\n4. Output value: each node i must, at time t, specify one vector in its memory as local output\n\nof the algorithm, that is, for all i \u2208 {1, ..., n},\n\n\u03b8i,t \u2208 Mi,t .\n\n(7)\n\n(9)\n\nHence, a black-box procedure will return n output values\u2014one for each computing unit\u2014and our\nanalysis will focus on ensuring that all local output values are converging to the optimal parameter of\nEq. (1). For simplicity, we assume that all nodes start with the simple internal memory Mi,0 = {0}.\nNote that communications and local computations may be performed in parallel and asynchronously.\n\n3 Distributed optimization under global regularity\n\nn(cid:88)\n\nThe most standard approach for distributing a \ufb01rst-order optimization method consists in computing\na subgradient of the average function\n\n\u2207 \u00aff (\u03b8) =\n\n1\nn\n\n\u2207fi(\u03b8) ,\n\n(8)\nwhere \u2207fi(\u03b8) is any subgradient of fi at \u03b8, by sending the current parameter \u03b8t to all nodes,\nperforming the computation of all local subgradients in parallel and averaging them on a master node.\nSince each iteration requires communicating twice to the whole network (once for \u03b8t and once for\nsending the local subgradients to the master node, which both take a time \u2206\u03c4 where \u2206 is the diameter\nof the network) and one subgradient computation (on each node and performed in parallel), the time\nto reach a precision \u03b5 with such a distributed subgradient descent is upper-bounded by\n\ni=1\n\n(cid:18)(cid:16) RLg\n\n(cid:17)2\n\nO\n\n\u03b5\n\n(cid:19)\n\n(\u2206\u03c4 + 1)\n\n.\n\nNote that this convergence rate depends on the global Lipschitz constant Lg, and is thus applicable\nunder the global regularity assumption. The number of subgradient computations in Eq. (9) (i.e. the\nterm not proportional to \u03c4) cannot be improved, since it is already optimal for objective functions\nde\ufb01ned on only one machine (see for example Theorem 3.13 p. 280 in [18]). However, quite surpris-\ningly, the error due to communication time may bene\ufb01t from fast convergence rates in O(RLg/\u03b5).\nThis result is already known under the local regularity assumption (i.e. replacing Lg with L(cid:96) or\neven maxi Li) in the case of decentralized optimization [14] or distributed optimization on a totally\nconnected network [17]. To our knowledge, the case of global regularity has not been investigated by\nprior work.\n\n3.1 A simple algorithm with fast communication rates\n\nWe now show that the simple smoothing approach introduced in [10] can lead to fast rates for error\ndue to communication time. Let \u03b3 > 0 and f : Rd \u2192 R be a real function. We denote as smoothed\nversion of f the following function:\n\n(10)\nwhere X \u223c N (0, I) is a standard Gaussian random variable. The following lemma shows that f \u03b3 is\nboth smooth and a good approximation of f.\n\nf \u03b3(\u03b8) = E [f (\u03b8 + \u03b3X)] ,\n\n3\n\n\fAlgorithm 1 distributed randomized smoothing\n\nInput: approximation error \u03b5 > 0, communication graph G, \u03b10 = 1, \u03b1t+1 = 2/(1 +(cid:112)1 + 4/\u03b12\n\nt )\n\n(cid:108) 20RLgd1/4\n\n(cid:109)\n\n\u03b5\n\nT =\n\n, K =\n\n(cid:108) 5RLgd\u22121/4\n\n(cid:109)\n\n\u03b5\n\n, \u03b3t = Rd\u22121/4\u03b1t, \u03b7t =\n\n\u221a\n\nR\u03b1t\n2Lg(d1/4+\n\n.\n\nt+1\nK )\n\nOutput: optimizer \u03b8T\n1: Compute a spanning tree T on G.\n2: Send a random seed s to every node in T .\n3: Initialize the random number generator of each node using s.\n4: x0 = 0, z0 = 0, G0 = 0\n5: for t = 0 to T \u2212 1 do\nyt = (1 \u2212 \u03b1t)xt + \u03b1tzt\n(cid:80)K\n6:\nSend yt to every node in T .\n7:\nk=1 \u2207fi(yt + \u03b3tXt,k), where Xt,k \u223c N (0, I)\nEach node i computes gi = 1\n8:\nK\n9: Gt+1 = Gt + 1\nn\u03b1t\nzt+1 = argminx\u2208K (cid:107)x + \u03b7t+1Gt+1(cid:107)2\n10:\nxt+1 = (1 \u2212 \u03b1t)xt + \u03b1tzt+1\n11:\n12: end for\n13: return \u03b8T = xT\n\n(cid:80)\n\ni gi\n\n2\n\nLemma 1 (Lemma E.3 of [10]). If \u03b3 > 0, then f \u03b3 is Lg\n\nf (\u03b8) \u2264 f \u03b3(\u03b8) \u2264 f (\u03b8) + \u03b3Lg\n\nd .\n\n(11)\n\n\u03b3 -smooth and, for all \u03b8 \u2208 Rd,\n\n\u221a\n\n(cid:80)n\ni=1 \u2207f \u03b3\n\nn\n\nHence, smoothing the objective function allows the use of accelerated optimization algorithms and\nprovides faster convergence rates. Of course, the price to pay is that each computation of the smoothed\ngradient \u2207 \u00aff \u03b3(\u03b8) = 1\ni (\u03b8) now requires, at each iteration m, to sample a suf\ufb01cient amount\nof subgradients \u2207fi(\u03b8 + \u03b3Xm,k) to approximate Eq. (10), where Xm,k are K i.i.d. Gaussian random\nvariables. At \ufb01rst glance, this algorithm requires all computing units to synchronize on the choice\nof Xm,k, which would require to send to all nodes each Xm,k and thus incur a communication cost\nproportional to the number of samples. Fortunately, computing units only need to share one random\nseed s \u2208 R and then use a random number generator initialized with the provided seed to generate\nthe same random variables Xm,k without the need to communicate any vector. The overall algorithm,\ndenoted distributed randomized smoothing (DRS), uses the randomized smoothing optimization\nalgorithm of [10] adapted to a distributed setting, and is summarized in Alg. 1. The computation of a\nspanning tree T in step 1 allows ef\ufb01cient communication to the whole network in time at most \u2206\u03c4.\nMost of the algorithm (i.e. steps 2, 4, 6, 7, 9, 10 and 11) are performed on the root of the spanning\nsubtree T , while the rest of the computing units are responsible for computing the smoothed gradient\n(step 8). The seed s of step 2 is used to ensure that every Xm,k, although random, is the same on\nevery node. Finally, step 10 is a simple orthogonal projection of the gradient step on the convex set K.\nWe now show that the DRS algorithm converges to the optimal parameter under the global regularity\nassumption.\n\nTheorem 1. Under global regularity (A1), Alg. 1 achieves an approximation error E(cid:2) \u00aff (\u03b8T )(cid:3)\u2212 \u00aff (\u03b8\u2217)\n\nof at most \u03b5 > 0 in a time T\u03b5 upper-bounded by\n\nO\n\n(cid:18) RLg\n(cid:24) RLgd1/4\n\n\u03b5\n\n(cid:25)\n\n\u03b5\n\nT\u03b5 \u2264 40\n\n.\n\n\u03b5\n\n(cid:17)2(cid:19)\n(cid:16) RLg\n(cid:25)(cid:24) RLgd\u22121/4\n(cid:24) RLgd1/4\n(cid:17)4\n\n\u03b5\n\n\u03b5\n\n(cid:25)\n\n.\n\n(12)\n\n(13)\n\nMore speci\ufb01cally, Alg. 1 completes its T iterations by time\n\nComparing Eq. (13) to Eq. (9), we can see that our algorithm improves on the standard method when\nthe dimension is not too large, and more speci\ufb01cally\n\n(14)\nIn practice, this condition is easily met, as \u03b5 \u2264 10\u22122 already leads to the condition d \u2264 108 (assuming\nthat R and Lg have values around 1). Moreover, for problems of moderate dimension, the term\n\n\u03b5\n\n.\n\n(\u2206\u03c4 + 1)d1/4 +\n\n\u2206\u03c4 + 100\n\nd \u2264(cid:16) RLg\n\n4\n\n\fd1/4 remains a small multiplicative factor (e.g. for d = 1000, d1/4 \u2248 6). Finally, note that DRS\nachieves a linear speedup when communication through the whole network requires a constant time,\ni.e., \u2206\u03c4 = O(1), and the convexity of each local function fi is not necessary for Theorem 1 to hold.\nRemark 1. Several other smoothing methods exist in the literature, notably the Moreau envelope [19]\nenjoying a dimension-free approximation guarantee. However, the Moreau envelope of an average\nof functions is dif\ufb01cult to compute (requires a different oracle than computing a subgradient), and\nunfortunately leads to convergence rates with respect to local Lipschitz characteristics instead of Lg.\n\n3.2 Optimal convergence rate\n\nThe following result provides oracle complexity lower bounds under the global regularity assumption,\nand is proved in the supplemental material. This lower bound extends the communication complexity\nlower bound for totally connected communication networks from [17].\nTheorem 2. Let G be a network of computing units of size n > 0, and Lg, R > 0. There exists n\nfunctions fi : Rd \u2192 R such that (A1) holds and, for any t < (d\u22122)\n2 min{\u2206\u03c4, 1} and any black-box\nprocedure one has, for all i \u2208 {1, ..., n},\n\n(cid:115)\n\n\u00aff (\u03b8i,t) \u2212 min\n\u03b8\u2208B2(R)\n\n\u00aff (\u03b8) \u2265 RLg\n36\n\n1\n(1 + t\n2\u2206\u03c4 )2\n\n+\n\n1\n\n1 + t\n\n.\n\n(15)\n\nAssuming that the dimension d is large compared to the characteristic values of the problem (a\nstandard set-up for lower bounds in non-smooth optimization [20, Theorem 3.2.1]), Theorem 2\nimplies that, under the global regularity assumption (A1), the time to reach a precision \u03b5 > 0 with\nany black-box procedure is lower bounded by\n\n(cid:18) RLg\n\n(cid:16) RLg\n\n(cid:17)2(cid:19)\n\n,\n\n\u03b5\n\n\u03b5\n\n\u2126\n\n\u2206\u03c4 +\n\n(16)\nwhere the notation g(\u03b5) = \u2126(f (\u03b5)) stands for \u2203C > 0 s.t. \u2200\u03b5 > 0, g(\u03b5) \u2265 Cf (\u03b5). This lower bound\nproves that the convergence rate of DRS in Eq. (13) is optimal with respect to computation time and\nwithin a d1/4 multiplicative factor of the optimal convergence rate with respect to communication.\nThe proof of Theorem 2 relies on the use of two objective functions: \ufb01rst, the standard worst case\nfunction used for single machine convex optimization (see e.g. [18]) is used to obtain a lower bound\non the local computation time of individual machines. Then, a second function \ufb01rst introduced in [17]\nis split on the two most distant machines to obtain worst case communication times. By aggregating\nthese two functions, a third one is obtained with the desired lower bound on the convergence rate. The\ncomplete proof is available as supplementary material. Finally, note that, due to its random nature,\nAlg. 1 is not per se a black-box procedure, and Theorem 2 does not apply to it. Lower bounds for\nrandom algorithms are more challenging and left for future work.\nRemark 2. The lower bound also holds for the average of local parameters 1\ni=1 \u03b8i, and more\nn\ngenerally any parameter that can be computed using any vector of the local memories at time t: in\nTheorem 2, \u03b8i,t may be replaced by any \u03b8t such that\n\n(cid:80)n\n\n\u03b8t \u2208 Span\n\nMi,t\n\n.\n\n(17)\n\n(cid:19)\n\n(cid:18)(cid:91)\n\ni\u2208V\n\n4 Decentralized optimization under local regularity\n\nIn many practical scenarios, the network may be unknown or changing through time, and a local\ncommunication scheme is preferable to the master/slave approach of Alg. 1. Decentralized algo-\nrithms tackle this problem by replacing targeted communication by local averaging of the values of\nneighboring nodes [9]. More speci\ufb01cally, we now consider that, during a communication step, each\nmachine i broadcasts a vector xi \u2208 Rd to its neighbors, then performs a weighted average of the\nvalues received from its neighbors:\n\nnode i sends xi to his neighbors and receives (cid:80)\n\n(18)\n\nj Wjixj .\n\nIn order to ensure the ef\ufb01ciency of this communication scheme, we impose standard assumptions on\nthe matrix W \u2208 Rn\u00d7n, called the gossip matrix [9, 8]:\n\n5\n\n\f1. W is symmetric and positive semi-de\ufb01nite,\n2. The kernel of W is the set of constant vectors: Ker(W ) = Span(1), where 1 = (1, ..., 1)(cid:62),\n3. W is de\ufb01ned on the edges of the network: Wij (cid:54)= 0 only if i = j or (i, j) \u2208 E.\n\nNote that these assumptions are implied by symmetry, stochasticity and positive eigengap on I \u2212 W .\n\n4.1 Optimal convergence rate\n\nvergence rate is obtained by replacing the diameter of the network with 1/(cid:112)\u03b3(W ), where\n\nSimilarly to the smooth and strongly-convex case of [8], the lower bound on the optimal con-\n\n\u03b3(W ) = \u03bbn\u22121(W )/\u03bb1(W ) is the ratio between smallest non-zero and largest eigenvalues of W ,\nalso known as the normalized eigengap.\nTheorem 3. Let L(cid:96), R > 0 and \u03b3 \u2208 (0, 1]. There exists a matrix W of eigengap \u03b3(W ) = \u03b3, and n\nfunctions fi satisfying (A2), where n is the size of W , such that for all t < d\u22122\n\u03b3, 1) and\nall i \u2208 {1, ..., n},\n\n2 min(\u03c4 /\n\n\u221a\n\n(cid:115)\n\n\u00aff (\u03b8i,t) \u2212 min\n\u03b8\u2208B2(R)\n\n\u00aff (\u03b8) \u2265 RL(cid:96)\n108\n\n1\n(1 + 2t\n\u03c4\n\n\u221a\n\n\u03b3\n\n)2\n\n+\n\n1\n\n1 + t\n\n.\n\n(19)\n\nAssuming that the dimension d is large compared to the characteristic values of the problem, The-\norem 3 implies that, under the local regularity assumption (A2) and for a gossip matrix W with\neigengap \u03b3(W ), the time to reach a precision \u03b5 > 0 with any decentralized black-box procedure is\nlower-bounded by\n\n\u2126\n\nThe proof of Theorem 3 relies on linear graphs (whose diameter is proportional to 1/(cid:112)\u03b3(L) where\n\nL is the Laplacian matrix) and Theorem 2. More speci\ufb01cally, a technical aspect of the proof consists\nin splitting the functions used in Theorem 2 on multiple nodes to obtain a dependency in L(cid:96) instead\nof Lg. The complete derivation is available as supplementary material.\n\n+\n\n\u03b5\n\n\u03b5\n\n.\n\n(20)\n\n(cid:18) RL(cid:96)\n\n\u03c4(cid:112)\u03b3(W )\n\n(cid:16) RL(cid:96)\n\n(cid:17)2(cid:19)\n\n4.2 Optimal decentralized algorithm\n\nWe now provide an optimal decentralized optimization algorithm under (A2). This algorithm is\nclosely related to the primal-dual algorithm proposed by [14], which we modify by the use of\n(cid:80)n\naccelerated gossip using Chebyshev polynomials as in [8].\nFirst, we formulate our optimization problem in Eq. (1) as the saddle-point problem in Eq. (21) below,\ni=1 fi(\u03b8i) over \u0398 = (\u03b81, . . . , \u03b8n) \u2208 Kn\nby considering the equivalent problem of minimizing 1\nwith the constraint that \u03b81 = \u00b7\u00b7\u00b7 = \u03b8n, or equivalently \u0398A = 0, where A is a square root of the\nn\nsymmetric matrix W . Through Lagrangian duality, we therefore get the equivalent problem:\n\nmin\n\u0398\u2208Kn\n\nmax\n\u039b\u2208Rd\u00d7n\n\n1\nn\n\nfi(\u03b8i) \u2212 tr \u039b(cid:62)\u0398A .\n\n(21)\n\nWe solve it by applying Algorithm 1 in Chambolle-Pock [21] (we could alternatively apply composite\nMirror-Prox [22]), which is both simple and well tailored to our problem: (a) it is an accelerated\nmethod for saddle-point problems, (b) it allows for composite problems with a sum of non-smooth and\nsmooth terms, (c) it provides a primal-dual gap that can easily be extended to the case of approximate\nproximal operators. At each iteration t, with initialization \u039b0 = 0 and \u03980 = \u0398\u22121 = (\u03b80, . . . , \u03b80):\n\n(a) \u039bt+1 = \u039bt \u2212 \u03c3(2\u0398t+1 \u2212 \u0398t)A\n(b) \u0398t+1 = argmin\n\u0398\u2208Kn\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(\u03b8i) \u2212 tr \u0398(cid:62)\u039bt+1A(cid:62) +\n\ntr(\u0398 \u2212 \u0398t)(cid:62)(\u0398 \u2212 \u0398t) ,\n\n(22)\n\n1\n2\u03b7\n\nwhere the gain parameters \u03b7, \u03c3 are required to satisfy \u03c3\u03b7\u03bb1(W ) \u2264 1. We implement the algorithm\nn) \u2208 Rd\u00d7n, for which all updates can be made\nwith the variables \u0398t and Y t = \u039btA(cid:62) = (yt\n\n1, . . . , yt\n\nn(cid:88)\n\ni=1\n\n6\n\n\fAlgorithm 2 multi-step primal-dual algorithm\nInput: approximation error \u03b5 > 0, gossip matrix W \u2208 Rn\u00d7n,\n\nK = (cid:98)1/(cid:112)\u03b3(W )(cid:99), M = T = (cid:100) 4RL(cid:96)\n\n\u03b5 (cid:101), c1 =\n\n\u221a\n1\u2212\n\u221a\n1+\n\n\u03b3(W )\n\n\u03b3(W )\n\nOutput: optimizer \u00af\u03b8T\n1: \u03980 = 0, \u0398\u22121 = 0, Y0 = 0\n2: for t = 0 to T \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8: \u0398t+1 = \u02dc\u0398M\n9: end for\n10: return \u00af\u03b8T = 1\nT\n\nY t+1 = Y t \u2212 \u03c3 ACCELERATEDGOSSIP(2\u0398t \u2212 \u0398t\u22121, W , K)\n\u02dc\u03980 = \u0398t\nfor m = 0 to M \u2212 1 do\ni \u2212 2\n\u02dc\u03b8m\n(cid:80)T\n(cid:80)n\n\n(cid:2) \u03b7\nn\u2207fi(\u02dc\u03b8m\n\ni ) \u2212 \u03b7yt+1\n\n\u02dc\u03b8m+1\ni\nend for\n\ni \u2212 \u03b8t\n\n= m\nm+2\n\ni=1 \u03b8t\ni\n\nm+2\n\nt=1\n\n1\nn\n\ni\n\n, \u03b7 = nR\nL(cid:96)\n\n1\u2212cK\n1\n1+cK\n1\n\n, \u03c3 = 1+c2K\n\u03c4 (1\u2212cK\n\n1\n\n1 )2 .\n\n// see [8, Alg. 2]\n\n(cid:3), \u2200i \u2208 {1, . . . , n}\n\nlocally: Since AA(cid:62) = W , they now become\n\n(a(cid:48)) Y t+1 = Y t \u2212 \u03c3(2\u0398t+1 \u2212 \u0398t)W\nfi(\u03b8i) \u2212 \u03b8(cid:62)\n(b(cid:48))\n\n\u03b8t+1\ni\n\n= argmin\n\u03b8i\u2208K\n\n1\nn\n\ni yt+1\n\ni +\n\n(cid:107)\u03b8i \u2212 \u03b8t\n\ni(cid:107)2,\u2200i \u2208 {1, . . . , n} ,\n\n1\n2\u03b7\n\n(23)\n\nThe step (b(cid:48)) still requires a proximal step for each function fi. We approximate it by the outcome of\nthe subgradient method run for M steps, with a step-size proportional to 2/(m + 2) as suggested\nin [23]. That is, initialized with \u02dc\u03b80\ni \u2212 2\n\u02dc\u03b8m\n\n(cid:3), m = 0, . . . , M \u2212 1.\n\ni, it performs the iterations\ni \u2212 \u03b8t\n\ni ) \u2212 \u03b7yt+1\n\n\u2207fi(\u02dc\u03b8m\n\n(cid:2) \u03b7\n\ni = \u03b8t\n\n\u02dc\u03b8m+1\ni\n\n(24)\n\nm\n\n=\n\ni\n\nm + 2\n\nm + 2\n\nn\n\nWe thus replace the step (b(cid:48)) by running M steps of the subgradient method to obtain \u02dc\u03b8M\nTheorem 4. Under local regularity (A2), the approximation error with the iterative algorithm of\nEq. (23) and (24) after T iterations and using M subgradient steps per iteration is bounded by\n\n.\n\ni\n\nTheorem 4 implies that the proposed algorithm achieves an error of at most \u03b5 in a time no larger than\n\n(cid:16) 1\n\nT\n\n(cid:17)\n\n+\n\n1\nM\n\n(cid:19)2(cid:19)\n\n\u00aff (\u03b8) \u2264 RL(cid:96)(cid:112)\u03b3(W )\n(cid:18) RL(cid:96)\n\u03c4(cid:112)\u03b3(W )\n\n+\n\n\u03b5\n\n1(cid:112)\u03b3(W )\n\n\u00aff (\u00af\u03b8T ) \u2212 min\n\u03b8\u2208K\n\n(cid:18) RL(cid:96)\n\n\u03b5\n\nO\n\n.\n\n(25)\n\n.\n\n(26)\n\nWhile the \ufb01rst term (associated to communication) is optimal, the second does not match the lower\nbound of Theorem 3. This situation is similar to that of strongly-convex and smooth decentralized\noptimization [8], when the number of communication steps is taken equal to the number of overall\niterations.\nBy using Chebyshev acceleration [24, 25] with an increased number of communication steps, the\nalgorithm reaches the optimal convergence rate. More precisely, since one communication step is a\nmultiplication (of \u0398 e.g.) by the gossip matrix W , performing K communication steps is equivalent\nto multiplying by a power of W . More generally, multiplication by any polynomial PK(W ) of\ndegree K can be achieved in K steps. Since our algorithm depends on the eigengap of the gossip\nmatrix, a good choice of polynomial consists in maximizing this eigengap \u03b3(PK(W )). This is the\napproach followed by [8] and leads to the choice PK(x) = 1 \u2212 TK(c2(1 \u2212 x))/TK(c2), where\nc2 = (1 + \u03b3(W ))/(1 \u2212 \u03b3(W )) and TK are the Chebyshev polynomials [24] de\ufb01ned as T0(x) = 1,\nT1(x) = x, and, for all k \u2265 1, Tk+1(x) = 2xTk(x) \u2212 Tk\u22121(x). We refer the reader to [8] for more\n\u03b3(PK(W )) \u2265 1/4 and the optimal convergence rate.\n\ndetails on the method. Finally, as mentioned in [8], chosing K = (cid:98)1/(cid:112)\u03b3(W )(cid:99) leads to an eigengap\n\n7\n\n\fWe denote the resulting algorithm as multi-step primal-dual (MSPD) and describe it in Alg. 2. The\nprocedure ACCELERATEDGOSSIP is extracted from [8, Algorithm 2] and performs one step of\nChebyshev accelerated gossip, while steps 4 to 8 compute the approximation of the minimization\nproblem (b\u2019) of Eq. (23). Our performance guarantee for the MSPD algorithm is then the following:\nTheorem 5. Under local regularity (A2), Alg. 2 achieves an approximation error \u00aff (\u00af\u03b8T ) \u2212 \u00aff (\u03b8\u2217) of\nat most \u03b5 > 0 in a time T\u03b5 upper-bounded by\n\n,\n\n(27)\n\n(cid:18) RL(cid:96)\n\nO\n\n\u03b5\n\n\u03c4(cid:112)\u03b3(W )\n(cid:25)\n(cid:24) 4RL(cid:96)\n\n\u03b5\n\n(cid:16) RL(cid:96)\n\n(cid:17)2(cid:19)\n\n+\n\n\u03b5\n\n(cid:25)2\n\n(cid:24) 4RL(cid:96)\n(cid:80)n\n\n\u03b5\n\n+\n\n\u03c4(cid:112)\u03b3(W )\n(cid:80)T\n\nwhich matches the lower complexity bound of Theorem 3. Alg. 2 is therefore optimal under the the\nlocal regularity assumption (A2).\nRemark 3. It is clear from the algorithm\u2019s description that it completes its T iterations by time\n\nT\u03b5 \u2264\n\n.\n\n(28)\n\n\u03b5\n\nt=1\n\nTo obtain the average of local parameters \u00af\u03b8T = 1\ni=1 \u03b8i, one can then rely on the gossip\nalgorithm [9] to average over the network the individual nodes\u2019 time averages. Let W (cid:48) = I \u2212\nnT\n1 )2. Since W (cid:48) is bi-stochastic, semi-de\ufb01nite positive\nc3PK(W ) where c3 = (1 + c2K\nand \u03bb2(W (cid:48)) = 1 \u2212 \u03b3(PK(W )) \u2264 3/4, using it for gossiping the time averages leads to a time\n\n(cid:1)(cid:1) to ensure that each node reaches a precision \u03b5 on the objective function (see [9] for\n\nO(cid:0) \u03c4\u221a\n\u03b3 ln(cid:0) RL(cid:96)\n\n1 )/(1 \u2212 cK\n\nmore details on the linear convergence of gossip), which is negligible compared to Eq. (27).\nRemark 4. A stochastic version of the algorithm is also possible by considering stochastic oracles\non each fi and using stochastic subgradient descent instead of the subgradient method.\nRemark 5. In the more general context where node compute times \u03c1i are not necessarily all equal\nto 1, we may still apply Alg. 2, where now the number of subgradient iterations performed by node i\nis M/\u03c1i rather than M. The proof of Theorem 5 also applies, and now yields the modi\ufb01ed upper\nbound on time to reach precision \u03b5:\n\n(cid:18) RL(cid:96)\n\n\u03b5\n\n\u03c4(cid:112)\u03b3(W )\n\n(cid:16) RLc\n\n(cid:17)2(cid:19)\n\n+\n\n\u03b5\n\n,\n\n(29)\n\nO\n\n(cid:80)n\n\ni .\ni=1 \u03c1iL2\n\nwhere L2\n\nc = 1\nn\n\n5 Conclusion\n\nIn this paper, we provide optimal convergence rates for non-smooth and convex distributed optimiza-\ntion in two settings: Lipschitz continuity of the global objective function, and Lipschitz continuity of\nlocal individual functions. Under the local regularity assumption, we provide optimal convergence\nrates that depend on the (cid:96)2-average of the local Lipschitz constants and the (normalized) eigengap\nof the gossip matrix. Moreover, we also provide the \ufb01rst optimal decentralized algorithm, called\nmulti-step primal-dual (MSPD).\nUnder the global regularity assumption, we provide a lower complexity bound that depends on the\nLipschitz constant of the (global) objective function, as well as a distributed version of the smoothing\napproach of [10] and show that this algorithm is within a d1/4 multiplicative factor of the optimal\nconvergence rate.\n\u221a\nIn both settings, the optimal convergence rate exhibits two different speeds: a slow rate in \u0398(1/\nt)\nwith respect to local computations and a fast rate in \u0398(1/t) due to communication. Intuitively,\ncommunication is the limiting factor in the initial phase of optimization. However, its impact\ndecreases with time and, for the \ufb01nal phase of optimization, local computation time is the main\nlimiting factor.\nThe analysis presented in this paper allows several natural extensions, including time-varying com-\nmunication networks, asynchronous algorithms, stochastic settings, and an analysis of unequal node\ncompute speeds going beyond Remark 5. Moreover, despite the ef\ufb01ciency of DRS, \ufb01nding an optimal\nalgorithm under the global regularity assumption remains an open problem and would make a notable\naddition to this work.\n\n8\n\n\fReferences\n\n[1] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-\n\nmization. IEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[3] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed\noptimization: Convergence analysis and network scaling. IEEE Transactions on Automatic\ncontrol, 57(3):592\u2013606, 2012.\n\n[4] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact \ufb01rst-order algorithm for\ndecentralized consensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n[5] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the\nADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing,\n62(7):1750\u20131761, 2014.\n\n[6] Du\u0161an Jakoveti\u00b4c, Jos\u00e9 M. F. Moura, and Joao Xavier. Linear convergence rate of a class\nof distributed augmented lagrangian algorithms. IEEE Transactions on Automatic Control,\n60(4):922\u2013936, 2015.\n\n[7] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed\noptimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597\u20132633, 2017.\n[8] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. In Proceedings\nof the 34th International Conference on Machine Learning ICML, pages 3027\u20133036, 2017.\n\n[9] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip\n\nalgorithms. IEEE/ACM Transactions on Networking (TON), 14(SI):2508\u20132530, 2006.\n\n[10] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for\n\nstochastic optimization. SIAM Journal on Optimization, 22(2):674\u2013701, 2012.\n\n[11] Du\u0161an Jakoveti\u00b4c, Joao Xavier, and Jos\u00e9 M. F. Moura. Fast distributed gradient methods. IEEE\n\nTransactions on Automatic Control, 59(5):1131\u20131146, 2014.\n\n[12] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging\n\ngradient algorithm. Journal of Machine Learning Research, 17(1):2165\u20132199, 2016.\n\n[13] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In\n\n51st Annual Conference on Decision and Control (CDC), pages 5445\u20135450. IEEE, 2012.\n\n[14] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-ef\ufb01cient algorithms for decentralized\n\nand stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.\n\n[15] Martin Jaggi, Virginia Smith, Martin Tak\u00e1c, Jonathan Terhorst, Sanjay Krishnan, Thomas\nHofmann, and Michael I Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent.\nIn Advances in Neural Information Processing Systems 27, pages 3068\u20133076, 2014.\n\n[16] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning\nand estimation. In Advances in Neural Information Processing Systems 27, pages 163\u2013171,\n2014.\n\n[17] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning\nand optimization. In Advances in Neural Information Processing Systems 28, pages 1756\u20131764,\n2015.\n\n[18] S\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[19] J. J. Moreau. Proximit\u00e9 et dualit\u00e9 dans un espace hilbertien. Bulletin de la Soci\u00e9t\u00e9 Math\u00e9matique\n\nde France, 93:273\u2013299, 1965.\n\n[20] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Kluwer Academic\n\nPublishers, 2004.\n\n[21] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems\nwith applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145,\nMay 2011.\n\n9\n\n\f[22] Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term\ncomposite minimization and semi-separable problems. Computational Optimization and Appli-\ncations, 61(2):275\u2013319, 2015.\n\n[23] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an\nO(1/t) convergence rate for the projected stochastic subgradient method. Technical Report\n1212.2002, arXiv, 2012.\n\n[24] W. Auzinger. Iterative Solution of Large Linear Systems. Lecture notes, TU Wien, 2011.\n[25] M. Arioli and J. Scott. Chebyshev acceleration of iterative re\ufb01nement. Numerical Algorithms,\n\n66(3):591\u2013608, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1446, "authors": [{"given_name": "Kevin", "family_name": "Scaman", "institution": "Huawei Technologies, Noah's Ark"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Sebastien", "family_name": "Bubeck", "institution": "Microsoft Research"}, {"given_name": "Laurent", "family_name": "Massouli\u00e9", "institution": "Inria"}, {"given_name": "Yin Tat", "family_name": "Lee", "institution": "UW"}]}