{"title": "Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up", "book": "Advances in Neural Information Processing Systems", "page_first": 1216, "page_last": 1227, "abstract": "We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a \"big data\" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.", "full_text": "Optimal Statistical Rates for Decentralised\n\nNon-Parametric Regression with Linear Speed-Up\n\nDominic Richards\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nPatrick Rebeschini\n\nDepartment of Statistics\n\nUniversity of Oxford\n\n24-29 St Giles\u2019, Oxford, OX1 3LB\n\n24-29 St Giles\u2019, Oxford, OX1 3LB\n\ndominic.richards@spc.ox.ac.uk\n\npatrick.rebeschini@stats.ox.ac.uk\n\nAbstract\n\nWe analyse the learning performance of Distributed Gradient Descent in the context\nof multi-agent decentralised non-parametric regression with the square loss function\nwhen i.i.d. samples are assigned to agents. We show that if agents hold suf\ufb01ciently\nmany samples with respect to the network size, then Distributed Gradient Descent\nachieves optimal statistical rates with a number of iterations that scales, up to a\nthreshold, with the inverse of the spectral gap of the gossip matrix divided by the\nnumber of samples owned by each agent raised to a problem-dependent power.\nThe presence of the threshold comes from statistics. It encodes the existence of a\n\u201cbig data\u201d regime where the number of required iterations does not depend on the\nnetwork topology. In this regime, Distributed Gradient Descent achieves optimal\nstatistical rates with the same order of iterations as gradient descent run with all the\nsamples in the network. Provided the communication delay is suf\ufb01ciently small,\nthe distributed protocol yields a linear speed-up in runtime compared to the single-\nmachine protocol. This is in contrast to decentralised optimisation algorithms that\ndo not exploit statistics and only yield a linear speed-up in graphs where the spectral\ngap is bounded away from zero. Our results exploit the statistical concentration\nof quantities held by agents and shed new light on the interplay between statistics\nand communication in decentralised methods. Bounds are given in the standard\nnon-parametric setting with source/capacity assumptions.\n\n1\n\nIntroduction\n\nIn machine learning a canonical goal is to use training data sampled independently from an unknown\ndistribution to \ufb01t a model that performs well on unseen data from the same distribution. With a loss\nfunction measuring the performance of a model on a data point, a common approach is to \ufb01nd a\nmodel that minimises the average loss on the training data with some form of explicit regularisation\nto control model complexity and avoid over\ufb01tting. Due to the increasingly large size of datasets\nand high model complexity, direct minimisation of the regularised problem is posing more and\nmore computational challenges. This has led to growing interest in approaches that improve models\nincrementally using gradient descent methods [8], where model complexity is controlled through\nforms of implicit/algorithmic regularisation such as early stopping and step-size tuning [57, 58, 27].\nThe growth in the size of modern datasets has also meant that the coordination of multiple machines\nis often required to \ufb01t machine learning models. In the centralised server-clients setup, a single\nmachine (server) is responsible to aggregate and disseminate information to other machines (clients)\nin what is an effective star topology. In some settings, such as ad-hoc wireless and peer-to-peer\nnetworks, network instability, bandwidth limitation and privacy concerns make centralised approaches\nless feasible. This has motivated research into scalable methods that can avoid the bottleneck\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand vulnerability introduced by the presence of a central authority. Such solutions are called\n\u201cdecentralised\u201d, as no single entity is responsible for the collection and dissemination of information:\nmachines communicate with neighbours in a network structure that encodes communication channels.\nSince the early works [52, 53] to the more recent work [22, 34, 33, 23, 29, 30, 10, 18, 47, 31],\nproblems in decentralised multi-agent optimisation have often been treated as a particular instance of\nconsensus optimisation. In this framework, a network of machines or agents collaborate to minimise\nthe average of functions held by individual agents, hence \u201creaching consensus\u201d on the solution of\nthe global problem. In this setting the performance of the chosen protocol naturally depends on the\nnetwork topology, since to solve the problem each agent has to communicate and receive information\nfrom all other agents. In particular, the number of iterations required by decentralised iterative\ngradient methods typically scales with the inverse of the spectral gap of the communication matrix\n(a.k.a. gossip or consensus matrix) [18, 42, 43], which re\ufb02ects the performance of gossip protocols in\nthe problem of distributed averaging [9, 17, 44, 4].\nMany distributed machine learning problems, in particular those involving empirical risk minimisation,\nhave been framed in the context of consensus optimisation. However, as highlighted in [46] and more\nrecently in [38], often these problems have more structure than consensus optimisation due to the\nstatistical regularity of the data. When the agents\u2019 functions are the empirical risk of their local data, in\nthe setting where the local data comes from the same unknown distribution (homogeneous setting), the\nfunctions held by each agent are similar to one another by the phenomenon of statistical concentration.\nIn particular, in the limit of an in\ufb01nite amount of data per agent, the local functions are the same and\nagents do not need to communicate to solve the problem. This phenomenon highlights the existence\nof a natural trade-off between statistics and communication. While statistical similarities of local\nobjective functions and the statistics/communication trade-off have been investigated and exploited in\ncentralised server-clients setup, typically in the analysis and design of divide-and-conquer schemes\n[60, 28, 20, 32, 26, 1, 62, 46, 45, 61, 2], only recently there has been some investigation into the\ninterplay between statistics and communication/network-topology in the decentralised setting. The\nauthors in [6] investigate the interplay between the spectral norm of the data-generating distribution\nand the inverse spectral gap of the communication matrix for Distributed Stochastic Gradient Descent\nin the case of strongly convex losses. As most of the literature on decentralised machine learning,\nthis work also focuses on minimising the training error and not the test/prediction error (numerical\nexperiments are given for the test error). Some works have investigated the performance on the\ntest loss in the single-pass/online stochastic setting where agents use each data point only once.\nThe authors in [37, 51] investigate a distributed regularised online learning setting [55] and obtain\nguarantees for a \u201cmulti-step\u201d Distributed Stochastic Mirror Descent algorithm where agents reach\nconsensus on their stochastic gradients in-between computation steps. The works [25] and [3]\nconsider the performance of Distributed Stochastic Gradient Descent algorithms in the non-convex\nsmooth case. They investigate the average performance of the agents over the network in terms of\nconvergence to a stationary point of the test loss [19] and show that a linear speed-up in computational\ntime can be achieved provided the number of samples seen, equivalently the number of iterations\nperformed, exceeds the network size times the inverse of the spectral gap, each raised to a certain\npower. The work [38] seems to be the \ufb01rst to have considered minimisation of the test error in the\nmulti-pass/of\ufb02ine stochastic setting that more naturally relates to the classical literature on consensus\noptimisation. The authors investigate stability of Distributed Stochastic Gradient Descent on the test\nerror and show that for smooth and convex losses the number of iterations required to achieve optimal\nstatistical rates scales with the inverse of the spectral gap of the gossip matrix, a term that captures\nthe noise of the gradients\u2019 estimates, and a term that controls the statistical proximity of the local\nempirical losses.\n\n1.1 Contributions\n\nIn this work we investigate the implicit-regularisation learning performance of full-batch Distributed\nGradient Descent [33] on the test error in the context of non-parametric regression with the square\nloss function. In the homogeneous setting where agents hold independent and identically distributed\ndata points, we investigate the choice of step size and number of iterations that guarantee each agent to\nachieve optimal statistical rates with respect to all the samples in the network. We build a theoretical\nframework that allows to directly and explicitly exploit the statistical concentration of quantities\n(i.e. batched gradients) held by agents. On the one hand, exploiting concentration yields savings on\ncomputation, i.e. it allows to achieve faster convergence rates compared to methods that do not exploit\n\n2\n\n\fconcentration in their parameter tuning. On the other hand, it yields savings on communication,\nas it allows to take advantage of the trade-off between statistical power and communication costs.\nFirstly, we show that if agents hold suf\ufb01ciently many samples with respect to the network size, then\nDistributed Gradient Descent achieves optimal statistical rates up to poly-logarithmic factors with\na number of iterations that scales with the inverse of the spectral gap of the communication matrix\ndivided by the number of samples owned by each agent raised to a problem-dependent power, up\nto a statistics-induced threshold. Previous results for decentralised iterative gradient schemes in the\ncontext of consensus optimisation do not take advantage of the statistical nature of decentralised\nempirical risk minimisation problems. In the statistical setting that we consider, these methods would\nrequire a larger number of iterations that scales only with respect to the inverse of the spectral gap.\nSecondly, we show that if agents additionally hold suf\ufb01ciently many samples with respect to the\ninverse of the spectral gap, then the same order of iterations allows Distributed Gradient Descent and\nSingle-Machine Gradient Descent (i.e. gradient descent run on a single machine that holds all the\nsamples in the network) to achieve optimal statistical rates up to poly-logarithmic factors. Provided\nthe communication delay is suf\ufb01ciently small, this yields a linear speed-up in runtime over Single-\nMachine Gradient Descent, with a \u201csingle-step\u201d method that performs a single communication round\nper local gradient descent step. Single-step methods that do not exploit concentration can only achieve\na linear speed-up in runtime in graphs with spectral gap bounded away from zero, i.e. expanders or\nthe complete graph. Our results demonstrate how the increased statistical similarity between the local\nempirical risk functions can make up for a decreased connectivity in the graph topology, showing\nthat a linear speed-up in runtime can be achieved in any graph topology by exploiting concentration.\nTo the best of our knowledge, we seem to be the \ufb01rst to isolate this type of phenomena.\nWe prove our results under the standard \u201csource\u201d and \u201ccapacity\u201d assumptions in non-parametric\nregression. These assumptions relate, respectively, to the projection of the optimal predictor on the\nhypothesis space and to the effective dimension of this space [59, 12]. A contribution of this work\nis to show that proper tuning yields, up to poly-logarithmic terms, optimal non-parametric rates in\ndecentralised learning. As far as we aware, in the distributed setting such guarantees have been\nestablished only for centralised divide-and-conquer methods [60, 28, 20, 32, 26].\nTo prove our results we build upon previous work for Single-Machine Gradient Descent applied\nto non-parametric regression, in particular the line of works [57, 40, 27]. Exploiting that in our\nsetting the iterates of Distributed Gradient Descent can be written in terms of products of linear\noperators depending on the data held by agents, we decompose the excess risk into bias and sample\nvariance terms for Single-Machine Gradient Descent plus an additional quantity that captures the\nerror incurred by using a decentralised protocol over the communication network. We analyse this\nnetwork error term by further decomposing it into a term that behaves similarly to the consensus\nerror previously considered in [18, 33], and a new higher-order term. We control both terms by using\nthe structure of the gradient updates, which allows us to analyse the interplay between statistics, via\nconcentration, and network topology, via mixing of random walks related to the gossip matrix.\nThe work is structured as follows. Section 2 presents the setting, assumptions, and algorithm that\nwe consider. Section 3 states the main convergence result and discusses implications from the point\nof view of statistics, computation and communication. Section 4 presents the error decomposition\ninto bias, variance, and network error, and it illustrates the implicit regularisation strategy that we\nadopt. Section 5 highlights some of the features of our contribution in the light of future research\ndirections. The appendix in the supplementary material is structured as follows. Section A includes\nsome remarks about our results. Section B illustrates the main scheme of the proofs, highlighting the\ninterplay between statistics and network topology. Section C contains the full details of the proofs.\n\n2 Setup\n\nIn this section we describe the learning problem, assumptions and algorithm that we consider.\n\n2.1 Learning problem: decentralised non-parametric least-squares regression\n\nWe adopt the setting used in [40, 27], which involves regression in abstract Hilbert spaces. This\nsetting is of relevance for applications related to the Reproducing Kernel Hilbert Space (RKHS). See\nthe work in [57] and references therein.\n\n3\n\n\fLet H be a separable Hilbert Space with inner product and induced norm denoted by (cid:104)\u00b7 , \u00b7(cid:105)H and\n(cid:107) \u00b7 (cid:107)H, respectively. Let X \u2286 H be the input space and Y \u2282 R be the output space. Let \u03c1 be an\nunknown probability measure on Z = X \u00d7 Y , \u03c1X (\u00b7 ) be the marginal on X, and \u03c1(\u00b7|x) be the\nconditional distribution on Y given x \u2208 X. Assume that there exists a constant \u03ba \u2208 [1,\u221e) so that\n(1)\nLet the network of agents be modelled by a simple, connected, undirected, \ufb01nite graph G = (V, E),\nwith |V | = n nodes joined by edges E \u2286 V \u00d7 V . Edges represent communication constraints: agents\nv, w \u2208 V can only communicate if they share an edge (v, w) \u2208 E. We consider the homogeneous\nsetting where each agent v \u2208 V is given m data points zv := {xv, yv} sampled independently from\n\u03c1, where xv = {xi,v}i=1,...,m and yv = {yi,v}i=1,...,m, and each pair (xi,v, yi,v) is sampled from \u03c1.\nThe problem under study is the minimisation of the test/prediction risk with the square loss:\n\n(cid:104)x, x(cid:48)(cid:105)H \u2264 \u03ba2,\n\n\u2200x, x(cid:48) \u2208 X.\n\nX\u00d7Y\n\ninf\n\u03c9\u2208H\n\nThe quality of an approximate solution(cid:98)\u03c9 \u2208 H is measured by the excess risk E((cid:98)\u03c9) \u2212 inf \u03c9\u2208H E(\u03c9).\nand m, and (cid:101)O(\u00b7 ) denote orders of magnitudes up to both constants and poly-logarithmic terms in n\n\nNotation Given a matrix A \u2208 Rn\u00d7n,\nlet Avw denote the (v, w)-th element and Av =\n(Avw)w=1,...,n denote the v-th row. Let O(\u00b7 ) denote orders of magnitudes up to constants in n\nand m. Let (cid:46), (cid:38),(cid:39) denote inequalities and equalities modulo constants and poly-logarithmic terms\nin n, m. We use the notation a \u2228 b = max{a, b} and a \u2227 b = min{a, b}.\n\n(2)\n\n(cid:90)\n\nE(\u03c9),\n\nE(\u03c9) =\n\n((cid:104)\u03c9, x(cid:105)H \u2212 y)2d\u03c1(x, y),\n\n2.2 Assumptions\n\n(cid:82)\n\u03c1X, with norm (cid:107)f(cid:107)\u03c1 := ((cid:82)\noperator de\ufb01ned as L\u03c1(f ) := (cid:82)\nT\u03c1 :=(cid:82)\nbe the conditional expectation f\u03c1(x) :=(cid:82)\n\nThe assumptions that we consider are standard in non-parametric regression [27, 35]. The \ufb01rst\nassumption is a control on the even moments of the response.\nAssumption 1. There exist M \u2208 (0,\u221e) and \u03bd \u2208 (1,\u221e) such that for any (cid:96) \u2208 N we have\nY y2(cid:96)d\u03c1(y|x) \u2264 \u03bd(cid:96)!M (cid:96) \u03c1X-almost surely.\nLet L2(H, \u03c1X ) be the Hilbert space of square-integrable functions from H to R with respect to\nX |f (x)|2d\u03c1X (x))1/2. Let L\u03c1 : L2(H, \u03c1X ) \u2192 L2(H, \u03c1X ) be the\nX(cid:104)x, \u00b7(cid:105)H f (x)d\u03c1X (x). Under Assumption 1 the operator L\u03c1 can\nbe proved to be in the class of positive trace operators [15], and therefore the r-th power Lr\n\u03c1, with\nr \u2208 R, can be de\ufb01ned by using spectral theory. Let us also de\ufb01ne the operator T\u03c1 : H \u2192 H as\nX(cid:104)x, \u00b7(cid:105)H xd\u03c1X (x) and its operator norm (cid:107)T\u03c1(cid:107) := sup\u03c9\u2208H,(cid:107)\u03c9(cid:107)H =1 (cid:107)T\u03c1\u03c9(cid:107)H. The function\nminimising the expected squared loss (2) over all measurable functions f : H \u2192 R is known to\nY yd\u03c1(y|x) for x \u2208 X . Let H\u03c1 := {f : X \u2192 R|\u2203\u03c9 \u2208\nH with f (x) = (cid:104)w, x(cid:105)H , \u03c1X-almost surely} be the hypothesis space that we consider. The optimal\nf\u03c1 may not be in H\u03c1 as under Assumption 1 the space of functions searched H\u03c1 is a subspace of\nL2(H, \u03c1X ). Let fH denote the projection of f\u03c1 onto the closure of H\u03c1 in L2(H, \u03c1X ). Searching for\na solution to (2) is equivalent to searching for a linear function in H\u03c1 that approximates fH.\nThe following assumption quanti\ufb01es how well the target function fH can be approximated in H\u03c1.\nAssumption 2. There exist r > 0 and R > 0 such that (cid:107)L\u2212r\nThis assumption is often called the \u201csource\u201d condition [12]. Representing fH in the eigenspace of\nL\u03c1, this condition can be related to the rate at which the coef\ufb01cients of this representation decay. The\nbigger r is, the faster the decay, and more stringent the assumption is. In particular, if r \u2265 1/2 then\nthe target function is in the hypothesis space fH \u2208 H\u03c1. The last assumption is on the capacity of the\nhypothesis space.\n) \u2264 c\u03b3\u03bb\u2212\u03b3 for all \u03bb > 0.\nAssumption 3. There exist \u03b3 \u2208 (0, 1], c\u03b3 > 0 such that Tr(L\u03c1\nAssumption 3 relates to the effective dimension of the underlying regression problem [59, 12] and is\noften called the \u201ccapacity\u201d assumption. This assumption is always satis\ufb01ed for \u03b3 = 1 and c\u03b3 = \u03ba2\nsince L\u03c1 is a trace class operator. This case is called the capacity-independent setting. Meanwhile,\nthis assumption is satis\ufb01ed for \u03b3 \u2208 (0, 1] if, for instance, the eigenvalues of L\u03c1, denoted by {\u03c4i}i\u22651,\ndecay suf\ufb01ciently quickly, i.e. \u03c4i = O(i\u22121/\u03b3). This case allows improved rates to be obtained. For\nmore details on the interpretation of these assumptions we refer to the work in [40, 27, 35].\n\n(cid:0)L\u03c1+\u03bbI(cid:1)\u22121\n\n\u03c1 fH(cid:107)\u03c1 \u2264 R.\n\n4\n\n\f2.3 Algorithm: distributed gradient descent\n\n\u03c9t+1,v =\n\nw\u2208V\n\n(cid:17)\n\n,\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nPvw\n\n\u03c9t,w \u2212 \u03b7t\n\nthe vector (P a)v = (cid:80)\n\n(cid:0)(cid:104)\u03c9t,w, xi,w(cid:105)H \u2212 yi,w\n\n(cid:1)xi,w\n\nWe now describe the Distributed Gradient Descent algorithm [33] and its application to the problem\nof non-parametric regression. Let P \u2208 Rn\u00d7n\u22650 be a symmetric doubly-stochastic matrix, i.e. P = P (cid:62)\nand P 1 = 1 where 1 = (1, . . . , 1) \u2208 Rn is the vector of all ones. Let P be supported on the\ngraph, i.e. for any v (cid:54)= w, Pvw (cid:54)= 0 only if (v, w) \u2208 E. The matrix P encodes local averaging on\nthe network: when each agent has a real number represented by the vector a = (av)v\u2208V \u2208 Rn,\nw\u2208V Pvwaw for v \u2208 V encodes what each agent computes after taking a\nweighted average of its own and neighbours\u2019 numbers. Distributed Gradient Descent is implemented\nby communication on the network through the gossip matrix P . Initialised at w1,v = 0 for v \u2208 V ,\nthe iterates of the Distributed Gradient Descent are de\ufb01ned as follows, for v \u2208 V and t \u2265 1:\n(cid:88)\n(cid:0)(cid:104)\u03c9t,w, xi,w(cid:105)H \u2212 yi,w\n(cid:80)m\nconsensus step1 \u03c9t+1,v =(cid:80)\nmatrix P . Speci\ufb01cally, given the spectral decomposition of the gossip matrix P =(cid:80)n\n\n(cid:16)\n(cid:1)xi,w; second, each agent performs local averaging through the\n\nwhere {\u03b7t}t\u22651 is the sequence of positive step sizes. The iterates (3) can be seen as a combination\nof two steps: \ufb01rst, each agent w \u2208 V performs a local gradient descent step \u03c9t+1/2,w = \u03c9t,w \u2212\n\u03b7t\n\nw\u2208V Pvw\u03c9t+1/2,w. We treat gradient descent as a statistical device. We\nare interested in tuning the parameters of the algorithm to bound the expected value of the excess risk\nE[E(\u03c9t+1,v)] \u2212 inf \u03c9\u2208H E(\u03c9), where E[\u00b7 ] denotes expectation with respect to the data {zv}v\u2208V .\n\nNetwork dependence Let \u03c32 be the second largest eigenvalue in magnitude of the communication\nl=1 \u03bblulu(cid:62)\nwhere 1 = \u03bb1 \u2265 \u03bb2 \u2265, . . . ,\u2265 \u03bbn > \u22121 are the ordered real eigenvalues of P and {ul}l=1,...,n\nl\nthe associated eigenvectors, we have \u03c32 := max{|\u03bb2|,|\u03bbn|}. In many settings, the spectral gap\nscales with the size of the network raised to a certain power depending on the topology. For instance,\nsupposing G is a \ufb01nite regular graph and the communication matrix is the random walk matrix, then\nthe inverse of the spectral gap (1 \u2212 \u03c32)\u22121 scales as \u0398(1) for a complete graph, \u0398(n) for a grid, and\n\u0398(n2) for a cycle [14, 24, 18]. The question of designing gossip matrices P that yield better (smaller)\nscaling for the quantity (1 \u2212 \u03c32)\u22121 has been investigated [56], and it has been found numerically that\nthe rates mentioned above can not be improved unless lifted graphs are considered [44].\n\n1\nm\n\ni=1\n\n(3)\n\n3 Main result: optimal statistical rates with linear speed-up in runtime\n\nWe now state and highlight the main contribution of this work in the context of decentralised statistical\noptimisation. The result that we are about to state in Theorem 1 showcases the interplay between\nstatistics and communication that arise from the statistical regularities of the problem. This result\nshows the existence of a \u201cbig data\u201d regime where Distributed Gradient Descent can achieve a linear\n(in the number of agents n) speed-up in runtime compared to Single-Machine Gradient Descent.\nTheorem 1. Let Assumptions 1, 2, 3 hold with r \u2265 1/2 and 2r + \u03b3 > 2. Let t be the smallest integer\ngreater than the quantity\n\n(nm)1/(2r+\u03b3)\n\n\u00d7\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nSingle-Machine Iterations\n\n\uf8f1\uf8f2\uf8f3\n(cid:16) (nm)2r/(2r+\u03b3)\n\nm(1\u2212\u03c32)\u03b3\n\u221a\n(nm)r/(2r+\u03b3)\nm(1\u2212\u03c32)\n\n(cid:17)1/\u03b3 \u2228 1\n\nif m \u2265 n2r/\u03b3\notherwise\n\nLet \u03b7s \u2261 \u03b7 = \u03ba\u22122(nm)1/(2r+\u03b3)\n\nt\n\n\u2200s \u2265 1. If m \u2265 n\n\n2r+2+\u03b3\n\n2r+\u03b3\u22122 and n \u2265 2(1 + r) log( n\n1\u2212\u03c32\n\n), then \u2200v \u2208 V :\n\nE[E(\u03c9t+1,v)] \u2212 inf\n\u03c9\u2208H\n\nE(\u03c9) \u2264 C(nm)\u22122r/(2r+\u03b3),\n\nwhere C depends on \u03ba2,(cid:107)T\u03c1(cid:107), M, \u03bd, r, R, \u03b3, c\u03b3, and polynomials of log(nm) and log(\n\n1\n\n1\u2212\u03c32\n\n).\n\n1 We note, while this assumes agents communicate in\ufb01nite dimensional quantities in the general non-\nparametric setting, the framework we consider accommodates \ufb01nite approximations of in\ufb01nite dimensional\nquantities whilst accounting for the statistical precision [13].\n\n5\n\n\f2r+2+\u03b3\n\nTheorem 1 shows that when agents are given suf\ufb01ciently many samples (m) with respect to the\nnumber of agents (n), m \u2265 n\n2r+\u03b3\u22122 , proper tuning of the step size and number of iterations (a form\nof implicit regularisation) allows Distributed Gradient Descent to recover the optimal statistical rate\nO((nm)\u22122r/(2r+\u03b3)) for r \u2208 (1/2, 1) [12] up to poly-logarithmic terms.\nSingle-Machine Gradient Descent run on all of the observations has been previously shown to reach\noptimal statistical accuracy with a number of iterations of the order tSingle-Machine \u223c O((nm)1/(2r+\u03b3))\n[27]. The number of iterations t \u2261 tDistributed prescribed by Theorem 1 scales like tSingle-Machine times a\nnetwork-dependent factor that is a function of the inverse of the spectral gap (1\u2212 \u03c32)\u22121. The fact that\nthe number of iterations required to reach a prescribed level of error accuracy is inversely proportional\nto the spectral gap is a standard feature of iterative gradient methods applied to generic decentralised\nconsensus optimisation problems [18, 42, 43]. This dependence encodes the fact that in the case\nof generic objective functions assigned to agents, agents have to share information with everyone\nto solve the global problem and minimise the sum of the local functions; hence, more iterations\nare required in graph topologies that are less well-connected. In the present homogeneous setting,\nhowever, the statistical nature of the problem allows to exploit concentration of random variables to\ncharacterise the existence of a (network-dependent) \u201cbig data\u201d regime where the number of iterations\ndoes not depend on the network topology. The trade-off between statistics and communication is\nencoded by the dependence of the tuning parameters (stopping time and step size) on the number\n)1/\u03b3 \u2228 1 is a decreasing\nof samples m assigned to each agent. Observe that the factor ( (nm)2r/(2r+\u03b3)\nm(1\u2212\u03c32)\u03b3\nfunction of m, up to the threshold 1. When m \u2265 n2r/\u03b3\n2r+\u03b3\u22122 this factor becomes 1 and\nTheorem 1 guarantees that the same order of iterations allows both Distributed and Single-Machine\nGradient Descent to achieve the optimal statistical rates up to poly-logarithmic factors. This regime\nrepresents the case when the increased statistical similarity between the local empirical risk functions\nassigned to each agent (increasing as a function of m, as described by the non-asymptotic Law of\nLarge Numbers) makes up for the decreased connectivity in the graph topology (typically decreasing\nwith the spectral gap 1 \u2212 \u03c32) to yield a linear speed-up in runtime over Single-Machine Gradient\nDescent when the communication delay between agents is suf\ufb01ciently small. See Section 3.1 below.\nThe result of Theorem 1 depends on some other requirements which we now brie\ufb02y discuss. The\nrequirement n \u2265 2(1 + r) log( n\n) is technical and arises from the need to perform suf\ufb01ciently\n1\u2212\u03c32\nmany iterations to reach the mixing time of the gossip matrix P , i.e. t (cid:38) (1 \u2212 \u03c32)\u22121. Noting that the\nnumber of iterations t depends on the number of agents, samples and spectral gap. The requirement\n2r + \u03b3 > 2 relates to the dif\ufb01culty of the estimation problem and is stronger than a similar condition\nseen for single-machine gradient methods where 2r + \u03b3 > 1, see for instance the works [27, 35].\nThis requirement, alongside m \u2265 n\n2r+\u03b3\u22122 , ensures that the higher-order error terms arising from\nconsidering a decentralised protocol decay suf\ufb01ciently quickly with respect to the number of samples\nowned by agents m. The condition m \u2265 n\n2r+\u03b3\u22122 can be removed if the covariance operator T\u03c1 is\nassumed to be known to agents, which aligns with the additive noise oracle in single-pass Stochastic\nGradient Descent [16] or \ufb01xed-design regression in \ufb01nite-dimensional settings [21]. The condition\nm \u2265 n2r/\u03b3 corresponds to the case when the rate of concentration of the batched gradients held by\nm \u2264 (nm)\u22122r/(2r+\u03b3). This condition\nagents (i.e. 1/m) is faster than the optimal statistical rate, i.e. 1\nbecomes more stringent (i.e. more data per agent is needed) as the problem becomes easier from a\nstatistical point of view and r and 1/\u03b3 increase (see discussion in Section 2.2). This is due to the fact\nthat as r and 1/\u03b3 increase, only the statistical rate improves while the rate of concentration in the\nnetwork error stays the same, implying that more data is needed to balance the two terms.\n\n(1\u2212\u03c32)2r+\u03b3 \u2228 n\n\n2r+2+\u03b3\n\n2r+2+\u03b3\n\n2r+2+\u03b3\n\n3.1 Linear speed-up in runtime\n\nLet gradient computations cost 1 unit of time and communication delay between agents be \u03c4 units of\ntime.2 Denote the number of iterations required by Single-Machine Gradient Descent and Distributed\nGradient Descent to achieve the optimal statistical rate by tSingle-Machine and tDistributed, respectively. The\nspeed-up in computational time obtained by running the distributed protocol over the single-machine\nm+\u03c4 +Deg(P ), where Deg(P ) = maxv\u2208V |{Pvw (cid:54)= 0, w \u2208 V }|\nprotocol is of the order tSingle-Machine\nis the maximum degree of the communication matrix P . Theorem 1 implies that when m \u2265\ntDistributed\n\nnm\n\n2 For details on this communication model as well as comparison to [50] see remarks within Appendix A.\n\n6\n\n\f2r+2+\u03b3\n\nn2r/\u03b3\n\n(1\u2212\u03c32)2r+\u03b3 \u2228 n\n2r+\u03b3\u22122 then tDistributed \u223c tSingle-Machine, and if \u03c4 + Deg(P ) grows as O(m) then the\nspeed-up in computational time is of order n, linear in the number of agents. Classical \u201csingle-step\u201d\ndecentralised methods that alternate single communication rounds per local gradient computation,\nsuch as the methods inspired by [33], do not exploit concentration and have a runtime that scales\nwith the inverse of the spectral gap, without any threshold. As a result, these methods only yield a\nlinear speed-up in graphs with spectral gap bounded away from zero, i.e. expanders or the complete\ngraph. See below for more details. On the other hand, \u201cmulti-step\u201d methods that alternate multiple\ncommunication rounds per local gradient computation, such as the ones considered in [37, 51, 42, 43],\ndisplay a runtime that scales with a factor of the form m + \u03c4 +Deg(P )\nin our setting. Thus, while these\nmethods can achieve a linear speed-up in any graph topology in the \u201cbig data\u201d regime m (cid:38) \u03c4 +Deg(P )\nwithout exploiting concentration, they require an additional amount of communication rounds that is\nnetwork-dependent and scales with the inverse of the spectral gap. For a cycle graph, for instance, this\nmeans an extra O(n2) communication steps per iteration (or O(n) for gossip-accelerated methods).\nHence, classical decentralised optimisation methods that do not exploit concentration suffer from a\ntrade-off between runtime and communication cost: if you reduce the \ufb01rst you increase the second,\nand viceversa. Our results show that single-step methods can achieve a linear speed-up in runtime in\nany graph topology by exploiting concentration: statistics allows to \ufb01nd a regime where it is possible\nto simultaneously have a linear speed-up in runtime without increasing communication.\n\n1\u2212\u03c32\n\n1\u2212\u03c32\n\nm(1\u2212\u03c32)\u03b3\n\nComparison to single-step decentralised methods that do not exploit concentration Decen-\ntralised optimisation methods that do not consider statistical concentration rates in their parameter\ntuning can not exploit the statistics/communication trade-off encoded by the presence of the factor\n)1/\u03b3 \u2228 1 in Theorem 1, and they typically require a smaller step size and more iterations\n( (nm)2r/(2r+\u03b3)\nto achieve optimal statistical rates. The convergence rate typically achieved by classical consensus\noptimisation methods, e.g. [18], is recovered in Theorem 1 when m = n2r/\u03b3 as in this case the\nnumber of iterations required becomes t \u223c (nm)1/(2r+\u03b3)\n, which corresponds to tSingle-Machine scaled\nby a certain power of 1/(1 \u2212 \u03c32) (in our setting the power is 1). This represents the setting where\nthe choice of step size aligns with the choice in the single-machine case scaled by (1 \u2212 \u03c32), and\na linear speed-up occurs when (1 \u2212 \u03c32)\u22121 = O(1). Since the network error is decreasing in m in\nour case (due to concentration), larger step sizes can be chosen for m > n2r/\u03b3. Speci\ufb01cally, the\nsingle-machine step size is now scaled by [(1 \u2212 \u03c32)( m\nn2r/\u03b3 )1/(2r+\u03b3)] \u2228 1, yielding a linear speed-up\nwhen (1 \u2212 \u03c32)\u22121 = O(( m\nn2r/\u03b3 )1/(2r+\u03b3)), which, as m increases, is a weaker requirement on the\nnetwork topology over the standard consensus optimisation setting.\n\n1\u2212\u03c32\n\n4 General result: error decomposition and implicit regularisation\n\nTheorem 1 is a corollary of the next result, which explicitly highlights the interplay between statistics\nand network topology and the implicit regularisation role of the step size and number of iterations.\nTheorem 2. Let Assumptions 1, 2, 3 hold with r \u2265 1/2. Let \u03b7s = \u03b7s\u2212\u03b8 \u2200s \u2265 1 with \u03b8 \u2208 (0, 3/4)\nand \u03b7 \u2208 (0, \u03ba\u22122]. If t/2 \u2265 (cid:100) (r+1) log(t)\nE[E(\u03c9t+1,v)] \u2212 inf\n\u03c9\u2208H\n\n(cid:101) =: t(cid:63), then for all v \u2208 V , \u03b1 \u2208 [0, 1/2] and \u03b3(cid:48) \u2208 [1, \u03b3]:\n\nE(\u03c9)\n\n1\u2212\u03c32\n\n\u2264(cid:104)\nq1(\u03b7t1\u2212\u03b8)\u22122r + q2(nm)\u22122r/(2r+\u03b3)(cid:16)\n\n1\u2228(nm)\u22122/(2r+\u03b3)(\u03b7t1\u2212\u03b8)2\u2228t\u22122(\u03b7t1\u2212\u03b8)2(cid:17)(cid:105)\n(m\u22121\u03b7t1\u2212\u03b8) \u2228 (\u03b7t1\u2212\u03b8)\u03b3(cid:17)\n\n(cid:16)\n\u03b72t\u22122r \u2228 (m\u22121(\u03b7t(cid:63))1+2\u03b1) \u2228 (\u03b7t(cid:63))\u03b3(cid:48)+2\u03b1(cid:17)\n1 \u2228 (\u03b7t1\u2212\u03b8)2 \u2228 t\u22122(\u03b7t1\u2212\u03b8)4(cid:17)(cid:16)\n(cid:16)\n\nlog2(t) (4)\n\n(5)\n\n(6)\n\nwhere q1, q2, q3, q4 are all constants depending on \u03ba2,(cid:107)T\u03c1(cid:107), M, \u03bd, r, R, \u03b3, c\u03b3.\nThe bound in Theorem 2 shows that the excess risk has been decomposed into three main terms,\nas detailed in Section B.1. The \ufb01rst term (4) corresponds to the error achieved by Single-Machine\nGradient Descent run on all nm samples. It consists of both bias and sample variance terms [27].\nThe second two terms (5) and (6) characterise the network error due to the use of a decentralised\n\nlog2(n) log2(t(cid:63))\n\nm\n\nlog4(n) log2(t)\n\n+ q3\n\n+ q4\n\nm2\n\n7\n\n\fprotocol. These terms decrease with the number of samples m owned by each agent. This captures\nthe fact that, as agents are given samples from the same unknown distribution, agents are in fact\nsolving the same learning problem and their local empirical loss functions concentrate to the same\nobjective as m increases. The decentralised error term is itself composed of two terms which decay\n\nat different rates with respect to m. The term in (5) is dominant and decays at the order of (cid:101)O(1/m).\n(cid:101)O(1/m2). This is a higher-order error term that is not appearing in the error decomposition when the\n\nThis can be interpreted as the consensus error seen in the works [33, 18] for instance. As in that\nsetting, this quantity is also increasing with the step size \u03b7 and decreasing with the spectral gap\nof the communication matrix 1 \u2212 \u03c32, as encoded by t(cid:63). The term (6) decays at the faster rate of\ncovariance operator T\u03c1 is assumed to be known to agents. This quantity arises from the interaction\nbetween the local averaging on the network through P and what has been previously labelled as the\n\u201cmultiplicative\u201d noise in the single-machine single-pass stochastic gradient setting for least squares\n[16], i.e. the empirical covariance operator interacting with the iterates at each step. Section B.2\nprovides a high-level illustration of the analysis of the Network Error terms (5) and (6).\nThe bound in Theorem 2 shows how the algorithmic parameters\u2014step size and number of iterations\u2014\nact as regularisation parameters for Distributed Gradient Descent, following what is seen in the\nsingle-machine setting. Theorem 1 demonstrates how optimal statistical rates can be recovered by\ntuning these parameters appropriately with respect to the network topology, network size, number of\nsamples, and with respect to the estimation problem itself. The bound in Theorem 1 is obtained from\nthe bound in Theorem 2 by \ufb01rst tuning the quantity \u03b7t to the order (nm)1/(2r+\u03b3) so that the bias\nand variance terms in (4) achieve the optimal statistical rate. This leaves the tuning of the remaining\ndegree of freedom (say \u03b7) to ensure that also the network error achieves the optimal statistical rate.\nThe high-level idea is the following. As m increases, the network error is dominated by the term\nin (5) that is proportional to the factor (\u03b7t(cid:63))\u03b3(cid:48)+2\u03b1/m. There are two ways to choose the largest\n\npossible step size \u03b7 to guarantee that this factor is (cid:101)O((nm)\u22122r/(2r+\u03b3)), depending on whether the\n\n)1/\u03b3 \u2228 1 and (nm)r/(2r+\u03b3)\nm(1\u2212\u03c32)\n\nrate of concentration of the batched gradients held by agents is faster than the optimal statistical\nrate or not, i.e., whether m \u2265 n2r/\u03b3 is true or not (cf. Section 3). The two cases yield the factors\nin Theorem 1, corresponding to the choice \u03b3(cid:48) = \u03b3 , \u03b1 = 0\n( (nm)2r/(2r+\u03b3)\nand \u03b3(cid:48) = 1 , \u03b1 = 1/2, respectively. If the concentration of the batched gradients held by agents fully\ncompensates for the network error, i.e. m \u2265\n(1\u2212\u03c32)2r+\u03b3 , then (\u03b7t(cid:63))\u03b3(cid:48)+2\u03b1/m (cid:39) (nm)\u22122r/(2r+\u03b3)\nwith a constant step size and tDistributed \u223c tSingle-Machine \u223c (nm)1/(2r+\u03b3), yielding the regime where a\nlinear speed-up occurs. For more details on the parameters \u03b1, \u03b3(cid:48), see Lemma 8 in Appendix C.3.1.\n\nm(1\u2212\u03c32)\u03b3\n\n\u221a\n\nn2r/\u03b3\n\n5 Future directions\n\nWe highlight some of the features of our contribution and outline directions for future research.\nNon-parametric setting We prove bounds in the attainable case r \u2265 1/2. The non-attainable case\nr < 1/2 is known to be more challenging [27], and it is natural to investigate to what extent our\nresults can be extended to that setting. We consider the case \u03b3 > 0 which does not include the \ufb01nite-\ndimensional setting H = Rd, \u03b3 = 0, where the optimal rate is O(d/(nm)) [54]. While adapting\nour results to this setting requires minor modi\ufb01cations, optimal bounds would only hold for \u201ceasy\u201d\nestimation problems with r > 1 due to the higher-order term in the network error. Improvements\nrequire getting better bounds on this term, potentially using a different learning rate.\n\nGeneral loss functions The analysis that we develop is speci\ufb01c to the square loss, which yields\nthe bias/variance error decomposition and allows to get explicit characterisations by expanding the\nsquares. While the concentration phenomena that we exploit are generic, different techniques are\nrequired to extend our analysis to other losses, as in the single-machine setting. The statistical\nproximity of agents\u2019 functions in the \ufb01nite-dimensional setting has been investigated in [38].\n\nStatistics/communication trade-off with sparse/randomised gossip In this work we show that\nwhen agents hold suf\ufb01ciently many samples, then Distributed and Single-Machine Gradient Descent\nachieve the optimal statistical rate with the same order of iterations. This motivates balancing and\ntrading off communication and statistics, e.g., investigating statistically robust procedures in settings\nwhen agents communicate with a subset of neighbours, either deterministically or randomly [9, 17, 4].\n\n8\n\n\fStochastic gradient descent and mini-batches Our work exploits concentration of gradients\naround their means, so full-batch gradients (i.e. batches of size m) yield the concentration rate 1/m.\nIn single-machine learning, stochastic gradient descent [39] has been shown to achieve good statistical\nperformance in a variety of settings while allowing for computational savings. Extending our \ufb01ndings\nto stochastic methods with appropriate mini-batch sizes is another venue for future investigation.\n\nAcknowledgments\n\nDominic Richards is supported by the EPSRC and MRC through the OxWaSP CDT programme\n(EP/L016710/1). Patrick Rebeschini is supported in part by the Alan Turing Institute under the\nEPSRC grant EP/N510129/1. We would like to thank Francis Bach, Lorenzo Rosasco and Alessandro\nRudi for helpful discussions.\n\nReferences\n[1] Alekh Agarwal and John C. Duchi. Distributed delayed stochastic optimization. In Advances in\n\nNeural Information Processing Systems, pages 873\u2013881, 2011.\n\n[2] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning\nand optimization. In Advances in neural information processing systems, pages 1756\u20131764,\n2015.\n\n[3] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient\n\npush for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.\n\n[4] Florence B\u00e9n\u00e9zit, Alexandros G Dimakis, Patrick Thiran, and Martin Vetterli. Order-optimal\nconsensus through randomized path averaging. IEEE Transactions on Information Theory,\n56(10):5150\u20135167, 2010.\n\n[5] Rapha\u00ebl Berthier, Francis Bach, and Pierre Gaillard. Accelerated Gossip in Networks of Given\nDimension using Jacobi Polynomial Iterations. arXiv preprint arXiv:1805.08531, May 2018.\n\n[6] Avleen S Bijral, Anand D Sarwate, and Nathan Srebro. Data-dependent convergence for\nconsensus stochastic optimization. IEEE Transactions on Automatic Control, 62(9):4483\u20134498,\n2017.\n\n[7] Gilles Blanchard and Nicole M\u00fccke. Optimal rates for regularization of statistical inverse\n\nlearning problems. Foundations of Computational Mathematics, 18(4):971\u20131013, 2018.\n\n[8] Olivier Bousquet and L\u00e9on Bottou. The tradeoffs of large scale learning. In Advances in Neural\n\nInformation Processing Systems, pages 161\u2013168, 2008.\n\n[9] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip\n\nalgorithms. IEEE transactions on information theory, 52(6):2508\u20132530, 2006.\n\n[10] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed\noptimization and statistical learning via the alternating direction method of multipliers. Found.\nTrends Mach. Learn., 3(1):1\u2013122, January 2011.\n\n[11] Ming Cao, Daniel A Spielman, and Edmund M Yeh. Accelerated gossip algorithms for\ndistributed computation. In Proc. of the 44th Annual Allerton Conference on Communication,\nControl, and Computation, pages 952\u2013959. Citeseer, 2006.\n\n[12] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[13] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features.\n\nIn Advances in Neural Information Processing Systems, pages 10192\u201310203, 2018.\n\n[14] Fan R.K. Chung and Fan Chung Graham. Spectral graph theory. Number 92. American\n\nMathematical Soc., 1997.\n\n[15] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint,\n\nvolume 24. Cambridge University Press, 2007.\n\n9\n\n\f[16] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger\nconvergence rates for least-squares regression. The Journal of Machine Learning Research,\n18(1):3520\u20133570, 2017.\n\n[17] Alexandros DG Dimakis, Anand D Sarwate, and Martin J Wainwright. Geographic gossip:\nEf\ufb01cient averaging for sensor networks. IEEE Transactions on Signal Processing, 56(3):1205\u2013\n1216, 2008.\n\n[18] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed\noptimization: Convergence analysis and network scaling. IEEE Transactions on Automatic\nControl, 57(3):592\u2013606, 2012.\n\n[19] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[20] Zheng-Chu Guo, Shao-Bo Lin, and Ding-Xuan Zhou. Learning theory of distributed spectral\n\nalgorithms. Inverse Problems, 33(7):074009, 2017.\n\n[21] L\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of\n\nnonparametric regression. Springer Science & Business Media, 2006.\n\n[22] Bjorn Johansson, Maben Rabi, and Mikael Johansson. A simple peer-to-peer algorithm for dis-\ntributed optimization in sensor networks. In Decision and Control, 2007 46th IEEE Conference\non, pages 4705\u20134710. IEEE, 2007.\n\n[23] Bj\u00f6rn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient\nmethod for distributed optimization in networked systems. SIAM Journal on Optimization,\n20(3):1157\u20131170, 2009.\n\n[24] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American\n\nMathematical Soc., 2017.\n\n[25] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized\nalgorithms outperform centralized algorithms? a case study for decentralized parallel stochastic\ngradient descent. In Advances in Neural Information Processing Systems, pages 5330\u20135340,\n2017.\n\n[26] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with stochastic\ngradient methods and spectral-regularization algorithms. arXiv preprint arXiv:1801.07226,\n2018.\n\n[27] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.\n\nJournal of Machine Learning Research, 18(97):1\u201347, 2017.\n\n[28] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least\n\nsquares. The Journal of Machine Learning Research, 18(1):3202\u20133232, 2017.\n\n[29] Ilan Lobel and Asuman Ozdaglar. Distributed subgradient methods for convex optimization\n\nover random networks. IEEE Transactions on Automatic Control, 56(6):1291\u20131306, 2011.\n\n[30] Ion Matei and John S Baras. Performance evaluation of the consensus-based distributed\nsubgradient method under random communication topologies. IEEE Journal of Selected Topics\nin Signal Processing, 5(4):754\u2013771, 2011.\n\n[31] Aryan Mokhtari and Alejandro Ribeiro. Dsa: Decentralized double stochastic averaging gradient\n\nalgorithm. Journal of Machine Learning Research, 17(61):1\u201335, 2016.\n\n[32] Nicole M\u00fccke and Gilles Blanchard. Parallelizing spectrally regularized kernel algorithms. The\n\nJournal of Machine Learning Research, 19(1):1069\u20131097, 2018.\n\n[33] Angelia Nedi\u00b4c, Alex Olshevsky, Asuman Ozdaglar, and John N. Tsitsiklis. On distributed\nIEEE Transactions on Automatic Control,\n\naveraging algorithms and quantization effects.\n54(11):2506\u20132517, 2009.\n\n10\n\n\f[34] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-\n\nmization. IEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[35] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic\ngradient descent on hard learning problems through multiple passes. In Advances in Neural\nInformation Processing Systems 31, pages 8125\u20138135. 2018.\n\n[36] IF Pinelis and AI Sakhanenko. Remarks on inequalities for large deviation probabilities. Theory\n\nof Probability & Its Applications, 30(1):143\u2013148, 1986.\n\n[37] M. Rabbat. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE\n6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing\n(CAMSAP), pages 517\u2013520, Dec 2015.\n\n[38] D. Richards and P. Rebeschini. Graph-Dependent Implicit Regularisation for Distributed\n\nStochastic Subgradient Descent. ArXiv e-prints, sep 2018.\n\n[39] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins\n\nSelected Papers, pages 102\u2013109. Springer, 1985.\n\n[40] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization.\n\nAdvances in Neural Information Processing Systems, pages 1630\u20131638, 2015.\n\nIn\n\n[41] Ali H. Sayed. Adaptive networks. Proceedings of the IEEE, 102(4):460\u2013497, 2014.\n\n[42] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 3027\u20133036. JMLR.\norg, 2017.\n\n[43] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Laurent Massouli\u00e9, and Yin Tat Lee. Opti-\nmal algorithms for non-smooth distributed optimization in networks. In Advances in Neural\nInformation Processing Systems, pages 2745\u20132754, 2018.\n\n[44] Devavrat Shah. Gossip algorithms. Foundations and Trends R(cid:13) in Networking, 3(1):1\u2013125,\n\n2009.\n\n[45] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning\nand estimation. In Advances in Neural Information Processing Systems, pages 163\u2013171, 2014.\n\n[46] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Commu-\nnication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages\n850\u2013857. IEEE, 2014.\n\n[47] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact \ufb01rst-order algorithm for\ndecentralized consensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\n[48] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business\n\nMedia, 2008.\n\n[49] Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths:\nOptimality and almost-sure convergence. IEEE Trans. Information Theory, 60(9):5716\u20135735,\n2014.\n\n[50] Konstantinos Tsianos, Sean Lawlor, and Michael G Rabbat. Communication/computation\nIn Advances in neural information\n\ntradeoffs in consensus-based distributed optimization.\nprocessing systems, pages 1943\u20131951, 2012.\n\n[51] Konstantinos I Tsianos and Michael G Rabbat. Ef\ufb01cient distributed online prediction and\nstochastic optimization with approximate distributed averaging. IEEE Transactions on Signal\nand Information Processing over Networks, 2(4):489\u2013506, 2016.\n\n[52] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic\nIEEE transactions on automatic control,\n\nand stochastic gradient optimization algorithms.\n31(9):803\u2013812, 1986.\n\n11\n\n\f[53] John Nikolas Tsitsiklis. Problems in decentralized decision making and computation. Technical\nreport, Massachusetts Inst Of Tech Cambridge Lab For Information And Decision Systems,\n1984.\n\n[54] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel Machines,\n\npages 303\u2013313. Springer, 2003.\n\n[55] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(Oct):2543\u20132596, 2010.\n\n[56] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems and\n\nControl Letters, 53(1):65\u201378, 2004.\n\n[57] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[58] Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations\n\nof Computational Mathematics, 8(5):561\u2013596, 2008.\n\n[59] Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural\n\nComputation, 17(9):2077\u20132098, 2005.\n\n[60] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression:\nA distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research,\n16(1):3299\u20133340, 2015.\n\n[61] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss.\n\nIn International conference on machine learning, pages 362\u2013370, 2015.\n\n[62] Yuchen Zhang, Martin J. Wainwright, and John C. Duchi. Communication-ef\ufb01cient algorithms\nfor statistical optimization. In Advances in Neural Information Processing Systems, pages\n1502\u20131510, 2012.\n\n12\n\n\f", "award": [], "sourceid": 739, "authors": [{"given_name": "Dominic", "family_name": "Richards", "institution": "University of Oxford"}, {"given_name": "Patrick", "family_name": "Rebeschini", "institution": "University of Oxford"}]}