{"title": "Information-theoretic lower bounds for distributed statistical estimation with communication constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2328, "page_last": 2336, "abstract": "We establish minimax risk lower bounds for distributed statistical estimation given a budget $B$ of the total number of bits that may be communicated. Such lower bounds in turn reveal the minimum amount of communication required by any procedure to achieve the classical optimal rate for statistical estimation. We study two classes of protocols in which machines send messages either independently or interactively. The lower bounds are established for a variety of problems, from estimating the mean of a population to estimating parameters in linear regression or binary classification.", "full_text": "Information-theoretic lower bounds for distributed\n\nstatistical estimation with communication constraints\n\nYuchen Zhang1\nJohn C. Duchi1 Michael I. Jordan1,2 Martin J. Wainwright1,2\n1Department of Electrical Engineering and Computer Science and 2Department of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n{yuczhang,jduchi,jordan,wainwrig}@eecs.berkeley.edu\n\nAbstract\n\nWe establish lower bounds on minimax risks for distributed statistical estima-\ntion under a communication budget. Such lower bounds reveal the minimum\namount of communication required by any procedure to achieve the centralized\nminimax-optimal rates for statistical estimation. We study two classes of proto-\ncols: one in which machines send messages independently, and a second allowing\nfor interactive communication. We establish lower bounds for several problems,\nincluding various types of location models, as well as for parameter estimation in\nregression models.\n\n1\n\nIntroduction\n\nRapid growth in the size and scale of datasets has fueled increasing interest in statistical estimation\nin distributed settings [see, e.g., 5, 23, 7, 9, 17, 2]. Modern data sets are often too large to be stored\non a single machine, so that it is natural to consider methods that involve multiple machines, each\nassigned a smaller subset of the full dataset. An essential design parameter in such methods is the\namount of communication required between machines or chips. Bandwidth limitations on network\nand inter-chip communication often impose signi\ufb01cant bottlenecks on algorithmic ef\ufb01ciency.\n\nThe focus of the current paper is the communication complexity of various classes of statistical es-\ntimation problems. More formally, suppose that we are interested in estimating the parameter \u03b8 of\nsome unknown distribution P , based on a dataset of N i.i.d. samples. In the classical setting, one\nconsiders centralized estimators that have access to all N samples, and for a given estimation prob-\nlem, the optimal performance over all centralized schemes can be characterized by the minimax rate.\nBy way of contrast, in the distributed setting, one is given m different machines, and each machine\nis assigned a subset of samples of size n = ! N\nm\". Each machine may perform arbitrary operations\non its own subset of data, and it then communicates results of these intermediate computations to\nthe other processors or to a central fusion node. In this paper, we try to answer the following ques-\ntion: what is the minimal number of bits that must be exchanged in order to achieve the optimal\nestimation error achievable by centralized schemes?\n\nThere is a substantial literature on communication complexity in many settings, including function\ncomputation in theoretical computer science (e.g., [21, 1, 13]), decentralized detection and estima-\ntion (e.g., [18, 16, 15]) and information theory [11]. For instance, Luo [15] considers architectures in\nwhich machines may send only a single bit to a centralized processor; for certain problems, he shows\nthat if each machine receives a single one-dimensional sample, it is possible to achieve the optimal\ncentralized rate up to constant factors. Among other contributions, Balcan et al. [2] study Probably\nApproximately Correct (PAC) learning in the distributed setting; however, their stated lower bounds\ndo not involve the number of machines. In contrast, our work focuses on scaling issues, both in terms\nof the number of machines as well as the dimensionality of the underlying data, and we formalize\nthe problem in terms of statistical minimax theory.\n\n1\n\n\fMore precisely, we study the following problem: given a budget B of the total number of bits that\nmay be communicated from the m distributed datasets, what is the minimax risk of any estimator\nbased on the communicated messages? While there is a rich literature connecting information-\ntheoretic techniques with the risk of statistical estimators (e.g. [12, 22, 20, 19]), little of it character-\nizes the effects of limiting communication. In this paper, we present some minimax lower bounds for\ndistributed statistical estimation. By comparing our lower bounds with results in statistical estima-\ntion, we can identify the minimal communication cost that a distributed estimator must pay to have\nperformance comparable to classical centralized estimators. Moreover, we show how to leverage\nrecent work [23] to achieve these fundamental limits.\n\n2 Problem setting and notation\n\nWe begin with a formal description of the statistical estimation problems considered here. Let\nP denote a family of distributions and let \u03b8 : P\u2192 \u0398 \u2286 Rd denote a function de\ufb01ned on P.\nA canonical example throughout the paper is the problem of mean estimation, in which \u03b8(P ) =\nEP [X]. Suppose that, for some \ufb01xed but unknown member P of P, there are m sets of data stored\non individual machines, where each subset X (i) is an i.i.d. sample of size n from the unknown\ndistribution P .1 Given this distributed collection of local data sets, our goal is to estimate \u03b8(P )\nbased on the m samples X (1), . . . , X (m), but using limited communication.\n\nWe consider a class of distributed protocols \u03a0, in which at each round t = 1, 2, . . ., machine i sends a\nmessage Yt,i that is a measurable function of the local data X (i), and potentially of past messages. It\nis convenient to model this message as being sent to a central fusion center. Let Y t = {Yt,i}i\u2208[m] de-\nnote the collection of all messages sent at round t. Given a total of T rounds, the protocol \u03a0 collects\n\nthe sequence (Y 1, . . . , Y T ), and constructs an estimator !\u03b8 :=!\u03b8(Y 1, . . . , Y T ). The length Lt,i of\nmessage Yt,i is the minimal number of bits required to encode it, and the total L =\"T\n\ni=1 Lt,i\nof all messages sent corresponds to the total communication cost of the protocol. Note that the com-\nmunication cost is a random variable, since the length of the messages may depend on the data, and\nthe protocol may introduce auxiliary randomness.\n\nt=1\"m\n\nIt is useful to distinguish two different classes, namely independent versus interactive protocols. An\nindependent protocol \u03a0 is based on a single round (T = 1) of communication, in which machine\ni sends message Y1,i to the fusion center. Since there are no past messages, the message Y1,i can\ndepend only on the local sample X (i). Given a family P, the class of independent protocols with\nbudget B \u2265 0 is given by\n\n(1)\n\n(2)\n\nAind(B,P) =# independent protocols \u03a0 such that\n\nEP$ m%i=1\n\nLi& \u2264 B\u2019.\n\nsup\nP\u2208P\n\n(For simplicity, we use Yi to indicate the message sent from processor i and Li to denote its length\nin the independent case.) It can be useful in some situations to have more granular control on the\namount of communication, in particular by enforcing budgets on a per-machine basis. In such cases,\nwe introduce the shorthand B1:m = (B1, . . . , Bm) and de\ufb01ne\n\nAind(B1:m,P) =#independent protocols \u03a0 such that\n\nEP [Li] \u2264 Bi for i \u2208 [m]\u2019 .\n\nsup\nP\u2208P\n\nIn contrast to independent protocols, the class of interactive protocols allows for interaction at dif-\nferent stages of the message passing process. In particular, suppose that machine i sends message\nYt,i to the fusion center at time t, who then posts it on a \u201cpublic blackboard,\u201d where all machines can\nread Yt,i. We think of this as a global broadcast system, which may be natural in settings in which\nprocessors have limited power or upstream capacity, but the centralized fusion center can send mes-\nsages without limit. In the interactive setting, the message Yt,i should be viewed as a measurable\nfunction of the local data X (i), and the past messages Y 1:t\u22121. The family of interactive protocols\nwith budget B \u2265 0 is given by\n\nAinter(B,P) =#interactive protocols \u03a0 such that\n\nEP [L] \u2264 B\u2019.\n\nsup\nP\u2208P\n\n(3)\n\n1 Although we assume in this paper that every machine has the same amount of data, our technique gener-\n\nalizes easily to prove tight lower bounds for distinct data sizes on different machines.\n\n2\n\n\fWe conclude this section by de\ufb01ning the minimax framework used throughout this paper. We wish to\n\ncharacterize the best achievable performance of estimators!\u03b8 that are functions of only the messages\n(Y 1, . . . , Y T ). We measure the quality of a protocol and estimator!\u03b8 by the mean-squared error\n\nEP,\u03a0((!\u03b8(Y 1, . . . , Y T ) \u2212 \u03b8(P )(2\n2) ,\n\nwhere the expectation is taken with respect to the protocol \u03a0 and the m i.i.d. samples X (i) of size\nn from distribution P . Given a class of distributions P, parameter \u03b8 : P\u2192 \u0398, and communication\nbudget B, the minimax risk for independent protocols is\n\nMind(\u03b8, P, B) :=\n\ninf\n\n\u03a0\u2208Aind(B,P)\n\ninf\n!\u03b8\n\nsup\nP\u2208P\n\nEP,\u03a0$***!\u03b8(Y1, . . . , Ym) \u2212 \u03b8(P )***\n\n2\n\n2& .\n\n(4)\n\nHere, the in\ufb01mum is taken jointly over all independent procotols \u03a0 that satisfy the budget constraint\n\nB, and over all estimators!\u03b8 that are measurable functions of the messages in the protocol. This min-\n\nimax risk should also be understood to depend on both the number of machines m and the individual\nsample size n. The minimax risk for interactive protocols, denoted by Minter, is de\ufb01ned analogously,\nwhere the in\ufb01mum is instead taken over the class of interactive protocols. These communication-\ndependent minimax risks are the central objects in this paper: they provide a sharp characterization\nof the optimal rate of statistical estimation as a function of the communication budget B.\n\n3 Main results\n\nWith our setup in place, we now turn to the statement of our main results, along with some discussion\nof their consequences.\n\n3.1 Lower bound based on metric entropy\n\nWe begin with a general but relatively naive lower bound that depends only on the geometric struc-\nture of the parameter space, as captured by its metric entropy. In particular, given a subset \u0398 \u2282 Rd,\nentropy of \u0398 as\n\nwe say {\u03b81, . . . ,\u03b8 K} are \u03b4-separated if**\u03b8i \u2212 \u03b8j**2 \u2265 \u03b4 for i += j. We then de\ufb01ne the packing\n\nlog M\u0398(\u03b4) := log2+max,K \u2208 N | {\u03b81, . . . ,\u03b8 K}\u2282 \u0398 are \u03b4-separated-. .\n\n(5)\nThe function \u03b8 ,\u2192 log M\u0398(\u03b4) is left-continuous and non-increasing in \u03b4, so we may de\ufb01ne the\ninverse function log M\u22121\n\n\u0398 (B) := sup{\u03b4 | log M\u0398(\u03b4) \u2265 B}.\n\nProposition 1 For any family of distributions P and parameter set \u0398= \u03b8(P), the interactive\nminimax risk is lower bounded as\n\nMinter(\u03b8, P, B) \u2265/ 1\n\n4\n\nlog M\u22121\n\n\u0398 (2B + 2)02\n\n.\n\n(6)\n\nOf course, the same lower bound also holds for Mind(\u03b8, P, B), since any independent protocol is\na special case of an interactive protocol. Although Proposition 1 is a relatively generic statement,\nnot exploiting any particular structure of the problem, it is in general unimprovable by more than\nconstant factors, as the following example illustrates.\n\nExample: Bounded mean estimation. Suppose that our goal is to estimate the mean \u03b8 = \u03b8(P )\nof a class of distributions P supported on the interval [0, 1], so that \u0398= \u03b8(P) = [0, 1]. Suppose\nthat a single machine (m = 1) receives n i.i.d. observations Xi according to P . Since the packing\nentropy is lower bounded as log M\u0398(\u03b4) \u2265 log(1/\u03b4), the lower bound (6) implies\n\nMind(\u03b8, P, B) \u2265 Minter(\u03b8, P, B) \u2265\n\ne\u22122\n4\n\ne\u22122B.\n\n2 log n yields the lower bound Mind(\u03b8, P([0, 1]), B) \u2265 e\u22122\nThus, setting B = 1\n4n . This lower bound\nis sharp up to the constant pre-factor, since it can be achieved by a simple method. Given its n\nn\"n\nobservations, the single machine can compute the sample mean X n = 1\ni=1 Xi. Since the\nsample mean lies in the interval [0, 1], it can be quantized to accuracy 1/n using log(n) bits, and this\nn ,\n\nquantized version!\u03b8 can be transmitted. A straightforward calculation shows that E[(!\u03b8 \u2212 \u03b8)2] \u2264 2\n\nso Proposition 1 yields an order-optimal bound in this case.\n\n3\n\n\f3.2 Multi-machine settings\n\nWe now turn to the more interesting multi-machine setting (m > 1). Let us study how the budget\nB\u2014meaning the of bits required to achieve the minimax rate\u2014scales with the number of machines\nm. We begin by considering the uniform location family U = {P\u03b8,\u03b8 \u2208 [\u22121, 1]}, where P\u03b8 is\nthe uniform distribution on the interval [\u03b8 \u2212 1,\u03b8 + 1]. For this problem, a direct application of\nProposition 1 gives a nearly sharp result.\n\nCorollary 1 Consider the uniform location family U with n i.i.d. observations per machine:\n\n(a) Whenever the communication budget is upper bounded as B \u2264 log(mn), there is a univer-\n\nsal constant c such that\n\nMinter(\u03b8, U, B) \u2265\n\n(mn)2 .\n\n(b) Conversely, given a budget of B =+2 + 2 ln m. log(mn) bits, there is a universal constant\n\nc# such that\n\nc\n\nc#\n\nMinter(\u03b8, U, B) \u2264\n\n(mn)2 .\n\nIf each of m machines receives n observations, we have a total sample size of mn, so the minimax\nrate over all centralized procedures scales as 1/(mn)2 (for instance, see [14]). Consequently, Corol-\nlary 1(b) shows that the number of bits required to achieve the centralized rate has only logarithmic\ndependence on the number m of machines. Part (a) shows that this logarithmic dependence on m is\nunavoidable.\n\nIt is natural to wonder whether such logarithmic dependence holds more generally. The following\nresult shows that it does not: for some problems, the dependence on m must be (nearly) linear. In\nparticular, we consider estimation in a normal location family model, where each machine receives\nan i.i.d. sample of size n from a normal distribution N(\u03b8, \u03c3 2) with unknown mean \u03b8.\nTheorem 1 For the univariate normal family N = {N(\u03b8, \u03c3 2) | \u03b8 \u2208 [\u22121, 1]}, there is a universal\nconstant c such that\n\nMinter(\u03b8, N , B) \u2265 c\n\n\u03c32\nmn\n\nmin# mn\n\n\u03c32 ,\n\nm\n\nlog m\n\n,\n\nm\n\n(7)\n\nB log m\u2019 .\nlog m2 bits are required for a\n\nmn ; consequently, the lower bound (7) shows that at least B =\u21261 m\n\nThe centralized minimax rate for estimating a univariate normal mean based on mn observations\nis \u03c32\ndecentralized procedure to match the centralized rate in this case. This type of scaling is dramati-\ncally different than the logarithmic scaling for the uniform family, showing that establishing sharp\ncommunication-based lower bounds requires careful study of the underlying family of distributions.\n\n3.3\n\nIndependent protocols in multi-machine settings\n\nDeparting from the interactive setting, in this section we focus on independent protocols, providing\nsomewhat more general results than those for interactive protocols. We \ufb01rst provide lower bounds\nfor the problem of mean estimation in the parameter for a d-dimensional normal location family\n\nNd = {N(\u03b8, \u03c3 2Id\u00d7d) | \u03b8 \u2208 \u0398= [ \u22121, 1]d},\n\n(8)\n\nTheorem 2 For i = 1, . . . , m, assume that each machine has communication budget Bi, and re-\nceives an i.i.d. sample of size n from a distribution P \u2208N d. There exists a universal (numerical)\nconstant c such that\n\nMind(\u03b8, Nd, B1:m) \u2265 c\n\n\u03c32d\nmn\n\nmin3 mn\n\n\u03c32 ,\n\nm\n\nlog m\n\n,\n\nm\n\n1\"m\ni=1 min{1, Bi\n\nd }2 log m4 .\n\n(9)\n\nGiven centralized access to the full mn-sized sample, a reasonable procedure would be to compute\nthe sample mean, leading to an estimate with mean-squared error \u03c32d\nmn , which is minimax optimal.\n\n4\n\n\fConsequently, Theorem 2 shows that to achieve an order-optimal mean-squared error, the total num-\nber of bits communicated must (nearly) scale with the product of the dimension d and number of\nmachines m, that is, as dm/ log m. If we ignore logarithmic factors, this lower bound is achievable\nby a simple procedure: each machine computes the sample mean of its local data and quantizes each\ncoordinate to precision \u03c32/n using O(d log(n/\u03c32)) bits. These quantized sample averages are com-\nmunicated to the fusion center using B = O(dm log(n/\u03c32)) total bits. The fusion center averages\nthem, obtaining an estimate with mean-squared error of optimal order \u03c32d/(mn) as required.\n\nmin3m,\n\nd\nm\n\nm\n\nd }4 ,\n\nWe \ufb01nish this section by presenting a result that is sharp up to numerical constant prefactors. It is a\nminimax lower bound for mean estimation over the family Pd = {P supported on [\u22121, 1]d}.\nProposition 2 Assume that each of m machines receives a single sample (n = 1) from a distribution\nin Pd. There exists a universal (numerical) constant c such that\n\n(10)\n\nMind(\u03b8, Pd, B1:m) \u2265 c\n\nwhere Bi is the budget for machine i.\n\nThe standard minimax rate for d-dimensional mean estimation scales as d/m. The lower bound (10)\nd } ! m, showing that each\n\n\"m\ni=1 min{1, Bi\nshows that in order to achieve this scaling, we must have\"m\ni=1 min{1, Bi\nmachine must send Bi ! d bits.\nMoreover, this lower bound is achievable by a simple scheme. Suppose that machine i receives\na d-dimensional vector Xi \u2208 [\u22121, 1]d. Based on Xi, it generates a Bernoulli random vector\nZi = (Zi1, . . . , Zid) with Zij \u2208{ 0, 1} taking the value 1 with probability (1 + Xij)/2, indepen-\ndently across coordinates. Machine i uses d bits to send the vector Zi \u2208{ 0, 1}d to the fusion center.\nThe fusion center then computes the average!\u03b8 = 1\ni=1(2Zi \u2212 1). This average is unbiased, and\nits expected squared error is bounded by d/m.\n\nm\"m\n\n4 Consequences for regression\n\nIn this section, we turn to identifying the minimax rates for a pair of important estimation problems:\nlinear regression and probit regression.\n\n4.1 Linear regression\n\nWe consider a distributed instantiation of linear regression with \ufb01xed design matrices. Concretely,\nsuppose that each of m machines has stored a \ufb01xed design matrix A(i) \u2208 Rn\u00d7d and then observes a\nresponse vector b(i) \u2208 Rd from the standard linear regression model\n\n(11)\nwhere \u03b5(i) \u223c N(0,\u03c3 2In\u00d7n) is a noise vector. Our goal is to estimate unknown regression vector\n\u03b8 \u2208 \u0398= [ \u22121, 1]d, shared across all machines, in a distributed manner, To state our result, we\nassume uniform upper and lower bounds on the eigenvalues of the rescaled design matrices, namely\n\nb(i) = A(i)\u03b8 + \u03b5(i),\n\n0 <\u03bb min \u2264 min\n\ni\u2208{1,...,m}\n\n\u03b7min(A(i))\n\n\u221an\n\nand\n\nmax\n\ni\u2208{1,...,m}\n\n\u03b7max(A(i))\n\n\u221an\n\n\u2264 \u03bbmax.\n\n(12)\n\nCorollary 2 Consider an instance of the linear regression model (11) under condition (12).\n\n(a) Then there is a universal positive constant c such that\n\nMind(\u03b8, P, B1:m) \u2265 c\n\n\u03c32d\nmn\n\nmin3 mn\n\n\u03c32 ,\n\nm\n\n\u03bb2\nmax log m\n\n,\n\nm\n\ni=1 min{1, Bi\n\n\u03bb2\n\nmax1\"m\n\nd }2 log m4 .\n\n(b) Conversely, given budgets Bi \u2265 d log(mn) for i = 1, . . . , m, there is a universal constant\n\nc# such that\n\nMind(\u03b8, P, B1:m) \u2264\n\nc#\n\u03bb2\n\nmin\n\n\u03c32d\nmn\n\n.\n\n5\n\n\fIt is a classical fact (e.g. [14]) that the minimax rate for d-dimensional linear regression scales as\nd\u03c32/(nm). Part (a) of Corollary 2 shows this optimal rate is attainable only if the budget Bi at each\nlog m .\nPart (b) of the corollary shows that the minimax rate is achievable with budgets that match the lower\nbound up to logarithmic factors.\n\nmachine is of the order d/ log(m), meaning that the total budget B =\"m\n\ni=1 Bi must grow as dm\n\nProof: The proof of part (b) follows from techniques of Zhang et al. [23], who show that solving\neach regression problem separately and then performing a form of approximate averaging, in which\neach machine uses Bi = d log(mn) bits, achieves the minimax rate up to constant prefactors.\n\nTo prove part (a), we show that solving an arbitrary Gaussian mean estimation problem can be\nreduced to solving a specially constructed linear regression problem. This reduction allows us to\napply the lower bound from Theorem 2. Given \u03b8 \u2208 \u0398, consider the Gaussian mean model\n\nX (i) = \u03b8 + w(i), where w(i) \u223c N50,\n\n\u03c32\n\u03bb2\nmaxn\n\nId\u00d7d6 .\n\nEach machine i has its own design matrix A(i), and we use it to construct a response vector\nmaxn A(i)(A(i))% is\n\nb(i) \u2208 Rn. Since \u03b7max(A(i)/\u221an) \u2264 \u03bbmax, the matrix \u03a3(i) := \u03c32In\u00d7n \u2212 \u03c32\n\npositive semide\ufb01nite. Consequently, we may form a response vector via\n\n\u03bb2\n\nb(i) = A(i)X (i) + z(i),\n\nz(i) \u223c N10, \u03a3(i)2 is drawn independently of w(i).\n\n(13)\nThe independence of w(i) and z(i) guarantees that b(i) \u223c N(A(i)\u03b8, \u03c3 2In\u00d7n), so that the pair\n(b(i), A(i)) is faithful to the regression model (11).\nNow consider any protocol \u03a0 \u2208A ind(B,P) that can solve any regression problem to within accu-\n2] \u2264 \u03b42. By the previously described reduction, the protocol \u03a0 can also\nsolve the mean estimation problem to accuracy \u03b4, in particular via the pair (A(i), b(i)) constructed\nvia expression (13). Combined with this reduction, the corollary thus follows from Theorem 2. \"\n\nracy \u03b4, so that E[(!\u03b8 \u2212 \u03b8(2\n\n4.2 Probit regression\n\nWe now turn to the problem of binary classi\ufb01cation, in particular considering the probit re-\ngression model. As in the previous section, each of m machines has a \ufb01xed design matrix\nA(i) \u2208 Rn\u00d7d, where A(i,k) denotes the kth row of A(i). Machine i receives n binary responses\nZ(i) = (Z(i,1), . . . , Z(i,n)), drawn from the conditional distribution\n\nP(Z(i,k) = 1 | A(i,k),\u03b8 ) =\u03a6( A(i,k)\u03b8)\n\n(14)\nwhere \u03a6(\u00b7) denotes the standard normal CDF. The log-likelihood of the probit model (14) is con-\ncave [4, Exercise 3.54]. Under condition (12) on the design matrices, we have:\n\nfor some \ufb01xed \u03b8 \u2208 \u0398= [ \u22121, 1]d,\n\nCorollary 3 Consider the probit model (14) under condition (12). Then\n\n(a) There is a universal constant c such that\n\nMind(\u03b8, P, B1:m) \u2265 c\n\nd\nmn\n\nmin3mn,\n\nm\n\n\u03bb2\nmax log m\n\n,\n\nm\n\ni=1 min{1, Bi\n\n\u03bb2\n\nmax1\"m\n\nd }2 log m4 .\n\n(b) Conversely, given budgets Bi \u2265 d log(mn) for i = 1, . . . , m, there is a universal constant\n\nc# such that\n\nMind(\u03b8, P, B1:m) \u2264\n\nc#\n\u03bb2\n\nmin\n\nd\nmn\n\n.\n\nProof: As in the previous case with linear regression, Zhang et al.\u2019s study of distributed convex\noptimization [23] gives part (b): each machine solves the local probit regression separately, after\nwhich each machine sends Bi = d log(mn) bits to average its local solution.\n\nTo prove part (a), we show that linear regression problems can be solved via estimation in a specially\nconstructed probit model. Consider an arbitrary \u03b8 \u2208 \u0398; assume we have a regression problem of the\n\n6\n\n\fform (11) with noise variance \u03c32 = 1. We construct the binary responses for our probit regression\n(Z(i,1), . . . , Z(i,n)) by\n\nZ(i,k) =#1\n\n0\n\nif b(i,k) \u2265 0,\notherwise.\n\n(15)\n\nBy construction, we have P(Z(i,k) = 1 | A(i),\u03b8 ) =\u03a6( A(i,k)\u03b8) as desired for our model (14). By\ninspection, any protocol \u03a0 \u2208A ind(B,P) solving the probit regression problem provides an esti-\nmator with the same error (risk) as the original linear regression problem via the construction (15).\n\"\nCorollary 2 provides the desired lower bound.\n\n5 Proof sketches for main results\n\nWe now give an outline of the proof of each of our main results (Theorems 1 and 2), providing a\nmore detailed proof sketch for Proposition 2, since it displays techniques common to our arguments.\n\n5.1 Broad outline\n\nMost of our lower bounds follow the same basic strategy of reducing an estimation problem to\na testing problem. Following this reduction, we then develop inequalities relating the probability\nof error in the test to the number of bits contained in the messages Yi sent from each machine.\nEstablishing these links is the most technically challenging aspect.\n\nOur reduction from estimation to testing is somewhat more general than the classical reductions\n(e.g., [22, 20]), since we do not map the original estimation problem to a strict test, but rather a\ntest that allows some errors. Let V denote an index set of \ufb01nite cardinality, where \u03bd \u2208V indexes a\nfamily of probability distributions {P (\u00b7 | \u03bd)}\u03bd\u2208V . For each member of this family, associate with a\nparameter \u03b8\u03bd := \u03b8(P (\u00b7 | \u03bd)) \u2208 \u0398, where \u0398 denotes the parameter space. In our proofs applicable to\nd-dimensional problems, we set V = {\u22121, 1}d, and we index vectors \u03b8\u03bd by \u03bd \u2208V . Now, we sample\nV uniformly at random from V. Conditional on V = \u03bd, we then sample X from a distribution\nPX(\u00b7 | V = \u03bd) satisfying \u03b8\u03bd := \u03b8(PX(\u00b7 | \u03bd)) = \u03b4\u03bd, where \u03b4> 0 is a \ufb01xed quantity that we control.\nWe de\ufb01ne dham(\u03bd, \u03bd#) to be the Hamming distance between \u03bd, \u03bd# \u2208V . This construction gives\n\n(\u03b8\u03bd \u2212 \u03b8\u03bd \"(2 = 2\u03b47dham(\u03bd, \u03bd#).\n\nFixing t \u2208 R, the following lemma reduces the problem of estimating \u03b8 to \ufb01nding a point \u03bd \u2208V\nwithin distance t of the random variable V . The result extends a result of Duchi and Wainwright\n[8]; for completeness we provide a proof in Appendix H.\n\nLemma 1 Let V be uniformly sampled from V. For any estimator!\u03b8 and any t \u2208 R, we have\n\n2] \u2265 \u03b42(!t\" + 1) inf\n!\u03bd\n\nsup\nP\u2208P\n\nE[(!\u03b8 \u2212 \u03b8(P )(2\n\nP (dham(!\u03bd, V ) > t) ,\n\nwhere the in\ufb01mum ranges over all testing functions.\n\nLemma 1 shows that minimax lower lower bound can be derived by showing that, for some t > 0\nto be chosen, it is dif\ufb01cult to identify V within a radius of t. The following extension of Fano\u2019s\ninequality [8] can be used to control this type of error probability:\n\nLemma 2 Let V \u2192 X \u2192 !V be a Markov chain, where V is uniform on V. For any t \u2208 R, we have\n\nI(V ; X) + log 2\n\n,\n\nP(dham(!V , V ) > t) \u2265 1 \u2212\n\nlog |V|Nt\n\nwhere Nt := max\n\n\u03bd\u2208V |{\u03bd# \u2208V : dham(\u03bd, \u03bd#) \u2264 t}| is the size of the largest t-neighborhood in V.\n\nLemma 2 allows \ufb02exibility in the application of the minimax bounds from Lemma 1. If there is a\nlarge set V for which it is easy to control I(V ; X), whereas neighborhoods in V are relatively small\n(i.e., Nt is small), then we can obtain sharp lower bounds.\n\n7\n\n\fIn a distributed protocol, we have a Markov chain V \u2192 X \u2192 Y , where Y denotes the messages the\n\ndifferent machines send. Based on the messages Y , we consider an arbitrary estimator!\u03b8(Y ). For\n\u03c42 \u2264 21d\n\u03c4 =01d\n0 \u2264 t \u2264 /d/30, we have Nt =\"t\nt6 \u2265 d log 2 \u2212\nNt \u2265 d log 2 \u2212 log 25d\n\nt2 \u2264 (de/t)t, for t \u2264 d/6 we have\n\nt2. Since1d\n\nlog(6e) \u2212 log 2 = d log\n\nlog |V|\n\n21/d 6\u221a6e\n\n>\n\nd\n6\n\nd\n6\n\n2\n\nfor d \u2265 12 (the case d < 12 can be checked directly). Thus, combining Lemma 1 and Lemma 2\n\n(using the Markov chain V \u2192 X \u2192 Y \u2192!\u03b8), we \ufb01nd that for t = !d/6\",\n\n2) \u2265 \u03b42(!d/6\" + 1)51 \u2212\n\nsup\nP\u2208P\n\nE((!\u03b8(Y ) \u2212 \u03b8(P )(2\n\nd/6\n\nI(Y ; V ) + log 2\n\nWith inequality (16) in hand, it then remains to upper bound the mutual information I(Y ; V ), which\nis the main technical content of each of our results.\n\n6 .\n\n(16)\n\n5.2 Proof sketch of Proposition 2\nFollowing the general outline of the previous section, let V be uniform on V = {\u22121, 1}d. Letting\n0 <\u03b4 \u2264 1 be a positive number, for i \u2208 [m] we independently sample X (i) \u2208 Rd according to\n\nP (X (i)\n\nj = \u03bdj | V = \u03bd) =\n\n1 + \u03b4\n\n2\n\nand P (X (i)\n\nj = \u2212\u03bdj | V = \u03bd) =\n\n1 \u2212 \u03b4\n2\n\n.\n\n(17)\n\nUnder this distribution, we can give a sharp characterization of the mutual information I(V ; Yi). In\nparticular, we show in Appendix B that under the sampling distribution (17), there exists a numerical\nconstant c such that\n\nI(V ; Yi) \u2264 c\u03b42I(X (i); Yi).\n\n(18)\n\nSince the random variable X takes discrete values, we have\n\nI(X (i); Yi) \u2264 min{H(X (i)), H(Yi)}\u2264 min{d, H(Yi)}.\n\nSince the expected length of message Yi is bounded by Bi, Shannon\u2019s source coding theorem [6]\nimplies that H(Yi) \u2264 Bi. In particular, inequality (18) establishes a link between the initial distri-\nbution (17) and the number of bits used to transmit information, that is,\n\nWe can now apply the quantitative data processing inequality (19) in the bound (16). By the in-\ni=1 I(V ; Yi), and thus inequality (16)\n\nI(V ; Yi) \u2264 c\u03b42 min{d, Bi}.\ndependence of the communication scheme, I(V ; Y1:m) \u2264\"m\nc\u03b42\"m\n\nMind(\u03b8, P, B1:m) \u2265 \u03b42(!d/6\" + 1)51 \u2212\n\nsimpli\ufb01es to\n\ni=1 min{d, Bi} + log 2\n\nd/6\n\n6 .\n\n(19)\n\nAssuming d \u2265 9, so 1 \u2212 6 log 2/d > 1/2, we see that choosing \u03b42 = min{1,\nimplies\n\n24c \"m\n\nd\n\ni=1 min{Bi,d}}\n\nMind(\u03b8, P, B1:m) \u2265\n\n\u03b42(!d/6\" + 1)\n\n4\n\n= !d/6\" + 1\n\n4\n\nmin#1,\n\nRearranging slightly gives the statement of the proposition.\n\nAcknowledgments\n\nd\n\ni=1 min{Bi, d}\u2019 .\n\n24c\"m\n\nWe thank the anonymous reviewers for their helpful feedback and comments. JCD was supported\nby a Facebook Graduate Fellowship. Our work was supported in part by the U.S. Army Research\nLaboratory, U.S. Army Research Of\ufb01ce under grant number W911NF-11-1-0391, and Of\ufb01ce of\nNaval Research MURI grant N00014-11-1-0688.\n\n8\n\n\fReferences\n\n[1] H. Abelson. Lower bounds on information transfer in distributed computations. Journal of the\n\nACM, 27(2):384\u2013392, 1980.\n\n[2] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication com-\nplexity and privacy. In Proceedings of the Twenty Fifth Annual Conference on Computational\nLearning Theory, 2012. URL http://arxiv.org/abs/1204.3514.\n\n[3] K. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors\n\nof Geometry, pages 1\u201358. MSRI Publications, 1997.\n\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\ntical learning via the alternating direction method of multipliers. Foundations and Trends in\nMachine Learning, 3(1), 2011.\n\n[6] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.\n\n[7] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[8] J. C. Duchi and M. J. Wainwright. Distance-based and continuum fano inequalities with appli-\n\ncations to statistical estimation. arXiv [cs.IT], to appear, 2013.\n\n[9] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization:\nconvergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):\n592\u2013606, 2012.\n\n[10] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates.\n\narXiv:1302.3203 [math.ST], 2013. URL http://arXiv.org/abs/1302.3203.\n\n[11] S. Han and S. Amari. Statistical inference under multiterminal data compression. IEEE Trans-\n\nactions on Information Theory, 44(6):2300\u20132324, 1998.\n\n[12] I. A. Ibragimov and R. Z. Has\u2019minskii. Statistical Estimation: Asymptotic Theory. Springer-\n\nVerlag, 1981.\n\n[13] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.\n\n[14] E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998.\n\n[15] Z.-Q. Luo. Universal decentralized estimation in a bandwidth constrained sensor network.\n\nIEEE Transactions on Information Theory, 51(6):2210\u20132219, 2005.\n\n[16] Z.-Q. Luo and J. N. Tsitsiklis. Data fusion with minimal communication. IEEE Transactions\n\non Information Theory, 40(5):1551\u20131563, 1994.\n\n[17] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured percep-\ntron. In North American Chapter of the Association for Computational Linguistics (NAACL),\n2010.\n\n[18] J. N. Tsitsiklis. Decentralized detection.\n\nIn Advances in Signal Processing, Vol. 2, pages\n\n297\u2013344. JAI Press, 1993.\n\n[19] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n\n[20] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence.\n\nAnnals of Statistics, 27(5):1564\u20131599, 1999.\n\n[21] A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary re-\nport). In Proceedings of the Eleventh Annual ACM Symposium on the Theory of Computing,\npages 209\u2013213. ACM, 1979.\n\n[22] B. Yu. Assouad, Fano, and Le Cam.\n\nIn Festschrift for Lucien Le Cam, pages 423\u2013435.\n\nSpringer-Verlag, 1997.\n\n[23] Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-ef\ufb01cient algorithms for statisti-\n\ncal optimization. In Advances in Neural Information Processing Systems 26, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1114, "authors": [{"given_name": "Yuchen", "family_name": "Zhang", "institution": "UC Berkeley"}, {"given_name": "John", "family_name": "Duchi", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}]}