Paper ID: 1114
Title: Information-theoretic lower bounds for distributed statistical estimation with communication constraints
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper gives minimax lower bound for the number of bits, B, needed to be communicated between m estimators, each of which has acessess to n iid samples, such that the estimation error is on par with the centralized case (with mn samples). This is a very interesting problem of theoretic and practical significance. In view of the fact that the minimax rate without communication constraints which does not admit a general solution, I do not expect a general solution for the minimal B. As the authors shows the conclusion seems problem-specific and can be quite pessimistic, meaning a large amount of communication overhead is necessary. While the proof of Thm 2 is quite trivial (a vanilla application of Fano's inequality), the proof of Thm 3 is quite interesting which relies on a type of strong data processing inequality for mutual information under some bounded density condition, which is reminiscient of differential privacy and Thm 1 in

J. C. Duchi, M. I. Jordan and M. J. Wainwright (2013). Local privacy and statistical minimax rates. Arxiv technical report, February 2013.


- It will be very interested to look at infinite-dim problems (e.g. function estimation under white Gaussian noise) especially those where the parameter set has faster growth such that Yang-Barron gives sharp rate w.r.t. KL loss. Whether one need even bigger B in order to reach the rate at sample size=mn

- I wonder if there is any problem in multi-user information theory that is similar in spirit, where a joint decoder has access to rate-constrained coded messages and want to recover the original source with some fidelity. I am not familiar with this field, but it might be interested to identify this connection. Of course, the major difference here is that there is no large blocklength to exploit.

- The following paper might be relevant (the reviewer has NOT read the paper)

Maxim Raginsky, Achievability results for statistical learning under communication constraints
Proceedings of ISIT 2009

- "this work does not characterize the effects of limiting communication" -> "these results do not characterize the effects of limited communication"

- The well-known statistical minimax rate for linear regression scales as d \sigma^2 /(\lambda 2 nm) (e.g. [15]). -- I am not so sure about this statement. The result of [15] is known to be loose in finite-dim models. Also what conditions on the design matrices A are assumed?
Q2: Please summarize your review in 1-2 sentences
This is a well-written and interesting paper. A clear accept.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers minimax lower bounds on the mean squared error for statistical estimation problems, in a setting where available data is distributed i.i.d. across several machines and there is a constraint on the number of bits that may be communicated. The paper considers both interactive and non-interactive protocols, where the final estimate is a function of the messages sent by the machines. For interactive protocols, these include lower bounds for estimating the mean of a uniform distribution in a unit cube, and the mean of a univariate Gaussian distribution. For non-interactive protocols, the lower bounds are for estimating the mean of a multivariate Gaussian distribution, and applications for linear and probit regression in a fixed (known) design setting.

The issue of distributed learning and estimation has received plenty of attention in recent years, mostly in the form of algorithms and upper bounds. This paper is among the first (see remarks below) to explicitly consider lower bounds. Since lower bounds are crucial to understanding the limits of achievable performance, I consider this a very interesting line of research. The results presented here seem to be interesting first steps in this direction. Also, I found the paper clear and quite easy to follow. Finally, the results seem overall correct, although I haven't checked the proofs in the supplementary carefully.

My main criticism is that the results focus on somewhat limited regimes, which may not correspond to realistic applications of distributed learning. To be more specific:

1) The setting considered here is only harder than the standard (non-distributed) setting of statistical estimation. Thus, all the standard (non-distributed) lower bounds trivially hold. The lower bounds presented here are stronger than those standard bounds, only in the regime where the average amount of communication B_i per machine is smaller than the data dimension d. However, this is a very harsh constraint. For example, in standard linear prediction, it means that each machine can't even output a single predictor. For most applications of distributed learning, one is very happy with protocols where B_i is order of d, and for such a regime, the results here do not improve on the existing standard lower bounds. For fairness, it should be said that it's not known whether the standard lower bounds are improvable when B_i >= d, except in some cases, such as statistical estimation with sufficiently smooth loss functions (see Zhang et al. 2012 - reference [18] in the paper). These results do not apply to non-smooth losses, e.g. when we try to distribute support vector machines. Moreover, all these results are in terms of mean-squared-error, and do not give tight results when converting them to optimization error / excess risk (e.g. in terms of dependence on the strong-convexity parameter, which can be small in regularized statistical learning problems). Anyway, the results in this paper do not shed light on these issues.

2) Most of the results here are for non-interactive protocols, where each machine can only send a single message based on its data, and the final estimator is a function of these messages. This is restrictive, since one can easily imagine algorithms which require several communication rounds between the machines. The concrete results for interactive protocols (proposition 1 and theorem 2) seem much more limited, focusing either on a uniform distribution in a unit cube, or a 1-dimensional setting. As to the more generic theorem 1, I may be wrong, but I think the same proof would apply to a non-distributed setting, where the only constraint it that the output \hat{\theta} is limited to B bits. So in a certain sense, it may not capture an interesting aspect of the distributed setting.

3) The regression lower bounds focus on a fixed design setting, where the design/instances are known by all machines in advance, and the uncertainty is in the target values. In a distributed setting, it's much more realistic to assume unknown instances partitioned between the machines.



Technical Comments
------------------

- Line 41: "Currently there is little theoretical understanding of the role of communication complexity in statistical inference, and thus the design
and analysis of distributed inference algorithms is more art than science". I think this is exaggerated. Principled approaches to design and analyze distributed inference and learning have been well-studied in the literature. It is true that lower bounds have not been as carefully studied.

- It would be good to explain why the dependence on B is so different between the two settings of proposition 1 and theorem 2.

- Line 111: "an protocol"

- Line 119, definition of minimax risk - why is the protocol Pi separated from the estimator function \Theta? Pehaps it would make more sense to include the final estimate computation as part of the protocol.

- Line 148: If I understood correctly, example 1 shows the tightness of Theorem 1 only for one particular choice of B. It would be better if one can come up with an example which shows the tightness of the bound for *any* choice of B.

- Line 1120 in supplementary: "achieves minimax" -> "which achieves the"(?)

- In discussing related work, the authors should also mention the COLT 2012 paper "Distributed Learning, Communication Complexity and Privacy" by Balcan et al., which also discuss lower bounds for distributed learning (although in a different setting).

- Section 2: When defining A_ind and A_inter, do any of the lower bounds become better if we require L<=B always rather just in expectation? Assuming that not, it might be worthwhile to point this out.
Q2: Please summarize your review in 1-2 sentences
The paper provides lower bounds for statistical estimation in a data-distributed setting. These are interesting first steps in this direction, although the regimes considered are somewhat limited.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is well-described by its title - it gives minimax lower bounds (and some matching upper bounds) for statistical estimation problems when data is split across several machines given a constraint on how much communication is allowed (independently or interactively) between the servers having the data and a fusion point where the messages from the servers are collected and a final estimate is reached. This is done in a variety of settings, all of which were followable, and the lower bounds are proved using standard reductions to hypothesis testing followed by some non-standard detailed calculations (that I did not completely follow).

I think the paper was well-structured and easy to read, going from more examples and intuition to proofs at the end, with good intuition for the start of the proof. To the best of my knowledge, the work was original, it is definitely of significance to the community, and it is of high quality.

I think more emphasis should be placed on the differences in rates between interactive and independent settings. Also, the proof on page 8 has descriptions for each step, but I feel better intuition can be given (in section 5.2, in addition to section 5.1).
Q2: Please summarize your review in 1-2 sentences
Overall, I think this is a very good paper and highly recommend its acceptance - it was clear, high quality, original and very significant.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their careful reading and comments and criticism of
our paper. We will address all comments in the revised version of the paper;
we tackle major feedback below.

REVIEWER 4

We appreciate the positive feedback, and indeed, Thm. 3 is reminiscent of
Duchi et al.'s work--we will provide a citation. There is some difference
because our likelihood ratio inequalities (in the case of Theorems 2 and 4,
e.g. Lemma 3 and Eq. (31) in the appendix) do not hold for all samples, and
the data processing inequalities (at least for now) require somewhat careful
conditions on the random variable V. But certainly they do have substantial
similarities; we are quite curious to understand when such likelihood
ratio/bounded density conditions give data processing inequalities.

- The infinite dimensional question is extremely interesting--we will be
looking at that in the future (we have seen some recent work on communication
upper bounds here; see Zhang, Duchi, Wainwright, COLT 2013).

- The connections with multi-user information theory are interesting; we do
not have a full understanding yet. Slepian-Wolf coding is the closest problem
in spirit to ours; here senders encode independently and a decoder decodes
independently encoded messages--it is possible in this setting (since large
blocklengths can be exploited) to send messages at a rate of the joint
entropy; in the statistical estimation setting it appears we need to send a
bit more.

- In terms of linear regression bounds, we are assuming that A has bounded
minimum and maximum singular values--we will be more precise. (The
true minimax rate is tr(((1/n) A'*A)^{-1}) / n).

Thanks also for the pointer to the Raginsky paper. The settings are slightly
different--Raginsky considers a single communication channel--but it is good
to have the reference.

REVIEWER 7

1) The reviewer is correct--our setting is only harder. But given the
difficulty--and, to our knowledge, lack of lower bounds on communication (and
the fact that sending O(d) bits is sufficient for some problems)--of providing
such lower bounds, we think this is a good step in this direction. Multi-user
information theory, even after fifty years of work, is still a quite unsolved
field.

2) Theorem 1 is a straightforward application of Fano's inequality, so yes, it
also applies in non-distributed settings. Our main contributions are Theorems
2-4, which *are* distributed. But Theorem 1 is sharp for some settings, as
exemplified by Proposition 1, where estimating a uniform family can be done
with exponentially fewer (in m) bits with interactivity than without
(or for the normal distribution).

3) We provide *lower* bounds for the *easier* fixed-design regression problem;
a harder problem (where the design matrices are unknown) will have worse
lower bounds. Given Zhang et al.'s [18] results, however, our lower bounds
are sharp in this case as well.

Technical comments:

- We view an understanding of communication complexity to be fleshed out
only when there are demonstrable tradeoffs, which necessarily include lower
bounds.

- Dependence on B: The compactness of support of the uniform appears
to allow improvements. We will explain this more.

- 119: The protocol and \theta are both infimized over; they may interact
and our definition does not exclude this. We will be clearer in the final
version.

- 148: The bound is tight for essentially any 1-machine setting; we may either
choose the number of samples n or the communication bound B to make it tight.

- We will discuss connections to Balcan et al. Given the difference in
settings, the connections are not immediate.

- Section 2: The lower bounds become somewhat easier to prove when L \le B,
but Theorems 2-4 do not change. The upper bound in Proposition 1 requires
randomized communication; we believe a \log(m) dependence in the upper bound
is impossible without randomization. (Each processor must communicate at least
1 bit in a non-random setting.)

REVIEWER 9

We think the title is pretty descriptive as well :).

Does the reviewer have suggestions for parts of Section 5 that are unclear?
We would certainly like the intuition to come through clearly and would
appreciate any suggestions. We recognize that after the standard lower bound
(through Eq. (12), essentially), the calculations become quite complex and a
bit non-intuitive. Perhaps making clearer the importance of the strong
data-processing inequality (15), which is the essential part of the argument,
and its connections with the probability of error in identifying the V_j
would be helpful.