NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5951
Title:Unified Sample-Optimal Property Estimation in Near-Linear Time

Reviewer 1

Originality: While using polynomial approximators for functionals of distributions has a long history, the "local" sense in which this paper proposes using them is interesting. The discussion of prior work in the context of the paper's results could use more elaboration; see below. Quality: I have not checked all the proofs, but the ones I did check seem correct. Clarity: The paper is reasonably well-written and intuition is presented well. Significance: While the methodology is interesting and the applications to various estimation problems seems promising, the paper would benefit from a revision that better articulates certain specific conclusions. Details below: -- The paper left me slightly confused about the correct asymptotic to consider in order to compare the various methodologies for these problems. In particular, the results of the current paper seem to exhibit a variance term that has a strictly sub-optimal exponent of the sample size n. In the case where k is fixed and n \to \infty, this should imply that the estimators in the current paper are not actually minimax optimal? Optimality is of course evident in the case where k grows, but this is not mentioned anywhere in the manuscript. It would be good to be up front about this. -- In the absence of concrete experiments for some of these properties, it is not clear that this methodology is actually more suited to the various properties than existing estimators, so an empirical evaluation (even just on synthetic data) would be very useful. In parituclar, many of the claimed imporvements over prior art are (sub-)logarithmic, and so it is not immediately clear that employing this methodology provides practical gains in useful regimes of n. Specific comments: -- It would be good to make the Lipschitz constant explicit in the statement of Theorem 1. -- In table 2, I think it would be helpful from a comparison perspective if the best available bounds for these three properties (before this paper) were also included, so as to facilitate a concrete comparison. -- Line 101: plug -> plugging -- Line 234: liner -> linear -- Line 277: Number of symbols is at most k, correct? The draft says n. In general, this subsection is slightly confusing and should be rewritten for clarity.

Reviewer 2

This paper considers property estimation for discrete distributions: given a "smooth enough" functional f of distributions, assumed further to be "additive" (i.e., f(p) = sum_i f_i(p_i), where p=(p_i)_i is the distribution identified to its pmf), the authors derive a general approach, based on polynomial approximation, to obtain optimal or near-optimal rates, with estimators that are easy to construct and can be computed in time nearly linear in the sample size. The main result is Theorem 1, which provides the general upper bound statement. Overall, I find the results and approach of this paper nice, but somewhat incremental; and far from the "novelty" and "strong implications" claimed. (Also, the privacy part doesn't really seem to fit with the rest... basically, you get Lipscitz estimators, so can get privacy almost for free; maybe make that a distinct section, instead of putting it in the middle, as I'd argue it's not really the meat of your work?) Comments: - line 73: distance estimation to a fixed distribution or between two unknown distributions is known as well by the techniques of Valiant-Valiant; see ***- ll.93-96: This is one of my main criticisms. You claim a novel approach, departing conceptually from the two extisting methods you described before. However, what you described here does not seem really novel: it is, unless I misunderstood what you describe, a refinement of the "approximation" technique, so sounds a lot more incremental than what you suggest. - ll 127-133 on high probability: this is really misleading. As far as I can tell, the lack of high-probability statements from previous work is not that the approaches do not yield the "good" (additive) dependence on log(1/\delta): for all we know, they might (using McDiarmid, or even direct arguments?), or might not -- it's just that it wasn't *considered* in those works. Thus, claiming that your work significantly improves on those wrt delta by comparing your results with the previous one *using the median trick* is not something of a misrepresentation. - ll. 140-141: so you need subconstant approximation parameter eps for support size? That's a *big* restriction. - l.150: how "conventional" is that? Is it actually common (I don't recall seeing this assumption). Also, do you assume knowledge of that constant c by the algorithm? - l.154-155: you should compare with the upper bound of [2]! They have a result on this, and your upper bounds are nearly the same (you improve by basically a (log k)^2 factor in the middle term). You cannot not mention it! (also, please remove the constants 2 in 2eps, 2alpha, they are not meaningful and obscure the statement, since you use \lesssim afterwards) - l.234: "liner" -> "linear" UPDATE: I have read the authors' response, and taken it into account in my assessment.

Reviewer 3

The paper considers the problem of distribution property estimation, where you observe n i.i.d samples from a distribution discrete distribution p over [k] = {1, 2, ...,k} and you want to estimate a certain property f(p) of the distribution. The problem is a fundamental problem in statistics and machine learning with an increasing interest in the finite sample regime where you consider the sample complexity during the recent decade. The paper considers all distributions that can be expressed in an additive form with f(p) = \sum_i f_i(p_i) and provides a unified algorithm by approximating each f_i by a piece-wise polynomial function f'_i and the final estimation is the sum of unbiased estimators for f'_i(p_i)'s. The proposed estimator: 1) achieves a sublinear sample complexity of (k/\log k) for all Lipschitz properties in near-linear time. 2) achieve optimal sample complexity for various symmetric properties including Shannon entropy, support size, support cover, distance to a certain distribution and etc. Although sample optimal estimators for the aforementioned properties already exist in the literature. But it is not clear whether they can be extended to other all additive properties. The idea of using best polynomial approximation to estimate additive properties is not new (See [Wu, Yang, 2016] for entropy estimation). This paper extends the idea to a more general class of distribution properties by applying best polynomial approximation in multiple intervals. And this extension might be of independent interest. Cons: 1. The writing of the paper needs to be improved. It might be good to write what the final estimator will be in Section 2 to give an overall view of the whole algorithm before going into math details in the following sections. 2. Detailed comments´╝Ü 1) The abbreviation PML is used without defining. 2) Line 81: belongs to 3) The bounds for power sum seem not matching. Can you elaborate more on this? 4) The equation between 154 and 155 is missing a term \frac{\log^2{\min{n,k}}}{\alpha^2}. 5) Line 214. The statement may not be correct without proper assumption on the posterior of p although I believe lien 213 is correct. 6) Line 237. Should be bias instead of error. 7) The proof for theorem 3 is missing in the appendix. Overall, I would recommend the paper for acceptance.