Review for NeurIPS paper: Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Distributions

NeurIPS 2020

Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Distributions

Review 1

Summary and Contributions: This paper introduces "profile entropy" for discrete distributions, and it shows its applications in the tasks of estimation, inference, and compression. More precisely, the speed of estimating a distribution relative to the best natural estimator is determined, the rate of inferring all symmetric properties relative to the best estimator over any label-invariant distribution collection is calculated, and compression of sample profiles are studied for which profile entropy plays the role of fundamental limit.

Strengths: - The paper has some merit by having contributions in studying the applications of profile entropy in the tasks of estimation, inference and compression. The topic of the paper is of interest to the NeurIPS community. - The results of the paper seem correct to me, except for some confusions which I point out below.

Weaknesses: - I found the paper fairly hard to read and follow. Especially, at times it is not clear whether a stated result is a known result or is a contribution of this paper. The paper does not seem to have much coherence and is a collection of relatively different results with no particular storyline. - In line 225, where exactly is the block compression algorithm proposed? Similarly in line 22, it should be pointed out where exactly the adaptive learning algorithm is. - Merely due to the fact that the profile of a sequence is the type of its type, it seems premature to claim that the paper connects profile entropy with the method of types in lines 333-338.

Correctness: The results seem correct to me, except that the statement of Theorem 1 and its proof in the appendix can be made more precise. For any given fixed n, I can compute the right hand side of the inequality, but then what does (D_n ~ H_n) mean for fixed n?

Clarity: The paper is fairly hard to read and sometimes confusing for realizing which of the results are contributions of this paper and which are known results. At times it seems that the paper is written with the style of a survey paper. Many \paragraph are used, some of which are just literature review. For example, the first paragraphs in section 2.4 which are applications of sample profiles in prior works, can be compressed and moved to the introduction.

Relation to Prior Work: The paper has compared its contributions with known results all in the "Main results" Section. The paper will benefit by putting a separate related result section.

Reproducibility: Yes

Additional Feedback: - I would like the author(s) to explain how they are going to improve the clarity of the paper and their own contributions in their response. - It will be good if an example can be provided on sample profile and entropy in Section 1 after their definitions. Also, why is small phi used in line 34 but big phi^n in line 39? - Line 143: has been --> has ----- Post Author Response ----- Since the author(s) have explained how they are going to improve the clarity of their paper in their response, I slightly increase my score.

Review 2

Summary and Contributions: This paper characterizes the profile entropy, which is the entropy of a set that measures the frequencies of symbols. This measure act as a fundamental bound that encompasses several statistical problems such as estimation, inference, and compression. In concrete, the profile entropy measures the discrepancy between the estimated distribution of a sample concerning the natural assignment. On the other hand, estimators of symmetric properties based on the profile entropy universally achieve minimax error. Finally, the profile entropy is a measure of complexity for block compression problems.

Strengths: The work is technically robust, and the contribution is clear in the sense that encompasses their results in three different fields. In particular, I like very much the following ingredients : + The profile entropy seems a novelty contribution that can be very useful in the information theory field and serves as a fundamental bound in a variety of statistical problems; + The proofs of the results are correct and consistent, and connections with universal coding are appealing and interesting.

Weaknesses: Even though the profile entropy seems an interesting tool that connects estimation, inference, and compression, there is no insights about its benefits and implications for practical problems. For instance, + How complex is the profile entropy to compute it? Specially, when dealing with high-dimensional alphabets. + Are there some examples where you can see the PML estimator? + Can we see how asymptotically close the profile is with its dimension? + Besides, the connection of the proofs with the method of types could be developed with more depth. The method of types concentrates the probability in a set, and we use it as a way of characterizing the behaviour of long sequences based on the properties of the type of a given sequence. In that sense, what is the gain of using the profile entropy instead of using the empirical distribution?

Correctness: Yes, the claims are robust even though it lacks empirical analysis. The contributions are well ordered and presented. In each area, the authors connected their work with state-of-the-art results from different references.

Clarity: The paper is well written and the results are well stated.

Relation to Prior Work: Most of the results are clearly connected with previous works and contributions. However, the authors could compare their results more naturally with some of the classical information-theoretic contributions in this field (e.g., universal compression of sequences).

Reproducibility: Yes

Additional Feedback: I suggest the authors to better address connections with universal source coding and related problems already mentioned in comments about "the weaknesses". In general, when comparing with other state-of-the-art, it is important in my opinion to show how (technically) problems are connected. I believe this can add much more value to the present work.

Review 3

Summary and Contributions: The paper proposes profile entropy H_phi(p), a function of any discrete distribution p, as a fundamental measure of how effective one can infer its properties from its samples. The main contributions are the following: 1. For any distribution p, its profile entropy characterizes how well one can estimate the distribution in KL divergence compared to any natural estimator. 2. For any discrete distribution and any symmetric property of a distribution, the paper shows that the plug-in estimator with profile maximum likelihood estimation for the distribution achieves the performance of the best estimator with sample size n/H_phi(p). 3. H_phi(p) characterizes how well one can compress the profile of a distribution losslessly. The paper also proposes algorithms to achieve the compression rate.

Strengths: 1. For instance-based distribution estimation, [Hao and Orlitsky 2019b] shows that there exists an algorithm that achieves a competitive risk depending on the expectation of the profile dimension (which is equal to profile entropy up to log factors). This work complements the result by proving a matching lower bound over all distribution with bounded profile entropy. Taking the maximum over the profile entropy for any distribution p, the bound recovers previous bound in the min-max setting. For several natural distribution classes, the bound strictly improves over min-max bounds over all distributions. 2. There have been several attempts on trying to construct a universal estimator that performs well for any property estimation tasks. However, they also impose some constriaints on the format of the properties (e.g. Lipshictz, additive and etc.). The result in the paper shows that the PML + plug-in estimator can be used as a universal estimator for any symmetric distributions with a loss of at most 1/H_phi(p) factor of the samples compared to the optimal estimators. 3. The concept of profile entropy can be of independent interest for other applications beyond those considered in the work.

Weaknesses: The result about competitiveness of PML estimator doesn't recover the previous results on establishing the optimality of PML for several property estimation tasks. So it would be intersting to see when does the optimality of the bound holds.

Correctness: I didn't check all the proofs in the appendix. But the results seem correct to me.

Clarity: The writing of the paper is good overall.

Relation to Prior Work: The paper addresses comparison with previous results nicely overall. One thing to improve is that it is better to indicate explicitly that the upper bound part of Theorem 2 is achieved by a scheme proposed in previous work [Hao and Orlitsky 2019b].

Reproducibility: Yes

Additional Feedback: Minor comments: The lower bound statement of Theorem 2 is a bit hard to understand. The explanation in the following paragraph does a good job but it somehow complicates things. One suggestion is to state it as min_{\hat{p}} \max_{p : H_phi(p) \le H} r_n(p; \hat{p}) \ge H/n. Line 887 in supplementary: "us to " -> "up to"