Review for NeurIPS paper: Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance

NeurIPS 2020

Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance

Review 1

Summary and Contributions: ***Update after author feedback*** My initial impression stands. I think this is a very good paper relatively to other submissions I have reviewed currently and in the past for ML conferences. In my opinion, the question tackled by the authors is hard, relevant, and the theoretical results are not trivial at all. I agree with the author rebuttal that the results are of statistical, rather than computational nature. However, computational limitations should be put more upfront, as they are mentioned only briefly. On the other hand, my main concern remains: the discretization error of SWD still depends exponentially on the dimension (ref [26]) but in this manuscript they are conveniently hidden in the approx less than or equal notation, which I think is misleading and should be remarked. Rebuttal has clarified my concerns about figure 1, and I agree it would be good to expand on its description. ***Original Review*** The paper derives limit distributions for the smoothed wasserstein distance between a probability measure and its empirical counterpart, as the number of samples tend to infinity. The smoothed Wasserstein distance is the usual Wasserstein distance between two measure but convolved with gaussian noise. Such results were only known before in one dimension, or for measures supported on a countable set. Measurability, consistency and limiting distributions of the minimum smooth wasserstein distance are also derived.

Strengths: The paper is highly significant as it develops further the theory around a recently proposed smooth wasserstein distance that attempts to overcome the curse of dimensionality of the original W1. The work is highly relevant to the community as optimal transport has been used in many machine learning applications in recent years. The clarity of presentation and mathematical rigour is another strength of this paper, as it states the results in a precise and concise manner, while keeping a simple notation. The quality of the provided proofs and the general structure and style of the paper show that the authors have a deep understanding of the topic.

Weaknesses: The main weakness I find is that Figure 1 is hard to understand visually, as the point clouds all seem to be pretty similar, I have a hard time understanding if there should be a difference between the clouds corresponding to different colors. Perhaps the authors should elaborate a bit more on the main takeaway from this figure, and what should we expect from the point clouds corresponding to different colors.

Correctness: I have checked the high-level points of the proofs for soundness and they seem correct, however I would say such results merit further reviewing that what a conference review cycle allows. Claims and arguments are clear and well written. The results are built with a mix of new and well-known techniques, and rely also on well stablished results in statistics and probability. The proposed experiments illustrate well the theory. However I was somewhat confused by Figure 1 (see weaknesses section).

Clarity: The paper is really well written, it is easy to follow and the main claims and statements of the theorems are precise. The level of rigour is high while preserving readability.

Relation to Prior Work: The main differences with previous work are clearly stated and authors provide enough references to relevant material.

Reproducibility: Yes

Additional Feedback: 1. Integral probability metrics may attain the parametric rate of convergence n^(-1/2) see for example https://papers.nips.cc/paper/6483-minimax-estimation-of-maximum-mean-discrepancy-with-radial-kernels.pdf so please check your claim in line 34. Also it is not so clear to mention wass. distance and IPMs as different things (lines 33,34) because W1 is a particular case of an IPM. 2. Line 45 is a bit misleading it might be better to recall that the rate of W1^sigma still depends exponentially on the dimension 3. Line 56: allows enables

Review 2

Summary and Contributions: After reading the rebuttal, I slightly increased my grade. I think the authors only partially addressed the issue I raised. In particular, the statement "exponential in d runtime (though no major issues arose in practice)" is quite problematic and suggests that the authors are not taking seriously the numerical evaluation of the theory. == This paper proposes a theoretical study of Gaussian smoothing of optimal transport. In particular it performs a sort of central limit theorem analysis, therefore extending some previously known results in the case of (un-regularized) Wasserstein distances.

Strengths: It studies an important problem, is well written and the mathematical treatment is rigorous. I would be quite supportive of its acceptance in case the issues raised below are addressed. I think this should include a re-writing of some parts, and this could be explained in the rebuttal.

Weaknesses: I think the paper vastly oversells the Gaussian smoothing technic, to the point that it deserves its interesting mathematical results. In short, I think the paper does not provide a fair overview of the current bottlenecks of using OT in high dimension, mostly because: 1/ up to know estimating Gaussian smoothing distance seems intractable, 2/ there is no clear statement about why is W^sigma an interesting quantity to minimize. Either it is because one is interested in approximation OT (but this it not the problem under study) or it is because it carries interesting properties for a non-vanishing sigma, but this is not really discussed. Quoting the authors "W^sigma enjoys the rich structure of Wasserstein distances" but I am unsure what it means. I would love to be wrong on both issues (and I might be wrong), but at very least I find that the paper does not really answer these key issues and hides the most important and interesting questions of OT for ML. In the current state of the paper (in particular given the numerical schemes proposed, using re-sampling and GAN-like methods), I think the statement "posing the SWD as a potent tool for learning and inference in high dimensions" is not correct.

Correctness: From a theoretical perspective the paper is correct. The numerical part however is not really satisfying (even as a proof of concept to validate the theory) as explained below.

Clarity: Yes the paper is well written.

Relation to Prior Work: As explained bellow, a lot of relevant literature is missing. -- Bibliography -- [1] Jonathan Weed and Quentin Berthet. Estimation of smooth densities in Wasserstein distance. In Conference on Learning Theory, pages 3118–3119, 2019. [2] Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In Advances in Neural Information Processing Systems, pages 4543–4553, 2019. [3] Aude Genevay, Lénaïc Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of Sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1574–1583, 2019. [4] Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications Jérémie Bigot, Elsa Cazelles, Nicolas Papadakis [5] M. Cuturi, Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

Reproducibility: Yes

Additional Feedback: The list of remarks and questions I have: * The current standard for regularization of OT is entropic regularization of the plan (papers of Cuturi [5], and also sample complexity results [3,4]). This paper seems to mostly ignore this literature, which is quite weird, given the fact that the goals are (almost) the same. * There are known results on the asymptotic distribution of entropic regularized transport, see for instance [4] (there might be other references, including studies of parameter estimation). Given the fact that entropic regularization can (should) be viewed as a "cheap proxy" for Gaussian smoothing, a proper and detailed comparison seems in order. * The biggest issue of the paper (unless I am missing something) is the actual computation of W^sigma. The authors seem to be using a re-sampling scheme "Sampling from P_n and N(sigma)_x0000_ and adding the obtained values produces samples from P_n * N(sigma)". But the potential problem is that to cope with the CoD, this might requires a number of samples exponential in the dimension. * Gaussian smoothing is very similar to technic as the one proposed [1] to fight the curse of dimensionality for smooth densities (they use a more involved wavelet method, but for uniformly smooth densities this would have the same properties). They do not propose efficient algorithms, and use a re-sampling scheme and a discrete plugin estimator, which I have the impression is the same at what is used here. Some comments on this type of approaches seem in order. * The use of a GAN-type method is also problematic and not covered by the theory I believe. * This is in sharp contrast with Sinkhorn's approach, which seems to offer similar 1/sqrt(n) sample complexity guarantees, could be amenable to a similar analysis of its asymptotic distributions (see the reference above), but at a cheaper O(n^2) computational price, independent of the dimension. This should at least be discussed. * The role of sigma>0 is not really explained. For instance, the dependency on sigma of the constants involved is not discussed. * The extreme case of Sinkhorn regularization is MMD. It is also the asymptotic of large sigma smoothing. MMD does not suffer either from the curse of dimensionality. So what would be the type of tradeoffs in term of parameter selection vs sample size and smoothness of the distributions governing the use of Gaussian smoothing, Sinkhorn smoothing and MMD ?

Review 3

Summary and Contributions: This paper deals with minimum distance estimation, i.e., estimation of a parameter $\theta$ by minimizing (with respect to it) a statistical distance between the observed empirical distribution and the target one (which depends on $\theta$), when the statistical distance at stake is the smooth 1-Wasserstein distance. The main contribution of this paper is to establish asymptotic results for the smooth 1-Wasserstein estimator and related quantities, when the underlying distribution is sub-Gaussian. == Update == I would like to thank the authors for their response. They plan to modify the manuscript based on my main remarks and their feedback is satisfactory in this regard. My opinion on this paper remains very positive.

Strengths: Wasserstein distances (and others) suffer from the curse of dimensionality: the rate of convergence of the empirical distribution is of order $n^{-1/d}$, which is obviously less and less good as the dimension $d$ increases. In contrast, the paper states that the rate of convergence under the smooth Wasserstein distance is $n^{-1/2}$ whatever the dimension when the distribution is sub-Gaussian. This main result is to me of great interest for the community and I felt particularly enthusiastic when reading the paper.

Weaknesses: From my point of view, the main limitation of this work lies in the choice of the smoothing parameter $\sigma$, which is unfortunately never discussed. From both the theoretical results (Theorem 1) and the numerical simulation study, one may think that the best choice is to take it very large, but in that case, the smooth Wasserstein distance can be far from the usual one. The choice of a statistical distance can not be done only on its asymptotic properties, but also on what it measures on the data. What is the opinion of the authors on this question? Perhaps it would have been a good idea to compare generalization results from Wasserstein and smooth Wasserstein distances.

Correctness: To the best of my understanding, the results presented in this paper are correct.

Clarity: The paper is pretty easy to read although technical. I have the following two minor remarks: - The paper is presented with a strong emphasis on generative modeling, but the theoretical and numerical results it contains, although very interesting, concern classical statistical problems (parametric estimation). Despite the paragraph "Generalization discussion" (l.248), I think that the content of this manuscript could benefit from another presentation. - In Theorem 1, the order of the dependency on $\sigma$ is easy to read, but this is not the case in other results. I understand that this can be very tricky and technical, especially because of the definition of the limit process, but perhaps the authors could help the reader on this point. Typos l.56: allows enables l.95: sharp rate are l.138: one many then obtain l.279: the distribution [...] converge

Relation to Prior Work: Sliced Wasserstein distance has similar properties to the smooth Wasserstein distance investigated in this paper since both do not suffer from the curse of dimensionality (see Theorems 5 and 6 in Nadjahi et al., NeurIPS 2019). I would have liked the positioning in relation to this work to be more detailed than the few lines written in the introduction. Perhaps it goes beyond the scope of the paper, but a natural question is: which distance to choose since both do not suffer from the dimensionality problem? It can be from a theoretical or a very practical point of view.

Reproducibility: Yes

Additional Feedback: See my questions and comments in the previous sections: weaknesses, clarity, and relation to prior work.

Review 4

Summary and Contributions: +++++++++ Update after reading the author's response +++++++++ I thank the authors for their response. I however agree with reviewer #3 in that the claim that the rate of convergence is independent of the sample dimension should be tempered in view of the computational reality at stake with the use of the metric. In particular, as I mentioned in my review, this needs to be addressed via a more thorough exposition of the GAN experiment. +++++++++++++++++++++++++++++++++++++++++++++ The paper establishes new, nice and important theoretical properties of the smooth Wasserstein distance (SWD) as a statistical distance for learning generative models. In particular, the convergence with rate independent of the sample dimension (of order square root of the sample size) is proved for the parameter of the distribution minimizing the SWD to the empirical distribution of the sample.

Strengths: The paper is well written and provides proofs or references for all stated results. It provides new results which look very promising for the machine learning community by establishing statistical properties of a statistical distance which are independent of the data dimension. These results are very likely to draw attention to the smooth Wassertein distance in the ML community as this statistical distance is showed to be quite unique in its robustess to the curse of dimensionality. The authors also provide convincing empirical evidence of their result.

Weaknesses: While the paper is well written for mathematical standards, the importance of the work for ML applications may be difficult to grasp for readers with a more empirical background. In particular, the presentation of the last experiment using a parametric family of distributions given by a generative network can be improved.

Correctness: I do not have the expertise to check the proof in the amount of time provided, but I am confident that an expert in the field should have no problem doing this job given the clarity of the authors' exposition. The empirical methodology is correct, even though more details are necessary for the last experiment.

Clarity: Yes, as mentioned already. However more effort can be made to meet the wider audience interested in the results. Note the following typos: 56 choose between allows or enables 95 rateS 511 have TO verify

Relation to Prior Work: This is done with much details.

Reproducibility: Yes

Additional Feedback: