Paper ID: 111
Title: Measuring Sample Quality with Stein's Method
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers the general problem of estimating expectations $E_P h$ under an arbitrary sample drawn (effectively) from some $Q$, and in particular of a method for calculating the quality of $Q$ as a surrogate for $P$. In particular, they focus on integral probability metrics to measure the quality of $P$.

The paper includes interesting extensions of technical results on the Stein discrepancy, though at times it is unclear which ones are the authors' own contributions. But overall, the theoretical results seem thorough and complete.

The paper also includes a method for efficiently computing Stein discrepancies, which also seems novel, and whch they show is the same order as the classical Stein discrepancy.

Overall, the method is highly interesting and seems to have general applicability. The experimental results illustrate it well.

Q2: Please summarize your review in 1-2 sentences
The paper presents theory and algorithms for estimating Stein discrepancies between distributions. The theoretical results are interesting and the method is highly applicable.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The main contribution of the paper is a proposal for IPM with certain properties and geared towards a specific application. The IPM proposed is based on Stein operator, which is one of the standard methods, by now, for studying convergence in distribution. The motivating application is measuring the quality of biased MCMC procedures. The assumption are: (i) the expectation under the target distribution is zero (which is made because of the Stein's method) (ii) the operator used is the infinitesimal generator of Markov process and (iii) the set of functions in defining the IPM is what is called as 'classic Stein set' in page 3 of the paper.

The main contributions of the paper are 1) Theorem 7 (and theorem 2) relating Stein and (Wasser)stein IPMs for log-concave densities under certain assumptions in the multivariate setting. 2) Providing computable optimization programs for Stein discrepancy - I especially liked the use of

Whitney's extension theorem for computing a graph based discrepancy.

It would be interesting to explore rates of convergence results for Stein discrepancy rigorously. Furthermore, it would be interesting to explore relationship to MMD (as one of the main focus in on computability). Using MMD in this context would correspond to goodness-of-fit style testing for target distribution (say P) using MMD. The authors mention MMD would work when 'expected kernel evaluations are easily computed under the target'. But since the space of rkhs function is dense in L_2(P) essentially it could approximate a large class of functions (not in the sample sense as mentioned in the paper but in a population setting).

I went over the proofs and they seem to be correct at least at a high level. The contributions of this paper are interesting and I would recommend accepting the paper. I think extending and exploring ideas along the lines of this paper should lead to interesting results.
Q2: Please summarize your review in 1-2 sentences
The contributions of this paper are interesting and I would recommend accepting the paper. I think extending and exploring ideas along the lines of this paper should lead to interesting results.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Massive data pose many challenges to MC methods.

In a variety of applications it is easier to sample using an approximate transition kernel than the exact. Some examples include posterior exploration or maximum a posteriori estimation based on small mini-batches of the data. This paper introduces a pseudo-metric based on Stein's method to compare samples from the approximate and exact procedure.

Theoretically grounded, the method is applicable to wide variety of important problems and can be used efficiently using off-the-shelf linear programming solvers.

The paper is clear and well-written.

It introduces a new application of Stein's method that has excellent applications to current MC methods.

We also recommend a few suggestions for improvements:

- In the decoupled linear program why is \ell_1 norm appropriate? It would be beneficial to compare the simulation results for say \ell_1 and \ell_2 norms. - provide a few intuition about the sufficient conditions in Thm 2 and why is it not necessary; similar comment for definition of set G in Eqn (5).

Q2: Please summarize your review in 1-2 sentences
This work introduces a pseudo-metric based on Stein's method to compare the Monte Carlo (MC) estimates from approximate methods and their target value.

The proposed approach is theoretically grounded and can be efficiently computed.

Many experiments are presented demonstrate the theoretical results on real and simulated data.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Although much more technically involved than your average NIPS paper, it was a pleasure reading and reviewing this manuscript. I do not have many comments:

-Section 4 compresses the material far too much, and makes it difficult to follow. It may be worthwhile to remove some of the results (or move them to an appendix), and expand the discussion in the text. For a journal extension I would certainly include all the detail presented here, but as this is a conference paper I would favour simplicity/clarity over detail.

-Some extra discussion of the implications of the various Stein functions in figure 1 would be nice. I am not sure what to extract from these given the text.

-I think the only thing that is sorely lacking in this paper is an evaluation (experimental, theoretical, or optimally both) of the computational complexity/practicality of what you've proposed. Please comment on why this is missing or suggest a change to the paper that addresses this.
Q2: Please summarize your review in 1-2 sentences
This paper is a very cool application of Stein's method to assessing the quality of samples from MCMC algorithms. It addresses a major gap in state of the art MCMC research. While this paper should probably just be a journal publication (it certainly has enough material to warrant it), I would love to see this at NIPS.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their helpful feedback and provide our responses to specific points raised below.

Reviewer 1:

"Rates of convergence": A theoretical assessment of convergence rates would be very interesting. Prop. 3 (or other upper-bounding techniques) could be used to analyze Stein discrepancy convergence under various sampling assumptions. Thm. 7 and Lem. 8 would then imply explicit rates of convergence for the smooth function distance and Wasserstein distance, respectively.

"Relationship to MMD": MMD (in its most common use as an IPM with H the unit ball of a reproducing kernel Hilbert space) has a number of desirable properties: its IPM optimization problem has a closed-form solution, and, when a universal kernel like a Gaussian kernel is used, MMD satisfies Desiderata (i) and (ii). The chief drawback is that the MMD solution requires explicit integration under P. For example, Gaussian kernel MMD requires computing integrals of the form \int \exp(-||x-z||^2) p(z). While this is straightforward for select target distributions (like finite Gaussian mixtures), these integrals are unavailable for most targets of interest. This limitation on the applicability of MMD was one of the initial motivations for this work, and we will endeavor to make this important point clearer in the text. Note that one can avoid the intractable integration under P by approximating P with a sample from P, but this requires access to a separate surrogate sample.


Reviewer 2:

"Why the L1 norm?": The graph Stein discrepancy equipped with the L1 norm has two especially desirable properties: 1) the resulting optimization problem is a linear program and 2) the optimization problem decomposes into d independent linear programs that may be solved in parallel. Other norm choices, including L2, require solving a much larger coupled non-linear program instead of d decoupled smaller linear programs. We will work to make these points clearer in the text.

"Sufficient conditions in Thm. 2": The Thm. 2 conditions relate to the speed at which the overdamped Langevin diffusion underlying our Stein operator T_P converges to stationarity and were chosen to ensure that the diffusion was a sufficiently smooth function of its starting point. Recent work by Eberle ("Reflection coupling and Wasserstein contractivity without convexity
") suggests that these conditions can be relaxed, and alternative sufficient conditions (for the univariate case) appear in [9,10].

"Intuition about Eq. (5)": To establish Thm. 2, we first bound the Wasserstein distance by a Stein discrepancy based on the non-uniform Stein set in Eq. (5) and then bound the non-uniform Stein discrepancy by the classical Stein discrepancy (Prop. 4). It is in this sense that the non-uniform Stein set leads to a tighter bound on the Wasserstein distance. For general use with a new target distribution, we advocate the parameter-free classical Stein and graph Stein sets, so we do not use the non-uniform Stein sets in our applications.


Reviewer 3:

"Sec. 4 overly compressed ... may be worthwhile to remove some of the results": We are reluctant to cut an example, as we hope that the variety of examples will inspire members of the community with distinct interests to build upon these ideas. However, we agree that Sec. 4 is overly compressed, and we will prioritize revising this section to improve clarity.

"Implications of Stein functions": The recovered test function h = T_Pg represents an explicit witness of the discrepancy between Q and P (as the expectations of Q and P over h deviate by the Stein discrepancy). We are not yet comfortable drawing additional conclusions from these functions, but we believe that a thorough cataloguing of Stein functions for various sample-target pairs is warranted to better understand the implications.

"Evaluation of computational complexity/practicality of what you've proposed":
We will make space to report experiment timings for each of our proposed applications.
Using a single core of a dual socket Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, the most expensive Stein discrepancy computations had the following running times:
-Hyperparameter selection: 0.4s (spanner), 1.4s (coordinate LP)
-Bias-variance tradeoff: 1.3s (spanner), 11s (coordinate LP)
-Convergence rates: 0.08s (LP)