Review for NeurIPS paper: A Non-Asymptotic Analysis for Stein Variational Gradient Descent

NeurIPS 2020

A Non-Asymptotic Analysis for Stein Variational Gradient Descent

Review 1

Summary and Contributions: The authors consider Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution. The authors under slightly weaker assumption establish that algorithm is decreasing. And in next shows convergence of infinite particle system under Stein log-Sobolev inequality assumption.

Strengths: The main result is the convergence of infinite SVGD with infinite particle to target, which reduces gap between theory and practical implementation.

Weaknesses: The result about finite (in appendix) particle depends exponentially on parameters which make results non-applicable in practice.

Correctness: The proofs seem to be correct.

Clarity: The paper is quite well written.

Relation to Prior Work: The authors nicely present existing results in the field and clearly explain their contribution.

Reproducibility: Yes

Additional Feedback: My detailed minor remarks: - In chapter 2: $$L_2(\mu)$$ should be a space of functions $$f : \mathcal{X} \rightarrow \mathbb{R}$$, not $$f: \mathcal{X} \rightarrow \mathcal{X}$$. - In chapter 2: It would be nice to include some nice reference about W_2 Hessian of KL - I am not familiar with it. - In chapter 3: Praised be the author who uses the term "positive semi-definite kernel" instead of "positive definite kernel". - In section 3.2 there is inconsistency with notation once we have $$P_{\mu_n} \nabla$$, otherwise we have $$P_{\mu_n} \nabla_{W_2}$$. #UPDATE After reading rebuttal and other reviews I still reccomend to accept.

Review 2

Summary and Contributions: This paper proves linear convergence in KL of the discretized, infinite-particle version of Stein variational gradient descent (SVGD) under a Stein log Sobolev inequality (plus additional technical assumptions).

Strengths: The paper proves an important -- although not too surprising result -- about the convergence of infinite-particle SVGD. Overall, the analysis is elegant and other than the Stein LSI, only relies on very reasonable smoothness assumptions, plus a single boundedness assumption. The SM (Prop 12) also provides an important propagation of chaos result on the finite-particle accuracy of SVGD. This is a valuable paper for understanding the approximation properties of SVGD and similar algorithms.

Weaknesses: There needs to be more discussion of examples where all three assumptions required for Prop 5 hold. One concern I have is that these assumptions (particularly A1) are in conflict with the Stein LSI holding. Do A1 hold in the 1D examples from Duncan et al. 2019 discussed after Definition 2? Is there any hope of checking whether A3 holds?

Correctness: While I have not checked the proofs in detail, the claims appear to be correct and reasonable.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: ********** Update based on author rebuttal: The authors did a good job addressing my comments. I'm happy to accept this paper and increased my score to 7. Though I very much hope the authors move their finite-particle results to the main paper. ********** A. I would encourage the authors to add at a least a short presentation of the finite-particle results to the main paper -- these are important. B. Please be sure to put citations in parentheses when appropriate. There were many confusing non-parenthetical citations at the ends of sentences.

Review 3

Summary and Contributions: This paper proves the first nonasymptotic convergence bound for the SVGD, hence provides a nonasymptotic analysis of the SVGD (for its infinite particle limit) for the first time in the literature.

Strengths: The paper is well written, the bounds are clean and understandable. SVGD has become a popular method, hence its analysis is relevant for the NeurIPS community.

Weaknesses: As for most of the analysis papers, I do not think verifying whether the target satisfies the log-Sobolev inequality is not trivial, hence the bounds in this paper may not be immediately practical yet. Furthermore, the analysis applies to the infinite-particle limit and finite-bounds are not as clean as the infinite-particle case. (see my detailed comments)

Correctness: Yes.

Clarity: Yes. In many places, citations are not formatted correctly, e.g., they are always of the form Author (year), whereas in many places where it should be (Author, year).

Relation to Prior Work: Yes. Previous contributions in this area focused on either a continuous-time analysis or asymptotic convergence results. This paper provides a first nonasymptotic result.

Reproducibility: Yes

Additional Feedback: Post rebuttal: Thank you for you response. I keep my score as it is (which is accept) -- but like R2, I really would like to see the finite-N results in the extra page that would be provided if this paper is accepted. -- This paper deals with an important open problem so far, deriving the convergence rates of the SVGD in discrete-time. The analysis looks sound and nicely done. (1) First of all, the key to establish the results is the assumption that the target satisfies Stein log-Sobolev inequality. Some elaboration on what satisfies this inequality is provided before Sec. 3.4, but it would be especially nice if somewhat a realistic example is given for which the theory provably holds. (2) The convergence rate is obtained in the number of iterations in discrete-time, but in the infinite-particle limit. Therefore, these results are still not precisely about the SVGD, but on its infinite-particle limit. Results regarding the finite-particle case are deferred to Appendix but I am still not sure what they mean. Is it possible to obtain a unified convergence bound, containing N as in Prop. 12 (the number of particles) and n (the number of iterations) as in Theorem 7 together? Currently these two results are bounding different things and it is not clear a finite-time meaningful result (which is decaying in n *and* N) can be obtained either in KL or in Wasserstein-2 distance. This is especially the case since the bound in Prop. 12 is time-growing as opposed to Theorem 7, which is geometrically decaying in the number of iterations. So it looks like time-decaying nature of Theorem 7 does not apply to the finite N case, which should be made explicit as an open problem. Without this final result in the paper, I actually think the title should be slightly edited as well, as the analysis applies only to the infinite-particle limit case, rather than the real SVGD algorithm with finite number of particles. (3) A discussion related to the effect of dimensions in the bounds is missing. In particular, I strongly suggest authors to identify the dimension dependence (qualitatively, if not quantitatively) in bounds and comment on whether these bounds deteriorate as d grows, if so, how fast. Again, relating to comment (2), it is certain that the finite-particle bound would suffer from dimensionality but it would be interesting to also see whether the infinite-particle bound has any bad dependence to d. (4) I am not sure about the sentence following Theorem 7: "Hence, if 2cγ λ < 1, we obtain exponential convergence." It looks like it is not possible to convert this sentence into a condition on the step-size and *guarantee* that this holds. So, to be clear, is it possible that despite the conditions in Theorem 7 being satisfied, the above quantity can be bigger than 1? This would result in a bound growing in time. If this is the case, I also strongly suggest some discussion on this as to how/when this can happen.

Review 4

Summary and Contributions: This paper study the Stein Variational Gradient Descent (SVGD) algorithm, a particle variational inference method, which optimizes a set of particles to approximate a target probability distribution. This algorithm has shown to be practical but the non-asymptotic analysis remains incomplete. The most interesting part of this paper is it provides a guarantee on the convergence rate in KL divergence, under some minor assumptions. They argue that, with recent results on the Stein geometry, they can draw a clear connection between time-discretized Wasserstein gradient flow and SVGD. Compared to recent related works, more natural and simple assumption is needed and the analysis looks simple yet correct.

Strengths: [+] detailed discussions about the related works and background. [+] clear discussion and summarization about the theoretical result. [+] Well-written.

Weaknesses: [-] In section 5, the authors provide an analysis of the discretization of the gradient flow. The assumption A1 - A3 looks not so intuitive, i hope some remarks e.g. connection to moment constraints, can be discussed here. [-] Line 312 -313 says theorem 7 'gives clearer connections with Wasserstein gradient flows'. Here, a revisit of the boundedness condition of KSD may make the comparsion clearer.

Correctness: the claims and methods are correct.

Clarity: This paper is well-written.

Relation to Prior Work: This paper clearly discusses how this work differs from previous contributions.

Reproducibility: Yes

Additional Feedback: