Review for NeurIPS paper: SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

NeurIPS 2020

SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Review 1

Summary and Contributions: In this paper, the authors give us a new perspective on Stein Variational Gradient Descent (SVGD). Instead of Kullback-Leibler divergence, they view SVGD as the (kernelized) gradient flow of the chi-squared divergence, which exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincare inequality. Based on theoretical analysis, they also propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD).

Strengths: 1. They provide detailed theoretical analyses on SVGD. It seems to be reasonable. 2. Based on these theoretical analyses, they provide a new algorithm called LAWGD.

Weaknesses: 1. The most concern about this work is the experiments. Although this paper is equipped with powerful theoretical analysis, its experiments about LAWGD is weak. I am not very familiar with SVGD but I think the experiment is not enough for NeurIPS. For example, in [1], authors apply SVGD on Bayesian Logistic Regression and Bayesian Neural Network with different tasks. Compared with [1], this work only illustrates the advantage of LAWGD on simulated data such as mixture of Gaussian distribution. If the authors can provide more experimental results on LAWGD, I think I may improve my score. 2. The theoretical analyses take up the main space of the paper. For many of readers, they may not very familiar with such detailed analysis. Most of them want to know about the difference of this work with existing ones. However, in the paper, such discussions are few, for example, the related works is a little weak. Actually, the first reason is the main reason that I give the score 5. The second reason is my personal suggestion about how to make a theoretical paper more easier to understand for most of readers. [1] Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Correctness: Actually, I am not very familiar with this field. I try my best to understand their method and proofs. I think it is correct based on my current knowledge.

Clarity: The paper is well written. However, as my personal suggestion, I think the authors can give more discussions on the difference or novelty of their work.

Relation to Prior Work: Not very clear.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper makes the following contributions: 1) Interpretation (up to a constant factor of 2) of SVGD as (kernelized) gradient flow of the Chi-squared divergence, called as CSF 2) Establishing exponential ergodicity of CSF (continuous case) with respect to the KL metric and Chi-squared divergence metric, under certain Poincare condition (or LSI) on the target. 3) Propose the use of Laplacian based kernel for SVGD and establish scale-invariant ergodicity results under certain assumptions. =====updates after response===== Line 11 in response mentions kernel selection is an issue. Indeed this is an issue with any kernel method (from SVM to MMD to SVGD) and it has been addressed in various ways. If one were critical, there is still no "nice" way to pick a kernel. Indeed as mentioned in Line 16 and 17 , a single integral operator depending on target \pi is good (in a way it is also along expected lines - for example in MMD context something similar leads to optimality properties). However I tend to not agree 100% with lines 27-28 that "solving high-dimensional PDEs is precisely the target of intensive research in modern numerical PDE" which is my main concern with the practical applicability of the proposed work. There is no "concrete" progress in this direction to the best of the reviewer's knowledge despite several ad-hoc approaches recently. However, I also recognize making concrete progress in high-dim numerical pde is not the main aim of this paper. Given this, I am still skeptical of the practicality of the proposed approach (there is a good possibility it will remain difficult to compute for various \pi and dimension combinations -- I would be delighted to be proven wrong~!! This is the case for many practical kernel selection methods unfortunately). Hence, I retain my score.

Strengths: 1) The interpretation of SVGD (up to a constant) as Chi-squared flow is interesting. 2) The proposal of Laplacian based kernel for SVGD is interesting.

Weaknesses: 1) The main issue with the paper is the lack of results or algorithms in the discrete setting. In particular, how to compute the Laplacian-based kernel efficiently is completely left as open question with connection to numerical PDE literature. It could be argued that the numerical techniques in the PDE literature are not scalable and hence this severely limits the applicability of the proposed method. Do the authors have more to comment on this aspect ? 2) The establishment of exponential ergodicity conditions are more or less standard given the flow interpretation. Hence, the main contribution of the paper is in terms of an "idea" or a "method" with some potential.

Correctness: To my understanding the claims and the methods are correct.

Clarity: yes -- very easy to read and understand

Relation to Prior Work: yes.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Stein variational gradient descent (SVGD) is a particle-based sampling algorithm that have attracted much attention recently. While SVGD has typically been viewed as the kernelized gradient flow of the KL-divergence, this paper offers a new perspective by interpreting it as the kernelized gradient flow of the Chi-squared divergence. Adopting this perspective, the paper proves uniform ergodicity under a Poincare inequality. Motivated by finding an optimal kernel, the paper further proposes an alternative sampling algorithm which they term Laplacian Adjusted Wasserstein Gradient Flow (LAWGD) that exhibits stronger convergence guarantees.

Strengths: SVGD and its analysis from the perspective of gradient-flows have attracted much attention recently, and I believe this paper will be a refreshing addition to the literature. The paper is clearly written, and the derivation of SVGD as the kernelized gradient flow of the Chi-squared divergence is simple and rather elegant. It’s nice to see the advantage of this interpretation over SVGD_p, namely that the kernel integral operator depends only on the target distribution and is stationary through time. While the proofs are not very complex, the results on strong uniform ergodicity under Poincare inequality is quite appealing. I also find the motivation and derivation of the LAWGD algorithm from the generator of the Langevin diffusion insightful.

Weaknesses: An obvious limitation of the current work is that the current implementation of the LAWGD algorithm is merely a proof-of-concept, and much more efficient numerical solvers should be used in place of the current finite-difference scheme to make the algorithm actually practical. However, this is fine with me as the paper’s primary focus is theoretical, and would probably inspire follow-up research on developing more practical algorithms. I should also note that while the theoretical results look appealing to me, I am not sufficiently familiar with the recent developments in the learning theory community to comment on the novelty and innovativeness of the analysis techniques.

Correctness: To the best of my knowledge, yes; but I have not checked the proofs step-by-step.

Clarity: Yes, the paper is clearly written and a pleasure to read.

Relation to Prior Work: I believe the paper has addressed prior work appropriately, but I may be missing very recent developments in the learning theory community.

Reproducibility: Yes

Additional Feedback: I thank the authors for their response, and I also think Reviewers 2 and 4 raised valid points regarding the lack of discrete-time results. On the other hand, regarding the lack of extensive or practical experiments/applications in this paper, while this is definitely the case, I believe that a simple proof-of-concept is sufficient for a paper whose focus is theoretical, and that the present paper would inspire future research that develop more practical algorithms. In this regard, I'm happy to raise my overall score from 7 to 8. ================================================== I have some more specific comments/questions: - While the derivation of the Chi-squared divergence perspective does not exhibit a direct analogy to the case of the KL-divergence, could similar interpretations be established for other f-divergences? - A well-known problem with SVGD is that in high dimensions, the particles tend to collapse around some modes of the target distribution and fail to spread out. Theoretically speaking, would LAWGD also suffer from this issue using the same e.g., RBF kernel, or would one expect LAWGD to more fully explore the parameter-space?  - As a consequence of Equations (10) and (11), it’s somewhat interesting that to approximate K_\Lcal one would be interested in the bottom of the spectrum of \Lcal rather than the largest eigenvalues. Would this give rise to any practical (numerical) implications in discretizing the algorithm? - It’s quite interesting to see the connection between particle-based sampling and the spectrum of the Schrodinger operator. More generally, could LAWGD potentially be derived from the Schrodinger operator rather than the generator for the Langevin diffusion instead? - Minor: In Equation (8), \gtrsim is not properly defined.

Review 4

Summary and Contributions: This paper first analyzes the popular SVGD algorithm as a kernelized gradient flow of the chi-squared divergence and further proposes Laplacian adjusted Wasserstein gradient descent, another particle-based variational inference method. Given some assumptions on the derivatives of the target distribution, the new method can have a faster convergence rate.

Strengths: [+] Detailed discussions about the background and related work, e.g. Wasserstein SVGD as gradient flow, accelerated version of SVGD. [+] Well-organized methodlogy part and show the advance of the proposed algorithm clearly.

Weaknesses: [-] Following the analysis of SVGD with KL divergence, the authors analyzed the dynamics for the chi-square divergence. I would like to see some further analysis of the discretization of the gradient flow, which is more difficult but more meaningful. [-] Compared to [1] and its variants, the difference is convergence rate in Theorem 4, which should be highlighted. [-] (Main) My main concern is about the experiment part. This paper proposes a new algorithm, and I think the experiments should empirically show the advances of the algorithm. (Baselines) For baselines, besides SVGD, some recent works discussed in section 2 should be compared, e.g. [1], [3]. These algorithms also claim they can accelerate SVGD empirically and can improve the performance. (Time complexity) The proposed Laplacian adjusted Wasserstein gradient descent needs to calculate the eigendecomposition. Therefore, I would like to see some running time results for real-world applications, e.g. Bayesian neural network, generative model [2]. (Practical) I would like to see some experiments (e.g. Bayesian neural network, generative model, etc.) except the toy cases (e.g. Gaussian mixture), using Laplacian adjusted Wasserstein gradient descent. Otherwise, I would wonder whether this is a practical algorithm. [1] A Stein variational Newton method [2] Kernel Stein Generative Modeling [3] Stein variational gradient descent with matrix-valued kernels

Correctness: The claims and methods are correct.

Clarity: The paper is well-written.

Relation to Prior Work: It clearly discussed how this work differs from the previous contribution.

Reproducibility: Yes

Additional Feedback: