NeurIPS 2020

Variance reduction for Random Coordinate Descent-Langevin Monte Carlo

Review 1

Summary and Contributions: The paper proposes to integrate Random Coordinate Descent (RCD) into Langevin Monte Carlo (LMC)(both overdamped and underdamped version) framework and theoretically analyzes its convergence properties. The paper presents one negative result when directly apply RCD in overdamped LMC.

Strengths: The idea presented is well motivated and clearly presented. And the work may be useful for wide range of applications, especially in the case when gradient info. is not readily available. It may also have potential in generative modeling area.

Weaknesses: This paper is mainly a theoretical paper with convergence analysis included, however, it would not be so convincing without comparisons on toy data. Therefore, it would be better to also numerically and empirically compare with the traditional LMC in terms accuracy and computation efficiency on the synthetic/simulated data. There is a gap between theoretical analysis and empirically performance in general, without illustration on simplest toy data might not be easy to let general audience to embrace the usefulness of the proposed algorithm.

Correctness: The algorithm/methodology seems correct. The convergence proof correctness is beyond my scope and thus unjudgeable.

Clarity: The paper is well written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The author rebuttal did address some of my concerns, more illustrative simulated experiments could be included for demonstration in their revision.

Review 2

Summary and Contributions: I have read the other reviews and author feedback carefully. It would be great if the authors could add the simulation figures in the final paper. ------ This paper considers the problem of sampling from a strongly log-concave distribution when gradient is not available and needs to be approximated via finite-difference. The naive implementation of the randomized coordinate descent does not improved upon the direct finite difference approximation of gradients. The authors propose a new variance reduction approach to implement the randomized coordinate averaging descent (RCAD) to be combined with Langevin dynamics for log-concave sampling. With K denoting dimension, Langevin Monte Carlo (LMC) with direct finite difference approximation of gradients converges in K^2/epsilon function evaluations because each step computes K coordinates of the gradient. The proposed RCAD-O-LMC converges in K^{3/2}/epsilon function evaluation. Similar improvement of RCAD-U-LMC (the underdamped Langevin with RCAD) over U-LMC (underdamped Langevin) is also derived. The improvement is largely inspired by the variance reduction techniques for stochastic gradient methods on a finite sum function in optimization.

Strengths: In the specific setting of strongly log-concave sampling where gradients need to be approximated via finite difference, their methods have improved dimension dependency K on the number of function evaluations. The proposed algorithm is not that novel for optimization, but for sampling, the proposed algorithm is new.

Weaknesses: 1. Some parts of the paper writing seems to be rushed and contains small errors. For example, in the main Theorem 5.1 equation (16) (line 238), it should be \sqrt{1 + R^2} instead of \sqrt{1 + 1/R^2}. Because According to line 108 of Appendix, due to "c_p" choice, R^2 should appear there. It is a mistake. There are also other typos in the Appendix that I will explain below. 2. This paper does not have a convincing application case for their theoretical setting. For the proposed algorithms to have improvement, one has to be in a setting where gradient computation is much more expensive than function evaluation. Could the authors provide a good example? For statistical problems like Bayesian logistic regression, the computation cost for the gradient is similar to function evaluation. If one has a neural network, then the forward cost (function evaluation) and the backward cost (gradient evaluation) have the same order of computation. 3. It would be better if the authors provide some simulation results to check the suggested rate improvement. 4. Theorem 5.1 requires Hessian-Lipschitz condition. Is it possible to provide a version without Hessian-Lipschitz condition in the main paper? 5. The proposed technique seems only work for Wasserstein-2 distance, can the authors comment on other distributional distance?

Correctness: I have checked the proof of Theorem 5.1. The main proof is correct, except for some small typos. For example, Appendix line 108, line 100 1-1/K instead of 1-K, line 103 + is missing etc.

Clarity: The main paper is clearly written. Maybe the heavy notation in Theorem 5.1 and Theorem 5.2 can be cleaned up a bit. Big O notation in Table 1 is not well defined. What is being ignored in big O notation? just constant, or also the condition number? The Appendix needs more work. There are many typos to be fixed. The way the current proof is written is not easy to read. Adding a paragraph or a diagram in the begining of Appendix B to explain the proof sketch might be helpful. Inconsistent notation, like putting subscript 2 in the l2 norm in Line 73 of Appendix should be avoided. Letter B is overloaded, for Brownian motion and for the constant.

Relation to Prior Work: Yes. To my knowledge, this specific type of log-concave sampling where gradient needs to be approximated via function evaluation is not studied before.

Reproducibility: Yes

Additional Feedback: The title could be more clear about the specific setting where gradient needs to be approximated by finite difference. It is not just about high-dimension.

Review 3

Summary and Contributions: The paper deals with the problem of efficient sampling from log-concave distributions in a high dimensional setting. In this setup, a simple yet efficient sampling algorithms is the Langevin Monte Carlo algorithm. It makes use of the gradient of the (log-)pdf to ensure a fast convergence towards the region of high density. However, computing the gradient in a high-dimensional setting is computationally expensive. The contributions of the paper are related to techniques for reducing the computational cost for evaluating the gradient, and are two-fold: first, an analysis of the simple random coordinate descent (RCD) technique for reducing the computational cost for the gradient is carried out, and second, a new technique to reduce the computational cost for gradient evaluation is proposed and analysed. The contributions are theoretical. NOTE: I read the authors rebuttal.

Strengths: Overall, the paper proposes an interesting technique for reducing the per-iteration computational cost of gradient evaluation while still maintaing an overall acceptable computational cost. The technique is employed in two sampling algorithms. The paper starts by analysing an existing technique for reducing the per-iteration computational cost of the gradient evaluation in sampling algorithms, namely random coordinate descent (RCD). The analysis is carried out in the general case and the result is that in such a setting for a given required accuracy there is no overall net gain in computational cost as the per-iteration reduced cost is balanced out by the increase in the required number of iterations. Such a theoretical analysis is interesting and the results are, in my opinion, noteworthy. The authors also point out that there exist particular scenarios in which the RCD approach yields an overall net gain in computational cost for a given accuracy. The paper then proposes an improved technique for reducing the per-iteration computational cost for gradient evaluation. It is important to note that the 'new' gradient that is computed at each iteration and that has a reduced computational cost is in fact an approximation of the true gradient. The issue with the RCD technique is that the approximation has a rather high variance, hence the need for a large number of iterations to reach the required accuracy. The proposed technique starts with a complete gradient evaluation and then at each iteration proceeds to update the gradient in only one direction, which is chosen at random. For the other directions, the technique calls for recycling the gradients from the previous iteration irrespective if they have been updated or not. Theoretical results for the distance between the distribution of the iterates at iteration m and the target distribution are given for both algorithms that are considered in the paper. The improved technique for approximating the gradient and the theoretical results for the distance between the distribution of the iterates and the target distribution are the strong points of the paper. From my point of view, these contributions are of interest for the ML/NeurIPS community.

Weaknesses: The main weaknesse of the paper is that there are no numerical examples of the proposed technique and algorithms. I feel that the authors devoted too much space explaining the existing algorithms: the contributions are discussed starting with section 4 which starts at the bottom of the 5th page. Space could have been easily gained to make room for a section containing a numerical evaluation of the performances of the proposed technique and algorithms. If we look at theorems 5.1 and 5.2 we see that there is a rather intricate dependence of the theoretical results with respect to what we can collectively refer to as parameters of the proposed methods (h, \eta, M, L), the number of iterations and the dimension of input data. A numerical evaluation would have definitely allowed one to gain some insight on the performances of the proposed methods for real-life problems, more so as the contributions are given in a general setting. I am left wondering why did one not consider the case of stochastic optimization and focused solely on sampling. The technique that is used to reduce the number of gradient evaluations in one iteration could be easily applied to an optimization problem. Personally, I would have first studied the advantages offered by the proposed technique for an optimization problem. The fact that the paper analyses the proposed technique in a general setting is a strong point. However, nothing is said about particular cases. How does the proposed technique fares with RCD in such a case, especially in a case in which RCD offers good results? Also, nothing is said about how the method fares with other competing methods besides LMC and RCD.

Correctness: The claims seem to be correct. There are no numerical experiments.

Clarity: The paper is mostly clearly written, though some paragraphs from the introduction section are somewhat ambiguous, namely the last paragraph on page 1 and the first two paragraphs on page 2: - In the last paragraph on page 1 one contrasts the broad class of MCMC algorithms with particular instances of MCMC type algorithms: MH algorithm, Langevin dynamics based methods, Hamiltonian Monte Carlo, etc. There is no point in contrasting a class of algorithms with particular instances of algorithms. - In the first paragraph on page 2 one says that MCMC is one popular method. I would not call MCMC a method, MCMC is a class of algorithms. Furthermore, from reading the first and second paragraph on page 2 I get the impression that Langevin dynamics is not an MCMC type algorithm, when it is the case.

Relation to Prior Work: The relation to prior work is in my opinion clearly discussed.

Reproducibility: Yes

Additional Feedback: