NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 186 Block Coordinate Regularization by Denoising

### Reviewer 1

A recent trend in large scale optimization, specially in the machine learning community, was to replace full gradient based algorithm by its coordinate descent counterpart. The idea being to reduce the computational cost of each iteration while enjoying similar rate of convergence. Often, the solution of maximum a posteriori (estimated with a proximal algorithm) is hard to estimate exactly when the prior is not directly available. In that case, the "proximal iteration" is replaced by "Denoised iteration" where the proximal operator of the prior is replaced by another adequate denoising operator. Such algorithm is then based on full vector update just as vanilla (proximal) gradient descent. Then, this paper follows the recent trend and make the same coordinate descent moves. Using (same) classical assumptions for establishing rate of convergence of smooth + non-smooth optimization problem, this time replacing non expansiveness of the prox by non expansiveness of the denoiser, and using same proof techniques, the authors obtain similar rates of convergence O(1/t). At the difference that, the fixed point is not explicitly known as a minimizer of some objective function. This seems to be not too surprising and immediately follows from the current results in smooth + nonsmooth convex optimization. Again, just replace the proximal operator by any other non-expansive operator and all the results remains the same (up to explicit characterization of the fixed point) When D is a proximal operator of h, the iterations in Eq. 3 correspond to the two steps 1- a (proximal) smoothing of h 2- a gradient descent on the smooth function {g + smoothed(h)}. With this view, it is natural that line 161: "when the denoiser is a proximal operator of some convex function h, BC-RED is not directly solving (1), but rather its approximation". Indeed, it solve the smoothed version of (1). These strategies are similar to Nesterov smoothing see (Beck and Teboulle, 2012: Smoothing and First Order Methods: A Unified Framework). More precisely, it seems to correspond to the "partial smoothing" method. Small questions: - Usually, proximal iteration on a smooth + nonsmooth strongly convex, enjoys geometric rate of convergence. using a general denoiser, what would be a natural assumption to get the geometric rates? - Usually, convergence analysis of bloc coordinate descent require separability of the regularizer. (If necessary) Can the authors explicitly clarify the equivalent property for a generic denoiser? Small typos: - In proof of Theorem 1, in the second and third equation after line 322, there is some missing ^2. - In the proof of Proposition 8, the function phi can be more than 1-strongly convex eg when h is already alpha-strongly convex.