NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4653
Title:A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers

Reviewer 1

Overall the paper is well written and the proposed algorithm is well explained. Since it is variant of the existing ADMM, the authors are encouraged to spend more texts to explain the key difference, for example, what kind of advantages we can get by using this new ADMM variant. On the other hand, I find the experimental part is pretty weak. Using real and large scale data may further improve the paper. Line 36, it seems the existing ADMM can be extended to the case N \ge 3, based on certain strongly convexity condition. Is the reduction of the communication round due to the two layer-structure in Algorithm 1? Besides this part, what could be the essential advantage of this proposed ADMM method? On the other hand, the authors may want to show how much computation time or resources we can get by this communication round reduction. This paper doesn’t talk much about the fault tolerance. Does the model convergence rely on the success of the communication? If one communication fails, what will the results look like? This proposed method is still limited to the decomposable property in the variable. If it is not the case, for example, apply a L_{21} regularization term in the objective, will the Algorithm still work? In Section 4, these experimental results are all based on synthetic and small data sets. The effectiveness will be more convincing if real and large data sets are used for evaluation. Some minor comments: (1) On Line 28-29, the presentation can be be simplified as “Ax = b, x_i == x_j \forall i, j”. Apparently x_i == x_j if i == j.

Reviewer 2

This paper proposes a new communication efficient multi-block ADMM for linearly constrained stochastic optimization. The proposed method is as fast as (or faster than) existing stochastic ADMMs but the associated communication overhead is only the square root of that required by existing ADMMs. Although the paper is theoretically sound, there are still some questions need to be discussed in this paper: 1. This paper assumes that the strong duality hold in Assumption 1, but in many applications this assumption does not hold. Therefore, does the algorithm in this paper have some limitations? 2. What’s the stochastic objective function? In addition, there are many related algorithms, for example, E. Wei and A. Ozdaglar. Distributed Alternating Direction Method of Multipliers. CDC, 2012. F. Huang, S. Chen and H. Huang. Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization. ICML, 2019. T. Chang, W. Liao, M. Hong and X. Wang. Asynchronous Distributed ADMM for Large-Scale Optimization-Part II: Linear Convergence Analysis and Numerical Performance. IEEE Transactions on Signal Processing, 2016. 3. The experiments in Section 4.2 of this paper lack the detailed information about the experimental environment, and the tools and hardware facilities used in the experiment should be appropriately introduced. 4. More compared algorithms (including distributed deterministic ADMM and stochastic ADMM algorithms) should be added to reflect the superiority of the proposed algorithm, and the performance of the proposed algorithm in terms of running time should be also reported.

Reviewer 3

This paper considers a communication efficient distributed optimization algorithm based on ADMM for stochastic optimization. The main idea is to perform multiple steps (can be timevarying) of stochastic gradient updates before the agents communicate, and therefore improving the communication efficiency. The proposed algorithm is shown to converge (in objective value & constraint violation) under a general non-smooth, non-strongly convex settings with O(1/eps) communication rounds and O(1/eps^2) unbiased gradient oracle calls. Other setting such as smooth+strongly convex and non-smooth+strongly convex are also analyzed and presented. Using multiple steps is however not novel. In addition, a significant drawback for the proposed algorithm is that it is not supported for time varying communication -- which is a major set back with ADMM type distributed optimization. While in overall the paper is well written with several interesting results (e.g., convergence with non-smooth+non-strongly convex problems), there are some outstanding concerns from the reviewer: -- Communication Efficiency and Comparison to Prior Work This paper counts the number of "outer-loop" as a measure of communication efficiency, especially in terms of its scaling with the desired accuracy epsilon. However, such notion should also be compared in terms of the network topology which is missing in the current Theorems (and is perhaps hidden in the constants that depends on ||A||^2). These notions are emphasized and compared in a few recent work: Nedic et al., "Network topology and communication-computation tradeoffs in decentralized optimization", Proceedings of IEEE, 2018. Scaman et al., "Optimal Algorithms for Non-Smooth Distributed Optimization in Networks", in NeurIPS 2018. Uribe et al., "Optimal Algorithms for Distributed Optimization", arXiv:1712.00232 It is also important to point out that Scaman et al. have considered using multiple step in primal update to achieve a similar performance as the current paper, i.e., requiring O(1/eps) outer loops to achieve an eps-accurate solution. In addition to the above work that tackles deterministic optimization, the following work has also considered stochastic optimization problem like the current paper: S. Pu, A. Nedic, "Distributed Stochastic Gradient Tracking Methods", arXiv:1805.11454 Without using a multiple steps update, they also achieve a similar O(1/eps) rate for strongly convex objective functions. Comparing these methods (at least) numerically is important for improving the current paper. Overall, the literature review done in this paper seems to be limited to ADMM type algorithms, while missing a lot of advances made in the gradient tracking type algorithms. Note that the latter is also actively researched. -- Minor Comments - While it is understood that the constraints "\sum_{i=1}^N A_i x_i = b" can be re-expressed to reflect on the consensus constraints, the application of the proposed algorithms to decentralized setting may still be unclear for non-expert readers. Such formulation should be described clearly in the paper as well. (e.g., in terms of the "A_i" and "b" involved in these settings). - the authors mentioned that the negative term is (7) is useful for analysis - please expatiate in terms of its intuition as it is not clear in the main paper. ---- The reviewer has read the response and I agree that the network dependence is more minor than my suggested references, which should be emphasized better in the revision. However, my comment with time varying / unreliable communication is not addressed by the authors. This concern is relevant to the authors response R2Q1, where (4),(5) are claimed to be ignorable if communication fails - there the authors seem to assume that communication failure is an "all or nothing" situation, while in reality communication failure happens only partially in the network. Second, the theory requires setting $K=T$ where $T=O(1/epsilon)$. This can be a huge number and requires knowing $\epsilon$ in advance.