Paper ID: 249
Title: Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Reviews

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper studies a mini-batch gradient method for dual coordinate ascent. The idea is simple: at each iteration randomly pick m samples and update the gradient. The authors prove that the convergence rate of the mini-batch method interpolates between SDCA and AGD -- in certain circumstances it could be faster than both.

I am a little surprised that in case of gamma*lambda*n = O(1), the number of examples processed by ASDCA is n*\sqrt{m}, which means that in full parallelization m machines give an acceleration rate of \sqrt{m}. Since the mini-batch SGD gives the optimal m acceleration rate, I am not sure if the bound in this paper is tight. Is it possible for ASDCA to achieve the optimal rate? The experiment seems suggest ASDCA can do better than \sqrt{m}.

It will be better to compare the actual runtime of three algorithms in the empirical study. The author discusses the cost of communication and opening channels to annotate the advantage of ASDCA. However, these are not mentioned in the experiment.

In general, this is an interesting paper with nice theoretical result. A more careful discussion on the tightness of bound and a more detailed empirical study will make it better.
Q2: Please summarize your review in 1-2 sentences
In general, this is an interesting paper with nice theoretical result. A more careful discussion on the tightness of bound and a more detailed empirical study will make it better.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present an Accelerated mini-batch version of SDCA (ASDCA) and analyze its convergence rate. The theoretical results show that the ASDCA algorithm interpolates between SDCA and AGD. The authors also discuss the issue of parallel Implementation. And we observe that with an appropriate value of the batch size, ASDCA may outperform both SDCA and AGD. Some experiments are provided to support the theoretical analysis.

The paper is well written and easy to follow. The proposed algorithm is a nontrivial extension of SDCA. The algorithm is important in practice, since it can utilize parallel computing to reduce the running time. The paper is potentially impactful for the optimization and learning communities.

Some suggestions:
1. Besides the idea of mini-batches, the authors also employ Nesterov’s acceleration method in the proposed algorithm. It is not clear to me why this is necessary. What will happen if we apply the naïve mini-batching SDCA to the optimization problem considered in this study?
2. Similar to Shalev Shwartz and Zhang [2013], the linear convergence rate only holds in expectation. Is it possible to derive a high probability bound?

Typos:
1. Line 52, \gamma --> \frac{1}{\gamma}
2. Line 314, 10-2 --> 10^{-2}
Q2: Please summarize your review in 1-2 sentences
The proposed ASDCA is a nontrivial extension of the existing work, and leads to a speedup in parallel computing.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends stochastic dual coordinate ascent (SDCA) to the mini-batch setting, with a proposed algorithm accelerated mini-batch SDCA (ASDCA). Its rate of convergence interpolates between SDCA and accelerated gradient descent (AGD). In addition, a parallel implementation is also proposed where SDCA again interpolates between SDCA and AGD. This interpolation is also demonstrated in experiment.

Although the rates of convergence for ASDCA are interesting, a careful look at Table 2 and Figure 1 reveals that ASDCA is nowhere superior to SDCA. I am wondering why it is useful to have such an algorithm. In addition, Section 3 showed much promise of ASDCA on parallel architecture. However, no experiment was done in this setting. Therefore more experimental studies (especially on parallel architecture) are necessary before we can confirm the empirical usefulness/superiority of ASDCA (as indicated by the word ‘Accelerated’ in the title).

In response to author's rebuttal: I stick to my position that the paper should show empirical evidence of the advantage of the new algorithm. I think this is an important issue, because in practice the acceleration proffered by parallelization can be highly tricky. The theory part of the paper is surely good. So I won't mind if the paper is accepted.
Q2: Please summarize your review in 1-2 sentences
This paper proves the rate of convergence for a stochastic dual coordinate ascent (SDCA) algorithm in the mini-batch setting. However, no theoretical or empirical advantage is shown for this algorithm compared with SDCA. The discussion on parallel implementation also lacks experimental study.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
To reviewers 2 and 3:
===============

Thanks for the positive feedback.
Regarding optimality: This is a good point, which we’re still trying to figure out.
Regarding acceleration: While we don’t have a rigorous explanation for this, it seems that some sort of acceleration is necessary for improving the rate using mini-batches. This is also the approach taken in previous algorithms for sgd (such as [1,2]).
Regarding expectation vs. high probability: It is possible to obtain high probability bounds by a rather simple amplification argument.

To reviewer 5:
===========

Our theoretical and experimental results show that in many cases, by taking a mini-batch of size larger than 1 (but not too large), the total number of processed examples remain roughly the same. As discussed in Section 3, this immediately implies a potential significant improvement when using a parallel machine (such as a cluster or a GPU). This is the type of theory exist in the literature for mini-batching --- for sgd, minibatch also showed no improvement compared to sgd in the serial computing setting (such as [1,2]). However these paper are still important in the parallel setting. In this sense, your statement that “no theoretical advantage is shown” is not quite accurate.

Regarding the experiments:
Performing an experiment with a parallel machine involves many more implementation details, which are beyond the scope of a theoretically oriented NIPS submission. Our contribution is mainly theoretical, and the experiments come to validate our theoretical findings.
As a side note, most previous papers (many of them appeared at NIPS), follow the exact experimental evaluation method that we used (e.g. [1,3,4]).

[1] Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. NIPS, 2011.

[2] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. JMLR, 2012.

[3] Takac et al. Mini-batch primal and dual methods for SVM. JMLR 2013.

[4] Alekh Agarwal and John Duchi. Distributed delayed stochastic optimization. CDC 2012