
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends the stochastic optimization algorithm SVRG proposed in recent years. These modifications mainly includes: the convergence analysis of SVRG with corrupted full gradient; Mix the iteration of SGD and SVRG; the strategy of minibatch; Using support vectors etc. For each modification, the author makes clear proofs and achieves linear convergence under smooth and strong convex assumptions. However, this paper's novelty is not big enough. The improvement of convergence rate is not obvious and the proof outline is very similar to the original SVRG. The key problem such as the support for nonstrongly convex loss is still unsolved. Besides, there exists some small mistakes on the notations in the appendix.
Q2: Please summarize your review in 12 sentences
This paper makes some practical improvements based on the stochastic optimization algorithm SVRG proposed in recent years. These improvements are proved to be effective by the theory and experiment results. However, the novelty and improvements of performance are not big enough. The proof idea is very similar to the original SVRG and key problems such as the support for nonstrongly convex loss is still unsolved.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Comments:
 Tricks like nonuniform sampling and then correcting by importance sampling have been around for much longer, see e.g. [1] which claims a speedup of 3x. Those older approaches chose the probability based on error instead of L_i  could you discuss/compare such an approach, at least in practice? I'd suspect this may be useful in practice where the L_i may not be known.
 Channeling Leon Bottou: please discuss the limitations, and the cases where your method may *not* be applicable.
 I would really appreciate an opensource code release accompanying the paper, which would increase its impact.
 The experimental section and figures are clearly the weakest part of the paper, I'd recommend a serious second pass. For example, all plots should have error bars, state over how many runs they were averaged, labeled more clearly (SVRG > full batch in top plots), the testerror plot does not have the resolution to be readable (maybe logscale is better even if it won't go to zero). The same holds for the figures in the appendix. I do not understand why there is no difference at all between uniform sampling and Lipschitz sampling (in experiment 3)? Of course, a more challenging domain than logistic regression would be appreciated, but that can rightfully be shown in another paper.
[1] Geoff Hinton, "To recognize objects, first learn to generate images", 2007.
Typos:
L19 a variant L53 increases over L77 expectation missing L96, L132 expectation symbol misformatted L164 B^s L289 can be used L316 superscript instead of subscript L325 sometimes L416 sometimes often?
Q2: Please summarize your review in 12 sentences
This is a solid and important paper about variancereduced gradient descent. It has many good, new ideas, explains them clearly, and determines convergence rates for all its proposed variants. The paper's weakness is the experimental section.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors should be explicit on what this new methods adds to existing stochastic optimization (e.g., Hu Kowk and Pan, 2009) and related minibatch optimization (e.g. Konecny et al, 2013 and 2014), or Smola's work (e.g., M Zinkevich, M Weimer, L Li, AJ Smola, 2010), including using such methods as baseline on the experiments.
Merely comparing the method to variants of itself is insufficient.
This is a crowded field.
It behooves the authors to explicitly delineate their novel contributions.
The paper is relatively well written, and the comparative bounds table (sec 8) is useful.
Q2: Please summarize your review in 12 sentences
The paper discusses the stochastic variancereduced gradient method to speed convergence in optimization of a sum of strongly convex Lipschitz continuous functions. The method trades off inexact computation for speed, and it is not clear whether it represents a sufficient improvement over the state of the art in stochastic optimization to merit publication in NIPS.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes numerous methods to speed up svrg. The methods are growing batch size, mixed SGD/svrg, and a new minibatching scheme.
Specific comments 1. It would be nice to have empirical comparison against a welltuned SGD implementation. 2. I'm not clear on how the results for "time to reach E_est + E_opt for FG and SVRG were reached. E_est should only depend on n. For alpha =1, it should be O(1/n). It is unclear how log^2 1/epsilon and d^2 arises.
Q2: Please summarize your review in 12 sentences
This is a nice paper that proposes numerous modifications to SVRG.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces several approaches to increase the convergence speed of a wellknown stochastic gradient method called stochastic variancereduced gradient (SVRG).
The paper is wellwritten and it is very easy to follow the arguments of the paper. The authors start with a short introduction of SVRG method and subsequently explain three approaches for improving the convergence speed of SVRG.
This paper starts with a key proposition showing that SVRG does not require a very accurate approximation of the total gradient of the objective function needed by SVRG algorithm. The authors use this proposition to derive a batching SVRG algorithm with the same convergence rate as that of original SVRG. Then, the authors propose a mixed stochastic gradient/SVRG approach and give a convergence proof for such a scheme. As a different approach of speeding up, the authors proposed a speedup technique for Huberized hingeloss support vector machine.
In addition to speedingup strategies and their convergence analysis, the authors include similar analysis for the case of regularized SVRG algorithms. The authors provide propositions for the convergence of minibatching strategies. Through extensive simulations, the authors show the improvements obtained by their methods.
Except minor comments stated below, I do not have any major comments for further improving the manuscript.
Minor Comments: * On the right hand side of inequality in line 77, there should be an expectation * Question marks in the first line and middle of page 10 should be corrected
Q2: Please summarize your review in 12 sentences
The topic of the paper is of interest for nips community and the analysis and simulation results were done perfectly.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all the reviewers for taking the time to
read and comment on our work. We will use the comments to improve the
paper. Below we comment on some specific issues that were
raised.
R1
These are good points regarding the experiments,
we will update the plots following these suggestions. Note that uniform
and Lipschitz are the same in some plots because the rows of the data are
normalized (Lipschitz+ can still give improvements here because it depends
on the potentiallysmaller Lipschitz constant of the deterministic
part.)
Choosing the probabilities based on the error is an
interesting suggestion. Proving a convergence rate in this setting may be
hard, but for many models the errors are related to the *local* L_i values
which give much faster rates than global L_i values.
It was hard to
fit in a discussion of limitations/extensions given the page limit, but
we'll try to squeeze it in. We are planning to release an opensource
implementation of SVRG and our various extensions (including using local
L_i values).
R2
Unfortunately, we accidentally removed the
last few sentences of Appendix H that showed how to obtain the new "time
to reach E_est + E_opt" result. The subtlety in that result is that, if
the condition number is large, we can choose the number of examples 'n' to
be smaller than the total available in order to jointly drive (E_est +
E_opt) to zero at a faster rate.
There are a few reasons we did not
compare against SGD: (i) beginning with Le Roux et al. there are
already many empirical comparisons showing an advantage of
linearlyconvergent methods over tunedSGD. (ii) our main contribution
is improving on SVRG (as the only memoryfree linearlyconvergent
methods). (iii) we are running methods as parameterfree `black boxes'
and SGD usually requires tuning for good
performance.
R6
Please note that our introduction explicitly
states why we are focusing on SVRG; it is the only linearlyconvergent
stochastic method that does not have a memory. There is now ample evidence
that linearlyconvergent methods often outperform classic stochastic
gradient methods that only have sublinear convergence rates (The works of
of Hu et al. and Zinkevich et al. that are referenced in your review have
sublinear rates). But the lack of a memory requirement clearly makes SVRG
advantageous over other linearlyconvergent methods in many applications.
These two reasons make SVRG stand out among the crowded field, and (as
discussed in the paper) is why we focus on SVRG.
We ultimately
removed a more thorough discussion contrasting our work with Konecny et
al. in order to save space, but note that we still discuss this work
explicitly in Section 7. In particular, their method uses minibatches in
the inner iteration to reduce the variance but still uses full passes
through the data to compute mu^s. Our method is complimentary to the one
of Konecny et al. (you can use both modifications), and further in Section
7 we give tighter bounds on their method and propose a new method with
better properties in many practical scenarios.
R7
It is true
that some of our proofs follow the same structure as the SVRG argument.
However, we have extended these arguments in various novel and significant
ways (allowing error, allowing regular SG steps, showing that the rate is
improved via the regularized update, better minibatch strategies, etc.).
Further, some of the results (like the generalization error result) are
based on completely different methods.
Regarding the performance
improvements, please see the bottom row of Figure 1 in the supplementary
material. The difference between the 'SVRG' and 'Grow' methods is huge
here, and this row corresponds to the largest datasets (the caption is
unfortunately wrong in this particular figure). We could have done "data
set selection" and only shown these largest datasets to make the
difference seem more significant, but instead chose to plot performance on
all datasets to give the reader the full picture. Further, the
differences become even larger when we combine multiple ideas from this
paper.
While the nonstrongly convex case is interesting, in
machine learning we often don't worry about this case because we can add
an L2regularizer to make the problem stronglyconvex. In contrast, in
this work we have analyzed many other important issues (e.g., improving
performance on huge datasets, justifying the regularized update used in
everyone's implementation, analyzing the generalization error, exploiting
GPUs, etc.). Indeed, the generalization error in particular is one of the
most important issues in the field of machine
learning.
Nevertheless, as shown in
http://arxiv.org/pdf/1312.1666v2.pdf, if we simply add an L2regularizer
with lambda=O(L/epsilon) then all of our results also give rates in the
nonstrongly convex case. 
