
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors proposed to approximate the gradients for risk
minimization in largescale settings using incomplete Ustatistics, i.e., sampling tuples of data points with replacement. Previous work estimated the gradients by sampling data points without replacement (complete Ustatistics based on subsamples), which is not efficient compared with the proposed incomplete Ustatistics. The authors first provide the nonasymptotic rate bounds and asymptotic convergence rates for the SGD procedure applied to the empirical minimization of a Ustatistics. A novel generalization bound is provided by the authors, from which we
can see that by using an incomplete Ustatistics, we can achieve better performance for the same computational cost compared to a complete Ustatistics. Finally, numerical experiments were conducted and the results showed the effectiveness of the proposed
method.
The quality of this paper is good. The authors try to achieve a good tradeoff between statistical accuracy and computational cost for the wellknown Empirical Risk Minimization problem in largescale settings. The method proposed is easy to reproduce and the
datasets in the experiments are publicly available as well. One point which needs to be pointed is that all the bounds are based on
the assumption of the step size \gamma_t = \gamma_1/ t^{\beta}. What if the step size is not in this form? Besides, the step size in
the first experiment is in the form "\gamma_t = \frac{\gamma_1}{1+\gamma_1 \lambda t}, which is a different form. The paper should address this issue.
Overall, the paper is well written and easy to follow. The main contributions and technical contents are clearly stated. There are
two minor points that the paper could be better improved: a) It is a little confusing of the statement "Increasing slightly the variance of the gradient estimators is the price to pay for reducing SGD cost complexity". It is shown that (6) has smaller variance
than (5); however, the computational cost of (6) is also less than (5). b) In the experiment part, it could be better to elaborate how the batch size and the initial \gamma affect the performance, i.e., keeping one parameter fixed and observing the performance
changing with the other one.
The originality of this paper is good. In order to deal with largescale minimization of empirical risk, the incomplete Ustatistics are proposed, which achieves a tradeoff between statistical accuracy and computational cost. The authors proposed to use MonteCarlo method to estimate the empirical gradient, which produced better performance than using a Ustatistic with subsamples. The technical proof is provided as well to show that incomplete Ustatistics can achieve better performance while reducing the computation cost. This paper is of good significance to the community. As the data size is very large in the real world, we need to figure out how to efficiently solve ERM problem. The method proposed in this paper is easy to implement and has the potential to be adopted in both academia and industry.
In summary, the paper elegantly proposed to use incomplete Ustatistics to deal with ERM problem in largescale. A novel generalization bound is described to show that incomplete Ustatistics can achieve better performance that complete Ustatistics.
Q2: Please summarize your review in 12 sentences
In this paper, the authors proposed to obtain the gradient estimates by incomplete Ustatistics instead of complete Ustatistics for the Empirical Risk Minimization in largescale settings. A novel generalization bound was derived and results on numerical experiments show the effectiveness of the proposed method.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
 Review after rebuttal  I thank the authors for their responses. I notched up my recommendation a bit, but still I cannot offer a clear accept for this interesting paper. My quibble with Prop. 1 was not fully resolved. The authors said "the variances \sigma are w.r.t. the gradient of the *true* loss L(\theta)". This is a bit unclear but I take it to mean that
\theta should be \theta^* in Prop. 1?
That would be an important typo that was not acknowledged. Furthermore, in later search through the citation the authors provide and other works, I found an inconsistency in the results, which basically boils down to the \sum n'_k being a product
in Prop. 1. Unfortunately, I cannot disclose other works to maintain author anonymity. Together with the way the authors originally presented this result, this makes me skeptical
about Prop. 1. Regarding my comments on excess mathematical clutter, these were generally acknowledged by the authors but it is unclear how much can be done in such limited time.
Still, I decided to increase my rating for this paper, so if other reviewers are happy with the
rebuttal the paper could get through.

In this paper the authors investigate ways to perform stochastic approximation that use Ustatistics in a way that is computationally and statistically efficient. The quite surprising result is that using "incomplete Ustatistics" is significantly
more efficient than "complete Ustatistics", in both efficiency aspects. This is surprising because complete Ustatistics use averages over all
"data tuples", albeit reducing the sample size.
I am skeptical about this result, which begins in Proposition 1. Specifically the variance for the complete Ustatistics is proportional to 1/\sum n_k' whereas the variance for the incomplete Ustatistic is 1/\prod {\choose..}. (see Eq. in Prop. 1) This difference is huge and the mathematical formulation is counterintuitive. If we set n_k'=n_k
shouldn't the complete Ustatistic be efficient like theory suggests? It seems that efficiency should depend on
terms like n_k' / n_k. This would not follow from Prop. 1. Unfortunately there is no proof for Prop. 1, but a citation that is not very helpful.
I strongly suggest that the authors provide a simple proof for that fact because it is also
important for Theorem 2, and for the paper in general.
The other major problem is mathematical clutter. For example, Def. 1 is a quarter page long and overwhelming. A more intuitive explanation of Ustatistics is possible since this is a welldeveloped subject in statistics. I would recommend to build up from primitives, e.g., introduce Ustatistics when K=1, and then
generalize. I also think that Proposition 2 and Theorem 1 are standard theory
in stochastic approximation, and should be in the Appendix. This section
confused me the most. How is normality used?
Is the purpose of this section to show that the conditional variance is important
for convergence of SGD? If it is, then the section is unnecessary because it replicates standard theory
results. If not, the authors should try to be more explicit and precise of what this section is
about and what are the new results.
The experimental results are limited and they don't seem to support
the claim "the incomplete Uestimator significantly outperforms its complete variant". For example, the results in Figure 1 show that the difference between the two methods
is not statistically significant, unless plotted standard deviations do not reflect
standard errors; in such case the authors should make this explicit.
Finally, there are several parts in the paper that I was confused or struggled through excess
wording e.g.,
1) In the abstract: "In this paper, we ague that in the largescale setting
the inductive principle of SA for risk minimization should be implemented in a very specific manner. ...". This sentence spans 3 lines and gives no information.
2) In Lines #4849 a description of SGD is given that is not typical. In plain SGD there is no averaging.
3) Overall the use of "tuples" is very confusing (starting from the abstract.) "Tuple" can mean many different things. Because this term is heavily used
and only becomes more clear in Page 3, I would suggest a very early definition.
4) Indexes in Line #106 of Def. 1 are nonoverlapping?
5) In Line #173 it says that the complete Ustatistic is "far from optimal". I think this is
evident by definition?
6) Remark 1 is confusing and could be omitted, or added at a discussion section, or
in conclusion.
7) The "conditional performance analysis" is confusing as well. In Line #207 it says that the analysis is "conditional on the observed data samples". But from what follows it seems that we simply condition on the number of samples, not the data samples themselves, is that correct?
8) In Line #215 it says "We point out that the present analysis can be extended to more general
contexts" and then the authors cite the work of Moulines and Bach on nonasymptotic
analysis of stochastic approximations. This is a very confusing statement.
What extension do the authors mean?
Minor: * Line #184 D_B should have "mathcal" * Constant C in Line #295 is used earlier,
Q2: Please summarize your review in 12 sentences
This is a good work on a very interesting area of statistical learning. However, there are important flaws in the submitted version of the paper,
particularly in accuracy of theoretical results and significance of experimental results. Furthermore, several parts of the paper appear to be unnecessary.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their very helpful
comments. Some reviewers have suggested interesting modifications in the
presentation of the paper that we will take into account to improve the
quality and clarity of the manuscript. Below, we answer specific questions
and address some technical points raised by the reviewers, in particular a
clarification of Proposition 1.
*** Assigned_Reviewer_1
***
Although not rooted in theory, step sizes of the form \gamma_t
= \frac{\gamma_1} {1+\gamma_1 \lambda t} are often used in practice in SGD
for better convergence and to avoid very large steps in the beginning of
the optimization. Note that the step size asymptotically decreases like
\frac{1}{\lambda t}. To avoid confusion we will instead show the results
for \gamma_t = \frac{\gamma_1}{\lambda t}, for which the same conclusions
hold.
*** Assigned_Reviewer_2 ***
We feel that the criticism
of the reviewer is a bit unfair. We do not mean to compare ourselves with
specific stateoftheart methods in clustering, ranking, metric learning,
etc. Instead, we take a more general perspective by observing that the
associated optimization problems are all in the form of ERM of
Ustatistics. When the training set is large, many methods resort to
SGDtype algorithms to numerically solve these problems in practice.
However, to the best of our knowledge, the impact of the tuple sampling
strategy on the performance of SGD has never been carefully investigated.
Our paper provides a novel analysis and argues that one strategy should
clearly be preferred over the other, providing both theoretical and
experimental evidence of its superior performance. We believe that our
study can help practitioners implement efficient SGD algorithms to solve
this general family of problems.
*** Assigned_Reviewer_3
***
We first want to thank the reviewer for his/her several useful
comments to improve the clarity of the paper.
 It seems that the
reviewer slightly misunderstood the nature of the statement in Proposition
1: the variances \sigma are w.r.t. the gradient of the *true* loss
L(\theta) and not that of the empirical loss \hat{L}_n(\theta), unlike the
empirical variances \sigma_n of Section 3.2. We will make this more clear
and provide a basic proof. While the huge difference in variance shown by
Proposition 1 might sound counterintuitive at first, it is due to the fact
that the incomplete Ustatistic estimate involves many more different
observations. For instance, take the case where K=2 and d=1, and we are
given n i.i.d. observations. For a batch size B=n'(n'1)/2, the number of
different observations involved in the complete Ustatistic is exactly
equal to n', while for the incomplete Ustatistic it is always superior to
n' (more precisely, if n' is not bigger than \sqrt{n} then the expected
number of different observations involved is of the order of
n'(n'1)).
 Note that Figure 1 shows averages over 50 random runs
of SGD. While the performance can indeed vary quite a lot from one run to
the other (which is common in SGD algorithms), on average (solid lines)
the incomplete Ustatistic estimate significantly outperforms the complete
estimate. We also stress the fact that the case K=2, d=1 (as in the AUC
optimization results of Figure 1) is the "least favorable case" for the
incomplete Ustatistic (as seen in Proposition 1). The results for K=1,
d=3 given in Figure 2 show a much larger gap, which is consistent with our
theoretical results.
 In Section 3.2, we work conditionally upon
the data sample in the sense that we only deal with the empirical risk,
which is defined w.r.t. the data sample.
 Finally, the kind of
simple extension we have in mind is to address the smooth but not strongly
convex case, for instance adapting the analysis in Bach and Moulines (NIPS
2011). 
