Paper ID: 946
Title: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The author claim convergence rate O(1/sqrt(T)), without assumptions on the number of asynchronously working processors and with any amount of "sparsity" (partial separability). Considering that the subsequent empirical section shows that the convergence very much depends on the amount of sparsity, I would suggest the authors study the dependence of the sparsity and the number of processors as well, e.g. similar to http://arxiv.org/abs/1311.1873, http://arxiv.org/abs/1406.0238.
Q2: Please summarize your review in 1-2 sentences
The study of asynchronous parallel algorithms is very challenging. The authors do some heavy lifting to obtain a middling result, but that should still be applauded.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides some theoretical analysis to show that completely asynchronous stochastic gradient procedures asymptotically achieve optimal convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures. Furthermore, empirical results are also demonstrated to verify the theoretical results.

The theoretical results of this paper seem to be promising and important. Furthermore, the experimental results successfully verify the strong performance of the asynchronous parallel stochastic optimization schemes, which is consistent with the theoretical analysis.

The results can be improved if faster convergence rates can also be provided with more strict assumptions, such as smooth and strongly convex.
Q2: Please summarize your review in 1-2 sentences
The theoretical results of this paper seem to be promising, and the empirical results are consistent with the theoretical analysis.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an asynchronous stochastic approximation algorithm, with in mind a distributed implementation. The proposed algorithm is described at the bottom of the first page, and this description is rather unclear. I think the authors should exactly describe how the various processors access the central memory, in both writing and reading. In particular I did not get how the centralized counter k was updated. Is k read along with the data x in step (i)? Please clarify.

This work extends the results of [1] to remove some of the technical assumptions made to ensure convergence. The authors need to elaborate further on the differences between the existing work, and in particular [1], and their work. For example, assumptions made should be discussed in more details. Right now the related work section is sketchy (12 lines on the second page).

The role of the centralized counter should also be emphasized. Was it present in the algorithms proposed in [1]? My guess is that having such counter does not matter at all, see for example http://arxiv.org/pdf/1505.04824.pdf and references therein.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an asynchronous stochastic approximation algorithm, and establishes its convergence rate. The description of the algorithm is not so clear to me, and its connection to existing algorithms and convergence results could be also clarified. The topic is interesting and important, and establishing convergence of such schemes under a minimal set of assumptions is important.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary of paper:

The paper studies a class of asynchronous methods to solving optimization problems in which the objective function is an integral functional of a random process or variable integrated against a probability measure. The class of problem assumed is that the argument of the functional for each non-zero measure set or outcome in the sample space upon which the measure is defined will produce a convex function. In this case the authors demonstrate that asymptotically, completely asynchronous stochastic gradient procedures achieve optimal convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures. The authors are able to then propose based on these results specialised asynchronous procedures based on multi-core parallelization schemes for stochastic gradient based optimization algorithms which behave analogously to synchronous procedures.

Quality:

The problem statement and all assumptions are very carefully set out and clearly written and explained with regard to the meaning and significance of such assumptions. The exposition and explanation of Theorem 1 is very well presented as is the technical details for Theorem 2 and the required Lemmas in the appendix.

The examples selected to illustrate the method were also very interesting having selected the regression examples based on linear regression as a toy illustration and then non-linear logistic regressions were proposed. These were relatively simple and demonstrated that the required conditions were not overly restrictive for real problems in applications settings.

However, it would be perhaps informative to explain a little more background on the context of the extended problems based on stochastic monotone operator problems, this is left very brief and a citation given, it would be useful to see a little more context here.

In assumption C and D for the distribution of the noise process it would be good to get more explanation of these assumptions and their significance and interpretation.

The experiment performed is very carefully detailed and well designed.

Clarity:

The paper is highly technical for a NIPS audience but it is very well written and explained carefully so it is still in my opinion in the scope of the conference. I believe the authors have done an excellent job to explain such detailed results in a short paper so clearly and concisely.

Originality:

The paper is based on several existing works but provides a useful set of results for practical problems which generalise and relax assumptions of existing works.

Significance:

The authors show that their results significantly generalize existing works that have considered similar results, though as they state previous works of a related nature often impose strong conditions on gradient sparsity or conditioning of the Hessian or allowable types of asynchrony. In particular they only require minimal conditions which are that at the argument that minimizes the objective function one has a positive definite evaluation of the function and that the gradient are smooth.

Q2: Please summarize your review in 1-2 sentences
The paper is very well written and motivation for the problem statement and significance of the results derived is clearly explained. The results in Theorem 1 and Theorem 2 are clearly placed in context of their generalizations of existing works and then the example to illustrate the theoretical results in linear and logistic regressions is well designed and explained clearly.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank each of the reviewers for their careful reading of the paper,
constructive feedback, and references provided. We address each reviewer's
responses in turn below, asking forgiveness for necessary brevity.

R1

Thanks for the comments and references to other work. Let us briefly discuss
the reviewer's comment that "convergence depends on sparsity" in the empirical
section. Indeed, the convergence behavior of SGD itself is different for
different amounts of sparsity--this is clear both theoretically and
experimentally, as noted by the reviewer. We show, however, that asynchrony
does not introduce any additional slowdown in convergence rate (as a function
of iteration) over a synchronous/one-thread method, so that there is
*relatively* no penalty. In Figures 1 and 3, the convergence (in number of
iterations) for each of the number of cores are nearly identical, which we
believe highlights that asynchrony does not hurt. And Figures 2 and 4 show
that it can, in fact, be very helpful.

R2

We thank the reviewer for the positive feedback.

We note in passing that our convergence guarantees are unimprovable and
optimal, even to numerical constant factors. (See the discussion after Theorem
1.) It will certainly be interesting to understand exactly what conditions are
necessary for variants and improvements of these rates of convergence.

R4

We appreciate the reviewer's enthusiasm and quite positive feedback--we
tried hard to make the paper approachable given the length constraints.

The reviewer's remark that our elaboration on monotone operator problems is
sparse is well taken--we will give more in our revision. As a short tidbit
along these lines, we note Bertsekas' and Tsitsiklis's book [reference 3,
Chapter 3.5]; a few natural problems include stochastic games and saddle-point
problems, systems of equations, and (stochastic) network equilibrium problems.

R6

Thanks for the quick feedback.

R7

We thank the reviewer for the review and feedback, and hope to clarify a few
of the issues the reviewer raises. Our discussion of related work was brief
due to the constraints of the 8 page conference paper. In short, previous work
on asynchronous algorithms generally (1) makes somewhat strong assumptions on
problem structure such as gradient sparsity, variants of diagonal dominance of
the Hessian matrix, or ordering of updates (refs. [1,3,4,11,13,16]), or (2)
left open interesting questions of the possibilities of optimal convergence
rates (e.g. ref. [3]). We attempt to address both of these issues. (The paper
[1] in particular assumes that the updates are performed in fixed
round-robin-like orders, and also does not give as sharp of optimality
guarantees as we do.)

In terms of memory access, we assume that our computer is a shared memory
system, where each core has access to the same memory, though we do not
believe this is an essential part of our contribution. The memory model of Niu
et al.'s Hogwild! is essentially identical to ours. Broadly, our contribution
is to show that asynchrony has asymptotically negligible effect on the
optimality guarantees of stochastic gradient procedures, and this is quite
robust to delay models.

The issue of the centralized counter is a good question; we will clarify that
this counter is a single integer shared between processors, accessed
asynchronously in the same way that the vector $x$ is read and written (so it
is read with the data in step (i)). Most modern processors allow an atomic
increment, which we assume on $k$. All stochastic gradient procedures we know
either assume a fixed number of updates--so that each processor knows a
stepsize in advance--or have decreasing stepsizes based on something like a
counter. Other work on asynchronous procedures makes similar assumptions. The
algorithms of [1] (see Eq. (9) of http://arxiv.org/pdf/1104.5525.pdf) have a
stepsize $\alpha(t)$ that decreases with a (known centralized) iteration $t$.
The paper http://arxiv.org/abs/1505.04824 assumes a bounded delay
$\tau_{\max}$ on pg. 8, and Alg. 1 has a stepsize that is implicitly global
because of this delay. We could change our analysis to have local counters
$k$, but this adds--in our view--unnecessary complexity.