Paper ID: 1554
Title: Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presented a unified theoretical framework for analyzing the convergence rates of asynchronous and low-precision random algorithms like stochastic gradient descent (SGD). The basic idea is to use the martingale-based analysis which can be used to capture the rich noise models arising from random algorithms. Moreover, a new SGD method called BUCKWILD! is proposed to use lower-precision arithmetic. Experiments on real-world datasets are used to evaluate the performance of the proposed method.

The martingale-based unified analysis seems to be interesting. It provides another tool to analyze some existing methods, such as HOGWILD!.

However, the analysis of this paper provides no hint to improve the algorithm of HOGWILD!, and the convergent rate in this paper is only the same as that in HOGWILD!. This makes this paper not very promising. Furthermore, the speedup of BUCKWILD! is not satisfactory. For example, the speedup of 8-bit int with 24 threads is only 6, which is far less than the ideal speedup which is 24.

Q2: Please summarize your review in 1-2 sentences
The martingale-based unified analysis seems to be interesting. However, the analysis of this paper provides no hint to improve the algorithm and convergence rate of the existing algorithm HOGWILD!.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is an essentially theoretical contribution regarding convergence rates for the so-called "Hogwild"-style algorithms for stochastic gradient descent.

In these algorithms, the gradient step is produces asynchronously over different chunks of the dataset in parallel, with results updating current weights as they are completed, independent of other parallel updates.

Previously, demonstrating theoretical convergence has been difficult and somewhat brittle.

In this paper, the authors present an analysis framework that is able to extend convergence results to a variety of specializations/generalizations of the algorithm.

They show that one of their proven variants, "Buckwild" provides significant real-world speedups by using lower precision arithmetic to compute the gradient steps.

As far as the paper goes, it is generally good.

I had little trouble reading and understanding the paper (I think), and they make a point to explain the maths in an intuitive fashion, insofar as it is possible.

The authors do a good job placing their work in the context of other research.

I admit that I'm not an expert in convergence proofs of this type, so I'll have to leave it to others to do a robust check of the correctness thereof.

If the proof is correct, it seems to satisfy the usual criteria for a reasonable proof of this type (part of which is that it is by itself not sufficient to show the practical usefulness of the underlying algorithm).

One concern I have is that it's not obvious where the work of others ends and the work done by the authors begins.

All of the citations are there, I'm just not sure how closely related previous results are to what's being show here.

For example, on p1, 51 the authors mention at least one other version of SGD that uses reduced precision arithmetic, but no details from the paper are referenced in section 3.2.

Is Buckwild closely related?

Are the results similar?

Moreover, I could swear I remember a paper in which supermartingale methods are use to prove convergence results in some Hogwild-related paper, but neither Google nor memory is serving me well, so I'll take the authors word that they're first to this idea as they seem to claim (in any case, they should please give the literature one last look to make sure that this is so).

Another concern is the broadness of the result:

I'm not sure how popular fixed point arithmetic is in this area, especially considering the dominance of GPU calculations in this area.

The result for the non-convex optimizations is nice, but we decide the fixed point result isn't that important, it starts to seem a bit thin to me.

Of course, you often have to do limited precision calculations to use the GPU (e.g., 64 -> 32 bit floats).

Do these results apply in that case?

Might be a nice addition.

The empirical results are somewhat lightweight, but it's to be expected in a mostly theoretical paper.

I'd like to see confidence bars on the speed comparison, and I'd like to see every thread size from 1 to 24 if they can manage it.

Why does the 32-bit version slow down at high numbers of threads?

Figure 1b. seems unnecessary to me:

I'm willing to take the authors word for it that the two algorithms converge to roughly the same answers in roughly the same fashion.

Table 1 is kind of a fun result by itself which I'd love to see explored more, maybe in a different paper.

It seems crazy to me that we can throw out a whole bunch of information and have virtually no impact on performance.

Honestly, let's just use a single bit and see what happens.

Could you reference other papers that show something similar here?

Incidentally, Googling related terms brings up this paper both at arvix and at the author's homepage.

Is this what we're doing now?

At a minimum, it makes blind review kind of meaningless.
Q2: Please summarize your review in 1-2 sentences
An theoretical analysis of Hogwild that accounts for common approximations used by current literature.

A decent paper, with some reasonable experimental results.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Positives: * unified analysis of asynchronous update algorithms. Asynchronous algorithms are fast and empirically relevant and analyzing their convergence rates/properties is definitely useful to the community * presents convergence analysis of Hogwild style algorithms without requirement of sparsity assumptions * their analysis for asynchronous SGD style algorithms also extends to (a specific) non-convex case, which is a big plus * a low-precision algorithm, which I believe, are getting popular in machine learning community and a theoretical framework to analyze them would indeed be interesting

Concerns: * the limited precision is only useful when features space is not large (I am assuming that since the input is fixed precision, the feature weights that are updated based on input data is also represented via low precision). But in many industry-scale real world problems, the limited precision might lead to feature collisions (due to hashing into a limited feature space). It would be interesting to observe the robustness of low-precision arithmetic for datasets with millions of features * I would temper down the claims made in end of section 4 about the low-precision asynchronous updates leading to 2.3x speedup without any significant error degradation. Surely this applies to small subset of datasets that were empirically analyzed in the paper.
Q2: Please summarize your review in 1-2 sentences
Overall, I think is an interesting paper that proposed an unified analysis of asynchronous SGD type algorithms. What more useful is that their analysis extends to specific non-convex use cases as well as fixed-point algorithms and probably a combination of both. The empirical results for the low-precision case and the claims made based on those are a bit preliminary and merits more exhaustive treatment. It would be really interesting to see how the accuracy varies and what other tradeoffs come into play when using low-precision arithmetic for very high dimensional datasets (features ~ a few million, eg. Criteo click-prediction dataset).

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Minor comments -------------------

line 320: "we focus on the rank one version of the problem in (16)": actually, (16) is already the rank-one problem. maybe the authors meant to put the full rank in (16)?
Q2: Please summarize your review in 1-2 sentences
The paper is clearly written, considers an interesting topic and unifies the analysis of asynchronous updates with quantization.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their reviews and feedback, which have helped us improve our paper. Two important points mentioned by the reviewers deserve specific comment:

1. In the introduction, R2 mentions that it is "not obvious where the work of others ends and the work done by the authors begins." We agree that this could be made more clear, and we will add prose in the introduction that delineates cited work from ours. For the example mentioned by R2, their work on low-precision algorithms focuses specifically on an empirical analysis of neural network training and provides no theoretical guarantees, which differs from our work which gives theoretical guarantees for a general class of problems.

2. Both R1 and R2 had questions about the experimental section, particularly the results given in Figure 1(a). R1 says "the speedup of BUCKWILD! is not satisfactory. For example, the speedup of 8-bit int...is far less than the ideal speedup." R2 has a similar question: "Why does the 32-bit version slow down at high numbers of threads?"

We don't get linear speedup because we are bound by the memory bandwidth available on the core. The 32-bit version slows down at 24 threads because there are only 12 physical cores and the whole workload is already memory-bandwidth bound at 12 cores. Adding 2x logical cores with hyperthreading provides no benefits, while trashing the L1 & L2 caches and increasing conflicts from updating the model concurrently. However, for the same number of threads, limited-precision achieves higher performance than full-precision, both because it decreases the memory load and because we are able to compute with high-throughput SIMD instructions. If an algorithmic change makes HOGWILD! run faster, we expect BUCKWILD! to also run faster. We will amend the paper to make clear how our experiments are affected by the combination of SIMD, memory bandwidth, and caching.

Reviewer 1

R1 notes that "convergent rate in this paper is only the same as that in HOGWILD!" While this is true when measured in terms of number of updates, our low-precision BUCKWILD! updates each take less time than HOGWILD! steps, so BUCKWILD! executes in less wall clock time. We will modify Section 3.2 to include this point. Also note that, compared to previous results, our rate applies under more general conditions.

R1 states that we give no "hint to improve the algorithm of HOGWILD!" We will amend the paper to make it more clear that low-precision BUCKWILD! updates are intended to be used as such an improvement in many cases.

Reviewer 2

R2 mentions remembering a HOGWILD!-related paper "in which supermartingale methods are use to prove convergence." We are unaware of any such paper, but if we can identify such a paper, we will cite it properly.

R2 also mentions that they are "not sure how popular fixed point arithmetic is in this area," and is concerned about broadness and running on GPUs. While we agree that the performance depends on the instruction set of a device, GPUs do typically have 16-bit float, and 8- through 64-bit integer types (see "Accelerating GPU computation through mixed-precision methods"). On any device, we can always save on memory bandwidth by using low-precision arithmetic, and our results do still apply for even 32-bit floating point operations on GPUs. Additionally, people often run SGD on the CPU in practice, so it is important to understand how we can speed it up.

We will modify Figure 1a to include more thread counts.

R2 also is interested in algorithms that use just a single bit. There are reasons to expect that this will work well, including the ease of doing 1-bit multiplies and other bitwise operations on the CPU. Unfortunately, our theoretical bounds get really poor as the number of bits goes to 1, possibly because we end up discarding too much information. We ran 1-bit and 4-bit experiments, and we will amend the paper to briefly include our results. For the 4-bit case, because Intel CPU does not have hardware-level support for SIMD with half-byte data, we observe it runs slower than 8-bit due to extra computation cost.

Reviewer 3

We thank R3 for their review, and will correct the mentioned issue.

Reviewer 4

R4 notes that "the limited precision is only useful when features space is not large." It is true that the algorithm will run slower with a larger feature space. However, compared to a full-precision version, each low-precision update will still run faster even for a large feature space. As future work, we will test how the error due to low-precision updates interacts with feature collisions, but such an analysis is beyond the scope of this paper.

We will be more specific with our claims in Section 4 about the conditions under which we achieve the stated speedup without significant errors.

Reviewer 5

We thank R5 for their review.

Reviewer 6

We thank R6 for their review.