NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 950 Scalable Adaptive Stochastic Optimization Using Random Projections

### Reviewer 1

#### Summary

This paper considers faster stochastic gradient algorithms for possibly non-convex functions. In particular this paper considers various schemes to improve upon AdaGrad and its full matrix variant AdaFull, popular iterative methods for solving such problems over large data sets. More precisely, this paper provides an algorithm called Ada-LR that uses sketching techniques to obtain convergence guarantees similar to AdaFull without paying for the iteration costs. The paper provides a proof of the regret obtained by this algorithm to demonstrate its efficacy over AdaGrad and then discuss various improvements to decrease the iteration costs, make the algorithm more stable, and perform variance reduction. Finally, the paper provides an empirical evaluation of these methods on various problems suggesting how they may improve the performance of deep network training in some regimes.

#### Qualitative Assessment

Using fast randomized numerical linear algebra techniques to provably and practically improve the performance of AdaGrad and AdaFull is a wonderful idea and this paper initiates what could be a very interesting line of work in this area. However, the particular algorithm and modifications they propose are quite natural and consequently it is difficult to pinpoint where the novelty in the approach is. More troubling, it seems like details of the algorithms performance may be hidden in this size of the sketch matrix Pi and how frequently it is computed. It would be nice if the main theorem (Proposition 2) made clear the end-to-end cost of the algorithm and what it depends on. Also, note that properties of a sketch matrix Pi may not hold under adaptive queries. Note the vectors g_t+1 depend on beta_t+1 which in turn dependent on Pi. Given this, how is it ensured that Pi has the right properties in such a scenario if it is not recomputed? I believe Theorem 11.2 in [14] only holds if the matrix sketched is fixed apriori. Nevertheless, the empirical success of the algorithm, the cleanliness of the paper, and its practical utility may be sufficient to counteract the above concerns. Detailed Comments • Line 33-35: just in certain cases, right? • Line 53: I would be a little more specific with what mean by “AdaGrad style algorithms” • Line 67: Why is this impractical if it was just a low space single pass? • Line 70 – 81: Maybe subspace embeddings should be included in this discussion? • Line 93: Why was G_t-1 + g_tg_t^T written instead of just G_t ? • Line 124: What is M? • Section 3.2 and 3.3 are there any theoretical guarantees for these algorithms?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

This paper proposes an extension is Adagrad - a method for adaptively adjusting the learning rates per individual feature, to a full second order version, called Ada-Full, while assuming that the resulting second order matrix is low rank and hence the system of linear equations that needs to be solved is solved via random projections. This results in a significant improvement in per iteration complexity while maintaining desired improvement with high probability on each step. Additionally, the authors propose further improvement by employing inexact computations in each step. This allows for even cheaper iteration complexity, but there is no theory provided for this improvement. Yet it seems to work well in practice.

#### Qualitative Assessment

I find the paper very clearly written, with well defined goals and contributions. I see the use of randomized linear algebra as very significant in the future development of optimization algorithms. My only puzzlement as to why the authors chose the analysis of the online (regret) setting, while their implementation seems to target regular optimization setting for ML. I assume the choice is due to availability of theoretical results for regret, rather than regular objective convergence rates. The difference is not very significant, of course.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper proposed low-rank appromixation of a core matrix. A randomized version is also provided. Regret bound is also given and experiments showed advantages in speed.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Qualitative Assessment

I am generally impressed with this paper. The introduction gives convincing arguments for the need to modify ADAGRAD/ADA-FULL in the prevalence of low effective rank data. Then the paper introduces and describes the new algorithms as a series of matrix operations to approximate ADA-FULL. Then it provides some theoretical results (deferring proofs to the appendix) and presents three sets of experiments. This is a logical progression of the material. In the related work section, you might wish to remove the second paragraph in favor of other references that argue about the prevalence of low effective rank data. The second paragraph, while certainly conceptually related to the paper, doesn't have anything saying its explicit relation. For instance, having something like "variance reduced SGD algorithms have vastly superior performance BUT perform extremely poorly on low effective rank data" would be helpful to understand its context. I suggest this because the paper's contribution is only going to be as good as the prevalence of low effective rank data in the real world, and this is the recurring theme that must be emphasized. The algorithms appear useful and insightful, and involve "just" a series of matrix operations simple enough so that an alert reader should be able to code it up. It seems a little awkward to include ADA-VR when, as far as I can tell, there's virtually no reason to use it over RADAGRAD, but it makes the paper's story proceed nicely so I guess it's OK. The table in Appendix A is enormously helpful. Thank you for including this. The experiments the authors chose to run seem reasonable to me given the research goals (and the 9-page space constraint). The first experiment is logical to include since it provides the best-case scenario for the algortihms, and they certainly perform well. The second experiment presented in Figure 2 is, to me, the most interesting one of the paper. My main concern there pertains to your comment about using RADAGRAD only for the convolutional layers. It seems like a fairer comparision would be to use RADAGRAD for everything or ADAGRAD for everything? I think more discussion about this is worthier than the earlier discussion about the implementation; there's little reason to talk about the code beyond that it was implemented in Lasagne/Theano/Python. I am also intrigued by the third experiment due to the growing importance of LSTMs nowadays. However, the performance benefits over ADAGRAD+SVRG seem extremely minor, even when taking into account the log scale. Is there a way to explain intuitively what this extra benefit means? You might explain this and delete the second paragraph in Section 5, which doesn't seem to make sense for the paper (the forgetting schemes were never discussed earlier). Despite some nitpicks over the experiments, as mentioned earlier I have a positive impression on this paper, and lean towards its acceptance to NIPS. It introduces an important extension to the widely-used ADAGRAD function and argues that these algorithms are needed when we have low effective rank data. While it is not as groundbreaking as the original ADAGRAD paper, I believe the research contribution is substantial enough to warrant consideration to NIPS. Minor comments: In line 108, you might wish to say "borrowing the $g_{1:T,j}$ notation" or something like that, because the notation is never explicitly defined in this paper, but is used later (e.g., in Equation 2). In lines 124 through 128 and including Equation 2, I am unable to derive the first part of Equation 2. I am also unsure why you assume $g_t \propto Mx_t$, because M is later revealed to be a scalar, so why the need for the proportional symbol? That constant can be absorbed into M. I once thought it should be something like $\|g_t\|_2 \le M\|x_t\|_2$ but perhaps the authors could clarify. In line 145, did you mean to say "applicable in this low-dimensional *space*"? In line 192, "it's" should be "its". In Figure 1, please be consistent and use ADA-FULL instead of ADA-F, since the former is used throughout the paper's text. In Figure 1(b), you might consider using the same number of principal components for RADAGRAD and ADA-LR. I think one has 21 and another has 28, and I believe these values come from the top $\tau$ singular values, so you might want to make them equal, or otherwise explain your choice for $\tau$. I'm not sure why there's a difference right now. I'm assuming $\tau$ is just chosen heuristically? In line 268, "A the end" should be "At the end". In line 397 (in Appendix), the title of C.1 is misleading, it should be the regret bound for ADA-LR, not RADAGRAD.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)