NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center

### Reviewer 1

Paper Summary: The main idea is that Nesterov's acceleration method's and Stochastic Gradient Descent's (SGD) advantages are used to solve sparse and dense optimization problems with high-dimensions by using an improved GCD (Greedy Coordinate Descent) algorithm. First, by using a greedy rule, an $l_1$-square-regularized approximate optimization problem (find a solution close to $x^*$ within a neighborhood $\epsilon$) can be reformulated as a convex but non-trivial to solve problem. Then, the same problem is solved as an exact problem by using the SOTOPO algorithm. Finally, the solution is improved by using both the convergence rate advantage of Nesterov's method and the "reduced-by-one-sample" complexity of SGD. The resulted algorithm is an improved GCD (ASGCD=Accelerated Stochastic Greedy Coordinate Descent) with a convergence rate of $O(\sqrt{1/\epsilon})$ and complexity reduced-by-one-sample compared to the vanilla GCD. Originality of the paper: The SOTOPO algorithm proposed, takes advantage of the l1 regularization term to investigate the potential values of the sub-gradient directions and sorts them to find the optimal direction without having to calculate the full gradient beforehand. The combination of Nesterov's advantage with SGC advantage and the GCD advantage is less impressive. Bonus for making an efficient and rigorous algorithm despite the many pieces that had to be put together Contribution: -Reduces complexity and increases convergence rate for large-scale, dense, convex optimization problems with sparse solutions (+), -Uses existing results known to improve performance and combines them to generate a more efficient algorithm (+), -Proposes a criterion to reduce the complexity by identifying the non-zero directions of descent and sorting them to find the optimal direction faster (+), -Full computation of the gradient beforehand is not necessary in the proposed algorithm (+), -There is no theoretical way proposed for the choice of the regularization parameter $\lambda$ as a function of the batch size. The choice of $\lambda$ seems to affect the performance of the ASGCD in both batch choice cases (-). Technical Soundness: -All proofs to Lemmas, Corollaries, Theorems and Propositions used are provided in the supplementary material (+), -Derivations are rigorous enough and solid. In some derivations further reference to basic optimization theorems or Lemmas could be more en-lighting to non-optimization related researchers (-). Implementation of Idea: The algorithm is complicated to implement (especially the SOTOPO part). Clarity of presentation: -Overall presentation of the paper is detailed but the reader is not helped to keep in mind the bigger picture (might be lost in the details). Perhaps reminders of the goal/purpose of each step throughout the paper would help the reader understand why each step is necessary(-), -Regarding the order of application of different known algorithms or parts of them to the problem: it is explained but could be more clear with a diagram or pseudo-code (-), -Notation: in equation 3, $g$ is not clearly explained and in Algorithm 1 there are two typos in referencing equations (-), -For the difficulty of writing such a mathematically incremental paper, the clarity is at descent (+). Theoretical basis: -All Lemmas and transformations are proved thoroughly in the supplementary material (+), -Some literature results related to convergence rate or complexity of known algorithms are not referenced (lines 24,25,60,143 and 73 was not explained until equation 16 which brings some confusion initially). Remark 1 could have been referenced/justified so that it does not look completely arbitrary (-), -A comparison of the theoretical solution accuracy with the other pre-existing methods would be interesting to the readers (-), -In the supplementary material in line 344, a $d \theta_t$ is missing from one of the integrals (-). Empirical/Experimental basis: -The experimental results verify the performance of the proposed algorithm with respect to the ones chosen for comparison. Consistency in the data sets used between the different algorithms, supports a valid experimental analysis (+), -A choice of better smoothing constant $T_1$ is provided in line 208 (+) but please make it more clear to the reader why this is a better option in the case of $b=n$ batch size (-), -The proposed method is under-performing (when the batch size is 1) compared to Katyusha for small regularization $10^{-6}$ and for the test case Mnist while for Gisette it is comparable to Katyusha. There might be room for improvement in these cases or if not it would be interesting to show which regularization value is the threshold and why. The latter means that the algorithm proposed is more efficient for large-scale problems with potentially a threshold in sparsity (minimum regularization parameter) that the authors have not theoretically explored. Moreover, there seems to be a connection between the batch size (1 or n, in other words stochastic or deterministic case) and the choice of regularization value that makes the ASGCD outperform other methods which is not discussed (-). Interest to NIPS audience [YES]: This paper compares the proposed algorithm with well-established algorithms or performance improvement schemes and therefore would be interesting to the NIPS audience. Interesting discussion might arise related to whether or not the algorithm can be simplified without compromising it's performance.