NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 409 Convergence Analysis of Two-layer Neural Networks with ReLU Activation

### Reviewer 1

This paper provides convergence guarantees for two-layer residual networks with ReLU activations using SGD. This is an important problem and will be interesting to NIPS community. In particular, authors show that when the input is Gaussian and the ground truth parameters (the residual part W^*) are small enough, SGD will converge to the global optimal, which is also the ground truth in the considered setting. Preliminary experiments verify the claim on both synthetic data and real-world data. The paper is in general well written and a nice proof flowchart is provided to help readers. The two-phase analysis is interesting and novel, where in the first phase a potential function is constructed and proved to decrease with the iterations of SGD and in the second phase a classical SGD analysis is employed following one-point strong convexity. Potential issue in proof: We found a potential flaw in the proof for Lemma 2.5 (the convergence analysis of SGD). In line 778 of the appendix, the second equality should lead to $\langle W_t - W^*, \eta G_t \rangle$ instead of $\langle W_t - W^*, \eta \nabla f(W) \rangle$. These two formulations are equivalent to each other if they are in expectation, but the current results in Lemma 2.5 are not in expectation. Comparison to result of [ZSJ+2017]: a) it's important to note that the works are independent as that paper has not been presented so far, b) however, both the papers provide very similar Guarantees as this paper require "small" norm of the residual parameter while [ZSJ+2017] require good enough initialization, c) this paper however provides analysis of SGD which is more practical algorithm than GD. Overall, the paper presents an interesting result for two-layer neural networks and should be of interest to the NIPS community. However, the proof might have an issue that requires clarification. Refs: [ZSJ+ 2017] Recovery Guarantees for One-hidden-layer Neural Networks, ICML 2017, https://arxiv.org/pdf/1706.03175.pdf

### Reviewer 2

This paper discusses the convergence property of SGD for two layer NN with ReLU and identity mapping. The authors argue that SGD converges to a global minimizer from (almost) arbitrary initialization in two phases: * Phase 1: W is drifting in wrong directions but a potential function g(W) = || W* + I ||_{2, 1} - || W* + I ||_{2, 1} decreases * Phase 2: W converges to W* at a sublinear rate The analysis is a combination of local one-point restricted strong convexity (used to show Phase 2) + guarantees for entering this benign region (Phase 1). I didn't check the proofs line by line, but I'm a bit unclear about the statements of Lemma 5, I would kindly ask the authors to elaborate the following: First, W^t needs to satisfy the one-point strongly convex property for all t. We need to control ||W^t||_2 < gamma -- the upper bound for ||W^*||_2. I think for this you need some smoothness assumption. Could you please point me where how do you ensure this? Second, the authors assumes that the iterates W^t stays in the one-point RSS region with diameter D. In line 174, you mentioned D = sqrt(d)/50, where d is the input dimension. This means the number of iterations you need must be larger than dimension d. I think this might be a bit weak, as many convergence results for nonconvex optimization are dimension free. Thanks!