Review for NeurIPS paper: Online Robust Regression via SGD on the l1 loss

NeurIPS 2020

Online Robust Regression via SGD on the l1 loss

Review 1

Summary and Contributions: This paper consider the problem of robust regression, and consider the simple algorithm: run SGD on ell_1 loss. And the authors establish a convergence rate on this simple algorithm: 1((1-eta)^2*n), though the dependence on eta might not be optimal, given the simplicity of the algorithm and strong empirical performance, I think this result deserves presenting at the conference.

Strengths: Clear theoretical justification of the proposed approach with guarantees.

Weaknesses: No real data experiments to verify the effectiveness of the approach in real world.

Correctness: Theoretical results are clearly stated and seem correct. Empirical results look reasonable.

Clarity: This paper is well written.

Relation to Prior Work: Related work are extensively discussed.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper performed a theoretical analysis on the empirical loss minimization with absolute loss. In a setting they called the online oblivious response corruption, they proved the average of a sequence obtained by the SGD achieves convergence rate of 1/n.

Strengths: They show the proposed algorithm (averaging of a sequence of SGD) is highly scalable and optimal. They show other benefits of the proposed algorithm such as low dependency on noise level and feature confounding.

Weaknesses: Their analysis seems to strongly depend on the assumption that data is drawn from 0 mean gaussian distribution. When it is said 1/n is the optimal rate, one usually assumes much less.

Correctness: Their proof seems to be correct.

Clarity: Yes

Relation to Prior Work: The authors relates this work to others from both aspects of robust statistics and stochastic optimization.

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: The robust linear regression problem is studied in the online setting. Under certain conditions, the convergence rate of the averaged iterate is obtained.

Strengths: The online version of SGD for the robust linear regression problem seems novel.

Weaknesses: The linear model is simple, and the assumptions on the data seems a little bit restrictive.

Correctness: The proofs seems correct, but the referee did not check them line by line.

Clarity: yes

Relation to Prior Work: seems so

Reproducibility: Yes

Additional Feedback: After the rebuttal: I am not very familiar with this area, while the authors addressed my concerns partially. Hence, I decide to remain the current score.

Review 4

Summary and Contributions: This paper addresses the task of online learning for robust linear regression (with L1 loss). In particular, based on the smoothing mechanism, the authors propose a stochastic gradient descent on the l1 loss with guaranteed convergence. The authors show some encouraging results on robust regression.

Strengths: - The problem of online learning for robust regression is quite important in several machine learning tasks, where the algorithm only has access to the the data in a streaming manner. Thus, the proposed algorithm would be very useful for such applications. - The non-smothness of the l1 loss is addressed using Gaussian smoothing. Although Gaussian smoothing is not new, using it in this online learning context is rather interesting. - The final algorithm is quite simple, which can be easily implemented using existing optimization frameworks. - A proof for convergence guarantee is provided.

Weaknesses: - As can be seen from Lemma 3, the higher the outlier proportion, the local conditioning becomes worse. Also, if the noise level tends to 0, the problem becomes non-smooth. It is not clear if the proposed algorithm still well-behaves in such extreme cases. - Some work on robust online regression had been previously discussed, for example: + Briegel, Thomas, and Volker Tresp. "Robust neural network regression for offline and online learning." Advances in Neural Information Processing Systems. 2000. It is not clear why such references are not mentioned and compared in the paper. ======== Update after rebuttal: I thank the authors for providing feedback to all reviewers' comments. Some of my concerns have been addressed. Therefore, I have upgraded my rating for this work to "7. Accept".

Correctness: - The proofs look reasonable (although the details have not been carefully checked). - The empirical results support the theory.

Clarity: - The paper is well written where most details are clearly explained

Relation to Prior Work: Some related works that addressed the similar problems are not thoroughly discussed.

Reproducibility: Yes

Additional Feedback: - The robustness of the algorithm is attained through L1 (or Huber loss). However, in many practical applications with high outlier rates, one needs to use more difficult loss functions, e.g., Tukey loss. It would be interesting to discuss if it is straight-forward to extend this work to such loss.