NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper considers the problem of sketching in linear regression with symmetric norm loss functions. They provide a novel sketching method for generic symmetric norms that can provide a $d\text{polylog}n$ approximate to the underlying solution. This is an interesting result especially due to the fact that it generalizes the previous results in the literature. However, unlike other results in the literature (which mostly focus on $\ell_p$-norms and recently M-estimators), it can not provide a $1+\epsilon$-estimate of the solution (this is presented as a future direction in the conclusion section.) For the special case of Orlicz norm, the row sampling algorithm according to the Orlicz norm leverage score is a novel algorithm that can improve the previous results (in [2]). However, it should be emphasized that the Assumption 1. limits the applicability of the result. Specifically, due to the fact that the growth in function $G$ is sub-quadratic, this algorithm can not be used when the loss function is an $\ell_p$-norm with $p>2$. I understand that this a theory paper. However, in order to show the applicability of the algorithms I expected to see some experimental results to compare the time complexity and the quality of the estimate of the proposed algorithm with the previous results in the literature. The paper lacks providing experimental results which would have been very interesting for NeurIPS audience.
Reviewer 2
After reading the response, I tend to agree with the authors that the "Orlicz norm leverage scores" are a nice contribution. However, I would still appreciate a more detailed discussion and comparison between their results and Clarkson and Woodruff (2015) [11]. -------------- The authors present new sketching algorithms for linear regression under a family of (nice) Orlicz norms, and also more generally, symmetric norms. All of the algorithms run in O(nnz(A)+poly(d)) time, where nnz(A) is the number of nonzeros in the n x d input matrix A (whose rows represent data points in the regression). In the case of Orlicz norms, the proposed algorithm obtains an eps-approximation for the loss of the optimum, and in the case of arbitrary symmetric norm, the approximation quality is sqrt(d) x polylog(n) x mmc(l), where mmc(l) is a sort of condition number of the norm l. The prior works on this obtained a d x polylog n approximation for Orlicz norms and an eps-approximation for a related class of M-estimators and a narrower class of l_p norms. The authors mention a couple of examples of symmetric and Orlicz norms that do not fall under those other classes of regression losses. To my understanding, the contribution is of a primarily theoretical interest. Thus it is important to take into account the novelty of the proofs. I’m not expert and did not go over the proofs in the appendix in significant detail. My overall impression is that there are two contributions: 1. A subspace embedding sketch for symmetric norms, which is an incremental improvement over the prior work, which used similar sketches in a somewhat different data streaming context [7]. 2. Analysis of sampling based on “Orlicz norm leverage scores” which is described as the primary contribution. These techniques look somewhat similar to the “leverage score sampling” used in Clarkson and Woodruff (2015) [11]. Even though [11] uses different leverage scores (based on l_p norms, I believe) and they claim guarantees for M-estimators rather than Orlicz norms, my impression is that those techniques may apply here as well (I refer to Section 4.2 of the arxiv version of [11]). One argument for this is that, even though Orlicz norms are not "entry-wise" norms, the analysis in this paper immediately makes a transformation that converts the norms into "entry-wise" expressions that are very similar to M-estimators (see equation (6) and proof of Lemma 7). As I said, I’m not an expert, but even if the authors disagree, this would be worth discussing more closely in the paper. Minor comments: - Line 33: "Despite we have successfully applied" -> Despited having successfully applied - Line 40: "Is that possible" -> Is it possible - Line 43: "a list a M-estimators" -> a list of M-estimators - Line 55: "is prohibited for" -> is prohibitive for - Line 76: "function G appeared in" -> function G which appeared in - Line 96: "as follow" -> as follows - Line 221,223: "least square" -> least squares
Reviewer 3
The definition in (2) seems quite important for the analysis, e.g., the proof of Lemma 7. But the original definition of Orlicz norm is slightly different from (2). That is, in (2), the equality is strictly hold and the $¥alpha$ should be unique. I'm not sure the whether (2) and the definition of the Orlicz norm is equivalent or not. Please make sure the relationship between these two. The authors mainly focus of the regression problem in the article. However, classification or non-linear regression are also important in practice. Could the algorithm extend to such a more general problem?