NeurIPS 2020

Adaptive Reduced Rank Regression


Review 1

Summary and Contributions: This paper studies the low-rank regression problem in a high-dimensional setting. It provides an efficient algorithm with small mean squared error under an eigenvalue decay assumption about the covariance matrix of the features.

Strengths: The paper provides a simple two-stage algorithm with provable guarantees under a power-law eigenvalue decay assumption. The algorithm is natural and seems to perform well experimentally. The theoretical claims appear to be sound.

Weaknesses: It is not clear to me how restrictive the assumption on the covariance matrix is.

Correctness: The theoretical claims appear to be correct and the experimental methodology is sound.

Clarity: Overall, the paper is decently written.

Relation to Prior Work: I am not an expert on this part of the literature. The authors argue that their results are distinct from prior literature on the topic.

Reproducibility: Yes

Additional Feedback: Thanks to the authors for their response. I have decided to keep the same score.


Review 2

Summary and Contributions: This paper studies low rank regression problem: given observed responses Y in R^{d_2 x n} and features X in R^{d_1 x n}, the goal is to recover a low rank matrix M such that y = Mx + eps. Unlike previous literature, this paper gives the first result with theoretical guarantee when n << d_1, i.e., the number of observations is much less than the dimension of the features x.

Strengths: This paper introduce a simple and clean algorithm. The algorithm is easy to implement in practice. The analysis for the error bound is not trivial. Authors give a bunch of new views of PCA method. In particular, authors give a clever way to first choose a rank k_1 to reduce the features to a rank k_1 subspace. They show that the samples can give a good estimation to these features. Then authors apply a relatively standard denoise step to find a rank k_2 result.

Weaknesses: My first concern is that the authors assume that a feature x is a multivariate Gaussian. This is usually not the case in practice. I am not sure whether this is a standard assumption in the literature.Though authors claim in line 86 that they can relax the assumption for most the result, it is not clear to me how to extend. My second concern is that since n << d_1, I am curious why the following simpler algorithm could not work:compute a rank-k PCA Y' of Y where k is a tunable parameter. Then solve the Equation Y'=MX to find M. Notice that since n << d_1, the solution of the equation exists. I would appreciate if author gives more intuition why this simple denoise approach for Y does not work.

Correctness: I did not find obvious issues in their proof and the experiments.

Clarity: The paper is well written in general.

Relation to Prior Work: Authors in Section 9 gave a clear discussion to compare with previous work.

Reproducibility: Yes

Additional Feedback: Some minor comments are as follows: - It would be good to mention the observations are n response vectors corresponding to n feature vectors at the beginning of the paper. - Line 88: it would be good if authors justify the assumption d_2 ~= n. Post Rebuttal: The authors addressed my questions.


Review 3

Summary and Contributions: This paper studies the problem of low rank regression where the number of samples is much less than the number of features. The authors give an algorithm achieving improved results in some parameter regimes, and the algorithm is arguably simpler than the previous state of the art (requiring two applications of PCA rather than solving an SDP). They also give lower bounds suggesting that their algorithm is essentially optimal in their setting. Experiments bear out that their algorithm is competitive.

Strengths: The algorithm described here is very simple, and solves a foundational problem nearly optimally. It's great that the authors are able to say something new just by analyzing PCA.

Weaknesses: I would be happy to see more compelling experimental datasets. What about predicting height based on genome, which seems well-suited for this algorithm? It would also be kind to provide one or two simplified corollaries to theorem 1 that state the result without so many parameters (e.g., under a rank assumption).

Correctness: I believe the technical results are correct, but I take issue with some of the expository claims made by the authors. For example, they claim in 9.2(b) (in the full version) that the power law decay assumption is "the weakest possible assumption for non-identity covariance matrices". I don't have any problem with the assumption, but there are clearly non-identity matrices that don't satisfy it. They also state that in 9.1.1 that their analysis works for X which are not near low rank, contrasting with other analyses of PCA. But while their theorem may hold for X which are not close to low rank, it isn't obvious to me that it gives an effective bound, or an interesting statement, in such cases. I am also skeptical of the authors claim of novelty in view PCA as a regularizer. As evidence, they quote Andrew Ng's intro machine learning course where he advises students not to use PCA to prevent overfitting, in cases where doing so throws away a substantial amount of the variance. I don't think it is charitable for the authors to equate simple advice for new ML students with researchers views about possible uses of PCA.

Clarity: The paper could be better written. The exposition is generally clear, and a great deal of intuition is provided. But the authors could give clearer statements of their theorems in particular. For example, in theorem 1, there is an excessive amount of notation, with for example \epsilon and \varepsilon meaning different things, and y apparently defined twice (in fact the second definition of y is a definition of N).

Relation to Prior Work: As mentioned above, the novelty justifications in section 9 of the full version are in my opinion overstated. But the discussion in Section 5 is good.

Reproducibility: Yes

Additional Feedback: I thank the authors for responding to my feedback. Yes, the UK Biobank data would be a great dataset to try your algorithm on, but I think you have to jump through some hoops to get access. I think I remember seeing some smaller human dataset somewhere, or maybe there is a non-human dataset that is easier to access. I urge the authors to purge from their paper any claim to novelty in using PCA for regularization. In particular, rules of thumb given to students in intro ML courses cannot be used to make any claims about the state of the art of ML research. In particular, if you claim that Andrew Ng has a certain opinion, then either present a *research* paper of his that explicitly makes that view, or email him and ask for a direct quote. Better, just get rid of this distraction from your paper! I also would also encourage you to clean up the notation in Thm 1, not just add an extra pointer afterward. \epsilon and \varepsilon shouldn't be used together, there are many greek letters available! If you want to include all those parameters, please put some thought into how to make the theorem as readable as possible. I have slightly lowered the score from 8 to 7, because I'm not sure the authors are serious enough about addressing my expository concerns.


Review 4

Summary and Contributions: This paper suggests a reduced-rank regression (RRR) estimator suitable for the high-dimensional n<<p setting. The estimator is very simple and consists of two steps: (1) reduce X with PCA to Z; (2) do SVD on cross-covariance between Z and Y. The paper claims that this procedure has good statistical guarantees and outpeforms all existing competitors.

Strengths: The paper addresses an important problem of estimating regression coefficients of multivariate regression in n<<p setting. It develops a detailed mathematical treatment (mostly in the Appendix) to provide some statistical guarantees on the performance.

Weaknesses: That said, I am not convinced that this paper provides a contribution of NeurIPS level. The suggested estimator is extremely simple; one could even say "naive". Both ingredients are standard in statistics and machine learning: step (1) is the same as in principal component regression (PCR); step (2) is the same as in partial least squares (PLS). Together, this method is something like a PCR-PLS, with some singular value thresholding. The authors claim that it's very novel and outperforms all the competitors, but I remain not entirely convinced by this (see below). Disclaimer: I did not attempt to follow the mathematical proofs in the Appendix.

Correctness: Likely yes, but I did not attempt to follow the mathematical proofs.

Clarity: Okay-ish.

Relation to Prior Work: Not clearly enough.

Reproducibility: Yes

Additional Feedback: This paper suggests a reduced-rank regression (RRR) estimator suitable for the high-dimensional n<<p setting. The estimator is very simple and consists of two steps: (1) reduce X with PCA to Z; (2) do SVD on cross-covariance between Z and Y. The paper claims that this procedure has good statistical guarantees and outpeforms all existing competitors. The paper addresses an important problem of estimating regression coefficients of multivariate regression in n<<p setting. It develops a detailed mathematical treatment (mostly in the Appendix) to provide some statistical guarantees on the performance. That said, I am not convinced that this paper provides a contribution of NeurIPS level. The suggested estimator is extremely simple; one could even say "naive". Both ingredients are standard in statistics and machine learning: step (1) is the same as in principal component regression (PCR); step (2) is the same as in partial least squares (PLS). Together, this method is something like a PCR-PLS, with some singular value thresholding. The authors claim that it's very novel and outperforms all the competitors, but I remain not entirely convinced by this (see below). Disclaimer: I did not attempt to follow the mathematical proofs in the Appendix. Major issues * The algorithm consists of applying PCA to X as in PCR and then doing SVD of cross-covariance between Y and PCs of X as in PLS. The entire method could be called PCR-PLS. Of course PLS is related to RRR but they are not equivalent. For example, the proposed algorithm does not seem to converge to the standard RRR for infinite data, or when the Step-1-PCA is modified to keep all PC components of X. Doing SVD of ZY where Z are standardized PCs of X is not equivalent to RRR. This is confusing. * I suspect that the authors already had to deal with similar criticim (that their method is not very novel) because the Appendix contains a dedicated section about why this approach is novel. One part of claims that PCA is traditionally *not* seen as a regularization method. It cites Andrew Ng saying that PCA is used to speed up the computations but cannot prevent overfitting. Whatever the views of Andrew Ng might be (and with all due respect), it is standard textbook knowlegde that PCA in principal component regression (PCR) can very effectively prevent overfitting and in fact is closely related to ridge regression. See Hastie et al., The Elements of Statistical Learning. It is strange to claim that PCA regularization is a novel idea. * The experimental results in Table 1 are not convincing -- perhaps because they are presented in an unclear way: (a) What exactly is R^2_out? If it actually is R^2 then it cannot be above 1. Unclear what R^2=18 means. (b) Many comparison methods have hyper-parameters (regularization strength, i.e. penalty in ridge regression, number of PCs in PCR, number of components in RRR, etc.). How were they set? It only makes sense to use cross-validation to set the optimal parameters. Was CV used here? It is not mentioned. (c) I cannot believe that the method suggested here outperforms PCR, RR, RRR, etc. by more than 3-fold (as in Table 1). This is an enormous difference that makes one suspect that other methods were not applied correctly. Unfortunately, the manuscript does not convince the reader that the comparisons are sound. I want to stress that this comparison table is absolutely crucial for the paper. Medium issues * The motivation in section 2 is slightly confusing because "our model" in line 84-88 does not mention reduced-rank. It seems that the reduced-rank regression here is not the goal per se but rather a regularization method to estimate multivariate regression coefficeints (matrix M). Maybe this can be pointed out more explicitly. * line 119: "PCA reduction outputs P_k(X)" where P_k is a rank-k approximation to X -- why?? PCA reduction can mean different things, including extracting k left singular vectors of X. Minor issues * line 32: what is "sample complexity n"? --------------- POST-REBUTTAL: I have increased my score from 3 to 4. The authors addressed some of the issues I raised above, but I am still not really convinced by the experiments and do not understand how PCA+RRR can outperform e.g. ridge RRR by such a large margin. Regarding "When all PCs of X is kept, the problem reduces to RRR" -- I wish the authors point this out and ideally prove in the revised paper. Regarding the discussion around Ng's quote: I do not understand what the authors mean by "but the analysis is possible only under factor models". Not sure what "under factor models" means here. I think "View 1" is simply that PCA can serve as a regularization. This is what PCR is all about. I teach this in the first Intro course to Machine Learning. And I do not mention any "factor models". I think the authors are actually doing themselves a disservice by arguing this point.