Review for NeurIPS paper: How many samples is a good initial point worth in Low-rank Matrix Recovery?

NeurIPS 2020

How many samples is a good initial point worth in Low-rank Matrix Recovery?

Review 1

Summary and Contributions: This paper investigates matrix sensing. The problem is non-convex but it is known that if enough linear measurements are taken, then there are no spurious local minima making it amenable to optimization. In this work the authors characterize the number of measurements that are required to prevent any given point from becoming a spurious local minimum. Subsequently, they connect the quality of an initial point (in terms of how close it is to the true matrix) with the number of measurements needed to prevent the existence of spurious local minima within its neighborhood.

Strengths: I find this work very interesting in terms of understanding how the optimization landscapes gradually shapes into making a non-convex problem optimizable. The paper is theoretically sound and extends results that were previously known only for matrices with rank 1. I believe that understanding the landscape of non-convex optimization problems, as well as why and under what initial conditions we can efficiently solve them constitute important questions that are core and of interest to the NeurIPS community.

Weaknesses: There are practical questions that emerge from this work that remain unanswered, for example, how would you find such an initial point that is good, in order to reduce the number of measurements that you need? However, I think that the value of this work is mostly on providing a theoretical understanding for the evolution of the landscape and less on being a practical solution. In a similar note, I have a question regarding the optimization: we know that if we have enough measurements we will not have a spurious local minimum in a ball around the true solution. However, does this imply that gradient descent, will converge to the true minimum? There could be other spurious local minima outside of the ball and there is no guarantee as to where GD will take us, in principle it can take us to a spurious local minimum outside of the ball, right?

Correctness: The paper is quite technical, as far as I can tell the claims seem correct and believable, same for the high-level ideas. I didn’t verify the appendix proofs.

Clarity: Overall, the paper is well-written. I would say that the related work is in an awkward place, I’d prefer to see it closer to the introduction (especially given that it mentions the contributions of the current paper in a high-level while it follows Section 3 where the authors presented the contributions formally), or at the end. Now it feels that it cuts the paper in half.

Relation to Prior Work: The authors explain how their current work extends previous results on both local and global guarantees and how their techniques differ from the ones of Zhang et al. that appears to be one of the most related works.

Reproducibility: Yes

Additional Feedback: Overall, I enjoyed reading this paper and I think it makes interesting and important contributions to our understanding of non-convex problems. I would like to see it in the conference. Minor typos: - l123: \grad f_A = X) = 0 instead of \grad f_A^*(X) = 0 - l139: \delta* = 1/5 > 1 instead of for r > 1 - l174 that this is both and necessary and sufficient

Review 2

Summary and Contributions: In this work, the authors came up with a closed-form expression for the threshold on the number of samples needed to guarantee no spurious local minima for the nonconvex matrix recovery problem. It provides an important trade-off between the quality of the initial point and the sample size.

Strengths: The paper is well written. I briefly went over the proofs and they looked correct. They were carried out systematically. Moreover, the geometric intuition behind the closed-form formulation of the lower bound on the local threshold made the analysis even more clear. In addition, this work also simplifies and extends the proof of [23] to accommodate the case where the rank is greater than unity. Overall, I think this work is going to be an important addition in terms of understanding the significance of a good initialization in nonconvex setup.

Weaknesses: This is not exactly a weakness. But I think title of the paper is way too generic. It would be great if the authors could make it a bit more problem specific.

Correctness: I think the paper is very well written and well presented and I enjoyed reading it.

Clarity: See above.

Relation to Prior Work: The authors understand many of the underlying subtleties and know the background literature well, and are able to use very specific results from the literature.

Reproducibility: Yes

Additional Feedback: The authors have addressed my comments in their response.

Review 3

Summary and Contributions: This paper considers a low-rank matrix sensing problem from linear measurements and studies the sample complexity needed to guarantee no spurious local minimum in a given parameter region. The approach leverages the notion of RIP constant of the measurement matrix.

Strengths: The theorem in this paper provides a restricted isometry constant threshold for matrix sensing problem which can guarantee that the true solution is the unique local/global minimum point within a neighborhood of the true solution, where the neighborhood radius epsilon and the rank r of the input matrix can be arbitrary. Hence, this theoretical result is more flexible and general than the previous ones, and thus provide insights on the trade-off between sample complexity requirement and initialization quality in the matrix sensing problem.

Weaknesses: (1) More information could be added to the introduction of the matrix sensing problem. For example, are A_1…A_m known? Are they fixed or random? What are applications of matrix sensing and maybe what is its relationship to machine learning? Such information perhaps makes this topic clearer and more motivating, especially to the people who touch this topic for the first time. (2) The key idea of this paper is to propose to use the sample complexity for eliminating spurious critical points in a given region and the sample complexity for eliminating the spurious local min globally as a lower bound for the sample complexity for eliminating local min in a given region, see eq (7). This is a correct lower bound. But I did not see why it is a tight one as the author claimed from Fig1. Is it a numerical evaluation or a rigorous proof? (3)The paper focuses on low-rank matrix sensing problem, which has a relatively simple loss that leads to closed form solutions. Can the results of this paper have any implications on deep models? In deep learning, one usually applies Gaussian-type initialization and it seems that the training can always find a global min. (4) I want to point out that the notion \delta_soc(W) can eliminate spurious local min for any set W. But the sample complexity lower bound obtained in the paper only applies to W that is close to the ground truth. I suggest the author clarify this. Moreover, existing studies on matrix sensing and phase retrieval have established sample complexities for guaranteeing the condition eq(9) by proving local strong convexity-related conditions. In this perspective, how does the main result here strengthen the existing results? (5) The numerical example is an explanation of theorem 5 under the specific 1-rank case. I think it may be better to demonstrate the theorem via more experiments. For example, for a certain (X,Z), use Corollary 6 to calculate the threshold for m, the number of samples. Then, select a grid of values of m, some below the threshold and some above. For each value of m, conduct multiple optimizations using randomly generated A_1… A_m and initialization X, and see how many experiments converges to Z. Then, show the proportion of the experiments that converges to Z for each value of m. Can it reach 100% when m exceeds the threshold?

Correctness: Yes, I think the theoretical method and empirical example are both correct.

Clarity: Yes, I think the paper is well written.

Relation to Prior Work: Yes, I think it clearly discussed the contribution of this work from the previous works.

Reproducibility: Yes

Additional Feedback: (1) What is the value of C_0 in Corollary 6?

Review 4

Summary and Contributions: This paper studies the low-rank matrix sensing problem. It characterizes a lower bound on the RIP constant so an arbitrary point is not a spurious local minimum. In order to make the problem more tractable, the authors relaxed the condition so they found a lower bound on the constant where the arbitrary point is not even a stationary point. The main contribution of the paper would be that the RIP condition can be relaxed from the global guarantee by Bhojanapalli et al. if we have an initial guess closer to the solution. This leads to relaxed sample complexity by a constant factor, i.e., m > C(\eps) nr (m : # samples, n : dimension, r : rank, \eps : distance between the initial guess and the solution, C(0) is still non zero constant)

Strengths: 1. Understanding the landscape of low rank matrix problems has been of interest to this community. Many inspiring results have been published recently. 2. The key result establishes a relation between the basin of attraction and RIP condition, and the fact that the RIP condition can be relaxed when the initial guess is close to the global solution. This characterization wasn't much discussed in the existing work.

Weaknesses: 1. I wonder if this work should also discuss how harder it gets to obtain such a good initial point (within B_\eps). While we characterized that we only need relaxed RIP condition when the initial point is closer to the global optimum, but wouldn't we need a stronger RIP condition to get such a good initial point? I wonder if the motivation of this work is only establishing a theoretical relationship between the RIP condition and the basin of attraction, or also related to any practical applications where this results can be effective. 2. Considering the existing analysis in the literature, I think that the analytical result is a bit incremental. The implication of Theorem 4 looks similar to [1, Lemma 4.2], which could also be interpreted as a lower bound on the RIP constant for a matrix to be a stationary point. And it can also lead to a similar property to L47: "a point X is more likely to be a local minimum if ... is closer to orthogonal", though the lemma wasn't used for this purpose. I couldn't find an interesting analysis technique for NeurIPS publication. [1] Bhojanapalli et al, "Global Optimality of Local Search for Low Rank Matrix Recovery"

Correctness: The analysis made sense, thought I couldn't fully checked the proof line by line.

Clarity: This paper is clear and well written. I didn't get any concerns in its clarity when reading the paper.

Relation to Prior Work: It is fairly clear that how this work differs from existing works.

Reproducibility: Yes

Additional Feedback: