Paper ID: | 2540 |
---|---|

Title: | Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates |

The paper provides an analysis of a specific alternating updates scheme for the non-negative matrix factorization problem for input data following a certain generative model. The analysis is more general as earlier ones in the sense that it does not require separability (e.g. Bitdorf et al., 2012) or incoherence assumptions while still covering noisy scenarios. On the other hand, a generative model for the coefficients and on the initialization are required.

Analysis of alternating minimization for NMF is a long standing problem, and the authors make a laudable attempt here. However, 'alternating minimization' does not refer to the commonly used alternating least squares approach but to a different algorithm devised by the authors. My main concern is that the paper does not contain any empirical results regarding the performance of the algorithm under study nor (numerical) illustrations of the many theoretical developments made in the paper. This makes it particularly hard to evaluate the practical impact of the analysis. The paper is not easy to read. There are many definitions, conditions, remarks about their meaning and how they interact. Eventually, I feel it is hard to keep track of all that and to come up with an interpretation of the stated results with regard to the scope of problems these results actually apply to. In my opinion, the paper would profit from moving more of the technical content to the appendix, using more space for commenting on the conditions, and possibly stating simplified versions of the theorems applying to special cases of interest.

2-Confident (read it all; understood it all reasonably well)

This paper provides theoretical guarantees for nonnegative matrix factorization using a simple and natural alternating minimization procedure. Previous theory work on this subject has fallen into three categories: -- provided inefficient algorithms, and provided strong recovery guarantees -- analyzed efficient combinatorial (e.g clustering-based) procedures under strong separation conditions, and provided stronger recovery guarantees than those of the present paper -- analyzed a rather different alternating minimization scheme

This is a significant contribution to the literature on NMF: -- it analyzes a simple and natural scheme -- it provides lots of intuition for the analysis -- it provides many insights into the NMF problem The condensed NIPS format makes it hard to include a lot of detail, but I think the authors have done a good job of deciding what to present. I wish they had said a bit more about initialization, though. Their analysis requires the initializer of the alternating minimization procedure to already be a pretty good solution (and this is a standard feature of such analyses). It would be good if they could (1) comment upon what methods could provably generate a good initialization of this form, and (2) provide an example of a bad initialization that would (provably) force the procedure to go severely astray. One other suggestion for the presentation: the authors have chosen to present their result in maximum generality, presumably because they feel that this emphasizes the significance of their contribution. But the real contribution is a nice style of analysis for a simple method; and this would be enhanced if they simply cut out a lot of irrelevant complexity by focusing on a core subcase (e.g. simpler assumptions on X and on the noise). This would greatly reduce the number of confusing constants floating around -- and the fully general result could simply be asserted as an afterthought and proved in an appendix.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper investigated the problem of non-negative matrix factorization (Y=AX) where A is deterministic and columns of X are non-negative and sampled from some distribution. The paper proposed an algorithm that can recover the ground-truth A from noisy observations with provable guarantees given a good initialization and some conditions on matrices A and X.

- In Theorem 1, the result claims that if norm of N_0 is less than c /10, then, the algorithm can estimate A up to an error of \epsilon + c. Noting that N_0 as defined in condition A3 is approximately the error of the initialization estimation, then, the final estimation error upper bound (c) is greater than the initial error (c/10). It doesn't quite make sense to me if the analysis and algorithm indeed reduces the estimation error. I would suggest authors to make sure if Theorem 1 indeed make sense. (Maybe I have a wrong interpretation, please correct me if it is the case)

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The paper proposed an algorithm that updates features and weights alternatively to solve a special version of non-negative matrix factorization problem (which does not require feature matrix to be non-negative), and assuming a generative model of the data, it is guaranteed to recover ground truth with mild conditions.

This paper solves a special kind of nonnegative matrix factorization problem (Y=AX) using alternating updates, the most contribution is that it is provable to work in noisy case, where most existing work tend to fail. The paper is well-written, easy to follow, and with detailed supplementary information and strong technique results. My main concerns are: (1) The paper only require the weight matrix X to be nonnegative, although the authors explain the difference of their problem and normal NMF (feature and weight matrix A, X are all nonnegative), there are plenty of applications requiring both A and X nonnegative and for symmetric NMF we further require X=A^T. To the reviewer's understanding, as the authors have claimed, if the ground truth is nonnegative, it should still be able to recover it; but I am not sure if this model can cover the symmetric NMF case. As we know, in noisy case normal NMF do not have unique solution, it is better that the authors explain these two cases to demonstrate the generative of their algorithm. (2) There is no experiment/numerical validation of the proposed algorithm. It is better if some experiments on synthesis data and real data be conducted to validate the performance of the algorithm. Further if the authors can experiment on normal NMF and symmetric NMF case, it would be more convincing. Some minor problems: (1) The citing is not in NIPS format, but in theoretical computer science format (2) line 94, I suppose 'neurally plausible' should be 'neutrally plausible' or 'naturally plausible' (3) In SI, line 411, 441, 470, 522, 697, there are typos on Lemma/Theorem

2-Confident (read it all; understood it all reasonably well)

The paper shows recovery guarantees for non-negative matrix factorization under adversarial noise and unbiased noise. The algorithm proceeds by alternatively decoding and encoding, which correspond to updating weight matrix and updating feature matrix respectively. The analysis requires some conditions including linear independence among feature columns, independent sampling of weights for different features, balanced second moments of the weights and some good initialization.

The recovery guarantee for NMF with mild conditions is very important for the machine learning community. This paper provides a solution under three main conditions as follows. 1) The conditions, A0 and A1, on the feature matrix are very mild. 2) The condition of the weights, A2, is also reasonable except for the independent requirement. 3) The condition for a good initialization, A3. However, it is unclear if this initialization can be achieved. The proof technique seems to be novel, i.e., different from the popular techniques used for alternating minimization of matrix completion or the tensor methods. Therefore, it will probably be useful for the community. Therefore, I vote poster-level for novelty and oral-level for potential impact. I have two concerns for this paper, 1. It seems that the theorems hold only with some (high) probability, which is not stated in the theorem. The reason is that the encoding step uses empirical expectation rather than the population expectation. Even if provided large enough samples as claimed on Line 643 and Line 647 in the appendix, as long as the number of samples is finite, the theorems will not hold almost surely without more assumptions on the weights. 2. If I understand Sec. 5 correctly (Sec. 5 is a little harder to read), the goal of this section is to provide solutions when the weights don't satisfy Assumption A2. However, according to Theorem 6, A^* is required to satisfy Assumption A0, which implies A^*D will not satisfy A0. Therefore, we can't apply Algorithm 1 afterwards to get A^* as claimed on Line 293. This will invalidate the whole Sec. 5. Due to the above two concerns, I just vote poster-level for the technical quality. The paper is not hard to understand, so I vote poster-level for clarity. Some minor comments: 1. At the end of the paper Line 292-297, the authors mention how to initialize. However, it lacks reference and clear statement. 2. It will be better to provide some experimental results to verify the theoretical analysis, even if it is on synthetic data. 3. The lemma number is missing near Line 411 in the appendix. 4. Line 294, pontentially -> potentially ========= To authors' feedback. I have read the feedback. The feedback clarified my concerns.

2-Confident (read it all; understood it all reasonably well)

The authors propose an algorithm for non-negative matrix factorization that alternates between decoding and encoding steps performing a gradient descent-like update for the model parameters.

The authors propose an algorithm for non-negative matrix factorization that alternate between decoding and encoding steps. This idea is not new was already proposed by Clevert et at. 2015, which used Ganchev et al’s posterior regularization to enforce non negativity constraints on the posterior. Previous work is not appropriately cited: Clevert et al., Rectified Factor Network, NIPS 2015 The author claims, that major feature of our algorithm is the significant robustness to noise, but can’t substantiate this claim. The paper lacks an experimental section, not even a basic comparision with RFNs nor with standard NMF is presented. The model seems analytically well studied, but due to the missing experiments I am not convinced of its merits.

2-Confident (read it all; understood it all reasonably well)