Review for NeurIPS paper: Regularized linear autoencoders recover the principal components, eventually

NeurIPS 2020

Regularized linear autoencoders recover the principal components, eventually

Review 1

Summary and Contributions: The presented

Strengths: The paper is clearly structured and relatively easy to follow. The analysis of the loss landscape for the non-uniformly l2-regularized linear autoencoder appears thorough and sound (I have only superficially checked the proofs in the appendix for plausibility). The idea of using non-uniform l2 regularization for obtaining ordered representations is intriguing and could prove useful in other applications of neural networks. Additionally, the paper presents a deterministic variant of the regularization term given by nested dropout. The introduced modification to the gradient-based update rule could improve convergence of neural networks, where the problem of weakly-broken rotational symmetry exists.

Weaknesses: My main concern with this submission lies in its utility for the community. I do not see how the presented findings can be generalized to any neural network used in practice. On the other hand, I acknowledge that the main benefit of research is not in the immediate application of its findings. Nevertheless, I would appreciate a discussion of this aspect. For example, the utility of the proposed update rule should be discussed more in-depth. From what I understand it is only useful for problems withouth mini-batching and where there is a subspace with weakly

Correctness: The derivations and proofs are correct, as far as I can tell.

Clarity: The paper has a clear and intuitive structure, which makes it easy to follow its logic. The writing is somewhat dry, as can be expected of a mathematical paper.

Relation to Prior Work: The relevant work I know of is covered with the exception of papers investigating the link between L2 regularization and dropout (e.g. Wager et al. "Dropout Training as Adaptive Regularization"), which I believe to be relevant here.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: I have read the rebuttal. Thank you **** The authors study symmetry breaking properties of two regularization schemes for linear autoencodes, and provide a new algorithm to circumvent the slow convergence stemming out of illconditioning introduced by such a scheme.

Strengths: The paper is very well written and delineates its contributions well. The paper makes an important contribution to the community – discussing the symmetry breaking properties of certain regularization schemes, closely studying convergence properties based on loss landscape and designing algorithms to circumvent the problematic issues that slow down the convergence. The authors build upon the previous work of Kunin et al who showed a slightly weaker form of symmetry breaking (still retains orthogonality) using uniform l2 regularization and show that non-uniform regularization breaks orthogonality and only retains reflection. This is intuitive, since penalizing different directions differently does not allow for orthogonality anymore. While the proof methodologies are not new, the proofs themselves are novel and the results interesting. The close analysis of bad conditioning that arises out of the non-uniform regularization and the two-phase split of convergence with the latter slower phase attributed to breaking the rotational symmetry. The authors then modify the gradient descent algorithm to address this issue. This is not straight-forward, as noted by the authors. The noted connections of this proposed algorithm to Hebbian algorithm is also interesting.

Weaknesses: Minor gripes: The links for references did not seem to be working for some reason, please fix as it makes jumping around for references and equations harder. Also, it would be nice if the authors could expand more on future directions. E.g. the development of the augmented gradient method as equivalent to applying a gauge field, this sentence adds zero information without a couple of more sentences explaining the said field, and putting quotes around gauge field does not help at all. The authors have shied away from probabilistic interpretation studied in Kunin et al. It seems the non-uniform regularization should have a straightforward generalization. What about the nested dropout ? Is there a reason to not make these connections in this work?

Correctness: I would think so. I have gone through the main claims/proofs.

Clarity: Definitely yes.

Relation to Prior Work: Yes, the connection to previous works is well-highlighted, including the attribution to results, as well as proof techniques.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Post-rebuttal update: Thank you for clarifying on the motivation of this work as well as the mistake in proof. I'm updating my score since the two-stage convergence behavior is interesting, and the proposed algorithm has interesting connections to prior work. However, I'm not sure if the modified linear AE model is the best model to understand the slowness of learning NN representations, as the regularization schemes seem somewhat artificial and doesn't seem to correspond to any commonly-used algorithms. From a probabilistic perspective, it's also unclear why we would assign an arbitrary non-uniform prior when we don't have any knowledge about their scales, e.g. is there any gain from choosing a more correct prior about the scales? I think further discussion on these issues would greatly enhance this work. ====== Building upon Kunin et al (2019), this work investigates ways for the regularized linear autoencoder to recover the original principal components (up to inconsequential ambiguities such as permutation). It is shown that both non-uniform L2 regularization and nested dropout lead to such recovery, but ordinary GD using these objectives suffers from slow convergence. An alternative optimization algorithm is developed to accelerate convergence, and is connected to the generalized Hebbian algorithm.

Strengths: This work established new identifiability results, identified the convergence issue of OGD, and proposed an optimization algorithm connected to GHA.

Weaknesses: * Proof of theorem 1 appears incomplete; see below. * Comparing with previous work, the algorithms investigated in this work are less related to those used in practice. While I don't think this issue is deal-breaking, it nonetheless make the scope of this work more limited.

Correctness: I'm not sure if the proof for Theorem 1 is correct: on L593-594, you claim that if \Lambda is a diagonal matrix with positive diagonal elements, v^T \Lambda AA^T v >= 0 will always hold. This is not true; consider AA^T=[[1,-1],[-1,1]], v=[1,0.5], \Lambda v=[1,2].

Clarity: This paper is well written.

Relation to Prior Work: The related work section is mostly thorough, and difference from prior work is clearly discussed. However, * The following concurrent work (according to NeurIPS guidelines) investigates the same identifiability issue, and should probably be discussed in a revised version: Oftadeh et al, Eliminating the Invariance on the Loss Landscape of Linear Autoencoders. ICML 2020. * Regarding connection between linear VAE trained with ELBO and pPCA, you should probably cite the following works, which are earlier than [17]. Dai, Bin, et al. "Connections with robust PCA and the role of emergent sparsity in variational autoencoder models." The Journal of Machine Learning Research 19.1 (2018): 1573-1614. Rolinek, Michal, Dominik Zietlow, and Georg Martius. "Variational autoencoders pursue pca directions (by accident)." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Reproducibility: Yes

Additional Feedback: I will update my review once my question regarding correctness is clarified.

Review 4

Summary and Contributions: The authors further theoretical understanding of learning the optimal representations for data using linear autoencoders (LAEs) with different gradient based optimizers. They provide two different regularization methods, non-uniform L2 regularization and deterministic nested dropout, to allow for convergence to the optimal axis-aligned representation. They provide explanations for why such convergence is slow via condition number analysis, and present a modified learning algorithm that greatly enhances the speed of convergence to the optimal axis-aligned solution.

Strengths: The authors further previous work regarding breaking symmetry of the set of optimal solutions for a LAE. The contribution is significant, as they further understanding of why convergence towards the optimal solution might be slow, and they provide an algorithm that shows promising empirical results. The theoretical grounding for the algorithm is good, as they are able to provide a local convergence rate.

Weaknesses: A discussion regarding how the regularization matrix Lambda is chosen for the non-uniform L2 regularized objective would be helpful, since it is no longer just one regularization parameter. There is an assumption made stating that the regularization terms should be smaller than the respective singular values during the characterization of the stationary points (Theorem 3). The authors should explain how the weights are known to satisfy this condition. An updated figure like figure 1 showing the convergence trajectories of the various methods on the loss landscape would be helpful.

Correctness: The claims and method are correct and seem to work empirically. Proofs for all lemmas and theorems are present in the supplementary methods and are sound.

Clarity: The paper is well written and it is easy to follow logically. There are a few minor inconsistencies in the writing. In the experiments section, the text of the work states that convergence in figure 4 is plotted until a value of .25 is reached, but in the figure a value of .3 is stated. It is slightly unclear whether the models that use the augmented gradient algorithm also uses an Adam optimizer, or if the models trained using an Adam optimizer only have a different objective (i.e. non-uniform L2 or deterministic dropout) since the Adam trained models seem to converge the best.

Relation to Prior Work: A good background discussion is provided on prior work and how this work expands upon prior work. The work is clearly contextualized.

Reproducibility: Yes

Additional Feedback: ** after author rebuttal ** Thank you for the response. I suggest the authors include in the revised text the discussion on choosing lambda matrix in the rebuttal.