NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8871
Title:Structured and Deep Similarity Matching via Structured and Deep Hebbian Networks

Reviewer 1


		
EDIT: I have read the author response and appreciate the effort made by the authors' to address review suggestions. Unfortunately, the empirical results given in the author response appear to detract from, rather than add to, the significance of the work. Deep networks produce less useful features than shallow networks. Since the similarity-matching-for-shallow-networks method was already present in the literature, it is unclear what benefit the method being presented authors. The authors provided no comparison between their structured network approach and an unstructured network, so it is not possible to tell if their other contribution provides any empirical benefit. Without such a contribution, I find it difficult to recommend this paper in its current form for publication, particularly in light of the point mentioned in my initial review that the paper's main conceptual insights are drawn from prior work. Further experimentation or refinement of the method may address this issue, and I would love to see an updated paper that does so. ORIGINAL REVIEW: This paper generalizes recent work that connects similarity matching to particular Hebbian / Anti-Hebbian dense and shallow network architectures, to the structured and deep network case. The mathematical derivation is clear, particularly in the context of the papers it builds on. However, I have concerns about the novelty and significance of the result. 1. The feedback architecture is essentially the same as that used in contrastive Hebbian learning, a model not mentioned explicitly in the paper though cited implicity (citation 20, Xie & Seung). Moreover, Xie & Seung proved that contrastive Hebbian learning can approximate (or in a certain regime be equivalent to) backpropagation. Thus this paper appears to, in effect, stitch together the insight of the earlier literature on relating similarity matching objectives to Hebbian / anti-Hebbian learning rules and the insight of older literature on the ability of contrastive Hebbian learning to use feedback and local learning rules to assign credit in deep neural networks. Moreover, Bahroun et al. arrived at a similar architecture excluding the feedback component. To me, this scale of theoretical contribution warrants publication only if it leads to meaningful empirical advances, which (see below) I don't find to be demonstrated in the paper in its current form. 2. More explanation is warranted for the value of similarity matching as an objective for deep networks. After all, the goal of many deep learning problems is to radically warp the similarity metric between data points! Similarity matching in the linear case has a nice interpretation in terms of PCA, but the authors do not give a similar interpretation to the present model (e.g. a relationship to autoencoders). 3. The empirical results are uncompelling. First, the role and value of feedback, which is the main contribution of this paper beyond Bahroun et al., is not demonstrated empirically. Second, the features learned by deep networks are not compared to those learned by a comparable shallow model. Thirdly, no quantitative evidence of the quality of the learned features (e.g. their utility in a downstream supervised learning task) is given. 4. This paper does not claim to aim for state-of-the-art empirical results, but rather to introduce a biologically plausible approach to unsupervised learning. As such, the biological plausibility of the model is very important. The proposed model involves an obvious weight transport issue, which the authors acknowledge and propose a workaround for (namely, that Hebbian learning of both forward and backward weights will tend to cause them to become transposes of one another). While an interesting argument, I am not entirely convinced that it is robust to biological noise, other sources of learning, etc.

Reviewer 2


		
——Rebuttal Response—— It looks like my score is a bit of an outlier here. I took some time to read one more time the paper, all the discussions, and the author’s rebuttal. While I hear the objections, my position remains unchanged. This is a very interesting paper and I think that it should be accepted. This paper is indeed a first look at the problem, rather than a fully conclusive and executed research program, but I think it contains an interesting idea. Compared to the previous work on similarity matching, this is the first one that extends similarity matching cost function to networks with several layers and structured connectivity. In my view, this is a huge step. It is also nice that the authors provide a reasonably “biological” implementation of their optimization, yes, with a little bit of weight transport, but nothing is perfectly biological anyway. Regarding table 1 of the rebuttal. I would encourage the authors to invest enough time to experiment with the hyperparameters and clearly state in the final version where they stand. If it turns out that deep architectures learn representations leading to a better accuracy, this is an interesting statement. If it turns out that shallow architectures lead to a better accuracy, this is an equally interesting and important statement. Since the proposed algorithm is different from the traditional backpropagation learning both results are possible. I agree with reviewer 3 that the paper is hard to follow. I would encourage the authors to invest time in making the presentation more clear, as they promised in the rebuttal. ———————— There is a large body of work devoted to approximating backpropagation algorithm by local learning rules with varying degree of biological plausibility. On the other hand, the amount of work dedicated to investigating various biologically-plausible learning rules and their computational significance for modern machine learning problems is rather small. This paper tackles the second problem and demonstrates that a network with structured connectivity and a good degree of biological plausibility is capable of learning latent representations. This is a significant statement. For this reason I argue in favor of accepting this paper. Specifically the authors study similarity-matching objective function, and show how it can be optimized in deep networks with structured connectivity. To the best of my knowledge these results are new. While the general idea and results are clear, I have several technical questions throughout the paper: 1. There seem to be a misprint in Eq (2). Is y_i the same variable as r_i? 2. I do not understand the “Illustrative example” in section 6.1 How were the inputs constructed? What does it mean that they were “clustered into two groups and drawn from a gaussian distribution”? There are many ways how one can do this. Is it possible to show examples of the inputs? Without this, it’s hard to tell how non-trivial these results are. In general, this section would benefit from more detailed explanations, I think. 3. Figure 3 is impressive, but the description of how it is constructed is insufficient, in my opinion. I would appreciate a clear explanation of the sentence “Features are calculated by reverse correlation on the dataset, and masking these features to keep only the portions of the dataset which elicits a response in the neuron”. Maybe in supplementary materials? 4. What is the interpretation of the features in the first layer? Are they PCA components of the inputs or are they something else? 5. I understand that learning rules (2) are needed for the optimization of the similarity-matching (3). Out of curiosity, would something like figure 3 emerge if in equation (1) the weights L were fixed and all equal to a positive constant (global lateral inhibition), or is it crucial that the weights L are learned? In a recent paper https://www.pnas.org/content/116/16/7723, for example, a network with constant L was shown to lead to high quality latent representations. Are the representations reported in this submission in some sense “better” than the ones reported in the above paper (since the weights L are learned), or is learning the weights L simply necessary to make a connection with the similarity matching objective function?

Reviewer 3


		
*** REBUTTAL RESPONSE *** Thank you for the sincere rebuttal. I've changed my review to marginally above accept. I look forward to revisions for clarity and to the expanded empirical evaluations as mentioned in the rebuttal. *** ORIGINAL REVIEW *** As observed in previous studies certain similarity matching objectives leads to a biologically plausible (local) hebbian/anti-hebbian learning algorithm. This paper introduces a "deep" and "structured" similarity matching objective and derive the corresponding local hebbian/anti-hebbian learning algorithm. The objective is structured in the sense that constraints are introduced on which input and output neurons pairs should contribute to the matching objective. In particular for image inputs local spatial structure is imposed, akin to convolutional filters. The structured objective is a generalization and becomes the unstructured objective in the limit of no constraints. The similarity matching objective is deep in the sense that its the sum of similarity measures over multiple layers of a multi-layer neural network. The paper performs experiments on a toy dataset and on labeled faces in the wild. The main results are the filters learned by the network, which resemble a clear hierarchy of face parts, although it's not clear how the filters were visualized exactly. w.r.t. originality: In a sense the deep and structured extensions seem to be "just" imported and brought together from previous studies in sparse coding (no small feat to be sure). It is however very interesting to see what this deep, structured hebbian/anti-hebbian learning is actually optimizing for. w.r.t. quality and clarity: I think the paper is very hard to follow and has several quality issues. In my opinion it could be a lot better. The introduction and abstract are fairly well written, and explains the idea pretty well. However in the following sections much of the math is poorly introduced and hard to follow. For instance W and L are not defined in eq. 1 and y_i is not defined in eq. 2. eq. 2 is given as self-evident, but isn't shown until eq. 9. A "leak term" is mentioned on line 77 which is never defined. I couldn't follow the parts on regularization in eq. 3, and 4 where several results are given as self-evident. At this point in the text the link between eq. 1, 2 and 3 have not yet been made, but the regularization discussion mentions these links as self-evident and makes extensive claims. I'd suggest the authors either take the time and space needed to actually introduce and explain the equations, or simply give them as proven, give the relevant reference and explain what they mean. If you read the referenced papers it becomes more and more clear what the author means by their math, but the paper is simply not understandable as a self contained unit in my opinion. After having spent 2 days trying to understand section 2 I gave up, and skimmed the rest of the math, and tried to understand the idea instead. Maybe I'm just not intelligent enough, although I think I'm fairly close to the average reader, and I think I was more patient than the average reader would be. I suggest significant editing of the manuscript, with a focus on how understandable it is to the average reader and what parts of the math are truly significant to the idea. Nitpick: It's "The credit assignment problem is ..." not "Credit assignment problem is ..." The experimental results looks good, but are presented without much analysis. The paper introduces several hyper-parameters, e.g. the gamma parameter which determines feedback (top down) strength, the regularization functions, the structural constraints, etc. but doesn't examine how these affect the results or whether they're important. The results are not compared to any other methods and it's not exactly clear how the experiments were performed. I'd not be able to reproduce the experiments. I want to like this paper, since I think the subject is fascinating, and the direction important, but in the current form I can't recommend it for publication.