__ Summary and Contributions__: The paper proposes a layer-wise objective based on the concept of the information bottleneck for backpropagation-free supervised learning in deep neural networks.
Moreover, the authors show how their learning rule can be related to a possibly biologically plausible learning rule.
Finally, the authors analyze the effectiveness of their proposed learning rule on MNIST and CIFAR10 where they show good performance.

__ Strengths__: The authors' back-propagation-free kernel approach to representation learning in deep neural networks is laudable. It holds the potential not only to discover new learning algorithms that do not suffer from backward locking but also to generate a deeper understanding of the mechanistic underpinnings of representation learning in deep neural architectures.

__ Weaknesses__: The main weakness is that the authors did not empirically verify the various approximations that go into the learning rule, other than showing that the learning rule works well in a final training setup. To gain a better understanding of well the chosen approximations hold (e.g. Eq. (6)), would be helpful to better understand why the proposed method works.
Moreover, a direct comparison of the proposed learning algorithm to similar previous work, e.g., Nøkland and Eidnes, (2019), using the same network and dataset is missing.

__ Correctness__: The analytical derivations seem sound. Merely for some of the approximations many readers won't have a good intuition as to how much of an error they introduce.

__ Clarity__: Overall, the paper is well written and quite clear. However, the abstract could be reworked by reducing the amount of slang, e.g., "learn to squeeze as much information as possible," and emphasizing the premise of the present study.

__ Relation to Prior Work__: Yes, previous work is mentioned and cited appropriately. The only suggestion for improvement would be to discuss other greedy layer-wise training approaches.

__ Reproducibility__: Yes

__ Additional Feedback__: Consider discussing (Mostafa et al., 2018) as the study underlying (Nøkland and Eidnes, 2019).
For central approximations such as Eq. (6) it would be great to provide empirical support for how well the approximation is justified. For instance, sample pairs for both the lhs and rhs and put on a scatter plot. This point is to answer whether this is indeed a good approximation or whether the replacement simply gives rise to a learning rule that works.
pg3.98: "But if knew how" ... insert "we"
Doesn't Eq. (11) break the plausible online character?
The steps from Eq. (13) to (16) were a tad fast for me. It wasn't clear whether this temporal variant is verified empirically in the experiments section or whether it is just an addendum to the theory.
I suggest also mentioning the results obtained with Adam in the main manuscript instead of postponing them to the appendix.
Finally, the notion of "grouping" was not clear to me and not explained in the text. However, since grouping is applied generously in the experiments section, this term should be explained more clearly in the main text.
# UPDATE: Thanks for the clarifications and for addressing my suggestions.

__ Summary and Contributions__: Building on the information bottleneck principle and kernel methods the authors propose a layer-wise learning rule that has 3-factor structure: pre(post)-synaptic activities and a 3-rd factor, and can be applied to training artificial neural networks.

__ Strengths__: This is a very interesting paper that contains some new ideas as well as tricks that improve performance. Experimental evaluations demonstrate that the proposed learning rule can learn useful representations in hidden layers for ANNs.

__ Weaknesses__: I think this paper would benefit from a more comprehensive connection to existing ideas in the field of biologically-plausible learning. Specifically:
1. There is a body of literature on biologically plausible methods for training feedforward ANNs with labels, see [R1] and [R2] for example.
2. Biologically-plausible learning rules that do not require label information have been studied in [R3] for shallow networks, and in [R4] for training multiple hidden layers of representations. The latter work, although targeting similarity search as a downstream task and not classification, also extensively uses divisive normalization for images. Similarly to this work, this seems to be important for achieving high accuracy (precision).
3. The empirical evaluations (Table 1) need to be compared with previously published results in the biologically-plausible settings. For instance, [R3] reports accuracy on fully connected network for MNIST better than 98.5%. Refs [1,3] report slightly better than 50% accuracy for fully connected architectures on CIFAR-10.
Refs:
R1. https://arxiv.org/abs/1412.7525
R2. https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full
R3. https://www.pnas.org/content/116/16/7723
R4. https://arxiv.org/abs/2001.04907

__ Correctness__: I don’t understand the approximation in Eq 6 leading to the pHSIC learning rule. Could the authors please provide some intuition/computational reason on why this approximation is a meaningful thing to do?
Same question pertaining to approximations leading to equation 8 from equation 7.
-------Post Rebuttal-------
Thanks for the clarifications! Although I agree with some of the reviewer’s comments that this paper is not perfectly written, I still think that the originality of the proposed learning rule outweighs problems with the presentation (that need to be fixed). I am inclined to keep my initial scores and still vote the acceptance. I hope that the authors take seriously the issues raised during the review: adjust the claims on biological plausibility of the proposed method, and add the missing references.

__ Clarity__: Clarity can be improved. Certain technical details are missing. For example I had to read Ref [6] from the paper to understand the details of the convolutional architecture implementation. It would be nice if the authors made it self-complete by specifying sizes of convolutional and pooling filters, strides, etc.
I am not sure about the overall sign convention in equations 12, but there is a misprint in hebbian/anti-hebbian learning in one of the two instances in lines 133-135.

__ Relation to Prior Work__: As I explained above, the paper can be improved by more comprehensively connecting and comparing this current work with previously published results.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposes a new biologically plausible learning rule based on the Hilbert-Schmidt Independence Criterion. The new rule improves on past biological learning rules through its unified training/inference stages and the requirement for a large amount of data. The rule is then demoed on MNIST and CIFAR-10 datasets.

__ Strengths__: The rule proposed does indeed appear biologically plausible. Moreover, the rule depends on a global scalar feedback term and pairwise cross-layer activations - both of which can be interpreted as biological substrates implicated in learning and plasticity.

__ Weaknesses__: The main weaknesses of the paper are 1) the lack of comparison to other biologically plausible rules, 2) the simplicity of the datasets the rule is tested on, and 3) several unsubstantiated points from the abstract. (1) While the paper compares the performance of their proposed rule with various kernels across 4 small datasets, the paper lacked a thorough comparison to other state-of-the-art biologically plausible learning rules. (2) Even on the relatively simple datasets of MNIST and CIFAR10, the proposed methods significantly underperform small networks trained with SGD. Moreover, the most biologically plausible of the proposed kernels - the Gaussian kernel - performs the worst of all methods displayed. This suggests that the proposed rule does not scale to larger problems. (3) The abstract suggests the new rule is novel due to its superior performance in the absence of large amounts of data and the absence of an inference/learning bifurcation. Neither of these claims were adequately supported by the data presented. For one, evidence was not supplied suggesting these new methods performed better than competing biologically plausible rules when trained on little data. Second, the inference/learning bifurcation contained in the weight updates suggested here did not appear any different from previously proposed methods.

__ Correctness__: The derivation of the biologically plausible rule appeared sound and correct.

__ Clarity__: This paper is well written, however I would have appreciated more comparison with other competing methods.

__ Relation to Prior Work__: The authors contrast the new proposed learning rule with previous rules such as feedback alignment on two primary topics, however these two points could benefit from further elaboration.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: A 3-factor Hebbian learning rule is proposed that provides a biologically plausible alternative to BP for training feedforward artificial neural networks.

__ Strengths__: The derivation of the rule integrates a number of clever ideas. The performance is tested for CIFAR10 on CNNs. It achieves there a performance very close to BP.

__ Weaknesses__: The stated goal is to provide a biologically plausible alternative to BP. Unfortunately "biologically plausible" is a rather subjective term. For example, one could point to several aspects of the resulting paradigm which someone could find to be not biologically plausible, such as
-feedforward rather than recurrent NNs
-CNNs (weight sharing, max-pooling)
-use of the SELU activation function, that assumes both positive and negative values
-transfer of the supervised learning paradigm from ML, rather than taking into account that brains are likely to combine a lot of unsupervised and self-supervised learning with occasional labels from a teacher
Also, the resulting Hebbian plasticity rule looks really complicated, and it is not clear how it can be related to experimental data for synaptic plasticity.

__ Correctness__: As far as I can see, yes.

__ Clarity__: The writing is rather technical, but pretty clear.

__ Relation to Prior Work__: I am missing a discussion of Broadcast Alignment (Lillicrap et al) and methods proposed by Bengio et al.as biologically plausible alternatives to BP, and especially a performance comparison with these.
The new method is also not compared with other approaches based on Information Bottleneck or kernels.

__ Reproducibility__: Yes

__ Additional Feedback__: