NeurIPS 2020

### Review 1

Summary and Contributions: 1) summary: The authors propose a gradient-based supervised learning method called activation- and timing- based learning rule (ANTLR), that can improve the accuracy of the network in the situations where the number of spikes are constrained, by precisely considering the influence of individual spikes. 2) contributions: The key idea is unifying the different learning methods (activation-based and timing-based) suitable for different coding schemes (temporal coding and rate coding). The authors run a series of experiments which demonstrate that the proposed method shows higher performance in terms of both accuracy and efficiency than the activation- based and timing-based approaches.

Strengths: The proposed approach is interesting, since it suggests a way to process patterns that are encoded with both a rate and a temporal code. However, the method proposed in this paper is only integrates the two learning methods together, which is not innovative enough. As the experiments and theoretical analysis are still sufficient and properly assess some of the points made by the author, I think that the paper is a not bad contribution for NeurIPS.

Weaknesses: The method proposed in this paper is only integrates the two learning methods together, which is not innovative enough.

Correctness: In the third paragraph of section 4.2, the author mentions that "The number of spikes used to finish a task was usually not presented in previous works, .... " However, many learning algorithms based on the number of spikes have been proposed. It is suggested that the author investigate relevant studies, correct the expressions, add comparison experiments of Activation-based, ANTLR-based, and other spikes-based methods or analyze their differences, and point out the advantages of ANTLR.

Clarity: The paper is clearly written and easy to follow.

Relation to Prior Work: This paper has clearly discussed how ANTLR differs from previous contributions. And it described how to combine the two learning methods in detail.

Reproducibility: Yes

### Review 2

Summary and Contributions: This paper presents a backpropagation (bp) training method (ANTLR) for spiking neural networks by averaging two different bp formulations in the literature in the form of a weighted sum. Essentially, a weighted-sum of the gradients computed by the two well known SNN bp formulations, BPTT with surrogate derivative approximation, referred to as activation-based methods by the authors, and spikepop type bp methods, referred to as timing-based methods in the paper, is empirically used to update the weight/bias parameters of the network. The authors claim this weighted average of the two known formulation offers a better solution to the supervised training of spiking neural networks. The following assessment is based on reading the paper and also submitted author rebuttal. This reviewer is concerned with several major problems of this paper: (P1 lack of novelty): The proposed ANTLR essentially computes the error gradients by averaging the gradients computed by two well-known bp formulations. As such, the presented work has no fundamental new contribution. (P2 lack of mathematical rigor): The averaging scheme in ANTLR is ad hoc, and is suggested with no firm mathematical foundation. It is merely based upon the intuition that having a weighted sum of the two methods can lead to a method can outperform both methods. (P3 poor accuracy of the proposed method): ANTLR is only tested using two small datasets: latency-coded MNIST and NMNIST on small feedforward spiking neural networks with only a single hidden layer. And yet, the reported ANTLR accuracies are obviously worse than other reported methods cited in the paper. (P4 lack of convincing experimental evidence): The authors made no direct quantitative comparison with other published methods in terms of accuracy and sparsity. The experimental settings are inconsistent among the three types of bp methods implemented by the authors and hence are insufficient to support the claimed merits of ANTLR such as sparsity.

Strengths: This paper is easy to follow and written clearly. The authors did a good job providing a survey on the existing activation-based (BPTT with surrogative derivative) and timing-based (SpikePop like) methods, contrasting the differences in their formulations.

Correctness: The presented ANTLR is falling behind with other published methods on accuracy. Due to inconsistent experimental settings used for the three methods implemented by the authors, the relative merits of ANTLR is not well supported by the presented experimental evidence. In addition, the adopted datasets and network sizes are very small; the authors should scale up their experimental effort. These issues are detailed under “Weaknesses”.

Clarity: In general, the paper is easy to follow, particularly on the survey of the existing bp methods. The authors call the particular type of activation-based methods in section 2.1 RNN-like method. “RNN” is a bad wording, and can easily introduce confusion with recurrent networks. Section 2.2.1: it is mentioned that neglecting reset paths can improve the performance. How stable is this conclusion in practice? Are you able to have better performance if reset paths are only considered over a short time window? Figure 3(b) is pretty confusing. What do you mean by a timing shift can be realized by having two spike activations with opposite directions? How can you leverage this in activation-based bp methods to mimic spike timing shifs? If so, how do you control the amount of shift? In ANTLR, can the activation-based bp component be either the RNN-like and SRM based method? I assume the answer is yes.

Relation to Prior Work: Since not all bp methods fall under the categories of the so-called activation/timing based methods, such exceptions shall be made clear upfront in the introduction. As discussed in “Weaknesses”, the authors shall add direct comparison with related bp methods in recent literature experimentally.

Reproducibility: Yes

### Review 3

Summary and Contributions: The paper investigates gradient-based supervised learning rules for spiking neural networks (SNNs), in particular those calculating gradients with respect to spike activations, and those calculating gradients with respect to spike timing. A novel method is presented that combines the two paradigms into the ANTLR learning rule, which uses a weighted combination of the two gradients, and is able to deal with loss functions and scenarios that are hard or even infeasible with one or the other approach. The method is tested on spike-train matching, MNIST, and N-MNIST, achieving OK accuracy but with very few spikes.

Strengths: 1. The significance of the paper comes from the integration of two concepts for training SNNs (activation-based gradients and timing-based gradients) that so far have been studied separately. The combination provides advantages that cannot be achieved with only one of the approaches. 2. The ANTLR method is well motivated and clearly derived from existing methods. The originality and novelty is limited though, as its merely a combination of existing building blocks. 3. The presented method is efficient in its use of spikes and achives satisfactory results with a very small spike count (although accuracies are significantly below state-of-the-art approaches for SNNs that use more spikes). 4. Although the evaluation is not done on real-world tasks, it makes use of toy datasets (random spike train matching), classical machine learning tasks (MNIST), and specific spiking datasets (N-MNIST). It thereby shows its usefulness in a variety of relevant tasks for SNNs. 5. The experiments are sufficiently documented to be reproducible, code will be released at a later stage.

Weaknesses: 1. The paper overall lacks clarity, in particular the figures of computational graphs (Fig. 1, 2, 4) are not well explained and lack meaningful captions. 2. The demonstrated advantages are mainly relative to either purely activation or purely timing-based rules, but do not compare to state-of-the-art methods. The accuracies in MNIST and N-MNIST are well below those reached in other papers. The paper admits this and highlights the efficient use of very few spikes. While this is an advantage, I would like to see a recommendation or even better evidence that the accuracy gap can be closed, e.g. with larger networks. 3. Broader impact is not addressed.

Correctness: Yes, the methods are derived and evaluations seem to be done properly.

Clarity: Clarity is average at best, mainly because the figures are hard to interpret and not properly explained in the captions. Some important derivations are made in the appendix only.

Relation to Prior Work: The paper properly cites sources from both the activation- and the timing-based gradient literature for SNNs. There are certainly other relevant approaches that could be cited, but I consider it sufficient.

Reproducibility: Yes

Additional Feedback: I consider this paper relevant work by integrating two previously separate strands of research on SNN training. The results are not really convincing though, since the accuracies are relatively far from the best results achieved with competing approaches. I would recommend adding a comparison to the state-of-the-art, and make proposals how the gap can be overcome. I would strongly recommend making the figures more self-explanatory, e.g. by a longer caption. =================================== Update after rebuttal: I thank the authors for their clear rebuttal letter, which has addressed some of the major concerns. I still think that this is a valuable contribution, since it offers a new perspective on how to deal with the two seemingly opposite views on spike codes. On the other hand, it is slightly ad hoc to just linearly weight the two gradients, and the experimental results are not really convincing. For a conference like NeurIPS this puts the paper right at the borderline, and a resubmission with more thorough experimental validation of the method would sound like the best possibility to me.

### Review 4

Summary and Contributions: This paper first discusses the deficiencies of two independent approaches to learning in spiking neural networks: activation-based and timing-based methods. The former considers learning as a process of generating and removing spikes, while the latter considers the shifting of spiking times. Then the paper proposes to combine both approaches in the learning rule. Experimental tests show that the proposed method is both accurate and efficient (in terms of the number of spikes).

Strengths: The proposed method is based on a sound analysis of spike dynamics. The generation, removal and shifting of spikes should all be incorporated in the analysis. Not only is this paper important in proposing a new algorithm, but it also introduces a comprehensive viewpoint that may have long-term impact in the field.

Weaknesses: I queried why the experimental settings are not completely consistent among the three types of backpropagation methods. Nevertheless, the timing-based method and ANTLR seem to have more similar settings and can be compared, and the results showed that ANTLR outperforms the timing-based method in Figs. 7(a), 7(b) and 8(a). For Fig. 8(b), ANTLR and the timing-based method have comparable performance, but there are more spikes in the latter method. Although it was claimed that the no-spike penalty is one of the reasons, it remains for the manuscript to clarify whether including this penalty upsets the fairness of comparison. In the discussion and conclusion section, the manuscript mentioned a few other works that are neither activation-based nor timing-based. To ensure that the supporting evidence of the proposed method is complete, there should be a check to verify whether those methods are also efficient in terms of sparsity. Update after feedback and discussions: ============================= While it is true that the two component algorithms are not new, the manuscript’s contribution is more than merely combining two common algorithms using a weighted sum to retain the best of both worlds. In the combination, the activation-based component has not taken into account the time shift of the spikes, whereas the timing-based component has not taken into account the generation and removal of spikes. The algorithm is therefore combining the two methods with complementary nature. Thus, the approach is considered to be a principled one. I consider this a conceptual advance that will influence future analyses of spike dynamics in a holistic way. Indeed, it is noted that both amplitude and timing characterize a spike. Thus, learning in spiking neural networks should have handles to adjust both elements. Including the gradients due to both activation and time shifts with equal weighting is consistent with the calculus of the chain rule for calculating derivatives of multivariable functions. Although the reported accuracies are worse than other reported methods cited in the manuscript, the focus of this paper was cases in which the networks are forced to use fewer spikes, which are constrained to have lower performance. The manuscript considered latency coding, which belong to the regime of reduced input spikes and is not so much the focus of previous works. On the other hand, I agree with my fellow reviewers that more experimental results are needed to verify the effectiveness of the algorithm. Overall, my score is tuned from 8 to 7.

Correctness: The paper may need to clarify whether the derivative of the membrane potential in Eq. (7) should also be dependent on the last spike of the same neuron, that is, $\tilde t_j^{\rm last}$.

Clarity: Overall, the paper is clearly written, except for a few points to be clarified. Figure 2: Not all partial derivatives appearing in Appendices C and D are displayed in the figure. It may be instructive to mention them in the figure caption. Line 99 and Equation (7): Please explain what approximation is used in the derivative of the membrane potential. Section 4.2: The reader may wonder why different loss functions were used in the three methods to be compared, and whether the inclusion of no-spike penalty upsets the fairness of comparison. Line 221: “No-spike penalty” was applied in the timing-based method, but in the previous paragraph an additional count loss was applied to ANTLR instead. Please clarify.

Relation to Prior Work: The paper discussed how its proposed method was related to the previous activation-based and timing-based methods and how it improves over them.

Reproducibility: Yes