NeurIPS 2020

### Review 1

Summary and Contributions: new analysis of reservoir model through soft margin with interpretation in terms of cumulants and second order Green function

Strengths: - interesting new theoretical approach - empirical evaluation only on one example - good relevance to NeurIPS

Weaknesses: - see below

Correctness: - the title is too general and should be more specific

Clarity: - overall well written

Relation to Prior Work: - could be improved

Reproducibility: Yes

Additional Feedback: The authors develop an interesting analysis, especially with eqs (4)(5)(6) and the second order Green function (12)(13). - the authors study only one example in one application while the title makes a very general claim. Either the authors should demonstrate and compare on many more data sets or should make the title more specific. - interplay recurrent networks and nonlinearities: additional work in the area of recurrent neural networks and nonlinear systems analysis can be mentioned here. - eq (6) is there a connection with Fisher discriminant analysis methods, which are expressed in terms of between and within covariance matrices? - is the choice of the initial state important or not? (probably more important in the nonlinear case?) - are there implicit assumptions on the stability of the system? is global asymptotic stability assumed or can it also be e.g. chaotic? - Table 1: the accuracies in the last two columns are the same, while in the paper it was mentioned that the nonlinear case is much better than the linear one. Please explain. - section 5: unfortunately one example is given here. Can't the method be proposed as a general purpose scheme in different applications? Thanks to authors for the replies, I have taken it into consideration.

### Review 2

Summary and Contributions: In this work, the authors study a classification task of one-dimensional temporal sequences. They show that a recurrent neural network (RNN) can act as a spatiotemporal kernel and facilitates the classification of inputs. To unfold the temporal evolution of the recurrent network internal dynamics, they expand the Green function of the network's response to leading orders. By averaging over the input statistics, they arrive at a series expansion in the cumulants of the output. They show that for a randomly connected RNN (a reservoir), the output's classification performance is optimal when the input projections are optimized. They derive a perturbative theory for linear networks and networks with "weak" nonlinearity. Finally, they test their result using labeled time-series data.

Strengths: * The framework of expanding the response of a recurrent network using the Green function is novel and holds great potential. Following ideas from Statistical Mechanic and many-body physics, this method allows a perturbative expansion of complex interaction using a power series in the degree of interaction. While the current work studies a relatively simple case, I think that the work serves as a proof-of-concept that perturbation analysis of the interaction is useful in practical computational problems. * The authors show that the input projections into a recurrent neural network can have a significant impact on the readout performance, at least for a simple classification task. This perspective is different from the bulk of studies on computation by RNNs, which are inspired by reservoir computing with fixed input weights and learned readout. * The paper is written in a clear way accessible to readers without prior knowledge of perturbative methods in many-body physics. * The theoretical predictions are backed up with simulations with artificial inputs and inputs from a known time-series database.

Weaknesses: * The authors address the nonlinearity by expanding the activity around a steady-state, which is a fair assumption. However, the authors further assume that the nonlinear terms, and in particular quadratic terms are small by letting alpha be small in eq (9). It is a strong assumption, which I am not sure is valid in many cases of interest. The authors cite [Roxin et al., 2011], who studied the long-tailed distribution of synaptic input to cortical circuits. The authors state that in vivo, neurons typically operate in a low-med regime. The most common biological models of single neural dynamics (e.g., LIF and HH) include rectification at the transfer function at the origin. In that case, there is a sharp nonlinearity, and the assumption of low alpha does not seem to be valid. Furthermore, one would expect complex computations require operation far from any locally-linear regime, and that networks take advantage of nonlinearities such as rectification. I think the result of nonlinearity's effect should be understood qualitatively and should not be directly compared to cortical circuits. Moreover, the results in figure 3 clearly show that the nonlinearity has a minor contribution to the overall dynamics. * The authors show that classification at time $T$ is mostly affected by the input a short time before it [lines 133-135]. It is a trivial statement since the reservoir network operates in its fixed point regime, and the effect of any input decays exponentially. This point is demonstrated by the Lyapunov spectra analysis of the authors. It is not clear to me if the authors suggest that optimizing the input would result in any other behavior, and if they do, then I don't see any evidence for that. * The result that optimizing the input in nonlinear networks is more significant than linear is somewhat trivial. The optimization of the input projections is equivalent to selecting a feature set for the kernel. For nonlinear networks, the space of features (different spatiotemporal correlations) is much larger. Thus, input optimization — or choosing the best features — will necessarily have a more substantial effect.

Correctness: The method is novel and seems correct. I found some issued with the underlying assumptions (see above)/

Clarity: The paper is written in a clear way accessible to readers without prior knowledge of perturbative methods in many-body physics.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Other minor comments: * In the application to the ECG datasets. The authors normalize the input so that mu=(+/-)1 [line 232]. It is unclear to me how this is done without the knowledge of the labels (perhaps I am missing something). If the data is indeed normalized using the labels, then I see no point in using a real data set, and this test is not preferable to artificial data. * Figure 3a, the right inset has no scale, so it is hard to determine if the correction due to the nonlinear terms is meaningful. I suspect it is not (following my comment from above). * The loss in eq (8) is maximized, so I think the constraints should have a negative sign. The constraints seem remnants of the delta function representation in the partition function. When defining a cost function, I think that either the constraint terms should be squared to normalize the vector, or that they should be interpreted as minimizing the norm (which means having a minus sign, and no need for the "1" term). ****** I don't the authors properly addressed my comments, this may be to limited space in the rebuttal, and the initial high score. While I thinks there are flaws in the study which I hope will be addressed, I stand by my original view that the this work present a new conceptual way to treat the dynamics on RNN and to approach the structure-activity relation.

### Review 3

Summary and Contributions: This paper built optimal timeseries classifier upon random reservoir networks. Reservoir network is a kind of random recurrent network which plays a role of the temporal kernel. By developing an analytical approach of unrolling recurrent non-linear networks by use of perturbative expansion, the input projection u and readout vector v can be jointly optimized in this work. A binary timeseries classification task is selected to test the proposed model. Results show that this joint optimization of u and v can lead to significant performance improvement.

Strengths: The derivations in terms of the input projection u and the readout vector v can be calculated in the closed forms.

Weaknesses: More recent related work about recurrent kernel/temporal kernel can be considered. It is unclear what is the advantages of random recurrent networks compared to BPTT training based RNN/LSTMs from the deep learning community. Since reservoir network such as echo state network is much suitable to chaotic time series prediction, when doing timeseries classification tasks, are there any assumptions about the input data? As mentioned in equation 7, the timeseries input should be from the Gaussian distribution? Is the length of time series an important hyper-parameter? Do the final points at time T (when doing classification) converge to two different stable points/attractors? Since reservoir networks always present a strong short-term memory, it would be better to discuss the connections between memory characteristic and the studied temporal kernel. Will it easily forget the information at the beginning of time series?

Correctness: Based on the a perturbative expansion the first and second order Green’s function of the system, the closed forms of the derivatives of the input projection and readout vector are derived. This process is correct and the soft margin values during optimization are also discussed in figures 2-3.

Clarity: This paper is well written.

Relation to Prior Work: Related work for reservoir networks and temporal kernel can be introduced more.

Reproducibility: Yes

Additional Feedback: Thanks to authors for the replies. I stick with my rating.

### Review 4

Summary and Contributions: The paper investigates use of randomly connected RNNs for time series classification tasks, building on existing work in reservoir computing. In contrast to most reservoir computing approaches, the proposed method investigates the input weights that are traditionally drawn randomly.

Strengths: The idea to optimise input weights of a random reservoir for a binary classification task is novel. The work is also relevant to the NeurIPS community, and addresses the problem of time series classification using an efficient approach for RNN training.

Weaknesses: Weaknesses are that it is difficult to see how well we can expect the approach to work from either theoretical considerations or from a proper empirical evaluation. In the reservoir computing / ESN literature are examples of time series classification that would be useful for a comparison, for example using the Japanese Vowel Dataset is a common benchmark (eg in https://doi.org/10.1016/j.neunet.2007.04.016). It would also be good to compare against other time series classification approaches in terms of performance and compuatational complexity. The work at present does not appear to cite recent / major contributions in the area of time series classification; some relevant ones that contain further references are, for example: Wang, Z., Yan, W., & Oates, T. (2017, May). Time series classification from scratch with deep neural networks: A strong baseline. Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Correctness: The claims that the method does indeed improve time series classification would need to be backed up with more evidence.

Clarity: The paper could be improved by better separating aims, idea, derivation and realisation of the proposed idea. In the current state, it is hard to tell how the overall training procedure hsa to look like, and I would suggest to include the algorithm in the paper.

Relation to Prior Work: It becomes clear what the work is aiming to do differently. I does not become sufficiently clear how this is achieved, and neither how well the method work.

Reproducibility: No

Additional Feedback: The authors' responses clarified issues that I have previously hadn't seen, and have adjusted my overall score.