__ Summary and Contributions__: The authors approach the problem of optimizing autoregressive spike train models for modeling neural spike trains in response to stimuli. They suggest that a focus on optimizing these autoregressive models by maximizing likelihood on one-step prediction often leads to unstable models that don't predict actual spike trains well. This happens because at training the model is constrained on actually observed spike train values at time steps <t to predict tilmestep t (data-conditioned), but in prediction time, it is only conditioned on it's own prediction at time steps <t (free-running). This can lead to instability in certain regimes.
They take inspiration from the GAN literature to rethink the objective they are actually optimizing for and suggest maximum mean discrepancy, coupled with spike train kernels, as a suitable alternative to MLE and claim to show that it outperforms MLE.

__ Strengths__: Updated Review:
I'd like to thank the authors for answering my concerns re: the stochastic nature of the kernel optimization. I am now even more confident in my assessment that this is a good submission and an accept from me.
In response to an under constrained general MLE framework for fitting autoregressive models, the authors introduce a method to balance multiple objectives (fidelity to outputs under data-constrained and free-running conditions). They achieve this with model-based MMD (sometimes with the addition of a likelihood objective), which requires matching of free-running and data-constrained model features, essentially maximizing likelihood under the constraint of requiring faithful behavior in both conditions. This leads to a model that is both faithful and stable in the free-running condition.
The results show that their method comes close to capturing the MLE solution for a simple model built on top of toy data with no history dependence in finite samples, and improves upon the MLE model for toy data with explicit history dependence. They verify this on real datasets from monkey and human cortex and show similar improvement over MLE fits.
I think this work has a general significance and relevance to people in the neuroscience community at NeurIPS, but also for people modeling other scenarios with point process dynamics with an emphasis on explainability in their models.

__ Weaknesses__: The authors discuss the the effects of different kernels, and also on the stochastic nature of their optimization procedure, but don't show work leveraging the stochastic nature of their procedure to obtain the expected values of many samples of this optimization process. It's difficult to determine if the differences in kernel estimates fo the history dependence function are consequences of the kernels used or of the stochastic nature of the optimization without this.

__ Correctness__: On page 7 you wrote :
"We used a dataset previously used in Hocker and Park [21] and (original reference?)"
please add original reference.

__ Clarity__: Yes, the paper is well written and clear.

__ Relation to Prior Work__: Previous work is discussed clearly, and it is clearly demonstrated how this work is a novel contribution.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors address the important problem of run-away excitation in GLM models, a common approach to model the dynamics in neuroscience. Typical estimation method maximizes the data-conditioned log-likelihood, i.e., maximize the probability of observing the data at the next time step given past observations. However, this might not necessarily maximize the free-running likelihood, i.e., recursively using samples generated by the model itself to predict accurate samples for future. Directly maximizing the free-running likelihood from the data is hard. Hence, the authors propose to regularize the traditional data-conditioned likelihood maximization with an additional loss: minimize a Kernel based distance between the model generated spike trains and the data. They present a score-function based estimation method for arbitrary kernels but with higher variance, and an alternate model-based kernel approach that has lower variance.

__ Strengths__: Soundness of Claims
The approach is well grounded in the widely used Kernel methods.
Significance
The authors present a viable method to fix the run-away excitation problem in GLMs. The presented estimation methods that can be applied in different conditions and under different assumptions. As shown by the analysis using autocorrelation, the Kernels can incorporate different features of interest in the responses.
Novelty
This work presents a different solution to the common problem of run-away excitation in GLMs.
Relevance
The topic is very relevant to GLMs, a popular modeling tool in Computational Neuroscience, and many in NeurIPS community will find it useful.

__ Weaknesses__: Soundness of Claims
As described in more detail below, statistical significance is unclear for most of the analyses.
Novelty
Quantitative comparison of this approach to existing methods for this problem is absent. Perhaps a simple regularization on history filter would have fixed the problem?

__ Correctness__: Majority of the discussion in Sec 4.3 (Lines 237-250) relies on observations in one model fit for kernel type in Figure 4. However, the lack of error bars in Figure 4 makes the conclusions uncertain. All panels in Figure 4 must have error bars. For example, are the differences in history filter in Figure 4A consistent across multiple fits of model and across different partitions of data? Same for other panels.
Many of the spike-train statistics computed on ML fits could be dominated by the few trials that show run-away activation (For example, in Figures 3E, J). What if we just identify and ignore the trials with run-away activation? Does this common (but naive) approach fix this problem?
------
Update post review: The authors do a reasonably good job addressing the concerns. Addition of error-bars in estimated fits and statistics is very helpful, and should go in the final paper. Increasing the score upwards.

__ Clarity__: Section 2.3 and Line 123 describe minimization of dMMD(q_\theta, p_\theta), but
Section 3 and results describe minimization of dMMD(\hat{p}, p_\thetha) (See line Eq 10-16, for example).
Describe all the Kernels used in the Results (sec 4.2, 4.3) in the Methods (sec 3.2).
Small comments:
-- Explain in more detail why minimization of KL divergence with free-running likelihood is hard (line 100).
-- Line 117: this -> these
-- Line 122 : to -> too
-- Line 187 : Give figure number.
-- Line 192 : Figure 2 -> Figure 2F
-- Lines 197 - 211 : Figure 3 (and its panels) must be referred to in main text.
-- Line 201, 216 : Mention how is \alpha chosen.
-- Line 214 : Fix [original reference]

__ Relation to Prior Work__: Yes, multiple prior works have been mentioned.
A related recent work could be included in the paper: https://doi.org/10.1101/2020.06.11.145904

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: They optimized many steps of an auto-regressive model for spike trains rather than just one step, using MMD. This avoided instability during generation that is common in models that don't do this.

__ Strengths__: Seems like an important advance.

__ Weaknesses__: It wasn't clear whether it could be used at scale. But it wasn't clear it couldn't, so this is not much of a weakness.
Update post-author-response: Given the authors response, it looks like this doesn't have the best scaling with the number of spikes. But it's not so bad that it will kill the methods. So my score won't change.

__ Correctness__: I think so.

__ Clarity__: Yes! Both on an absolute scale, and the best written paper of the six I reviewed.

__ Relation to Prior Work__: I think so, but I'm not really an expert.

__ Reproducibility__: Yes

__ Additional Feedback__: None.