Paper ID: | 2805 |
---|---|

Title: | Mutually Regressive Point Processes |

In the manuscript entitled, "Mutually Regressive Point Processes", the authors present a hierarchical model and associated Gibbs sampler for a modification of the mutually-excitatory Hawkes process to allow suppressive behaviour as well. The presentation is clear and both the proposed model and sampling scheme appear correct. The application of the model to a neural spike dataset is both topical (for the conference name! but really as a topic of current work) and interesting (since one must imagine the true model is not exactly of the class assumed: i.e., application with [presumably] a typical degree of misspecification). I think this model has the potential to be applied to a number of scientific problems in addition to the neural model: for instance, in studies of disease where new cases spread and generate others, but each can also increase the likelihood of a community receiving prophylaxis in response. Given the level of detail in the current manuscript, the examination of both well-specified and misspecified (real world) data, and the supplementary material of the code, I would consider this paper already completely sufficient to warrant acceptance. That said, I would encourage the authors to think about some issues for future work / if planning to release the code through (e.g.) an R package. One is obviously scalability and approximate posterior representation as the authors themselves allude to. The other is prior choice. These deep hierarchical models can often be difficult to choose good priors for on a case by case basis, yet very broad 'non-informative' priors can nevertheless pull the model away from the data-preferred regions of parameter space in surprising/unwanted ways. Ideally the authors could come up with a set of summary statistics or visualisations that would be appropriate to help new users choose appropriate hyper-parameters for their priors and/or diagnose prior-sensitivity. (The latter might be indicable from posterior re-weighting; if Radon-Nikodym derivatives for the collapsed processes can be calculated a la Ghosh et al. 2006 for the Dirichlet Process).

Main comments 1) When the model is introduced in section 2.3 I would have liked to have read some discussion about why this structure would be beneficial and sufficiently flexible to do the job that you're asking of it. I figured it out slowly, but it took probably longer than it would have with a bit of author assistance. 2) Relatedly, after the model (and prior) have been introduced fully, it would have been nice to see some kind of demonstration of the range and kinds of excitation/inhibition behaviour that is now possible to both simulate and model. What is this model capable of doing? Where are its (simulatory) limitations etc? Show how it goes beyond the most relevant competing models. 3) There were no real comparisons to other methods. So I don't know if I am getting anything extra in the model, compared to other approaches out there, and if the only benefit here is the exact simulation component. If there is nothing extra in the model compared to other approaches, this is fine, but it should be made clear, and that it is the computational contribution that is primary here. On the other hand, if there is greater flexibility in the new model than existing approaches, then the authors should not hold back from actually demonstrating this! 4) Figure 4(b) is clearly bimodal, with both positive and negative modes. There is also evidence that many of the other parameters are multi modal. What does this mean in terms of understanding the excitation/inhibition process? Alternatively, as this is inference based on data simulated under the model, is this just evidence that the Markov chain sampler has not adequately explored the posterior yet, in which case why don't you run the chain for longer and provide a better description of the (marginal) posteriors? More generally, given that exact computation for this continuous time model is important, I would have liked to have seem more discussion on the sampler performance in terms of mixing (e.g. integrated autocorrelation time, autocorrelations, or whatever is suitable). Currently given the bumps in Figures 1 and 2 I'm slightly suspicious of sampler performance. Exact simulation is good, but if performance is terrible compared to approximate methods, then ... 5) L.207 The simulated activity in Fig 4(c) is stated to "remain psychological and similar to the one used for training fin Fig 4a." This is making a statement about the application-specific based structure of the data that is not properly quantified or qualified. What makes this particular simulated data "better" or more relevant in an application-based sense than those that could have come from other methods? This needs explaining/quantifying more clearly. Minor Comments The paper would benefit from a close reading of sentences for grammar and structure. Please use comma's where it would be beneficial to understand the phrasing of a sentence, for example. L.142, what does the asterisk denote in the numerator of the LHS fraction? It was mildly annoying that much of the information that could have been in the main paper (with a little consideration of space restrictions) was instead relegated to supporting information, requiring forward-backward flipping constantly. Maybe the authors could consider if any important details could be returne to the main paper to minimise this to the less important parts?

A generally good paper. The paper proposes a class of point processes, called mutually regressive point processes that captures both excitation and inhibition in sequences of events. The proposed approach, can model nonlinear temporal interactions in continuous time. The construction builds on a generalization of Hawkes Processes in a new combination of existing methods. We can recover the traditional Hawkes processes by certain parameter settings. The resulting construction is quite involved. It is based on the augmentation of the Hawkes Process intensity with a probability term. This term is equipped with a weight parameter that tunes the influence of a certain type of event to another. In particular, a sparse normal gamma prior is proposed to be placed on the weights so that an interaction between 2 types is inhibitory or excitatory in a probabilistic manner. The paper is well justified and supported for the sampling scheme in the generative model as well as the inference procedure. For the former, there is a detailed explanation for the hierarchy, the parameters and the constraints. For the latter, there is a thorough description of the inference and the related methods (eg thinning) However, as a downside the paper is heavy in notation which is because of the many parameters involved in the construction. The heavy notation makes the reading/comprehension of the paper not straightforward. The authors give the precise algorithms for inference. The synthetic data experiment supports the validity of the model. Regarding other experiments, although the paper brings interesting directions in the use of these processes for neuron data, the experiments are limited. Also, sufficient comparisons to other models on both interpretation and predictive performance are missing.