NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6051 Attentive State-Space Modeling of Disease Progression

### Reviewer 1

Originality: The paper is innovative. Disease progression modelling is a very timely application, and not many papers that present experimental and clinically meaningful machine learning results exist. Clarity: The paper is very clearly written. It was my pleasure to read and review it. As complimented above, the authors have excelled in conceptualising a problem, composing it as a machine learning task, and writing and referencing the related report in a way that is meaningful and engaging for both the computer science/engineering and health care/sciences communities. It is very rare to see a paper that delivers so well to such a diverse audience. Also findings are presented and visualised (as Tables and Figures) in a mature, meaningful, and comparative way. Quality: Although many of the methods are perhaps unsurprising and perhaps not as inspiring to machine learning method developers, they are well justified, compared, and related. The contribution of this paper really is that it puts machine learning into practice in a way that is meaningful to clinicians. Significance: The authors consider a societally significant problem of modelling disease progression in way that is clinically meaningful by providing actionable insights to inform their clinical judgement and decision-making. It is important that studies of this kind are presented in NeurIPS to attract more people to contribute to this important topic of medical informatics.

### Reviewer 2

# Originality Discrete state space models are often considered to be interpretable because they essentially cluster the elements of time series data (while accounting for time dependence). The key idea in this paper is to maintain this property of discrete state space models while relaxing the stationary Markov assumption on the transition probabilities that we typically use to simplify inference. Although this idea is not new (e.g. semi-Markov models, or Markov models that augment the state with a larger history), the proposed mechanism for relaxing the assumption does seem to be original. The variational inference algorithm for this model also seems to be new. # Quality [1] I think that the HMM baseline is unnecessarily weak. In practice, we can relax the "strict" Markov assumption (i.e. the state in year $t + 1$ is conditionally independent of the past given the state at year $t$) by augmenting the state with the past $h > 1$ years. This keeps the inference exact and relatively easy to implement. Although the state space can grow quite large, it still may not be quite as big of a burden as fitting a complicated model with poorly understood convergence properties. [2] I also thought that the analysis of the proposed variational inference algorithm could be much stronger. The main theoretical motivation for the VI algorithm is Rao-Blackwellization, but I was confused by this claim. Typically, this means that an estimator (i.e. a function of the data) is replaced with its expected value conditioned on some piece of observable information. The paper doesn't show how this definition connects to the proposed VI algorithm, which weakened the argument. The experimental results did not support this claim either. The only result shows the change in negative log-likelihood as a function of epochs for the proposed algorithm and two reasonable alternatives. Although the proposed VI algorithm does seem to achieve a better upper bound on the NLL, there's no evidence that this is due to the reasons that the authors discuss in Section 3. Moreover, the NLL doesn't answer the question of whether the inferences using the proposed VI algorithm are actually better than those obtained using the baselines. Two inference-related questions I have are: (1) Do these simpler inference algorithms recover similar stages of Cystic Fibrosis? (2) How do the prediction results in Table 2 change when we use these alternative inference algorithms? # Clarity The paper is clearly written, and I enjoyed reading it. # Significance I think the method has the potential to be make an impact on the ML+health field, but I have a few reservations. First, this algorithm might be difficult for practitioners to use in practice. Comparison to a stronger HMM baseline (e.g. the one I described above in "Quality", which is simple to implement) could help to show that the increased difficulty of implementation is justified. Second, when internal details about a model are given to clinicians to help make predictions more "interpretable", I believe that it is important to quantify uncertainty. For instance, it appears that patients in Stages 2 or 3 of CF are at much higher risk of diabetes than those at Stage 1. How stable is this pattern? Would it change if we fit the model using, say, a different random seed? In general, what can we say about the reliability/stability of the latent states/stages that this model learns? # Minor questions: How did you choose the number of states for attentive state space model in the CF analysis? And how did you initialize the states?

### Reviewer 3

Overall: I found this paper well-written and convincing. The authors do a strong job justifying the proposed model and positioning it relative to standard models in the field. I have a few questions, but overall, think it is a very strong applied paper. Questions/comments: 1. Clinical visits are generally irregularly spaced and rife with missing measurements. How was this handled in the context of your discrete time model? 2. Beyond evaluation in the context of this paper, why restrict the model to have the same number of states as an existing disease phenotype scheme? In particular, we cannot assume that the unsupervised model will learn back an existing phenotype scheme and, in many cases, it is not clear that we would want to. 3. Line 159: Similarly, we *hope* that the factorization in (2) produces clinically meaningful disease states, but it is by no means a guarantee. 4. Line 177: This posterior factorization was not been previously discussed. I recommend adding a section in the supplementary material that gives more details on its derivation. Maybe even just a simple three time step example derived from the graphical model would suffice. 6. Line 183: \vec{\alpha}_t --> \vec{\alpha}_{t-1} 7. Figure 4: I'm not sure that the training NLL is very informative here since we would expect a more expressive model to fit the training data better regardless of how well it generalizes. I would recommend a held-out NLL plot instead. Also, are the lines for Mean-Field and Attentive actual NLL values or are they the ELBO? 8. Line 237: The learned transition matrix has non-zero probabilities for transitioning backward from more severe stages to less severe ones. Is the existing CF progression monotonic or does it allow for backwards transitions? If so, this seems like the kind of constraint that could be easily encoded in the model.