Paper ID:1890
Title: Stochastic variational inference for hidden Markov models
Current Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The authors adapt SVI of Hoffman, Blei, Wang and Paisley (2013) to hidden
markov models (HMM).
Previous applications of SVI consider minibatches of complete data items.
SVIHMM differs in that minibatches consist of minibatches of subsequences.
This is an interesting idea, where some external observations are used to
seed the messages on each side of the subsequence.
The authors propose a heuristic for determining how many observations need to be added
on each side of the subsequence. It is a bit surprising that this works since
one might imagine there might be long term dependence in the messages.
Doesn't GrowBuf terminate immediately if S^new = S^old?
Is there a relationship between the optimal tau and the second largest
eigenvalue of A?

The paper quality itself is low, unfortunately: there are missing figures and tables from
experiments (e.g., table 4 line 355 and the timing experiments).
The introduction and review is rather long: it is not until page 4 that we get
to the material of the paper. Consequently, too much is placed in the
supplement. GrowBuf would appear key to understanding the paper, but is only
presented in the supplement.
I find it hard to assess the FBR rate quoted: DBN has a FDR of 0.999038, whilst
this method yields 0.999026. This difference appears really rather slight,
but perhaps SVIHMM is faster than the DBN result? No results were presented.
Also, how much noise is there in the estimate of the FDR?
Q2: Please summarize your review in 1-2 sentences
A mostly straightforward adaptation of SVI to HMMs with a heuristic for training on subsequences. Presentation is incomplete.

Submitted by Assigned_Reviewer_31

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper presents stochastic variational inference (SVI) for HMMs. Unlike LDA, in HMMs the chain-based dependence structure poses problems for SVI. Subsampling in each iteration cannot assume independence between series data, and skipping any data outside the chain can lead to errors. (Network models present somewhat different difficulties, replacing chain-based with pair-wise dependence. Subsampling in this setting has been handled in prior work (Gopalan et al. (2012)).)

In this work, the authors show how to scale the noisy subsample gradients for unbiased estimates and bound error induced by updating only variables corresponding to subchain samples. They augment subsampled subchains with extra observations to achieve this. They argue that this buffering and the forward-backward computations can be done incrementally, and therefore efficiently. They apply their algorithm to a large genomics data set, and demonstrate similar performance to VB.


the paper is written well; contributions through developing an SVI algorithm are clear and significant.


- my main criticism of this work is that a comparison to Johnson et al. (2014) is not provided. That paper develops an algorithm for large collections of independent series. surely, the SVI algorithm in your paper makes better (and correct) assumptions; but it's hard to know if these additional complexities are worth it, in practice, when running on real data. comparison to an EM algorithm is insufficient, as your gains could derive from using SVI.

- You mention "significant gains" in computation for the genomic data compared to the EM algorithm, but no runtime evidence is provided.

- computational complexity of the batch vs. SVI algorithm is not mentioned (although it's evaluated); i assume the SVI is still has quadratic per-iteration complexity in terms of states?

- what is the worst-case cost when L is set poorly? the algorithm requires a good choice of subchain lengths (L) and number of chains per mini-batch. this is not unusual for SVI algorithms, where mini-batch size can plays a key role (Hoffman et al., (2010)). however, choosing a good L is critical, as it may cause the "buffering" algorithm to add too many observations for a good approximation. the authors state that this is not the case, but it's not convincing, as it's in the case of simulated data.
Q2: Please summarize your review in 1-2 sentences
The paper presents a thorough development of SVI for the HMM model, that respects the chain-based dependencies in long time-series data. Prior work has dealt with somewhat different challenges posed by dependencies in network data; or prior work has simply treated considered only collections of independent time-series. The authors demonstrate improvements over the EM algorithm on a large genomics data set, and study decisions within the algorithm using simulated data.

One drawback is they have not presented a comparison with prior SVI algorithms (Johnson et al. (2014)) for collections of independent time-series data.

Submitted by Assigned_Reviewer_39

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper provides an innovative method for solving HMM models using SVI. The algorithm samples subchains in sequence data, but also addresses the issue of dependency breaking at the edges of subchains. Specifically, this problem is resolved by expanding the scope of a buffer around the target subchain until the subchain posteriors, across different length buffers, converge. The experiments show that their method is much more efficient than the existing methods and applicable to the real data. This paper provides insights for applying online learning algorithms to time-dependent models.

Minor comments:

Equation (1), p(y1|x1) is mising phi parameter

Line 355 “ Table 4” -> Table 1

Table 1, What’s the predictive log-likelihood in this experiment? What’s the held-out percentage? What’s the setting of hyper-parameters in this case? Why was ||A - A0||_F not compared as well in this experiment? The error is hard to tell in the log-predictive case.

For all these experiments, the author only mentioned the hyper-parameter setting for k, what about other hyper-parameters?

Page 8
In Human chromatin segmentation experiment, what’s the runtime for DBN?
Q2: Please summarize your review in 1-2 sentences
An important advance in dealing with large scale HMM inference.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their positive remarks and thoughtful criticisms. Your responses have brought to our attention points of confusion in our paper that we will better emphasize and clarify. Our paper presents an algorithm that

1) allows efficient, scalable Bayesian inference in a dependent setting and
2) accounts for the broken dependencies which are ignored by naive subsampling.

Common to the reviews was some confusion about growBuf, and we will focus on clearly presenting the idea after rewriting introductory material more concisely. growBuf is critical and a novel solution to 2).

The real data application is primarily meant to illustrate point 1), enabling inference on the full dataset and achieving similar performance to a more complex model. All reviews rightly mention an incomplete runtime comparison: we were unfortunately left waiting for a response regarding runtimes by Noble et al. prior to submission, but a specific timing comparison will definitely be present in the final revision.


- Our algorithm simplifies to Johnson’s if given a collection of many short, independent data streams, but their method is not directly applicable to long chains. Breaking up a long chain and treating segments as independent and applying Johnson’s algorithm would be equivalent to applying our algorithm without growBuf and with a naive sampling scheme, which our simulations demonstrate is suboptimal. A direct comparison to their implementation was not included as their paper draft was released shortly before the NIPS submission deadline. We will include more discussion emphasizing this difference.

- The computational complexity of our algorithm is O(L*M*K^2) where L is the subchain length and M is the number of subchains. Again, L*M <<< T, leading to significant efficiency gains

- It is true that minibatch size can play a key role in success of SVI algorithms: this is more analogous to choice of M than L, as it has to do with reducing noise in stochastic gradients by covering more ground in the data. Choosing large enough L is important so that approximation error from edge effects is small after message-passing; this is illustrated in the reverse cycle experiment. Importantly, we see that our proposed growBuf algorithm ameliorates sensitivity to choice of L, enabling successful inference even with a poor setting.

- In our experiments, growBuf adds at most around 20 observations when L is small and current parameter estimates are noisy, and only adds around 4 observations when L is large (such as the real data application). To clarify, adding many observations does not increase error as the "buffer" itself is discarded. growBuf adding many fewer observations than T is common to most real-world applications due to limited memory in the chain.


- The typos and undefined items (e.g., hyper-parameter settings) will be corrected in the final version.

- As for the table heading “log-predictive”, this column contains the predictive log-likelihood values for held out data, and will be relabeled for clarification. This is computed with 10% missing observations over 5 folds. We chose to omit Frobenius norm comparison simply due to redundancy, as it tells the same story as "log-predictive", but can include them in the final version for consistency with the other experiments.


- We apologize for the mislabeling of Table 1 (labeled Table 4), which includes the timing comparisons you mention. We are unsure which additional figures/results are missing since all seem present.

- We will shorten the introduction in favor of elaborating upon growBuf, and agree with this suggestion as this is a focal point of our contribution. Your skepticism of growBuf may stem from a shortcoming in our presentation of the method. For most real-world applications, the memory in the chain tends to be fairly local (non-unitary second eigenvalue of the transition matrix), especially in the context of our long chains of interest. In a different setting, Gonzalez et al (JMLR, 2009) harness this short memory as well.

- If the parameters have changed between subchain samplings, growBuf does not necessarily converge to the same point.

- Your question about the optimal tau and its relation to the second largest eigenvalue is interesting. As mentioned briefly in our paper as well as in Gonzalez et al (JMLR, 2009), it is difficult to obtain an analytic expression for the optimal tau in (17) for our case. Importantly, even with an analytic relationship between tau and the second eigenvalue, computing the optimal tau would rely on knowing the true A, which is part of what we are aiming to learn. Following Gonzalez et al (who even assume known parameters), we simply provide a practical solution to this problem.

- The take away from the FDR values on the real data is that SVI-HMM obtains comparable performance to the complicated DBN used by Noble et al by enabling efficient inference on the full dataset jointly (something infeasible for the Noble et al method). Again, our results were acquired in only a couple of hours, much less time than Noble et al. While it is certainly true that it is difficult to assess performance solely through FDR, the focus in Noble et al is unsupervised pattern discovery, and FDR is the only metric provided for direct comparison.