Paper ID: 391
Title: Projecting Ising Model Parameters for Fast Mixing
Reviews

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Aims to improve the mixing rate of Gibbs sampling in pairwise Ising models with strong interactions, which are known to be "slow-mixing". Several projections to "fast-mixing" models are proposed; essentially, parameters are identified which are as close as possible to the original model, but weak enough to satisfy a spectral bound establishing rapid mixing. Experiments show some regimes where this leads to improved marginal estimates for a given computation time, compared to variational methods (mean field and belief propagation variants) and Gibbs sampling in the original model.

The technical content builds heavily on prior results establishing conditions for rapid mixing [4,8]. But I haven't seen the idea of projecting onto the nearest rapidly mixing model, as an approximate inference method, explored before.

On the domain of "toy tiny Ising models", results offer some nice improvements over a good set of baselines (standard sampling and contemporary variational methods). My main concern is that I think there should be more than the usual amount of skepticism as to whether these results will scale to larger models. Experiments involve comparisons of mixing times of samplers on different models, and it is hard to judge how these will scale with problem size. Also there is a (from all appearances) computationally expensive projection step required to build the fast-mixing model, the cost of which seems not to be accounted for. Finally, it is not at all clear that generalization beyond binary states will be possible, since establishing convergence bounds more generally is far more challenging.

CLARITY:
In general the presentation is clear and reasonably accessible, given the technical content. But there are some places where reorganization and clarification is needed:
* The paper refers several times to "standard results" and even states these in the form of a theorem (e.g. Theorem 6) without reference or proof sketch. Lemma 5 is left unproven, the reference proves for a special case of zero-field.
* Discuss properties of the dependency matrix R. For instance, it does not appear to be tractable due to the maximization, please state whether this is the case. (I presume this is why lemma 5 is invoked.)
* The second half of Sec. 4 is nearly impossible to follow. Before Theorem 7 the text references g, M, and Lambda before they are introduced. Then the statement of Theorem 4 includes notation that is not really explained. If this optimization is going to be discussed, more explanation is needed. (Perhaps less time could be spent restating results from [8] which are not really used.)
* In Sec. 1, KL(q||p) is used for both directions of KL divergence when the notation needs to be shifted for one. In the first paragraph, q is used for the true distribution and p for the tractable approximation; this is the opposite of almost all related literature.

EXPERIMENTS:
It appears that the time comparison in Figure 2 does *not* include the computation required for projection. A somewhat ambiguous statement to this effect appears in Sec 6.1 but is unclear; please clarify and if it is the case, show results with projection time included. As it stands, the proposed methods essentially get to use the output of a sophisticated variational optimization without penalty, which certainly makes the improvement over standard Gibbs less convincing.

It was disappointing not to see experiments on larger models, given that Gibbs mixing times often depend on problem size. There are certainly options for running experiments in regimes where junction tree is intractable, like using a model where symmetries let the true marginals be computed, or taking the output of a very long sampling run as truth.

I understand that KL(\theta||\psi) is intractable in general, but it would still be interesting to explore here as a potential "best case" for how sampling in an approximate model would perform. (Junction tree could be used for the toy models used in the submission.)

Mean field is a degenerate case of the reverse KL projection, as the paper points out, yet there is a large difference between mean field error and the error from reverse KL projection. This deserves discussion.
Q2: Please summarize your review in 1-2 sentences
The idea of approximating slow-mixing models by projecting to the closest fast-mixing model is a nice one, and recent work on mixing bounds is leveraged in an elegant way. But there are some concerns about experimental comparisons, and the limited range of models to which this approach is potentially applicable.

Submitted by Assigned_Reviewer_11

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a method for improving the mixing rate of Gibbs sampling in Ising models by projecting the models original parameterization onto a new parameter setting that satisfies the Dobrushin criterion needed for fast mixing. They formulate the projection as a constrained optimization problem, where the constraints are needed to ensure that the new parameters are defined over the same space as the original parameters, and examine this projection under several distance/divergence measures.

In my opinion, methods that combine principles from stochastic and deterministic inference is an under-explored area and as a result, I think this is an interesting idea. While the idea of augmenting the original parameters to improve mixing time is straightforward, I found the description of the dual of the projection problem to be a little unclear - e.g. how do the z_ij*d_ij =0 constraints ensure that the new parameterization is over the same space (can't it be over a smaller space)? I also was unsure of the overall procedure - do you perform the projected gradient operations to completion and then run a Gibbs sampler, or do you somehow interleave sampling with the gradient updates? An algorithm description box would help to clarify. Also, is the proposed projected gradient descent strategy guaranteed to converge?

I also found the experiments to be a little unconvincing. In the first set of experiments, why was the Gibbs sampler run for 30K iterations? Since you are comparing a sampling method to deterministic methods, a comparison on the basis of time would seem more fair. Also, where are the error bars on these charts - the reader cannot tell if these results are significant. The second set of results compare the original/naive gibbs sampler with gibbs samplers under different projections and show that the projection does lead to faster mixing. However, it takes time to performing the projection operation and this is not accounted for in this comparison (e.g. if the projection takes 1 minute, and the naive Gibbs sampler can generate 10k samples in that time, then the projection might not be worthwhile). Last, why was there no comparison to blocked Gibbs samplers or Gogate's "Lifted Gibbs Sampler"?
Q2: Please summarize your review in 1-2 sentences
All in all, a very nice idea. However, the development of the projection problem and proposed method could use a little work and a bolstered set of experiments are needed to convince me of the utility of the method.

Submitted by Assigned_Reviewer_13

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Success of inference by Gibbs sampling in MRF (here, only with two
labels, ie, ising model) depends strongly on the mixing rate of the
underlying Monte Carlo Markov chain. The paper suggests the following
approach to inference:

1) Project the input model (which possibly does not mix fast) on the set
of models that do mix fast.

2) Do inference on the obtained model that mixes fast.

The space of fast-mixing models is defined by bounding the spectral
norm of the matrix of absolute values of Ising edge strengths.
"Projection" is defined by divergences of Gibbs distribution. It is
forced to preserve the graph structure. Projection in Euclid distance
is obtained by dualizing the initial task and using LBFGS-B. For
other divergences (KL, piecewise KL, and reversed KL divergences are
implemented), projected gradient algorithm is used. In reversed KL
divergence, Gibbs sampling (but on a fast mixing model) must be done
to compute the projection.

Extensive experiments on small random models are presented compare the
approximated marginals with the true marginals. The methods tested are
the proposed one (with all the above divergences) and loopy BP, TRW,
MF and Gibbs sampling on the original model. Not only accuracy but also
runtime-vs-accuracy evaluation is done. The experiments show that the
proposed methods consistently outperform TRW, MF and LBP in accuracy,
and for reasonable range of runtimes also Gibbs sampling on the
original model. Of the proposed methods, the one with reversed KL
divergence performs consistently best.

Comments:

The projected gradient algorithm from section 5.1 n fact has two
nested loops, the inner loop being LBFGS-B. Pls give details on when
the inner iterations are stopped.

It is not clear what the horizontal axis in the plots in Figure 2 (and
the supplement) means. It is titled "number of samples" but sampling
is used only for reversed KL divergence. I believe the horizontal
axis should be runtime of the algorithm. Similarly, why not to report
also runtime of LBP, TRW and MF. This would ensure fair
accuracy-runtime comparison of all tested algorithms. Please, clarify
this issue - without that it is hard to interpret the experiments. Give absolute running time in seconds.

Please consider experimental comparison with larger models. An interesting option is to use models from the paper

[Boris Flach: A class of random fields on complete graphs with tractable partition function,
to appear in TPAMI, available online]

which allow polynomial inference.

222: replace "onto the tree" with "onto graph T"

226: Shouldn't we ssy "subgradient" rather than "derivative"?
Q2: Please summarize your review in 1-2 sentences
Interesting paper, convincing empirical results. Practical utility can be limited though due to high runtime (this needs clarification in rebuttal).
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thanks to all the reviewers for helpful comments. We have a couple of minor comments for each reviewer, and a longer discussion of the efficiency of projection, which all reviewers asked about.)

(Minor Comments)

Reviewer_11

The z_ij*d_ij=0 constraints are to ensure that the new parameterization is not over a *larger* space.  Without these constraints, projection of a parameters on a graph would yield a densely connected graph.

Convergence is guaranteed, in a certain sense (see reference 3 on ergodic mirror descent).  However, as with mean-field, a non-convex objective is being fit, meaning local minima are possible.

Lifted Gibbs Sampling doesn't seem applicable here since we aren't in a Makov Logic Network setting, but one could certainly apply Blocked Gibbs sampling. As we note in the paper, the mixing time bound used here is sufficient to guarantee fast mixing for block Gibbs. (Though it could be loose.) We agree that extension of the mixing-time bounds and projections to these more complex samplers is important, but also quite challenging.

Reviewer_13

We have done some experiments comparing to large (e.g. 40x40) zero-field planar Ising models, which has results similar to those shown. (Although one must measure pairwise marginal accuracy, since true univariate marginals are always exactly uniform.) Thank you for the reference on regular fully-connected graphs. Along with attractive zero-field planar Ising models, this provides a very interesting class of large-scale models with tractable exact marginals.

Reviewer_9

It is indeed non-obvious how to generalize these results beyond the Ising model. We quickly outline how this might be done on lines 418-420. After writing this paper, we spent several months working on this generalization, and found that it is possible to do these types of projections with general MRFs, although the more general projection algorithm is considerably more complex.

We actually ran all the experiments including projection under KL(\theta||\psi) (with marginals computed via the junction tree algorithm) but removed it out of a concern it might be confusing, since it is intractable on large graphs. Still, as you'd expect, it does perform better than the "reversed" KL divergence, so we'd be happy to put the results back in.

We were also at first surprised by the difference of performance between mean-field and reverse KL divergence. As one increases the allowed spectral norm from zero (mean field) to 2.5, the quality of results smoothly interpolates, confirming the performance difference is just due to optimizing the divergence over a larger constraint set. Presumably, this continues as the norm increases further, but of course Gibbs sampling could be exponentially slow.

(Efficiency of Projection)

The experiments do not currently include the computational effort deployed to find the projected parameters (also mentioned on lines 354-358). Particularly with the SGD algorithm, finding the most efficient optimization parameters (sample pool size, # Markov transitions, LBFGS convergence threshold, gradient step size) is challenging. We did not attempt to find the fastest parameters, instead opting for conservative (slow, reliable) parameters to ensure transparency of the results. Our thinking was, first, the main idea is to show that strong fast-mixing approximations exist, and second that projection could be deployed "offline". (e.g. one might project once, and then do inference after conditioning on various inputs.)

That being said, in hindsight it seems obvious that readers would also be interested in the efficiency of projection as compared to Gibbs or variational methods, and the paper should certainly make such comparisons clear. Measurement is somewhat complicated by unoptimized implementation in an interpreted language, but the computational bottlenecks are (a) Gibbs sampling, and (b) Singular Value Decomposition, as called when optimizing the dual in Theorem 7. As these are implemented efficiently in C/Fortran, we report their times below for the 8x8 grid with attractive strength 3 interactions.

30,000 iterations of Gibbs sampling: 0.18s
A single SVD of a dependency matrix: 0.0026s
Euclidean projection: (93 SVDs): 0.15s
Piecewise-1 projection: (399 SVDs): 1.1s
KL via Stochastic Gradient Descent: (540 SVDs + 30,000 Gibbs steps): 1.4s+0.2 = 1.6s
Variational Methods: less than 0.001s

Here are the same measurements for strength 1.5 interactions:

Euclidean projection: (23 SVDs): 0.061s
Piecewise-1 projection: (206 SVDs): 0.51s
KL via Stochastic Gradient Descent: (252 SVDs + 30,000 Gibbs steps): 0.7s+0.2 = 0.9s
Variational Methods: less than 0.001s

Note here that, though SGD uses 60 Euclidean projections, the cost is much less than 60x as much, due to the use of warm-start.

Roughly speaking: with strength 3, a single Euclidean costs the same as 30,000 iterations of Gibbs, and KL projection costs the same as 300,000 iterations of Gibbs. With strength 1.5, Euclidean costs as 10,000 iterations of Gibbs, and KL costs the same as 30,000 iterations. The paper should definitely include a table of such timing measurements, and run Gibbs for 300,000 iterations (one more order of magnitude) to be equal to the cost of KL projection. Note that Gibbs will still clearly be inferior to projecting with strong interaction strengths and/or dense graphs. (Observe the essentially horizontal lines for original parameters in Figs 5-8.) A online hybrid sampling/projection algorithm would presumably be superior in this setting, but we believe this goes beyond what can be done in one paper.