|
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presets a new MCMC sampling scheme for
Dirichlet process mixture models. The sampling scheme can be parallelised,
and empirical results are presented that show the method has superior
convergence properties to existing methods. These are both important
qualities in the context of working with DP mixture models.
The
presented method is constructed from 'restricted' samplers, that is
samplers that satisfy detailed balance but are not ergodic. The authors
show that one can combine multiple such samplers to produce ergodic
chains, hence eusuring the uniqueness of the stationary distribution to
which the MCMC chain converges. By working with these 'restricted'
samplers, they are able to produce a sampling scheme with the above
desirable properties. The resulting sampler is a combination of restricted
Gibbs steps and 'split' steps, which results in a valid sampler.
Results are presented on both synthetic (2D Gaussian clusters) and
real data. Even without parallelisation, there are gains to be had in
terms of convergence. When parallelised, the algorithm of course runs much
faster (with regards to wallclock time).
This is an interesting
and useful piece of research. DPs are widely used and are often
computationally intensive to work with. Being able to efficiently
parallelise MCMC sampling for a DP is therefore very useful. This is also
an interesting and novel way to construct an MCMC sampler, and so may
represent a new direction for research in this area.
Q2: Please summarize your review in 1-2
sentences
This paper presents a new MCMC sampling scheme for
Dirichlet process mixture models. This scheme is highly parallelisable and
has good convergence properties. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a parallel algorithm for the DPMM
that parallelizes a RJMCMC sampler that jumps between finite models. While
the parallelization and the RJMCMC sampler are proposed together, I will
separate them for the purpose of this review, in order to ask questions
about each part separately.
First, the RJMCMC algorithm (by which
I mean, the algorithm we would have on a single cluster). Here, we use a
reversible-jump MCMC algorithm to jump between finite-dimensional
Dirichlet distributions. As an aside, since $\bar{\pi}_{K+1}$ is not used
in the mixture model (the mixture model is defined on the renormalized
occupied K components), it would seem to make more sense to define a
K-dimensional, rather than a K-1 - dimensional, Dirichlet distribution;
this is valid under marginalization properties of the Dirichlet
distribution, since equation 10 samples from a distribution proportional
to $\pi_1 ... \pi_K$
To jump between model dimensionalities, the
authors propose a split/merge RJMCMC step that is reminiscent of that of
Green and Richardson. In fact, the “naive” choice is very similar to the
Green and Richardson method, and I am not sure why it is not valid in this
case.. I would like to see a more detailed comparison of the Green and
Richardson method (and other split/merge methods) and the proposed method,
and the reasons for the different design choices.
The
parallelization method splits the clusters onto processors with
probability related to the similarity between clusters. Each data point
can only be assigned to processors within its own cluster. The split/merge
method of creating new clusters, and the fact that atoms are explicitly
instantiated, ensures that there is no alignment problem of new clusters
across processors. Update of cluster parameters can be done in parallel;
however contrary to line 194 I do not see how sampling the $\pi_k$ can be
done in parallel. Further, sampling the superclusters must be done in
parallel.
The experiments suggest improved performance; however I
would be interested in seeing further experimental evaluation and
information. What is the number of superclusters? How often are $\pi$ and
the super-clusters updates? How does performance scale with number of
clusters? Also, it would be interesting to discuss why your split/merge
sampler performs so much better than the other split/merge sampler.
Additional points:
Line 107: The Dirichlet process can
be sampled by normalizing an infinite sequence of gamma random variables,
not beta random variables (or more accurately, by normalizing a sample
from a gamma process). The Sethuraman paper cited describes an iterative
algorithm for sampling size-biased samples from a DP, but it is based on
the fact that the DP has Dirichlet marginals rather than normalization of
an unnormalized random measure.
Line ~079, line ~131: In addition
to the approximate truncated and finite methods mentioned, there exist
instantiated-weight samplers that are asymptotically correct -- for
example slice sampling and retrospective sampling approaches, or the
recent paper by Favaro & Teh
Q2: Please
summarize your review in 1-2 sentences
Scalable inference is an important goal in Bayesian
nonparametrics, and this shows signs of being a useful approach; however I
would like to see further evaluation and
exploration. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Massively Parallel Sampling of DP Mixture Models using
Sub-Clusters
The paper proposes a new sampling strategy for
Dirichlet process mixture models that can be applied both in the conjugate
and non-conjugate setting. The method is based on restricted Gibbs
sampling, which can be computed in parallel, and split-merge moves.
Together, these two move types form an ergodic Markov chain, which
importantly can be implemented in a highly parallel fashion. An important
idea used in the paper is that the parameter space is augmented with sub-
and super-clusters which serve to facilitate both the split moves and the
restricted Gibbs moves.
Efficiently sampling from Dirichlet
process mixtures is an important problem, and this paper makes a
significant contribution towards designing correct paralellizable MCMC
methods for this problem.
Regarding the use of super-clusters in
the restricted Gibbs move, I am not sure I understand the argumentation.
If we think of Algorithm 1 as just some random procedure for choosing a
super-cluster, which chooses the particular restricted Gibbs proposal used
subsequently, then the random procedure must not depend on the current
state of the MCMC chain to be valid, but it does. If we alternatively
think of the super-clusters, g, as another parameter in an augmented
parameter space then Algorithm 1 simply defines the conditional
distribution of g given all other parameters - however, when sampling the
parameters conditioned on g the joint distribution of the parameters and g
should be used when computing the acceptance probability. Perhaps this is
what the authors suggest, but I am not sure, or maybe I misunderstand
something?
Another thing that puzzles me is the proposed split
procedure. As I understand it, the authors propose a split-merge procedure
that in itself is a valid MCMC procedure - thus in theory running this
alone would sample correctly from the posterior. However, it is noted that
the merge step is (almost) never accepted and can be omitted. I would have
liked some more discussion about this, since this (for me at least) goes
against intuition.
The ideas in the paper are well presented,
although I found some of the argumentation regarding the choices of
distributions for the auxiliary variables as well as the above mentioned
details difficult to follow. I think the introduction, review, and
experimental section are well written and sufficiently detailed.
I
look forward to hearing the authors' response to my comments.
UPDATE: I am happy with the authors' comments and confident that
this paper is a solid and important contribution.
Q2: Please summarize your review in 1-2
sentences
A well written paper with interesting and useful new
ideas, but with some unclear parts that could be improved.
Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Clarity: Lots of little issues with the
references: "dp", "dirichlet", "Statstica", "markov" Good writing and
grammar.
Quality: General parallel clustering methods need to
be cited as well: The database community has looked at distributed
clustering for years, starting with BIRCH by Zhang, Ramakrishnan and
Livny, the use of variants of KD-trees, and simple map-reduce style
approaches in an EM framework. Bradley and Fayyad had papers on
other scaling and distributed methods in the late 90s. The effort here
is *not* on the mixture model esoterics, its on large scale
distance/log-prob calculations.
Split and merge clustering
algorithms in an EM style, keeping a pair of sub-clusters, were
published in 1970 by Wallace and Bolton with SNOB. The Jain and Neal
split-merge proposals do not appear to work well on larger problems
and so MCMC split-merge approaches are important.
The focus on
Dirichlet process mixture models could be relaxed. One can use
Dirichlet mixture models with a large K and small alpha as well, since
these often compete well with DPs. So the focus should be
probabilistic mixture models generally. Although, the authors do a
great job of simplifying things and basically removing any complexity
created by the DP. Equation (7) is the key. Using DPs to "pick the
right number of clusters" is not always significant ... one can use
Pitman-Yors, and even Dirichlets for this, and setting the other
hyperparameters can greatly affect the number of clusters chosen.
This paper is a great tour of techniques in sampling. Very good
read. I made lots of notes on technical details to check, but
everything seemed to pan out. I would like to see a better theoretical
statement about how multiple restricted samplers can be stitched
together to create a correct sampler. Automatically rejecting the
merge moves (end of section 5.3) is an interesting result!
The
results section is way too short, though I don't know what could be
eliminated to give more room for it. The results seem to justify the
approach, though the sizes are a bit small.
One problem I see:
3000 seconds for full Gibbs cycle, versus 5 seconds for your algorithm.
That is a 600 fold speed up. Perhaps your tricks with the g variable have
allowed a good speed up, but 600?
Originality: The combination
of the sub-cluster idea with the special samplers and the restricted
samplers is original in my understanding. Perhaps better split-merge
MCMC sampling exists already?
Significance: This is the first
significant attempt I have seen to address split-merge clustering in
MCMC using methods from early EM style algorithms. Anyway, this is a
significant development that should open the way for many other
problems where split-merge is needed, not just clustering. I have some
other applications in mind, e.g., topic
models!!! Q2: Please summarize your review in 1-2
sentences
Authors: please reread review, some changes.
The related works section is poor. The contribution however,
seems to be a tractible split-merge sampling procedure for MCMC
clustering, and that is very important, for the Dirichlet distribution
too.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their insightful and
helpful comments. They will be considered upon any final submission of the
work. We would appreciate suggestions from the reviewers for sections that
can be shortened so that other sections that need more discussion can be
expanded. Individual reviewer comments are now addressed.
==R6== We believe R6 has a misconception about this work.
The majority of computation time in typical methods is spent sampling
cluster labels. The presented algorithm exhibits intra-cluster
parallelization via super-clusters *and* inter-cluster parallelization
since each point’s label assignment can be sampled in parallel via the
restricted Gibbs sampler. Inter-cluster parallelization samplers typically
requires approximations or result in slower convergence (see below ++).
The presented method does not suffers from these.
“make more sense
to define a K-dimensional rather than a K-1 - dimensional Dirichlet”
We are unclear as to what R6 means. While the weights in Eqn 8 are a
K+1 dimensional Dirichlet, the restricted sampler only considers the
non-empty, K dimensional Dirichlet. There is no K-1 dimensional Dirichlet.
“I am not sure why [the naive choice is] not valid” It is
valid, but results in posterior distributions that are difficult to sample
and go against intuition [Ln 250-257]. Details were left for the
supplement [Suppl. Ln 40-85] to not detract from the paper.
“comparison [to] Green and Richardson [10]” [10] develops a
general framework for split/merge moves, but does not explicitly specify
how to choose their “mock weights”, w, or their mapping “vector
functions”, m. Proposed splits are unlikely to fit the posterior since
auxiliary variables, u and w, governing the split cluster parameters and
weights are proposed independent of the data. Furthermore, similar to all
other split methods [4,14,15], a split is constructed in linear time and
fed into a Hastings acceptance. If the split is rejected, considerable
computation is wasted, and all information contained in learning the split
is forgotten. In contrast, our splits are learned over many iterations via
the parallel restricted Gibbs sampler. Conditioned on the sub-clusters,
the proposed splits are evaluated in constant time [Ln 69-78, 280-285,
336-350].
“sampling the pi_k... in parallel” They are sampled
by normalizing K independent draws from a Gamma distribution. The
normalization term can be computed with a parallel reduction, and the
actual sampling and division are independent and parallelizable.
“number of superclusters?” This is automatically determined
from our algorithm. While the count depends on the data, we typically see
1-5 super-clusters for ~20 clusters.
“How often are pi and the
super-clusters updated?” We alternate the sampling of {labels and
sub-cluster labels}, {cluster parameters and weight}, and {super-cluster
labels}. These details were mistakenly left out and will be included in
any final submission.
“scale with number of clusters?” It
should scale similar to other sampling algorithms since the core of the
algorithm is the same.
“why [is performance] better than the other
split/merge sampler.” See [Ln. 69-78] and the comparison to other
split/merge methods above.
“The DP can be sampled by normalizing
an infinite sequence of gamma” Agreed, we will correct the error.
Missing references++ The slice/retrospective samplers of
[Walker, Papaspiliopoulos & Roberts, and Favaro & Teh] are exact
instantiated-weight samplers. In these cases, if the cluster parameters
(theta) are not instantiated, the sampling cannot be parallelized. If they
are instantiated, empty cluster parameters are sampled from the prior, and
it is unlikely to obtain a parameter that fits the data [Ln. 135-137]. The
final submission will include the references.
==R7== “when
sampling the parameters conditioned on g the joint distribution of the
parameters and g should be used” We circumvent this problem with the
following relationship: p(theta,g|x,z) = p(theta|x,z) p(g|theta,x,z). We
can then draw a joint sample of theta and g by theta~p(theta|x,z) [Eqn 9],
followed by g~p(g|theta,x,z) [Alg 1]. However, as indicated by the
results, the super-clusters do not have much of an effect in the
~20-cluster datasets.
“split-merge procedure that in itself is a
valid MCMC procedure” Actually, without the restricted Gibbs sampler,
the proposed deterministic splits would never change, and the resulting
chain would not be ergodic [Ln 280-350].
“the merge step is never
accepted... more discussion, since this goes against intuition.”
Detailed discussion was left to the supplement [Suppl. Ln 182-256] to
not distract from the main points and due to space limitations.
==R8== “parallel clustering methods need to be cited”
These were omitted due to the focus on DPMMs. We will include them in
a final submission space permitting.
“The focus on DPMMs is
misdirected” We suppose this is a matter of preference. While the
methodology applies to generic mixture models, we felt it was interesting
that combining splits with restricted moves creates an ergodic chain.
Restricted moves are unnecessary for finite mixture models as all clusters
can be instantiated.
“I would like to see a better theoretical
statement about how multiple restricted samplers can be stitched together
to create a correct sampler” It is quite difficult to give general
conditions on combining multiple restricted samplers for an arbitrary
model. The restricted samplers must combine to make the chain reachable
and aperiodic, but these are just the general principles of ergodicity.
“the [dataset] sizes are a bit small” We test on datasets that
are feasible to run with typical algorithms. EDIT: Corrected inefficiency
existing in all algorithms to improve performance thanks to R8 for final
submission.
| |