Paper ID: 20
Title: Documents as multiple overlapping windows into grids of counts
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors devise a new topic model that combines gradual topic changes of counting grids with a mixture model. As a byproduct the counting grid gives rise to new topic visualizations, as word distributions are embedded in a 2D plane. This paper evaluates the gains from giving the model the freedom to combine topics that are not close in the 2D space---in analog to a teleportation step. The new model is evaluated on three data sets.

The model extension is very reasonable, the derivation is correct. I deem the integration of counting grids into topic models as a valuable area of advancing topic models, that seemed to get stuck over the last few years.

The experimental evaluation on three data sets is rich with insight. The authors compares the model to a reasonable selection of related models, including counting grids without the extension.

The paper is well-written and understandable.

My only complaint is that the extension is rather straight-forward, where the main part of the paper is dedicated to introducing counting grids. On the other hand we appreciate the in-depth evaluation on the gains of this extension.

Q2: Please summarize your review in 1-2 sentences
A reasonable extension to counting grid topic models. Convincing approach and experimental evaluation.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an LDA-style mixed membership model where the component distributions are counting grids. Each document is a Dirichlet mixture over windows into the grid. Each window is a uniform distribution over the cells contained in that window. Each cell in the grid is a distribution over the vocabulary. The method is evaluated using classification accuracy, embedding quality, AUC in a retrieval task, and a visualization example.

I like this paper a lot. I think it's marginally above the accept threshold as is, and could be substantially improved with another round of revisions. I gave it an "incremental", but it's more like a 1.499.

The introduction does a reasonably good job presenting the basic intuition of the paper, that words often appear in a smooth continuum rather than distinct topics, and that the single-window model is too restrictive for real documents. Figure 1 is a bit of a missed opportunity. 1a works for me, but I found 1b and 1c confusing. A simpler toy example might be better, especially if it could be merged with Figure 2b.

I had a hard time moving from the introduction to section 2. In part, I felt there was a lot of technical terminology floating around, some of it redundant. What is a source, for example, or a bag? Are these cells and documents? I would pick a small set of concrete terms and use them consistently in both sections. Avoid "topic".

In Sect 2:

I would call pi_i(z) a distribution over the vocabulary, not a set of normalized counts. The latter may be its actual implementation in an algorithm, but here we're defining the model in the abstract. Give an example: pi_{1,1}("pizza") is the probability of the word "pizza" in the top left cell (or similar, if that's not correct)

What's a sample? Is that a document?

What is Z? The size of the vocabulary or the size of the grid? The variable z means "hidden variable" to me.

I'm really looking for a toy example.

Give an example for W, like 3x3.

Actually, I might restrict the initial presentation to square windows in 2D grids, for the sake of simplicity of notation, and then comment that the model is easily extended to d-dimensional grids and arbitrary window sizes.

I don't see why the number of non-overlapping windows is equivalent to the number of topics. The whole point is that the windows can overlap, no? Overlapping windows don't represent independent topics, but they do add modeling power, and that has to be taken into account for a fair comparison. Also, give a formula for kappa explicitly: it's not unambiguous right now (which ratio?).

I would rewrite section 2 around a step-by-step generative model. To generate a document, you first pick a distribution over windows. Then to generate each word, you pick a window, then pick a cell within the window, and finally pick a word from the cell.

The connection to LDA (windows of size 1x1) would be good to point out here.

It took a moment to figure out that U^W was a uniform distribution over a window. It looked like a matrix multiplication at first.

"Locations in the grid generate similar features" -- does this mean that the words in neighboring cells are semantically similar (but the words themselves differ), or that the distributions themselves are similar?

"The number of sources must not be specified..." First, I think this should be "need not be" -- "must not be" implies that you will get in trouble if you do! Second, I'm not sure what this means. The rest of the paper seems to assume that all dimensions are known in advance. Also, I'm still not sure what sources are (cells? windows?).

I didn't follow the marginalization step. I'd like a bit more explanation. If space is needed, variational update equations are an excellent candidate for supplementary material.

Distinguish between variables and distributions, and avoid referring to symbol alone: "the new updates for the cell distributions q(k) and the window distributions q(l) ..."

The comparison to Dunson et al is reasonable, but it might be worth talking about what is lost by using a grid rather than a continuous space.

Section 2.1 seems like a tangent.

Evaluation:

I didn't see held-out likelihood tests. It's not trivial for LDA. How would they be computed for this model?

I'd like to see a linear SVM used for document classification, in addition to the nearest neighbor method (I'm assuming that's BOW?).

See my earlier concerns about using "capacity" as a measure of approximate # of topics. Why not just the number of grid cells? In some sense CCG is just a fancier correlated distribution over the cells, like CTM or PAM.

The Euclidean embedding section is interesting. I'd like to see more of it.

What are the visual words? Are they given with the data set?

On the whole I buy the visualization aspect, because the word distributions are intrinsically embedded in a 2D space. I'd like to see more development of this idea. How would people use this embedding to learn something about the corpus? I didn't find Figure 5 particularly compelling for the fry/deep fry distinction, but the cluster of words that I associate with Indian recipes in the bottom right was interesting.
Q2: Please summarize your review in 1-2 sentences
An interesting twist on topic models, with a more psychologically plausible notion of semantic gradients, and an intrinsic connection to visualization interfaces. Presentation is confusing, and results could be strengthened.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes a creative alternative for topic modeling:
mixed-membership on a "counting grid." The advantage of this approach
seems to be that you can move smoothly across the grid, achieving a
high effective number of topics while the spatial smoothing prevents
overfitting. The disadvantage seems to be that there are more
parameters (grid dimension and size, and window size). A variational
inference procedure that is somewhat to LDA is possible, although no
speed/complexity comparisons are provided. The spatial nature of the
approach has potential advantages for visualization as well.

One major problem for the paper is that the presentation is quite
confusing. This is partly due to the writing, which could generally
use a closer edit and more attention to detail. The mathematical
notation is also confusing, with too many variable names and some
inconsistency. For the probability of a word, we have four different
notations.

pi(z)
pi(w_n)
h_{i,z}
p(w_n | k_n, \pi)

I think the paper could survive with just two:
p_{W|K}(w | K = k) -- the probability of a word, conditioned on a draw directly from location k
p_{W|L}(w | L = l) -- the probability of a word, conditioned on a window centered on location l

with
p_{W|L}(w | L = l) \propto \sum_{k \in W_l} p_{W|K}(w | K = k) p_{K|L}(k | L = k),

where the last term has been defined to be uniform later in the paper
(this is not clear in Figure 2). Avoid capital I as a variable
name, especially in a sans-serif font.

The variational approximation seems to include a term q(\theta), but
this is never updated; rather, theta is updated directly in equation
(5). This may be connected with the definition of q(\theta) as a
"Dirac function centered at the optimal value theta_hat," but
theta_hat is never defined or updated. I think it would make more
sense just to treat this as a parameter and not bother to define
q(\theta).

The note that the "minimization procedure can be carried out
efficiently in O(n log n) using FFTs" is not helpful. At a minimum,
the reader needs a citation to a paper that explains how to do this;
but really, give specifics in an appendix. Also, please explain what
is n, and how it depends on the window size, vocabulary size, and
number of word tokens.

The marginalized updates (9) and (10) are presented in terms of the
overall likelihood rather than the variational bound. It is indeed
interesting that these updates don't depend on each other, although
it's a little misleading. If they were truly decoupled, there would be
no point in having l, as it would be decoupled from the observations
w. In fact, the update for q(l) affects theta, which in turn affects
the update for q(k); conversely, the update for q(k) affects \pi,
which in turn affects q(l).

The empirical results seem strong, beating published work on the 20
news task, and also on a new document classification datasets
(although the mind boggles that new document classification datasets
are still felt to be necessary). What is the meaning of the multiple
shadings of the circles in Figure 3? How were the model parameters
(grid size, dimensionality, window size) determined in the multimodal
retrieval and euclidean embedding experiments?

The visualization application seems interesting, although it seems
possible for individual words to appear in multiple positions, and
this has happened in Figure 5 for a few words (e.g.,
chicken). Wouldn't this be confusing for users?
Q2: Please summarize your review in 1-2 sentences
Creative idea, confusing presentation.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for positive scores and even much more positive comments. Since [3] is new and somewhat overlooked, we feel that this paper will serve both the purpose of popularizing the main idea of counting grids and making a significant step ahead in topic models, as the reviewers imply. This will make impact in various communities beyond ML such as NLP, Visualization, HCI, Neuroscience, CV.

*R1* Yes, Rome was not built in one day. Whenever a relatively new model is extended, part of the audience asks that the paper stands on its own, as the basic idea has not yet been digested. A significant fraction of time it is one of the extensions that finally alerts the community to the recent overlooked developments, and this makes it even more important for early papers to properly review the basic idea -- in this case [3]. So we explain this basic idea before moving to componential structure, which significantly enriches the model.

*R2* We will follow the suggestions to prepare the final version which will meet the expectations.
The overlapping windows (h at nearby locations) have similar feature distributions because they share many cells. pi can have variable neighboring distributions, though they do tend to be semantically similar.
Regarding the capacity, CCG has as many topics as cells in the grid since the window, regardless of its size could be positioned to start at any of the cells, each time yielding a new distribution over words. However, the differences for two highly overlapping windows are very small. So compared with LDA where topics are less constrained, CCG uses a much larger number of topics, but they only differ from their neighbors slightly. However, any LDA model can be turned into a CCG model of any size by simply designating large enough nonoverlapping windows to topics, filling them with words in one of many ways so that the sum (h) represents the topic distribution, and then using a prior over locations in the grid so that hybrid (overlapping windows) are forbidden. Thinking of a capacity of the grid as equal to the size of the grid would be a bit misleading, since we can train a 100X100 model on 100 datapoints without overfitting and even underfit when W=99X99. On the other hand, if the data is truly generated from an LDA model, the learned grid would, theoretically be learned to spread the topics into non-overlapping windows. Thus the notion of capacity as we use it in the paper k=|E|/|W|.
Lack of space prohibited further discussion of Eucl. embedding methods. For example, CCG’s embedding better preserves the original distances that are below a certain threshold: the correlation between the cosine distance between the counts and the symmetric KL divergence between theta’s (grid location posterior) is higher than the correlation with the Euclidean distance in the spaces provided by LLE and ISOMAP when the analysis is limited to document pairs with small distances (\rho=~0.4 vs ~0.3 after CE - all documents are embedded but we only look at how well small distances are preserved). This is especially important in visualization, where we mostly only care about small distances
Visual words are, as in [7], quantized sift descriptors extracted from the images.
In Tab.1, BOW is the result of linear SVM.
Regarding held-out likelihood, different models, even different learning algorithms, are hard to compare [9], and the comparisons often do not translate to performance differences in applications. In addition, CCGs subsume LDA, which means that some advantage for CCGs is expected by default in such tests.
Regarding the browsing applications, our experience shows that the grid consumption makes it much easier for us to formulate search expressions which filter the grid. This is because each word in the grid is accompanied by related words, stimulating users’ own thinking in terms of word associations (word embedding is close to mapping, so the user suddenly realizes that searching for mapping rather than embedding will find more journal papers related to our submission). Additionally docs that use the same word in different contexts are separated into different spatial clusters and this help disambiguation.
Finally, even if the user is after a relatively specific subset of documents, and though they can find them quickly using our interface, the user gets exposed to a variety of terms form other documents embedded nearby, serving a similar purpose that peripheral vision provides in vision: establishing context, and providing additional vague awareness of other related content that might be useful in the near future.

*R3* CCGs may seem to require more parameters be set than LDA, but this is not true in practice as once the W is “sufficiently big” (>3x3), the main parameter that matters is the ratio between grid and window areas. In Fig.3, the shaded circles indicate the CG area (see ln 300).
The fact that the same word can appear in multiple positions (if needed) is one of the points we didn’t emphasize enough! As opposed to previous co-occurrence embedding methods that consider all pairs of words, our representation naturally captures the same word appearing in different multi-word contexts. The word “memory” in the Science magazine corpus is a striking example (memory in neruoscience, memory in electronic devices, immunologic memory). For the most part, presenting a raw grid does make sense to the users once they get used to the presentation: a prominent word is prominent locally and derives one of its multiple meanings from its neighbors. The repetition of one word in nearby positions may seem odd, so we used a simple heuristic to prevent this.
In each task we employed the dataset author’s validation partition and protocol; if not available, we used 10% of the training data as validation set. We will clarify this in the introduction of Sec.5.
We thank R3 for the helpful suggestions regarding the notation and cleaning up the math section.