Paper ID: 697
Title: A Deep Architecture for Matching Short Texts
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper aims at measuring the similarity between pairs of short texts using a neural network. Each input unit is associated with a set of words from each text. Each input unit first computes a bilinear match score from its pair of word sets. Then the rest of the network is more classical. The connectivity patterns between the network units, i.e. association of terms to input units and connection between layers comes from a multi-resolution topic model.

The paper is clear and reads well. Reference to related work and application motivation are appropriate. The combination of multi-resolution connectivity structure and input bilinear match is original and seems to be nice way to address the targeted problem efficiently. I have a few suggestions to improve the paper.

** Model Overview **

I feel the analogy with image patches is not very clear. I would suggest to directly introduce the fact that you use a hierarchical topic model to build network connection, define the input layer and show how information propagate in the network, possibly with a simplified running example. This is important to understand the paper and this highlight the originality of your approach. You can take more space for that and drop gradient computations which are not necessary.

** Structure from LDA **

Getting the network structure from multiresolution LDA involves several hyperparameters: number of resolutions, number of topics per resolutions. Then getting sparse connections between words and topics as well as between topics from different resolutions involve binarizing continuous probabilities and word overlap measure. How did you pick these parameters, could you give a sense to the sensibility to this parameter. In particular, there seem to be a efficiency/effectiveness tradeoff in this binarization step.

More generally, could you show how efficient your technique is? For instance, given a computational budget on could choose to learn (i) a fully connected model (all words are used by all inputs and the rest of the net is fully connected), or (ii) your strategy with a sparsely connected model built from the topic DAG. What would be the impact on the results?

** Hyperparameter Validation **

Do you use the same rank from A for all input units? In particular, it seems that the number of words per input unit and their frequency varies.

** Typos **

pLSI -> this is confusing given that Thomas Hoffman introduced a popular model with that name as well. I would just use LSI there.
Inconsistency: NDCG@5 in the text and NEDCG@6 in the tables.
Q2: Please summarize your review in 1-2 sentences
The paper is clear and reads well. Reference to related work and application motivation are appropriate. The combination of multi-resolution connectivity structure and input bilinear match is original and seems to be nice way to address the targeted problem efficiently.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposed a combination of bilingual topic models with a neural network based matching function.

Pros:
- interesting combination

Cons:
- overly complicated annotation and not easy to follow description
- the pipeline of running multiple LDA models with varying number of topics and then the neural network is not very elegant or principled
- I would use a hierarchical topic model instead. This seems to be a reasonable application for them.
- Generally, I would use the many deep learning based topic models and try to train the full model jointly.
- the evaluation section is very poor since the datasets are new and not available to the public and the random baseline scores quite highly too.

are the patches just words from the different topics?

the figures such as fig. 5 could be clearer
the notation is unnecessarily complicated and not easy to follow, e.g. a_p(x,y) is used and then only defined several paragraphs later.

if possible, please find a native speaker to proof read, there are a couple of grammar mistakes,
"the remained words", "increasing with either way"

the claim that deep learning models do not give matching functions and cannot handle short texts with large vocabularies is just not true. You even cite the paper by Socher et al. that deals with large vocabularies and single sentences for paraphrase detection.


what is the speed of your method?
Q2: Please summarize your review in 1-2 sentences
This paper proposed a combination of bilingual topic models with a neural network based matching function.

The results are hard to replicate since the data is not available and some claims are not true but there are some interesting ideas in there.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes an approach to matching short texts based on using topic
modeling to identify the common word co-occurrence patterns within pairs of
matched documents. Topic modeling is peformed several times using fewer and
fewer topics, producing a hierarchy over topics and thus word co-occurrence
patterns. The heirarchy is then used to define the structure of a sparsely
connected neural network, taking advantage of "semantic locality". The model is
then trained to score correct matches above incorrect ones.

The approach is novel and potentially interesting, but is fairly complicated
and not very clearly described in the paper. While the analogy between image
patches and word groups of co-occurring words is probably worth mentioning,
using the image patch terminology throughout the paper is more confusing than
helpful.

The experimental section is the weakest point of the paper. The authors claim
that their model performs so well because it is hierarchical and nonlinear, but
the experimental results do not provide direct evidence for these claims. To
substantiate these claims, DeepMatch should be compared to a linear version of
itself (with all sigmoid units replaced by linear ones) as well as to a
non-hierarchical version of itself (using only one resolution of topics).
Without these comparisons there are several confounding factors that could
potentially explain the superior results obtained by DeepMatch. As for using
existing baselines, it makes more sense to compare to Supervised Semantic
Indexing or Polynomial Semantic Indexing instead of RMLS, as those methods are
trained using the same margin-based criterion as DeepMatch. The choice of the
datasets is also unfortunate, as they appear to be non-standard and
proprietary, which makes the reported scores difficult to interpret and the
results impossible to reproduce. The authors also do not report several
crucial details such as the latent dimensionality of the models, the number of
topics used, the number hidden units (sigmoid vs. linear), regularization
parameters etc. The description of the hyperparameter selection procedure on
lines 345-48 is a cause for concern as it seems to suggest that the
hyperparemeters where selected on a subset of the training set, which is likely
to result in overfitting. The big gap in performance between DeepMatch and the
siamese network makes me wonder whether the latter was properly regularized.
The mention of "zero regularization" for that model on line 364 seems to
suggest otherwise. Given the relatively small size of the datasets for the
vocabulary sizes used, proper model regularization is absolutely essential for
a fair comparison.

The gradients given by Eqs. (8) and (9) are not quite right as they reference
y_i instead of y+_i and y-_i. It is unclear what the potential value (pot^k_p)
is. Is it the total input to a unit?

It appears that DeepMatch used a reduced vocabulary ("trim the words" on line
172). Is this correct and if so, was the same vocabulary reduction performed
for all models?

How exactly are the higher-resolution topics are assigned to the lower-resolution
ones?

Were the comparison accuracy scores on Figure 6 computed on the training data or
on the test data? What are the "correct comparisons" shown on the figure?

The paper repeatedly uses "localness" instead of the more appropriate "locality".
Q2: Please summarize your review in 1-2 sentences
The idea is potentially interesting, but the write up is unclear and the experimental
evaluation not convincing.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new deep architecture for matching texts from two categories (such as questions and answers). The main originality consists in defining the architecture of the neural network using hierarchical topic models, trained jointly on both text categories. The topic models are used to define word patches of different granularities as well as hierarchical connections between them. These patches and connections then serve as to define the connectivity pattern of a neural network (later trained by backprop+sgd).


The paper is fairly written and quite easy to follow.

The main idea of the paper is neat and original I think. Relying on LDA to define the text patches is very nice. So is it for deriving the connectivity pattern of the network.

The main weakness of this paper is the experimental part. The main assumptions underlying the introduced framework are:
(1) models for dealing with text should take its inherent structure into account;
(2) such structure does not necessarily follow the ordering of words in the text, nor does it follow any kind of parse tree.

The main originality of the paper concerns (2) since many previous approaches agree with (1). And here is the problem: to justify claim (2) (which I am inclined to accept), one should compare with previous approaches using text structure in the model. But there is no such comparison (only variants using bag-of-words). Some interesting comparisons could be:

* siamese networks with ngrams or with parse tree features (from syntactic or dependency parses). Easy to do and already considers some structure.
* non-linear neural networks that uses the word ordering as structure (such as in Collobert et al. JMLR 2011).
* the RNN of Socher et al.
* and how about using directly the hierarchical topic models to match the short texts?
Q2: Please summarize your review in 1-2 sentences
This paper introduces a new nice framework for "learning" the structure of neural networks dealing with text as input. Unfortunately, the experiments do not allow to completely assess the efficiency of this approach.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

We thank all 4 reviewers for their insightful comments. We are working towards making the data-set publically available in near future.

To Reviewer 5:
1.We used the same rank for all the local bi-linear models
2.We agree with the reviewer that a comparison between different models with fixed “computational budget” would be quite meaningful

To Reviewer 6:
1.We agree that a hierarchical LDA is a more elegant choice for finding the architecture of the model, but other heuristics in the constructing process (eg. assigning at least k more specific clusters to a more general one) will probably diminish the elegance of HLDA.
2.To answer the question (“are the patches just words from the different topics? ”): each patch depicts the interaction between words from two domains in a concatenated topic in the ``bilingual” topic model.
3.To answer the comment (“the claim that deep learning models do not give matching functions and cannot handle short texts with large vocabularies is just not true. You even cite the paper by Socher et al. that deals with large vocabularies and single sentences for paraphrase detection. ? ”): Socher et al model (ref[16]) is based on word embedding and therefore avoids the difficulty brought by the large vocabulary , but its setting only applies to paraphrase detection.
4.It took about 10 hours training with 100,000 triples on a PC.

To Reviewer 7:
1.We tried several settings of regularization for the Siamese network, and actually reported the best one (which is not significantly better than the un-regularized one). Our argument with “zero regularization” is merely to point out that the inferior performance of the Siamese network is not due to its lack of model capacity.
2.Yes, Eq (8) & (9) only give the gradient from a term in the error function, and therefore incomplete.
3.We thank the reviewer for pointing out the work of Supervised Semantic Indexing. We didn’t include any comparison to it since Siamese network is a nonlinear version of it (with same margin-based objective), but we will include the empirical comparison to it in the future version.
4.In Section 5.3, we already reported the performance with several variants of the architectures, which indicates that a shallower architecture performs worse. We omitted several primitive ones (eg. the linear combination of local matchings) but will consider including them in the further version
5.Yes, we use the same reduced vocabulary for all models.
6.Figure 6 reports the performance on test data


To Reviewer 8:
1.We agree with the reviewer that some comparison to model with natural language structure would be interesting. However, the current models (e.g Socher et al’s RNN & the C&W’s RENNA) cannot be directly applied to the matching cases.
2.A hierarchical topic models would give a multi-resolution representation of the text, which can be used for learning a matching model. We may consider including experiments like that in the future version of the paper.