Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1421
Title:	Distributed Representations of Words and Phrases and their Compositionality

Reviews

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper has 3 main contributions: (i) it extends the skip-gram model of (Mikolov et al, 2013) to speed up training by adopting an objective close to Noise Contrastive Estimation and subsampling frequent words, (ii) it proposes to learn phrase representations and introduces a dataset to evaluate those, (iii) it introduces the concept of additive compositionality.

This is a good paper. It is clear and reads well. It is technically correct. The part on speeding up learning (i) justifies the paper. The additive compositionality (iii) is a nice concept to throw in, even if the authors make no attempt to explain how it arises from the optimization problem that training is solving. The part on phrase representation (ii) could be given more emphasis given it could be more controversial. In particular, testing the limits of the proposed technique might make the paper more insightful.

** NCE & Subsampling **

I feel that you should spend more time introducing NCE and highlighting the difference with ICE. From the text, it seems that the notation in (4) is wrong: you sample k sample w according to Pn(w) then you use each individual sample log(sigma<,>) as an unbiased estimator of the expectation but you never manipulate the expectation itself.

For subsampling frequent words, it might be worth mentioning that this strategy might be inefficient for less semantic NLP task (e.g. POS tagging or parsing) where the representation of common words might matter more than for the semantic test from [7].

** Additive compositionality **

This property is similar to the one highlighted in [Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, Linguistic Regularities in Continuous SpaceWord Representations] which observed
[Koruna-Czech] \simeq [Yen-Japan]
You make the remark that these two kind of vectors are also close to currency. This is an interesting observation. It would be nice if one would attempt to explain how the optimization problem solved by maximizing (4) actually yields to such property. This remains puzzling to me and maybe to some other member of the ML community.

** Phrase representation **

You propose to learn a representation for each frequent (according to (6)) phrase. This is a valid proposal and your "Air Canada" example against Socher like models make sense. I do not have a strong argument against your strategy when data are plentiful. However, a strong argument in favor of distributed LM as opposed to n-gram LMs that they allow parameter sharing across similar contexts. This help modeling language because there will always be infrequent phrases that can benefit from similar frequent phrases, e.g. learning that "the (small town name) warriors" is a sports team need to share parameters with other popular teams. It might worse testing the limit of your strategy in terms of number of occurrences as it might go a long way.

** Typos **

- Table 3: samsampling -> subsampling
- It might be worth citing Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, Linguistic Regularities in Continuous SpaceWord Representations which first introduced the unigram fact dataset.

Q2: Please summarize your review in 1-2 sentences

This is a good paper. It is clear and reads well. It is technically correct. The part on speeding up learning (i) justifies the paper. The additive compositionality (iii) is a nice concept to throw in, even if the authors make no attempt to explain how it arises from the optimization problem that training is solving. The part on phrase representation (ii) could be given more emphasis given it could be more controversial. In particular, testing the limits of the proposed technique might make the paper more insightful.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.

The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.

Table 6 is a misrepresentation of past work and should be improved before submission. There are two problems:
1. The paper claims the skip-gram model learns better embeddings based on this evaluation. However, the training set for the skip-gram model is as much as 30x bigger. To make this claim the models would need to be trained on the same training set. As it stands, it can only be said that as ready-made embeddings they are better.
2. The table does not acknowledge the fact that these are rare words (due to random selection). If the aim is to compare the performance on rare words, this needs to be said explicitly. If not, simply randomly select words according to their unigram probability.

Q2: Please summarize your review in 1-2 sentences

The paper describes improvements over the skip-gram model. The improvements lead to better generalization error in the experiments. However, the comparison to other embedding methods in Table 6 can be misleading and should be addressed.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Summary:
The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.
The various proposed extensions to Skip-gram algorithm are compared with respect to speed and accuracy on "some" large scale dataset. The authors also qualitatively compare the goodness of the learnt embeddings with those provided in the literature using neural network models.

Quality:
I found the quality of research discussed in the paper to be above average. The extensions proposed in the paper does result in significant improvements in training of the original skip-gram model. However there are a number of issues in the paper which are worth pointing out.

-- The authors only give a qualitative comparison of the goodness of the embeddings learnt by their model versus those proposed by others in the literature. Any reason why? Ideally, one should compare the performance of learnt embeddings quantitatively by using them on potentially multiple NLP tasks. In the end, no matter how good the embeddings look to the human eye, they are not useful if they cannot be used in a concrete NLP task.

-- Also, I though the comparison (Table 6) is a bit unfair, because the previous embeddings are trained on a different and smaller dataset as compared to the ones learnt in the paper. A more fair result would be to learn all the embeddings on the same dataset.

Clarity:
For the most part the paper is clearly written. There are a number of areas where it needs improvement though. For instance,

-- the authors need to elaborate on how the various hyper-parameters in the extensions proposed are chosen.

-- The details of the dataset used by the authors is completely missing. I have no idea about the nature and source of the data being used for training the models.

Originality:
The paper gives a number of incremental extensions to the original Skip-gram algorithm for learning word embeddings. Though these extensions seem to perform well, I would not call the contributions significantly original.

Significance:
I think the results in the paper are fairly interesting and certainly require further research. I was particularly impressed by how cleanly the model was able to learn embeddings associated with phrases composed of multiple words. Equally interesting was how one could combine (add/subtract) word vectors to generate new meaningful vectors. These properties of the proposed model are certainly worth further attention towards achieving the goal of creating meaningful vector representation for full sentences.

Q2: Please summarize your review in 1-2 sentences

The paper gives a number of interesting extensions to the original Skip-gram model for training word embeddings in a linear fashion. The work presented is significant enough to be accepted in NIPS and shared among other researchers in the area.

Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

We thank the reviewers for their detailed comments.

R6 + R7:

We will improve our comparison to previous work and explicitly state
that we used more data in the experiments. The aim of Table 6 was
not to argue that the Skip-gram model learns better representation on
the same amount of data. Instead, we tried to show that, if our goal is to learn the best possible
word representations (perhaps because we have an application in mind),
then the Skip-gram model is the best choice precisely because it is much
so much faster than other methods and can therefore be trained on much more data.

As for the words in Table 6 being rare, we deliberately chose them to be so (see line 373), because a common
objection to word vectors is that they are great for the common words
but not useful for rare words.

R7:

We agree that the ultimate test for the quality of word embeddings is their usefulness for
other NLP tasks, and we have already demonstrated that it is the case for machine translation
in a followup submission. However, it is also the case that the analogical
reasoning tasks we use to evaluate the word vectors are at least a
somewhat reasonable metric of word vector quality.

We will release the code to make our results reproducible.