Reviews: Comparing Unsupervised Word Translation Methods Step by Step

Although this paper only combines existing techniques, it's good science to do more careful experimentation on existing techniques and new combinations of existing techniques. I think this is a valuable contribution. The heart of the paper is in sections 4.1 and 4.2. I could be mistaken, but the impression I get is that these sections read like two different projects combined together. Section 4.1 shows clearly and convincingly that more recent alternatives to GAN distribution matching don't do as well on more difficult language pairs. Some small things that could strengthen this section are: - How was the 2% threshold for defining "failure" chosen? Could you use some other metric that doesn't require you to choose an arbitrary threshold? And reporting the max over 10 runs might be misleading since the variance is so high. Would it be better to report mean and standard deviation, or even plot all 10 scores? - Restate results on less difficult language pairs too, to provide a complete picture, so that the takeaway message is something like "for languages with properties ABC, use method D, but for languages with properties EFG, use method H." - The paper presenting Gromov-Wasserstein alignment points out that their method is much faster than GANs. This would be worth mentioning here too. I found Section 4.2 difficult to understand. It took me a couple of reads to even realize that this section contained both of the main innovations of the paper. I would have expected this section to be written as follows: - Subsection 4.2.1: GAN with Procrustes (Table 2, bottom half) is compared with GAN with SBDI, and is found to have a higher average or maximum score over 10 runs. - Subsection 4.2.2: Using 10 runs of GAN+SBDI, model selection using discriminator loss is compared against model selection using cosine similarity, and the latter is found to have a higher score. (The opposite order would be possible too: show that cosine-similarity model selection works well on GAN+Procrustes, then show that GAN+SBDI is better.) But neither of these comparisons is made. First, it's stated that model selection using discriminator loss doesn't work (line 241); as far as I can tell, no justification is made for this claim. Second, a comparison is made between C-MUSE+Procrustes vs G-W+SBDI. The authors admit this is not an apples-to-apples comparison, so they make another comparison between C-MUSE+Procrustes and C-MUSE+SBDI. But this comparison fails to make a connection with Section 4.1 in two ways. - I assume that what is called GAN in Section 4.1 is the same as what is called MUSE in Section 4.2. This is not stated anywhere, as far as I can tell, and if I'm mistaken, then I don't know how to relate 4.1 and 4.2 at all. - Section 4.2 adds cosine-similarity model selection without comparing it to any results in Section 4.1.

Originality: This paper is mostly about empirical analysis. Although there's not much originality in algorithmic proposals, it provides a critical view of the current methods. After all they also achieve SoTA on some alignment language pairs. Quality: This paper looks at the current methods for embedding alignment with a critical eye -- to see what really works best using the same pairs of languages. The summary of related work is quite illuminating. Clarity: The paper is very well written and very clear, at least for someone who are somewhat familiar with embeddings and word alignment. Significance: I think researchers and practitioners will find this useful.

Paper ID:	3240
Title:	Comparing Unsupervised Word Translation Methods Step by Step

Reviewer 1

Reviewer 2

Reviewer 3