Review for NeurIPS paper: Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable Neural Distribution Alignment

NeurIPS 2020

Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable Neural Distribution Alignment

Review 1

Summary and Contributions: This paper proposes a new method for distribution alignment which utilises a normalising flow as alignment transformation. The authors propose a new objective (LRMF) which bounds the log-likelihood ratio distance. The main advantage over existing objectives is that LRMF is non-adversarial. As a result, the authors claim that training is more robust and that the model has a meaningful loss metric that can be used to show convergence.

Strengths: The paper is well-written and utilizes theory very well to explain the method. The method itself is an important addition to literature, as it introduces a direct optimization method for distribution alignment, which is typically done by adversarial methods. The theory in the paper is a pleasure to read, because the motivation for different concepts is explained well. In addition, the paper illustrates problems and methods using examples and figures, which aids understanding. The paper places itself in context by comparing to existing methods in a structured manner, and discusses advantages and disadvantages. The paper is also upfront about limitations (mainly dimensionality leading to vanishing gradients).

Weaknesses: The main weakness of the paper is the experimental section. Whereas adversarial methods in literature manage to align complex input images, the paper mainly focuses on smaller problems. In Figure 5 a,b, it could be made clearer that T(A) and emb/pix refer to the method of the authors. Also it is unclear whether the F G^{-1}(A) composition in the experiments refers to an AlignFlow-inspired baseline, please clarify this. As the authors discuss, scalability to higher dimensions of this method is its main limitation in its current form. The proofs and claims in the paper seem to depend on being able to find global optima (for instance eq. 3 depends on finding the optimal parameters for theta). Since in practice neural networks do not really converge to global (or even local) optima, it is currently unclear how optimization effects these claims. Additionally, it would be helpful if the source of the second inequality in Eq. 3 (between d_Lambda and min L_LRMF) could be quantified. When does equality hold (apart from the trivial case when L_LRMF and Epsilon_bias are zero)?

Correctness: The claims in the paper seem to be correct. Empirical methodology seems to be correct.

Clarity: The paper is really well-written. I would recommend this paper to newcomers in the field because of the illustrated introduction and the clearly outlined related work section.

Relation to Prior Work: The relation the prior work is clearly discussed. The related work section is structured nicely into different existing approaches.

Reproducibility: Yes

Additional Feedback: Since the paper utilizes a flow as distribution transformation, it is not possible to change dimensionality. The authors could address this limitation and discuss possible work-arounds. --------- After rebuttal ------- I have read the rebuttal and I am satisfied with the answers of the authors. I will keep my score as it is.

Review 2

Summary and Contributions: The paper proposed a method for neural distribution alignment with likelihood ratio as cost function, and show that under certain transformation family, i.e., normalising flows, this cost can be optimised efficiently compared to existing methods of neural distribution alignment, and also provides an interpretable value for model selection. The intuition behind the approach is that two datasets are indistinguishable from each other in the context of a model family if the combined likelihood of the two dataset under individual fitted model is the same as the likelihood of the combined dataset under a single fitted model. Additionally, the likelihood of the fitted individual model for transformed dataset can be approximated in closed form under a suitable family of transformation, e.g., normalising flow where the Jacobian is well formed. Both these observations lead to the final objective function (Def 2.2) that involves maximisation over two parameters, i.e., parameters of the shared model and the parameters of the normalising flow respectively.

Strengths: The paper tackles a relevant problem and provides an intriguing perspective. The arguments are well supported by examples, explanations and illustrations. The connection between the LRMF and existing methods is very relevant.

Weaknesses: The experiments on real dataset is limited and can be extended further.

Correctness: The derivation of method seems to be correct, and experimental evidence is convincing although section on real data can be extended.

Clarity: The paper is very well written, and quite easy to follow with well placed examples and illustrations.

Relation to Prior Work: The paper addressed prior work extensively, and explicitly discuss the relationship between the proposed model and existing models.

Reproducibility: Yes

Additional Feedback: Should this be likelihood ratio than log-likelihood ratio given the authors are taking the log of likelihood ratio (definition 2.1) than ratio of log-likelihood? The description of experiments on real datasets is too short. It will be great to provide some more information and enlarge the figures. The authors discuss the experiment on simulated dataset in great detail, and provide thoughtful insight and comparison. However, similar level of details and comparison are missing on the real datasets, for example, why the experiment was selected, do LRMF perform better, why digits were aligned in both embedding and pixel space? Some more information on what transformation has been used will be great. For example, what is RealNVP etc., and how were specific transformations chosen in each experiment? Figure 2. The boundary between the first and second row is not obvious. ---------- I have read the rebuttal and I thank the authors for addressing my comments.

Review 3

Summary and Contributions: The paper proposes an unsupervised domain alignment method based on log-likelihood ratio with normalizing flows. The method is designed to find a transformation of the dataset such that it it equivalent to another dataset with respect to a certain family of density functions. The main contributions of the work are a log-likelihood ratio minimizing distance metric based on normalizing flows with convergence guarantees. The optimization for unsupervised domain alignment is thus a minimization problem in contrast to adversarial formulations based on GANs. Experiments are performed on datasets -- the moon dataset and MNSIT-USPS.

Strengths: + The theoretical framework for minimizing the log-likelihood ratio for two distributions using normalizing flows is sound. + The method as shown in Figure 2 (last column) seems to have stable training (to convergence) compared to SN-GAN counterparts for domain alignment. + The qualitative examples are well structured to explain the advantages of using the proposed LRMF over the MMD, SN-GAN and AlignFlow.

Weaknesses: - Central parts of the paper are unclear eg. in line 80 \log P_M (X; \theta) should be the negative cross entropy. - The proposed objective Eq. 2 in line 128, requires the optimisation over both the parameters of the transformation \phi and the shared model \theta_S. The effect on the number of parameters vs. prior work eg. AlignFlow (Grover et. al. 2019) has not been discussed clearly. - The paper is sparse in quantitative results and does not compare to important prior work based on GANs [1]. The only quantitative results are on adaptation from USPS to MNIST in line 268. However, prior work [1] achieves 96.5% accuracy in comparison to the 55% accuracy achieved by the proposed method. - The empirical evaluation is restricted to small datasets eg. moons, MNIST and USPS. It would be desirable to evaluate the proposed approach on the more complex Facades/Maps/Cityscapes using the MSE metric to facilitate comparison with AlignFlow and [1]. - The shared model (\theta_s) is trained on two datasets simultaneously. It is unclear how the inductive bias from each of the datasets influence the shared space. [1] CyCADA: Cycle-Consistent Adversarial Domain Adaptation, ICML 2018.

Correctness: The empirical evaluation is limited to toy datasets whereas competing methods show better experimental setups (AlignFlow, CyCADA) and therefore improvement over the previous approaches is not established.

Clarity: -The paper can be improved with respect to introduction of technical notations. The definitions of (negative) cross-entropy can be made more consistent. - The introduction has various claims without appropriate citations. e.g, line 24 "difficult to quantitatively reason about the performance of such methods." line 31 " but rarely on density alignment". line 72 "In general, this would require solving an adversarial optimization problem" - line 191 "Authors of the AlignFlow" - >"AlignFlow"

Relation to Prior Work: The prior work discussion is limited to the work of SN-GAN, AlignFLOW, MMD. Various approaches for domain alignment based on GANs e.g. [1] are not discussed. The related work section can be improved.

Reproducibility: Yes

Additional Feedback: ---- Update after rebuttal-------- After the discussion and the rebuttal, I agree with the other reviewers that the idea is sound and has certain advantages over the prior work (AlignFlow). The scalability issue can be addressed in future work. However, the experimental section could have been improved as pointed to in the review above. I have modified my score accordingly.