Review for NeurIPS paper: Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder

NeurIPS 2020

Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder

Review 1

Summary and Contributions: This paper proposes a new out-of-domain (OOD) detection approach for Variational Autoencoders (VAEs). The authors explain that BDP computed for VAEs changes by a smaller amount compared to Flows and PixelCNNs, and this is the reason the previously proposed OOD scores don’t work well for VAEs. The proposed score is based on the improvement of likelihood on new examples if the encoder is further optimized on the new example.

Strengths: As the authors point out, deep generative models (and in particular VAEs) fail to assign lower p(x) to OOD samples, and the proposed OOD scores so far do not seem to address this. While one line work involves coming up with better models and objectives, another reasonable approach is what this paper focused on which is designing better metrics. And the proposed approach makes sense and seems to address this problem to some extent based on the experiments.

Weaknesses: I had a question for the authors. Something I often wonder about OOD scores is how good are they at differentiating between “difficult” in-domain and OOD examples. For example, if the model is trained on (red squares, red circles, blue squares), can the proposed score detect that a ‘blue circle’ is in-domain but say a ‘car’ image is OOD? Also I was wondering if the authors have done any architecture search? Working on VAEs myself I know that deep and shallow VAE generalization differ vastly.

Correctness: Yes

Clarity: This is very well-written paper. I found it really easy to read overall. I have a small complaint about the notation. It is common in the VAE literature for θ to be the parameters of the generative model. In the paper, the generative model is defined as pθ(x), but then define the parameters of the decoder as: pη(x|z) and define θ as (η,φ) which doesn’t make sense given that p(x) theoretically doesn’t depend on the encoder.

Relation to Prior Work: I am not very familiar with SOTA models for OOD detection but the authors seem to cover the literature review well.

Reproducibility: Yes

Additional Feedback: Typically people use a linear layer after the last Conv layer in the encoder. I noticed that in Table 3 that this is not the case. I was curious if this makes a difference.

Review 2

Summary and Contributions: This paper developed an elegant yet effective method to detect OOD examples using VAE, which was termed Likelihood Regret (LR). LR is obtained by taking the difference between the log likelihood under the original VAE model trained on the entire training set and the log-likelihood of the same sample but “fine-tuned” on this specific sample -- either by finding the optimal posterior for that single sample or optimizing the encoder only for that single sample. They showed that the LR method is superior to the likelihood baseline and previous methods (including input complexity adjusted score, likelihood ratio , latent Mahalanobis distance) on VAE, where previous methods have some failure modes on some OOD datasets.

Strengths: The method is simple and elegant yet very powerful for OOD detection. It's a really interesting and well-written paper. There is very little computational overhead for this method, on the scale of seconds, which makes it very practical to actually be used. The problem statement is very well-motivated and communicated clearly, the authors made it very clear what is missing: an OOD detection sore for VAE, in Sections 1 and 2.2 and 3. Nice analysis on lines 249-259 on how the complexity measurement gap can override the likelihood gap. Experimentation is complete and comprehensive, for VAEs. Authors compared the proposed method with prior methods including likelihood ratio and input complexity adjusted score, the likelihood baseline. And even better two implementations of the proposed method are compared.

Weaknesses: My main concern is on the lack of comparison with previous methods (input complexity adjusted score and the likelihood ratio method) on the corresponding models(e.g. Glow and Pixel-CNN) in the original papers instead of just comparing their methods on VAE. Those methods were developed and tested with the corresponding generative models in the original paper and it seems unfair to only compare with their method on VAEs. Without this comparison, if a researcher wants to choose the SOTA OOD detection method for their own applications, it’s hard to tell which method will most likely achieve the best performance if they have the freedom to choose their own generative models. This is the main drawback and the main reason for my rating. Furthermore, this leads to the general motivation of the paper. I really like the analysis on why prior likelihood-ratio based methods didn’t work as well on VAEs, however, if all we care about is detecting OOD examples, why is it absolutely necessary to have a method that works well on VAEs? It’d be great if the authors could explain and motivate the necessity. Additionally, it’d be very helpful to include an analysis, or hypothesis, on why optimizing the encoder leads to slightly better performance than optimizing the posteriors only. Minor concerns: I assume the y axis of the histograms in Figure 1 is frequency, it’d be great if you could add y axis along with the unit for clarity. Same with Figure 3. Since the “tao” notation is central to this paper, It’d be helpful to spend a sentence to explain, in English, what this is and how to interpret it. When describing the experiments from lines 214-220, clarify that the previous methods were tested for VAE instead of the generative models in the original papers. It’d be helpful to expand and explain more on how the bottleneck structure of VAE provides a natural regularization, in lines 154-156. Updates: I have read the rebuttal and the other reviews and based on the new results I have decided to increase my score from 6 to 7. I was not trying to say it's not useful to have a method on OOD detection using VAEs, but rather the point is that I think when introducing a new method, it's important to provide thorough empirical evidence and analysis on all major types of deep generative models, it's ok if this method works better on one model than the rest, but it's important to provide complete information for future research. I really appreciate the follow up experiments, however it still seems a bit rushed as the results were achieved by fine tuning only the last layer to prevent overfitting and there are other alternatives that are not tested on other generative models, such as optimizing the latent code directly. This rush is very understandable under such limited time of 1 week. However I do think it's important to do more thorough experiments of this sort to really make sure not to misguide the community.

Correctness: Yes. Authors claimed that their LR method obtains the best overall OOD detection performances when applied on VAE and their detailed experimentation indeed verified and support this claim by testing their method on 7 datasets and comparing with 3 methods in prior work along with a likelihood baseline, and the results indeed indicated the superiority of their LR method, on VAE.

Clarity: The paper is very well-written and clear and easy to read. Introduction is comprehensive and clear by pointing out the failure of current OOD scores on VAE presents a gap in current research literature and that their method can fill this gap. Nice summary of results in 5.2. It’ll probably be worth repeating throughout the paper that everything tested in this paper is on VAEs, as sometimes the prior methods were on other generative models and it might confuse readers who are only skimming the paper.

Relation to Prior Work: Yes, the major prior works were discussed along with their limitations, and how the author's LR method can overcome those limitations.

Reproducibility: Yes

Additional Feedback: In the related work section near the end of paragraph two where comparing a batch of samples were discussed, maybe it’d also be helpful to consider mentioning a comprehensive study on detecting dataset shift: https://papers.nips.cc/paper/8420-failing-loudly-an-empirical-study-of-methods-for-detecting-dataset-shift.pdf Line 51: A slightly more detailed summary of experimental results will give the reader a better overall idea of the paper, e.g, “obtains the best overall performances”... compared to what?

Review 3

Summary and Contributions: The paper proposes a way to do out-of-distribution detection with a given variational auto-encoder. For this, they define Likelihood Regret as the difference in ELBO made by replacing the VAE's encoding (or alternatively encoder) with an optimal one for the given sample. Using this number for discriminating in- and out-distribution, they generally improve OOD detection performance compared to other VAE-based methods and appear to not have unacceptable weaknesses on any of the standard evaluation dataset pairs, which all compared previous methods have.

Strengths: Having good OOD detection properties is desirable for different kinds of machine learning models that by themselves do not need to be focused on solving that task. In that sense, it is useful to enable the popular VAE's being used in that regard, especially since the method does not need any adjustments on the model itself (training, architecture) and therefore does not impact its performance in other regards. The paper gives a good theoretical and methodological derivation of the proposed Likelihood Regret without needing any difficult to access assumptions or hypotheses. The applicable algorithm follows directly from these theoretical considerations. The results, while not being remarkable for general OOD detection methods, are very good when compared to other VAE based methods, especially since they do not show catastrophic failure on any of the presented out-distributions (which include a good standard range of datasets).

Weaknesses: The results are only good for the quite restricted case of using a VAE for OOD detection. It is not clear why this is an important use case, since e.g. classifier based OOD detection seems to work better than generative model / density / likelihood based approaches. Non-VAE types of the latter class, as stated, also work better in that regard. As several modifications of the VAE (beta-VAE etc.) have been proposed and become popular, a transfer to them should be discussed and if possible evaluated. If there are more points for why using particularly VAEs for OOD detection, their inclusion would be welcome. Even if all methods work nearly perfectly well, the results on MNIST should be included in the supplement, ideally with the inclusion of the EMNIST letter dataset where other OOD detection method do not achieve perfect AUCROC yet. For the CIFAR-10 in-distribution, the CIFAR-100 dataset should be included as out-distribution since it represents very similarly captured images and is the most challenging one for many approaches. The AUCROC of CIFAR-10 vs. SVHN is not close to the optimal value 1, since 87% still represents quite many confused pairs.

Correctness: The mathematical derivations are clear and seem to be correct. The demo code contains typos in lines 90 and 131 and then gives a FileNotFoundError after the evaluations on the in-distribution are done (their number should be set lower than 999 if it is just about printouts of the ELBOs). The dataloaders should probably have shuffle=False to get reproducible evaluation runs.

Clarity: Yes, it is very well written, set and structured. Minor remarks: Please check for consistent spelling of dataset names and "vs.".

Relation to Prior Work: Yes, prior work and other approaches are referenced.

Reproducibility: Yes

Additional Feedback: I couldn't find the density-based definition of OOD in [15]. Could you please point out where they make it? Splitting off the presentation of the FashionMNIST vs. MNIST and CIFAR-10 vs. SVHN results from the rest seems unnecessary. For comparison, AUCROC values of non-VAE based methods could be included in the appendix. Update: Thanks for the new points on the relevance of VAEs and generative models in general for OOD detection and for the additional evaluations. I've decided to raise my score from 6 to 7. I think the method is an interesting improvement for the limited scope that it is proposed for and the paper is well written. Thus it might be a small but valuable step towards understanding the behaviour of likelihood estimations on unseen data. My concerns have been well addressed in the rebuttal, however the CIFAR-10 vs. CIFAR-100 score is a bit disappointing, especially as classifier based methods achieve around 90% AUCROC; since the reported 58.2% improves above other likelihood-based methods, this is ok though. The MNIST-EMNIST score is quite impressive on the other hand. The first rebuttal point about generative models working with unlabelled data in contrast to classifier based OOD detection for me adds some convincing justification for the relevance of using VAEs here. An extended discussion of this and the other items should be included in the potential final version. From my point of view, the improvement on VAEs has merit by itself, but a comprehensive treatment of all commonly used generative models (more in-depth than the first results shown in the rebuttal) would be very beneficial.

Review 4

Summary and Contributions: The paper tackles the problem of OOD detection via generative models. It highlights how current OOD scores perform poorly when applied in the context of VAEs, it proposes a new OOD score - likelihood regret score - and it shows that it generally outperforms alternatives. In particular, while the alternatives perform poorly on at least one setting, the proposed method is consistently among the top-performing methods.

Strengths: * The OOD detection problem is important to the community and recent works have attempted to provide OOD scores that outperform log-likelihood thresholding at OOD detection. From this perspective, the current paper is relevant to the community. * The paper illustrates clearly that likelihood thresholding can fail to detect OOD samples in the VAE case and demonstrates that the proposed approach can alleviate this issue. * The experimental results are strong, with the proposed approach outperforming the generic OOD scores. * The paper makes a reasonable attempt to explain why the VAEs might be different than other generative models when it comes to OOD detection. * The paper also discusses its limitations (e.g. the increased computational time).

Weaknesses: I generally like this investigation. My main concerns are the following: * Previous work has already established that likelihood thresholding can fail to detect OOD samples and proposed OOD scores to mitigate this issue. The current approach introduces a VAE-specific score. From this perspective, the current approach may have a limited impact (that has to do with how often is a VAE the generative model of choice for performing OOD detection). * Related to the above, the paper would have a stronger impact evaluated VAE-based OOD detection with alternative generative models on the same datasets, and it highlighted that a VAE can be necessary and therefore a VAE-specific OOD score is necessary as well. * Given the restriction to VAEs, another question is whether the results hold only for image VAEs, or would also hold for other domains (e.g. VAEs have been used to model sequences and graphs).

Correctness: * The proposed approach is sensible. * The main claims, i.e. (1) current OOD scores do not perform well for VAE and (2) the likelihood regret score is effective across tasks, are supported well by evidence.

Clarity: The paper has been very easy to follow.

Relation to Prior Work: The paper mentioned the relevant prior work that I'm familiar with. The paper distinguishes itself by proposing a VAE-specific approach.

Reproducibility: Yes

Additional Feedback: